Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for capturing return values from worker functions #15

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JamesWrigley
Copy link
Member

This matches the behaviour of multiprocessing.Pool and ThreadPool.

This matches the behaviour of multiprocessing.Pool and ThreadPool.
@JamesWrigley JamesWrigley requested a review from philsmt December 27, 2022 12:16
@JamesWrigley JamesWrigley self-assigned this Dec 27, 2022
@codecov
Copy link

codecov bot commented Dec 27, 2022

Codecov Report

Base: 91.34% // Head: 90.02% // Decreases project coverage by -1.31% ⚠️

Coverage data is based on head (9d77800) compared to base (5bbd9f0).
Patch coverage: 91.66% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
- Coverage   91.34%   90.02%   -1.32%     
==========================================
  Files           5        5              
  Lines         358      361       +3     
==========================================
- Hits          327      325       -2     
- Misses         31       36       +5     
Impacted Files Coverage Δ
pasha/context.py 88.98% <88.88%> (-4.18%) ⬇️
pasha/tests/test_context.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@JamesWrigley
Copy link
Member Author

Not sure what's up with the CI 🤔

@philsmt
Copy link
Collaborator

philsmt commented Jan 2, 2023

Thanks for this MR!

Now, I would like to be a bit cautious with this one. I can immediately see use cases for this, e.g. aggregating metadata like event counters across workers for which so far you would need some per-worker counter:

per_worker_counter = psh.alloc(1, dtype=int, per_worker=True)

def kernel(worker_id, ...):
    per_worker_counter[worker_id] += processed_events

num_events = per_worker_counter.sum()

With this, one can simple return the result and have it aggregate for you:

def kernel(worker_id, ...):
    return processed_events

num_events = sum(psh.map(kernel, ...))

That being said, it opens up the possibility for a serious mispattern of pushing around the actual worker results. I suppose having multiprocessing.Pool with fancier iterator support is good enough for pasha to exist, but I would like to gently push novice programmers (the average photon scientist 😜) in the right direction.

Could you please run some numbers of the performance impact in edge situations, say very short worker functions for very long iterables?

@JamesWrigley
Copy link
Member Author

Sure, with:

%%time

def foo(worker_id, index, value):
    return value

processes_times = pd.DataFrame(columns=["Input size", "Running time"])
pasha.set_default_context("processes", num_workers=psutil.cpu_count())

for i in range(1, 40):
    start = time.perf_counter()
    pasha.map(foo, range(i**4))
    running_time = time.perf_counter() - start
    
    processes_times.loc[i] = [i**4, running_time]

I get this on the map_return branch:
image

And on the version in the xfel_anaconda3 kernel:
image

(I was too lazy to overlay the plots, sorry 🙈 )

That's going up to ~2.5 million elements, where the overhead is ~0.4s.

But if we use a larger kernel:

def foo(worker_id, index, value):
    return (value, value, value, value)

On map_return:
image

Vs the version in xfel_anaconda3:
image

Not sure why the baseline is lower on map_return, but the overhead goes to ~0.7s. I think the majority of it is coming from the call to list(iterators.chain.from_iterable()), which I got from this SO answer: https://stackoverflow.com/a/953097

I think that's acceptable, but I could make it optional and off by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants