Add support for capturing return values from worker functions #15

JamesWrigley · 2022-12-27T12:16:24Z

This matches the behaviour of multiprocessing.Pool and ThreadPool.

codecov · 2022-12-27T12:18:27Z

Codecov Report

Base: 91.34% // Head: 90.02% // Decreases project coverage by -1.31% ⚠️

Coverage data is based on head (9d77800) compared to base (5bbd9f0).
Patch coverage: 91.66% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #15      +/-   ##
==========================================
- Coverage   91.34%   90.02%   -1.32%     
==========================================
  Files           5        5              
  Lines         358      361       +3     
==========================================
- Hits          327      325       -2     
- Misses         31       36       +5

Impacted Files	Coverage Δ
pasha/context.py	`88.98% <88.88%> (-4.18%)`	⬇️
pasha/tests/test_context.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

JamesWrigley · 2022-12-27T12:22:12Z

Not sure what's up with the CI 🤔

philsmt · 2023-01-02T09:39:50Z

Thanks for this MR!

Now, I would like to be a bit cautious with this one. I can immediately see use cases for this, e.g. aggregating metadata like event counters across workers for which so far you would need some per-worker counter:

per_worker_counter = psh.alloc(1, dtype=int, per_worker=True)

def kernel(worker_id, ...):
    per_worker_counter[worker_id] += processed_events

num_events = per_worker_counter.sum()

With this, one can simple return the result and have it aggregate for you:

def kernel(worker_id, ...):
    return processed_events

num_events = sum(psh.map(kernel, ...))

That being said, it opens up the possibility for a serious mispattern of pushing around the actual worker results. I suppose having multiprocessing.Pool with fancier iterator support is good enough for pasha to exist, but I would like to gently push novice programmers (the average photon scientist 😜) in the right direction.

Could you please run some numbers of the performance impact in edge situations, say very short worker functions for very long iterables?

JamesWrigley · 2023-01-02T14:22:45Z

Sure, with:

%%time

def foo(worker_id, index, value):
    return value

processes_times = pd.DataFrame(columns=["Input size", "Running time"])
pasha.set_default_context("processes", num_workers=psutil.cpu_count())

for i in range(1, 40):
    start = time.perf_counter()
    pasha.map(foo, range(i**4))
    running_time = time.perf_counter() - start
    
    processes_times.loc[i] = [i**4, running_time]

I get this on the map_return branch:

And on the version in the xfel_anaconda3 kernel:

(I was too lazy to overlay the plots, sorry 🙈 )

That's going up to ~2.5 million elements, where the overhead is ~0.4s.

But if we use a larger kernel:

def foo(worker_id, index, value):
    return (value, value, value, value)

On map_return:

Vs the version in xfel_anaconda3:

Not sure why the baseline is lower on map_return, but the overhead goes to ~0.7s. I think the majority of it is coming from the call to list(iterators.chain.from_iterable()), which I got from this SO answer: https://stackoverflow.com/a/953097

I think that's acceptable, but I could make it optional and off by default?

Add support for capturing return values from worker functions

9d77800

This matches the behaviour of multiprocessing.Pool and ThreadPool.

JamesWrigley requested a review from philsmt December 27, 2022 12:16

JamesWrigley self-assigned this Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for capturing return values from worker functions #15

Add support for capturing return values from worker functions #15

JamesWrigley commented Dec 27, 2022

codecov bot commented Dec 27, 2022 •

edited

Loading

JamesWrigley commented Dec 27, 2022

philsmt commented Jan 2, 2023

JamesWrigley commented Jan 2, 2023

Add support for capturing return values from worker functions #15

Are you sure you want to change the base?

Add support for capturing return values from worker functions #15

Conversation

JamesWrigley commented Dec 27, 2022

codecov bot commented Dec 27, 2022 • edited Loading

Codecov Report

JamesWrigley commented Dec 27, 2022

philsmt commented Jan 2, 2023

JamesWrigley commented Jan 2, 2023

codecov bot commented Dec 27, 2022 •

edited

Loading