Thanks all for the feedback! I especially appreciate that a couple of you pointed out that because I am avoiding the standard library threading constructs like futures I could be limiting the use of other features they provide. (I am experiencing that issue with other wrapped libraries in my codebase, so I can appreciate this feedback.)
One other detail of this module that I've found useful is that @parallel.task decorated methods deeper in the call stack are run on the concurrent.futures.ThreadPoolExecutor provided by the @parallel.threads context manager without carrying around a reference to the ThreadPoolExecutor. I agree that it doesn't seem appropriate standard library, but sounds like there could be an audience for a PyPI library. Thanks and best! Sean On Mon, Feb 10, 2020 at 4:46 AM Kyle Stanley <[email protected]> wrote: > > 2. I couldn't wrap my head around the async/await and future constructs > particularly quickly, and I was concerned that my team would also struggle > with this change. > > > 3. I believe the procedural style glue code we have is quite easy to > comprehend, which I think has a positive impact on scale. > > While I can certainly understand the appeal of the simplicity of the ``@ > parallel.task`` decorator used in the example, I strongly suspect that it > will end up becoming increasingly tangled as the needs grow in complexity. > I'd bet on something like this being highly convenient in the short term, > but very costly in the long term if it eventually becomes unmaintainable > and has to be reconstructed from the ground up (which seems rather likely). > It would also lose out on much of the useful functionality of futures, such > as cancellations, timeouts, and asynchronously iterating over results in > the order of completion, as they finish, rather than order sent (with > ``cf.as_completed()``); just to name a few. > > I can also understand that it takes some time to get used to how futures > work, but it's well worth the effort and time to develop a solid > fundamental understanding for building scalable back-end systems. Many > asynchronous and concurrent frameworks (including in other languages, such > as C++, Java, and C#) utilize futures in a similar manner, so the general > concepts apply universally for the most part. It's a similar story with > async/await syntax (which is present in C# and JS; and upcoming in C++20). > > That being said, I think the above syntax could be useful for simple > scripts, prototyping, or perhaps for educational purposes. I could see it > potentially having some popularity on PyPI for those use cases, but I don't > think it has a place in the standard library. > > On Sat, Feb 8, 2020 at 6:08 PM Sean McIntyre <[email protected]> wrote: > >> Hi folks, >> >> I'd like to get some feedback on a multi-threading interface I've been >> thinking about and using for the past year or so. I won't bury the lede, see >> my approach here >> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py> >> . >> >> *Background / problem:* >> >> A couple of years ago, I inherited my company's codebase to get data into >> our data warehouse using an ELT approach (extract-and-loads done in python, >> transforms done in dbt/SQL). The codebase has dozens of python scripts to >> integrate first-party and third-party data from databases, FTPs, and APIs, >> which are run on a scheduler (typically daily or hourly). The scripts I >> inherited were single-threaded procedural scripts, looking like glue code, >> and spending most of their time in network I/O. (See example. >> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthreaded_example-py>) >> This got my company pretty far! >> >> As my team and I added more and more integrations with more and more >> data, we wanted to have faster and faster scripts to reduce our dev cycles >> and reduce our multi-hour nightly jobs to minutes. Because our scripts were >> network-bound, multi-threading was a good way to accomplish this, and so I >> looked into concurrent.futures (example >> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concurrent_futures_example-py>) >> and asyncio (example >> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-asyncio_example-py>), >> but I decided against these options because: >> >> 1. It wasn't immediately apparently how to adapt my codebase to use these >> libraries without either some fundamental changes to our execution platform >> and/or reworking of our scripts from the ground up and/or adding >> significant lines of multi-threading code to each script. >> >> 2. I couldn't wrap my head around the async/await and future constructs >> particularly quickly, and I was concerned that my team would also struggle >> with this change. >> >> 3. I believe the procedural style glue code we have is quite easy to >> comprehend, which I think has a positive impact on scale. >> >> *Solution:* >> >> And so, as mentioned at the top, I designed a different interface to >> concurrent.futures.ThreadPoolExecutor that we are successfully using for >> our extract-and-load pattern, see a basic example here >> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py>. >> The design considerations of this interface include: >> >> - The usage is minimally-invasive to the original unthreaded approach of >> the codebase. (And so, teaching the library to team members has been fairly >> straightforward despite the multi-threaded paradigm shift.) >> >> - The @parallel.task decorator should be used to encapsulate a >> homogeneous method accepting different parameters. The contents of the >> method should be primarily I/O to achieve the concurrency gains of python >> multi-threading. >> >> - If no parallel.threads context manager has been entered, the >> @parallel.task decorator acts as a no-op (and the code runs serially). >> >> - If an environment variable is set to disable the context manager, the >> @parallel.task decorator acts as a no-op (and the code runs serially). >> >> - There is also an environment variable to change the number of workers >> provided by parallel.threads (if not hard-coded). >> >> While it's possible to return a value from a @parallel.task method, I >> encourage my team to use the decorator to start-and-complete work; think of >> writing "embarrassingly parallel" methods that can be "mapped". >> >> A couple of other things we've implemented include a "thread barrier" in >> the case where we want a set tasks to complete before a set of other tasks, >> and a decorator for factory methods to produce cached thread-local objects >> (helpful for ensuring thread-safe access to network clients that are not >> thread-safe). >> >> *Your feedback:* >> >> - I'd love to hear your thoughts on my problem and solution. >> >> - I've done a bit of research of existing libraries in PyPI and PEPs but >> I don't see any similar libraries; are you aware of anything? >> >> - What do you suggest I do next? I'm considering publishing it, but could >> use some tips on what to here! >> >> Thanks! >> >> Sean McIntyre >> _______________________________________________ >> Python-ideas mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >> Message archived at >> https://mail.python.org/archives/list/[email protected]/message/KGSMCQT4JIVFEPXULKIYMQOIZLQZUWW5/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >
_______________________________________________ Python-ideas mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/3RRJWEELZOWTYDFL2BHGR5DKT3TOFCNF/ Code of Conduct: http://python.org/psf/codeofconduct/
