[Python-ideas] Re: Multi-threading interface idea

Sean McIntyre Tue, 11 Feb 2020 18:43:27 -0800

Thanks all for the feedback! I especially appreciate that a couple of you
pointed out that because I am avoiding the standard library threading
constructs like futures I could be limiting the use of other features they
provide. (I am experiencing that issue with other wrapped libraries in my
codebase, so I can appreciate this feedback.)


One other detail of this module that I've found useful is that
@parallel.task decorated methods deeper in the call stack are run on the
concurrent.futures.ThreadPoolExecutor provided by the @parallel.threads
context manager without carrying around a reference to the
ThreadPoolExecutor.

I agree that it doesn't seem appropriate standard library, but sounds like
there could be an audience for a PyPI library. Thanks and best!

Sean

On Mon, Feb 10, 2020 at 4:46 AM Kyle Stanley <[email protected]> wrote:

> > 2. I couldn't wrap my head around the async/await and future constructs
> particularly quickly, and I was concerned that my team would also struggle
> with this change.
>
> > 3. I believe the procedural style glue code we have is quite easy to
> comprehend, which I think has a positive impact on scale.
>
> While I can certainly understand the appeal of the simplicity of the ``@
> parallel.task`` decorator used in the example, I strongly suspect that it
> will end up becoming increasingly tangled as the needs grow in complexity.
> I'd bet on something like this being highly convenient in the short term,
> but very costly in the long term if it eventually becomes unmaintainable
> and has to be reconstructed from the ground up (which seems rather likely).
> It would also lose out on much of the useful functionality of futures, such
> as cancellations, timeouts, and asynchronously iterating over results in
> the order of completion, as they finish, rather than order sent (with
> ``cf.as_completed()``); just to name a few.
>
> I can also understand that it takes some time to get used to how futures
> work, but it's well worth the effort and time to develop a solid
> fundamental understanding for building scalable back-end systems. Many
> asynchronous and concurrent frameworks (including in other languages, such
> as C++, Java, and C#) utilize futures in a similar manner, so the general
> concepts apply universally for the most part. It's a similar story with
> async/await syntax (which is present in C# and JS; and upcoming in C++20).
>
> That being said, I think the above syntax could be useful for simple
> scripts, prototyping, or perhaps for educational purposes. I could see it
> potentially having some popularity on PyPI for those use cases, but I don't
> think it has a place in the standard library.
>
> On Sat, Feb 8, 2020 at 6:08 PM Sean McIntyre <[email protected]> wrote:
>
>> Hi folks,
>>
>> I'd like to get some feedback on a multi-threading interface I've been
>> thinking about and using for the past year or so. I won't bury the lede, see
>> my approach here
>> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py>
>> .
>>
>> *Background / problem:*
>>
>> A couple of years ago, I inherited my company's codebase to get data into
>> our data warehouse using an ELT approach (extract-and-loads done in python,
>> transforms done in dbt/SQL). The codebase has dozens of python scripts to
>> integrate first-party and third-party data from databases, FTPs, and APIs,
>> which are run on a scheduler (typically daily or hourly). The scripts I
>> inherited were single-threaded procedural scripts, looking like glue code,
>> and spending most of their time in network I/O. (See example.
>> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-unthreaded_example-py>)
>> This got my company pretty far!
>>
>> As my team and I added more and more integrations with more and more
>> data, we wanted to have faster and faster scripts to reduce our dev cycles
>> and reduce our multi-hour nightly jobs to minutes. Because our scripts were
>> network-bound, multi-threading was a good way to accomplish this, and so I
>> looked into concurrent.futures (example
>> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-concurrent_futures_example-py>)
>> and asyncio (example
>> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-asyncio_example-py>),
>> but I decided against these options because:
>>
>> 1. It wasn't immediately apparently how to adapt my codebase to use these
>> libraries without either some fundamental changes to our execution platform
>> and/or reworking of our scripts from the ground up and/or adding
>> significant lines of multi-threading code to each script.
>>
>> 2. I couldn't wrap my head around the async/await and future constructs
>> particularly quickly, and I was concerned that my team would also struggle
>> with this change.
>>
>> 3. I believe the procedural style glue code we have is quite easy to
>> comprehend, which I think has a positive impact on scale.
>>
>> *Solution:*
>>
>> And so, as mentioned at the top, I designed a different interface to
>> concurrent.futures.ThreadPoolExecutor that we are successfully using for
>> our extract-and-load pattern, see a basic example here
>> <https://gist.github.com/boxysean/3ed325ebb75db0303002f9484821e553#file-my_example-py>.
>> The design considerations of this interface include:
>>
>> - The usage is minimally-invasive to the original unthreaded approach of
>> the codebase. (And so, teaching the library to team members has been fairly
>> straightforward despite the multi-threaded paradigm shift.)
>>
>> - The @parallel.task decorator should be used to encapsulate a
>> homogeneous method accepting different parameters. The contents of the
>> method should be primarily I/O to achieve the concurrency gains of python
>> multi-threading.
>>
>> - If no parallel.threads context manager has been entered, the
>> @parallel.task decorator acts as a no-op (and the code runs serially).
>>
>> - If an environment variable is set to disable the context manager, the
>> @parallel.task decorator acts as a no-op (and the code runs serially).
>>
>> - There is also an environment variable to change the number of workers
>> provided by parallel.threads (if not hard-coded).
>>
>> While it's possible to return a value from a @parallel.task method, I
>> encourage my team to use the decorator to start-and-complete work; think of
>> writing "embarrassingly parallel" methods that can be "mapped".
>>
>> A couple of other things we've implemented include a "thread barrier" in
>> the case where we want a set tasks to complete before a set of other tasks,
>> and a decorator for factory methods to produce cached thread-local objects
>> (helpful for ensuring thread-safe access to network clients that are not
>> thread-safe).
>>
>> *Your feedback:*
>>
>> - I'd love to hear your thoughts on my problem and solution.
>>
>> - I've done a bit of research of existing libraries in PyPI and PEPs but
>> I don't see any similar libraries; are you aware of anything?
>>
>> - What do you suggest I do next? I'm considering publishing it, but could
>> use some tips on what to here!
>>
>> Thanks!
>>
>> Sean McIntyre
>> _______________________________________________
>> Python-ideas mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/[email protected]/message/KGSMCQT4JIVFEPXULKIYMQOIZLQZUWW5/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/3RRJWEELZOWTYDFL2BHGR5DKT3TOFCNF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Multi-threading interface idea

Reply via email to