[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
On Thu, 7 May 2020 at 01:34, Cameron Simpson wrote: > Maybe I'm missing something, but the example that comes to my mind is > embedding a Python interpreter in an existing nonPython programme. > > My pet one-day-in-the-future example is mutt, whose macro language is... > crude. And mutt is single threaded. > > However, it is easy to envisage a monolithic multithreaded programme > which has use for Python subinterpreters to work on the larger > programme's in-memory data structures. > > I haven't a real world example to hand, but that is the architectural > situation where I'd consider multiprocessing to be inappropriate or > infeasible because the target data are all in the one memory space. Vim would be a very good example of this. Vim has Python interpreter support, but multiprocessing would not be viable as you say. And from my recollection, experiments with threading didn't end well when I tried them :-) Paul ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/X33AUGHKJFOP2AMNQFM7ES6IRDPTSMNO/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
On Wed, May 6, 2020 at 12:36 PM Nathaniel Smith wrote: > > Sure, zero cost is always better than some cost, I'm not denying that > :-). What I'm trying to understand is whether the difference is > meaningful enough to justify subinterpreters' increased complexity, > fragility, and ecosystem breakage. > > If your data is in large raw memory buffers to start with (like numpy > arrays or arrow dataframes), then yeah, serialization costs are > smaller proportion of IPC costs. And out-of-band buffers are an > elegant way of letting pickle users take advantage of that speedup > while still using the familiar pickle API. Thanks for writing that PEP > :-). > > But when you're in the regime where you're working with large raw > memory buffers, then that's also the regime where inter-process > shared-memory becomes really efficient. Hence projects like Ray/Plasma > [1], which exist today, and even work for sharing data across > languages and across multi-machine clusters. And the pickle > out-of-band buffer API is general enough to work with shared memory > too. > > And even if you can't quite manage zero-copy, and have to settle for > one-copy... optimized raw data copying is just *really fast*, similar > to memory access speeds. And CPU-bound, big-data-crunching apps are by > definition going to access that memory and do stuff with it that's > much more expensive than a single memcpy. So I still have trouble > figuring out how skipping a single memcpy will make subinterpreters > significantly faster that subprocesses in any real-world scenario. > While large object copies are fairly fast -- I wouldn't say trivial, a gigabyte copy will introduce noticeable lag when processing enough of them -- the flip side of having large objects is that you want to avoid having so many copies that you run into memory pressure and the dreaded swapping. A multiprocessing engine that's fully parallel, every fork takes chunks of data and does everything needed to them won't gain much from zero-copy as long as memory limits aren't hit. But a pipeline of processing would involve many copies, especially if you have a central dispatch thread that passes things from stage to stage. This is a big deal where stages may take longer or slower at any time, especially in low-latency applications, like video conferencing, where dispatch needs the flexibility to skip steps or add extra workers to shove a frame out the door, and using signals to interact with separate processes to tell them to do so is more latency and overhead. Not that I'm recommending someone go out and make a pure Python videoconferencing unit right now, but it's a use case I'm familiar with. (Since I use Python to test new ideas before converting them into C++.) ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/NBO4EZ5OHSDBTHSTWLPG45IAD3OHN3AL/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] for Glenn Linderman Re: Re: Improvement to SimpleNamespace
Apologies to other list members. Glenn, we were having a conversation off list and there's no evidence my replies reached you. Could you have a glance in your spam (if you have such a thing) to see if my messages are lying there idle? From the 15th and 20th of April. GMail certainly seems to have a personal dislike for me, and I'm fearing something similar may be at play for you. Again, my apologies to other list members. Thanks, Cameron Simpson ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/PQ35CV5NIBGN6DBT3YTDBT3XEIDAIFET/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
On Thu, May 7, 2020 at 2:50 AM Emily Bowman wrote: > While large object copies are fairly fast -- I wouldn't say trivial, a > gigabyte copy will introduce noticeable lag when processing enough of them -- > the flip side of having large objects is that you want to avoid having so > many copies that you run into memory pressure and the dreaded swapping. A > multiprocessing engine that's fully parallel, every fork takes chunks of data > and does everything needed to them won't gain much from zero-copy as long as > memory limits aren't hit. But a pipeline of processing would involve many > copies, especially if you have a central dispatch thread that passes things > from stage to stage. This is a big deal where stages may take longer or > slower at any time, especially in low-latency applications, like video > conferencing, where dispatch needs the flexibility to skip steps or add extra > workers to shove a frame out the door, and using signals to interact with > separate processes to tell them to do so is more latency and overhead. > > Not that I'm recommending someone go out and make a pure Python > videoconferencing unit right now, but it's a use case I'm familiar with. > (Since I use Python to test new ideas before converting them into C++.) Thanks for the insight, Emily (and everyone else). It's really helpful to get many different expert perspectives on the matter. I am definitely not an expert on big-data/high-performance use cases so, personally, I rely on folks like Nathaniel, Travis Oliphant, and yourself. The more, the better. :) Again, thanks! -eric ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/5KA262LMVS3IBXUZQD6VJ5IQTZOSMR5U/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
On Tue, May 5, 2020 at 6:44 PM Joseph Jenne via Python-Dev wrote: > > I'm seeing a drop in performance of both multiprocess and subinterpreter > based runs in the 8-CPU case, where performance drops by about half > despite having enough logical CPUs, while the other cases scale quite > well. Is there some issue with python multiprocessing/subinterpreters on > the same logical core? This is not a Python issue at all, but a limitation of logical cores. The logical cores still share the same physical resources, so the logical cores are contending for the same execution resources. Actually it would probably be bad if Python *didn't* scale this way, because that would indicate that a Python process that should be running full-blast isn't actually utilizing all the physical resources of a CPU! -Cody ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/43N2VIRRE4Q2MDP7KFMIXDCPTP3SSEUC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Issues with import_fresh_module
Maybe an initialization/import side-effect bug which is triggered if the module is imported twice? ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/25XFYLISP53DRZX2UI7ADYC3JC2V2NVG/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
On Wed, May 6, 2020 at 1:14 PM Serhiy Storchaka wrote: > 06.05.20 00:46, Victor Stinner пише: > > Subinterpreters and multiprocessing have basically the same speed on > > this benchmark. > > It does not look like there are some advantages of subinterpreters > against multiprocessing. > There is not an implementation worthy of comparison at this point, no. I don't believe meaningful conclusions of that comparative nature can be drawn from the current work. We shouldn't be blocking any decision on reducing our existing tech debt around subinterpreters on a viable multi-core solution existing. There are benchmarks I could propose that I predict would show a different result even today but I'm refraining because I believe such things to be a distraction. I am wondering how much 3.9 will be slower than 3.8 in single-thread > single-interpreter mode after getting rid of all process-wide singletons > and caches (Py_None, Py_True, Py_NonImplemented. small integers, > strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning > breaking binary compatibility. > I'm not worried, because it won't happen in 3.9. :) Nobody is seriously proposing that that be done in that manner. The existing example work Victor did here (thanks!) was a rapid prototype where the easiest approach to getting _something_ running parallel as a demo was just to disable a bunch of shared global things instead of also doing much larger work to make those per-interpreter. That isn't how we'd likely ever actually land this kind of change. Longer term we need to aim to get rid of process global state by moving that into per-interpreter state. No matter what. This isn't something only needed by subinterpreters. Corralling everything into a per-interpreter state with proper initialization and finalization everywhere allows other nice things like multiple independent interpreters in a process. Even sequentially (spin up, tear down, spin up, tear down, repeat...). We cannot reliably do that today without side effects such as duplicate initializations and resulting resource leaks or worse. Even if such per-interpreter state instead of per-process state isolation is never used for parallel execution, I still want to see it happen. Python already loses out to Lua because of this. Lua is easily embedded in a self-contained fashion. CPython has never been. This kind of work helps open up that world instead of relegating us to only single life-of-the-process long lived language VM uses that we can serve today. -gps ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/BUIRTWX3FTXHFQUDAJZ5VHFG6ND3QT4U/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Deprecate os.removedirs() and os.renames()
It seems to me that os.removedirs() and os.renames() was added just for symmetry with os.makedirs(). All three functions have similar structure and was added in the same commit. Seems they were initially code examples of using some os.path and os functions. Unlike to quite popular os.makedirs(), os.removedirs() and os.renames() are not used in the stdlib and rarely used in third party code. os.removedirs() is considered as an opposite to os.makedirs(), and os.renames() is a combination of os.makedirs(), os.rename() and os.removedirs(). The problems with them are: 1. They do not remove directory if any file or other subdirectory is left. They just stop removing and return success. ZTo the user it looks like they do not work as expected, but he need to test the existence of directory explicitly to check this. 2. They can remove more than expected. If the parent directory was empty before calling os.makedirs(), the following os.removedirs() will remove not just the newly created directories, but the parent directory, and its parent if it contained a single directory, and so on. os.removedirs() is not an opposite to os.makedirs(). It can remove less or more, and you have no control on how much it will remove. It is better to use shutil.rmtree(). os.renames() correspondingly can be replaced by os.rename() or shutil.move(), with possible addition of os.makedirs() and shutil.rmtree() if needed. I propose to deprecate these functions and remove them in future Python versions. ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/MWUFGKT43L3KJXN33DLTYN6OLDB6GP45/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Deprecate os.removedirs() and os.renames()
Serhiy Storchaka wrote: > I propose to deprecate these functions and remove them in future Python versions. +1, assuming the deprecation lasts for at least two versions and the available alternatives are explicitly mentioned in the What's New entry (for both the version they're initially deprecated in and the one they become removed in). Although, I suspect that the deprecation may need to be longer than two versions from possible breakage in older libraries or legacy code. They don't seem at all common compared to shutil.move() and shutil.rmtree(), but I do vaguely recall seeing some usage of os.renames() and os.removedirs() in third party code. On Thu, May 7, 2020 at 4:06 PM Serhiy Storchaka wrote: > > It seems to me that os.removedirs() and os.renames() was added just for > symmetry with os.makedirs(). All three functions have similar structure > and was added in the same commit. Seems they were initially code > examples of using some os.path and os functions. > > Unlike to quite popular os.makedirs(), os.removedirs() and os.renames() > are not used in the stdlib and rarely used in third party code. > os.removedirs() is considered as an opposite to os.makedirs(), and > os.renames() is a combination of os.makedirs(), os.rename() and > os.removedirs(). The problems with them are: > > 1. They do not remove directory if any file or other subdirectory is > left. They just stop removing and return success. ZTo the user it looks > like they do not work as expected, but he need to test the existence of > directory explicitly to check this. > > 2. They can remove more than expected. If the parent directory was empty > before calling os.makedirs(), the following os.removedirs() will remove > not just the newly created directories, but the parent directory, and > its parent if it contained a single directory, and so on. > > os.removedirs() is not an opposite to os.makedirs(). It can remove less > or more, and you have no control on how much it will remove. It is > better to use shutil.rmtree(). > > os.renames() correspondingly can be replaced by os.rename() or > shutil.move(), with possible addition of os.makedirs() and > shutil.rmtree() if needed. > > I propose to deprecate these functions and remove them in future Python > versions. > ___ > Python-Dev mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/[email protected]/message/MWUFGKT43L3KJXN33DLTYN6OLDB6GP45/ > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/KDW3D7HAV7EEMSA2DJNR7NQRPG6TEJTB/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround
Le mer. 6 mai 2020 à 22:10, Serhiy Storchaka a écrit : > I am wondering how much 3.9 will be slower than 3.8 in single-thread > single-interpreter mode after getting rid of all process-wide singletons > and caches (Py_None, Py_True, Py_NonImplemented. small integers, > strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning > breaking binary compatibility. There is no plan to remove caches like small integers, _Py_IDENTIFIER or _PyArg_Parser. The plan is to make these caches "per-interpreter". I already modified small integers to make them per-interpreter. Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/5XMPJQXMET3XFDXTBQHJB2O26X4UVX3Q/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Issues with import_fresh_module
To expand on my earlier comment about changing the module under test to make your testing easier, asyncio is one library that has lots of tests of different combinations of its C and Python implementations being used together. As far as I know, it doesn't use import_fresh_module or similar hackery. Instead it exposes a private way of getting at the parallel Python implementation: https://github.com/python/cpython/blob/b7a78ca74ab539943ab11b5c4c9cfab7f5b7ff5a/Lib/asyncio/futures.py#L271-L272 This is the kind of thing I was suggesting. (It might require more setup than this in your case.) --Chris On Thu, May 7, 2020 at 11:33 AM Brett Cannon wrote: > Maybe an initialization/import side-effect bug which is triggered if the > module is imported twice? > ___ > Python-Dev mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/[email protected]/message/25XFYLISP53DRZX2UI7ADYC3JC2V2NVG/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/LTMDTYKYL7IVTPISSFVUSX7355GI4QOX/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Latest PEP 554 updates.
On 06/05/2020 21:52, Eric Snow wrote: On Wed, May 6, 2020 at 2:25 PM Jeff Allen wrote: ... My reason for worrying about this is that, while the C-API has been there for some time, it has not had heavy use in taxing cases AFAIK, and I think there is room for it to be incorrect. I am thinking more about Jython than CPython, but ideally they are the same structures. When I put the structures to taxing use cases on paper, they don't seem quite to work. Jython has been used in environments with thread-pools, concurrency, and multiple interpreters, and this aspect has had to be "fixed" several times. That insight would be super helpful and much appreciated. :) Is that all on the docs you've linked? As far as it goes. I intended to (will eventually) elaborate the more complex cases, such as concurrency and application server, where I think a Thread may have "history" in a runtime that should be ignored. There's more on my local repo, but not about this yet. I have linked you into one page of a large and rambling (at times) account of experiments I'm doing. Outside be dragons. The other thing I might point to would be Jython bugs that may be clues something is still wrong conceptually, or at least justify getting those concepts clear (https://bugs.jython.org issues 2642, 2507, 2513, 2846, 2465, 2107 to name a few). This is great stuff, Jeff! Thanks for sharing it. I was able to skim through but don't have time to dig in at the moment. I'll reply in detail as soon as I can. Thanks. I hope it's a positive contribution. Isn't PlantUML awesome? The key argument (or where I'm mistaken) is that, once you start sharing objects, only the function you call knows the right Interpreter (import context) to use, so in principle, it is different in every frame. You can't get to it from the current thread. Jeff ___ Python-Dev mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/[email protected]/message/OGMJULX5RIVP2GFIX3G2TAUZAYQKAA5D/ Code of Conduct: http://python.org/psf/codeofconduct/
