[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Paul Moore
On Thu, 7 May 2020 at 01:34, Cameron Simpson  wrote:

> Maybe I'm missing something, but the example that comes to my mind is
> embedding a Python interpreter in an existing nonPython programme.
>
> My pet one-day-in-the-future example is mutt, whose macro language is...
> crude.  And mutt is single threaded.
>
> However, it is easy to envisage a monolithic multithreaded programme
> which has use for Python subinterpreters to work on the larger
> programme's in-memory data structures.
>
> I haven't a real world example to hand, but that is the architectural
> situation where I'd consider multiprocessing to be inappropriate or
> infeasible because the target data are all in the one memory space.

Vim would be a very good example of this. Vim has Python interpreter
support, but multiprocessing would not be viable as you say. And from
my recollection, experiments with threading didn't end well when I
tried them :-)

Paul
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/X33AUGHKJFOP2AMNQFM7ES6IRDPTSMNO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Emily Bowman
On Wed, May 6, 2020 at 12:36 PM Nathaniel Smith  wrote:

>
> Sure, zero cost is always better than some cost, I'm not denying that
> :-). What I'm trying to understand is whether the difference is
> meaningful enough to justify subinterpreters' increased complexity,
> fragility, and ecosystem breakage.
>
> If your data is in large raw memory buffers to start with (like numpy
> arrays or arrow dataframes), then yeah, serialization costs are
> smaller proportion of IPC costs. And out-of-band buffers are an
> elegant way of letting pickle users take advantage of that speedup
> while still using the familiar pickle API. Thanks for writing that PEP
> :-).
>
> But when you're in the regime where you're working with large raw
> memory buffers, then that's also the regime where inter-process
> shared-memory becomes really efficient. Hence projects like Ray/Plasma
> [1], which exist today, and even work for sharing data across
> languages and across multi-machine clusters. And the pickle
> out-of-band buffer API is general enough to work with shared memory
> too.
>
> And even if you can't quite manage zero-copy, and have to settle for
> one-copy... optimized raw data copying is just *really fast*, similar
> to memory access speeds. And CPU-bound, big-data-crunching apps are by
> definition going to access that memory and do stuff with it that's
> much more expensive than a single memcpy. So I still have trouble
> figuring out how skipping a single memcpy will make subinterpreters
> significantly faster that subprocesses in any real-world scenario.
>

While large object copies are fairly fast -- I wouldn't say trivial, a
gigabyte copy will introduce noticeable lag when processing enough of them
-- the flip side of having large objects is that you want to avoid having
so many copies that you run into memory pressure and the dreaded swapping.
A multiprocessing engine that's fully parallel, every fork takes chunks of
data and does everything needed to them won't gain much from zero-copy as
long as memory limits aren't hit. But a pipeline of processing would
involve many copies, especially if you have a central dispatch thread that
passes things from stage to stage. This is a big deal where stages may take
longer or slower at any time, especially in low-latency applications, like
video conferencing, where dispatch needs the flexibility to skip steps or
add extra workers to shove a frame out the door, and using signals to
interact with separate processes to tell them to do so is more latency and
overhead.

Not that I'm recommending someone go out and make a pure Python
videoconferencing unit right now, but it's a use case I'm familiar with.
(Since I use Python to test new ideas before converting them into C++.)
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/NBO4EZ5OHSDBTHSTWLPG45IAD3OHN3AL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] for Glenn Linderman Re: Re: Improvement to SimpleNamespace

2020-05-07 Thread Cameron Simpson

Apologies to other list members.

Glenn, we were having a conversation off list and there's no evidence my 
replies reached you. Could you have a glance in your spam (if you have 
such a thing) to see if my messages are lying there idle? From the 15th 
and 20th of April.


GMail certainly seems to have a personal dislike for me, and I'm fearing 
something similar may be at play for you.


Again, my apologies to other list members.

Thanks,
Cameron Simpson 
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/PQ35CV5NIBGN6DBT3YTDBT3XEIDAIFET/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Eric Snow
On Thu, May 7, 2020 at 2:50 AM Emily Bowman  wrote:
> While large object copies are fairly fast -- I wouldn't say trivial, a 
> gigabyte copy will introduce noticeable lag when processing enough of them -- 
> the flip side of having large objects is that you want to avoid having so 
> many copies that you run into memory pressure and the dreaded swapping. A 
> multiprocessing engine that's fully parallel, every fork takes chunks of data 
> and does everything needed to them won't gain much from zero-copy as long as 
> memory limits aren't hit. But a pipeline of processing would involve many 
> copies, especially if you have a central dispatch thread that passes things 
> from stage to stage. This is a big deal where stages may take longer or 
> slower at any time, especially in low-latency applications, like video 
> conferencing, where dispatch needs the flexibility to skip steps or add extra 
> workers to shove a frame out the door, and using signals to interact with 
> separate processes to tell them to do so is more latency and overhead.
>
> Not that I'm recommending someone go out and make a pure Python 
> videoconferencing unit right now, but it's a use case I'm familiar with. 
> (Since I use Python to test new ideas before converting them into C++.)

Thanks for the insight, Emily (and everyone else).  It's really
helpful to get many different expert perspectives on the matter.  I am
definitely not an expert on big-data/high-performance use cases so,
personally, I rely on folks like Nathaniel, Travis Oliphant, and
yourself.  The more, the better. :)  Again, thanks!

-eric
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/5KA262LMVS3IBXUZQD6VJ5IQTZOSMR5U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Cody Piersall
On Tue, May 5, 2020 at 6:44 PM Joseph Jenne via Python-Dev
 wrote:
>
> I'm seeing a drop in performance of both multiprocess and subinterpreter
> based runs in the 8-CPU case, where performance drops by about half
> despite having enough logical CPUs, while the other cases scale quite
> well. Is there some issue with python multiprocessing/subinterpreters on
> the same logical core?

This is not a Python issue at all, but a limitation of logical cores.
The logical cores still share the same physical resources, so the
logical cores are contending for the same execution resources.

Actually it would probably be bad if Python *didn't* scale this way,
because that would indicate that a Python process that should be
running full-blast isn't actually utilizing all the physical resources
of a CPU!

-Cody
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/43N2VIRRE4Q2MDP7KFMIXDCPTP3SSEUC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Issues with import_fresh_module

2020-05-07 Thread Brett Cannon
Maybe an initialization/import side-effect bug which is triggered if the module 
is imported twice?
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/25XFYLISP53DRZX2UI7ADYC3JC2V2NVG/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Gregory P. Smith
On Wed, May 6, 2020 at 1:14 PM Serhiy Storchaka  wrote:

> 06.05.20 00:46, Victor Stinner пише:
> > Subinterpreters and multiprocessing have basically the same speed on
> > this benchmark.
>
> It does not look like there are some advantages of subinterpreters
> against multiprocessing.
>

There is not an implementation worthy of comparison at this point, no.  I
don't believe meaningful conclusions of that comparative nature can be
drawn from the current work.  We shouldn't be blocking any decision on
reducing our existing tech debt around subinterpreters on a viable
multi-core solution existing.  There are benchmarks I could propose that I
predict would show a different result even today but I'm refraining because
I believe such things to be a distraction.

I am wondering how much 3.9 will be slower than 3.8 in single-thread
> single-interpreter mode after getting rid of all process-wide singletons
> and caches (Py_None, Py_True, Py_NonImplemented. small integers,
> strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning
> breaking binary compatibility.
>

I'm not worried, because it won't happen in 3.9.  :)  Nobody is seriously
proposing that that be done in that manner.

The existing example work Victor did here (thanks!) was a rapid prototype
where the easiest approach to getting _something_ running parallel as a
demo was just to disable a bunch of shared global things instead of also
doing much larger work to make those per-interpreter.

That isn't how we'd likely ever actually land this kind of change.

Longer term we need to aim to get rid of process global state by moving
that into per-interpreter state.  No matter what.  This isn't something
only needed by subinterpreters.  Corralling everything into a
per-interpreter state with proper initialization and finalization
everywhere allows other nice things like multiple independent interpreters
in a process.  Even sequentially (spin up, tear down, spin up, tear down,
repeat...).  We cannot reliably do that today without side effects such as
duplicate initializations and resulting resource leaks or worse.  Even if
such per-interpreter state instead of per-process state isolation is never
used for parallel execution, I still want to see it happen.

Python already loses out to Lua because of this.  Lua is easily embedded in
a self-contained fashion.  CPython has never been.  This kind of work helps
open up that world instead of relegating us to only
single life-of-the-process long lived language VM uses that we can serve
today.

-gps
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/BUIRTWX3FTXHFQUDAJZ5VHFG6ND3QT4U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Deprecate os.removedirs() and os.renames()

2020-05-07 Thread Serhiy Storchaka
It seems to me that os.removedirs() and os.renames() was added just for 
symmetry with os.makedirs(). All three functions have similar structure 
and was added in the same commit. Seems they were initially code 
examples of using some os.path and os functions.


Unlike to quite popular os.makedirs(), os.removedirs() and os.renames() 
are not used in the stdlib and rarely used in third party code. 
os.removedirs() is considered as an opposite to os.makedirs(), and 
os.renames() is a combination of os.makedirs(), os.rename() and 
os.removedirs(). The problems with them are:


1. They do not remove directory if any file or other subdirectory is 
left. They just stop removing and return success. ZTo the user it looks 
like they do not work as expected, but he need to test the existence of 
directory explicitly to check this.


2. They can remove more than expected. If the parent directory was empty 
before calling os.makedirs(), the following os.removedirs() will remove 
not just the newly created directories, but the parent directory, and 
its parent if it contained a single directory, and so on.


os.removedirs() is not an opposite to os.makedirs(). It can remove less 
or more, and you have no control on how much it will remove. It is 
better to use shutil.rmtree().


os.renames() correspondingly can be replaced by os.rename() or 
shutil.move(), with possible addition of os.makedirs() and 
shutil.rmtree() if needed.


I propose to deprecate these functions and remove them in future Python 
versions.

___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/MWUFGKT43L3KJXN33DLTYN6OLDB6GP45/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Deprecate os.removedirs() and os.renames()

2020-05-07 Thread Kyle Stanley
Serhiy Storchaka wrote:
> I propose to deprecate these functions and remove them in future Python
versions.

+1, assuming the deprecation lasts for at least two versions and the
available alternatives are explicitly mentioned in the What's New
entry (for both the version they're initially deprecated in and the
one they become removed in). Although, I suspect that the deprecation
may need to be longer than two versions from possible breakage in
older libraries or legacy code. They don't seem at all common compared
to shutil.move() and shutil.rmtree(), but I do vaguely recall seeing
some usage of os.renames() and os.removedirs() in third party code.

On Thu, May 7, 2020 at 4:06 PM Serhiy Storchaka  wrote:
>
> It seems to me that os.removedirs() and os.renames() was added just for
> symmetry with os.makedirs(). All three functions have similar structure
> and was added in the same commit. Seems they were initially code
> examples of using some os.path and os functions.
>
> Unlike to quite popular os.makedirs(), os.removedirs() and os.renames()
> are not used in the stdlib and rarely used in third party code.
> os.removedirs() is considered as an opposite to os.makedirs(), and
> os.renames() is a combination of os.makedirs(), os.rename() and
> os.removedirs(). The problems with them are:
>
> 1. They do not remove directory if any file or other subdirectory is
> left. They just stop removing and return success. ZTo the user it looks
> like they do not work as expected, but he need to test the existence of
> directory explicitly to check this.
>
> 2. They can remove more than expected. If the parent directory was empty
> before calling os.makedirs(), the following os.removedirs() will remove
> not just the newly created directories, but the parent directory, and
> its parent if it contained a single directory, and so on.
>
> os.removedirs() is not an opposite to os.makedirs(). It can remove less
> or more, and you have no control on how much it will remove. It is
> better to use shutil.rmtree().
>
> os.renames() correspondingly can be replaced by os.rename() or
> shutil.move(), with possible addition of os.makedirs() and
> shutil.rmtree() if needed.
>
> I propose to deprecate these functions and remove them in future Python
> versions.
> ___
> Python-Dev mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at 
> https://mail.python.org/archives/list/[email protected]/message/MWUFGKT43L3KJXN33DLTYN6OLDB6GP45/
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/KDW3D7HAV7EEMSA2DJNR7NQRPG6TEJTB/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

2020-05-07 Thread Victor Stinner
Le mer. 6 mai 2020 à 22:10, Serhiy Storchaka  a écrit :
> I am wondering how much 3.9 will be slower than 3.8 in single-thread
> single-interpreter mode after getting rid of all process-wide singletons
> and caches (Py_None, Py_True, Py_NonImplemented. small integers,
> strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning
> breaking binary compatibility.

There is no plan to remove caches like small integers, _Py_IDENTIFIER
or _PyArg_Parser.

The plan is to make these caches "per-interpreter". I already modified
small integers to make them per-interpreter.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/5XMPJQXMET3XFDXTBQHJB2O26X4UVX3Q/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Issues with import_fresh_module

2020-05-07 Thread Chris Jerdonek
To expand on my earlier comment about changing the module under test to
make your testing easier, asyncio is one library that has lots of tests of
different combinations of its C and Python implementations being used
together.

As far as I know, it doesn't use import_fresh_module or similar hackery.
Instead it exposes a private way of getting at the parallel Python
implementation:
https://github.com/python/cpython/blob/b7a78ca74ab539943ab11b5c4c9cfab7f5b7ff5a/Lib/asyncio/futures.py#L271-L272
This is the kind of thing I was suggesting. (It might require more setup
than this in your case.)

--Chris


On Thu, May 7, 2020 at 11:33 AM Brett Cannon  wrote:

> Maybe an initialization/import side-effect bug which is triggered if the
> module is imported twice?
> ___
> Python-Dev mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/25XFYLISP53DRZX2UI7ADYC3JC2V2NVG/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/LTMDTYKYL7IVTPISSFVUSX7355GI4QOX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Latest PEP 554 updates.

2020-05-07 Thread Jeff Allen

On 06/05/2020 21:52, Eric Snow wrote:

On Wed, May 6, 2020 at 2:25 PM Jeff Allen  wrote:
...

My reason for worrying about this is that, while the C-API has been there for some time, 
it has not had heavy use in taxing cases AFAIK, and I think there is room for it to be 
incorrect. I am thinking more about Jython than CPython, but ideally they are the same 
structures. When I put the structures to taxing use cases on paper, they don't seem quite 
to work. Jython has been used in environments with thread-pools, concurrency, and 
multiple interpreters, and this aspect has had to be "fixed" several times.

That insight would be super helpful and much appreciated. :)  Is that
all on the docs you've linked?


As far as it goes. I intended to (will eventually) elaborate the more 
complex cases, such as concurrency and application server, where I think 
a Thread may have "history" in a runtime that should be ignored. There's 
more on my local repo, but not about this yet.


I have linked you into one page of a large and rambling (at times) 
account of experiments I'm doing. Outside be dragons.


The other thing I might point to would be Jython bugs that may be clues 
something is still wrong conceptually, or at least justify getting those 
concepts clear (https://bugs.jython.org issues 2642, 2507, 2513, 2846, 
2465, 2107 to name a few).



This is great stuff, Jeff!  Thanks for sharing it.  I was able to skim
through but don't have time to dig in at the moment.  I'll reply in
detail as soon as I can.


Thanks. I hope it's a positive contribution. Isn't PlantUML awesome?

The key argument (or where I'm mistaken) is that, once you start sharing 
objects, only the function you call knows the right Interpreter (import 
context) to use, so in principle, it is different in every frame. You 
can't get to it from the current thread.


Jeff
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/OGMJULX5RIVP2GFIX3G2TAUZAYQKAA5D/
Code of Conduct: http://python.org/psf/codeofconduct/