One more question that came to my mind: How much performance improvement do
we gain on a real-world Python use case? Were the measurements more like
micro benchmarks where the Python UDF was called w/o the overhead of Flink?
I would just be curious how much the Python component contributes to the
overall runtime of a real world job. Do we have some data on this?

Cheers,
Till

On Mon, Jan 3, 2022 at 11:45 AM Till Rohrmann <[email protected]> wrote:

> Hi Xingbo,
>
> Thanks for creating this FLIP. I have two general questions about the
> motivation for this FLIP because I have only very little exposure to our
> Python users:
>
> Is the slower performance currently the biggest pain point for our Python
> users?
>
> What else are our Python users mainly complaining about?
>
> Concerning the proposed changes, are there any changes required on the
> runtime side (changes to Flink)? How will the deployment and memory
> management be affected when using the thread execution mode?
>
> Cheers,
> Till
>
> On Fri, Dec 31, 2021 at 9:46 AM Xingbo Huang <[email protected]> wrote:
>
>> Hi Wei,
>>
>> Thanks a lot for your feedback. Very good questions!
>>
>> >>> 1. It seems that we dynamically load an embedded Python and user
>> dependencies in the TM process. Can they be uninstalled cleanly after the
>> task finished? i.e. Can we use the Thread Mode in session mode and Pyflink
>> shell?
>>
>> I mentioned the limitation of this part in FLIP. There is no problem
>> without changing the python interpreter, but if you need to change the
>> python interpreter, there is really no way to reload the Python library.
>> The problem is mainly caused by many Python libraries having an assumption
>> that they own the process alone.
>>
>> >>> 2. Does one TM have only one embedded Python running at the same time?
>> If all the Python operator in the TM share the same PVM, will there be a
>> loss in performance?
>>
>> Your understanding is correct that one TM have only one embedded Python
>> running at the same time. I guess you are worried about the performance
>> loss of multi threads caused by Python GIL. There is a one-to-one
>> correspondence between Java worker thread and Python subinterpreters.
>> Although the subinterpreters has not yet completely overcome the GIL
>> sharing problem(The Python community’s recent plan for a per-interpreter
>> GIL is also under discussion[1]), the performance of subinterpreters is
>> very close to that of multiprocessing [2].
>>
>> >>> 3. How do we load the relevant c library if the python.executable is
>> provided by users?
>>
>> Once python.executable is provided, PEMJA will dynamically load the
>> CPython
>> library (libpython.*so or libpython.*dylib) and pemja.so installed in the
>> python environment.
>>
>> >>> May there be a risk of version conflicts?
>>
>> I understand that this question is actually discussing whether C/C++ has a
>> way to solve the problem of relying on different versions of a library.
>> First of all, we know that if there is only static linking, there will be
>> no such problem.  And I have studied the source code of CPython[3], and
>> there is no usage of dynamic linking. The rest is the case where dynamic
>> linking is used in the C library written by the users. There are many ways
>> to solve this problem with dynamic linking, but after all, this library is
>> written by users, and it is difficult for us to guarantee that there will
>> be no conflicts. At this time, Process Mode will be the choice of falk
>> back.
>>
>> [1]
>>
>> https://mail.python.org/archives/list/[email protected]/thread/S5GZZCEREZLA2PEMTVFBCDM52H4JSENR/#RIK75U3ROEHWZL4VENQSQECB4F4GDELV
>> [2]
>>
>> https://mail.python.org/archives/list/[email protected]/thread/PNLBJBNIQDMG2YYGPBCTGOKOAVXRBJWY/#L5OXHXPFONRKLR3W6U46LUSUIBN4FCZQ
>> [3] https://github.com/python/cpython
>>
>> Best,
>> Xingbo
>>
>> Wei Zhong <[email protected]> 于2021年12月31日周五 11:49写道:
>>
>> > Hi Xingbo,
>> >
>> > Thanks for creating this FLIP. Big +1 for it!
>> >
>> > I have some question about the Thread Mode:
>> >
>> > 1. It seems that we dynamically load an embedded Python and user
>> > dependencies in the TM process. Can they be uninstalled cleanly after
>> the
>> > task finished? i.e. Can we use the Thread Mode in session mode and
>> Pyflink
>> > shell?
>> >
>> > 2. Does one TM have only one embedded Python running at the same time?
>> If
>> > all the Python operator in the TM share the same PVM, will there be a
>> loss
>> > in performance?
>> >
>> > 3. How do we load the relevant c library if the python.executable is
>> > provided by users? May there be a risk of version conflicts?
>> >
>> > Best,
>> > Wei
>> >
>> >
>> > > 2021年12月29日 上午11:56,Xingbo Huang <[email protected]> 写道:
>> > >
>> > > Hi everyone,
>> > >
>> > > I would like to start a discussion thread on "Support PyFlink Runtime
>> > > Execution in Thread Mode"
>> > >
>> > > We have provided PyFlink Runtime framework to support Python
>> user-defined
>> > > functions since Flink 1.10. The PyFlink Runtime framework is called
>> > Process
>> > > Mode, which depends on an inter-process communication architecture
>> based
>> > on
>> > > the Apache Beam Portability framework. Although starting a dedicated
>> > > process to execute Python user-defined functions could have better
>> > resource
>> > > isolation, it will bring greater resource and performance overhead.
>> > >
>> > > In order to overcome the resource and performance problems on Process
>> > Mode,
>> > > we will propose a new execution mode which executes Python
>> user-defined
>> > > functions in the same thread instead of a separate process.
>> > >
>> > > I have drafted the FLIP-206[1]. Please feel free to reply to this
>> email
>> > > thread. Looking forward to your feedback!
>> > >
>> > > Best,
>> > > Xingbo
>> > >
>> > > [1]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-206%3A+Support+PyFlink+Runtime+Execution+in+Thread+Mode
>> >
>> >
>>
>

Reply via email to