Re: [DISCUSS] Support scalar vectorized Python UDF in PyFlink

Jingsong Li Mon, 10 Feb 2020 18:04:18 -0800

Hi Dian and Jincheng,

Thanks for your explanation. Think again. Maybe most of users don't want to
modify this parameters.
We all realize that "batch.size" should be a larger value, so "bundle.size"
must also be increased. Now the default value of "bundle.size" is only 1000.
I think you can update design to provide meaningful default value for
"batch.size" and "bundle.size".


Best,
Jingsong Lee

On Mon, Feb 10, 2020 at 4:36 PM Dian Fu <dian0511...@gmail.com> wrote:

> Hi Jincheng, Hequn & Jingsong,
>
> Thanks a lot for your suggestions. I have created FLIP-97[1] for this
> feature.
>
> > One little suggestion: maybe it would be nice if we can add some
> performance explanation in the document? (I just very curious:))
> Thanks for the suggestion. I have updated the design doc in the
> "BackGround" section about where the performance gains could be got from.
>
> > It seems that a batch should always in a bundle. Bundle size should
> always
> bigger than batch size. (if a batch can not cross bundle).
> Can you explain this relationship to the document?
> I have updated the design doc explaining more about these two
> configurations.
>
> > In the batch world, vectorization batch size is about 1024+. What do you
> think about the default value of "batch"?
> Is there any link about where this value comes from? I have performed a
> simple test for Pandas UDF which performs the simple +1 operation. The
> performance is best when the batch size is set to 5000. I think it depends
> on the data type of each column, the functionality the Pandas UDF does,
> etc. However I agree with you that we could give a meaningful default value
> for the "batch" size which works in most scenarios.
>
> > Can we only configure one parameter and calculate another automatically?
> For example, if we just want to "pipeline", "bundle.size" is twice as much
> as "batch.size", is this work?
> I agree with Jincheng that this is not feasible. I think that giving an
> meaningful default value for the "batch.size" which works in most scenarios
> is enough. What's your thought?
>
> Thanks,
> Dian
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink
>
>
> On Mon, Feb 10, 2020 at 4:25 PM jincheng sun <sunjincheng...@gmail.com>
> wrote:
>
> > Hi Jingsong,
> >
> > Thanks for your feedback! I would like to share my thoughts regarding the
> > follows question:
> >
> > >> - Can we only configure one parameter and calculate another
> > automatically? For example, if we just want to "pipeline", "bundle.size"
> is
> > twice as much as "batch.size", is this work?
> >
> > I don't think this works. These two configurations are used for different
> > purposes and there is no direct relationship between them and so I guess
> we
> > cannot infer a configuration from the other configuration.
> >
> > Best,
> > Jincheng
> >
> >
> > Jingsong Li <jingsongl...@gmail.com> 于2020年2月10日周一 下午1:53写道：
> >
> > > Thanks Dian for your reply.
> > >
> > > +1 to create a FLIP too.
> > >
> > > About "python.fn-execution.bundle.size" and
> > > "python.fn-execution.arrow.batch.size", I got what are you mean about
> > > "pipeline". I agree.
> > > It seems that a batch should always in a bundle. Bundle size should
> > always
> > > bigger than batch size. (if a batch can not cross bundle).
> > > Can you explain this relationship to the document?
> > >
> > > I think default value is a very important thing, we can discuss:
> > > - In the batch world, vectorization batch size is about 1024+. What do
> > you
> > > think about the default value of "batch"?
> > > - Can we only configure one parameter and calculate another
> > automatically?
> > > For example, if we just want to "pipeline", "bundle.size" is twice as
> > much
> > > as "batch.size", is this work?
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Mon, Feb 10, 2020 at 11:55 AM Hequn Cheng <he...@apache.org> wrote:
> > >
> > > > Hi Dian,
> > > >
> > > > Thanks a lot for bringing up the discussion!
> > > >
> > > > It is great to see the Pandas UDFs feature is going to be
> introduced. I
> > > > think this would improve the performance and also the usability of
> > > > user-defined functions (UDFs) in Python.
> > > > One little suggestion: maybe it would be nice if we can add some
> > > > performance explanation in the document? (I just very curious:))
> > > >
> > > > +1 to create a FLIP for this big enhancement.
> > > >
> > > > Best,
> > > > Hequn
> > > >
> > > > On Mon, Feb 10, 2020 at 11:15 AM jincheng sun <
> > sunjincheng...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Dian,
> > > > >
> > > > > Thanks for bring up this discussion. This is very important for the
> > > > > ecological of PyFlink. Add support Pandas greatly enriches the
> > > available
> > > > > UDF library of PyFlink and greatly improves the usability of
> PyFlink!
> > > > >
> > > > > +1 for Support scalar vectorized Python UDF.
> > > > >
> > > > > I think we should to create a FLIP for this big enhancements. :)
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Best,
> > > > > Jincheng
> > > > >
> > > > >
> > > > >
> > > > > dianfu <dia...@apache.org> 于2020年2月5日周三 下午6:01写道：
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Thanks a lot for the valuable feedback.
> > > > > >
> > > > > > 1. The configurations "python.fn-execution.bundle.size" and
> > > > > > "python.fn-execution.arrow.batch.size" are used for separate
> > purposes
> > > > > and I
> > > > > > think they are both needed. If they are unified, the Python
> > operator
> > > > has
> > > > > to
> > > > > > wait the execution results of the previous batch of elements
> before
> > > > > > processing the next batch. This means that the Python UDF
> execution
> > > can
> > > > > not
> > > > > > be pipelined between batches. With separate configuration, there
> > will
> > > > be
> > > > > no
> > > > > > such problems.
> > > > > > 2. It means that the Java operator will convert input elements to
> > > Arrow
> > > > > > memory format and then send them to the Python worker, vice
> verse.
> > > > > > Regarding to the zero-copy benefits provided by Arrow, we can
> gain
> > > them
> > > > > > automatically using Arrow.
> > > > > > 3. Good point! As all the classes of Python module is written in
> > Java
> > > > and
> > > > > > it's not suggested to introduce new Scala classes, so I guess
> it's
> > > not
> > > > > easy
> > > > > > to do so right now. But I think this is definitely a good
> > improvement
> > > > we
> > > > > > can do in the future.
> > > > > > 4. You're right and we will add a series of Arrow ColumnVectors
> for
> > > > each
> > > > > > type supported.
> > > > > >
> > > > > > Thanks,
> > > > > > Dian
> > > > > >
> > > > > > > 在 2020年2月5日，下午4:57，Jingsong Li <jingsongl...@gmail.com> 写道：
> > > > > > >
> > > > > > > Hi Dian,
> > > > > > >
> > > > > > > +1 for this, thanks driving.
> > > > > > > Documentation looks very good. I can imagine a huge performance
> > > > > > improvement
> > > > > > > and better integration to other Python libraries.
> > > > > > >
> > > > > > > A few thoughts:
> > > > > > > - About data split: "python.fn-execution.arrow.batch.size", can
> > we
> > > > > unify
> > > > > > it
> > > > > > > with "python.fn-execution.bundle.size"?
> > > > > > > - Use of Apache Arrow as the exchange format: Do you mean Arrow
> > > > support
> > > > > > > zero-copy between Java and Python?
> > > > > > > - ArrowFieldWriter seems we can implement it by code
> generation.
> > > But
> > > > it
> > > > > > is
> > > > > > > OK to initial version with virtual function call.
> > > > > > > - ColumnarRow for vectorization reading seems that we need
> > > implement
> > > > > > > ArrowColumnVectors.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jingsong Lee
> > > > > > >
> > > > > > > On Wed, Feb 5, 2020 at 12:45 PM dianfu <dia...@apache.org>
> > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> Scalar Python UDF has already been supported in the coming
> > release
> > > > > 1.10
> > > > > > >> (FLIP-58[1]). It operates one row at a time. It works in the
> way
> > > > that
> > > > > > the
> > > > > > >> Java operator serializes one input row to bytes and sends them
> > to
> > > > the
> > > > > > >> Python worker; the Python worker deserializes the input row
> and
> > > > > > evaluates
> > > > > > >> the Python UDF with it; the result row is serialized and sent
> > back
> > > > to
> > > > > > the
> > > > > > >> Java operator.
> > > > > > >>
> > > > > > >> It suffers from the following problems:
> > > > > > >> 1) High serialization/deserialization overhead
> > > > > > >> 2) It’s difficult to leverage the popular Python libraries
> used
> > by
> > > > > data
> > > > > > >> scientists, such as Pandas, Numpy, etc which provide high
> > > > performance
> > > > > > data
> > > > > > >> structure and functions.
> > > > > > >>
> > > > > > >> Jincheng and I have discussed offline and we want to introduce
> > > > > > vectorized
> > > > > > >> Python UDF to address the above problems. This feature has
> also
> > > been
> > > > > > >> mentioned in the discussion thread about the Python API
> plan[2].
> > > For
> > > > > > >> vectorized Python UDF, a batch of rows are transferred between
> > JVM
> > > > and
> > > > > > >> Python VM in columnar format. The batch of rows will be
> > converted
> > > > to a
> > > > > > >> collection of Pandas.Series and given to the vectorized Python
> > UDF
> > > > > which
> > > > > > >> could then leverage the popular Python libraries such as
> Pandas,
> > > > > Numpy,
> > > > > > etc
> > > > > > >> for the Python UDF implementation.
> > > > > > >>
> > > > > > >> Please refer the design doc[3] for more details and welcome
> any
> > > > > > feedback.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Dian
> > > > > > >>
> > > > > > >> [1]
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table
> > > > > > >> [2]
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-What-parts-of-the-Python-API-should-we-focus-on-next-tt36119.html
> > > > > > >> [3]
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1ls0mt96YV1LWPHfLOh8v7lpgh1KsCNx8B9RrW1ilnoE/edit#heading=h.qivada1u8hwd
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > > > --
> > > > > > > Best, Jingsong Lee
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Support scalar vectorized Python UDF in PyFlink

Reply via email to