Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

David Anderson Fri, 24 Jul 2020 04:52:31 -0700

Sorry I'm coming to this rather late, but I would like to argue that
DataStream.partitionCustom enables an important use case.
What I have in mind is performing partitioned enrichment, where each
instance can preload a slice of a static dataset that is being used for
enrichment.


For an example, consider
https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
.

Regards,
David

On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com> wrote:

> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
> methods in the Python DataStream implementation.
>
> And thank you all for joining in the discussion. It seems that we have
> reached a consensus. I will start a vote for this FLIP later today.
>
> Best,
> Shuiqiang
>
> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
>
> > Thanks a lot for your valuable feedback and suggestions! @Aljoscha
> Krettek
> > <aljos...@apache.org>
> > +1 to the vote.
> >
> > Best,
> > Hequn
> >
> > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > Thanks for updating! And yes, I think it's ok to include the few helper
> > > methods such as "readFromFile" and "print".
> > >
> > > I think we can now proceed to a vote! Nice work, overall!
> > >
> > > Best,
> > > Aljoscha
> > >
> > > On 16.07.20 17:16, Hequn Cheng wrote:
> > > > Hi,
> > > >
> > > > Thanks a lot for your discussions.
> > > > I think Aljoscha makes good suggestions here! Those problematic APIs
> > > should
> > > > not be added to the new Python DataStream API.
> > > >
> > > > Only one item I want to add based on the reply from Shuiqiang:
> > > > I would also tend to keep the readTextFile() method. Apart from
> > print(),
> > > > the readTextFile() may also be very helpful and frequently used for
> > > playing
> > > > with Flink.
> > > > For example, it is used in our WordCount example[1] which is almost
> the
> > > > first Flink program that every beginner runs.
> > > > It is more efficient for reading multi-line data compared to
> > > > fromCollection() meanwhile far more easier to be used compared to
> > Kafka,
> > > > Kinesis, RabbitMQ,etc., in
> > > > cases for playing with Flink.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Hequn
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> > > >
> > > >
> > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Aljoscha,
> > > >>
> > > >> Thank you for your valuable comments! I agree with you that there is
> > > some
> > > >> optimization space for existing API and can be applied to the python
> > > >> DataStream API implementation.
> > > >>
> > > >> According to your comments, I have concluded them into the following
> > > parts:
> > > >>
> > > >> 1. SingleOutputStreamOperator and DataStreamSource.
> > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> > > >> redundant, so we can unify their APIs into DataStream to make it
> more
> > > >> clear.
> > > >>
> > > >> 2. The internal or low-level methods.
> > > >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
> > > >>   - DataStream.partition_custom(): Has been removed in the FLIP wiki
> > > page.
> > > >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
> > been
> > > >> removed in the FLIP wiki page.
> > > >> Sorry for mistakenly making those internal methods public, we would
> > not
> > > >> expose them to users in the Python API.
> > > >>
> > > >> 3. "declarative" Apis.
> > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
> FLIP
> > > wiki
> > > >> page. They could be well covered by Table API.
> > > >>
> > > >> 4. Spelling problems.
> > > >> - StreamExecutionEnvironment.from_collections. Should be
> > > from_collection().
> > > >> - StreamExecutionEnvironment.generate_sequenece. Should be
> > > >> generate_sequence().
> > > >> Sorry for the spelling error.
> > > >>
> > > >> 5. Predefined source and sink.
> > > >> As you said, most of the predefined sources are not suitable for
> > > >> production, we can ignore them in the new Python DataStream API.
> > > >> There is one exception that maybe I think we should add the print()
> > > since
> > > >> it is commonly used by users and it is very useful for debugging
> jobs.
> > > We
> > > >> can add comments for the API that it should never be used for
> > > production.
> > > >> Meanwhile, as you mentioned, a good alternative that always prints
> on
> > > the
> > > >> client should also be supported. For this case, maybe we can add the
> > > >> collect method and return an Iterator. With the iterator, uses can
> > print
> > > >> the content on the client. This is also consistent with the behavior
> > in
> > > >> Table API.
> > > >>
> > > >> 6. For Row.
> > > >> Do you mean that we should not expose the Row type in Python API?
> > Maybe
> > > I
> > > >> haven't gotten your concerns well.
> > > >> We can use tuple type in Python DataStream to support Row. (I have
> > > updated
> > > >> the example section of the FLIP to reflect the design.)
> > > >>
> > > >> Highly appreciated for your suggestions again. Looking forward to
> your
> > > >> feedback.
> > > >>
> > > >> Best,
> > > >> Shuiqiang
> > > >>
> > > >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道：
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> thanks for the proposal! I have some comments about the API. We
> > should
> > > >> not
> > > >>> blindly copy the existing Java DataSteam because we made some
> > mistakes
> > > >> with
> > > >>> that and we now have a chance to fix them and not forward them to a
> > new
> > > >> API.
> > > >>>
> > > >>> I don't think we need SingleOutputStreamOperator, in the Scala API
> we
> > > >> just
> > > >>> have DataStream and the relevant methods from
> > > SingleOutputStreamOperator
> > > >>> are added to DataStream. Having this extra type is more confusing
> > than
> > > >>> helpful to users, I think. In the same vain, I think we also don't
> > need
> > > >>> DataStreamSource. The source methods can also just return a
> > DataStream.
> > > >>>
> > > >>> There are some methods that I would consider internal and we
> > shouldn't
> > > >>> expose them:
> > > >>>   - DataStream.get_id(): this is an internal method
> > > >>>   - DataStream.partition_custom(): I think adding this method was a
> > > >> mistake
> > > >>> because it's to low-level, I could be convinced otherwise
> > > >>>   - DataStream.print()/DataStream.print_to_error(): These are
> > > questionable
> > > >>> because they print to the TaskManager log. Maybe we could add a
> good
> > > >>> alternative that always prints on the client, similar to the Table
> > API
> > > >>>   - DataStream.write_to_socket(): It was a mistake to add this sink
> > on
> > > >>> DataStream it is not fault-tolerant and shouldn't be used in
> > production
> > > >>>
> > > >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
> > > should
> > > >>> be used for "declarative" use cases and I think these methods
> should
> > > not
> > > >> be
> > > >>> in the DataStream API
> > > >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> these
> > > are
> > > >>> internal methods
> > > >>>
> > > >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think
> > the
> > > >>> usability is questionable
> > > >>>   - StreamExecutionEnvironment.from_collections -> should be called
> > > >>> from_collection
> > > >>>   - StreamExecutionEnvironment.generate_sequenece -> should be
> called
> > > >>> generate_sequence
> > > >>>
> > > >>> I think most of the predefined sources are questionable:
> > > >>>   - fromParallelCollection: I don't know if this is useful
> > > >>>   - readTextFile: most of the variants are not
> useful/fault-tolerant
> > > >>>   - readFile: same
> > > >>>   - socketTextStream: also not useful except for toy examples
> > > >>>   - createInput: also not useful, and it's legacy DataSet
> > InputFormats
> > > >>>
> > > >>> I think we need to think hard whether we want to further expose Row
> > in
> > > >> our
> > > >>> APIs. I think adding it to flink-core was more an accident than
> > > anything
> > > >>> else but I can see that it would be useful for Python/Java interop.
> > > >>>
> > > >>> Best,
> > > >>> Aljoscha
> > > >>>
> > > >>>
> > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> > > >>>> Thanks for bring up this DISCUSS Shuiqiang!
> > > >>>>
> > > >>>> +1 for the proposal!
> > > >>>>
> > > >>>> Best,
> > > >>>> Jincheng
> > > >>>>
> > > >>>>
> > > >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> > > >>>>
> > > >>>>> Hi Shuiqiang,
> > > >>>>>
> > > >>>>> Thanks a lot for driving this discussion.
> > > >>>>> Big +1 for supporting Python DataStream.
> > > >>>>> In many ML scenarios, operating Object will be more natural than
> > > >>> operating
> > > >>>>> Table.
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Xingbo
> > > >>>>>
> > > >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> > > >>>>>
> > > >>>>>> Hi Shuiqiang,
> > > >>>>>>
> > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
> > > >>> PyFlink!
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Wei
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > > >>>>>>>
> > > >>>>>>> +1 for adding the Python DataStream API and starting with the
> > > >>> stateless
> > > >>>>>>> part.
> > > >>>>>>> There are already some users that expressed their wish to have
> > > >> the
> > > >>>>> Python
> > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
> > > >>> more
> > > >>>>> use
> > > >>>>>>> cases for our users.
> > > >>>>>>>
> > > >>>>>>> Best, Hequn
> > > >>>>>>>
> > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> > > >>> acqua....@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
> Support
> > > >>>>> Python
> > > >>>>>>>> DataStream API
> > > >>>>>>>> <
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > > >>>>>>>>
> > > >>>>>>>>> Hi everyone,
> > > >>>>>>>>>
> > > >>>>>>>>> As we all know, Flink provides three layered APIs: the
> > > >>>>>> ProcessFunctions,
> > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
> > > >>>>> different
> > > >>>>>>>>> trade-off between conciseness and expressiveness and targets
> > > >>>>> different
> > > >>>>>>>> use
> > > >>>>>>>>> cases[1].
> > > >>>>>>>>>
> > > >>>>>>>>> Currently, the SQL & Table API has already been supported in
> > > >>> PyFlink.
> > > >>>>>> The
> > > >>>>>>>>> API provides relational operations as well as user-defined
> > > >>> functions
> > > >>>>> to
> > > >>>>>>>>> provide convenience for users who are familiar with python
> and
> > > >>>>>> relational
> > > >>>>>>>>> programming.
> > > >>>>>>>>>
> > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
> more
> > > >>>>> generic
> > > >>>>>>>>> APIs to implement stream processing applications. The
> > > >>>>> ProcessFunctions
> > > >>>>>>>>> expose time and state which are the fundamental building
> blocks
> > > >>> for
> > > >>>>> any
> > > >>>>>>>>> kind of streaming application.
> > > >>>>>>>>> To cover more use cases, we are planning to cover all these
> > > >> APIs
> > > >>> in
> > > >>>>>>>>> PyFlink.
> > > >>>>>>>>>
> > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the
> Python
> > > >>>>>> DataStream
> > > >>>>>>>>> API for the stateless part. For more detail, please refer to
> > > >> the
> > > >>> FLIP
> > > >>>>>>>> wiki
> > > >>>>>>>>> page here[2]. If interested in the stateful part, you can
> also
> > > >>> take a
> > > >>>>>>>>> look the design doc here[3] for which we are going to discuss
> > > >> in
> > > >>> a
> > > >>>>>>>> separate
> > > >>>>>>>>> FLIP.
> > > >>>>>>>>>
> > > >>>>>>>>> Any comments will be highly appreciated!
> > > >>>>>>>>>
> > > >>>>>>>>> [1]
> > > >>> https://flink.apache.org/flink-applications.html#layered-apis
> > > >>>>>>>>> [2]
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > >>>>>>>>> [3]
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Shuiqiang
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Reply via email to