Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Shuiqiang Chen Fri, 24 Jul 2020 05:45:19 -0700

Hi David,

Thank you for your reply! I have started the vote for this FLIP, but we can
keep the discussion on this thread.
In my perspective, I would not against adding the
DataStream.partitionCustom to Python DataStream API.  However, more inputs
are welcomed.


Best,
Shuiqiang



David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道：

> Sorry I'm coming to this rather late, but I would like to argue that
> DataStream.partitionCustom enables an important use case.
> What I have in mind is performing partitioned enrichment, where each
> instance can preload a slice of a static dataset that is being used for
> enrichment.
>
> For an example, consider
> https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
> .
>
> Regards,
> David
>
> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com>
> wrote:
>
>> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
>> methods in the Python DataStream implementation.
>>
>> And thank you all for joining in the discussion. It seems that we have
>> reached a consensus. I will start a vote for this FLIP later today.
>>
>> Best,
>> Shuiqiang
>>
>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
>>
>> > Thanks a lot for your valuable feedback and suggestions! @Aljoscha
>> Krettek
>> > <aljos...@apache.org>
>> > +1 to the vote.
>> >
>> > Best,
>> > Hequn
>> >
>> > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org>
>> > wrote:
>> >
>> > > Thanks for updating! And yes, I think it's ok to include the few
>> helper
>> > > methods such as "readFromFile" and "print".
>> > >
>> > > I think we can now proceed to a vote! Nice work, overall!
>> > >
>> > > Best,
>> > > Aljoscha
>> > >
>> > > On 16.07.20 17:16, Hequn Cheng wrote:
>> > > > Hi,
>> > > >
>> > > > Thanks a lot for your discussions.
>> > > > I think Aljoscha makes good suggestions here! Those problematic APIs
>> > > should
>> > > > not be added to the new Python DataStream API.
>> > > >
>> > > > Only one item I want to add based on the reply from Shuiqiang:
>> > > > I would also tend to keep the readTextFile() method. Apart from
>> > print(),
>> > > > the readTextFile() may also be very helpful and frequently used for
>> > > playing
>> > > > with Flink.
>> > > > For example, it is used in our WordCount example[1] which is almost
>> the
>> > > > first Flink program that every beginner runs.
>> > > > It is more efficient for reading multi-line data compared to
>> > > > fromCollection() meanwhile far more easier to be used compared to
>> > Kafka,
>> > > > Kinesis, RabbitMQ,etc., in
>> > > > cases for playing with Flink.
>> > > >
>> > > > What do you think?
>> > > >
>> > > > Best,
>> > > > Hequn
>> > > >
>> > > > [1]
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
>> > > >
>> > > >
>> > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com
>> >
>> > > wrote:
>> > > >
>> > > >> Hi Aljoscha,
>> > > >>
>> > > >> Thank you for your valuable comments! I agree with you that there
>> is
>> > > some
>> > > >> optimization space for existing API and can be applied to the
>> python
>> > > >> DataStream API implementation.
>> > > >>
>> > > >> According to your comments, I have concluded them into the
>> following
>> > > parts:
>> > > >>
>> > > >> 1. SingleOutputStreamOperator and DataStreamSource.
>> > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
>> > > >> redundant, so we can unify their APIs into DataStream to make it
>> more
>> > > >> clear.
>> > > >>
>> > > >> 2. The internal or low-level methods.
>> > > >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
>> > > >>   - DataStream.partition_custom(): Has been removed in the FLIP
>> wiki
>> > > page.
>> > > >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
>> > been
>> > > >> removed in the FLIP wiki page.
>> > > >> Sorry for mistakenly making those internal methods public, we would
>> > not
>> > > >> expose them to users in the Python API.
>> > > >>
>> > > >> 3. "declarative" Apis.
>> > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
>> FLIP
>> > > wiki
>> > > >> page. They could be well covered by Table API.
>> > > >>
>> > > >> 4. Spelling problems.
>> > > >> - StreamExecutionEnvironment.from_collections. Should be
>> > > from_collection().
>> > > >> - StreamExecutionEnvironment.generate_sequenece. Should be
>> > > >> generate_sequence().
>> > > >> Sorry for the spelling error.
>> > > >>
>> > > >> 5. Predefined source and sink.
>> > > >> As you said, most of the predefined sources are not suitable for
>> > > >> production, we can ignore them in the new Python DataStream API.
>> > > >> There is one exception that maybe I think we should add the print()
>> > > since
>> > > >> it is commonly used by users and it is very useful for debugging
>> jobs.
>> > > We
>> > > >> can add comments for the API that it should never be used for
>> > > production.
>> > > >> Meanwhile, as you mentioned, a good alternative that always prints
>> on
>> > > the
>> > > >> client should also be supported. For this case, maybe we can add
>> the
>> > > >> collect method and return an Iterator. With the iterator, uses can
>> > print
>> > > >> the content on the client. This is also consistent with the
>> behavior
>> > in
>> > > >> Table API.
>> > > >>
>> > > >> 6. For Row.
>> > > >> Do you mean that we should not expose the Row type in Python API?
>> > Maybe
>> > > I
>> > > >> haven't gotten your concerns well.
>> > > >> We can use tuple type in Python DataStream to support Row. (I have
>> > > updated
>> > > >> the example section of the FLIP to reflect the design.)
>> > > >>
>> > > >> Highly appreciated for your suggestions again. Looking forward to
>> your
>> > > >> feedback.
>> > > >>
>> > > >> Best,
>> > > >> Shuiqiang
>> > > >>
>> > > >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道：
>> > > >>
>> > > >>> Hi,
>> > > >>>
>> > > >>> thanks for the proposal! I have some comments about the API. We
>> > should
>> > > >> not
>> > > >>> blindly copy the existing Java DataSteam because we made some
>> > mistakes
>> > > >> with
>> > > >>> that and we now have a chance to fix them and not forward them to
>> a
>> > new
>> > > >> API.
>> > > >>>
>> > > >>> I don't think we need SingleOutputStreamOperator, in the Scala
>> API we
>> > > >> just
>> > > >>> have DataStream and the relevant methods from
>> > > SingleOutputStreamOperator
>> > > >>> are added to DataStream. Having this extra type is more confusing
>> > than
>> > > >>> helpful to users, I think. In the same vain, I think we also don't
>> > need
>> > > >>> DataStreamSource. The source methods can also just return a
>> > DataStream.
>> > > >>>
>> > > >>> There are some methods that I would consider internal and we
>> > shouldn't
>> > > >>> expose them:
>> > > >>>   - DataStream.get_id(): this is an internal method
>> > > >>>   - DataStream.partition_custom(): I think adding this method was
>> a
>> > > >> mistake
>> > > >>> because it's to low-level, I could be convinced otherwise
>> > > >>>   - DataStream.print()/DataStream.print_to_error(): These are
>> > > questionable
>> > > >>> because they print to the TaskManager log. Maybe we could add a
>> good
>> > > >>> alternative that always prints on the client, similar to the Table
>> > API
>> > > >>>   - DataStream.write_to_socket(): It was a mistake to add this
>> sink
>> > on
>> > > >>> DataStream it is not fault-tolerant and shouldn't be used in
>> > production
>> > > >>>
>> > > >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
>> > > should
>> > > >>> be used for "declarative" use cases and I think these methods
>> should
>> > > not
>> > > >> be
>> > > >>> in the DataStream API
>> > > >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
>> these
>> > > are
>> > > >>> internal methods
>> > > >>>
>> > > >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think
>> > the
>> > > >>> usability is questionable
>> > > >>>   - StreamExecutionEnvironment.from_collections -> should be
>> called
>> > > >>> from_collection
>> > > >>>   - StreamExecutionEnvironment.generate_sequenece -> should be
>> called
>> > > >>> generate_sequence
>> > > >>>
>> > > >>> I think most of the predefined sources are questionable:
>> > > >>>   - fromParallelCollection: I don't know if this is useful
>> > > >>>   - readTextFile: most of the variants are not
>> useful/fault-tolerant
>> > > >>>   - readFile: same
>> > > >>>   - socketTextStream: also not useful except for toy examples
>> > > >>>   - createInput: also not useful, and it's legacy DataSet
>> > InputFormats
>> > > >>>
>> > > >>> I think we need to think hard whether we want to further expose
>> Row
>> > in
>> > > >> our
>> > > >>> APIs. I think adding it to flink-core was more an accident than
>> > > anything
>> > > >>> else but I can see that it would be useful for Python/Java
>> interop.
>> > > >>>
>> > > >>> Best,
>> > > >>> Aljoscha
>> > > >>>
>> > > >>>
>> > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
>> > > >>>> Thanks for bring up this DISCUSS Shuiqiang!
>> > > >>>>
>> > > >>>> +1 for the proposal!
>> > > >>>>
>> > > >>>> Best,
>> > > >>>> Jincheng
>> > > >>>>
>> > > >>>>
>> > > >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道：
>> > > >>>>
>> > > >>>>> Hi Shuiqiang,
>> > > >>>>>
>> > > >>>>> Thanks a lot for driving this discussion.
>> > > >>>>> Big +1 for supporting Python DataStream.
>> > > >>>>> In many ML scenarios, operating Object will be more natural than
>> > > >>> operating
>> > > >>>>> Table.
>> > > >>>>>
>> > > >>>>> Best,
>> > > >>>>> Xingbo
>> > > >>>>>
>> > > >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道：
>> > > >>>>>
>> > > >>>>>> Hi Shuiqiang,
>> > > >>>>>>
>> > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API
>> in
>> > > >>> PyFlink!
>> > > >>>>>>
>> > > >>>>>> Best,
>> > > >>>>>> Wei
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
>> > > >>>>>>>
>> > > >>>>>>> +1 for adding the Python DataStream API and starting with the
>> > > >>> stateless
>> > > >>>>>>> part.
>> > > >>>>>>> There are already some users that expressed their wish to have
>> > > >> the
>> > > >>>>> Python
>> > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can
>> cover
>> > > >>> more
>> > > >>>>> use
>> > > >>>>>>> cases for our users.
>> > > >>>>>>>
>> > > >>>>>>> Best, Hequn
>> > > >>>>>>>
>> > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
>> > > >>> acqua....@gmail.com>
>> > > >>>>>> wrote:
>> > > >>>>>>>
>> > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
>> Support
>> > > >>>>> Python
>> > > >>>>>>>> DataStream API
>> > > >>>>>>>> <
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道：
>> > > >>>>>>>>
>> > > >>>>>>>>> Hi everyone,
>> > > >>>>>>>>>
>> > > >>>>>>>>> As we all know, Flink provides three layered APIs: the
>> > > >>>>>> ProcessFunctions,
>> > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers
>> a
>> > > >>>>> different
>> > > >>>>>>>>> trade-off between conciseness and expressiveness and targets
>> > > >>>>> different
>> > > >>>>>>>> use
>> > > >>>>>>>>> cases[1].
>> > > >>>>>>>>>
>> > > >>>>>>>>> Currently, the SQL & Table API has already been supported in
>> > > >>> PyFlink.
>> > > >>>>>> The
>> > > >>>>>>>>> API provides relational operations as well as user-defined
>> > > >>> functions
>> > > >>>>> to
>> > > >>>>>>>>> provide convenience for users who are familiar with python
>> and
>> > > >>>>>> relational
>> > > >>>>>>>>> programming.
>> > > >>>>>>>>>
>> > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
>> more
>> > > >>>>> generic
>> > > >>>>>>>>> APIs to implement stream processing applications. The
>> > > >>>>> ProcessFunctions
>> > > >>>>>>>>> expose time and state which are the fundamental building
>> blocks
>> > > >>> for
>> > > >>>>> any
>> > > >>>>>>>>> kind of streaming application.
>> > > >>>>>>>>> To cover more use cases, we are planning to cover all these
>> > > >> APIs
>> > > >>> in
>> > > >>>>>>>>> PyFlink.
>> > > >>>>>>>>>
>> > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the
>> Python
>> > > >>>>>> DataStream
>> > > >>>>>>>>> API for the stateless part. For more detail, please refer to
>> > > >> the
>> > > >>> FLIP
>> > > >>>>>>>> wiki
>> > > >>>>>>>>> page here[2]. If interested in the stateful part, you can
>> also
>> > > >>> take a
>> > > >>>>>>>>> look the design doc here[3] for which we are going to
>> discuss
>> > > >> in
>> > > >>> a
>> > > >>>>>>>> separate
>> > > >>>>>>>>> FLIP.
>> > > >>>>>>>>>
>> > > >>>>>>>>> Any comments will be highly appreciated!
>> > > >>>>>>>>>
>> > > >>>>>>>>> [1]
>> > > >>> https://flink.apache.org/flink-applications.html#layered-apis
>> > > >>>>>>>>> [2]
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
>> > > >>>>>>>>> [3]
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>> > > >>>>>>>>>
>> > > >>>>>>>>> Best,
>> > > >>>>>>>>> Shuiqiang
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>>
>> > > >>>
>> > > >>
>> > > >
>> > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Reply via email to