Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Hequn Cheng Fri, 24 Jul 2020 02:36:26 -0700

Thanks a lot for your valuable feedback and suggestions! @Aljoscha Krettek
<aljos...@apache.org>
+1 to the vote.


Best,
Hequn

On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Thanks for updating! And yes, I think it's ok to include the few helper
> methods such as "readFromFile" and "print".
>
> I think we can now proceed to a vote! Nice work, overall!
>
> Best,
> Aljoscha
>
> On 16.07.20 17:16, Hequn Cheng wrote:
> > Hi,
> >
> > Thanks a lot for your discussions.
> > I think Aljoscha makes good suggestions here! Those problematic APIs
> should
> > not be added to the new Python DataStream API.
> >
> > Only one item I want to add based on the reply from Shuiqiang:
> > I would also tend to keep the readTextFile() method. Apart from print(),
> > the readTextFile() may also be very helpful and frequently used for
> playing
> > with Flink.
> > For example, it is used in our WordCount example[1] which is almost the
> > first Flink program that every beginner runs.
> > It is more efficient for reading multi-line data compared to
> > fromCollection() meanwhile far more easier to be used compared to Kafka,
> > Kinesis, RabbitMQ,etc., in
> > cases for playing with Flink.
> >
> > What do you think?
> >
> > Best,
> > Hequn
> >
> > [1]
> >
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> >
> >
> > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com>
> wrote:
> >
> >> Hi Aljoscha,
> >>
> >> Thank you for your valuable comments! I agree with you that there is
> some
> >> optimization space for existing API and can be applied to the python
> >> DataStream API implementation.
> >>
> >> According to your comments, I have concluded them into the following
> parts:
> >>
> >> 1. SingleOutputStreamOperator and DataStreamSource.
> >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> >> redundant, so we can unify their APIs into DataStream to make it more
> >> clear.
> >>
> >> 2. The internal or low-level methods.
> >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
> >>   - DataStream.partition_custom(): Has been removed in the FLIP wiki
> page.
> >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
> >> removed in the FLIP wiki page.
> >> Sorry for mistakenly making those internal methods public, we would not
> >> expose them to users in the Python API.
> >>
> >> 3. "declarative" Apis.
> >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP
> wiki
> >> page. They could be well covered by Table API.
> >>
> >> 4. Spelling problems.
> >> - StreamExecutionEnvironment.from_collections. Should be
> from_collection().
> >> - StreamExecutionEnvironment.generate_sequenece. Should be
> >> generate_sequence().
> >> Sorry for the spelling error.
> >>
> >> 5. Predefined source and sink.
> >> As you said, most of the predefined sources are not suitable for
> >> production, we can ignore them in the new Python DataStream API.
> >> There is one exception that maybe I think we should add the print()
> since
> >> it is commonly used by users and it is very useful for debugging jobs.
> We
> >> can add comments for the API that it should never be used for
> production.
> >> Meanwhile, as you mentioned, a good alternative that always prints on
> the
> >> client should also be supported. For this case, maybe we can add the
> >> collect method and return an Iterator. With the iterator, uses can print
> >> the content on the client. This is also consistent with the behavior in
> >> Table API.
> >>
> >> 6. For Row.
> >> Do you mean that we should not expose the Row type in Python API? Maybe
> I
> >> haven't gotten your concerns well.
> >> We can use tuple type in Python DataStream to support Row. (I have
> updated
> >> the example section of the FLIP to reflect the design.)
> >>
> >> Highly appreciated for your suggestions again. Looking forward to your
> >> feedback.
> >>
> >> Best,
> >> Shuiqiang
> >>
> >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道：
> >>
> >>> Hi,
> >>>
> >>> thanks for the proposal! I have some comments about the API. We should
> >> not
> >>> blindly copy the existing Java DataSteam because we made some mistakes
> >> with
> >>> that and we now have a chance to fix them and not forward them to a new
> >> API.
> >>>
> >>> I don't think we need SingleOutputStreamOperator, in the Scala API we
> >> just
> >>> have DataStream and the relevant methods from
> SingleOutputStreamOperator
> >>> are added to DataStream. Having this extra type is more confusing than
> >>> helpful to users, I think. In the same vain, I think we also don't need
> >>> DataStreamSource. The source methods can also just return a DataStream.
> >>>
> >>> There are some methods that I would consider internal and we shouldn't
> >>> expose them:
> >>>   - DataStream.get_id(): this is an internal method
> >>>   - DataStream.partition_custom(): I think adding this method was a
> >> mistake
> >>> because it's to low-level, I could be convinced otherwise
> >>>   - DataStream.print()/DataStream.print_to_error(): These are
> questionable
> >>> because they print to the TaskManager log. Maybe we could add a good
> >>> alternative that always prints on the client, similar to the Table API
> >>>   - DataStream.write_to_socket(): It was a mistake to add this sink on
> >>> DataStream it is not fault-tolerant and shouldn't be used in production
> >>>
> >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
> should
> >>> be used for "declarative" use cases and I think these methods should
> not
> >> be
> >>> in the DataStream API
> >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these
> are
> >>> internal methods
> >>>
> >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think the
> >>> usability is questionable
> >>>   - StreamExecutionEnvironment.from_collections -> should be called
> >>> from_collection
> >>>   - StreamExecutionEnvironment.generate_sequenece -> should be called
> >>> generate_sequence
> >>>
> >>> I think most of the predefined sources are questionable:
> >>>   - fromParallelCollection: I don't know if this is useful
> >>>   - readTextFile: most of the variants are not useful/fault-tolerant
> >>>   - readFile: same
> >>>   - socketTextStream: also not useful except for toy examples
> >>>   - createInput: also not useful, and it's legacy DataSet InputFormats
> >>>
> >>> I think we need to think hard whether we want to further expose Row in
> >> our
> >>> APIs. I think adding it to flink-core was more an accident than
> anything
> >>> else but I can see that it would be useful for Python/Java interop.
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>>
> >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> >>>> Thanks for bring up this DISCUSS Shuiqiang!
> >>>>
> >>>> +1 for the proposal!
> >>>>
> >>>> Best,
> >>>> Jincheng
> >>>>
> >>>>
> >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> >>>>
> >>>>> Hi Shuiqiang,
> >>>>>
> >>>>> Thanks a lot for driving this discussion.
> >>>>> Big +1 for supporting Python DataStream.
> >>>>> In many ML scenarios, operating Object will be more natural than
> >>> operating
> >>>>> Table.
> >>>>>
> >>>>> Best,
> >>>>> Xingbo
> >>>>>
> >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> >>>>>
> >>>>>> Hi Shuiqiang,
> >>>>>>
> >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
> >>> PyFlink!
> >>>>>>
> >>>>>> Best,
> >>>>>> Wei
> >>>>>>
> >>>>>>
> >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> >>>>>>>
> >>>>>>> +1 for adding the Python DataStream API and starting with the
> >>> stateless
> >>>>>>> part.
> >>>>>>> There are already some users that expressed their wish to have
> >> the
> >>>>> Python
> >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
> >>> more
> >>>>> use
> >>>>>>> cases for our users.
> >>>>>>>
> >>>>>>> Best, Hequn
> >>>>>>>
> >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> >>> acqua....@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: Support
> >>>>> Python
> >>>>>>>> DataStream API
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道：
> >>>>>>>>
> >>>>>>>>> Hi everyone,
> >>>>>>>>>
> >>>>>>>>> As we all know, Flink provides three layered APIs: the
> >>>>>> ProcessFunctions,
> >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
> >>>>> different
> >>>>>>>>> trade-off between conciseness and expressiveness and targets
> >>>>> different
> >>>>>>>> use
> >>>>>>>>> cases[1].
> >>>>>>>>>
> >>>>>>>>> Currently, the SQL & Table API has already been supported in
> >>> PyFlink.
> >>>>>> The
> >>>>>>>>> API provides relational operations as well as user-defined
> >>> functions
> >>>>> to
> >>>>>>>>> provide convenience for users who are familiar with python and
> >>>>>> relational
> >>>>>>>>> programming.
> >>>>>>>>>
> >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide more
> >>>>> generic
> >>>>>>>>> APIs to implement stream processing applications. The
> >>>>> ProcessFunctions
> >>>>>>>>> expose time and state which are the fundamental building blocks
> >>> for
> >>>>> any
> >>>>>>>>> kind of streaming application.
> >>>>>>>>> To cover more use cases, we are planning to cover all these
> >> APIs
> >>> in
> >>>>>>>>> PyFlink.
> >>>>>>>>>
> >>>>>>>>> In this discussion(FLIP-130), we propose to support the Python
> >>>>>> DataStream
> >>>>>>>>> API for the stateless part. For more detail, please refer to
> >> the
> >>> FLIP
> >>>>>>>> wiki
> >>>>>>>>> page here[2]. If interested in the stateful part, you can also
> >>> take a
> >>>>>>>>> look the design doc here[3] for which we are going to discuss
> >> in
> >>> a
> >>>>>>>> separate
> >>>>>>>>> FLIP.
> >>>>>>>>>
> >>>>>>>>> Any comments will be highly appreciated!
> >>>>>>>>>
> >>>>>>>>> [1]
> >>> https://flink.apache.org/flink-applications.html#layered-apis
> >>>>>>>>> [2]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> >>>>>>>>> [3]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Shuiqiang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Reply via email to