Thanks a lot for your valuable feedback and suggestions! @Aljoscha Krettek <aljos...@apache.org> +1 to the vote.
Best, Hequn On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org> wrote: > Thanks for updating! And yes, I think it's ok to include the few helper > methods such as "readFromFile" and "print". > > I think we can now proceed to a vote! Nice work, overall! > > Best, > Aljoscha > > On 16.07.20 17:16, Hequn Cheng wrote: > > Hi, > > > > Thanks a lot for your discussions. > > I think Aljoscha makes good suggestions here! Those problematic APIs > should > > not be added to the new Python DataStream API. > > > > Only one item I want to add based on the reply from Shuiqiang: > > I would also tend to keep the readTextFile() method. Apart from print(), > > the readTextFile() may also be very helpful and frequently used for > playing > > with Flink. > > For example, it is used in our WordCount example[1] which is almost the > > first Flink program that every beginner runs. > > It is more efficient for reading multi-line data compared to > > fromCollection() meanwhile far more easier to be used compared to Kafka, > > Kinesis, RabbitMQ,etc., in > > cases for playing with Flink. > > > > What do you think? > > > > Best, > > Hequn > > > > [1] > > > https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java > > > > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com> > wrote: > > > >> Hi Aljoscha, > >> > >> Thank you for your valuable comments! I agree with you that there is > some > >> optimization space for existing API and can be applied to the python > >> DataStream API implementation. > >> > >> According to your comments, I have concluded them into the following > parts: > >> > >> 1. SingleOutputStreamOperator and DataStreamSource. > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit > >> redundant, so we can unify their APIs into DataStream to make it more > >> clear. > >> > >> 2. The internal or low-level methods. > >> - DataStream.get_id(): Has been removed in the FLIP wiki page. > >> - DataStream.partition_custom(): Has been removed in the FLIP wiki > page. > >> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been > >> removed in the FLIP wiki page. > >> Sorry for mistakenly making those internal methods public, we would not > >> expose them to users in the Python API. > >> > >> 3. "declarative" Apis. > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP > wiki > >> page. They could be well covered by Table API. > >> > >> 4. Spelling problems. > >> - StreamExecutionEnvironment.from_collections. Should be > from_collection(). > >> - StreamExecutionEnvironment.generate_sequenece. Should be > >> generate_sequence(). > >> Sorry for the spelling error. > >> > >> 5. Predefined source and sink. > >> As you said, most of the predefined sources are not suitable for > >> production, we can ignore them in the new Python DataStream API. > >> There is one exception that maybe I think we should add the print() > since > >> it is commonly used by users and it is very useful for debugging jobs. > We > >> can add comments for the API that it should never be used for > production. > >> Meanwhile, as you mentioned, a good alternative that always prints on > the > >> client should also be supported. For this case, maybe we can add the > >> collect method and return an Iterator. With the iterator, uses can print > >> the content on the client. This is also consistent with the behavior in > >> Table API. > >> > >> 6. For Row. > >> Do you mean that we should not expose the Row type in Python API? Maybe > I > >> haven't gotten your concerns well. > >> We can use tuple type in Python DataStream to support Row. (I have > updated > >> the example section of the FLIP to reflect the design.) > >> > >> Highly appreciated for your suggestions again. Looking forward to your > >> feedback. > >> > >> Best, > >> Shuiqiang > >> > >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道: > >> > >>> Hi, > >>> > >>> thanks for the proposal! I have some comments about the API. We should > >> not > >>> blindly copy the existing Java DataSteam because we made some mistakes > >> with > >>> that and we now have a chance to fix them and not forward them to a new > >> API. > >>> > >>> I don't think we need SingleOutputStreamOperator, in the Scala API we > >> just > >>> have DataStream and the relevant methods from > SingleOutputStreamOperator > >>> are added to DataStream. Having this extra type is more confusing than > >>> helpful to users, I think. In the same vain, I think we also don't need > >>> DataStreamSource. The source methods can also just return a DataStream. > >>> > >>> There are some methods that I would consider internal and we shouldn't > >>> expose them: > >>> - DataStream.get_id(): this is an internal method > >>> - DataStream.partition_custom(): I think adding this method was a > >> mistake > >>> because it's to low-level, I could be convinced otherwise > >>> - DataStream.print()/DataStream.print_to_error(): These are > questionable > >>> because they print to the TaskManager log. Maybe we could add a good > >>> alternative that always prints on the client, similar to the Table API > >>> - DataStream.write_to_socket(): It was a mistake to add this sink on > >>> DataStream it is not fault-tolerant and shouldn't be used in production > >>> > >>> - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API > should > >>> be used for "declarative" use cases and I think these methods should > not > >> be > >>> in the DataStream API > >>> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these > are > >>> internal methods > >>> > >>> - StreamExecutionEnvironment.from_parallel_collection(): I think the > >>> usability is questionable > >>> - StreamExecutionEnvironment.from_collections -> should be called > >>> from_collection > >>> - StreamExecutionEnvironment.generate_sequenece -> should be called > >>> generate_sequence > >>> > >>> I think most of the predefined sources are questionable: > >>> - fromParallelCollection: I don't know if this is useful > >>> - readTextFile: most of the variants are not useful/fault-tolerant > >>> - readFile: same > >>> - socketTextStream: also not useful except for toy examples > >>> - createInput: also not useful, and it's legacy DataSet InputFormats > >>> > >>> I think we need to think hard whether we want to further expose Row in > >> our > >>> APIs. I think adding it to flink-core was more an accident than > anything > >>> else but I can see that it would be useful for Python/Java interop. > >>> > >>> Best, > >>> Aljoscha > >>> > >>> > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote: > >>>> Thanks for bring up this DISCUSS Shuiqiang! > >>>> > >>>> +1 for the proposal! > >>>> > >>>> Best, > >>>> Jincheng > >>>> > >>>> > >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道: > >>>> > >>>>> Hi Shuiqiang, > >>>>> > >>>>> Thanks a lot for driving this discussion. > >>>>> Big +1 for supporting Python DataStream. > >>>>> In many ML scenarios, operating Object will be more natural than > >>> operating > >>>>> Table. > >>>>> > >>>>> Best, > >>>>> Xingbo > >>>>> > >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道: > >>>>> > >>>>>> Hi Shuiqiang, > >>>>>> > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in > >>> PyFlink! > >>>>>> > >>>>>> Best, > >>>>>> Wei > >>>>>> > >>>>>> > >>>>>>> 在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道: > >>>>>>> > >>>>>>> +1 for adding the Python DataStream API and starting with the > >>> stateless > >>>>>>> part. > >>>>>>> There are already some users that expressed their wish to have > >> the > >>>>> Python > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover > >>> more > >>>>> use > >>>>>>> cases for our users. > >>>>>>> > >>>>>>> Best, Hequn > >>>>>>> > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen < > >>> acqua....@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: Support > >>>>> Python > >>>>>>>> DataStream API > >>>>>>>> < > >>>>>>>> > >>>>>> > >>>>> > >>> > >> > https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit > >>>>>>>>> > >>>>>>>> > >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道: > >>>>>>>> > >>>>>>>>> Hi everyone, > >>>>>>>>> > >>>>>>>>> As we all know, Flink provides three layered APIs: the > >>>>>> ProcessFunctions, > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a > >>>>> different > >>>>>>>>> trade-off between conciseness and expressiveness and targets > >>>>> different > >>>>>>>> use > >>>>>>>>> cases[1]. > >>>>>>>>> > >>>>>>>>> Currently, the SQL & Table API has already been supported in > >>> PyFlink. > >>>>>> The > >>>>>>>>> API provides relational operations as well as user-defined > >>> functions > >>>>> to > >>>>>>>>> provide convenience for users who are familiar with python and > >>>>>> relational > >>>>>>>>> programming. > >>>>>>>>> > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide more > >>>>> generic > >>>>>>>>> APIs to implement stream processing applications. The > >>>>> ProcessFunctions > >>>>>>>>> expose time and state which are the fundamental building blocks > >>> for > >>>>> any > >>>>>>>>> kind of streaming application. > >>>>>>>>> To cover more use cases, we are planning to cover all these > >> APIs > >>> in > >>>>>>>>> PyFlink. > >>>>>>>>> > >>>>>>>>> In this discussion(FLIP-130), we propose to support the Python > >>>>>> DataStream > >>>>>>>>> API for the stateless part. For more detail, please refer to > >> the > >>> FLIP > >>>>>>>> wiki > >>>>>>>>> page here[2]. If interested in the stateful part, you can also > >>> take a > >>>>>>>>> look the design doc here[3] for which we are going to discuss > >> in > >>> a > >>>>>>>> separate > >>>>>>>>> FLIP. > >>>>>>>>> > >>>>>>>>> Any comments will be highly appreciated! > >>>>>>>>> > >>>>>>>>> [1] > >>> https://flink.apache.org/flink-applications.html#layered-apis > >>>>>>>>> [2] > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>> > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298 > >>>>>>>>> [3] > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>> > >> > https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Shuiqiang > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > >