Hi David, Thank you for your reply! I have started the vote for this FLIP, but we can keep the discussion on this thread. In my perspective, I would not against adding the DataStream.partitionCustom to Python DataStream API. However, more inputs are welcomed.
Best, Shuiqiang David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道: > Sorry I'm coming to this rather late, but I would like to argue that > DataStream.partitionCustom enables an important use case. > What I have in mind is performing partitioned enrichment, where each > instance can preload a slice of a static dataset that is being used for > enrichment. > > For an example, consider > https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java > . > > Regards, > David > > On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com> > wrote: > >> Hi Aljoscha, Thank you for your response. I'll keep these two helper >> methods in the Python DataStream implementation. >> >> And thank you all for joining in the discussion. It seems that we have >> reached a consensus. I will start a vote for this FLIP later today. >> >> Best, >> Shuiqiang >> >> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道: >> >> > Thanks a lot for your valuable feedback and suggestions! @Aljoscha >> Krettek >> > <aljos...@apache.org> >> > +1 to the vote. >> > >> > Best, >> > Hequn >> > >> > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org> >> > wrote: >> > >> > > Thanks for updating! And yes, I think it's ok to include the few >> helper >> > > methods such as "readFromFile" and "print". >> > > >> > > I think we can now proceed to a vote! Nice work, overall! >> > > >> > > Best, >> > > Aljoscha >> > > >> > > On 16.07.20 17:16, Hequn Cheng wrote: >> > > > Hi, >> > > > >> > > > Thanks a lot for your discussions. >> > > > I think Aljoscha makes good suggestions here! Those problematic APIs >> > > should >> > > > not be added to the new Python DataStream API. >> > > > >> > > > Only one item I want to add based on the reply from Shuiqiang: >> > > > I would also tend to keep the readTextFile() method. Apart from >> > print(), >> > > > the readTextFile() may also be very helpful and frequently used for >> > > playing >> > > > with Flink. >> > > > For example, it is used in our WordCount example[1] which is almost >> the >> > > > first Flink program that every beginner runs. >> > > > It is more efficient for reading multi-line data compared to >> > > > fromCollection() meanwhile far more easier to be used compared to >> > Kafka, >> > > > Kinesis, RabbitMQ,etc., in >> > > > cases for playing with Flink. >> > > > >> > > > What do you think? >> > > > >> > > > Best, >> > > > Hequn >> > > > >> > > > [1] >> > > > >> > > >> > >> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java >> > > > >> > > > >> > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com >> > >> > > wrote: >> > > > >> > > >> Hi Aljoscha, >> > > >> >> > > >> Thank you for your valuable comments! I agree with you that there >> is >> > > some >> > > >> optimization space for existing API and can be applied to the >> python >> > > >> DataStream API implementation. >> > > >> >> > > >> According to your comments, I have concluded them into the >> following >> > > parts: >> > > >> >> > > >> 1. SingleOutputStreamOperator and DataStreamSource. >> > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit >> > > >> redundant, so we can unify their APIs into DataStream to make it >> more >> > > >> clear. >> > > >> >> > > >> 2. The internal or low-level methods. >> > > >> - DataStream.get_id(): Has been removed in the FLIP wiki page. >> > > >> - DataStream.partition_custom(): Has been removed in the FLIP >> wiki >> > > page. >> > > >> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has >> > been >> > > >> removed in the FLIP wiki page. >> > > >> Sorry for mistakenly making those internal methods public, we would >> > not >> > > >> expose them to users in the Python API. >> > > >> >> > > >> 3. "declarative" Apis. >> > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the >> FLIP >> > > wiki >> > > >> page. They could be well covered by Table API. >> > > >> >> > > >> 4. Spelling problems. >> > > >> - StreamExecutionEnvironment.from_collections. Should be >> > > from_collection(). >> > > >> - StreamExecutionEnvironment.generate_sequenece. Should be >> > > >> generate_sequence(). >> > > >> Sorry for the spelling error. >> > > >> >> > > >> 5. Predefined source and sink. >> > > >> As you said, most of the predefined sources are not suitable for >> > > >> production, we can ignore them in the new Python DataStream API. >> > > >> There is one exception that maybe I think we should add the print() >> > > since >> > > >> it is commonly used by users and it is very useful for debugging >> jobs. >> > > We >> > > >> can add comments for the API that it should never be used for >> > > production. >> > > >> Meanwhile, as you mentioned, a good alternative that always prints >> on >> > > the >> > > >> client should also be supported. For this case, maybe we can add >> the >> > > >> collect method and return an Iterator. With the iterator, uses can >> > print >> > > >> the content on the client. This is also consistent with the >> behavior >> > in >> > > >> Table API. >> > > >> >> > > >> 6. For Row. >> > > >> Do you mean that we should not expose the Row type in Python API? >> > Maybe >> > > I >> > > >> haven't gotten your concerns well. >> > > >> We can use tuple type in Python DataStream to support Row. (I have >> > > updated >> > > >> the example section of the FLIP to reflect the design.) >> > > >> >> > > >> Highly appreciated for your suggestions again. Looking forward to >> your >> > > >> feedback. >> > > >> >> > > >> Best, >> > > >> Shuiqiang >> > > >> >> > > >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道: >> > > >> >> > > >>> Hi, >> > > >>> >> > > >>> thanks for the proposal! I have some comments about the API. We >> > should >> > > >> not >> > > >>> blindly copy the existing Java DataSteam because we made some >> > mistakes >> > > >> with >> > > >>> that and we now have a chance to fix them and not forward them to >> a >> > new >> > > >> API. >> > > >>> >> > > >>> I don't think we need SingleOutputStreamOperator, in the Scala >> API we >> > > >> just >> > > >>> have DataStream and the relevant methods from >> > > SingleOutputStreamOperator >> > > >>> are added to DataStream. Having this extra type is more confusing >> > than >> > > >>> helpful to users, I think. In the same vain, I think we also don't >> > need >> > > >>> DataStreamSource. The source methods can also just return a >> > DataStream. >> > > >>> >> > > >>> There are some methods that I would consider internal and we >> > shouldn't >> > > >>> expose them: >> > > >>> - DataStream.get_id(): this is an internal method >> > > >>> - DataStream.partition_custom(): I think adding this method was >> a >> > > >> mistake >> > > >>> because it's to low-level, I could be convinced otherwise >> > > >>> - DataStream.print()/DataStream.print_to_error(): These are >> > > questionable >> > > >>> because they print to the TaskManager log. Maybe we could add a >> good >> > > >>> alternative that always prints on the client, similar to the Table >> > API >> > > >>> - DataStream.write_to_socket(): It was a mistake to add this >> sink >> > on >> > > >>> DataStream it is not fault-tolerant and shouldn't be used in >> > production >> > > >>> >> > > >>> - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API >> > > should >> > > >>> be used for "declarative" use cases and I think these methods >> should >> > > not >> > > >> be >> > > >>> in the DataStream API >> > > >>> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: >> these >> > > are >> > > >>> internal methods >> > > >>> >> > > >>> - StreamExecutionEnvironment.from_parallel_collection(): I think >> > the >> > > >>> usability is questionable >> > > >>> - StreamExecutionEnvironment.from_collections -> should be >> called >> > > >>> from_collection >> > > >>> - StreamExecutionEnvironment.generate_sequenece -> should be >> called >> > > >>> generate_sequence >> > > >>> >> > > >>> I think most of the predefined sources are questionable: >> > > >>> - fromParallelCollection: I don't know if this is useful >> > > >>> - readTextFile: most of the variants are not >> useful/fault-tolerant >> > > >>> - readFile: same >> > > >>> - socketTextStream: also not useful except for toy examples >> > > >>> - createInput: also not useful, and it's legacy DataSet >> > InputFormats >> > > >>> >> > > >>> I think we need to think hard whether we want to further expose >> Row >> > in >> > > >> our >> > > >>> APIs. I think adding it to flink-core was more an accident than >> > > anything >> > > >>> else but I can see that it would be useful for Python/Java >> interop. >> > > >>> >> > > >>> Best, >> > > >>> Aljoscha >> > > >>> >> > > >>> >> > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote: >> > > >>>> Thanks for bring up this DISCUSS Shuiqiang! >> > > >>>> >> > > >>>> +1 for the proposal! >> > > >>>> >> > > >>>> Best, >> > > >>>> Jincheng >> > > >>>> >> > > >>>> >> > > >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道: >> > > >>>> >> > > >>>>> Hi Shuiqiang, >> > > >>>>> >> > > >>>>> Thanks a lot for driving this discussion. >> > > >>>>> Big +1 for supporting Python DataStream. >> > > >>>>> In many ML scenarios, operating Object will be more natural than >> > > >>> operating >> > > >>>>> Table. >> > > >>>>> >> > > >>>>> Best, >> > > >>>>> Xingbo >> > > >>>>> >> > > >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道: >> > > >>>>> >> > > >>>>>> Hi Shuiqiang, >> > > >>>>>> >> > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API >> in >> > > >>> PyFlink! >> > > >>>>>> >> > > >>>>>> Best, >> > > >>>>>> Wei >> > > >>>>>> >> > > >>>>>> >> > > >>>>>>> 在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道: >> > > >>>>>>> >> > > >>>>>>> +1 for adding the Python DataStream API and starting with the >> > > >>> stateless >> > > >>>>>>> part. >> > > >>>>>>> There are already some users that expressed their wish to have >> > > >> the >> > > >>>>> Python >> > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can >> cover >> > > >>> more >> > > >>>>> use >> > > >>>>>>> cases for our users. >> > > >>>>>>> >> > > >>>>>>> Best, Hequn >> > > >>>>>>> >> > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen < >> > > >>> acqua....@gmail.com> >> > > >>>>>> wrote: >> > > >>>>>>> >> > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: >> Support >> > > >>>>> Python >> > > >>>>>>>> DataStream API >> > > >>>>>>>> < >> > > >>>>>>>> >> > > >>>>>> >> > > >>>>> >> > > >>> >> > > >> >> > > >> > >> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit >> > > >>>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道: >> > > >>>>>>>> >> > > >>>>>>>>> Hi everyone, >> > > >>>>>>>>> >> > > >>>>>>>>> As we all know, Flink provides three layered APIs: the >> > > >>>>>> ProcessFunctions, >> > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers >> a >> > > >>>>> different >> > > >>>>>>>>> trade-off between conciseness and expressiveness and targets >> > > >>>>> different >> > > >>>>>>>> use >> > > >>>>>>>>> cases[1]. >> > > >>>>>>>>> >> > > >>>>>>>>> Currently, the SQL & Table API has already been supported in >> > > >>> PyFlink. >> > > >>>>>> The >> > > >>>>>>>>> API provides relational operations as well as user-defined >> > > >>> functions >> > > >>>>> to >> > > >>>>>>>>> provide convenience for users who are familiar with python >> and >> > > >>>>>> relational >> > > >>>>>>>>> programming. >> > > >>>>>>>>> >> > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide >> more >> > > >>>>> generic >> > > >>>>>>>>> APIs to implement stream processing applications. The >> > > >>>>> ProcessFunctions >> > > >>>>>>>>> expose time and state which are the fundamental building >> blocks >> > > >>> for >> > > >>>>> any >> > > >>>>>>>>> kind of streaming application. >> > > >>>>>>>>> To cover more use cases, we are planning to cover all these >> > > >> APIs >> > > >>> in >> > > >>>>>>>>> PyFlink. >> > > >>>>>>>>> >> > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the >> Python >> > > >>>>>> DataStream >> > > >>>>>>>>> API for the stateless part. For more detail, please refer to >> > > >> the >> > > >>> FLIP >> > > >>>>>>>> wiki >> > > >>>>>>>>> page here[2]. If interested in the stateful part, you can >> also >> > > >>> take a >> > > >>>>>>>>> look the design doc here[3] for which we are going to >> discuss >> > > >> in >> > > >>> a >> > > >>>>>>>> separate >> > > >>>>>>>>> FLIP. >> > > >>>>>>>>> >> > > >>>>>>>>> Any comments will be highly appreciated! >> > > >>>>>>>>> >> > > >>>>>>>>> [1] >> > > >>> https://flink.apache.org/flink-applications.html#layered-apis >> > > >>>>>>>>> [2] >> > > >>>>>>>>> >> > > >>>>>>>> >> > > >>>>>> >> > > >>>>> >> > > >>> >> > > >> >> > > >> > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298 >> > > >>>>>>>>> [3] >> > > >>>>>>>>> >> > > >>>>>>>> >> > > >>>>>> >> > > >>>>> >> > > >>> >> > > >> >> > > >> > >> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing >> > > >>>>>>>>> >> > > >>>>>>>>> Best, >> > > >>>>>>>>> Shuiqiang >> > > >>>>>>>>> >> > > >>>>>>>>> >> > > >>>>>>>>> >> > > >>>>>>>>> >> > > >>>>>>>> >> > > >>>>>> >> > > >>>>>> >> > > >>>>> >> > > >>>> >> > > >>> >> > > >> >> > > > >> > > >> > > >> > >> >