Sorry I'm coming to this rather late, but I would like to argue that DataStream.partitionCustom enables an important use case. What I have in mind is performing partitioned enrichment, where each instance can preload a slice of a static dataset that is being used for enrichment.
For an example, consider https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java . Regards, David On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com> wrote: > Hi Aljoscha, Thank you for your response. I'll keep these two helper > methods in the Python DataStream implementation. > > And thank you all for joining in the discussion. It seems that we have > reached a consensus. I will start a vote for this FLIP later today. > > Best, > Shuiqiang > > Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道: > > > Thanks a lot for your valuable feedback and suggestions! @Aljoscha > Krettek > > <aljos...@apache.org> > > +1 to the vote. > > > > Best, > > Hequn > > > > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org> > > wrote: > > > > > Thanks for updating! And yes, I think it's ok to include the few helper > > > methods such as "readFromFile" and "print". > > > > > > I think we can now proceed to a vote! Nice work, overall! > > > > > > Best, > > > Aljoscha > > > > > > On 16.07.20 17:16, Hequn Cheng wrote: > > > > Hi, > > > > > > > > Thanks a lot for your discussions. > > > > I think Aljoscha makes good suggestions here! Those problematic APIs > > > should > > > > not be added to the new Python DataStream API. > > > > > > > > Only one item I want to add based on the reply from Shuiqiang: > > > > I would also tend to keep the readTextFile() method. Apart from > > print(), > > > > the readTextFile() may also be very helpful and frequently used for > > > playing > > > > with Flink. > > > > For example, it is used in our WordCount example[1] which is almost > the > > > > first Flink program that every beginner runs. > > > > It is more efficient for reading multi-line data compared to > > > > fromCollection() meanwhile far more easier to be used compared to > > Kafka, > > > > Kinesis, RabbitMQ,etc., in > > > > cases for playing with Flink. > > > > > > > > What do you think? > > > > > > > > Best, > > > > Hequn > > > > > > > > [1] > > > > > > > > > > https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java > > > > > > > > > > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com> > > > wrote: > > > > > > > >> Hi Aljoscha, > > > >> > > > >> Thank you for your valuable comments! I agree with you that there is > > > some > > > >> optimization space for existing API and can be applied to the python > > > >> DataStream API implementation. > > > >> > > > >> According to your comments, I have concluded them into the following > > > parts: > > > >> > > > >> 1. SingleOutputStreamOperator and DataStreamSource. > > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit > > > >> redundant, so we can unify their APIs into DataStream to make it > more > > > >> clear. > > > >> > > > >> 2. The internal or low-level methods. > > > >> - DataStream.get_id(): Has been removed in the FLIP wiki page. > > > >> - DataStream.partition_custom(): Has been removed in the FLIP wiki > > > page. > > > >> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has > > been > > > >> removed in the FLIP wiki page. > > > >> Sorry for mistakenly making those internal methods public, we would > > not > > > >> expose them to users in the Python API. > > > >> > > > >> 3. "declarative" Apis. > > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the > FLIP > > > wiki > > > >> page. They could be well covered by Table API. > > > >> > > > >> 4. Spelling problems. > > > >> - StreamExecutionEnvironment.from_collections. Should be > > > from_collection(). > > > >> - StreamExecutionEnvironment.generate_sequenece. Should be > > > >> generate_sequence(). > > > >> Sorry for the spelling error. > > > >> > > > >> 5. Predefined source and sink. > > > >> As you said, most of the predefined sources are not suitable for > > > >> production, we can ignore them in the new Python DataStream API. > > > >> There is one exception that maybe I think we should add the print() > > > since > > > >> it is commonly used by users and it is very useful for debugging > jobs. > > > We > > > >> can add comments for the API that it should never be used for > > > production. > > > >> Meanwhile, as you mentioned, a good alternative that always prints > on > > > the > > > >> client should also be supported. For this case, maybe we can add the > > > >> collect method and return an Iterator. With the iterator, uses can > > print > > > >> the content on the client. This is also consistent with the behavior > > in > > > >> Table API. > > > >> > > > >> 6. For Row. > > > >> Do you mean that we should not expose the Row type in Python API? > > Maybe > > > I > > > >> haven't gotten your concerns well. > > > >> We can use tuple type in Python DataStream to support Row. (I have > > > updated > > > >> the example section of the FLIP to reflect the design.) > > > >> > > > >> Highly appreciated for your suggestions again. Looking forward to > your > > > >> feedback. > > > >> > > > >> Best, > > > >> Shuiqiang > > > >> > > > >> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道: > > > >> > > > >>> Hi, > > > >>> > > > >>> thanks for the proposal! I have some comments about the API. We > > should > > > >> not > > > >>> blindly copy the existing Java DataSteam because we made some > > mistakes > > > >> with > > > >>> that and we now have a chance to fix them and not forward them to a > > new > > > >> API. > > > >>> > > > >>> I don't think we need SingleOutputStreamOperator, in the Scala API > we > > > >> just > > > >>> have DataStream and the relevant methods from > > > SingleOutputStreamOperator > > > >>> are added to DataStream. Having this extra type is more confusing > > than > > > >>> helpful to users, I think. In the same vain, I think we also don't > > need > > > >>> DataStreamSource. The source methods can also just return a > > DataStream. > > > >>> > > > >>> There are some methods that I would consider internal and we > > shouldn't > > > >>> expose them: > > > >>> - DataStream.get_id(): this is an internal method > > > >>> - DataStream.partition_custom(): I think adding this method was a > > > >> mistake > > > >>> because it's to low-level, I could be convinced otherwise > > > >>> - DataStream.print()/DataStream.print_to_error(): These are > > > questionable > > > >>> because they print to the TaskManager log. Maybe we could add a > good > > > >>> alternative that always prints on the client, similar to the Table > > API > > > >>> - DataStream.write_to_socket(): It was a mistake to add this sink > > on > > > >>> DataStream it is not fault-tolerant and shouldn't be used in > > production > > > >>> > > > >>> - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API > > > should > > > >>> be used for "declarative" use cases and I think these methods > should > > > not > > > >> be > > > >>> in the DataStream API > > > >>> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: > these > > > are > > > >>> internal methods > > > >>> > > > >>> - StreamExecutionEnvironment.from_parallel_collection(): I think > > the > > > >>> usability is questionable > > > >>> - StreamExecutionEnvironment.from_collections -> should be called > > > >>> from_collection > > > >>> - StreamExecutionEnvironment.generate_sequenece -> should be > called > > > >>> generate_sequence > > > >>> > > > >>> I think most of the predefined sources are questionable: > > > >>> - fromParallelCollection: I don't know if this is useful > > > >>> - readTextFile: most of the variants are not > useful/fault-tolerant > > > >>> - readFile: same > > > >>> - socketTextStream: also not useful except for toy examples > > > >>> - createInput: also not useful, and it's legacy DataSet > > InputFormats > > > >>> > > > >>> I think we need to think hard whether we want to further expose Row > > in > > > >> our > > > >>> APIs. I think adding it to flink-core was more an accident than > > > anything > > > >>> else but I can see that it would be useful for Python/Java interop. > > > >>> > > > >>> Best, > > > >>> Aljoscha > > > >>> > > > >>> > > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote: > > > >>>> Thanks for bring up this DISCUSS Shuiqiang! > > > >>>> > > > >>>> +1 for the proposal! > > > >>>> > > > >>>> Best, > > > >>>> Jincheng > > > >>>> > > > >>>> > > > >>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道: > > > >>>> > > > >>>>> Hi Shuiqiang, > > > >>>>> > > > >>>>> Thanks a lot for driving this discussion. > > > >>>>> Big +1 for supporting Python DataStream. > > > >>>>> In many ML scenarios, operating Object will be more natural than > > > >>> operating > > > >>>>> Table. > > > >>>>> > > > >>>>> Best, > > > >>>>> Xingbo > > > >>>>> > > > >>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道: > > > >>>>> > > > >>>>>> Hi Shuiqiang, > > > >>>>>> > > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in > > > >>> PyFlink! > > > >>>>>> > > > >>>>>> Best, > > > >>>>>> Wei > > > >>>>>> > > > >>>>>> > > > >>>>>>> 在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道: > > > >>>>>>> > > > >>>>>>> +1 for adding the Python DataStream API and starting with the > > > >>> stateless > > > >>>>>>> part. > > > >>>>>>> There are already some users that expressed their wish to have > > > >> the > > > >>>>> Python > > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover > > > >>> more > > > >>>>> use > > > >>>>>>> cases for our users. > > > >>>>>>> > > > >>>>>>> Best, Hequn > > > >>>>>>> > > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen < > > > >>> acqua....@gmail.com> > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: > Support > > > >>>>> Python > > > >>>>>>>> DataStream API > > > >>>>>>>> < > > > >>>>>>>> > > > >>>>>> > > > >>>>> > > > >>> > > > >> > > > > > > https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道: > > > >>>>>>>> > > > >>>>>>>>> Hi everyone, > > > >>>>>>>>> > > > >>>>>>>>> As we all know, Flink provides three layered APIs: the > > > >>>>>> ProcessFunctions, > > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a > > > >>>>> different > > > >>>>>>>>> trade-off between conciseness and expressiveness and targets > > > >>>>> different > > > >>>>>>>> use > > > >>>>>>>>> cases[1]. > > > >>>>>>>>> > > > >>>>>>>>> Currently, the SQL & Table API has already been supported in > > > >>> PyFlink. > > > >>>>>> The > > > >>>>>>>>> API provides relational operations as well as user-defined > > > >>> functions > > > >>>>> to > > > >>>>>>>>> provide convenience for users who are familiar with python > and > > > >>>>>> relational > > > >>>>>>>>> programming. > > > >>>>>>>>> > > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide > more > > > >>>>> generic > > > >>>>>>>>> APIs to implement stream processing applications. The > > > >>>>> ProcessFunctions > > > >>>>>>>>> expose time and state which are the fundamental building > blocks > > > >>> for > > > >>>>> any > > > >>>>>>>>> kind of streaming application. > > > >>>>>>>>> To cover more use cases, we are planning to cover all these > > > >> APIs > > > >>> in > > > >>>>>>>>> PyFlink. > > > >>>>>>>>> > > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the > Python > > > >>>>>> DataStream > > > >>>>>>>>> API for the stateless part. For more detail, please refer to > > > >> the > > > >>> FLIP > > > >>>>>>>> wiki > > > >>>>>>>>> page here[2]. If interested in the stateful part, you can > also > > > >>> take a > > > >>>>>>>>> look the design doc here[3] for which we are going to discuss > > > >> in > > > >>> a > > > >>>>>>>> separate > > > >>>>>>>>> FLIP. > > > >>>>>>>>> > > > >>>>>>>>> Any comments will be highly appreciated! > > > >>>>>>>>> > > > >>>>>>>>> [1] > > > >>> https://flink.apache.org/flink-applications.html#layered-apis > > > >>>>>>>>> [2] > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>> > > > >>>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298 > > > >>>>>>>>> [3] > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>> > > > >>>>> > > > >>> > > > >> > > > > > > https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing > > > >>>>>>>>> > > > >>>>>>>>> Best, > > > >>>>>>>>> Shuiqiang > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > > > > >