Hi Rong, thanks for your answer. If I understood well, the option will be to use ProcessFunction [1] since it has the method onTimer(). But not the ProcessWindowFunction [2], because it does not have the method onTimer(). I will need this method to call Collector<OUT> out.collect(...) from the onTImer() method in order to emit a single value of my Distinct Count function.
Is that reasonable what I am saying? [1] https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/datastream/DataStream.html [2] https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/functions/windowing/ProcessWindowFunction.html Kind Regards, Felipe *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com <https://felipeogutierrez.blogspot.com>* On Wed, Jun 12, 2019 at 3:41 AM Rong Rong <walter...@gmail.com> wrote: > Hi Felipe, > > there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact > there's already a thread going on recently [1] > Based on the description you provided, it seems like it might be a better > API level to use. > > To answer your question, > - You should be able to use other TimeCharacteristic. You might want to > try WindowProcessFunction and see if this fits your use case. > - Not sure I fully understand the question, your keyed by should be done > on your distinct key (or a combo key) and if you do keyby correctly then > yes all msg with same key is processed by the same TM thread. > > -- > Rong > > > > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/count-DISTINCT-in-flink-SQL-td28061.html > > On Tue, Jun 11, 2019 at 1:27 AM Felipe Gutierrez < > felipe.o.gutier...@gmail.com> wrote: > >> Hi all, >> >> I have implemented a Flink data stream application to compute distinct >> count of words. Flink does not have a built-in operator which does this >> computation. I used KeyedProcessFunction and I am saving the state on a >> ValueState descriptor. >> Could someone check if my implementation is the best way of doing it? >> Here is my solution: >> https://stackoverflow.com/questions/56524962/how-can-i-improve-my-count-distinct-for-data-stream-implementation-in-flink/56539296#56539296 >> >> I have some points that I could not understand better: >> - I only could use TimeCharacteristic.IngestionTime. >> - I split the words using "Tuple2<Integer, String>(0, word)", so I will >> have always the same key (0). As I understand, all the events will be >> processed on the same TaskManager which will not achieve parallelism if I >> am in a cluster. >> >> Kind Regards, >> Felipe >> *--* >> *-- Felipe Gutierrez* >> >> *-- skype: felipe.o.gutierrez* >> *--* *https://felipeogutierrez.blogspot.com >> <https://felipeogutierrez.blogspot.com>* >> >