Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Felipe Gutierrez Wed, 12 Jun 2019 00:27:33 -0700

Hi Rong,

thanks for your answer. If I understood well, the option will be to use
ProcessFunction [1] since it has the method onTimer(). But not the
ProcessWindowFunction [2], because it does not have the method onTimer(). I
will need this method to call Collector<OUT> out.collect(...) from the
onTImer() method in order to emit a single value of my Distinct Count
function.


Is that reasonable what I am saying?

[1]
https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/datastream/DataStream.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/functions/windowing/ProcessWindowFunction.html

Kind Regards,
Felipe

*--*
*-- Felipe Gutierrez*

*-- skype: felipe.o.gutierrez*
*--* *https://felipeogutierrez.blogspot.com
<https://felipeogutierrez.blogspot.com>*


On Wed, Jun 12, 2019 at 3:41 AM Rong Rong <[email protected]> wrote:

> Hi Felipe,
>
> there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact
> there's already a thread going on recently [1]
> Based on the description you provided, it seems like it might be a better
> API level to use.
>
> To answer your question,
> - You should be able to use other TimeCharacteristic. You might want to
> try WindowProcessFunction and see if this fits your use case.
> - Not sure I fully understand the question, your keyed by should be done
> on your distinct key (or a combo key) and if you do keyby correctly then
> yes all msg with same key is processed by the same TM thread.
>
> --
> Rong
>
>
>
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/count-DISTINCT-in-flink-SQL-td28061.html
>
> On Tue, Jun 11, 2019 at 1:27 AM Felipe Gutierrez <
> [email protected]> wrote:
>
>> Hi all,
>>
>> I have implemented a Flink data stream application to compute distinct
>> count of words. Flink does not have a built-in operator which does this
>> computation. I used KeyedProcessFunction and I am saving the state on a
>> ValueState descriptor.
>> Could someone check if my implementation is the best way of doing it?
>> Here is my solution:
>> https://stackoverflow.com/questions/56524962/how-can-i-improve-my-count-distinct-for-data-stream-implementation-in-flink/56539296#56539296
>>
>> I have some points that I could not understand better:
>> - I only could use TimeCharacteristic.IngestionTime.
>> - I split the words using "Tuple2<Integer, String>(0, word)", so I will
>> have always the same key (0). As I understand, all the events will be
>> processed on the same TaskManager which will not achieve parallelism if I
>> am in a cluster.
>>
>> Kind Regards,
>> Felipe
>> *--*
>> *-- Felipe Gutierrez*
>>
>> *-- skype: felipe.o.gutierrez*
>> *--* *https://felipeogutierrez.blogspot.com
>> <https://felipeogutierrez.blogspot.com>*
>>
>

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Reply via email to