Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Rong Rong Tue, 11 Jun 2019 18:42:40 -0700

Hi Felipe,

there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact
there's already a thread going on recently [1]
Based on the description you provided, it seems like it might be a better
API level to use.


To answer your question,
- You should be able to use other TimeCharacteristic. You might want to try
WindowProcessFunction and see if this fits your use case.
- Not sure I fully understand the question, your keyed by should be done on
your distinct key (or a combo key) and if you do keyby correctly then yes
all msg with same key is processed by the same TM thread.

--
Rong



[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/count-DISTINCT-in-flink-SQL-td28061.html

On Tue, Jun 11, 2019 at 1:27 AM Felipe Gutierrez <
[email protected]> wrote:

> Hi all,
>
> I have implemented a Flink data stream application to compute distinct
> count of words. Flink does not have a built-in operator which does this
> computation. I used KeyedProcessFunction and I am saving the state on a
> ValueState descriptor.
> Could someone check if my implementation is the best way of doing it? Here
> is my solution:
> https://stackoverflow.com/questions/56524962/how-can-i-improve-my-count-distinct-for-data-stream-implementation-in-flink/56539296#56539296
>
> I have some points that I could not understand better:
> - I only could use TimeCharacteristic.IngestionTime.
> - I split the words using "Tuple2<Integer, String>(0, word)", so I will
> have always the same key (0). As I understand, all the events will be
> processed on the same TaskManager which will not achieve parallelism if I
> am in a cluster.
>
> Kind Regards,
> Felipe
> *--*
> *-- Felipe Gutierrez*
>
> *-- skype: felipe.o.gutierrez*
> *--* *https://felipeogutierrez.blogspot.com
> <https://felipeogutierrez.blogspot.com>*
>

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Reply via email to