Deduplicate messages from Kafka topic

2017-01-14 Thread ljwagerfield
As I understand it, the Flink Kafka Producer may emit duplicates to Kafka topics. How can I deduplicate these messages when reading them back with Flink (via the Flink Kafka Consumer)? For example, is there any out-the-box support for deduplicating messages, i.e. by incorporating something like "

Re: Kafka topic partition skewness causes watermark not being emitted

2017-01-14 Thread tao xiao
The case I described was for experiment only but data skewness would happen in production. The current implementation will block the watermark emission to downstream until all partition move forward which has great impact on latency. It may be a good idea to expose an API to users to decide what th

Re: Queryable State

2017-01-14 Thread Dawid Wysakowicz
Hi Nico, Recently I've tried the queryable state a bit differently, by using ValueState with a value of a util.ArrayList and a ValueSerializer for util.ArrayList and it works as expected. The non-working example you can browse here: https://github.com/dawidwys/flink-intro/tree/c66f01117b0fe3c0adc

Re: Terminology: Split, Group and Partition

2017-01-14 Thread Robert Schmidtke
Hi Fabian, I have opened a ticket for that, thanks. I have another question: now that I have obtained the proper local grouping, I did some aggregation of type [T] -> U, where one aggregated object is of type U, containing information of zero or more Ts. The Us are still tied to the hostname, and

Re: Reading compressed XML data

2017-01-14 Thread Robert Metzger
Hi Sebastian, I'm not aware of a better way of implementing this in Flink. You could implement your own XmlInputFormat using Flink's InputFormat abstractions, but you would end up with almost exactly the same code as Mahout / Hadoop. I wonder why the decompression with the XmlInputFormat doesn't w