Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Philippe Caparroy
You should have a look at this project : https://github.com/addthis/stream-lib You can use it within Flink, storing intermediate values in a local state. > Le 9 juin 2016 à 15:29, Yukun Guo a écrit : > > Thank you very much for the detailed answer. Now I understand a DataStream > can be re

Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Christophe Salperwyck
Hi, There are some implementations to do that with low memory footprint. Have a look at the count min sketch for example. There are some Java implementations. Christophe 2016-06-09 15:29 GMT+02:00 Yukun Guo : > Thank you very much for the detailed answer. Now I understand a DataStream > can be

Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Yukun Guo
Thank you very much for the detailed answer. Now I understand a DataStream can be repartitioned or “joined” (don’t know the exact terminology) with keyBy. But another question: Despite the non-existence of incremental top-k algorithm, I’d like to incrementally compute the local word count during o

Re: Hourly top-k statistics of DataStream

2016-06-07 Thread Jamie Grier
Suggestions in-line below... On Mon, Jun 6, 2016 at 7:26 PM, Yukun Guo wrote: > Hi, > > I'm working on a project which uses Flink to compute hourly log statistics > like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and > packed > into a DataStream. > > The problem is, I find th

Re: Hourly top-k statistics of DataStream

2016-06-06 Thread Yukun Guo
My algorithm is roughly like this taking top-K words problem as an example (the purpose of computing local “word count” is to deal with data imbalance): DataStream of words -> timeWindow of 1h -> converted to DataSet of words -> random partitioning by rebalance -> local “word count” using mapParti