subject:"Hourly top\-k statistics of DataStream"

Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Philippe Caparroy

You should have a look at this project : https://github.com/addthis/stream-lib You can use it within Flink, storing intermediate values in a local state. > Le 9 juin 2016 à 15:29, Yukun Guo a écrit : > > Thank you very much for the detailed answer. Now I understand a DataStream > can be re

Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Christophe Salperwyck

Hi, There are some implementations to do that with low memory footprint. Have a look at the count min sketch for example. There are some Java implementations. Christophe 2016-06-09 15:29 GMT+02:00 Yukun Guo : > Thank you very much for the detailed answer. Now I understand a DataStream > can be

Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Yukun Guo

Thank you very much for the detailed answer. Now I understand a DataStream can be repartitioned or “joined” (don’t know the exact terminology) with keyBy. But another question: Despite the non-existence of incremental top-k algorithm, I’d like to incrementally compute the local word count during o

Re: Hourly top-k statistics of DataStream

2016-06-07 Thread Jamie Grier

Suggestions in-line below... On Mon, Jun 6, 2016 at 7:26 PM, Yukun Guo wrote: > Hi, > > I'm working on a project which uses Flink to compute hourly log statistics > like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and > packed > into a DataStream. > > The problem is, I find th

Re: Hourly top-k statistics of DataStream

2016-06-06 Thread Yukun Guo

My algorithm is roughly like this taking top-K words problem as an example (the purpose of computing local “word count” is to deal with data imbalance): DataStream of words -> timeWindow of 1h -> converted to DataSet of words -> random partitioning by rebalance -> local “word count” using mapParti

Hourly top-k statistics of DataStream

2016-06-06 Thread Yukun Guo

Hi, I'm working on a project which uses Flink to compute hourly log statistics like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and packed into a DataStream. The problem is, I find the computation quite challenging to express with Flink's DataStream API: 1. If I use something

Re: Hourly top-k statistics of DataStream

Re: Hourly top-k statistics of DataStream

Re: Hourly top-k statistics of DataStream

Re: Hourly top-k statistics of DataStream

Re: Hourly top-k statistics of DataStream

Hourly top-k statistics of DataStream

6 matches

Site Navigation

Mail list logo

Footer information