Hey! What you are trying to do here is a global rolling aggregation, which is inherently a DOP 1 operation. Your observation is correct that if you want to use a simple stateful sink, you need to make sure that you set the parallelism to 1 in order to get correct results.
What you can do is to keep local top-ks in a parallel operator (let's say a flatmap) and periodically output the local top-k elements and merge them in a sink with parallelism=1 to produce a global top-k. I am not 100% sure how you implemented the same functionality in spark but there you probably achieved the semantics I described above. The whole problem is much easier if you are interested in the top-k elements grouped by some key, as then you can use partitioned operator states which will give you the correct results with arbitrary parallelism. Cheers, Gyula defstat <defs...@gmail.com> ezt írta (időpont: 2015. aug. 23., V, 21:40): > Hi. I am struggling the past few days to find a solution on the following > problem, using Apache Flink: > > I have a stream of vectors, represented by files in a local folder. After a > new text file is located using DataStream<String> text = > env.readFileStream(...), I transform (flatMap), the Input into a > SingleOutputStreamOperator<Tuple2<String, Integer>, ?>, with the Integer > being the score coming from a scoring function. > > I want to persist a global HashMap containing the top-k vectors, using > their > scores. I approached the problem using a statefull transformation. > 1. The first problem I have is that the HashMap retains per-sink data (so > for each thread of workers, one HashMap of data). How can I make that a > Global collection > > 2. Using Apache Spark, I made that possible by > JavaPairDStream<String, Integer> stateDstream = > tuples.updateStateByKey(updateFunction); > > and then making transformations on the stateDstream. Is there a way I can > get the same functionality using FLink? > > Thanks in advance! > > > > -- > View this message in context: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Statefull-computation-tp2494.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. >