Hey!

What you are trying to do here is a global rolling aggregation, which is
inherently a DOP 1 operation. Your observation is correct that if you want
to use a simple stateful sink, you need to make sure that you set the
parallelism to 1 in order to get correct results.

What you can do is to keep local top-ks in a parallel operator (let's say a
flatmap) and periodically output the local top-k elements and merge them in
a sink with parallelism=1 to produce a global top-k.

I am not 100% sure how you implemented the same functionality in spark but
there you probably achieved the semantics I described above.

The whole problem is much easier if you are interested in the top-k
elements grouped by some key, as then you can use partitioned operator
states which will give you the correct results with arbitrary parallelism.

Cheers,
Gyula

defstat <defs...@gmail.com> ezt írta (időpont: 2015. aug. 23., V, 21:40):

> Hi. I am struggling the past few days to find a solution on the following
> problem, using Apache Flink:
>
> I have a stream of vectors, represented by files in a local folder. After a
> new text file is located using DataStream<String> text =
> env.readFileStream(...), I transform (flatMap), the Input into a
> SingleOutputStreamOperator<Tuple2&lt;String, Integer>, ?>, with the Integer
> being the score coming from a scoring function.
>
> I want to persist a global HashMap containing the top-k vectors, using
> their
> scores. I approached the problem using a statefull transformation.
> 1. The first problem I have is that the HashMap retains per-sink data (so
> for each thread of workers, one HashMap of data). How can I make that a
> Global collection
>
> 2. Using Apache Spark, I made that possible by
> JavaPairDStream<String, Integer> stateDstream =
> tuples.updateStateByKey(updateFunction);
>
> and then making transformations on the stateDstream. Is there a way I can
> get the same functionality using FLink?
>
> Thanks in advance!
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Statefull-computation-tp2494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Reply via email to