Hello and thanks for the subscription! I am using Streaming API to develop a ML algorithm and i would like your opinions regarding the following issues:
1) The input is read from a big size file with d-dimensional points, and i want to perform a parallel count window. In each parallel count window, i want to perform a function that maintains a list of buckets in memory in order to be checkpointed(exploiting state feature) . For every parallel count window some(0*)! of the buckets will be updated or deleted. My thoughts: As there is no logical key and there is no parallel countWinowAll, the correct way is to perform a parallel flatmap operator? But then i assume that i must implement a custom buffering of input data using ListState to implement the countwindow? Also i could use again another ListState to maintain the list of buckets in memory. But then every time i want to update a specific buffer of the listState i must clear the ListState and reinsert all buffers again(not Optimal for big buffers)? The other way is to use a deterministic pseudo-key and use keyby.countwindow. The number of different keys will be the number of parallelism. In order to update some of the buckets for every key (parallel instance) i am considering the use of mapState(UK=Bucket index,UV= Bucket elements). In that case i think the use of pseudo-key is not the best technique? and also i am going to use unnecessary data shuffle (keyby)? What is the best way? Or is there another way to solve the previous issues? 2)When there is no more input data (EOF) or when a user “asks” for a part-evaluation of the ML algorithm through an external source, i want to collect the list of buckets from the parallel operator instances to another reduce-style operator with parallelism 1 to find the final list (classic scenario of map-reduce). When there is no user query or EOF, I don't want the parallel operator instances to emit anything. My thoughts: I don't know how the user will “ask” the flink parallel operator instances (parallel count window) to emit their results to the downstream operator of parallelism 1. I don't know how the operator instances will know that the file ended (if i use keyby.countwindow i can use a custom trigger with timer? Else in flatmap case? ) The concept is that the list of buckets in each parallel operator instance is a local Sketch and i want to collect the local sketches when the user “asks” to calculate the final Sketch. Any thoughts are very much appreciated!!! Thank you in advance. <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1474/Untitled.png> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/