Hello, (Flink 1.2.1)
For performances reasons I'm trying to reduce the volume of data of my stream as soon as possible by windowing/folding it for 15 minutes before continuing to the rest of the chain that contains keyBys and windows that will transfer data everywhere. Because of the huge volume of data, I want to avoid "moving" the data between partitions as much as possible (not like a naïve KeyBy does). I wanted to create a custom ProcessFunction (using timer and state to fold data for X minutes) in order to fold my data over itself before keying the stream but even ProcessFunction needs a keyed stream... Is there a specific "key" value that would ensure me that my data won't be moved to another taskmanager (that it's hashcode will match the partition it is already in) ? I thought about the subtask id but I doubt I'd be that lucky :-) Suggestions · Wouldn't it be useful to be able to do a "partitionnedKeyBy" that would not move data between nodes, for windowing operations that can be parallelized. o Something like kafka => partitionnedKeyBy(0) => first folding => keyBy(0) => second folding => .... · Finally, aren't all streams keyed ? Even if they're keyed by a totally arbitrary partition id until the user chooses its own key, shouldn't we be able to do a window (not windowAll) or process over any normal Stream's partition ? B.R. Gwenhaël PASQUIERS