Hi, How can I maintain a local state, for instance a ConcurrentHashMap, across different steps in a streaming chain on a single machine/process? Static variable? (This doesn't seem to work well when running locally as it gets shared across multiple instances, a common "pipeline" store would be helpful)
Is it OK to checkpoint such a local state in a single map operation at the beginning of the pipeline, or does it need to be done for every function? Will multiple groupBy steps using the same key selector output pass data to the same machines? (To preserve data locality) How can I do a fold/reduce operation that only returns its result after a full window has been processed, even when the processing in the window includes streams that have been distributed and merged from different machines using groupBy? My scenario is as follows I want to build up and partition a large state across different machines by using groupBy on a stream. The processing occurs in a window and some processing needs to be done on multiple machines so I want to do additional groupBy operators to pass partial results to other machines. Pseudo code: flattenedWindowStream = streamSource.groupBy(myKeySelector). // Initial paritioning map(localStateSaverCheckpointMapper). //Checkpoint that saves local state, just passes through the data window(Count(100)).flatten(); localAndRemoteStream = flattenedWindowStream.split(event -> canBeProcessedLocally(event) ? "local" : "remote" ); remoteStream = localAndRemoteStream.select("remote"). map(partialProcessing). // Partially process what I can with my local state groupBy(myKeySelector). // Send the partial processing to the machines that own the rest of the data map(process); globalResult = localAndRemoteStream.select("local").map(process).union(remoteStream).broadcast(); // Broadcast all fully processed results to all machines globalResult.fold().addSink(globalWindowOutputSink) // fold/reduce, I want a result based on the full contents of the window Any help would be greatly appreciated! Thanks, William