Hi devs, while dealing with SPARK-31706 [1] we figured out the streaming output mode is only effective for stateful aggregation and not guaranteed on sink, which could expose data loss issue. SPARK-31724 [2] is filed to track the efforts on improving the streaming output mode.
Before we revisit the streaming output mode, I'd like to initiate the discussion around "complete" streaming output mode first, because I have no idea how it works for production use case. For me, it's only useful for niche cases and no other streaming framework has such concept. 1. It destroys the purpose of watermark and forces Spark to maintain all of state rows, growing incrementally. It only works when all keys are bounded to the limited set. 2. It has to provide all state rows as outputs per batch, hence the size of outputs is also growing incrementally. 3. It has to truncate the target before putting rows which might not be trivial for external storage if it should be executed per batch. 4. It enables some operations like sort on streaming query or couple of more things. But it will not work cleanly (state won't keep up) under reasonably high input rate, and we have to consider how the operation will work for streaming output mode hence non-trivial amount of consideration has to be added to maintain the mode. It would be a headache to retain the complete mode if we consider improving modes, as someone might concern about compatibility. It would be nice if we can make a consensus on the viewpoint of complete mode and drop supporting it if we agree with. Would like to hear everyone's opinions. It would be great if someone brings the valid cases where complete mode is being used in production. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-31706 2. https://issues.apache.org/jira/browse/SPARK-31724