Hi devs,

while dealing with SPARK-31706 [1] we figured out the streaming output mode
is only effective for stateful aggregation and not guaranteed on sink,
which could expose data loss issue. SPARK-31724 [2] is filed to track the
efforts on improving the streaming output mode.

Before we revisit the streaming output mode, I'd like to initiate the
discussion around "complete" streaming output mode first, because I have no
idea how it works for production use case. For me, it's only useful for
niche cases and no other streaming framework has such concept.

1. It destroys the purpose of watermark and forces Spark to maintain all of
state rows, growing incrementally. It only works when all keys are bounded
to the limited set.

2. It has to provide all state rows as outputs per batch, hence the size of
outputs is also growing incrementally.

3. It has to truncate the target before putting rows which might not be
trivial for external storage if it should be executed per batch.

4. It enables some operations like sort on streaming query or couple of
more things. But it will not work cleanly (state won't keep up) under
reasonably high input rate, and we have to consider how the operation will
work for streaming output mode hence non-trivial amount of consideration
has to be added to maintain the mode.

It would be a headache to retain the complete mode if we consider improving
modes, as someone might concern about compatibility. It would be nice if we
can make a consensus on the viewpoint of complete mode and drop supporting
it if we agree with.

Would like to hear everyone's opinions. It would be great if someone brings
the valid cases where complete mode is being used in production.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-31706
2. https://issues.apache.org/jira/browse/SPARK-31724

Reply via email to