Hey Jungtaek,

I totally agree with you about the issues of the complete mode you raised
here. However, not all streaming queries have unbounded states and
will grow quickly to a crazy state.

Actually, I found the complete mode is pretty useful when the states are
bounded and small. For example, a user can build a realtime dashboard based
on daily aggregation results (only 365 or 366 keys in one year, so less
than 40k keys in 100 years) using memory sink in the following steps:

- Write a streaming query to poll data from Kafka, calculate the
aggregation results, and save to the memory sink in the complete mode.
- In the same Spark application, start a thrift server with
"spark.sql.hive.thriftServer.singleSession=true" to expose the temp table
created by the memory sink through JDBC/ODBC.
- Connect a BI tool using JDBC/ODBC to query the temp table created by the
memory sink.
- Use the BI tool to build a realtime dashboard by polling the results in a
specified speed.

Best Regards,
Ryan


On Mon, May 18, 2020 at 8:44 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Hi devs,
>
> while dealing with SPARK-31706 [1] we figured out the streaming output
> mode is only effective for stateful aggregation and not guaranteed on sink,
> which could expose data loss issue. SPARK-31724 [2] is filed to track the
> efforts on improving the streaming output mode.
>
> Before we revisit the streaming output mode, I'd like to initiate the
> discussion around "complete" streaming output mode first, because I have no
> idea how it works for production use case. For me, it's only useful for
> niche cases and no other streaming framework has such concept.
>
> 1. It destroys the purpose of watermark and forces Spark to maintain all
> of state rows, growing incrementally. It only works when all keys are
> bounded to the limited set.
>
> 2. It has to provide all state rows as outputs per batch, hence the size
> of outputs is also growing incrementally.
>
> 3. It has to truncate the target before putting rows which might not be
> trivial for external storage if it should be executed per batch.
>
> 4. It enables some operations like sort on streaming query or couple of
> more things. But it will not work cleanly (state won't keep up) under
> reasonably high input rate, and we have to consider how the operation will
> work for streaming output mode hence non-trivial amount of consideration
> has to be added to maintain the mode.
>
> It would be a headache to retain the complete mode if we consider
> improving modes, as someone might concern about compatibility. It would be
> nice if we can make a consensus on the viewpoint of complete mode and drop
> supporting it if we agree with.
>
> Would like to hear everyone's opinions. It would be great if someone
> brings the valid cases where complete mode is being used in production.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-31706
> 2. https://issues.apache.org/jira/browse/SPARK-31724
>
>
>

Reply via email to