Hey Jungtaek, I totally agree with you about the issues of the complete mode you raised here. However, not all streaming queries have unbounded states and will grow quickly to a crazy state.
Actually, I found the complete mode is pretty useful when the states are bounded and small. For example, a user can build a realtime dashboard based on daily aggregation results (only 365 or 366 keys in one year, so less than 40k keys in 100 years) using memory sink in the following steps: - Write a streaming query to poll data from Kafka, calculate the aggregation results, and save to the memory sink in the complete mode. - In the same Spark application, start a thrift server with "spark.sql.hive.thriftServer.singleSession=true" to expose the temp table created by the memory sink through JDBC/ODBC. - Connect a BI tool using JDBC/ODBC to query the temp table created by the memory sink. - Use the BI tool to build a realtime dashboard by polling the results in a specified speed. Best Regards, Ryan On Mon, May 18, 2020 at 8:44 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Hi devs, > > while dealing with SPARK-31706 [1] we figured out the streaming output > mode is only effective for stateful aggregation and not guaranteed on sink, > which could expose data loss issue. SPARK-31724 [2] is filed to track the > efforts on improving the streaming output mode. > > Before we revisit the streaming output mode, I'd like to initiate the > discussion around "complete" streaming output mode first, because I have no > idea how it works for production use case. For me, it's only useful for > niche cases and no other streaming framework has such concept. > > 1. It destroys the purpose of watermark and forces Spark to maintain all > of state rows, growing incrementally. It only works when all keys are > bounded to the limited set. > > 2. It has to provide all state rows as outputs per batch, hence the size > of outputs is also growing incrementally. > > 3. It has to truncate the target before putting rows which might not be > trivial for external storage if it should be executed per batch. > > 4. It enables some operations like sort on streaming query or couple of > more things. But it will not work cleanly (state won't keep up) under > reasonably high input rate, and we have to consider how the operation will > work for streaming output mode hence non-trivial amount of consideration > has to be added to maintain the mode. > > It would be a headache to retain the complete mode if we consider > improving modes, as someone might concern about compatibility. It would be > nice if we can make a consensus on the viewpoint of complete mode and drop > supporting it if we agree with. > > Would like to hear everyone's opinions. It would be great if someone > brings the valid cases where complete mode is being used in production. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-31706 > 2. https://issues.apache.org/jira/browse/SPARK-31724 > > >