CheneyYin commented on PR #7476: URL: https://github.com/apache/seatunnel/pull/7476#issuecomment-2320207764
> > > > https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/micro/SeaTunnelMicroBatchPartitionReader.java#L27-L49 > > > > > > > > https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/batch/ParallelBatchPartitionReader.java#L87-L97 > > > > > > > > PartitionReader never close in streaming mode. > > > > > > > > > hi @CheneyYin It seems that after a checkpoint, it will be close > > > > > > Yes. If the reader does not receive new data for a long time, Spark will end the current micro batch. Spark's micro batch mechanism does not fully meet the requirements of long term streaming computing. First, creating a new reader for the next batch will incur some overhead. Second, the granularity of fault recovery is too large, and the Spark micro batch mechanism cannot restore the reader from the latest snapshot of the Seatunnel reader. I am looking for strategies to alleviate these problems while ensuring fault recovery. Currently, I add metadata to the seatunnel row and use a special identifier to represent the checkpoint event. After the source completes a checkpoint, it will create a checkpoint record and send it to the downstream. After receiving the checkpoint record, the sink saves the snapshot and confirms the prepared checkpoint made by the source. These checkpoint operations are performed based on the file system directory space. > > I think this pattern would be more like Spark's continuous streaming mode, but it seems to completely lack fault tolerance It can ensure end-to-end at-least-once semantics. If sink can be idempotent for handling reprocessing data, it can ensure exactly-once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org