Re: [PR] [WIP][feature][spark] Support streaming [seatunnel]

via GitHub Thu, 29 Aug 2024 18:49:23 -0700


Carl-Zhou-CN commented on PR #7476:
URL: https://github.com/apache/seatunnel/pull/7476#issuecomment-2319657631


   > > > 
https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/micro/SeaTunnelMicroBatchPartitionReader.java#L27-L49
   > > > 
   > > > 
https://github.com/apache/seatunnel/blob/1bba72385b6797dc5edd96fa5951376d0594e633/seatunnel-translation/seatunnel-translation-spark/seatunnel-translation-spark-3.3/src/main/java/org/apache/seatunnel/translation/spark/source/partition/batch/ParallelBatchPartitionReader.java#L87-L97
   > > > 
   > > > PartitionReader never close in streaming mode.
   > > 
   > > 
   > > hi @CheneyYin It seems that after a checkpoint, it will be close
   > 
   > Yes. If the reader does not receive new data for a long time, Spark will 
end the current micro batch. Spark's micro batch mechanism does not fully meet 
the requirements of long term streaming computing. First, creating a new reader 
for the next batch will incur some overhead. Second, the granularity of fault 
recovery is too large, and the Spark micro batch mechanism cannot restore the 
reader from the latest snapshot of the Seatunnel reader. I am looking for 
strategies to alleviate these problems while ensuring fault recovery. 
Currently, I add metadata to the seatunnel row and use a special identifier to 
represent the checkpoint event. After the source completes a checkpoint, it 
will create a checkpoint record and send it to the downstream. After receiving 
the checkpoint record, the sink saves the snapshot and confirms the prepared 
checkpoint made by the source. These checkpoint operations are performed based 
on the file system directory space.
   
   I think this pattern would be more like Spark's continuous streaming mode, 
but it seems to completely lack fault tolerance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [WIP][feature][spark] Support streaming [seatunnel]

Reply via email to