weizuo93 opened a new issue #7141:
URL: https://github.com/apache/incubator-doris/issues/7141


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Description
   
   ### Background
   In the sample Doris application, data flow is as follows:
   * read streaming data from Kafka
   * Execute ETL in Flink
   * Sink data batch to Doris by `stream load`
   
   Flink generates checkpoints on a regular, configurable interval and then 
writes the checkpoint to a persistent storage system, such as HDFS. A 
checkpoint in Flink is a consistent snapshot of:
   * The current state of an application
   * The consumption progress of data stream(`offset`)
   
   ![2021-11-17 20-43-16 
的屏幕截图](https://user-images.githubusercontent.com/68884553/142202896-1402fe7f-adb2-42fb-9ac9-9f7dfe7e1acc.png)
   
   In the event of a machine or Flink software failure and upon restart, the 
Flink application resumes processing from the most recent 
successfully-completed checkpoint, which causes partial data to be loaded to 
Doris twice and duplicate data.
   
   To provide exactly-once semantics, Doris must provide a means to commit or 
rollback load that coordinate with Flink's checkpoints. So, it's better to 
support `Two-Phase Commit(2PC)` for stream load.
   
   For the data sink to provide exactly-once guarantees, it must:
   * write all data to Doris through several stream load tasks between two 
checkpoints (All data is non-visible).
   * commit all stream load tasks between two checkpoints(All data is visible).
   
   In the event of a machine or Flink software failure and upon restart, commit 
all stream load tasks between the most recent two checkpoints(It is ok to 
execute commit repeatedly for a stream load task).
   
   ### Design
   
   The design of the two phase for stream load is as follows:
   
   * First Phase:
   
   ![2021-11-02 15-55-26 
的屏幕截图](https://user-images.githubusercontent.com/68884553/142198985-18bf0b3a-eb36-4ee1-bb37-fc79c6f70ab5.png)
   
   
   * Second Phase:
   ![2021-11-02 15-57-47 
的屏幕截图](https://user-images.githubusercontent.com/68884553/142199038-4c36a277-cdcc-4de8-bcb1-4d0ffcde88f8.png)
   
   Once the `pre-commit` is complete, we must ensure that the `commit` can be 
successful.
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to