danny0405 commented on code in PR #13530:
URL: https://github.com/apache/hudi/pull/13530#discussion_r2191685711
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java:
##########
@@ -207,31 +207,33 @@ private void initCheckpointId(int attemptId, long
restoredCheckpointId) throws E
this.checkpointId = restoredCheckpointId;
}
- private void sendBootstrapEvent(int attemptId, boolean isRestored) throws
Exception {
- if (attemptId <= 0) {
- if (isRestored) {
- HoodieTimeline pendingTimeline =
this.metaClient.getActiveTimeline().filterPendingExcludingCompaction();
- // if the task is initially started, resend the pending event.
- for (WriteMetadataEvent event : this.writeMetadataState.get()) {
- // Must filter out the completed instants in case it is a partial
failover,
- // the write status should not be accumulated in such case.
- if (pendingTimeline.containsInstant(event.getInstantTime())) {
- // Reset taskID for event
- event.setTaskID(taskID);
- // The checkpoint succeed but the meta does not commit,
- // re-commit the inflight instant
- this.eventGateway.sendEventToCoordinator(event);
- LOG.info("Send uncommitted write metadata event to coordinator,
task[{}].", taskID);
- }
- }
- }
- } else {
- // otherwise sends an empty bootstrap event instead.
+ private void sendBootstrapEvent(boolean isRestored) throws Exception {
+ if (!isRestored || !sendPendingCommitEvents()) {
this.eventGateway.sendEventToCoordinator(WriteMetadataEvent.emptyBootstrap(taskID,
checkpointId));
LOG.info("Send bootstrap write metadata event to coordinator,
task[{}].", taskID);
}
}
+ private boolean sendPendingCommitEvents() throws Exception {
+ boolean eventSent = false;
Review Comment:
This is by design, there is no need to resend the metadata events when
`attemptId > 0` because the event is already in the coordinator, we only need
to send an empty event to clean the legacy events.
But there is still very little chance, the send of the first attempt fails
and this could incur data loss, the odds should be very low: 1. first it needs
a restart of the whole job, 2. then it needs a failure of the event send. If 2
happens, it usually indicates that there is network issue, that is why I want
to keep the logic simple at first. My estimation for this incidence is 0.2(for
manual restart of the job) * 0.001(for the network issue) = 0.00002.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]