[ https://issues.apache.org/jira/browse/FLINK-29971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-29971: ----------------------------------- Labels: pull-request-available stale-assigned (was: pull-request-available) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issue is assigned but has not received an update in 30 days, so it has been labeled "stale-assigned". If you are still working on the issue, please remove the label and add a comment updating the community on your progress. If this issue is waiting on feedback, please consider this a reminder to the committer/reviewer. Flink is a very active project, and so we appreciate your patience. If you are no longer working on the issue, please unassign yourself so someone else may work on it. > Hbase sink will lose data at extreme case > ----------------------------------------- > > Key: FLINK-29971 > URL: https://issues.apache.org/jira/browse/FLINK-29971 > Project: Flink > Issue Type: Bug > Components: Connectors / HBase > Affects Versions: 1.13.6, 1.15.3 > Reporter: wenchao.wu > Assignee: wenchao.wu > Priority: Critical > Labels: pull-request-available, stale-assigned > Attachments: image-2022-11-10-16-02-51-402.png, > image-2022-11-10-16-08-23-325.png, image-2022-11-10-16-10-50-711.png, > image-2022-11-10-16-24-01-396.png > > > h2. Situation: > When I use kafka as source and hbase as sink but the hbase table I didn't > have the permission, I send data to kafka one message with a long time gap. > In this situation the normal result will be when trigger checkpoint the job > will failed. But actually the jobs will continue to run and can trigger > checkpoint successfully. > h2. Analysis > The hbase sink will throw exception in *checkErrorAndRethrow()* funciton. And > this function will be called in two function, *invoke()* and {*}flush(){*}. > Beside {*}invoke(){*}, *flush()* will be called at two place, one is > {*}snapshot(){*}, one is in the scheduledThread as the follow snippet of code: > !image-2022-11-10-16-02-51-402.png! > We can see that in the scheduledThread the exception throw by *flush()* will > be catch and reset to failureThrowable. > So if there's no message come, the only way to throw the exception is in > {*}snapshot(){*}. But the snapshot function call flush() is conditional as > the follow snippet of code: > !image-2022-11-10-16-08-23-325.png! > But the scheduledThread will called flush() periodically and set > numPendingRequests as 0. > !image-2022-11-10-16-10-50-711.png! > So if no other message comes the snapshot will run successfully which means > the checkpoint will be success but that message was not written to hbase, the > message is loss. > > h2. Solution > I think the reason is that when trigger checkpoint and call snapshot > function, need to call *checkErrorAndRethrow()* first as follow: > !image-2022-11-10-16-24-01-396.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)