[
https://issues.apache.org/jira/browse/HBASE-29261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944421#comment-17944421
]
Vinayak Hegde commented on HBASE-29261:
---------------------------------------
h3. Current Behavior
Our current delete logic attempts to determine if the backup in question is a
valid base full backup for *all* the possible PITR (Point-in-Time Recovery)
points for a given table.
If it is the *only* such backup that allows PITR for the entire allowed window,
we avoid deleting it unless the {{--force}} flag is used.
h3. Problem Statement
This logic is flawed and inconsistent.
It is *not practically possible* for a single full backup to serve as a valid
base full backup for *all* possible PITR points for a table.
We demonstrate this with the following reasoning.
h3. Task
Determine whether a given backup can act as a valid base full backup for *all*
PITR points for a specific table.
h3. Terminology
* *maxAllowedPITRTime (mapt):*
Maximum duration into the past from the current time to which PITR is supported
(e.g., 30 days). This is a cluster-level configuration.
So, PITR is possible from {{currentTime - mapt}} to {{{}currentTime{}}}.
* *continuousBackupStartTime (cst):*
The earliest timestamp from which continuous backups are available for the
table.
* *currentTime (ct):*
The current timestamp.
h3. Assumptions
* Full backups take significant time (typically in hours) because:
#
## Snapshot creation happens first.
#
## Then the snapshot is copied to a destination.
* Snapshot creation ({{{}fm{}}}) occurs *after* the full backup is triggered
({{{}fs{}}}) but *before* the full backup completes ({{{}fe{}}}).
* {{fm}} (snapshot time) is not currently recorded or exposed in our metadata.
h3. Timeline Definitions
Let:
* {{{}fs{}}}: Full backup *start time* (when the process begins)
* {{{}fm{}}}: Time when *snapshot is taken* (logical freeze point of table
data)
* {{{}fe{}}}: Full backup *end time* (when snapshot copy completes)
Then:
* The full backup *may include data* up to {{{}fm{}}}.
* The full backup *does not* include data between {{fm}} and {{{}fe{}}}.
h3. Limitation
We do not have a reliable way to determine the {{fm}} (snapshot taken time)
from the current backup metadata.
h3. Case Analysis
*Timeline & Notation Explanation*
Timeline
{code:java}
cst mapt ct
--------------------|--------------------|----------------------------------|
{code}
Backup Notation:
{code:java}
|---|------|
^ ^ ^
| | |
fs fm fe{code}
----
h4. *Case 1: {{continuousBackupStartTime < maxAllowedPITRTime}}*
This means we *do* have continuous backup data to cover *all* PITR points
between {{currentTime - maxAllowedPITRTime}} and {{{}currentTime{}}}.
{code:java}
cst mapt ct
--------------------|--------------------|----------------------------------|
a. |--|----|
b. |--|----|
c. |--|----|
d. |--|----|
e. |--|----|
f. |--|----|
g. |--|----| {code}
a, b: Not valid. After restoring full backup (at fm), the table will not
contain data from fm to cst.
c, d, e: Valid. Snapshot time fm is ≥ cst and ≤ mapt. PITR is fully supported.
f, g: Not valid. Snapshot fm is after mapt, so PITR cannot reach all required
points.
h4. effective condition to check whether the current backup can be a valid base
full backup for all the Points in PITR for a particular table??
fm >= cst && fm <= mapt
----
h4. *Case 2: {{maxAllowedPITRTime < continuousBackupStartTime}}*
In this case, we do *not* have continuous backup coverage going all the way
back to {{{}currentTime - maxAllowedPITRTime{}}}.
So the effective PITR window becomes [currentTime - cst] to currentTime
{code:java}
mapt cst ct
--------------------|--------------------|----------------------------------|
a. |--|----|
b. |--|----|
c. |--|----|
d. |--|----|
e. |--|----|
f. |--|----|
g. |--|----| {code}
a to e: Not valid. Data between {{fm}} and {{cst}} will be missing after full
backup restore.
f, g: Not valid. Snapshot {{fm}} is after {{{}cst{}}}, so points between
{{cst}} and {{fm}} are not covered.
h4. effective condition to check whether the current backup can be a valid base
full backup for all the Points in PITR for a particular table??
*None* — no full backup can cover all PITR points in this scenario.
----
h4. *Case 3: {{continuousBackupStartTime == maxAllowedPITRTime}}*
This is a special case where the PITR window *starts exactly* from {{{}cst{}}},
meaning: [currentTime - cst] to currentTime
{code:java}
cst,mapt ct
-------------------------------|----------------------------------|
a. |--|----|
b. |--|----|
c. |--|----|
d. |--|----| {code}
a, b: Not valid. Snapshot fm is before or equal to cst, but we miss data from
fm to cst.
c, d{*}:{*} Not valid. Snapshot {{fm}} is after {{{}cst{}}}, so points between
{{cst}} and {{fm}} are not restorable.
h4. effective condition to check whether the current backup can be a valid base
full backup for all the Points in PITR for a particular table??
*None* — no full backup can cover all PITR points in this scenario.
here we can see that a single full backup cannot serve as a valid base full
backup for *all* possible PITR points for a table.
> Investigate flaw in backup deletion validation of PITR-critical backups and
> propose correct approach
> ----------------------------------------------------------------------------------------------------
>
> Key: HBASE-29261
> URL: https://issues.apache.org/jira/browse/HBASE-29261
> Project: HBase
> Issue Type: Task
> Components: backup&restore
> Reporter: Vinayak Hegde
> Assignee: Vinayak Hegde
> Priority: Major
>
> This Jira investigates a flaw in our current logic used to validate whether a
> full backup—potentially critical for PITR (Point-In-Time Recovery)—can be
> safely deleted.
> The current approach incorrectly checks whether a full backup is the only
> valid base for *all* PITR target points, which is not a valid criterion. A
> full backup should not be required to support _all_ PITR points to be
> considered necessary. Instead, each full backup only contributes to a
> {*}specific PITR time range{*}, depending on when the backup was taken and
> the availability of continuous backups afterward.
> This ticket proposes a more accurate and conservative approach:
> * Determine the PITR range each full backup can support.
> * Identify if another full backup exists that fully covers the same range.
> * If such a backup exists, the original one can be considered safe for
> deletion.
> All edge cases and reasoning are explained in the comments for clarity.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)