[ 
https://issues.apache.org/jira/browse/HBASE-29261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944421#comment-17944421
 ] 

Vinayak Hegde commented on HBASE-29261:
---------------------------------------

h3. Current Behavior

Our current delete logic attempts to determine if the backup in question is a 
valid base full backup for *all* the possible PITR (Point-in-Time Recovery) 
points for a given table.
If it is the *only* such backup that allows PITR for the entire allowed window, 
we avoid deleting it unless the {{--force}} flag is used.
h3. Problem Statement

This logic is flawed and inconsistent.
It is *not practically possible* for a single full backup to serve as a valid 
base full backup for *all* possible PITR points for a table.
We demonstrate this with the following reasoning.
h3. Task

Determine whether a given backup can act as a valid base full backup for *all* 
PITR points for a specific table.
h3. Terminology
 * *maxAllowedPITRTime (mapt):*
Maximum duration into the past from the current time to which PITR is supported 
(e.g., 30 days). This is a cluster-level configuration.
So, PITR is possible from {{currentTime - mapt}} to {{{}currentTime{}}}.

 * *continuousBackupStartTime (cst):*
The earliest timestamp from which continuous backups are available for the 
table.

 * *currentTime (ct):*
The current timestamp.

h3. Assumptions
 * Full backups take significant time (typically in hours) because:

 # 
 ## Snapshot creation happens first.

 # 
 ## Then the snapshot is copied to a destination.

 * Snapshot creation ({{{}fm{}}}) occurs *after* the full backup is triggered 
({{{}fs{}}}) but *before* the full backup completes ({{{}fe{}}}).

 * {{fm}} (snapshot time) is not currently recorded or exposed in our metadata.

 
h3. Timeline Definitions

Let:
 * {{{}fs{}}}: Full backup *start time* (when the process begins)

 * {{{}fm{}}}: Time when *snapshot is taken* (logical freeze point of table 
data)

 * {{{}fe{}}}: Full backup *end time* (when snapshot copy completes)

Then:
 * The full backup *may include data* up to {{{}fm{}}}.

 * The full backup *does not* include data between {{fm}} and {{{}fe{}}}.

h3. Limitation

We do not have a reliable way to determine the {{fm}} (snapshot taken time) 
from the current backup metadata.

 
h3. Case Analysis

*Timeline & Notation Explanation*

Timeline
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------| 
{code}
Backup Notation:

 
{code:java}
|---|------|
^   ^      ^
|   |      |
fs  fm    fe{code}
 

 
----
h4. *Case 1: {{continuousBackupStartTime < maxAllowedPITRTime}}*

This means we *do* have continuous backup data to cover *all* PITR points 
between {{currentTime - maxAllowedPITRTime}} and {{{}currentTime{}}}.
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
a, b: Not valid. After restoring full backup (at fm), the table will not 
contain data from fm to cst.

c, d, e: Valid. Snapshot time fm is ≥ cst and ≤ mapt. PITR is fully supported.

f, g: Not valid. Snapshot fm is after mapt, so PITR cannot reach all required 
points.
h4. effective condition to check whether the current backup can be a valid base 
full backup for all the Points in PITR for a particular table??

 

fm >= cst && fm <= mapt
----
h4. *Case 2: {{maxAllowedPITRTime < continuousBackupStartTime}}*

In this case, we do *not* have continuous backup coverage going all the way 
back to {{{}currentTime - maxAllowedPITRTime{}}}.
So the effective PITR window becomes [currentTime - cst] to currentTime

 
{code:java}
                   mapt                 cst                                 ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
a to e:  Not valid. Data between {{fm}} and {{cst}} will be missing after full 
backup restore.

f, g:  Not valid. Snapshot {{fm}} is after {{{}cst{}}}, so points between 
{{cst}} and {{fm}} are not covered.
h4. effective condition to check whether the current backup can be a valid base 
full backup for all the Points in PITR for a particular table??

*None* — no full backup can cover all PITR points in this scenario.
----
h4. *Case 3: {{continuousBackupStartTime == maxAllowedPITRTime}}*

This is a special case where the PITR window *starts exactly* from {{{}cst{}}}, 
meaning: [currentTime - cst] to currentTime
{code:java}
                            cst,mapt                              ct
-------------------------------|----------------------------------|
a.                   |--|----|
b.                         |--|----|
c.                            |--|----|
d.                                 |--|----| {code}
a, b: Not valid. Snapshot fm is before or equal to cst, but we miss data from 
fm to cst.

c, d{*}:{*} Not valid. Snapshot {{fm}} is after {{{}cst{}}}, so points between 
{{cst}} and {{fm}} are not restorable.
h4. effective condition to check whether the current backup can be a valid base 
full backup for all the Points in PITR for a particular table??

*None* — no full backup can cover all PITR points in this scenario.

 

here we can see that a single full backup cannot serve as a valid base full 
backup for *all* possible PITR points for a table.

> Investigate flaw in backup deletion validation of PITR-critical backups and 
> propose correct approach
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29261
>                 URL: https://issues.apache.org/jira/browse/HBASE-29261
>             Project: HBase
>          Issue Type: Task
>          Components: backup&amp;restore
>            Reporter: Vinayak Hegde
>            Assignee: Vinayak Hegde
>            Priority: Major
>
> This Jira investigates a flaw in our current logic used to validate whether a 
> full backup—potentially critical for PITR (Point-In-Time Recovery)—can be 
> safely deleted.
> The current approach incorrectly checks whether a full backup is the only 
> valid base for *all* PITR target points, which is not a valid criterion. A 
> full backup should not be required to support _all_ PITR points to be 
> considered necessary. Instead, each full backup only contributes to a 
> {*}specific PITR time range{*}, depending on when the backup was taken and 
> the availability of continuous backups afterward.
> This ticket proposes a more accurate and conservative approach:
>  * Determine the PITR range each full backup can support.
>  * Identify if another full backup exists that fully covers the same range.
>  * If such a backup exists, the original one can be considered safe for 
> deletion.
> All edge cases and reasoning are explained in the comments for clarity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to