[
https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274700#comment-16274700
]
Mike Drob commented on HBASE-17852:
-----------------------------------
The problem is that backups and restores cannot occur simultaneously.
Let's say that we have a hypothetical system set to backup nightly (via cron or
some other non-interactive mechanism). While this full system backup is
running, some problem is detected with a single table and it is determined that
the correct course of action is to restore that table. Given that we base
backup and restore operations on snapshots, this should be straightforward -
the large backup can continue to run while a restore of the specific table (to
the last known good state) is put in place without waiting for the backup to
complete.
The current options appear to be wait until the backup finishes (maybe ok,
depending on sizes/bandwidth/etc...) or cancel the nightly backup (very bad,
especially if we have to do manual cleanup of things). I think the position
that I'm slowly arriving to is that we shouldn't be recommending nightly
backups at all to folks - this is probably a use case better served by
replication and having a wider variety of sinks available instead of only
another HBase cluster (HBASE-18846 might help with this?). That said we would
still need some kind of bulk restore wrappers. Let me think on this more...
> Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental
> backup)
> ------------------------------------------------------------------------------------
>
> Key: HBASE-17852
> URL: https://issues.apache.org/jira/browse/HBASE-17852
> Project: HBase
> Issue Type: Sub-task
> Reporter: Vladimir Rodionov
> Assignee: Vladimir Rodionov
> Fix For: 2.0.0
>
> Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch,
> HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch,
> HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch,
> HBASE-17852-v9.patch
>
>
> Design approach rollback-via-snapshot implemented in this ticket:
> # Before backup create/delete/merge starts we take a snapshot of the backup
> meta-table (backup system table). This procedure is lightweight because meta
> table is small, usually should fit a single region.
> # When operation fails on a server side, we handle this failure by cleaning
> up partial data in backup destination, followed by restoring backup
> meta-table from a snapshot.
> # When operation fails on a client side (abnormal termination, for example),
> next time user will try create/merge/delete he(she) will see error message,
> that system is in inconsistent state and repair is required, he(she) will
> need to run backup repair tool.
> # To avoid multiple writers to the backup system table (backup client and
> BackupObserver's) we introduce small table ONLY to keep listing of bulk
> loaded files. All backup observers will work only with this new tables. The
> reason: in case of a failure during backup create/delete/merge/restore, when
> system performs automatic rollback, some data written by backup observers
> during failed operation may be lost. This is what we try to avoid.
> # Second table keeps only bulk load related references. We do not care about
> consistency of this table, because bulk load is idempotent operation and can
> be repeated after failure. Partially written data in second table does not
> affect on BackupHFileCleaner plugin, because this data (list of bulk loaded
> files) correspond to a files which have not been loaded yet successfully and,
> hence - are not visible to the system
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)