Re: [HACKERS] File based Incremental backup v8

Marco Nenciarini Wed, 04 Mar 2015 09:06:01 -0800

Hi Fujii,

Il 03/03/15 11:48, Fujii Masao ha scritto:
> On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini
> <[email protected]> wrote:
>> Il 02/03/15 14:21, Fujii Masao ha scritto:
>>> On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I've attached an updated version of the patch.
>>>
>>> basebackup.c:1565: warning: format '%lld' expects type 'long long
>>> int', but argument 8 has type '__off_t'
>>> basebackup.c:1565: warning: format '%lld' expects type 'long long
>>> int', but argument 8 has type '__off_t'
>>> pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code
>>>
>>
>> I'll add the an explicit cast at that two lines.
>>
>>> When I applied three patches and compiled the code, I got the above 
>>> warnings.
>>>
>>> How can we get the full backup that we can use for the archive recovery, 
>>> from
>>> the first full backup and subsequent incremental backups? What commands 
>>> should
>>> we use for that, for example? It's better to document that.
>>>
>>
>> I've sent a python PoC that supports the plain format only (not the tar one).
>> I'm currently rewriting it in C (with also the tar support) and I'll send a 
>> new patch containing it ASAP.
> 
> Yeah, if special tool is required for that purpose, the patch should include 
> it.
>


I'm working on it. The interface will be exactly the same of the PoC script 
I've attached to 

[email protected]

>>> What does "1" of the heading line in backup_profile mean?
>>>
>>
>> Nothing. It's a version number. If you think it's misleading I will remove 
>> it.
> 
> A version number of file format of backup profile? If it's required for
> the validation of backup profile file as a safe-guard, it should be included
> in the profile file. For example, it might be useful to check whether
> pg_basebackup executable is compatible with the "source" backup that
> you specify. But more info might be needed for such validation.
> 

The current implementation bail out with an error if the header line is 
different from what it expect.
It also reports and error if the 2nd line is not the start WAL location. That's 
all that pg_basebackup needs to start a new incremental backup. All the other 
information are useful to reconstruct a full backup in case of an incremental 
backup, or maybe to check the completeness of an archived full backup.
Initially the profile was present only in incremental backups, but after some 
discussion on list we agreed to always write it.

>>> Sorry if this has been already discussed so far. Why is a backup profile 
>>> file
>>> necessary? Maybe it's necessary in the future, but currently seems not.
>>
>> It's necessary because it's the only way to detect deleted files.
> 
> Maybe I'm missing something. Seems we can detect that even without a profile.
> For example, please imagine the case where the file has been deleted since
> the last full backup and then the incremental backup is taken. In this case,
> that deleted file exists only in the full backup. We can detect the deletion 
> of
> the file by checking both full and incremental backups.
> 

When you take an incremental backup, only changed files are sent. Without the 
backup_profile in the incremental backup, you cannot detect a deleted file, 
because it's indistinguishable from a file that is not changed.

>>> We've really gotten the consensus about the current design, especially that
>>> every files basically need to be read to check whether they have been 
>>> modified
>>> since last backup even when *no* modification happens since last backup?
>>
>> The real problem here is that there is currently no way to detect that a 
>> file is not changed since the last backup. We agreed to not use file system 
>> timestamps as they are not reliable for that purpose.
> 
> TBH I prefer timestamp-based approach in the first version of incremental 
> backup
> even if's less reliable than LSN-based one. I think that some users who are
> using timestamp-based rsync (i.e., default mode) for the backup would be
> satisfied with timestamp-based one.

The original design was to compare size+timestamp+checksums (only if everything 
else matches and the file has been modified after the start of the backup), but 
the feedback from the list was that we cannot trust the filesystem mtime and we 
must use LSN instead.

> 
>> Using LSN have a significant advantage over using checksum, as we can start 
>> the full copy as soon as we found a block whith a LSN greater than the 
>> threshold.
>> There are two cases: 1) the file is changed, so we can assume that we detect 
>> it after reading 50% of the file, then we send it taking advantage of file 
>> system cache; 2) the file is not changed, so we read it without sending 
>> anything.
>> It will end up producing an I/O comparable to a normal backup.
> 
> Yeah, it might make the situation better than today. But I'm afraid that
> many users might get disappointed about that behavior of an incremental
> backup after the release...

I don't get what do you mean here. Can you elaborate this point?

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
[email protected] | www.2ndQuadrant.it

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] File based Incremental backup v8

Reply via email to