from:"Paul Guo"

Re: Two patches to speed up pg_rewind.

2021-02-18 Thread Paul Guo

Refactored the code a bit along with fixes. Manually tested them on centos
& Ubuntu (the later has copy_file_range())

For the first patch, actually I have some concerns. My assumption is that
the target pg_data directory should be fsync-ed already. This should be
correct normally but there is one scenario: a cleanly-shutdown database’s
pgdata directory was copied to another directory, in this case the new pgdata
is not fsync-ed - I’m not sure if that exists in real production environment or 
not,
but even considering this we could still use the optimization for the case that
calls ensureCleanShutdown() since this ensures a pgdata fsync on the target.





v2-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch
Description:  v2-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch


v2-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description:  v2-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch

Re: Freeze the inserted tuples during CTAS?

2021-02-18 Thread Paul Guo



On Jan 27, 2021, at 5:33 PM, Darafei Komяpa Praliaskouski 
mailto:m...@komzpa.net>> wrote:

Hi,

I confirm that my analytic workflows often do the CTAS and VACUUM of the 
relation right after, before the index creation, to mark stuff as all-visible 
for IOS to work. Freezing and marking as visible will help.

Thanks for letting me know there is such a real case in production environment.
I attached the short patch. If no more other concerns, I will log the patch on 
commitfest.


-Paul


On Wed, Jan 27, 2021 at 12:29 PM Paul Guo 
mailto:gu...@vmware.com>> wrote:
Here is the simple patch,

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index dce882012e..0391699423 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, 
TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
-   myState->ti_options = TABLE_INSERT_SKIP_FSM;
+   myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;

MatView code already does this and COPY does this if specified. I’m not sure how
does the community think about this. Actually personally I expect more about the
all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index 
only scan
if we create an index and table use CTAS, else people have to use index only 
scan
after vacuum. If people do not expect freeze could we at least introduce a 
option to
specify the visibility during inserting?

Regards,
Paul




v1-0001-Freeze-the-tuples-during-CTAS.patch
Description: v1-0001-Freeze-the-tuples-during-CTAS.patch

Re: Freeze the inserted tuples during CTAS?

2021-02-20 Thread Paul Guo

Attached is the v2 version that fixes a test failure due to plan change (bitmap 
index scan -> index only scan).


v2-0001-Freeze-the-tuples-during-CTAS.patch
Description: v2-0001-Freeze-the-tuples-during-CTAS.patch

Re: Freeze the inserted tuples during CTAS?

2021-03-09 Thread Paul Guo

> On Mar 3, 2021, at 1:35 PM, Masahiko Sawada  wrote:
>> On Sun, Feb 21, 2021 at 4:46 PM Paul Guo  wrote:
>> Attached is the v2 version that fixes a test failure due to plan change 
>> (bitmap index scan -> index only scan).

> I think this is a good idea.

> BTW, how much does this patch affect the CTAS performance? I expect
> it's negligible but If there is much performance degradation due to
> populating visibility map, it might be better to provide a way to
> disable it.

Yes,  this is a good suggestion. I did a quick test yesterday.

Configuration: shared_buffers = 1280M and the test system memory is 7G.

Test queries:
  checkpoint;
  \timing
  create table t1 (a, b, c, d) as select i,i,i,i from 
generate_series(1,2000) i;
  \timing
  select pg_size_pretty(pg_relation_size('t1'));

Here are the running time:

HEAD   : Time: 10299.268 ms (00:10.299)  + 1537.876 ms (00:01.538)  

Patch  : Time: 12257.044 ms (00:12.257)  + 14.247 ms

The table size is 800+MB so the table should be all in the buffer. I was 
surprised
to see the patch increases the CTAS time by 19.x%, and also it is not better 
than
"CTAS+VACUUM" on HEAD version. In theory the visibility map buffer change should
not affect that much. I looked at related code again (heap_insert()). I believe
the overhead could decrease along with some discussed CTAS optimization
solutions (multi-insert, or raw-insert, etc).

I tested 'copy' also. The COPY FREEZE does not involve much overhead than COPY
according to the experiement results as below. COPY uses multi-insert. Seems 
there is
no other difference than CTAS when writing a new table.

COPY TO + VACUUM
Time: 8826.995 ms (00:08.827) + 1599.260 ms (00:01.599)
COPY TO FREEZE + VACUUM
Time: 8836.107 ms (00:08.836) + 13.581 ms

So maybe think about doing freeze in CTAS after optimizing the CTAS performance
later?

By the way, ‘REFRESH MatView’ does freeze by default. Matview is quite similar 
to CTAS.
I did test it also and the conclusion is similar to that of CTAS. Not sure why 
FREEZE was
enabled though, maybe I missed something?

Re: fdatasync performance problem with large number of DB files

2021-03-15 Thread Paul Guo

> On 2021/3/15, 7:34 AM, "Thomas Munro"  wrote:

>> On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro  
wrote:
>> Time being of the essence, here is the patch I posted last year, this
>> time with a GUC and some docs.  You can set sync_after_crash to
>> "fsync" (default) or "syncfs" if you have it.

>  Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of 
> typos.

By the way, there is a usual case that we could skip fsync: A fsync-ed already 
standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not 
DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg 
instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files 
in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using 
guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset 
the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in 
pg_rewind/pg_basebackup, but people might
 copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to 
hint startup, before starting the pg instance.

[1] 
https://www.postgresql.org/message-id/flat/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16%40vmware.com#734e7dc77f0760a3a64e808476ecc592

Re: Freeze the inserted tuples during CTAS?

2021-03-15 Thread Paul Guo

>   to set visibility map bits on materialized views. I'll start a new
>thread to discuss that.

Thanks. Also I withdrew the patch.

Re: fdatasync performance problem with large number of DB files

2021-03-16 Thread Paul Guo

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao  wrote:
>
>
>
> On 2021/03/16 8:15, Thomas Munro wrote:
> > On Tue, Mar 16, 2021 at 3:30 AM Paul Guo  wrote:
> >> By the way, there is a usual case that we could skip fsync: A fsync-ed 
> >> already standby generated by pg_rewind/pg_basebackup.
> >> The state of those standbys are surely not 
> >> DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
> >> pgdata directory is fsync-ed again during startup when starting those pg 
> >> instances. We could ask users to not fsync
> >> during pg_rewind&pg_basebackup, but we probably want to just fsync some 
> >> files in pg_rewind (see [1]), so better
> >> let the startup process skip the unnecessary fsync? As to the solution, 
> >> using guc or writing something in some files like
> >> backup_label(?) does not seem to be good ideas since
> >> 1. Use guc, we still expect fsync after real crash recovery so we need to 
> >> reset the guc also need to specify pgoptions in pg_ctl command.
> >> 2. Write some hint information to files like backup_label(?) in 
> >> pg_rewind/pg_basebackup, but people might
> >>   copy the pgdata directory and then we still need fsync.
> >> The only one simple solution I can think out is to let user touch a file 
> >> to hint startup, before starting the pg instance.
> >
> > As a thought experiment only, I wonder if there is a way to make your
> > touch-a-special-signal-file scheme more reliable and less dangerous
> > (considering people might copy the signal file around or otherwise
> > screw this up).  It seems to me that invalidation is the key, and
> > "unlink the signal file after the first crash recovery" isn't good
> > enough.  Hmm  What if the file contained a fingerprint containing...
> > let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...

hostname, mac address, or pgdata path (or  e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

> > (add more seasoning to taste), and then also some flags to say what is
> > known to be fully fsync'd already: the WAL, pgdata but only as far as
> > changes up to the checkpoint LSN, or all of pgdata?  Then you could be
> > conservative for a non-match, but skip the extra work in some common
> > cases like pg_basebackup, as long as you trust the fingerprint scheme
> > not to produce false positives.  Or something like that...
> >
> > I'm not too keen to invent clever new schemes for PG14, though.  This
> > sync_after_crash=syncfs scheme is pretty simple, and has the advantage
> > that it's very cheap to do it extra redundant times assuming nothing
> > else is creating new dirty kernel pages in serious quantities.  Is
> > that useful enough?  In particular it avoids the dreaded "open
> > 1,000,000 uncached files over high latency network storage" problem.
> >
> > I don't want to add a hypothetical sync_after_crash=none, because it
> > seems like generally a bad idea.  We already have a
> > running-with-scissors mode you could use for that: fsync=off.
>
> I heard that some backup tools sync the database directory when restoring it.
> I guess that those who use such tools might want the option to disable such
> startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

> They can skip that sync by fsync=off. But if they just want to skip only that
> startup sync and make subsequent recovery (or standby server) work with
> fsync=on, they would need to shutdown the server after that startup sync
> finishes, enable fsync, and restart the server. In this case, since the server
> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
> would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

> sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-10 Thread Paul Guo

> On Nov 9, 2020, at 6:41 PM, Bharath Rupireddy 
>  wrote:
>
> On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy
>  wrote:
>>
>> If the approach followed in the patch looks okay, I can work on a separate 
>> patch for multi inserts in refresh materialized view cases.
>>
>
> Hi, I'm attaching a v2 patch that has multi inserts for CTAS as well
> as REFRESH MATERIALiZED VIEW.
>
> I did some testing: exec time in seconds.
>
> Use case 1: 1 int and 1 text column. each row size 129 bytes, size of
> 1 text column 101 bytes, number of rows 100million, size of heap file
> 12.9GB.
> HEAD: 220.733, 220.428
> Patch: 151.923, 152.484
>
> Thoughts?
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&data=04%7C01%7Cguopa%40vmware.com%7C2471a90558ce4bf0af5b08d8849c03bb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637405152899337347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QKeRMGQjOlOL%2FlQv%2BuEAb2ocLVq6zqXESKoNOaJ6YCo%3D&reserved=0
> 

Thanks for doing this. There might be another solution - use raw insert 
interfaces (i.e. raw_heap_insert()).
Attached is the test (not formal) patch that verifies this idea. 
raw_heap_insert() writes the page into the
table files directly and also write the FPI xlog when the tuples filled up the 
whole page. This seems be
more efficient.

In addition, those raw write interfaces call smgrimmedsync() when finishing raw 
inserting, this is because
the write bypasses the shared buffer so a CHECKPOINT plus crash might cause 
data corruption since
some FPI xlogs cannot be replayed and those table files are not fsync-ed during 
crash. It seems that a sync
request could be forwarded to the checkpointer for each table segment file and 
then we do not need to call
smgrimmedsync(). If the theory is correct this should be in a separate patch. 
Anyway I tested this idea
also by simply commenting out the smgrimmedsync() call in heap_raw_insert_end() 
(a new function in
the attached patch) since forwarding fsync request is light-weight.

I did a quick and simple testing. The test environment is a centos6 vm with 7G 
memory on my Mac laptop.
-O3 gcc compiler option; shared_buffers as 2GB. Did not check if parallel 
scanning is triggered by the test
query and the data volume is not large so test time is not long.

Here are the test script.
 create table t1 (a int, b int, c int, d int);
  insert into t1 select i,i,i,i from generate_series(1,1000) i;
  show shared_buffers;
  \timing on
  create table t2 as select * from t1;
  \timing off

Here are the results:

HEAD (37d2ff380312):
Time: 5143.041 ms (00:05.143)
Multi insert patch:
Time: 4456.461 ms (00:04.456)
Raw insert (attached):
Time: 2317.453 ms (00:02.317)
Raw insert + no smgrimmedsync():
Time: 2103.610 ms (00:02.104).

From the above data raw insert is better; also forwarding sync should be able 
to improve further
(Note my laptop is with SSD so on machine with SATA/SAS, I believe forwarding 
sync should
be able to help more.)

I tested removing smgrimmedsync in "vacuum full” code that uses raw insert 
also. FYI.
HEAD:
  Time: 3567.036 ms (00:03.567)
no smgrimmedsync:
  Time: 3023.487 ms (00:03.023)

Raw insert could be used on CTAS & Create MatView. For Refresh MatView the code 
is a bit
different. I did not spend more time on this so not sure raw insert could be 
used for that.

But I think the previous multi insert work could be still used in at least 
"INSERT tbl SELECT…” (if the INSERT
is a simple one, e.g. no trigger, no index, etc).

Regards,
Paul

v1-0001-ctas-using-raw-insert.patch
Description: v1-0001-ctas-using-raw-insert.patch

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-16 Thread Paul Guo




> On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
>  wrote:
> 
> On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
>> 
>> Thanks for doing this. There might be another solution - use raw insert 
>> interfaces (i.e. raw_heap_insert()).
>> Attached is the test (not formal) patch that verifies this idea. 
>> raw_heap_insert() writes the page into the
>> table files directly and also write the FPI xlog when the tuples filled up 
>> the whole page. This seems be
>> more efficient.
>> 
> 
> Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> the table parallelly) with parallelism? The existing
> table_multi_insert() API scales well, see, for instance, the benefit
> with parallel copy[1] and parallel multi inserts in CTAS[2].

Yes definitely some work needs to be done to make raw heap insert interfaces 
fit the parallel work, but
it seems that there is no hard blocking issues for this?

> 
> [1] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%2540mail.gmail.com&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136197927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=fyQaor4yhmqVRYcK78JyPW25i7zjRoWXqZVf%2BfFYq1w%3D&reserved=0
> [2] - 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2FCALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%252B0-Jg%252BKYT7ZO-Ug%2540mail.gmail.com&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=CkFToJ11nmoyT2SodsJYYMOGP3cHSpeNYn8ZTYurn3U%3D&reserved=0
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: 
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F&data=04%7C01%7Cguopa%40vmware.com%7C6fb10e05b7a243e0042608d887c651ac%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637408633136207912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=btiktR5Ftx1astyEmCUroQCIN1%2FcgcaMOxfA1z6pawE%3D&reserved=0

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-01-27 Thread Paul Guo

Thanks for the review, please see the replies below.

> On Jan 5, 2021, at 9:07 AM, Kyotaro Horiguchi  wrote:
> 
> At Wed, 8 Jul 2020 12:56:44 +0000, Paul Guo  wrote in 
>> On 2020/01/15 19:18, Paul Guo wrote:
>> I further fixed the last test failure (due to a small bug in the test, not 
>> in code). Attached are the new patch series. Let's see the CI pipeline 
>> result.
>> 
>> Thanks for updating the patches!
>> 
>> I started reading the 0003 patch.
>> 
>> The approach that the 0003 patch uses is not the perfect solution.
>> If the standby crashes after tblspc_redo() removes the directory and before
>> its subsequent COMMIT record is replayed, PANIC error would occur since
>> there can be some unresolved missing directory entries when we reach the
>> consistent state. The problem would very rarely happen, though...
>> Just idea; calling XLogFlush() to update the minimum recovery point just
>> before tblspc_redo() performs destroy_tablespace_directories() may be
>> safe and helpful for the problem?
> 
> It seems to me that what the current patch does is too complex.  What
> we need to do here is to remember every invalid operation then forget
> it when the prerequisite object is dropped.
> 
> When a table space is dropped before consistency is established, we
> don't need to care what has been performed inside the tablespace.  In
> this perspective, it is enough to remember tablespace ids when failed
> to do something inside it due to the absence of the tablespace and
> then forget it when we remove it.  We could remember individual
> database id to show them in error messages, but I'm not sure it's
> useful.  The reason log_invalid_page records block numbers is to allow
> the machinery handle partial table truncations, but this is not the
> case since dropping tablespace cannot leave some of containing
> databases.
> 
> As the result, we won't see an unresolved invalid operations in a
> dropped tablespace.
> 
> Am I missing something?

Yes, removing the database id from the hash key in the log/forget code should
be usually fine, but the previous code does stricter/safer checking.

Consider the scenario:

CREATE DATABASE newdb1 TEMPLATE template_db1;
CREATE DATABASE newdb2 TEMPLATE template_db2; <- in case the template_db2 
database directory is missing abnormally somehow.
DROP DATABASE template_db1;

The previous code could detect this but if we remove the database id in the 
code,
this bad scenario is skipped.

> 
> 
> dbase_redo:
> +  if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> +  {
> +XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
> 
> This means "record the belonging table space directory if it is not
> found OR it is not a directory". The former can be valid but the
> latter is unconditionally can not (I don't think we bother considering
> symlinks there).

Again this is a safer check, in the case the parent_path is a file for example 
somehow,
we should panic finally for the case and let the user checks and then does 
recovery again.

> 
> +/*
> + * Source directory may be missing.  E.g. the template database used
> + * for creating this database may have been dropped, due to reasons
> + * noted above.  Moving a database from one tablespace may also be a
> + * partner in the crime.
> + */
> +if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
> +{
> +  XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, 
> src_path);
> 
> This is a part of *creation* of the target directory. Lack of the
> source directory cannot be valid even if the source directory is
> dropped afterwards in the WAL stream and we can allow that if the
> *target* tablespace is dropped afterwards. As the result, as I
> mentioned above, we don't need to record about the database directory.
> 
> By the way the name XLogLogMiss.. is somewhat confusing. How about
> XLogReportMissingDir (named after report_invalid_page).

Agree with you.

Also your words remind me that we should skip the checking if the consistency 
point
is reached.

Here is a git diff against the previous patch. I’ll send out the new rebased 
patches after
the consensus is reached.

diff --git a/src/backend/access/transam/xlogutils.c 
b/src/backend/access/transam/xlogutils.c
index 7ade385965..c8fe3fe228 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -90,7 +90,7 @@ typedef struct xl_missing_dir
 static HTAB *missing_dir_tab = NULL;

 void
-XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *pa

Two patches to speed up pg_rewind.

2021-01-27 Thread Paul Guo


While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by 
default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have 
an implementation for pg_flush_data”,
but now the code guarded by it is just pre_sync_fname() relevant so we 
might want
to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

The code looks like this:
  #if defined(HAVE_SYNC_FILE_RANGE)
  (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
  (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
I did not check the Linux Kernel code but according to the man
page I suspect that this option might cause the kernel tends to evict the 
related kernel
pages from the page cache, which might not be something we expect. This is
not a big issue since sync_file_range() should exist on many widely used 
Linux.

Also I’m not sure how much we could benefit from the pre_sync code. Also 
note if the
directory has a lot of files or the IO is fast, pre_sync_fname() might slow 
down
the process instead. The reasons are: If there are a lot of files it is 
possible that we need
to read the already-synced-and-evicted inode from disk (by open()-ing) 
after rewinding since
   the inode cache in Linux Kernel is limited; also if the IO is faster and 
kernel do background
   dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

   A better solution might be launch a separate pthread and do fsync one by one
   when pg_rewind finishes handling one file. pg_basebackup could use the 
solution also.

   Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace 
read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could 
benefit more
on Linux fs on network-based block storage.

Regards,
Paul

0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch
Description:  0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch


0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description:  0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch

Freeze the inserted tuples during CTAS?

2021-01-27 Thread Paul Guo

Here is the simple patch,

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index dce882012e..0391699423 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, 
TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
-   myState->ti_options = TABLE_INSERT_SKIP_FSM;
+   myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;

MatView code already does this and COPY does this if specified. I’m not sure how
does the community think about this. Actually personally I expect more about the
all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index 
only scan
if we create an index and table use CTAS, else people have to use index only 
scan
after vacuum. If people do not expect freeze could we at least introduce a 
option to
specify the visibility during inserting?

Regards,
Paul

Re: Two patches to speed up pg_rewind.

2021-02-02 Thread Paul Guo



On Jan 28, 2021, at 3:31 PM, Michael Paquier 
mailto:mich...@paquier.xyz>> wrote:

On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote:
Second one is use copy_file_range() for the local rewind case to replace 
read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could 
benefit more
on Linux fs on network-based block storage.

Have you done some measurements?

I did not test pg_rewind but for patch 2, I tested copy_fiile_range() vs 
read()+write()
on XFS in Ubuntu 20.04.1 when working on the patches,

Here is the test time of 1G file (fully populated with random data) copy. The 
test is a simple C program.

copy_file_range() loop (actually it finished after one call) + fsync()
0m0.048s

For read()+write() loop with read/write buffer size 32K + fsync()
0m5.004s

For patch 1, it skips syncing less files so it surely benefits the performance.

Re: fdatasync performance problem with large number of DB files

2021-03-17 Thread Paul Guo

On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro  wrote:
>
> On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao  
> wrote:
> > On 2021/03/16 8:15, Thomas Munro wrote:
> > > I don't want to add a hypothetical sync_after_crash=none, because it
> > > seems like generally a bad idea.  We already have a
> > > running-with-scissors mode you could use for that: fsync=off.
> >
> > I heard that some backup tools sync the database directory when restoring 
> > it.
> > I guess that those who use such tools might want the option to disable such
> > startup sync (i.e., sync_after_crash=none) because it's not necessary.
>
> Hopefully syncfs() will return quickly in that case, without doing any work?

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

I think we could consider reviewing and then pushing the syncfs patch
at this moment since
the fsync issue really affects much although there seems to be a
better plan for this in the future,
it may make the sync step in startup much faster. Today I first
encountered a real
case in a production environment. startup spends >1hour on the fsync
step: I'm pretty
sure that the pgdata is sync-ed, and per startup pstack I saw the
startup process
one by one slowly open(), fsync() (surely do nothing) and close(), and
the pre_sync_fname() is also a burden in such a scenario. So this
issue is really
annoying.

We could discuss further optimizing the special crash recovery
scenario that users
explicitly know the sync step could be skipped (this scenario is
surely not unusual),
even having the patch. The syncfs patch could be used for this
scenario also but the
filesystem might be shared by other applications (this is not unusual
and happens in my
customer's environment for example) so syncfs() is (probably much) slower than
skipping the sync step.

-- 
Paul Guo (Vmware)

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Paul Guo

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you 
guys think
It is better to rename it to a clearer name like sync_method_after_crash or 
others?

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-03-30 Thread Paul Guo

On 2021/3/27, 10:23 PM, "Alvaro Herrera"  wrote:

>Hmm, can you post a rebased set, where the points under discussion
>   are marked in XXX comments explaining what the issue is?  This thread is
>long and old ago that it's pretty hard to navigate the whole thing in
>order to find out exactly what is being questioned.

OK. Attached are the rebased version that includes the change I discussed
in my previous reply. Also added POD documentation change for RecursiveCopy,
and modified the patch to use the backup_options introduced in
081876d75ea15c3bd2ee5ba64a794fd8ea46d794 for tablespace mapping.

>I think 0004 can be pushed without further ado, since it's a clear and
>simple fix.  0001 needs a comment about the new parameter in
>RecursiveCopy's POD documentation.

Yeah, 0004 is no any risky. One concern seemed to be the compatibility of some
WAL dump/analysis tools(?). I have no idea about this. But if we do not backport
0004 we do not seem to need to worry about this.

>As I understand, this is a backpatchable bug-fix.

Yes.

Thanks.

v11-0001-Support-node-initialization-from-backup-with-tab.patch
Description:  v11-0001-Support-node-initialization-from-backup-with-tab.patch

v11-0002-Tests-to-replay-create-database-operation-on-sta.patch
Description:  v11-0002-Tests-to-replay-create-database-operation-on-sta.patch

v11-0003-Fix-replay-of-create-database-records-on-standby.patch
Description:  v11-0003-Fix-replay-of-create-database-records-on-standby.patch

v11-0004-Fix-database-create-drop-wal-description.patch
Description: v11-0004-Fix-database-create-drop-wal-description.patch

Some two phase optimization ideas

2020-08-27 Thread Paul Guo

Hello hackers,

While working on two phase related issues, I found something related to two 
phase could be optimized.

1. The current implementation decouples PREPRE and COMMIT/ABORT PREPARE a lot. 
This is flexible, but if
PREPARE & COMMIT/ABORT mostly happens on the same backend we could use the 
cache mechanism to
speed up, e.g.

  a. FinishPreparedTransaction()->LockGXact(gid, user)
   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
find the gxact that matches gid 
   
  For this we can cache the gxact during PREPARE and use that for a 
fast path, i.e. if the cached gxact
  matches gid we do not need to walk through the gxact array. By the 
way, if the gxact array is large this
  will be a separate performance issue (use shared-memory hash table if 
needed?).

  b. FinishPreparedTransaction() reads the PREPARE information from either 
state file (stored during checkpoint)
  or wal file. We could cache the content during PREPARE, i.e. in 
EndPrepare() then in FinishPreparedTransaction()
 we can avoid reading the state file or the wal file.

   It is possible that some databases based on Postgres two phase might not 
want the cache, e.g. if PREPARE
   backend is always different than the COMMIT/ABORT PREPARE backend (I do 
not know what database is
  designing like this though), but gxact caching is almost no overhead and 
for b we could use ifdef to guard the
  PREPARE wal data copying code.

 The two optimizations are easy and small. I've verified on Greenplum 
database (based on Postgres 12).

2. wal content duplication between PREPARE and COMMT/ABORT PREPARE
 
 See the below COMMIT PREPARE function call. Those hdr->* have existed in 
PREPARE wal also. We do
 not need them in the COMMIT PREPARE wal also. During recovery, we could 
load these information (both
 COMMIT and ABORT) into memory and in COMMIT/ABORT PREPARE redo we use the 
corresponding data.

  RecordTransactionCommitPrepared(xid,
  hdr->nsubxacts, children,
  hdr->ncommitrels, commitrels,
  hdr->ninvalmsgs, invalmsgs,
  hdr->initfileinval, gid);

  One drawback of the change is this might involve non-trivial change.

3. About gid, current gid is defined as a char[]. I'm wondering if we should 
define an opaque type and let some
Databases implement their own gid types using callbacks. Typically if I 
want to use 64-bit distributed xid as gid,
current code is not that performance & storage friendly (e.g. still need to 
use strcmp to find gxact in LockGXact,).
We may implement a default implementation as char[]. gid is not widely used 
so the change seems to
be small (interfaces of copy, comparison, conversion from string to 
internal gid type for the PREPARE statement, etc)

Any thoughts?

Regards,
Paul

pg_ugprade test failure on data set with column with default value with type bit/varbit

2018-03-30 Thread Paul Guo

Hello,

While testing pg_upgrade we seemed to find an issue related to default
value of a column with type bit/varbit.
Below are the steps to reproduce. In this case we added two 'create table'
DDLs in the regression database.
Obviously we saw diff after pg_upgrade testing. The pg binaries are based
on the code pulled a couple of days ago.

[pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ git diff
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
old mode 100644
new mode 100755
index 39983abea1..a41105859e
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -188,6 +188,9 @@ if "$MAKE" -C "$oldsrc" installcheck; then
psql -X -d regression -c "$fix_sql;" || psql_fix_sql_status=$?
fi

+   psql -X -d regression -c "CREATE TABLE t111 ( a40 bit varying(5)
DEFAULT '1');"
+   psql -X -d regression -c "CREATE TABLE t222 ( a40 bit varying(5)
DEFAULT B'1');"
+
pg_dumpall --no-sync -f "$temp_root"/dump1.sql || pg_dumpall1_status=$?

if [ "$newsrc" != "$oldsrc" ]; then

[pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ make check


[pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ diff -du
tmp_check/dump1.sql tmp_check/dump2.sql
--- tmp_check/dump1.sql 2018-03-30 16:31:44.329021348 +0800
+++ tmp_check/dump2.sql 2018-03-30 16:31:54.100478202 +0800
@@ -10956,7 +10956,7 @@
 --

 CREATE TABLE public.t111 (
-a40 bit varying(5) DEFAULT B'1'::bit varying
+a40 bit varying(5) DEFAULT (B'1'::"bit")::bit varying
 );

There is no diff in functionality of the dump SQLs, but it is annoying. The
simple patch below could fix this. Thanks.

[pguo@host67:/data2/postgres/src/bin/pg_upgrade]$ git diff
diff --git a/src/backend/utils/adt/ruleutils.c
b/src/backend/utils/adt/ruleutils.c
index 2cd54ec33f..ef2ab2d681 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -9389,7 +9389,7 @@ get_const_expr(Const *constval, deparse_context
*context, int showtype)

case BITOID:
case VARBITOID:
-   appendStringInfo(buf, "B'%s'", extval);
+   appendStringInfo(buf, "'%s'", extval);
break;

case BOOLOID:

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-09-05 Thread Paul Guo

On Tue, Sep 3, 2019 at 11:58 PM Alvaro Herrera 
wrote:

> On 2019-Aug-22, Anastasia Lubennikova wrote:
>
> > 22.08.2019 16:13, Paul Guo wrote:
> > > Thanks. I updated the patch to v5. It passes install-check testing and
> > > recovery testing.
> > Hi,
> > Thank you for working on this fix.
> > The overall design of the latest version looks good to me.
> > But during the review, I found a bug in the current implementation.
> > New behavior must apply to crash-recovery only, now it applies to
> > archiveRecovery too.
>
> Hello
>
> Paul, Kyotaro, are you working on updating this bugfix?  FWIW the latest
> patch submitted by Paul is still current and CFbot says it passes its
> own test, but from Anastasia's email it still needs a bit of work.
>
> Also: it would be good to have this new bogus scenario described by
> Anastasia covered by a new TAP test.  Anastasia, can we enlist you to
> write that?  Maybe Kyotaro?
>
>
Thanks Anastasia and Alvaro for comment and suggestion. Sorry I've been busy
working on some non-PG stuffs recently. I've never worked on archive
recovery,
so I expect a bit more time after I'm free (hopefully several days later)
to take a look.
Of course Kyotaro, Anastasia or anyone feel free to address the concern
before that.

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-05 Thread Paul Guo

>
> It seems there's minor breakage in the build,  per CFbot.  Can you
> please rebase this?
>
> There is a code conflict. See attached for the new version. Thanks.


v5-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data


v5-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


v5-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

2019-09-09 Thread Paul Guo

On Fri, Aug 2, 2019 at 2:55 AM Heikki Linnakangas  wrote:

> On 17/06/2019 15:53, Paul Guo wrote:
> > I noticed that to do batch insert, we might need additional memory copy
> > sometimes comparing with "single insert"
> > (that should be the reason that we previously saw a bit regressions) so a
> > good solution seems to fall back
> > to "single insert" if the tuple length is larger than a threshold. I set
> > this as 2000 after quick testing.
>
> Where does the additional memory copy come from? Can we avoid doing it
> in the multi-insert case?

Hi Heikki,

Sorry for the late reply. I took some time on looking at & debugging the
code of TupleTableSlotOps
of various TupleTableSlot types carefully, especially the
BufferHeapTupleTableSlot case on which
we seemed to see regression if no threshold is set, also debugging &
testing more of the CTAS case.
I found my previous word "additional memory copy" (mainly tuple content
copy against single insert)
is wrong based on the latest code (probably is wrong also with previous
code). So in theory
we should not worry about additional tuple copy overhead now, and then I
tried the patch without setting
multi-insert threshold as attached.

To make test results more stable, this time I run a simple ' select
count(*) from tbl' before each CTAS to
warm up the shared buffer, run checkpoint before each CTAS, disable
autovacuum by setting
'autovacuum = off', set larger shared buffers (but < 25% of total memory
which is recommended
by PG doc) so that CTAS all hits shared buffer read if there exists warm
buffers (double-checked via
explain(analyze, buffers)). These seem to be reasonable for performance
testing. Each kind of CTAS
testing is run three times (Note before each run we do warm up and
checkpoint as mentioned).

I mainly tested the t12 (normal table with tuple size ~ 2k) case since for
others our patch either
performs better or similarly.

Patch: 1st_run 2nd_run3rd_run

t12_BufferHeapTuple 7883.400  7549.9668090.080
t12_Virtual 8041.637   8191.3178182.404

Baseline: 1st_run 2nd_run3rd_run

t12_BufferHeapTuple: 8264.290  7508.410   7681.702
t12_Virtual   8167.792  7970.537   8106.874

I actually roughly tested other tables we mentioned also (t11 and t14) -
the test results and conclusions are same.
t12_BufferHeapTuple means: create table tt as select * from t12;
t12_Virtual means: create table tt as select *partial columns* from t12;

So it looks like for t12 the results between our code and baseline are
similar so not setting
threshoud seem to be good though it looks like t12_BufferHeapTuple test
results varies a
lot (at most 0.5 seconds) for both our patch and baseline vs the virtual
case which is quite stable.

This actually confused me a bit given we've cached the source table in
shared buffers. I suspected checkpoint affects,
so I disabled checkpoint by setting max_wal_size = 3000 during CTAS, the
BufferHeapTuple case (see below)
still varies some. I'm not sure what's the reason but this does not seem to
a be blocker for the patch.
Patch: 1st_run 2nd_run3rd_run
t12_BufferHeapTuple 7717.3047413.2597452.773
t12_Virtual  7445.742 7483.148   7593.583

Baseline: 1st_run 2nd_run3rd_run
t12_BufferHeapTuple  8186.302 7736.541   7759.056
t12_Virtual   8004.880  8096.712   7961.483

v4-0001-Multi-insert-in-Create-Table-As.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-09 Thread Paul Guo

>
> Thank for rebasing.
>
> I didn't like 0001 very much.
>
> * It seems now would be the time to stop pretending we're using a file
> called recovery.conf; I know we still support older server versions that
> use that file, but it sounds like we should take the opportunity to
> rename the function to be less misleading once those versions vanish out
> of existance.
>

How about renaming the function names to
GenerateRecoveryConf -> GenerateRecoveryConfContents
WriteRecoveryConf -> WriteRecoveryConfInfo<- it writes standby.signal
on pg12+, so function name WriteRecoveryConfContents is not accurate.
and
variable writerecoveryconf -> write_recovery_conf_info?


> * disconnect_and_exit seems a bit out of place compared to the other
> parts of this new module.  I think you only put it there so that the
> 'conn' can be a global, and that you can stop passing 'conn' as a
> variable to GenerateRecoveryConf.  It seems more modular to me to keep
> it as a separate variable in each program and have it passed down to the
> routine that writes the file.
>
> * From modularity also seems better to me to avoid a global variable
> 'recoveryconfcontents' and instead return the string from
> GenerateRecoveryConf to pass as a param to WriteRecoveryConf.
> (In fact, I wonder why the current structure is like it is, namely to
> have ReceiveAndUnpackTarFile write the file; why wouldn't be its caller
> be responsible for writing it?)
>

Reasonable to make common code include less variables. I can try modifying
the patches to remove the previously added variables below in the common
code.

+/* Contents of configuration file to be generated */
+extern PQExpBuffer recoveryconfcontents;
+
+extern bool writerecoveryconf;
+extern char *replication_slot;
+PGconn*conn;


>
> I wonder about putting this new file in src/fe_utils instead of keeping
> it in pg_basebackup and symlinking to pg_rewind.  Maybe if we make it a
> true module (recovery_config_gen.c) it makes more sense there.
>
> I thought some about where to put the common code also. It seems pg_rewind
and pg_basebackup are the only consumers of the small common code.  I doubt
it deserves a separate file under src/fe_utils.


>
> 0003:
>
> I still don't understand why we need a command-line option to do this.
> Why should it not be the default behavior?
>

This was discussed but frankly speaking I do not know how other postgres
users or enterprise providers handle this (probably some have own scripts?).
I could easily remove the option code if more and more people agree on that
or at least we could turn it on by default?

Thanks

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-19 Thread Paul Guo

I've updated the patch series following the suggestions. Thanks.


>
>


v6-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data


v6-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data


v6-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-20 Thread Paul Guo

The patch series failed on Windows CI. I modified the Windows build file to
fix that. See attached for the v7 version.


v7-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data


v7-0001-Extact-common-recovery-related-functions-from-pg_.patch
Description: Binary data


v7-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-24 Thread Paul Guo

>
> On 2019-Sep-20, Paul Guo wrote:
>
> > The patch series failed on Windows CI. I modified the Windows build file
> to
> > fix that. See attached for the v7 version.
>
> Thanks.
>
> Question about 0003.  If we specify --skip-clean-shutdown and the cluter
> was not cleanly shut down, shouldn't we error out instead of trying to
> press on?


pg_rewind would error out in this case, see sanityChecks().
Users are expected to clean up themselves if they use this argument.

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-26 Thread Paul Guo

On Thu, Sep 26, 2019 at 1:48 AM Alvaro Herrera 
wrote:

> CC Alexey for reasons that become clear below.
>
> On 2019-Sep-25, Paul Guo wrote:
>
> > > Question about 0003.  If we specify --skip-clean-shutdown and the
> cluter
> > > was not cleanly shut down, shouldn't we error out instead of trying to
> > > press on?
> >
> > pg_rewind would error out in this case, see sanityChecks().
> > Users are expected to clean up themselves if they use this argument.
>
> Ah, good.  We should have a comment about that below the relevant
> stanza, I suggest.  (Or maybe in the same comment that ends in line
> 272).
>
> I pushed 0001 with a few tweaks.  Nothing really substantial, just my
> OCD that doesn't leave me alone ... but this means your subsequent
> patches need to be adjusted.  One thing is that that patch touched
> pg_rewind for no reason (those changes should have been in 0002) --
> dropped those.
>
> Another thing in 0002 is that you're adding a "-R" switch to pg_rewind,
> but we have another patch in the commitfest using the same switch for a
> different purpose.  Maybe you guys need to get to an agreement over who
> uses the letter :-)  Also, it would be super helpful if you review
> Alexey's patch: https://commitfest.postgresql.org/24/1849/
>
>
> This line is far too long:
>
> +   printf(_("  -s, --skip-clean-shutdown  skip running single-mode
> postgres if needed to make sure target is clean shutdown\n"));
>
> Can we make the description more concise?
>

Thanks. I've updated the reset two patches and attached as v8.

Note in the 2nd patch, the long option is changed as below. Both the option
and description
now seems to be more concise since we want db state as either DB_SHUTDOWNED
or
DB_SHUTDOWNED_IN_RECOVERY.

"-s, --no-ensure-shutdowned do not auto-fix unclean shutdown"


v8-0001-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


v8-0002-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-26 Thread Paul Guo

>
>
> Yes, -R is already used in pg_basebackup for the same functionality, so
> it seems natural to use it here as well for consistency.
>
> I will review options naming in my own patch and update it accordingly.
> Maybe -w/-W or -a/-A options will be good, since it's about WALs
> retrieval from archive.
>
>
Thanks


>
> Regards
> --
> Alexey
>
> P.S. Just noticed that in v12 fullname of -R option in pg_basebackup is
> still --write-recovery-conf, which is good for a backward compatibility,
> but looks a little bit awkward, since recovery.conf doesn't exist
> already, doesn't it? However, one may read it as
> 'write-recovery-configuration', then it seems fine.
>
>
Yes, here is the description
"--write-recovery-conf  write configuration for replication"
So we do not mention that is the file recovery.conf. People do not know
about the recovery.conf history might really not be confused since
postgresql has various configuration files.

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-26 Thread Paul Guo

>
>
> > Note in the 2nd patch, the long option is changed as below. Both the
> option
> > and description
> > now seems to be more concise since we want db state as either
> DB_SHUTDOWNED
> > or
> > DB_SHUTDOWNED_IN_RECOVERY.
> >
> > "-s, --no-ensure-shutdowned do not auto-fix unclean shutdown"
>
> Note that "shutdowned" is incorrect English; we've let
> it live in the code because it's not user-visible, but we should
> certainly not immortalize it where it becomes so.  I suppose
> "--no-ensure-shutdown" is okay, although I think some may prefer
> "--no-ensure-shut-down".  Opinions from native speakers would be
> welcome.  Also, let's expand "auto-fix" to "automatically fix" (or
> "repair" if there's room in the line?  Not sure. Can be bikeshedded to
> death I guess.)
>

I choose that one from the below tree.

--no-ensure-shutdown
--no-ensure-shutdowned
--no-ensure-clean-shutdown

Now I agree for user experience we should not use the 2nd one. For
--no-ensure-clean-shutdown or -no-ensure-shut-down, seems too many -.

I'm using --no-ensure-shutdown in the new version unless there are better
suggestions.


>
> Secondarily, I see no reason to test connstr_source rather than just
> "conn" in the other patch; doing it the other way is more natural, since
> it's that thing that's tested as an argument.
>
> pg_rewind.c: Please put the new #include line keeping the alphabetical
> order.
>

Agreed to the above suggestions. I attached the v9.

Thanks.


v9-0001-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


v9-0002-Ensure-target-clean-shutdown-in-pg_rewind.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

2019-09-26 Thread Paul Guo

Asim Thanks for the review.

On Wed, Sep 25, 2019 at 6:39 PM Asim R P  wrote:

>
>
>
> On Mon, Sep 9, 2019 at 4:02 PM Paul Guo  wrote:
> >
> > So in theory
> > we should not worry about additional tuple copy overhead now, and then I
> tried the patch without setting
> > multi-insert threshold as attached.
> >
>
> I reviewed your patch today.  It looks good overall.  My concern is that
> the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> place such as createas.c, we should be using generic tableam API only.
> However, I can also see that there is no better alternative.  We need to
> compute the size of accumulated tuples so far, in order to decide whether
> to stop accumulating tuples.  There is no convenient way to obtain the
> length of the tuple, given a slot.  How about making that decision solely
> based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
> altogether?
>

For heapam, ExecFetchSlotHeapTuple() will be called again in
heap_multi_insert() to prepare the final multi-insert. if we check
ExecFetchSlotHeapTuple(), we could find that calling it multiple time just
involves very very few overhead for the BufferHeapTuple case. Note for
virtual tuple case the 2nd ExecFetchSlotHeapTuple() call still copies slot
contents, but we've called ExecCopySlot(batchslot, slot); to copy to a
BufferHeap case so no worries for the virtual tuple case (as a source).

Previously (long ago) I probably understood the code incorrectly so had the
concern also. I used sampling to do that (for variable-length tuple), but
now apparently we do not need that.

>
> The multi insert copy code deals with index tuples also, which I don't see
> in the patch.  Don't we need to consider populating indexes?
>

create table as/create mat view DDL does not involve index creation for the
table/matview. The code seems to be able to used in RefreshMatView also,
for that we need to consider if we use multi-insert in that code.

Re: Batch insert in CTAS/MatView code

2019-09-26 Thread Paul Guo

On Thu, Sep 26, 2019 at 9:43 PM Alvaro Herrera 
wrote:

> On 2019-Sep-25, Asim R P wrote:
>
> > I reviewed your patch today.  It looks good overall.  My concern is that
> > the ExecFetchSlotHeapTuple call does not seem appropriate.  In a generic
> > place such as createas.c, we should be using generic tableam API only.
> > However, I can also see that there is no better alternative.  We need to
> > compute the size of accumulated tuples so far, in order to decide whether
> > to stop accumulating tuples.  There is no convenient way to obtain the
> > length of the tuple, given a slot.  How about making that decision solely
> > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple
> call
> > altogether?
>
> ... maybe we should add a new operation to slots, that returns the
> (approximate?) size of a tuple?  That would make this easy.  (I'm not
> sure however what to do about TOAST considerations -- is it size in
> memory that we're worried about?)
>
> Also:
>
> +   myState->mi_slots_size >= 65535)
>
> This magic number should be placed in a define next to the other one,
> but I'm not sure that heapam.h is a good location, since surely this
> applies to matviews in other table AMs too.
>
> yes defining 65535 seems better. Let's fix this one later when having
more feedback. Thanks.

Re: Batch insert in CTAS/MatView code

2019-09-29 Thread Paul Guo

On Sat, Sep 28, 2019 at 5:49 AM Andres Freund  wrote:

> Hi,
>
> On 2019-09-09 18:31:54 +0800, Paul Guo wrote:
> > diff --git a/src/backend/access/heap/heapam.c
> b/src/backend/access/heap/heapam.c
> > index e9544822bf..8a844b3b5f 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation,
> TupleTableSlot **slots, int ntuples,
> > CommandId cid, int options,
> BulkInsertState bistate)
> >  {
> >   TransactionId xid = GetCurrentTransactionId();
> > - HeapTuple  *heaptuples;
> >   int i;
> >   int ndone;
> >   PGAlignedBlock scratch;
> > @@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation,
> TupleTableSlot **slots, int ntuples,
> >   SizesaveFreeSpace;
> >   boolneed_tuple_data =
> RelationIsLogicallyLogged(relation);
> >   boolneed_cids =
> RelationIsAccessibleInLogicalDecoding(relation);
> > + /* Declare it as static to let this memory be not on stack. */
> > + static HeapTupleheaptuples[MAX_MULTI_INSERT_TUPLES];
> > +
> > + Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
> >
> >   /* currently not needed (thus unsupported) for heap_multi_insert()
> */
> >   AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
> > @@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation,
> TupleTableSlot **slots, int ntuples,
> >
> HEAP_DEFAULT_FILLFACTOR);
> >
> >   /* Toast and set header data in all the slots */
> > - heaptuples = palloc(ntuples * sizeof(HeapTuple));
> >   for (i = 0; i < ntuples; i++)
> >   {
> >   HeapTuple   tuple;
>
> I don't think this is a good idea. We shouldn't unnecessarily allocate
> 8KB on the stack. Is there any actual evidence this is a performance
> benefit? To me this just seems like it'll reduce the flexibility of the
>

Previous  heaptuples is palloc-ed in each batch, which should be slower than
pre-allocated & reusing memory in theory.

API, without any benefit.  I'll also note that you've apparently not
> updated tableam.h to document this new restriction.
>

Yes it should be moved from heapam.h to that file along with the 65535
definition.

Re: Batch insert in CTAS/MatView code

2019-09-29 Thread Paul Guo

>
>
> > > However, I can also see that there is no better alternative.  We need
> to
> > > compute the size of accumulated tuples so far, in order to decide
> whether
> > > to stop accumulating tuples.  There is no convenient way to obtain the
> > > length of the tuple, given a slot.  How about making that decision
> solely
> > > based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple
> call
> > > altogether?
> >
> > ... maybe we should add a new operation to slots, that returns the
> > (approximate?) size of a tuple?
>
> Hm, I'm not convinced that it's worth adding that as a dedicated
> operation. It's not that clear what it'd exactly mean anyway - what
> would it measure? As referenced in the slot? As if it were stored on
> disk? etc?
>
> I wonder if the right answer wouldn't be to just measure the size of a
> memory context containing the batch slots, or something like that.
>
>
Probably a better way is to move those logic (append slot to slots, judge
when to flush, flush, clean up slots) into table_multi_insert()? Generally
the final implementation of table_multi_insert() should be able to know
the sizes easily. One concern is that currently just COPY in the repo uses
multi insert, so not sure if other callers in the future want their own
logic (or set up a flag to allow customization but seems a bit
over-designed?).

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-30 Thread Paul Guo

>
>
> I went through the remaining two patches and they seem to be very clear
> and concise. However, there are two points I could complain about:
>
> 1) Maybe I've missed it somewhere in the thread above, but currently
> pg_rewind allows to run itself with -R and --source-pgdata. In that case
> -R option is just swallowed and neither standby.signal, nor
> postgresql.auto.conf is written, which is reasonable though. Should it
> be stated somehow in the docs that -R option always has to go altogether
> with --source-server? Or should pg_rewind notify user that options are
> incompatible and no recovery configuration will be written?
>

I modified code & doc to address this. In code, pg_rewind will error out
for the local case.


> 2) Are you going to leave -R option completely without tap-tests?
> Attached is a small patch, which tests -R option along with the existing
> 'remote' case. If needed it may be split into two separate cases. First,
> it tests that pg_rewind is able to succeed with minimal permissions
> according to the Michael's patch d9f543e [1]. Next, it checks presence
> of standby.signal and adds REPLICATION permission to rewind_user to test
> that new standby is able to start with generated recovery configuration.
>
> [1]
>
> https://github.com/postgres/postgres/commit/d9f543e9e9be15f92abdeaf870e57ef289020191
>
>
It seems that we could further disabling recovery info setting code for the
'remote' test case?

-   my $port_standby = $node_standby->port;
-   $node_master->append_conf(
-   'postgresql.conf', qq(
-primary_conninfo='port=$port_standby'
-));
+   if ($test_mode ne "remote")
+   {
+   my $port_standby = $node_standby->port;
+   $node_master->append_conf(
+   'postgresql.conf',
+   qq(primary_conninfo='port=$port_standby'));

-   $node_master->set_standby_mode();
+   $node_master->set_standby_mode();
+   }

Thanks.


v10-0001-Add-option-to-write-recovery-configuration-infor.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-09-30 Thread Paul Guo

>
> BTW in the future if you have two separate patches, please post them in
> separate threads and use separate commitfest items for each, even if
> they have minor conflicts.
>

Sure. Thanks.

Batch insert in CTAS/MatView code

2019-03-06 Thread Paul Guo

Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could leverage
that code also to boost the performance also. Attached is a patch to
implement that. That was done by Taylor (cc-ed) and me.

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation. For Modifytable->insert, it seems that
there are more limitations for batch insert (trigger, etc?) but it seems
that it is possible that we could do batch insert for the case that we
could do?

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Here is a quick simple performance testing on a mirrorless Postgres
instance with the SQLs below. The tests cover tables with small column
length, large column length and toast.

-- tuples with small size.
drop table if exists t1;
create table t1 (a int);

insert into t1 select * from generate_series(1, 1000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are untoasted and data that is 1664 bytes wide
drop table if exists t1;
create table t1 (a name, b name, c name, d name, e name, f name, g name, h
name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name);

insert into t1 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z' from generate_series(1, 50);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are toastable.
drop table if exists t1;
create table t1 (a text, b text, c text, d text, e text, f text, g text, h
text, i text, j text, k text, l text, m text, n text, o text, p text, q
text, r text, s text, t text, u text, v text, w text, x text, y text, z
text);

insert into t1 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from
generate_series(1,2000)) i;
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

Here are the timing results:

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

While for toast and large column size there is < 10% decrease but for small
column size the improvement is super good. Actually if I hardcode the batch
count as 4 all test cases are better but the improvement for small column
size is smaller than that with current patch. Pretty much the number 4 is
quite case specific so I can not hardcode that in the patch. Of course we
could further tune that but the current value seems to be a good trade-off?

Thanks.


0001-Heap-batch-insert-for-CTAS.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

2019-03-10 Thread Paul Guo

Sorry for the late reply.

To Michael. Thank you. I know this commitfest is ongoing and I'm not
targeting for this.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas  wrote:

> On 06/03/2019 22:06, Paul Guo wrote:
> > The patch also modifies heap_multi_insert() a bit to do a bit further
> > code-level optimization by using static memory, instead of using memory
> > context and dynamic allocation.
>
> If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
> That is still leaked to the current memory context.
>
>
Thanks. I checked the code for that but apparently, I missed that one. I'll
see what proper context can be used for CTAS. For copy code maybe just
revert my change.



> Leaking into the current memory context is not a bad thing, because
> resetting a memory context is faster than doing a lot of pfree() calls.
> The callers just need to be prepared for that, and use a short-lived
> memory context.
>
> > By the way, while looking at the code, I noticed that there are 9 local
> > arrays with large length in toast_insert_or_update() which seems to be a
> > risk of stack overflow. Maybe we should put it as static or global.
>
> Hmm. We currently reserve 512 kB between the kernel's limit, and the
> limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
> arrays add up to 52800 bytes on a 64-bit maching, if I did my math
> right. So there's still a lot of headroom. I agree that it nevertheless
> seems a bit excessive, though.
>

I was worried about some recursive calling of it, but probably there should
be no worry for toast_insert_or_update().


> > With the patch,
> >
> > Time: 4728.142 ms (00:04.728)
> > Time: 14203.983 ms (00:14.204)
> > Time: 1008.669 ms (00:01.009)
> >
> > Baseline,
> > Time: 11096.146 ms (00:11.096)
> > Time: 13106.741 ms (00:13.107)
> > Time: 1100.174 ms (00:01.100)
>
> Nice speedup!
>
> > While for toast and large column size there is < 10% decrease but for
> > small column size the improvement is super good. Actually if I hardcode
> > the batch count as 4 all test cases are better but the improvement for
> > small column size is smaller than that with current patch. Pretty much
> > the number 4 is quite case specific so I can not hardcode that in the
> > patch. Of course we could further tune that but the current value seems
> > to be a good trade-off?
>
> Have you done any profiling, on why the multi-insert is slower with
> large tuples? In principle, I don't see why it should be slower.
>

Thanks for the suggestion. I'll explore a bit more on this.


>
> - Heikki
>

Re: Batch insert in CTAS/MatView code

2019-03-12 Thread Paul Guo

On Mon, Mar 11, 2019 at 2:58 AM David Fetter  wrote:

> On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> > Hello, Postgres hackers,
> >
> > The copy code has used batch insert with function heap_multi_insert() to
> > speed up. It seems that Create Table As or Materialized View could
> leverage
> > that code also to boost the performance also. Attached is a patch to
> > implement that.
>
> This is great!
>
> Is this optimization doable for multi-row INSERTs, either with tuples
> spelled out in the body of the query or in constructs like INSERT ...
> SELECT ...?
>

Yes. That's "batch insert" in the ModifyTable nodes which I mentioned in
the first email.
By the way, batch is a usual optimization mechanism for iteration kind
model (like postgres executor),
so batch should benefit many executor nodes in theory also.


>
> Best,
> David.
> --
> David Fetter 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__fetter.org_&d=DwIBAg&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=wgGDTDFzZV7nnMm0NFt-yGKmm_KZk18RXKP9HL8h6UE&s=tnaoLdajjR0Ew-93XUliHW1FUspVl09pIFd9aXxvqc8&e=
> Phone: +1 415 235 3778
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>

Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-03-18 Thread Paul Guo

Hello, Postgres hackers,

Please see the attached patches.

The first patch adds an option to automatically generate recovery conf
contents in related files, following pg_basebackup. In the patch,
GenerateRecoveryConf(), WriteRecoveryConf() and escape_quotes() are almost
same as them on pg_basebackup. The main difference is due to replication
slot support and code (variables) limit. It seems that we could slightly
refactor later to put some common code into another file after aligning
pg_rewind with pg_basebackup. This was tested manually and was done by
Jimmy (cc-ed), Ashiwin (cc-ed) and me.

Another patch does automatic clean shutdown by running a single mode
postgres instance if the target was not clean shut down since that is
required by pg_rewind. This was manually tested and was done by Jimmy
(cc-ed) and me. I'm not sure if we want a test case for that though.

Thanks.


0001-Ensure-target-clean-shutdown-at-beginning-of-pg_rewi.patch
Description: Binary data


0001-Auto-generate-recovery-conf-at-the-end-of-pg_re.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-03-19 Thread Paul Guo

On Tue, Mar 19, 2019 at 2:18 PM Michael Paquier  wrote:

> On Tue, Mar 19, 2019 at 02:09:03PM +0800, Paul Guo wrote:
> > The first patch adds an option to automatically generate recovery conf
> > contents in related files, following pg_basebackup. In the patch,
> > GenerateRecoveryConf(), WriteRecoveryConf() and escape_quotes() are
> almost
> > same as them on pg_basebackup. The main difference is due to replication
> > slot support and code (variables) limit. It seems that we could slightly
> > refactor later to put some common code into another file after aligning
> > pg_rewind with pg_basebackup. This was tested manually and was done by
> > Jimmy (cc-ed), Ashiwin (cc-ed) and me.
>
>
> Interesting.  The two routines have really the same logic, I would
> recommend to have a first patch which does the refactoring and have
> pg_rewind use it, and then a second patch which writes recovery.conf
> and uses the first patch to get the contents.  Please note that the
>

This is a good suggestion also. Will do it.

> common routine needs to be version-aware as pg_basebackup requires
> compatibility with past versions, but you could just pass the version
> number from the connection, and have pg_rewind pass the compiled-in
> version value.
>
> > Another patch does automatic clean shutdown by running a single mode
> > postgres instance if the target was not clean shut down since that is
> > required by pg_rewind. This was manually tested and was done by Jimmy
> > (cc-ed) and me. I'm not sure if we want a test case for that though.
>
> I am not sure that I see the value in that.  I'd rather let the
> required service start and stop out of pg_rewind and not introduce
> dependencies with other binaries.  This step can also take quite some
>

This makes recovery more automatically. Yes, it will add the dependency on
the postgres
binary, but it seems that most time pg_rewind should be shipped as postgres
in the same install directory. From my experience of manually testing
pg_rewind,
I feel that this besides auto-recovery-conf writing really alleviate my
burden. I'm not sure how
other users usually do before running pg_rewind when the target is not
cleanly shut down,
but probably we can add an argument to pg_rewind to give those people who
want to
handle target separately another chance? default on or off whatever.

> time depending on the amount of WAL to replay post-crash at recovery
> and the shutdown checkpoint which is required to reach a consistent
> on-disk state.
>

The time is still required for people who want to make the target ready for
pg_rewind in another way.

Thanks.

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-03-19 Thread Paul Guo

On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier  wrote:

> On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote:
> > This is a good suggestion also. Will do it.
>
> Please note also that we don't care about recovery.conf since v12 as
> recovery parameters are now GUCs.  I would suggest appending those
> extra parameters to postgresql.auto.conf, which is what pg_basebackup
> does.
>
Yes, the recovery conf patch in the first email did like this, i.e. writing
postgresql.auto.conf & standby.signal

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-08-22 Thread Paul Guo

Thanks. I updated the patch to v5. It passes install-check testing and
recovery testing.

On Fri, Aug 2, 2019 at 6:38 AM Thomas Munro  wrote:

> On Mon, Jul 15, 2019 at 10:52 PM Paul Guo  wrote:
> > Please see the attached v4 patch.
>
> While moving this to the next CF, I noticed that this needs updating
> for the new pg_list.h API.
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=1zhC6VaaS7Ximav7vaUXMUt6EGjrVZpNZut32ug7LDI&s=jSDXnTPIW4WNZCCZ_HIbu7gZ3apEBx36DCeNeNuhLpY&e=
>


v5-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch
Description: Binary data

Questions about SyncRepWaitForLSN()

2019-12-15 Thread Paul Guo

Hello pg hackers,

This is the definition of the function:

SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)

1. In the code, it emits ereport(WARNING) for the
ProcDiePending/QueryCancelPending case like this:

  ereport(WARNING,
  (errcode(ERRCODE_ADMIN_SHUTDOWN),
   errmsg("canceling the wait for synchronous
replication and terminating connection due to administrator command"),
   errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));

   The message "The transaction has already committed locally" is wrong
for non-commit waiting e.g. 2PC Prepare or AbortPrepare, right? so maybe we
just give the errdtail for the commit==true case.

2. I'm curious how the client should proceed for the ProcDiePending corner
case in the function (assuming synchronous_commit as remote_write or
above). In this scenario, a transaction has been committed locally on
master but we are not sure if the commit is replicated to standby or not if
ProcDiePending happens. The commit is not in a safe status from the
perspective of HA, for example if further when auto-failover happens, we
may or may not lose the transaction commit on the standby and client just
gets (and even can not get) a warning of unknown commit replication status.

Thanks.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2020-01-13 Thread Paul Guo

On Fri, Jan 10, 2020 at 9:43 PM Alvaro Herrera 
wrote:

> On 2020-Jan-09, Alvaro Herrera wrote:
>
> > I looked at this a little while and was bothered by the perl changes; it
> > seems out of place to have RecursiveCopy be thinking about tablespaces,
> > which is way out of its league.  So I rewrote that to use a callback:
> > the PostgresNode code passes a callback that's in charge to handle the
> > case of a symlink.  Things look much more in place with that.  I didn't
> > verify that all places that should use this are filled.
> >
> > In 0002 I found adding a new function unnecessary: we can keep backwards
> > compat by checking 'ref' of the third argument.  With that we don't have
> > to add a new function.  (POD changes pending.)
>
> I forgot to add that something in these changes is broken (probably the
> symlink handling callback) so the tests fail, but I couldn't stay away
> from my daughter's birthday long enough to figure out what or how.  I'm
> on something else today, so if one of you can research and submit fixed
> versions, that'd be great.
>
> Thanks,
>

I spent some time on this before getting off work today.

With below fix, the 4th test is now ok but the 5th (last one) hangs due to
panic.

(gdb) bt
#0  0x003397e32625 in raise () from /lib64/libc.so.6
#1  0x003397e33e05 in abort () from /lib64/libc.so.6
#2  0x00a90506 in errfinish (dummy=0) at elog.c:590
#3  0x00a92b4b in elog_finish (elevel=22, fmt=0xb2d580 "cannot find
directory %s tablespace %d database %d") at elog.c:1465
#4  0x0057aa0a in XLogLogMissingDir (spcNode=16384, dbNode=0,
path=0x1885100 "pg_tblspc/16384/PG_13_202001091/16389") at xlogutils.c:104
#5  0x0065e92e in dbase_redo (record=0x1841568) at dbcommands.c:2225
#6  0x0056ac94 in StartupXLOG () at xlog.c:7200


diff --git a/src/include/commands/dbcommands.h
b/src/include/commands/dbcommands.h
index b71b400e700..f8f6d5ffd03 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -19,8 +19,6 @@
 #include "lib/stringinfo.h"
 #include "nodes/parsenodes.h"

-extern void CheckMissingDirs4DbaseRedo(void);
-
 extern Oid createdb(ParseState *pstate, const CreatedbStmt *stmt);
 extern void dropdb(const char *dbname, bool missing_ok, bool force);
 extern void DropDatabase(ParseState *pstate, DropdbStmt *stmt);
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e6e7ea505d9..4eef8bb1985 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -615,11 +615,11 @@ sub _srcsymlink
my $srcrealdir = readlink($srcpath);

opendir(my $dh, $srcrealdir);
-   while (readdir $dh)
+   while (my $entry = (readdir $dh))
{
-   next if (/^\.\.?$/);
-   my $spath = "$srcrealdir/$_";
-   my $dpath = "$dstrealdir/$_";
+   next if ($entry eq '.' or $entry eq '..');
+   my $spath = "$srcrealdir/$entry";
+   my $dpath = "$dstrealdir/$entry";
RecursiveCopy::copypath($spath, $dpath);
}
closedir $dh;

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2020-01-15 Thread Paul Guo

I further fixed the last test failure (due to a small bug in the test, not
in code). Attached are the new patch series. Let's see the CI pipeline
result.


v9-0001-Support-node-initialization-from-backup-with-tabl.patch
Description: Binary data


v9-0003-Fix-replay-of-create-database-records-on-standby.patch
Description: Binary data


v9-0002-Tests-to-replay-create-database-operation-on-stan.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

2020-01-16 Thread Paul Guo

I took some time on digging the issue yesterday so the main concern of the
patch is to get the tuple length from ExecFetchSlotHeapTuple().

+   ExecCopySlot(batchslot, slot);
+   tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+   myState->mi_slots_num++;
+   myState->mi_slots_size += tuple->t_len;

We definitely should remove ExecFetchSlotHeapTuple() here but we need to
know the tuple length (at least rough length). One solution might be using
memory context stat info as mentioned, the code looks like this.

tup_len = MemoryContextUsedSize(batchslot->tts_mcxt);
ExecCopySlot(batchslot, slot);
ExecMaterializeSlot(batchslot);
tup_len = MemoryContextUsedSize(batchslot->tts_mcxt) - tup_len;

MemoryContextUsedSize() is added to calculate the total used size (simply
hack to use the stats interface).

+int
+MemoryContextUsedSize(MemoryContext context)
+{
+   MemoryContextCounters total;
+
+   memset(&total, 0, sizeof(total));
+   context->methods->stats(context, NULL, NULL, &total);
+
+   return total.totalspace - total.freespace;
+}

This basically works but there are concerns:

   The length is not accurate (though we do not need to be that accurate)
since there are
   some additional memory allocations, but we are not sure if the size is
not much
   larger than the real length for some slot types in the future and I'm
not sure whether we
   definitely allocate at least the tuple length in the memory context
after materialization
   for all slot types in the future. Last is that the code seems to be a
bit ugly also.

   As a reference, For "create table t1 as select * from t2", the above
code returns
   "tuple length" is 88 (real tuple length is 4).

Another solution is that maybe return the real size
in ExecMaterializeSlot()? e.g.
ExecMaterializeSlot(slot, &tup_len); For this we probably want to store the
length
in the slot struct for performance.

For the COPY case the tuple length is known in advance but I can image more
cases
which do not know the size but need that for the multi_insert interface, at
least I'm
wondering if we should use that in 'refresh matview' and fast path for
Insert node
(I heard some complaints about the performance of "insert into tbl from
select..."
from some of our users)? So the concern is not just for the case in this
patch.

Besides, My colleagues Ashwin Agrawal and Adam Lee found maybe could
try raw_heap_insert() similar code for ctas and compare since it is do
insert in
a new created table. That would involve more discussions, much more code
change and need to test more (stability and performance). So multi insert
seems
to be a stable solution in a short time given that has been used in COPY
for a
long time?

Whatever the solution for CTAS we need to address the concern of tuple size
for multi insert cases.

standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-17 Thread Paul Guo

Hello postgres hackers,

Recently my colleagues and I encountered an issue: a standby can
not recover after an unclean shutdown and it's related to tablespace.
The issue is that the standby re-replay some xlog that needs tablespace
directories (e.g. create a database with tablespace),
but the tablespace directories has already been removed in the
previous replay.

In details, the standby normally finishes replaying for the below
operations, but due to unclean shutdown, the redo lsn
is not updated in pg_control and is still kept a value before the 'create
db with tabspace' xlog, however since the tablespace
directories were removed so it reports error when repay the database create
wal.

create db with tablespace
drop database
drop tablespace.

Here is the log on the standby.
2019-04-17 14:52:14.926 CST [23029] LOG:  starting PostgreSQL 12devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat
4.8.5-4), 64-bit
2019-04-17 14:52:14.927 CST [23029] LOG:  listening on IPv4 address
"192.168.35.130", port 5432
2019-04-17 14:52:14.929 CST [23029] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5432"
2019-04-17 14:52:14.943 CST [23030] LOG:  database system was interrupted
while in recovery at log time 2019-04-17 14:48:27 CST
2019-04-17 14:52:14.943 CST [23030] HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2019-04-17 14:52:14.949 CST [23030] LOG:  entering standby mode

2019-04-17 14:52:14.950 CST [23030] LOG:  redo starts at 0/30105B8

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
2019-04-17 14:52:14.951 CST [23029] LOG:  startup process (PID 23030)
exited with exit code 1
2019-04-17 14:52:14.951 CST [23029] LOG:  terminating any other active
server processes
2019-04-17 14:52:14.953 CST [23029] LOG:  database system is shut down


Steps to reprodce:

1. setup a master and standby.
2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace

3. Run SQLs:
drop tablespace if exists some_isolation2_pg_basebackup_tablespace;
create tablespace some_isolation2_pg_basebackup_tablespace location
'/tmp/some_isolation2_pg_basebackup_tablespace';

3. Clean shutdown and restart both postgres instances.

4. Run the following SQLs:

drop database if exists some_database_with_tablespace;
create database some_database_with_tablespace tablespace
some_isolation2_pg_basebackup_tablespace;
drop database some_database_with_tablespace;
drop tablespace some_isolation2_pg_basebackup_tablespace;
\! pkill -9 postgres; ssh host70 pkill -9 postgres

Note immediate shutdown via pg_ctl should also be able to reproduce and the
above steps probably does not 100% reproduce.

I created an initial patch for this issue (see the attachment). The idea is
re-creating those directories recursively. The above issue exists
in dbase_redo(),
but TablespaceCreateDbspace (for relation file create redo) is probably
buggy also so I modified that function also. Even there is no bug
in that function, it seems that using simple pg_mkdir_p() is cleaner. Note
reading TablespaceCreateDbspace(), I found it seems that this issue
has already be thought though insufficient but frankly this solution
(directory recreation) seems to be not perfect given actually this should
have been the responsibility of tablespace creation (also tablespace
creation does more like symlink creation, etc). Also, I'm not sure whether
we need to use invalid page mechanism (see xlogutils.c).

Another solution is that, actually, we create a checkpoint when
createdb/movedb/dropdb/droptablespace, maybe we should enforce to create
restartpoint on standby for such special kind of checkpoint wal - that
means we need to set a flag in checkpoing wal and let checkpoint redo
code to create restartpoint if that flag is set. This solution seems to be
safer.

Thanks,
Paul


0001-Recursively-create-tablespace-directories-if-those-a.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-04-18 Thread Paul Guo

Hi Michael,

I updated the patches as attached following previous discussions.

The two patches:
v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
v2-0002-Add-option-to-write-recovery-configuration-inform.patch

1. 0001 does move those common functions & variables to two new files (one
.c and one .h) for both pg_rewind and pg_basebackup use,
note the functions are slightly modified (e.g. because conn is probably
NULL on pg_rewind). I do not know where is more proper to put the
new files. Currently, they are under pg_basebackup and are used in
pg_rewind (Makefile modified to support that).

2. 0002 adds the option to write recovery conf.

The below patch runs single mode Postgres if needed to make sure the target
is cleanly shutdown. A new option is added (off by default).
v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch

I've manually tested them and installcheck passes.

Thanks.

On Wed, Mar 20, 2019 at 1:23 PM Paul Guo  wrote:

>
>
> On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier 
> wrote:
>
>> On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote:
>> > This is a good suggestion also. Will do it.
>>
>> Please note also that we don't care about recovery.conf since v12 as
>> recovery parameters are now GUCs.  I would suggest appending those
>> extra parameters to postgresql.auto.conf, which is what pg_basebackup
>> does.
>>
> Yes, the recovery conf patch in the first email did like this, i.e.
> writing postgresql.auto.conf & standby.signal
>
>

v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch
Description: Binary data

v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data

v2-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-21 Thread Paul Guo

Please see my replies inline. Thanks.

On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:

> On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> >
> > create db with tablespace
> > drop database
> > drop tablespace.
>
> Essentially, that sequence of operations causes crash recovery to fail
> if the "drop tablespace" transaction was committed before crashing.
> This is a bug in crash recovery in general and should be reproducible
> without configuring a standby.  Is that right?
>

No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
master.
That makes the file/directory update-to-date if I understand the related
code correctly.
For standby, checkpoint redo does not ensure that.

>
> Your patch creates missing directories in the destination.  Don't we
> need to create the tablespace symlink under pg_tblspc/?  I would
>

 'create db with tablespace' redo log does not include the tablespace real
directory information.
Yes, we could add in it into the xlog, but that seems to be an overdesign.

> prefer extending the invalid page mechanism to deal with this, as
> suggested by Ashwin off-list.  It will allow us to avoid creating

directories and files only to remove them shortly afterwards when the
> drop database and drop tablespace records are replayed.
>
>
'invalid page' mechanism seems to be more proper for missing pages of a
file. For
missing directories, we could, of course, hack to use that (e.g. reading
any page of
a relfile in that database) to make sure the tablespace create code
(without symlink)
safer (It assumes those directories will be deleted soon).

More feedback about all of the previous discussed solutions is welcome.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Paul Guo

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Oops! The comment in the previous patch is wrong.
>
> At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp>
> > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo  wrote in
> 
> > > Please see my replies inline. Thanks.
> > >
> > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:
> > >
> > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> > > > >
> > > > > create db with tablespace
> > > > > drop database
> > > > > drop tablespace.
> > > >
> > > > Essentially, that sequence of operations causes crash recovery to
> fail
> > > > if the "drop tablespace" transaction was committed before crashing.
> > > > This is a bug in crash recovery in general and should be reproducible
> > > > without configuring a standby.  Is that right?
> > > >
> > >
> > > No. In general, checkpoint is done for
> drop_db/create_db/drop_tablespace on
> > > master.
> > > That makes the file/directory update-to-date if I understand the
> related
> > > code correctly.
> > > For standby, checkpoint redo does not ensure that.
> >
> > That's right partly. As you must have seen, fast shutdown forces
> > restartpoint for the last checkpoint and it prevents the problem
> > from happening. Anyway it seems to be a problem.
> >
> > > > Your patch creates missing directories in the destination.  Don't we
> > > > need to create the tablespace symlink under pg_tblspc/?  I would
> > > >
> > >
> > >  'create db with tablespace' redo log does not include the tablespace
> real
> > > directory information.
> > > Yes, we could add in it into the xlog, but that seems to be an
> overdesign.
> >
> > But I don't think creating directory that is to be removed just
> > after is a wanted solution. The directory most likely to be be
> > removed just after.
> >
> > > > prefer extending the invalid page mechanism to deal with this, as
> > > > suggested by Ashwin off-list.  It will allow us to avoid creating
> > >
> > > directories and files only to remove them shortly afterwards when the
> > > > drop database and drop tablespace records are replayed.
> > > >
> > > >
> > > 'invalid page' mechanism seems to be more proper for missing pages of a
> > > file. For
> > > missing directories, we could, of course, hack to use that (e.g.
> reading
> > > any page of
> > > a relfile in that database) to make sure the tablespace create code
> > > (without symlink)
> > > safer (It assumes those directories will be deleted soon).
> > >
> > > More feedback about all of the previous discussed solutions is welcome.
> >
> > It doesn't seem to me that the invalid page mechanism is
> > applicable in straightforward way, because it doesn't consider
> > simple file copy.
> >
> > Drop failure is ignored any time. I suppose we can ignore the
> > error to continue recovering as far as recovery have not reached
> > consistency. The attached would work *at least* your case, but I
> > haven't checked this covers all places where need the same
> > treatment.
>
> The comment for the new function XLogMakePGDirectory is wrong:
>
> + * There is a possibility that WAL replay causes a creation of the same
> + * directory left by the previous crash. Issuing ERROR prevents the caller
> + * from continuing recovery.
>
> The correct one is:
>
> + * There is a possibility that WAL replay causes an error by creation of a
> + * directory under a directory removed before the previous crash. Issuing
> + * ERROR prevents the caller from continuing recovery.
>
> It is fixed in the attached.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

[Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-23 Thread Paul Guo

Hello, Postgres hackers.

I happened to see a hang issue when running a simple copy query. The root
cause and repro way are quite simple.

mkfifo /tmp/a

run sql:
copy (select generate_series(1, 10)) to '/tmp/a';

It hangs at AllocateFile()->fopen() because that file was previously
created as a fifo file, and it is not ctrl+c cancellable (on Linux).

#0  0x7f52df1c45a0 in __open_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1  0x7f52df151f20 in _IO_file_open (is32not64=4, read_write=4,
prot=438, posix_mode=, filename=0x7ffe64199a10
"\360\303[\001", fp=0x1649c40) at fileops.c:229
#2  _IO_new_file_fopen (fp=fp@entry=0x1649c40,
filename=filename@entry=0x15bc3f0
"/tmp/a", mode=, mode@entry=0xaa0bb7 "w",
is32not64=is32not64@entry=1) at fileops.c:339
#3  0x7f52df1465e4 in __fopen_internal (filename=0x15bc3f0 "/tmp/a",
mode=0xaa0bb7 "w", is32=1) at iofopen.c:90
#4  0x007a0e90 in AllocateFile (name=0x15bc3f0 "/tmp/a",
mode=mode@entry=0xaa0bb7 "w") at fd.c:2229
#5  0x005c51b4 in BeginCopyTo (pstate=pstate@entry=0x15b95f0,
rel=rel@entry=0x0, query=, queryRelId=queryRelId@entry=0,
filename=, is_program=, attnamelist=0x0,
options=0x0) at copy.c:1919
#6  0x005c8999 in DoCopy (pstate=pstate@entry=0x15b95f0,
stmt=stmt@entry=0x1596b60, stmt_location=0, stmt_len=48,
processed=processed@entry=0x7ffe64199cd8) at copy.c:1078
#7  0x007d717a in standard_ProcessUtility (pstmt=0x1596ea0,
queryString=0x1595dc0 "copy (select generate_series(1, 10)) to '/tmp/a';",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1596f80,
completionTag=0x7ffe64199f90 "") at utility.c:551

This is, in theory, not a 100% bug, but it is probably not unusual to see
conflicts of files between postgresql sqls and other applications on the
same node so I think the fix is needed. I checked all code that calls
AllocateFile() and wrote a simple patch to do sanity check (if the file
exists it must be a regular file) for those files which are probably out of
the postgres data directories which we probably want to ignore. This is
actually not a perfect fix since it is not atomic (check and open), but it
should fix most of the scenarios. To be perfect, we might want to refactor
AllocateFile() to allow atomic check&create using either 'x' in fopen()
or O_EXCL in open(), also it seems that we might not want to create temp
file for AllocateFile() with fixed filenames. This is beyond of this patch
of course.

Thanks.


0001-Make-sure-the-file-is-a-regular-file-if-it-exists-be.patch
Description: Binary data

Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-23 Thread Paul Guo

On Wed, Apr 24, 2019 at 12:49 PM Andres Freund  wrote:

> Hi,
>
> On 2019-04-24 12:46:15 +0800, Paul Guo wrote:
> > This is, in theory, not a 100% bug, but it is probably not unusual to see
> > conflicts of files between postgresql sqls and other applications on the
> > same node so I think the fix is needed. I checked all code that calls
> > AllocateFile() and wrote a simple patch to do sanity check (if the file
> > exists it must be a regular file) for those files which are probably out
> of
> > the postgres data directories which we probably want to ignore. This is
> > actually not a perfect fix since it is not atomic (check and open), but
> it
> > should fix most of the scenarios. To be perfect, we might want to
> refactor
> > AllocateFile() to allow atomic check&create using either 'x' in fopen()
> > or O_EXCL in open(), also it seems that we might not want to create temp
> > file for AllocateFile() with fixed filenames. This is beyond of this
> patch
> > of course.
>
> This seems like a bad idea to me. IMO we want to support using a pipe
> etc here. If the admin creates a fifo like this without attaching a
> program it seems like it's their fault.
>

Oh, I never know this application scenario before. So yes, for this, we
need to keep the current code logic in copy code.


>
> Greetings,
>
> Andres Freund
>

Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-24 Thread Paul Guo

On Wed, Apr 24, 2019 at 10:36 PM Tom Lane  wrote:

> Alvaro Herrera  writes:
> > On 2019-Apr-24, Paul Guo wrote:
> >> On Wed, Apr 24, 2019 at 12:49 PM Andres Freund 
> wrote:
> >>> This seems like a bad idea to me. IMO we want to support using a pipe
> >>> etc here. If the admin creates a fifo like this without attaching a
> >>> program it seems like it's their fault.
>
> >> Oh, I never know this application scenario before. So yes, for this, we
> >> need to keep the current code logic in copy code.
>
> > But the pgstat.c patch seems reasonable to me.
>
> Nah, I don't buy that one either.  Nobody has any business creating any
> non-Postgres files in the stats directory ... and if somebody does want
> to stick a FIFO in there, perhaps for debug purposes, why should we stop
> them?
>

For the pgstat case, the files for AllocateFile() are actually temp files
which
are soon renamed to other file names. Users might not want to set them as
fifo files.
For developers 'tail -f' might be sufficient for debugging purpose.

  const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE :
pgstat_stat_tmpname;
  fpout = AllocateFile(tmpfile, PG_BINARY_W);
  fwrite(fpout, ...);
  rename(tmpfile, statfile);

I'm not sure if those hardcoded temp filenames (not just those in pgstat)
are used across postgres reboot.
If no, we should instead call glibc function to create unique temp files
and also remove those hardcode temp
filename variables, else we also might want them to be regular files.

> The case with COPY is a bit different, since there it's reasonable to be
> worried about collisions with other users' files --- but I agree with
> Andres that this change would eliminate too many valid use-cases.
>
> regards, tom lane
>

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-28 Thread Paul Guo

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Mmm. I posted to wrong thread. Sorry.
>
> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp>
> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo  wrote in
> 
> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
> database
> > > create redo error, but I suspect some other kind of redo, which
> depends on
> > > the files under the directory (they are not copied since the directory
> is
> > > not created) and also cannot be covered by the invalid page mechanism,
> > > could fail. Thanks.
> >
> > If recovery starts from just after tablespace creation, that's
> > simple. The Symlink to the removed tablespace is already removed
> > in the case. Hence server innocently create files directly under
> > pg_tblspc, not in the tablespace. Finally all files that were
> > supposed to be created in the removed tablespace are removed
> > later in recovery.
> >
> > If recovery starts from recalling page in a file that have been
> > in the tablespace, XLogReadBufferExtended creates one (perhaps
> > directly in pg_tblspc as described above) and the files are
> > removed later in recoery the same way to above. This case doen't
> > cause FATAL/PANIC during recovery even in master.
> >
> > XLogReadBufferExtended@xlogutils.c
> > | * Create the target file if it doesn't already exist.  This lets us
> cope
> > | * if the replay sequence contains writes to a relation that is later
> > | * deleted.  (The original coding of this routine would instead suppress
> > | * the writes, but that seems like it risks losing valuable data if the
> > | * filesystem loses an inode during a crash.  Better to write the data
> > | * until we are actually told to delete the file.)
> >
> > So buffered access cannot be a problem for the reason above. The
> > remaining possible issue is non-buffered access to files in
> > removed tablespaces. This is what I mentioned upthread:
> >
> > me> but I haven't checked this covers all places where need the same
> > me> treatment.
>
> RM_DBASE_ID is fixed by the patch.
>
> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
>   - are not relevant.
>
> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
>   - Resources works on buffer is not affected.
>
> SMGR:
>   - Both CREATE and TRUNCATE seems fine.
>
> TBLSPC:
>   - We don't nest tablespace directories. No Problem.
>
> I don't find a similar case.


I took some time in digging into the related code. It seems that ignoring
if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by
calling TablespaceCreateDbspace().
What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the
code logic in GetDatabasePath().

2) It seems that src directory could be missing then
dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is
still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will
ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change
could fix that.

diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
 */
FlushDatabaseBuffers(xlrec->src_db_id);

+   /*
+* It is possible that the source directory is missing if
+* we are re-replaying the xlog while subsequent xlogs
+* drop the tablespace in previous replaying. For this
+* we just skip.
+*/
+   if (!(stat(src_path, &st) == 0 &

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-29 Thread Paul Guo

I updated the original patch to

1) skip copydir() if either src path or dst parent path is missing in
dbase_redo(). Both missing cases seem to be possible. For the src path
missing case, mkdir_p() is meaningless. It seems that moving the directory
existence check step to dbase_redo() has less impact on other code.

2) Fixed dbase_desc(). Now the xlog output looks correct.

rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn:
0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
pg_tblspc/16384/PG_12_201904281/16386

rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn:
0/01638EB8, prev 0/01638E40, desc: DROP dir
pg_tblspc/16384/PG_12_201904281/16386

I'm not familiar with the TAP test details previously. I learned a lot
about how to test such case from Kyotaro's patch series.👍

On Sun, Apr 28, 2019 at 3:33 PM Paul Guo  wrote:

>
> On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
>> Mmm. I posted to wrong thread. Sorry.
>>
>> At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro
>> HORIGUCHI  wrote in <
>> 20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp>
>> > At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo  wrote
>> in 
>> > > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this
>> database
>> > > create redo error, but I suspect some other kind of redo, which
>> depends on
>> > > the files under the directory (they are not copied since the
>> directory is
>> > > not created) and also cannot be covered by the invalid page mechanism,
>> > > could fail. Thanks.
>> >
>> > If recovery starts from just after tablespace creation, that's
>> > simple. The Symlink to the removed tablespace is already removed
>> > in the case. Hence server innocently create files directly under
>> > pg_tblspc, not in the tablespace. Finally all files that were
>> > supposed to be created in the removed tablespace are removed
>> > later in recovery.
>> >
>> > If recovery starts from recalling page in a file that have been
>> > in the tablespace, XLogReadBufferExtended creates one (perhaps
>> > directly in pg_tblspc as described above) and the files are
>> > removed later in recoery the same way to above. This case doen't
>> > cause FATAL/PANIC during recovery even in master.
>> >
>> > XLogReadBufferExtended@xlogutils.c
>> > | * Create the target file if it doesn't already exist.  This lets us
>> cope
>> > | * if the replay sequence contains writes to a relation that is later
>> > | * deleted.  (The original coding of this routine would instead
>> suppress
>> > | * the writes, but that seems like it risks losing valuable data if the
>> > | * filesystem loses an inode during a crash.  Better to write the data
>> > | * until we are actually told to delete the file.)
>> >
>> > So buffered access cannot be a problem for the reason above. The
>> > remaining possible issue is non-buffered access to files in
>> > removed tablespaces. This is what I mentioned upthread:
>> >
>> > me> but I haven't checked this covers all places where need the same
>> > me> treatment.
>>
>> RM_DBASE_ID is fixed by the patch.
>>
>> XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
>>   - are not relevant.
>>
>> HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
>>   - Resources works on buffer is not affected.
>>
>> SMGR:
>>   - Both CREATE and TRUNCATE seems fine.
>>
>> TBLSPC:
>>   - We don't nest tablespace directories. No Problem.
>>
>> I don't find a similar case.
>
>
> I took some time in digging into the related code. It seems that ignoring
> if the dst directory cannot be created directly
> should be fine since smgr redo code creates tablespace path finally by
> calling TablespaceCreateDbspace().
> What's more, I found some more issues.
>
> 1) The below error message is actually misleading.
>
> 2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory
> "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
> 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> Database/CREATE: copy dir 1663/1 to 65546/65547
>
> That should be due to dbase_desc(). It could be simply fixed following the
> code logic in GetDatabasePath().
>
> 2) It seems that src directory could be missing then
> dbase_redo()->copydir() could error out. For example,
>
> \!rm -r

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-05-13 Thread Paul Guo

Thanks for the reply.

On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

>
> +  if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> +  {
>
> This patch is allowing missing source and destination directory
> even in consistent state. I don't think it is safe.
>

I do not understand this. Can you elaborate?


>
>
>
> +ereport(WARNING,
> +(errmsg("directory \"%s\" for copydir() does not exists."
> +"It is possibly expected. Skip copydir().",
> +parent_path)));
>
> This message seems unfriendly to users, or it seems like an elog
> message. How about something like this. The same can be said for
> the source directory.
>
> | WARNING:  skipped creating database directory: "%s"
> | DETAIL:  The tabelspace %u may have been removed just before crash.
>

Yeah. Looks better.


>
> # I'm not confident in this at all:(
>
> > 2) Fixed dbase_desc(). Now the xlog output looks correct.
> >
> > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn:
> > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> > pg_tblspc/16384/PG_12_201904281/16386
> >
> > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn:
> > 0/01638EB8, prev 0/01638E40, desc: DROP dir
> > pg_tblspc/16384/PG_12_201904281/16386
>
> WAL records don't convey such information. The previous
> description seems right to me.
>

2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
The directories are definitely wrong and misleading.


> > > Also I'd suggest we should use pg_mkdir_p() in
> TablespaceCreateDbspace()
> > > to replace
> > > the code block includes a lot of
> > > get_parent_directory(), MakePGDirectory(), etc even it
> > > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > > graceful and simpler.
>
> But I don't agree to this. pg_mkdir_p goes above two-parents up,
> which would be unwanted here.
>
> I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
This change just makes the code concise. Though in theory the change is not
needed.


> > > Whatever ignore mkdir failure or mkdir_p, I found that these steps
> seem to
> > > be error-prone
> > > along with postgre evolving since they are hard to test and also we are
> > > not easy to think out
> > > various potential bad cases. Is it possible that we should do real
> > > checkpoint (flush & update
> > > redo lsn) when seeing checkpoint xlogs for these operations? This will
> > > slow down standby
> > > but master also does this and also these operations are not usual,
> > > espeically it seems that it
> > > does not slow down wal receiving usually?
>
> That dramatically slows recovery (not replication) if databases
> are created and deleted frequently. That wouldn't be acceptable.
>

This behavior is rare and seems to have the same impact on master & standby
from checkpoint/restartpoint.
We do not worry about master so we should not worry about standby also.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-05-27 Thread Paul Guo

On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> At Mon, 13 May 2019 17:37:50 +0800, Paul Guo  wrote in <
> caeet0zf9yn4daxyuflzocayyxuff1ms_oqwea+rwv3gha5q...@mail.gmail.com>
> > Thanks for the reply.
> >
> > On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
> > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> >
> > >
> > > +  if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> > > +  {
> > >
> > > This patch is allowing missing source and destination directory
> > > even in consistent state. I don't think it is safe.
> > >
> >
> > I do not understand this. Can you elaborate?
>
> Suppose we were recoverying based on a backup at LSN1 targeting
> to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
> is called as "consistency point", before where the database is
> not consistent. It's because we are applying WAL recored older
> than those that were already applied in the second trial. The
> same can be said for crash recovery, where LSN1 is the latest
> checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.
>
> Creation of an existing directory or dropping of a non-existent
> directory are apparently inconsistent or "broken" so we should
> stop recovery when seeing such WAL records while database is in
> consistent state.
>

This seems to be hard to detect. I thought using invalid_page mechanism
long ago,
but it seems to be hard to fully detect a dropped tablespace.

> > > 2) Fixed dbase_desc(). Now the xlog output looks correct.
> > > >
> > > > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn:
> > > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> > > > pg_tblspc/16384/PG_12_201904281/16386
> > > >
> > > > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn:
> > > > 0/01638EB8, prev 0/01638E40, desc: DROP dir
> > > > pg_tblspc/16384/PG_12_201904281/16386
> > >
> > > WAL records don't convey such information. The previous
> > > description seems right to me.
> > >
> >
> > 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
> > Database/CREATE: copy dir 1663/1 to 65546/65547
> > The directories are definitely wrong and misleading.
>
> The original description is right in the light of how the server
> recognizes. The record exactly says that "copy dir 1663/1 to
> 65546/65547" and the latter path is converted in filesystem layer
> via a symlink.
>

In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
there is an additional directory like PG_12_201905221 between
tablespace oid and database oid. See the directory layout as below,
so the directory info in xlog dump output was not correct.

$ ls -lh data/pg_tblspc/


total 0


lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2


$ ls -lh /tmp/2


total 0


drwx--. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221

>
>
> > > > > Also I'd suggest we should use pg_mkdir_p() in
> > > TablespaceCreateDbspace()
> > > > > to replace
> > > > > the code block includes a lot of
> > > > > get_parent_directory(), MakePGDirectory(), etc even it
> > > > > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > > > > graceful and simpler.
> > >
> > > But I don't agree to this. pg_mkdir_p goes above two-parents up,
> > > which would be unwanted here.
> > >
> > > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
> > This change just makes the code concise. Though in theory the change is
> not
> > needed.
>
> We don't want to create tablespace direcotory after concurrent
> DROPing, as the comment just above is saying:
>
> |  * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
> |  * or TablespaceCreateDbspace is running concurrently.
>
> If the concurrent DROP TABLESPACE destroyed the grand parent
> directory, we mustn't create it again.
>

Yes, this is a good reason to keep the original code. Thanks.

By the way, based on your previous test patch I added another test which
could easily detect
the missing src directory case.

Re: undefined symbol: PQgssEncInUse

2019-05-29 Thread Paul Guo

Have you used the correct libpq library? If yes, you might want to check
the build logs and related files to see where is wrong. In my environment,
it's ok with both gssapi enabled or disabled.

On Thu, May 30, 2019 at 9:22 AM Donald Dong  wrote:

> Hi,
>
> After I make temp-install on HEAD with a clean build, I fail to start
> psql (tmp_install/usr/local/pgsql/bin/psql) and get an error message:
>
> ./psql: symbol lookup error: ./psql: undefined symbol: PQgssEncInUse
>
> However, make check and other tests still work. For me, it is fine
> until commit b0b39f72b9904bcb80f97b35837ccff1578aa4b8. I wonder if
> this only occurs to me?
>
> Thank you,
> Donald Dong
>
>
>

A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-20 Thread Paul Guo

Hello,

Recently I encountered a panic (assert failure) on a cassert enabled build
of Greenplum database (an MPP database based on postgres). The based
postgresql version is 9.4 stable (9.4.19). The panic is not complex. Please
see below for more details. This issue seems to be a bug in postgresql also
so I'm writing to community.

(gdb) bt

#0  0x7fb8c56321f7 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7fb8c56338e8 in __GI_abort () at abort.c:90

#2  0x00ae3fd2 in ExceptionalCondition (conditionName=0xdcec75
"!(syncWalSnd)",
errorType=0xdce77a "FailedAssertion", fileName=0xdce770 "syncrep.c",
lineNumber=505) at assert.c:66
#3  0x00924dbc in SyncRepReleaseWaiters () at syncrep.c:505

#4  0x00915e5e in ProcessStandbyReplyMessage () at
walsender.c:1667
#5  0x00915bea in ProcessStandbyMessage () at walsender.c:1566

#6  0x00915ab5 in ProcessRepliesIfAny () at walsender.c:1502

#7  0x00916303 in WalSndLoop (send_data=0x916d46
) at walsender.c:1923
#8  0x0091439b in StartReplication (cmd=0x15abea8) at
walsender.c:704
#9  0x0091586e in exec_replication_command (

cmd_string=0x161fa78 "START_REPLICATION 1/8400 TIMELINE 6") at
walsender.c:1416
#10 0x00979915 in PostgresMain (argc=1, argv=0x15bafd8,
dbname=0x15baec8 "", username=0x15baea8 "gpadmin")
at postgres.c:5296


(gdb) p max_wal_senders

$1 = 1


(gdb) p WalSndCtl->walsnds[0]

$2 = {

  pid = 2765,

  state = WALSNDSTATE_STOPPING,

  sentPtr = 6509560784,

  sendKeepalive = 0 '\000',

  needreload = 0 '\000',

  write = 6509560784,

  flush = 6509560664,

  apply = 6509560664,

  caughtup_within_range = 1 '\001',

  xlogCleanUpTo = 6509560664,

  mutex = 0 '\000',

  latch = {

is_set = 1,

is_shared = 1 '\001',

owner_pid = 2765

  },

  sync_standby_priority = 1,

  synchronous = 0 '\000',

  replica_disconnected_at = 0

}


Below is related code.

  for (i = 0; i < max_wal_senders; i++)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &walsndctl->walsnds[i];

if (walsnd->pid != 0 &&
walsnd->state >= WALSNDSTATE_STREAMING &&
walsnd->sync_standby_priority > 0 &&
(priority == 0 ||
 priority > walsnd->sync_standby_priority) &&
!XLogRecPtrIsInvalid(walsnd->flush))
{
priority = walsnd->sync_standby_priority;
syncWalSnd = walsnd;
}
}

/*
 * We should have found ourselves at least.
 */
Assert(syncWalSnd);

The panic reason is that since there is just one wal sender and
WalSndCtl->walsnds[0].state is WALSNDSTATE_STOPPING so syncWalSnd will be
NULL and that causes assert failure. Latest postgresql code removes the
Assert code but the related similar code logic was kept. It looks like that
we need to modify the logic similar like below (PG 9.4 STABLE) to allow
WALSNDSTATE_STOPPING which is reasonable here If I understand correctly.

if (walsnd->pid != 0 &&
-   walsnd->state == WALSNDSTATE_STREAMING &&
+   walsnd->state >= WALSNDSTATE_STREAMING &&
walsnd->sync_standby_priority > 0 &&
(priority == 0 ||
 priority > walsnd->sync_standby_priority) &&

For Latest master,  the logic is as below but it could be modified
similarly.
  /* Must be streaming */
  if (state != WALSNDSTATE_STREAMING)
  continue;

If yes this change should be on all branches that have the patch which
introduces WALSNDSTATE_STOPPING.

Thanks.

Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-22 Thread Paul Guo

On Thu, Nov 22, 2018 at 1:29 PM Michael Paquier  wrote:

> On Wed, Nov 21, 2018 at 04:09:41PM +0900, Michael Paquier wrote:
> > The checkpointer initializes a shutdown checkpoint where it tells to all
> > the WAL senders to stop once all the children processes are gone, so it
> > seems to me that there is little point in processing
> > SyncRepReleaseWaiters() when a WAL sender is in WALSNDSTATE_STOPPING
> > state as at this stage syncrep makes little sense.  It is still
> > necessary to process standby messages at this stage so as the primary
> > can shut down when it is sure that all the standbys have flushed the
> > shutdown checkpoint record of the primary.
>
> Just refreshed my memory with c6c33343, which is actually at the origin
> of the issue, and my previous argument is flawed.  If a WAL sender has
> reached WALSNDSTATE_STOPPING no regular backends are around but a WAL
> sender could always commit a transaction in parallel which may need to
> make sure that its record is flushed and sync'ed, and this needs to make
> sure that waiters are correctly released.  So it is necessary to patch
> up SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum as
> mentioned upthread, perhaps adding a comment when looking at
> MyWalSnd->state looks adapted. Paul, would you like to write a patch?
> --
> Michael
>

Yes, please see the attached patch.


0001-Allow-stopping-wal-senders-to-be-invovled-in-SyncRep.patch
Description: Binary data

Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-26 Thread Paul Guo

>
> On Thu, Nov 22, 2018 at 06:31:04PM +0800, Paul Guo wrote:
> > Yes, please see the attached patch.
>
> Thanks, I have reviewed your patch, and could not resist to change
> SyncRepReleaseWaiters() on top of the rest with a comment to not be
> confused about the WAL sender states allowed to release waiters.
>
> The previous experience is proving that it seems unwise to rely on the
> order of the elements in WalSndState, so I have tweaked things so as the
> state is listed for all the code paths involved.
>
> Paul, what do you think?
> --
> Michael
>

Agree. Your code change is safer.

Two small issues related to table_relation_copy_for_cluster() and CTAS with no data.

2022-04-20 Thread Paul Guo

Hello hackers,

While reading the latest master branch code, I found something that we
may be able to improve.

1. The am table_relation_copy_for_cluster() interface.

  static inline void
  table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
  Relation OldIndex,
  bool use_sort,
  TransactionId OldestXmin,
  TransactionId *xid_cutoff,
  MultiXactId *multi_cutoff,
  double *num_tuples,
  double *tups_vacuumed,
  double *tups_recently_dead)

- Should add a line for parameter num_tuples below "Output parameters
" in comment
- Look at the caller code, i.e. copy_table_data(). It does initialize
*num_tuples,
   *tups_vacuumed and *tups_recently_dead at first. This does not seem
to be a good
   API design or implementation. We'd better let the am api return the
values without
   initializing from callers, right?

2. For CTAS (create table as) with no data. It seems that we won't run
into intorel_receive().
intorel_startup() could be run into for "create table as t1
execute with no data". So it looks
like we do not need to judge for into->skipData in
intorel_receive(). If we really want to
check into->skipData we could add an assert check there or if I
missed some code paths
in which we could be run into the code branch, we could instead
call below code in
intorel_receive() to stop early, right?

if (myState->into->skipData)
return false;

Regards,
Paul

pg_rewind fails if there is a read only file.

2021-05-19 Thread Paul Guo

Several weeks ago I saw this issue in a production environment. The
read only file looks like a credential file. Michael told me that
usually such kinds of files should be better kept in non-pgdata
directories in production environments. Thought further it seems that
pg_rewind should be more user friendly to tolerate such scenarios.

The failure error is "Permission denied" after open(). The reason is
open() fais with the below mode in open_target_file()

  mode = O_WRONLY | O_CREAT | PG_BINARY;
  if (trunc)
  mode |= O_TRUNC;
  dstfd = open(dstpath, mode, pg_file_create_mode);

The fix should be quite simple, if open fails with "Permission denied"
then we try to call chmod to add a S_IWUSR just before open() and call
chmod to reset the flags soon after open(). A stat() call to get
previous st_mode flags is needed.

Any other suggestions or thoughts?

Thanks,

-- 
Paul Guo (Vmware)

Re: pg_rewind fails if there is a read only file.

2021-05-20 Thread Paul Guo

> Presumably the user has a reason for adding the file read-only to the
> data directory, and we shouldn't lightly ignore that.
>
> Michael's advice is reasonable. This seems like a case of:

I agree. Attached is a short patch to handle the case. The patch was
tested in my dev environment.


v1-0001-Fix-pg_rewind-failure-due-to-read-only-file-open-.patch
Description: Binary data

sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-25 Thread Paul Guo

Hi hackers,

I found this when reading the related code. Here is the scenario:

bool
RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
bool retryOnError)

For the case retryOnError is true, the function would in loop call
ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
we can see if we run into the below branch, RegisterSyncRequest() will
need to loop until the checkpointer absorbs the existing requests so
ForwardSyncRequest() might hang for some time until a checkpoint
request is triggered. This scenario seems to be possible in theory
though the chance is not high.

ForwardSyncRequest():

if (CheckpointerShmem->checkpointer_pid == 0 ||
(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
 !CompactCheckpointerRequestQueue()))
{
/*
 * Count the subset of writes where backends have to do their own
 * fsync
 */
if (!AmBackgroundWriterProcess())
CheckpointerShmem->num_backend_fsync++;
LWLockRelease(CheckpointerCommLock);
return false;
}

One fix is to add below similar code in RegisterSyncRequest(), trigger
a checkpoint for the scenario.

// checkpointer_triggered: variable for one trigger only.
if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
!checkpointer_triggered)
SetLatch(ProcGlobal->checkpointerLatch);

Any comments?

Regards,
Paul Guo (Vmware)

Re: pg_rewind fails if there is a read only file.

2021-05-25 Thread Paul Guo

> You seem to have missed my point completely. The answer to this problem
> IMNSHO is "Don't put a read-only file in the data directory".

Oh sorry. Well, if we really do not want this we may want to document this
and keep educating users, but meanwhile probably the product should be
more user friendly for the case, especially considering
that we know the fix would be trivial and I suspect it is inevitable that some
extensions put some read only files (e.g. credentials files) in pgdata.

Re: pg_rewind fails if there is a read only file.

2021-05-27 Thread Paul Guo

On Wed, May 26, 2021 at 10:32 PM Andrew Dunstan  wrote:
>
>
> On 5/26/21 2:43 AM, Laurenz Albe wrote:
> > On Wed, 2021-05-26 at 08:57 +0900, Michael Paquier wrote:
> >> On Tue, May 25, 2021 at 03:17:37PM -0400, Andrew Dunstan wrote:
> >>> If we do decide to do something the question arises what should it do?
> >>> If we're to allow for it I'm wondering if the best thing would be simply
> >>> to ignore such a file.
> >> Enforcing assumptions that any file could be ready-only is a very bad
> >> idea, as that could lead to weird behaviors if a FS is turned as
> >> becoming read-only suddenly while doing a rewind.  Another idea that
> >> has popped out across the years was to add an option to pg_rewind so
> >> as users could filter files manually.  That could be easily dangerous
> >> though in the wrong hands, as one could think that it is a good idea
> >> to skip a control file, for example.
> >>
> >> The thing is that here we actually know the set of files we'd like to
> >> ignore most of the time, and we still want to have some automated
> >> control what gets filtered.  So here is a new idea: we build a list of
> >> files based on a set of GUC parameters using postgres -C on the target
> >> data folder, and assume that these are safe enough to be skipped all
> >> the time, if these are in the data folder.
> > That sounds complicated, but should work.
> > There should be a code comment somewhere that warns people not to forget
> > to look at that when they add a new GUC.
> >
> > I can think of two alternatives to handle this:
> >
> > - Skip files that cannot be opened for writing and issue a warning.
> >   That is simple, but coarse.
> >   A slightly more sophisticated version would first check if files
> >   are the same on both machines and skip the warning for those.
> >
> > - Paul's idea to try and change the mode on the read-only file
> >   and reset it to the original state after pg_rewind is done.
> >
>
> How about we only skip (with a warning) read-only files if they are in
> the data root, not a subdirectory? That would save us from silently

The assumption seems to limit the user scenario.

> ignoring read-only filesystems and not involve using a GUC.

How about this:
By default, change and reset the file mode before and after open() for
read only files,
but we also allow to pass skip-file names to pg_rewind (e.g.
pg_rewind --skip filenameN or --skip-list-file listfile) in case users really
want to skip some files (probably not just read only files).
Presumably the people
who run pg_rewind should be DBA or admin that have basic knowledge of this
so we should not worry too much about that the user skips some important files
(we could even set a deny list for this). Also this solution works
fine for a read only
file system since pg_rewind soon quits when adding the write
permission for any read only file.

--
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud  wrote:
>
> On Tue, May 25, 2021 at 4:39 PM Paul Guo  wrote:
> >
> > Hi hackers,
> >
> > I found this when reading the related code. Here is the scenario:
> >
> > bool
> > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
> > bool retryOnError)
> >
> > For the case retryOnError is true, the function would in loop call
> > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
> > we can see if we run into the below branch, RegisterSyncRequest() will
> > need to loop until the checkpointer absorbs the existing requests so
> > ForwardSyncRequest() might hang for some time until a checkpoint
> > request is triggered. This scenario seems to be possible in theory
> > though the chance is not high.
>
> It seems like a really unlikely scenario, but maybe possible if you
> use a lot of unlogged tables maybe (as you could eventually
> dirty/evict more than NBuffers buffers without triggering enough WALs
> activity to trigger a checkpoint with any sane checkpoint
> configuration).

RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and
SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for
buffer sync.

> > ForwardSyncRequest():
> >
> > if (CheckpointerShmem->checkpointer_pid == 0 ||
> > (CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests 
> > &&
> >  !CompactCheckpointerRequestQueue()))
> > {
> > /*
> >  * Count the subset of writes where backends have to do their own
> >  * fsync
> >  */
> > if (!AmBackgroundWriterProcess())
> > CheckpointerShmem->num_backend_fsync++;
> > LWLockRelease(CheckpointerCommLock);
> > return false;
> > }
> >
> > One fix is to add below similar code in RegisterSyncRequest(), trigger
> > a checkpoint for the scenario.
> >
> > // checkpointer_triggered: variable for one trigger only.
> > if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> > !checkpointer_triggered)
> > SetLatch(ProcGlobal->checkpointerLatch);
> >
> > Any comments?
>
> It looks like you intended to set the checkpointer_triggered var but

Yes this is just pseduo code.

> didn't.  Also this will wake up the checkpointer but won't force a
> checkpoint (unlike RequestCheckpoint()).  It may be a good thing

I do not expect an immediate checkpoint. AbsorbSyncRequests()
is enough since after that RegisterSyncRequest() could finish.

> though as it would only absorb the requests and go back to sleep if no
> other threshold is reachrf.  Apart from the implementation details it
> seems like it could help in this unlikely event.



-- 
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

On Thu, May 27, 2021 at 9:59 PM Paul Guo  wrote:
>
> On Thu, May 27, 2021 at 7:11 PM Julien Rouhaud  wrote:
> >
> > On Tue, May 25, 2021 at 4:39 PM Paul Guo  wrote:
> > >
> > > Hi hackers,
> > >
> > > I found this when reading the related code. Here is the scenario:
> > >
> > > bool
> > > RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
> > > bool retryOnError)
> > >
> > > For the case retryOnError is true, the function would in loop call
> > > ForwardSyncRequest() until it succeeds, but in ForwardSyncRequest(),
> > > we can see if we run into the below branch, RegisterSyncRequest() will
> > > need to loop until the checkpointer absorbs the existing requests so
> > > ForwardSyncRequest() might hang for some time until a checkpoint
> > > request is triggered. This scenario seems to be possible in theory
> > > though the chance is not high.
> >
> > It seems like a really unlikely scenario, but maybe possible if you
> > use a lot of unlogged tables maybe (as you could eventually
> > dirty/evict more than NBuffers buffers without triggering enough WALs
> > activity to trigger a checkpoint with any sane checkpoint
> > configuration).
>
> RegisterSyncRequest() handles SYNC_UNLINK_REQUEST and
> SYNC_FORGET_REQUEST scenarios, besides the usual SYNC_REQUEST type for
> buffer sync.
>
> > > ForwardSyncRequest():
> > >
> > > if (CheckpointerShmem->checkpointer_pid == 0 ||
> > > (CheckpointerShmem->num_requests >= 
> > > CheckpointerShmem->max_requests &&
> > >  !CompactCheckpointerRequestQueue()))
> > > {
> > > /*
> > >  * Count the subset of writes where backends have to do their own
> > >  * fsync
> > >  */
> > > if (!AmBackgroundWriterProcess())
> > > CheckpointerShmem->num_backend_fsync++;
> > > LWLockRelease(CheckpointerCommLock);
> > > return false;
> > > }
> > >
> > > One fix is to add below similar code in RegisterSyncRequest(), trigger
> > > a checkpoint for the scenario.
> > >
> > > // checkpointer_triggered: variable for one trigger only.
> > > if (!ret && retryOnError && ProcGlobal->checkpointerLatch &&
> > > !checkpointer_triggered)
> > > SetLatch(ProcGlobal->checkpointerLatch);
> > >
> > > Any comments?
> >
> > It looks like you intended to set the checkpointer_triggered var but
>
> Yes this is just pseduo code.
>
> > didn't.  Also this will wake up the checkpointer but won't force a
> > checkpoint (unlike RequestCheckpoint()).  It may be a good thing
>
> I do not expect an immediate checkpoint. AbsorbSyncRequests()
> is enough since after that RegisterSyncRequest() could finish.
>
> > though as it would only absorb the requests and go back to sleep if no
> > other threshold is reachrf.  Apart from the implementation details it
> > seems like it could help in this unlikely event.
>

Also note that ForwardSyncRequest() does wake up the checkpointer if
it thinks the requests in shared memory are "too full", but does not
wake up when the request is actually full. This does not seem to be reasonable.
See below code in ForwardSyncRequest

/* If queue is more than half full, nudge the checkpointer to empty it */
too_full = (CheckpointerShmem->num_requests >=
CheckpointerShmem->max_requests / 2);

/* ... but not till after we release the lock */
if (too_full && ProcGlobal->checkpointerLatch)
SetLatch(ProcGlobal->checkpointerLatch);

-- 
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

> You said "trigger a checkpoint", which sounded more like forcing a
> checkpointer rather than waking up the checkpointer so that it can
> absorb the pending requests, so it seems worth to mention what it
> would really do.

Yes it is not accurate. Thanks for the clarification.

-- 
Paul Guo (Vmware)

Re: sync request forward function ForwardSyncRequest() might hang for some time in a corner case?

2021-05-27 Thread Paul Guo

On Thu, May 27, 2021 at 10:22 PM Julien Rouhaud  wrote:
>
> On Thu, May 27, 2021 at 10:05 PM Paul Guo  wrote:
> >
> > Also note that ForwardSyncRequest() does wake up the checkpointer if
> > it thinks the requests in shared memory are "too full", but does not
> > wake up when the request is actually full. This does not seem to be 
> > reasonable.
> > See below code in ForwardSyncRequest
> >
> > /* If queue is more than half full, nudge the checkpointer to empty it 
> > */
> > too_full = (CheckpointerShmem->num_requests >=
> > CheckpointerShmem->max_requests / 2);
> >
> > /* ... but not till after we release the lock */
> > if (too_full && ProcGlobal->checkpointerLatch)
> > SetLatch(ProcGlobal->checkpointerLatch);
>
> Ah indeed.  Well it means that the checkpointer it woken up early
> enough to avoid reaching that point.  I'm not sure that it's actually
> possible to reach a point where the list if full and the checkpointer
> is sitting idle.

In theory this is possible (when the system is under heavy parallel write)
else we could remove that part code (CompactCheckpointerRequestQueue())
:-), though the chance is not high.

If we encounter this issue those affected queries would suddenly hang
until the next checkpointer wakeup.

Re: Two patches to speed up pg_rewind.

2021-05-27 Thread Paul Guo


> On 2021/2/19, 10:33 AM, "Paul Guo"  wrote:

> Refactored the code a bit along with fixes. Manually tested them on centos
> & Ubuntu (the later has copy_file_range())

> For the first patch, actually I have some concerns. My assumption is that
> the target pg_data directory should be fsync-ed already. This should be
> correct normally but there is one scenario: a cleanly-shutdown database’s
> pgdata directory was copied to another directory, in this case the new pgdata
> is not fsync-ed - I’m not sure if that exists in real production environment 
> or not,
> but even considering this we could still use the optimization for the case 
> that
> calls ensureCleanShutdown() since this ensures a pgdata fsync on the target.
Did some small modification and rebased the code. See attached for the new 
version.



v3-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch
Description:  v3-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch


v3-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description:  v3-0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch

Should wal receiver reply to wal sender more aggressively?

2021-06-15 Thread Paul Guo

While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop
before replying the write/flush/apply LSN to wal senders in
XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN
information in time, thus finally slows down those transactions due to
syncrep wait (assuming default synchronous_commit)

In my TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first
argument false means no force , i.e. it notifies unless max time of
guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait other calls
of XLogWalRcvSendReply()
to notify the wal sender.

I thought and tried enhancing this by force-replying to the wal sender
each when receiving
a maximum bytes (e.g. 128K) but several things confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
kernel is usually not
  efficient when handling small packets but we are not replying that
aggressively so why
  is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
max_connections tuned)
  but found no obvious performance difference with and without the
code change. I did not
  see obvious system (IO/CPU/network) bottleneck - probably the
bottleneck is in PG itself.
  I did not investigate further at this moment, but the change should
in theory help, no?

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process.
Network io is async, blocking at epoll/poll, wal write is mostly
non-blocking, but for wal flush,
probably we could decouple it to a dedicated process. And maybe use
sync_file_range instead
of wal file fsync in issue_xlog_fsync()? We should sync those wal
contents with lower LSN at
first and reply to the wal sender in time, right?.

Below is the related code:

/* See if we can read data immediately */
len = walrcv_receive(wrconn, &buf, &wait_fd);
if (len != 0)
{
/*
 * Process the received data,
and any subsequent data we
 * can read without blocking.
 */
for (;;)
{
if (len > 0)
{
/*
 * Something
was received from primary, so reset
 * timeout
 */

last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;

XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
}
else if (len == 0)
break;
else if (len < 0)
{
ereport(LOG,

 (errmsg("replication terminated by primary server"),

  errdetail("End of WAL reached on timeline %u at %X/%X.",

startpointTLI,

LSN_FORMAT_ARGS(LogstreamResult.Write;
endofwal = true;
break;
}
len =
walrcv_receive(wrconn, &buf, &wait_fd);
}

/* Let the primary know that
we received some data. */
XLogWalRcvSendReply(false, false);

/*
 * If we've written some
records, flush them to disk and
 * let the startup process and
primary server know about
 * them.
 */
        XLogWalRcvFlush(false);

-- 
Paul Guo (Vmware)

Re: Should wal receiver reply to wal sender more aggressively?

2021-06-16 Thread Paul Guo

[ Resending the mail since I found my previous email has a very
  bad format that is hard to read].

While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop before replying the write/flush/apply
LSN to wal senders in XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN information in time, thus finally slows down
the transactions due to syncrep wait (assuming default
synchronous_commit)

During TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first argument false means no force ,
i.e. it notifies unless max time of guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait for other
calls of XLogWalRcvSendReply() to notify the wal sender.

I thought and tried enhancing this by force-flushing-replying each
time when receiving a maximum bytes (e.g. 128K) but several things
confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
  kernel is usually not efficient when handling small packets but we
  are not replying that aggressively so why is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
  max_connections tuned) but found no obvious performance difference
  with and without the code change. I did not see an obvious system
  IO/CPU/network) bottleneck - probably the bottleneck is in PG itself?
  I did not investigate further at this moment, but the change should in
  theory help, right? I may continue investigating but probably won't
  do this unless I have some clear answers to the confusions.

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process. Network io is async, blocking at epoll/poll, etc,
wal write is mostly non-blocking, but for wal flush, probably we could
decouple it to a dedicated process? And maybe use sync_file_range
instead of wal file fsync in issue_xlog_fsync()? We should sync those
wal contents with lower LSN at first and reply to the wal sender in
time, right?.

Below is the related code:

  /* See if we can read data immediately */
len = walrcv_receive(wrconn, &buf, &wait_fd);
if (len != 0)
{
/*
 * Process the received data, and any subsequent data we
 * can read without blocking.
 */
for (;;)
{
if (len > 0)
{
/*
 * Something was received from primary, so reset
 * timeout
 */
last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;
XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
}
else if (len == 0)
break;
else if (len < 0)
{
ereport(LOG,
(errmsg("replication terminated by primary server"),
 errdetail("End of WAL reached on timeline %u at %X/%X.",
   startpointTLI,
   LSN_FORMAT_ARGS(LogstreamResult.Write;
endofwal = true;
break;
}
len = walrcv_receive(wrconn, &buf, &wait_fd);
}

/* Let the primary know that we received some data. */
XLogWalRcvSendReply(false, false);

/*
 * If we've written some records, flush them to disk and
 * let the startup process and primary server know about
 * them.
 */
XLogWalRcvFlush(false);

Re: Two patches to speed up pg_rewind.

2021-06-17 Thread Paul Guo

No worry I’m work on this.

On 2021/6/17, 3:18 PM, "Michael Paquier"  wrote:
On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > you cross a filesystem boundary, is that something we need to worry
> > about there?
>
> Hmm.  Good point.  That may justify having a switch to control that.

Paul, the patch set still needs some work, so I am switching it as
waiting on author.  I am pretty sure that we had better have a
fallback implementation of copy_file_range() in src/port/, and that we
are going to need an extra switch in pg_rewind to allow users to
bypass copy_file_range()/EXDEV if they do a local rewind operation
across different FSes with a kernel < 5.3.
--
Michael

Re: Two patches to speed up pg_rewind.

2021-06-21 Thread Paul Guo

On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier  wrote:
>
> On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > > you cross a filesystem boundary, is that something we need to worry
> > > about there?
> >
> > Hmm.  Good point.  That may justify having a switch to control that.
>
> Paul, the patch set still needs some work, so I am switching it as
> waiting on author.  I am pretty sure that we had better have a
> fallback implementation of copy_file_range() in src/port/, and that we
> are going to need an extra switch in pg_rewind to allow users to
> bypass copy_file_range()/EXDEV if they do a local rewind operation
> across different FSes with a kernel < 5.3.
> --

I did modification on the copy_file_range() patch yesterday by simply falling
back to read()+write() but I think it could be improved further.

We may add a function to determine two file/path are copy_file_range()
capable or not (using POSIX standard statvfs():f_fsid?) - that could be used
by other copy_file_range() users although in the long run the function
is not needed.
And even having this we may still need the fallback code if needed.

- For pg_rewind, we may just determine that ability once on src/dst pgdata, but
  since there might be soft link (tablespace/wal) in pgdata so we should still
  allow fallback for those non copy_fie_range() capable file copying.
- Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP
  (the file system does not support that and the kernel does not fall
back to simple copying?)
  although this is not documented and it seems not usual?

Any idea?

A cost issue in ORDER BY + LIMIT

2022-08-06 Thread Paul Guo

Hello,

Postgres seems to always optimize ORDER BY + LIMIT as top-k sort.
Recently I happened to notice
that in this scenario the output tuple number of the sort node is not
the same as the LIMIT tuple number.

See below,

postgres=# explain analyze verbose select * from t1 order by a limit 10;
   QUERY PLAN

--
--
 Limit  (cost=69446.17..69446.20 rows=10 width=4) (actual
time=282.896..282.923 rows=10 loops=1)
   Output: a
   ->  Sort  (cost=69446.17..71925.83 rows=991862 width=4) (actual
time=282.894..282.896 rows=10 l
oops=1)
 Output: a
 Sort Key: t1.a
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on public.t1  (cost=0.00..44649.62 rows=991862
width=4) (actual time=0.026..
195.438 rows=1001000 loops=1)
   Output: a
 Planning Time: 0.553 ms
 Execution Time: 282.961 ms
(10 rows)

We can see from the output that the LIMIT cost is wrong also due to
this since it assumes the input_rows
from the sort node is 991862 (per gdb debugging).

This could be easily fixed by below patch,

diff --git a/src/backend/optimizer/path/costsize.c
b/src/backend/optimizer/path/costsize.c
index fb28e6411a..800cf0b256 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2429,7 +2429,11 @@ cost_sort(Path *path, PlannerInfo *root,

startup_cost += input_cost;

-   path->rows = tuples;
+   if (limit_tuples > 0 && limit_tuples < tuples)
+   path->rows = limit_tuples;
+   else
+   path->rows = tuples;
+
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
 }

Withe the patch the explain output looks like this.

postgres=# explain analyze verbose select * from t1 order by a limit 10;
   QUERY PLAN

--
--
 Limit  (cost=69446.17..71925.83 rows=10 width=4) (actual
time=256.204..256.207 rows=10 loops=1)
   Output: a
   ->  Sort  (cost=69446.17..71925.83 rows=10 width=4) (actual
time=256.202..256.203 rows=10 loops
=1)
 Output: a
 Sort Key: t1.a
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on public.t1  (cost=0.00..44649.62 rows=991862
width=4) (actual time=1.014..
169.509 rows=1001000 loops=1)
   Output: a
 Planning Time: 0.076 ms
 Execution Time: 256.232 ms
(10 rows)

Regards,
Paul

Re: Two patches to speed up pg_rewind.

2021-08-05 Thread Paul Guo

On Tue, Jun 22, 2021 at 11:08 AM Paul Guo  wrote:
>
> On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier  wrote:
> >
> > On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote:
> > > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote:
> > > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if
> > > > you cross a filesystem boundary, is that something we need to worry
> > > > about there?
> > >
> > > Hmm.  Good point.  That may justify having a switch to control that.
> >
> > Paul, the patch set still needs some work, so I am switching it as
> > waiting on author.  I am pretty sure that we had better have a
> > fallback implementation of copy_file_range() in src/port/, and that we
> > are going to need an extra switch in pg_rewind to allow users to
> > bypass copy_file_range()/EXDEV if they do a local rewind operation
> > across different FSes with a kernel < 5.3.
> > --
>
> I did modification on the copy_file_range() patch yesterday by simply falling
> back to read()+write() but I think it could be improved further.
>
> We may add a function to determine two file/path are copy_file_range()
> capable or not (using POSIX standard statvfs():f_fsid?) - that could be used
> by other copy_file_range() users although in the long run the function
> is not needed.
> And even having this we may still need the fallback code if needed.
>
> - For pg_rewind, we may just determine that ability once on src/dst pgdata, 
> but
>   since there might be soft link (tablespace/wal) in pgdata so we should still
>   allow fallback for those non copy_fie_range() capable file copying.
> - Also it seems that sometimes copy_file_range() could return 
> ENOTSUP/EOPNOTSUP
>   (the file system does not support that and the kernel does not fall
> back to simple copying?)
>   although this is not documented and it seems not usual?
>
> Any idea?

I modified the copy_file_range() patch using the below logic:

If the first call of copy_file_range() fails with errno EXDEV or
ENOTSUP, pg_rewind
would not use copy_file_range() in rest code, and if copy_file_range() fails we
fallback to use the previous read()+write() code logic for the file.


v4-0001-Fsync-the-affected-files-directories-only-in-pg_r.patch
Description: Binary data


v4-0002-Use-copy_file_range-if-possible-for-file-copying-.patch
Description: Binary data

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-08-05 Thread Paul Guo

Rebased.


v12-0002-Tests-to-replay-create-database-operation-on-sta.patch
Description:  v12-0002-Tests-to-replay-create-database-operation-on-sta.patch


v12-0001-Support-node-initialization-from-backup-with-tab.patch
Description:  v12-0001-Support-node-initialization-from-backup-with-tab.patch


v12-0003-Fix-replay-of-create-database-records-on-standby.patch
Description:  v12-0003-Fix-replay-of-create-database-records-on-standby.patch


v12-0004-Fix-database-create-drop-wal-description.patch
Description: v12-0004-Fix-database-create-drop-wal-description.patch

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-08-11 Thread Paul Guo

On Wed, Aug 11, 2021 at 4:56 AM Robert Haas  wrote:
>
> On Thu, Aug 5, 2021 at 6:20 AM Paul Guo  wrote:
> > Rebased.
>
> The commit message for 0001 is not clear enough for me to understand
> what problem it's supposed to be fixing. The code comments aren't
> really either. They make it sound like there's some problem with
> copying symlinks but mostly they just talk about callbacks, which
> doesn't really help me understand what problem we'd have if we just
> didn't commit this (or reverted it later).

Thanks for reviewing. Let me explain a bit. The patch series includes
four patches.

0001 and 0002 are test changes for the fix (0003).
   - 0001 is the test framework change that's needed by 0002.
   - 0002 is the test for the code fix (0003).
0003 is the code change and the commit message explains the issue in detail.
0004 as said is a small enhancement which is a bit independent of the
previous patches.

Basically the issue is that without the fix crash recovery might fail
relevant to tablespace.
Here is the log after I run the tests in 0001/0002 without the 0003 fix.

2021-08-04 10:00:42.231 CST [875] FATAL:  could not create directory
"pg_tblspc/16385/PG_15_202107261/16390": No such file or directory
2021-08-04 10:00:42.231 CST [875] CONTEXT:  WAL redo at 0/3001320 for
Database/CREATE: copy dir base/1 to
pg_tblspc/16385/PG_15_202107261/16390


>
> I am not really convinced by Álvaro's claim that 0004 is a "fix"; I
> think I'd call it an improvement. But either way I agree that could
> just be committed.
>
> I haven't analyzed 0002 and 0003 yet.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>


-- 
Paul Guo (Vmware)

Re: Two patches to speed up pg_rewind.

2021-08-19 Thread Paul Guo

Thanks for reviewing, please see the replies below.

On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier  wrote:
>
> On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote:
> > I modified the copy_file_range() patch using the below logic:
> >
> > If the first call of copy_file_range() fails with errno EXDEV or
> > ENOTSUP, pg_rewind
> > would not use copy_file_range() in rest code, and if copy_file_range() 
> > fails we
> > fallback to use the previous read()+write() code logic for the file.
>
> I have looked at 0001, and I don't really like it.  One argument
> against this approach is that if pg_rewind fails in the middle of its
> operation then we would have done a set of fsync() for nothing, with
> the data folder still unusable.  I would be curious to see some
> numbers to see how much it matters with many physical files (say cases
> with thousands of small relations?).
> +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data 
> */
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif !defined(WIN32) && defined(MS_ASYNC)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
> +#define PG_FLUSH_DATA_WORKS 1
>
> This is wrong for the code frontend on platforms that may finish by
> using MS_ASYNC, no?  There is no such implementation in file_utils.c
> but there is one in fd.c.

Yes, it seems that we need to add the MS_ASYNC code (refer that in fd.c) in
src/common/file_utils.c:pre_sync_fname().

> +   fsync_fname("global/pg_control", false);
> +   fsync_fname("backup_label", false);
> +   if (access("recovery.conf", F_OK) == 0)
> +   fsync_fname("recovery.conf", false);
> +   if (access("postgresql.auto.conf", F_OK) == 0)
> +   fsync_fname("postgresql.auto.conf", false);
>
> This list is wrong on various aspects, no?  This would miss custom
> configuration files, or included files.

I did not understand this. Can you please clarify? Anyway let me
explain, here we fsync
these files additionally because pg_rewind (possibly) modified these
files after rewinding.
These files may not be handled/logged in filemap

pg_control action is FILE_ACTION_NONE
backup_label is excluded
recovery.conf is not logged in filemap
postgresql.auto.conf may be logged but let's fsync this file for safety.

>
> -   if (showprogress)
> -   pg_log_info("syncing target data directory");
> -   sync_target_dir();
> -
> /* Also update the standby configuration, if requested. */
> if (writerecoveryconf && !dry_run)
> WriteRecoveryConfig(conn, datadir_target,
> GenerateRecoveryConfig(conn, NULL));
>
> +   if (showprogress)
> +   pg_log_info("syncing target data directory");
> +   perform_sync(filemap);
>
> Why inverting the order here?

We need to synchronize the recoveryconf change finally in perform_sync().

>
> +   * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
> +   * (return EXDEV). Some fs do not support copy_file_range() (return
> +   * ENOTSUP). Here we explicitly disable copy_file_range() for the
> +   * two scenarios. For other failures we still allow subsequent
> +   * copy_file_range() try.
> +   */
> +   if (errno == ENOTSUP || errno == EXDEV)
> +   copy_file_range_support = false;
> Are you sure that it is right to always cancel the support of
> copy_file_range() after it does not work once?  Couldn't it be
> possible that things work properly depending on the tablespace being
> worked on by pg_rewind?

Ideally we should retry when first running into a symlink (e.g.
tablespace, wal),
but it seems not easy to do gracefully.

> Having the facility for copy_file_range() in pg_rewind is not nice at
> the end, and we are going to need a run-time check to fallback
> dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
> Hmm.  What about locating all that in file_utils.c instead, with a
> brand new routine name (pg_copy_file_range would be one choice)?  We
> still need the ./configure check, except that the conditions to use
> the fallback implementation is in this routine, aka fallback on EXDEV,
> ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
> problem at some point, but logging and fd handling will likely just
> locate that in fd.c, so having one routine for the purpose of all
> frontends looks like a step in the right direction.

Yes, seems better to make it generic.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2020-07-08 Thread Paul Guo

Looks like my previous reply was held for moderation (maybe due to my new email 
address).
I configured my pg account today using the new email address. I guess this 
email would be
held for moderation.

I’m now replying my previous reply email and attaching the new patch series.


On Jul 6, 2020, at 10:18 AM, Paul Guo 
mailto:gu...@vmware.com>> wrote:

Thanks for the review. I’m now re-picking up the work. I modified the code 
following the comments.
Besides, I tweaked the test code a bit. There are several things I’m not 100% 
sure. Please see
my replies below.

On Jan 27, 2020, at 11:24 PM, Fujii Masao 
mailto:masao.fu...@oss.nttdata.com>> wrote:

On 2020/01/15 19:18, Paul Guo wrote:
I further fixed the last test failure (due to a small bug in the test, not in 
code). Attached are the new patch series. Let's see the CI pipeline result.

Thanks for updating the patches!

I started reading the 0003 patch.

The approach that the 0003 patch uses is not the perfect solution.
If the standby crashes after tblspc_redo() removes the directory and before
its subsequent COMMIT record is replayed, PANIC error would occur since
there can be some unresolved missing directory entries when we reach the
consistent state. The problem would very rarely happen, though...
Just idea; calling XLogFlush() to update the minimum recovery point just
before tblspc_redo() performs destroy_tablespace_directories() may be
safe and helpful for the problem?

Yes looks like an issue. My understanding is the below scenario.

XLogLogMissingDir()

XLogFlush() in redo (e.g. in a commit redo).  <- create a minimum recovery 
point (we call it LSN_A).

tblspc_redo()->XLogForgetMissingDir()
   <- If we panic immediately after we remove the directory in tblspc_redo()
   <- when we do replay during crash-recovery, we will check consistency at 
LSN_A and thus PANIC inXLogCheckMissingDirs()

commit

We should add a XLogFlush() in tblspc_redo(). This brings several other 
questions to my minds also.


1. Should we call XLogFlush() in dbase_redo()  for XLOG_DBASE_DROP also?
   It calls both XLogDropDatabase() and XLogForgetMissingDir, which seem to 
have this issue also?

2. xact_redo_abort() calls DropRelationFiles() also. Why do not we call 
XLogFlush() there?



- appendStringInfo(buf, "copy dir %u/%u to %u/%u",
-  xlrec->src_tablespace_id, xlrec->src_db_id,
-  xlrec->tablespace_id, xlrec->db_id);
+ dbpath1 = GetDatabasePath(xlrec->src_db_id,  xlrec->src_tablespace_id);
+ dbpath2 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+ appendStringInfo(buf, "copy dir %s to %s", dbpath1, dbpath2);
+ pfree(dbpath2);
+ pfree(dbpath1);

If the patch is for the bug fix and would be back-ported, the above change
would lead to change pg_waldump's output for CREATE/DROP DATABASE between
minor versions. IMO it's better to avoid such change and separate the above
as a separate patch only for master.

I know we do not want wal format between minor releases, but does wal 
description string change
between minor releases affect users? Anyone I’ll extract this part into a 
separate patch in the series
since this change is actually independent of the other changes..


- appendStringInfo(buf, " %u/%u",
-  xlrec->tablespace_ids[i], xlrec->db_id);
+ {
+ dbpath1 = GetDatabasePath(xlrec->db_id, xlrec->tablespace_ids[i]);
+ appendStringInfo(buf,  "%s", dbpath1);
+ pfree(dbpath1);
+ }

Same as above.

BTW, the above "%s" should be " %s", i.e., a space character needs to be
appended to the head of "%s”.

OK


+ get_parent_directory(parent_path);
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
+ XLogLogMissingDir(xlrec->tablespace_id, InvalidOid, dst_path);

The third argument of XLogLogMissingDir() should be parent_path instead of
dst_path?

The argument is for debug message printing so both should be fine, but 
admittedly we are
logging for the tablespace directory so parent_path might be better.


+ if (hash_search(missing_dir_tab, &key, HASH_REMOVE, NULL) == NULL)
+ elog(DEBUG2, "dir %s tablespace %d database %d is not missing",
+  path, spcNode, dbNode);

I think that this elog() is useless and rather confusing.

OK. Modified.


+ XLogForgetMissingDir(xlrec->ts_id, InvalidOid, "");

The third argument should be set to the actual path instead of an empty
string. Otherwise XLogForgetMissingDir() may emit a confusing DEBUG2
message. Or the third argument of XLogForgetMissingDir() should be removed
and the path in the DEBUG2 message should be calculated from the spcNode
and dbNode in the hash entry in XLogForgetMissingDir().

I’m now removing the third argument. Use GetDatabasePath() to get the path if 
database did I snot InvalidOid.


+#include "common/file_perm.h"

This seems not necessary.

Right.



v10-0001-Support-node-initialization-from-backu

Two fsync related performance issues?

2020-05-11 Thread Paul Guo

Hello hackers,

1. StartupXLOG() does fsync on the whole data directory early in the crash
recovery. I'm wondering if we could skip some directories (at least the
pg_log/, table directories) since wal, etc could ensure consistency. Here
is the related code.

  if (ControlFile->state != DB_SHUTDOWNED &&
  ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
  {
  RemoveTempXlogFiles();
  SyncDataDirectory();
  }

I have this concern since I saw an issue in a real product environment that
the startup process needs 10+ seconds to start wal replay after relaunch
due to elog(PANIC) (it was seen on postgres based product Greenplum but it
is a common issue in postgres also). I highly suspect the delay was mostly
due to this. Also it is noticed that on public clouds fsync is much slower
than that on local storage so the slowness should be more severe on cloud.
If we at least disable fsync on the table directories we could skip a lot
of file fsync - this may save a lot of seconds during crash recovery.

2.  CheckPointTwoPhase()

This may be a small issue.

See the code below,

for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
RecreateTwoPhaseFile(gxact->xid, buf, len);

RecreateTwoPhaseFile() writes a state file for a prepared transaction and
does fsync. It might be good to do fsync for all files once after writing
them, given the kernel is able to do asynchronous flush when writing those
file contents. If the TwoPhaseState->numPrepXacts is large we could do
batching to avoid the fd resource limit. I did not test them yet but this
should be able to speed up checkpoint/restartpoint a bit.

Any thoughts?

Regards.

Re: Two fsync related performance issues?

2020-05-18 Thread Paul Guo

Thanks for the replies.

On Tue, May 12, 2020 at 2:04 PM Michael Paquier  wrote:

> On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote:
> > On 2020/05/12 9:42, Paul Guo wrote:
> >> 1. StartupXLOG() does fsync on the whole data directory early in
> >> the crash recovery. I'm wondering if we could skip some
> >> directories (at least the pg_log/, table directories) since wal,
> >> etc could ensure consistency.
> >
> > I agree that we can skip log directory but I'm not sure if skipping
> > table directory is really safe. Also ISTM that we can skip the
> directories
> > that those contents are removed or zeroed during recovery,
> > for example, pg_snapshots, pg_substrans, etc.
>
> Basically excludeDirContents[] as of basebackup.c.
>

table directories & wal fsync probably dominates the fsync time. Do we
know any possible real scenario that requires table directory fsync?

Re: Batch insert in CTAS/MatView code

2019-06-17 Thread Paul Guo

ect
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.214.46   -1.83%  create table tt as select * from t13;

11.88   12.05   -1.43%  create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.173.25-2.52%  create table tt as select * from t14;


2.933.12-6.48%  create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;



scenario2: all related kernel caches are populated after previous testing.





baselinepatch   diff%   SQL




9.6 4.9748.23%  create table tt as select * from t11;

10.41   5.3248.90%  create table tt as select a,b,c from t11;

9.129.52-4.38%  create table tt as select * from t12;

9.668.6 10.97%  create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

13.56   13.6-0.30%  create table tt as select * from t13;


11.36   11.7-2.99%  create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.083.13-1.62%  create table tt as select * from t14;


2.953.03-2.71%  create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;

>From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For  BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but
for VirtualTupleTableSlot it improves 10.x%.
 If we look at execTuples.c, it looks like this is quite relevant to
additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code
should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this
seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although this
hurts the VirtualTupleTableSlot case.

3. for t13, new code still uses single insert so the difference should be
small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for
seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the
kernel readahead window. Suppose this should help if seq tuple handling is
faster than default kernel readahead setting.


v2 patch is attached.


On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas  wrote:

> On 06/03/2019 22:06, Paul Guo wrote:
> > The patch also modifies heap_multi_insert() a bit to do a bit further
> > code-level optimization by using static memory, instead of using memory
> > context and dynamic allocation.
>
> If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
> That is still leaked to the current memory context.
>
> Leaking into the current memory context is not a bad thing, because
> resetting a memory context is faster than doing a lot of pfree() calls.
> The callers just need to be prepared for that, and use a short-lived
> memory context.
>
> > By the way, while looking at the code, I noticed that there are 9 local
> > arrays with large length in toast_insert_or_update() which seems to be a
> > risk of stack overflow. Maybe we should put it as static or global.
>
> Hmm. We currently reserve 512 kB between the kernel's limit, and the
> limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
> arrays add up to 52800 bytes on a 64-bit maching, if I did my math
> right. So there's still a lot of headroom. I agree that it nevertheless
> seems a bit excessive, though.
>
> > With the patch,
> >
> > Time: 4728.142 ms (00:04.728)
> > Time: 14203.983 ms (00:14.204)
> > Time: 1008.669 ms (00:01.009)
> >
> > Baseline,
> > Time: 11096.146 ms (00:11.096)
> > Time: 13106.741 ms (00:13.107)
> > Time: 1100.174 ms (00:01.100)
>
> Nice speedup!
>
> > While for toast and large column size there is < 10% decrease but for
> > small column size the improvement is super good. Actually if I hardcode
> > the batch count as 4 all test cases are better but the improvement for
> > small column size is smaller than that with current patch. Pretty much
> > the number 4 is quite case specific so I can not hardcode that in the
> > patch. Of course we could further tune that but the current value seems
> > to be a good trade-off?
>
> Have you done any profiling, on why the multi-insert is slower with
> large tuples? In principle, I don't see why it should be slower.
>
> - Heikki
>


v2-0001-Heap-batch-insert-for-CTAS-MatView.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

2019-06-18 Thread Paul Guo

On Mon, Jun 17, 2019 at 8:53 PM Paul Guo  wrote:

> Hi all,
>
> I've been working other things until recently I restarted the work,
> profiling & refactoring the code.
> It's been a long time since the last patch was proposed. The new patch has
> now been firstly refactored due to
> 4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible,
> finish split of existing slot type).
>
> Now that TupleTableSlot, instead of HeapTuple is one argument of
> intorel_receive() so we can not get the
> tuple length directly. This patch now gets the tuple length if we know all
> columns are with fixed widths, else
> we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES
> (defined as 1000) tuples
> and use for the total length of tuples in a batch.
>
> I noticed that to do batch insert, we might need additional memory copy
> sometimes comparing with "single insert"
> (that should be the reason that we previously saw a bit regressions) so a
> good solution seems to fall back
> to "single insert" if the tuple length is larger than a threshold. I set
> this as 2000 after quick testing.
>
> To make test stable and strict, I run checkpoint before each ctas, the
> test script looks like this:
>
> checkpoint;
> \timing
> create table tt as select a,b,c from t11;
> \timing
> drop table tt;
>
> Also previously I just tested the BufferHeapTupleTableSlot (i.e. create
> table tt as select * from t11),
> this time I test VirtualTupleTableSlot (i.e. create table tt as select
> a,b,c from t11) additionally.
> It seems that VirtualTupleTableSlot is very common in real cases.
>
> I tested four kinds of tables, see below SQLs.
>
> -- tuples with small size.
> create table t11 (a int, b int, c int, d int);
> insert into t11 select s,s,s,s from generate_series(1, 1000) s;
> analyze t11;
>
> -- tuples that are untoasted and tuple size is 1984 bytes.
> create table t12 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name);
> insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 50);
> analyze t12;
>
> -- tuples that are untoasted and tuple size is 2112 bytes.
> create table t13 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
> insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 50);
> analyze t13;
>
> -- tuples that are toastable and tuple compressed size is 1084.
> create table t14 (a text, b text, c text, d text, e text, f text, g text,
> h text, i text, j text, k text, l text, m text, n text, o text, p text, q
> text, r text, s text, t text, u text, v text, w text, x text, y text, z
> text);
> insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
> i, i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from
> generate_series(1,5000)) i;
> analyze t14;
>
>
> I also tested two scenarios for each testing.
>
> One is to clean up all kernel caches (page & inode & dentry on Linux)
> using the command below and then run the test,
> sync; echo 3 > /proc/sys/vm/drop_caches
> After running all tests all relation files will be in kernel cache (my
> test system memory is large enough to accommodate all relation files),
> then I run the tests again. I run like this because in real scenario the
> result of the test should be among the two results. Also I rerun
> each test and finally I calculate

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-06-19 Thread Paul Guo

On Mon, May 27, 2019 at 9:39 PM Paul Guo  wrote:

>
>
> On Tue, May 14, 2019 at 11:06 AM Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
>> Hello.
>>
>> At Mon, 13 May 2019 17:37:50 +0800, Paul Guo  wrote in <
>> caeet0zf9yn4daxyuflzocayyxuff1ms_oqwea+rwv3gha5q...@mail.gmail.com>
>> > Thanks for the reply.
>> >
>> > On Tue, May 7, 2019 at 2:47 PM Kyotaro HORIGUCHI <
>> > horiguchi.kyot...@lab.ntt.co.jp> wrote:
>> >
>> > >
>> > > +  if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
>> > > +  {
>> > >
>> > > This patch is allowing missing source and destination directory
>> > > even in consistent state. I don't think it is safe.
>> > >
>> >
>> > I do not understand this. Can you elaborate?
>>
>> Suppose we were recoverying based on a backup at LSN1 targeting
>> to LSN3 then it crashed at LSN2, where LSN1 < LSN2 <= LSN3. LSN2
>> is called as "consistency point", before where the database is
>> not consistent. It's because we are applying WAL recored older
>> than those that were already applied in the second trial. The
>> same can be said for crash recovery, where LSN1 is the latest
>> checkpoint ('s redo LSN) and LSN2=LSN3 is the crashed LSN.
>>
>> Creation of an existing directory or dropping of a non-existent
>> directory are apparently inconsistent or "broken" so we should
>> stop recovery when seeing such WAL records while database is in
>> consistent state.
>>
>
> This seems to be hard to detect. I thought using invalid_page mechanism
> long ago,
> but it seems to be hard to fully detect a dropped tablespace.
>
> > > > 2) Fixed dbase_desc(). Now the xlog output looks correct.
>> > > >
>> > > > rmgr: Databaselen (rec/tot): 42/42, tx:486, lsn:
>> > > > 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
>> > > > pg_tblspc/16384/PG_12_201904281/16386
>> > > >
>> > > > rmgr: Databaselen (rec/tot): 34/34, tx:487, lsn:
>> > > > 0/01638EB8, prev 0/01638E40, desc: DROP dir
>> > > > pg_tblspc/16384/PG_12_201904281/16386
>> > >
>> > > WAL records don't convey such information. The previous
>> > > description seems right to me.
>> > >
>> >
>> > 2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
>> > Database/CREATE: copy dir 1663/1 to 65546/65547
>> > The directories are definitely wrong and misleading.
>>
>> The original description is right in the light of how the server
>> recognizes. The record exactly says that "copy dir 1663/1 to
>> 65546/65547" and the latter path is converted in filesystem layer
>> via a symlink.
>>
>
> In either $PG_DATA/pg_tblspc or symlinked real tablespace directory,
> there is an additional directory like PG_12_201905221 between
> tablespace oid and database oid. See the directory layout as below,
> so the directory info in xlog dump output was not correct.
>
> $ ls -lh data/pg_tblspc/
>
>
> total 0
>
>
> lrwxrwxrwx. 1 gpadmin gpadmin 6 May 27 17:23 16384 -> /tmp/2
>
>
> $ ls -lh /tmp/2
>
>
> total 0
>
>
> drwx--. 3 gpadmin gpadmin 18 May 27 17:24 PG_12_201905221
>
>>
>>
>> > > > > Also I'd suggest we should use pg_mkdir_p() in
>> > > TablespaceCreateDbspace()
>> > > > > to replace
>> > > > > the code block includes a lot of
>> > > > > get_parent_directory(), MakePGDirectory(), etc even it
>> > > > > is not fixing a bug since pg_mkdir_p() code change seems to be
>> more
>> > > > > graceful and simpler.
>> > >
>> > > But I don't agree to this. pg_mkdir_p goes above two-parents up,
>> > > which would be unwanted here.
>> > >
>> > > I do not understand this also. pg_mkdir_p() is similar to 'mkdir -p'.
>> > This change just makes the code concise. Though in theory the change is
>> not
>> > needed.
>>
>> We don't want to create tablespace direcotory after concurrent
>> DROPing, as the comment just above is saying:
>>
>> |  * Acquire TablespaceCreateLock to ensure that no DROP TABLESPACE
>> |  * or TablespaceCreateDbspace is running concurrently.
>>
>> If the concurrent DROP TABLESPACE destroyed the grand parent
>> directory, we mustn't create it again.
>>
>
> Yes, this is a good reason to keep the original code. Thanks.
>
> By the way, based on your previous test patch I added another test which
> could easily detect
> the missing src directory case.
>
>

I updated the patch to v3. In this version, we skip the error if copydir
fails due to missing src/dst directory,
but to make sure the ignoring is legal, I add a simple log/forget mechanism
(Using List) similar to the xlog invalid page
checking mechanism. Two tap tests are included. One is actually from a
previous patch by Kyotaro in this
email thread and another is added by me. In addition, dbase_desc() is fixed
to make the message accurate.

Thanks.


v3-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Paul Guo

Rebased, aligned with recent changes in pg_rewind/pg_basebackup and then
retested. Thanks.

On Mon, Jul 1, 2019 at 7:35 PM Thomas Munro  wrote:

> On Fri, Apr 19, 2019 at 3:41 PM Paul Guo  wrote:
> > I updated the patches as attached following previous discussions.
>
> Hi Paul,
>
> Could we please have a fresh rebase now that the CF is here?
>
> Thanks,
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=mxictY8xxFIFvsyxFYx4bXwF4PfnGWWRuYLLX4R1yhs&s=qvC2BI2OhKkBz71FO1w2XNy6dvfhIeGHT3X0yU3XDlU&e=
>


v3-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data


v3-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data


v3-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-01 Thread Paul Guo

On Tue, Jul 2, 2019 at 12:35 AM Alvaro Herrera 
wrote:

> On 2019-Apr-19, Paul Guo wrote:
>
> > The below patch runs single mode Postgres if needed to make sure the
> target
> > is cleanly shutdown. A new option is added (off by default).
> > v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch
>
> Why do we need an option for this?  Is there a reason not to do this
> unconditionally?
>

There is concern about this (see previous emails in this thread). On
greenplum (MPP DB based on Postgres),
we unconditionally do this. I'm not sure about usually how Postgres users
do this when there is an unclean shutdown,
but providing an option seem to be safer to avoid breaking existing
script/service whatever. If many people
think this option is unnecessary, I'm fine to remove the option and keep
the code logic.

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-09 Thread Paul Guo

Yes, the patches changed Makefile so that pg_rewind and pg_basebackup could
use some common code, but for Windows build, I'm not sure where are those
window build files. Does anyone know about that? Thanks.

On Tue, Jul 9, 2019 at 6:55 AM Thomas Munro  wrote:

> On Tue, Jul 2, 2019 at 5:46 PM Paul Guo  wrote:
> > Rebased, aligned with recent changes in pg_rewind/pg_basebackup and then
> retested. Thanks.
>
> Hi Paul,
>
> A minor build problem on Windows:
>
> src/bin/pg_rewind/pg_rewind.c(32): fatal error C1083: Cannot open
> include file: 'backup_common.h': No such file or directory
> [C:\projects\postgresql\pg_rewind.vcxproj]
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.appveyor.com_project_postgresql-2Dcfbot_postgresql_build_1.0.46422&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=cieAr5np_1qgfD3tXImqOJcdIpBzgBco-pm1TLLUUuI&e=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__cfbot.cputube.org_paul-2Dguo.html&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=RkCg3MktPW2gi4I_fAkHqI8i3anSADrz0MXq2VaqmFE&e=
>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=9FvwTFotsG_UdMt_xNEvMpM7_kKgR-hV4Fg8mnseaNM&s=N8IZBBSK2EyREasMrbBQqHTHJwe1NbCBLEsxP-8C1Hk&e=
>

Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)

2019-07-15 Thread Paul Guo

On Wed, Jul 10, 2019 at 3:28 PM Michael Paquier  wrote:

> On Tue, Jul 09, 2019 at 10:48:49PM +0800, Paul Guo wrote:
> > Yes, the patches changed Makefile so that pg_rewind and pg_basebackup
> could
> > use some common code, but for Windows build, I'm not sure where are those
> > window build files. Does anyone know about that? Thanks.
>
> The VS scripts are located in src/tools/msvc/.  You will likely need
> to tweak things like $frontend_extraincludes or variables in the same
> area for this patch (please see Mkvcbuild.pm).
>

Thanks. Both Mkvcbuild.pm and pg_rewind/Makefile are modified to make
Windows build pass in a
local environment (Hopefully this passes the CI testing), also now
pg_rewind/Makefile does not
create soft link for backup_common.h anymore. Instead -I is used to specify
the header directory.

I also noticed that doc change is needed so modified documents for the two
new options accordingly.
Please see the attached new patches.

v4-0002-Add-option-to-write-recovery-configuration-inform.patch
Description: Binary data

v4-0003-Ensure-target-clean-shutdown-at-the-beginning-of-.patch
Description: Binary data

v4-0001-Extact-common-functions-from-pg_basebackup-into-s.patch
Description: Binary data

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-07-15 Thread Paul Guo

On Mon, Jul 8, 2019 at 11:16 AM Thomas Munro  wrote:

> On Wed, Jun 19, 2019 at 7:22 PM Paul Guo  wrote:
> > I updated the patch to v3. In this version, we skip the error if copydir
> fails due to missing src/dst directory,
> > but to make sure the ignoring is legal, I add a simple log/forget
> mechanism (Using List) similar to the xlog invalid page
> > checking mechanism. Two tap tests are included. One is actually from a
> previous patch by Kyotaro in this
> > email thread and another is added by me. In addition, dbase_desc() is
> fixed to make the message accurate.
>
> Hello Paul,
>
> FYI t/011_crash_recovery.pl is failing consistently on Travis CI with
> this patch applied:
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_postgresql-2Dcfbot_postgresql_builds_555368907&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=Usi0ex6Ch92MsB5QQDgYFw&m=ABylo8AVfubiiYVbCBSgmNnHEMJhMqGXx5c0hkug7Vw&s=5h4m_JhrZwZqsRsu1CHCD3W2eBl14mT8jWLFsj2-bJ4&e=
>
>
>
This failure is because the previous v3 patch does not align with a recent
patch

commit 660a2b19038b2f6b9f6bcb2c3297a47d5e3557a8

Author: Noah Misch 

Date:   Fri Jun 21 20:34:23 2019 -0700

Consolidate methods for translating a Perl path to a Windows path.


My patch uses TestLib::real_dir which is now replaced
with TestLib::perl2host in the above commit.

I've updated the patch to v4 to make my code align. Now the test passes in
my local environment.

Please see the attached v4 patch.

Thanks.


v4-0001-skip-copydir-if-either-src-directory-or-dst-direc.patch
Description: Binary data

How to reclaim the space of dropped columns of a table?

2019-07-15 Thread Paul Guo

Hello hackers,

I have been having a question about this with no answer from various sources
. As known after dropping a column using 'alter table', table is not
rewritten and vacuum full does not remove them also (still see the dropped
column in pg_attribute).

PG document says:

https://www.postgresql.org/docs/current/sql-altertable.html

"To force immediate reclamation of space occupied by a dropped column, you
can execute one of the forms of ALTER TABLE that performs a rewrite of the
whole table. This results in reconstructing each row with the dropped
column replaced by a null value."

This seems to a bit vague for users (how to rewrite but keep the table
definition) and it seems to still keep the dropped columns (though with
null). Isn't it better to leave the functionality to command like 'vacuum
full' to completely remove the dropped columns (i.e. no dropped columns in
pg_attributes and no null values for dropped columns for a table)?

Thanks.

Re: Possible race condition in pg_mkdir_p()?

2019-07-18 Thread Paul Guo

On Thu, Jul 18, 2019 at 4:57 PM Michael Paquier  wrote:

> On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote:
> > This is still wrong with current code logic, because when the statusdir
> is
> > a file the errno is also EEXIST, but it can pass the check here.  Even if
> > we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here
> is
> > still wrong.
>
> Would you like to send a patch?
>

Michael, we'll send out the patch later. Checked code, it seems that there
is another related mkdir() issue.

MakePGDirectory() is actually a syscall mkdir(), and manpage says the errno
meaning of EEXIST,

   EEXIST pathname already exists (not necessarily as a directory).
This includes the case where pathname is a symbolic link, dangling or not.

However it looks like some callers do not use that correctly, e.g.

  if (MakePGDirectory(directory) < 0)
  {
  if (errno == EEXIST)
  return;

OR

  if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)

i.e. we should better use stat(path) && S_ISDIR(buf) && errno == EEXIST to
replace errno == EEXIST.

One possible fix is to add an argument like ignore_created (in case some
callers want to fail if the path has been created) in MakePGDirectory() and
then add that code logic into it.

Re: pg_ugprade test failure on data set with column with default value with type bit/varbit

2018-07-31 Thread Paul Guo

Thanks. I updated the patch as attached.

Double-checked those tests passed.

2018-07-30 9:38 GMT+08:00 Thomas Munro :

> On Thu, May 17, 2018 at 8:20 PM, Paul Guo  wrote:
> > Thanks. I tentatively submitted a patch (See the attachment).
>
> Hi Paul,
>
> It looks like you missed a couple of changes in the contrib/btree_gist
> bit and varbit tests, so make check-world fails:
>
> - Index Cond: ((a >= B'100'::"bit") AND (a <= B'101'::"bit"))
> + Index Cond: ((a >= '100'::"bit") AND (a <= '101'::"bit"))
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


0001-Fix-pg_upgrade-test-failure-caused-by-the-DDL-below.v2.patch
Description: Binary data

[Patch] Create a new session in postmaster by calling setsid()

2018-08-01 Thread Paul Guo

Hello,

Recently I encountered an issue during testing and rootcaused it as the
title mentioned.

postmaster children have done this (creating a new session) by calling
InitPostmasterChild(),
but postmaster itself does not. This could lead to some issues (e..g signal
handling). The test script below is a simple example.

$ cat test.sh
pg_ctl -D ./data -l stop.log stop
sleep 2 -- wait for pg down.
pg_ctl -D ./data -l start.log start
sleep 2 -- wait for pg up.
echo "pg_sleep()-ing"
psql -d postgres -c 'select pg_sleep(1000)'  -- press ctrl+c, then you will
see postmaster and its children are all gone.

Long ago PG has pmdaemonize() to daemonize postmaster which of course
create a new session. That code was removed in the patch below.

commit f7ea6beaf4ca02b8e6dc576255e35a5b86035cb9
Author: Heikki Linnakangas 
Date:   Mon Jul 4 14:35:44 2011 +0300

Remove silent_mode. You get the same functionality with "pg_ctl -l
postmaster.log", or nohup.

There was a small issue with LINUX_OOM_ADJ and silent_mode, namely that
with
silent_mode the postmaster process incorrectly used the OOM settings
meant
for backend processes. We certainly could've fixed that directly, but
since
silent_mode was redundant anyway, we might as well just remove it.

Here is the related discussion about the patch. It seems that is related to
oom score setting in fork_process().
https://www.postgresql.org/message-id/4E046EA5.8070101%40enterprisedb.com

Personally I still like pg being a daemon, but probably I do not know some
real cases which do not like daemon. If we do not go back to pgdaemonize()
with further fix of fork_process() (assume I understand correctly) we could
simply call setsid() in postmaster code to fix the issue above. I added a
simple a patch as attached.

Thanks.


0001-Create-a-new-session-in-postmaster-by-calling-setsid.patch
Description: Binary data

Re: [Patch] Create a new session in postmaster by calling setsid()

2018-08-07 Thread Paul Guo

On Thu, Aug 2, 2018 at 10:30 PM, Tom Lane  wrote:

> Paul Guo  writes:
> > [ make the postmaster execute setsid() too ]
>
> I'm a bit skeptical of this proposal.  Forcing the postmaster to
> dissociate from its controlling terminal is a good thing in some
> scenarios, but probably less good in others, and manual postmaster
> starts are probably mostly in the latter class.
>
> I wonder whether having "pg_ctl start" do a setsid() would accomplish
> the same result with less chance of breaking debugging usages.
>
> regards, tom lane
>

Yes, if considering the case of starting postmaster manually, we can not
create
a new session in postmaster, so pg_ctl seems to be a good place for setsid()
call. Attached a newer patch. Thanks.


0001-Create-a-new-session-for-postmaster-in-pg_ctl-by-cal.patch
Description: Binary data

Re: Parallelize stream replication process

2020-09-15 Thread Paul Guo

> On Sep 16, 2020, at 11:15 AM, Li Japin  wrote:
> 
> 
> 
>> On Sep 15, 2020, at 3:41 PM, Fujii Masao  wrote:
>> 
>> 
>> 
>> On 2020/09/15 13:41, Bharath Rupireddy wrote:
>>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:

 For now, postgres use single process to send, receive and replay the WAL 
 when we use stream replication,
 is there any point to parallelize this process? If it does, how do we 
 start?

 Any thoughts?
>> 
>> Probably this is another parallelism than what you're thinking,
>> but I was thinking to start up walwriter process in the standby server
>> and make it fsync the streamed WAL data. This means that we leave
>> a part of tasks of walreceiver process to walwriter. Walreceiver
>> performs WAL receive and write, and walwriter performs WAL flush,
>> in parallel. I'm just expecting that this change would improve
>> the replication performance, e.g., reduce the time to wait for
>> sync replication.

Yes this should be able to improve that in theory. 

>> 
>> Without this change (i.e., originally), only walreceiver performs
>> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
>> it cannot receive newly-arrived WAL data. If WAL flush takes a time,
>> which means that the time to wait for sync replication in the primary
>> would be enlarged.
>> 
>> Regards,
>> 
>> -- 
>> Fujii Masao
>> Advanced Computing Technology Center
>> Research and Development Headquarters
>> NTT DATA CORPORATION
> 
> Yeah, this might be a direction. 
> 
> Now I am thinking about how to parallel WAL log replay. If we can improve the 
> efficiency
> of replay, then we can shorten the database recovery time (streaming 
> replication or database
> crash recovery). 

Yes, parallelization should be able to help the scenario that cpu util rate is 
100% or if it is not
100% but continues blocking in some operations which could be improved by 
running in
parallel. There are a lot of scenarios of WAL replay (memory operation, system 
calls, locking, etc).
I do not have the experience of real production environment, so not sure 
whether or how
the recovery suffer, but I believe parallel recovery should help to accelerate 
and help failover
which is quite important especially in cloud environment. To do this may need 
to carefully
analyze the dependency of various WALL at first. It won’t be a small effort. 
I’ve heard some
databases have implemented this though not sure how much that helps.

For parallel wal receiver/sender, its functionality is simple so after 
decoupling the fsync operation to
a separate process not sure how beneficial parallelization would be (surely if 
wal receiver/sender
Is 100% cpu utilized that is needed).

> 
> For streaming replication, we may need to improve the transmission of WAL 
> logs to improve
> the entire recovery process.
> 
> I’m not sure if this is correct.
> 
> Regards,
> Japin Li.
>

[PATCH] Use access() to check file existence in GetNewRelFileNode().

2018-05-17 Thread Paul Guo

Previous code uses BasicOpenFile() + close().

access() should be faster than BasicOpenFile()+close() and access()
should be more correct since BasicOpenFile() could fail for various
cases (e.g. due to file permission, etc) even the file exists.

access() is supported on Linux/Unix. I do not have a Windows dev
environment, but MSDN tells me that access() is supported on Windows also
and there have been access() call in the workspace, so I assume there is no
portability issue.

Thanks.


0001-Use-access-to-check-file-existence-in-GetNewRelFileN.patch
Description: Binary data

Re: pg_ugprade test failure on data set with column with default value with type bit/varbit

2018-05-17 Thread Paul Guo

Thanks. I tentatively submitted a patch (See the attachment).

By the way, current pg_upgrade test script depends on the left data on test
database, but it seems that
a lot of tables are dropped in those test SQL files so this affects the
pg_upgrade test coverage much.
Maybe this needs to be addressed in the future. (Maybe when anyone checks
in test cases pg_upgrade
needs to be considered also?)


2018-05-11 3:08 GMT+08:00 Robert Haas :

> On Fri, Mar 30, 2018 at 5:36 AM, Paul Guo  wrote:
> > There is no diff in functionality of the dump SQLs, but it is annoying.
> The
> > simple patch below could fix this. Thanks.
> >
> > --- a/src/backend/utils/adt/ruleutils.c
> > +++ b/src/backend/utils/adt/ruleutils.c
> > @@ -9389,7 +9389,7 @@ get_const_expr(Const *constval, deparse_context
> > *context, int showtype)
> >
> > case BITOID:
> > case VARBITOID:
> > -   appendStringInfo(buf, "B'%s'", extval);
> > +   appendStringInfo(buf, "'%s'", extval);
> > break;
> >
> > case BOOLOID:
>
> My first reaction was to guess that this would be unsafe, but looking
> it over I don't see a problem.  For the other types handled in that
> switch statement, we rely on the custom representation to avoid
> needing a typecast, but it seems that for BITOID and VARBITOID we
> insert a typecast no matter what.  So maybe the presence of absence of
> the "B" makes no difference.
>
> This logic seems to have been added by commit
> c828ec88205a232a9789f157d8cf9c3d82f85152, Peter Eisentraut, vintage
> 2002.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


0001-Fix-pg_upgrade-test-failure-caused-by-the-DDL-below.patch
Description: Binary data

Re: [PATCH] Use access() to check file existence in GetNewRelFileNode().

2018-05-17 Thread Paul Guo

F_OK seems to be better than R_OK because we want to check file existence
(not read permission) before creating the relation file with the path
later.

2018-05-17 17:09 GMT+08:00 Michael Paquier :

> On Thu, May 17, 2018 at 04:09:27PM +0800, Paul Guo wrote:
> > Previous code uses BasicOpenFile() + close().
> >
> > access() should be faster than BasicOpenFile()+close() and access()
> > should be more correct since BasicOpenFile() could fail for various
> > cases (e.g. due to file permission, etc) even the file exists.
>
> Failing because of file permissions would be correct.  There have been
> cases in the past, particularly on Windows, where anti-virus softwares
> wildly scan files, causing EACCES on various points of the data folder.
>
> > access() is supported on Linux/Unix. I do not have a Windows dev
> > environment, but MSDN tells me that access() is supported on Windows also
> > and there have been access() call in the workspace, so I assume there is
> no
> > portability issue.
>
> Yes, access() is spread already in the core code.
>
> -fd = BasicOpenFile(rpath, O_RDONLY | PG_BINARY);
>
> -if (fd >= 0)
> +if (access(rpath, F_OK) == 0)
>
> What you are looking for here is R_OK, no?
> --
> Michael
>

1 2 >

1 - 100 of 102 matches

Mail list logo