date:20191024

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Dilip Kumar

On Sat, Oct 19, 2019 at 3:18 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> Hello,
>
>
> The attached trivial patch fixes the initialization of the fake unlogged LSN. 
>  Currently, BootstrapXLOG() in initdb sets the initial fake unlogged LSN to 
> FirstNormalUnloggedLSN (=1000), but the recovery and pg_resetwal sets it to 
> 1.  The patch modifies the latter two cases to match initdb.
>
> I don't know if this do actual harm, because the description of 
> FirstNormalUnloggedLSN doesn't give me any idea.
>

I have noticed that in StartupXlog also we reset it with 1, you might
want to fix that as well?

StartupXLOG
{
...
/*
* Initialize unlogged LSN. On a clean shutdown, it's restored from the
* control file. On recovery, all unlogged relations are blown away, so
* the unlogged LSN counter can be reset too.
*/
if (ControlFile->state == DB_SHUTDOWNED)
XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
else
XLogCtl->unloggedLSN = 1;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: dropdb --force

2019-10-24 Thread Amit Kapila

On Wed, Oct 23, 2019 at 12:59 PM Amit Kapila  wrote:
>
> On Tue, Oct 22, 2019 at 4:51 PM Pavel Stehule  wrote:
> >
> > út 22. 10. 2019 v 5:09 odesílatel Amit Kapila  
> > napsal:
> >>
> >>
> >> CountOtherDBBackends is called from other places as well, so I don't
> >> think it is advisable to change the sleep time in that function.
> >> Also, I don't want to add a parameter for it.  I think you have a
> >> point that in some cases we might end up sleeping for 100ms when we
> >> could do with less sleeping time, but I think it is true to some
> >> extent today as well.  I think we can anyway change it in the future
> >> if there is a problem with the sleep timing, but for now, I think we
> >> can just call CountOtherDBBackends after sending SIGTERM and call it
> >> good.  You might want to add a futuristic note in the code.
> >>
> >
> > ok.
> >
> > I removed sleeping from TerminateOtherDBBackends().
> >
> > If you want to change any logic there, please, do it without any 
> > hesitations. Maybe I don't see, what you think.
> >
>
> Fair enough, I will see if I need to change anything.
>

While making some changes in the patch, I noticed that in
TerminateOtherDBBackends, there is a race condition where after we
release the ProcArrayLock, the backend process to which we decided to
send a signal exits by itself and the same pid can be assigned to
another backend which is connected to some other database.  This leads
to terminating a wrong backend.  I think we have some similar race
condition at few other places like in pg_terminate_backend(),
ProcSleep() and CountOtherDBBackends().  I think here the risk is a
bit more because there could be a long list of pids.

One idea could be that we write a new function similar to IsBackendPid
which takes dbid and ensures that pid belongs to that database and use
that before sending kill signal, but still it will not be completely
safe.  But, I think it can be closer to cases like we already have in
code.

Another possible idea could be to use the SendProcSignal mechanism
similar to how we have used it in CancelDBBackends() to allow the
required backends to exit by themselves.  This might be safer.

I am not sure if we can entirely eliminate this race condition and
whether it is a good idea to accept such a race condition even though
it exists in other parts of code.  What do you think?

BTW, I have added/revised some comments in the code and done few other
cosmetic changes, the result of which is attached.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

drop-database-force-20191024.amit.patch
Description: Binary data

Re: Ordering of header file inclusion

2019-10-24 Thread Amit Kapila

On Wed, Oct 23, 2019 at 10:23 AM Amit Kapila  wrote:
>
> Attached are patches for (a) and (b) after another round of review and
> fixes by Vignesh.   I am planning to commit the first one (a) tomorrow
> morning and then if everything is fine on buildfarm, I will commit the
> second one (b) and once both are good, I will look into the third one
> (c).   Another pair of eyes on these patches would be good.
>

I have pushed the first one.  I'll wait for some time and probably
commit the second one tomorrow.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Zedstore - compressed in-core columnar storage

2019-10-24 Thread Ashutosh Sharma

On Thu, Oct 17, 2019 at 2:11 PM Heikki Linnakangas  wrote:
>
> On 15/10/2019 13:49, Ashutosh Sharma wrote:
> > Hi,
> >
> > I got chance to spend some time looking into the recent changes done
> > in the zedstore code, basically the functions for packing datums into
> > the attribute streams and handling attribute leaf pages. I didn't find
> > any issues but there are some minor comments that I found when
> > reviewing. I have worked on those and attached is the patch with the
> > changes. See if the changes looks meaningful to you.
>
> Thanks for looking! Applied to the development repository

Thank you. Here is one more observation:

When a zedstore table is queried using *invalid* ctid, the server
crashes due to assertion failure. See below,

postgres=# select * from t1 where ctid = '(0, 0)';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I believe above should have either returned 0 rows or failed with some
user friendly error.

Further, when the same table is queried using some non-existing ctid,
the query returns 0 rows. See below,

postgres=# select count(*) from t1;
 count
---
 2
(1 row)

postgres=# select * from t1 where ctid = '(0, 2)';
 a |  b
---+--
 2 | str2
(1 row)

postgres=# select * from t1 where ctid = '(0, 3)';
 a | b
---+---
(0 rows)

postgres=# select * from t1 where ctid = '(0, 4)';
 a | b
---+---
(0 rows)

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re:Re: [BUG] standby node can not provide service even it replays all log files

2019-10-24 Thread Thunder

Thanks for replay.I feel confused about snapshot.

At 2019-10-23 11:51:19, "Kyotaro Horiguchi"  wrote:
>Hello.
>
>At Tue, 22 Oct 2019 20:42:21 +0800 (CST), Thunder   wrote in 
>> Update the patch.
>> 
>> 1. The STANDBY_SNAPSHOT_PENDING state is set when we replay the first 
>> XLOG_RUNNING_XACTS and the sub transaction ids are overflow.
>> 2. When we log XLOG_RUNNING_XACTS in master node, can we assume that all 
>> xact IDS < oldestRunningXid are considered finished?
>
>Unfortunately we can't. Standby needs to know that the *standby's*
>oldest active xid exceeds the pendig xmin, not master's. And it is
>already processed in ProcArrayApplyRecoveryInfo. We cannot assume that

>the oldest xids are not same on the both side in a replication pair.


This issue occurs when master does not commit the transaction which has lots of 
sub transactions, while we restart or create a new standby node.
The standby node can not provide service because of this issue.
Can the standby have any active xid while it can not provide service?


>
>> 3. If we can assume this, when we replay XLOG_RUNNING_XACTS and change 
>> standbyState to STANDBY_SNAPSHOT_PENDING, can we record oldestRunningXid to 
>> a shared variable, like procArray->oldest_running_xid?
>> 4. In standby node when call GetSnapshotData if 
>> procArray->oldest_running_xid is valid, can we set xmin to be 
>> procArray->oldest_running_xid?
>> 
>> Appreciate any suggestion to this issue.
>
>At 2019-10-22 01:27:58, "Robert Haas"  wrote:
>>On Mon, Oct 21, 2019 at 4:13 AM Thunder  wrote:
>..
>> >I think that the issue you've encountered is design behavior.  In
>> >other words, it's intended to work that way.
>> >
>> >The comments for the code you propose to change say that we can allow
>> >connections once we've got a valid snapshot. So presumably the effect
>> >of your change would be to allow connections even though we don't have
>> >a valid snapshot.
>> >
>> >That seems bad.
>
>regards.
>
>-- 
>Kyotaro Horiguchi
>NTT Open Source Software Center
>

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-24 Thread Amit Langote

Sorry about the late reply.

On Mon, Oct 14, 2019 at 11:54 PM Tom Lane  wrote:
> Justin Pryzby  writes:
> > On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
> >> BTW it probably should've been documented as an "Open Item" for v12.
>
> > https://commitfest.postgresql.org/25/2278/
> > I realized possibly people were thinking of that as a "feature" and not a
> > bugfix for backpatch (?)
> > But, my issue is a query which worked under v11 PWJ but fails under v12
> > (apparently broken by d25ea01275).
>
> Yeah, this should have been dealt with as an open item, but it
> slipped through the cracks.  We'll make sure to get it fixed,
> one way or another, for 12.1.
>
> In view of the proposed patches being dependent on some other
> 13-only changes, I wonder if we should fix v12 by reverting
> d25ea0127.  The potential planner performance loss for large
> partition sets could be nasty, but failing to plan at all is worse.

Actually, the patch I proposed to fix equivalence code can be applied
on its own.  The example I gave on that thread needs us to fix
partitionwise code to even work, which is perhaps a 13-only change,
but we have an example here that is broken due to d25ea01275.
Perhaps, considering applying my patch seems better than alternatives
which are either reverting d25ea01275 or avoiding doing partitionwise
join for such queries.

Since we've got 3373c71553 ("Speed up finding EquivalenceClasses for a
given set of rels") in HEAD, need two versions of the patch; please
see attached.

Thanks,
Amit

d25ea01275-fixup-v12.patch
Description: Binary data

d25ea01275-fixup-v13.patch
Description: Binary data

Re: WIP/PoC for parallel backup

2019-10-24 Thread Ibrar Ahmed

On Fri, Oct 18, 2019 at 4:12 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Thu, Oct 17, 2019 at 10:51 AM Asif Rehman 
> wrote:
>
>>
>> Attached are the updated patches.
>>
>
> I had a quick look over these changes and they look good overall.
> However, here are my few review comments I caught while glancing the
> patches
> 0002 and 0003.
>
>
> --- 0002 patch
>
> 1.
> Can lsn option be renamed to start-wal-location? This will be more clear
> too.
>
> 2.
> +typedef struct
> +{
> +charname[MAXPGPATH];
> +chartype;
> +int32size;
> +time_tmtime;
> +} BackupFile;
>
> I think it will be good if we keep this structure in a common place so that
> the client can also use it.
>
> 3.
> +SEND_FILE_LIST,
> +SEND_FILES_CONTENT,
> Can above two commands renamed to SEND_BACKUP_MANIFEST and SEND_BACKUP_FILE
> respectively?
> The reason behind the first name change is, we are not getting only file
> lists
> here instead we are getting a few more details with that too. And for
> others,
> it will be inline with START_BACKUP/STOP_BACKUP/SEND_BACKUP_MANIFEST.
>
> 4.
> Typos:
> non-exlusive => non-exclusive
> retured => returned
> optionaly => optionally
> nessery => necessary
> totoal => total
>
>
> --- 0003 patch
>
> 1.
> +static int
> +simple_list_length(SimpleStringList *list)
> +{
> +intlen = 0;
> +SimpleStringListCell *cell;
> +
> +for (cell = list->head; cell; cell = cell->next, len++)
> +;
> +
> +return len;
> +}
>
> I think it will be good if it goes to simple_list.c. That will help in
> other
> usages as well.
>
> 2.
> Please revert these unnecessary changes:
>
> @@ -1475,6 +1575,7 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res,
> int rownum)
>   */
>  snprintf(filename, sizeof(filename), "%s/%s", current_path,
>   copybuf);
> +
>  if (filename[strlen(filename) - 1] == '/')
>  {
>  /*
>
> @@ -1528,8 +1622,8 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res,
> int rownum)
>   * can map them too.)
>   */
>  filename[strlen(filename) - 1] = '\0';/* Remove
> trailing slash */
> -
>  mapped_tblspc_path =
> get_tablespace_mapping(©buf[157]);
> +
>  if (symlink(mapped_tblspc_path, filename) != 0)
>  {
>  pg_log_error("could not create symbolic link from
> \"%s\" to \"%s\": %m",
>
> 3.
> Typos:
> retrive => retrieve
> takecare => take care
> tablespae => tablespace
>
> 4.
> ParallelBackupEnd() function does not do anything for parallelism. Will it
> be
> better to just rename it as EndBackup()?
>
> 5.
> To pass a tablespace path to the server in SEND_FILES_CONTENT, you are
> reusing
> a LABEL option, that seems odd. How about adding a new option for that?
>
> 6.
> It will be good if we have some comments explaining what the function is
> actually doing in its prologue. For functions like:
> GetBackupFilesList()
> ReceiveFiles()
> create_workers_and_fetch()
>
>
> Thanks
>
>
>>
>> Thanks,
>>
>> --
>> Asif Rehman
>> Highgo Software (Canada/China/Pakistan)
>> URL : www.highgo.ca
>>
>>
>
> --
> Jeevan Chalke
> Associate Database Architect & Team Lead, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>
I had a detailed discussion with Robert Haas at PostgreConf Europe about
parallel backup.
We discussed the current state of the patch and what needs to be done to
get the patch committed.

- The current patch uses a process to implement parallelism. There are many
reasons we need to use threads instead of processes. To start with, as this
is a client utility it makes
more sense to use threads. The data needs to be shared amongst different
threads and the main process,
handling that is simpler as compared to interprocess communication.

- Fetching a single file or multiple files was also discussed. We concluded
in our discussion that we
need to benchmark to see if disk I/O is a bottleneck or not and if parallel
writing gives us
any benefit. This benchmark needs to be done on different hardware and
different
network to identify which are the real bottlenecks. In general, we agreed
that we could start with fetching
one file at a time but that will be revisited after the benchmarks are done.

- There is also an ongoing debate in this thread that we should have one
single tar file for all files or one
TAR file per thread. I really want to have a single tar file because the
main purpose of the TAR file is to
reduce the management of multiple files, but in case of one file per
thread, we end up with many tar
files. Therefore we need to have one master thread which is responsible for
writing on tar file and all
the other threads will receive the data from the network and stream to the
master thread. This also
supports the idea of using a thread-based mode

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Amit Kapila

On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar  wrote:
>
> On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  wrote:
> >
> > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila  
> > wrote:
> > >
> > > I am thinking if we can write the patch for both the approaches (a.
> > > compute shared costs and try to delay based on that, b. try to divide
> > > the I/O cost among workers as described in the email above[1]) and do
> > > some tests to see the behavior of throttling, that might help us in
> > > deciding what is the best strategy to solve this problem, if any.
> > > What do you think?
> >
> > I agree with this idea.  I can come up with a POC patch for approach
> > (b).  Meanwhile, if someone is interested to quickly hack with the
> > approach (a) then we can do some testing and compare.  Sawada-san,
> > by any chance will you be interested to write POC with approach (a)?
> > Otherwise, I will try to write it after finishing the first one
> > (approach b).
> >
> I have come up with the POC for approach (a).
>

I think you mean to say approach (b).

> The idea is
> 1) Before launching the worker divide the current VacuumCostBalance
> among workers so that workers start accumulating the balance from that
> point.
> 2) Also, divide the VacuumCostLimit among the workers.
> 3) Once the worker are done with the index vacuum, send back the
> remaining balance with the leader.
> 4) The leader will sum all the balances and add that to its current
> VacuumCostBalance.  And start accumulating its balance from this
> point.
>
> I was trying to test how is the behaviour of the vacuum I/O limit, but
> I could not find an easy way to test that so I just put the tracepoint
> in the code and just checked that at what point we are giving the
> delay.
> I also printed the cost balance at various point to see that after how
> much I/O accumulation we are hitting the delay.  Please feel free to
> suggest a better way to test this.
>

Can we compute the overall throttling (sleep time) in the operation
separately for heap and index, then divide the index's sleep_time with
a number of workers and add it to heap's sleep time?  Then, it will be
a bit easier to compare the data between parallel and non-parallel
case.

> I have printed these logs for parallel vacuum patch (v30) vs v(30) +
> patch for dividing i/o limit (attached with the mail)
>
> Note: Patch and the test results are attached.
>

I think it is always a good idea to summarize the results and tell
your conclusion about it.  AFAICT, it seems to me this technique as
done in patch might not work for the cases when there is an uneven
amount of work done by parallel workers (say the index sizes vary
(maybe due partial indexes or index column width or some other
reasons)).   The reason for it is that when the worker finishes it's
work we don't rebalance the cost among other workers.  Can we generate
such a test and see how it behaves?  I think it might be possible to
address this if it turns out to be a problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Simon Riggs

On Mon, 21 Oct 2019 at 06:03, Michael Paquier  wrote:

> On Sat, Oct 19, 2019 at 05:03:00AM +, tsunakawa.ta...@fujitsu.com
> wrote:
> > The attached trivial patch fixes the initialization of the fake
> > unlogged LSN.  Currently, BootstrapXLOG() in initdb sets the initial
> > fake unlogged LSN to FirstNormalUnloggedLSN (=1000), but the
> > recovery and pg_resetwal sets it to 1.  The patch modifies the
> > latter two cases to match initdb.
> >
> > I don't know if this do actual harm, because the description of
> > FirstNormalUnloggedLSN doesn't give me any idea.
>
> From xlogdefs.h added by 9155580:
> /*
>  * First LSN to use for "fake" LSNs.
>  *
>  * Values smaller than this can be used for special per-AM purposes.
>  */
> #define FirstNormalUnloggedLSN  ((XLogRecPtr) 1000)
>
> So it seems to me that you have caught a bug here, and that we had
> better back-patch to v12 so as recovery and pg_resetwal don't mess up
> with AMs using lower values than that.
>

I wonder why is that value 1000, rather than an aligned value or a whole
WAL page?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2019 at 4:21 PM Amit Kapila  wrote:
>
> On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar  wrote:
> >
> > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila  
> > > wrote:
> > > >
> > > > I am thinking if we can write the patch for both the approaches (a.
> > > > compute shared costs and try to delay based on that, b. try to divide
> > > > the I/O cost among workers as described in the email above[1]) and do
> > > > some tests to see the behavior of throttling, that might help us in
> > > > deciding what is the best strategy to solve this problem, if any.
> > > > What do you think?
> > >
> > > I agree with this idea.  I can come up with a POC patch for approach
> > > (b).  Meanwhile, if someone is interested to quickly hack with the
> > > approach (a) then we can do some testing and compare.  Sawada-san,
> > > by any chance will you be interested to write POC with approach (a)?
> > > Otherwise, I will try to write it after finishing the first one
> > > (approach b).
> > >
> > I have come up with the POC for approach (a).
> >
>
> I think you mean to say approach (b).

Yeah, sorry for the confusion.  It's approach (b).
>
> > The idea is
> > 1) Before launching the worker divide the current VacuumCostBalance
> > among workers so that workers start accumulating the balance from that
> > point.
> > 2) Also, divide the VacuumCostLimit among the workers.
> > 3) Once the worker are done with the index vacuum, send back the
> > remaining balance with the leader.
> > 4) The leader will sum all the balances and add that to its current
> > VacuumCostBalance.  And start accumulating its balance from this
> > point.
> >
> > I was trying to test how is the behaviour of the vacuum I/O limit, but
> > I could not find an easy way to test that so I just put the tracepoint
> > in the code and just checked that at what point we are giving the
> > delay.
> > I also printed the cost balance at various point to see that after how
> > much I/O accumulation we are hitting the delay.  Please feel free to
> > suggest a better way to test this.
> >
>
> Can we compute the overall throttling (sleep time) in the operation
> separately for heap and index, then divide the index's sleep_time with
> a number of workers and add it to heap's sleep time?  Then, it will be
> a bit easier to compare the data between parallel and non-parallel
> case.

Okay, I will try to do that.
>
> > I have printed these logs for parallel vacuum patch (v30) vs v(30) +
> > patch for dividing i/o limit (attached with the mail)
> >
> > Note: Patch and the test results are attached.
> >
>
> I think it is always a good idea to summarize the results and tell
> your conclusion about it.  AFAICT, it seems to me this technique as
> done in patch might not work for the cases when there is an uneven
> amount of work done by parallel workers (say the index sizes vary
> (maybe due partial indexes or index column width or some other
> reasons)).   The reason for it is that when the worker finishes it's
> work we don't rebalance the cost among other workers.
Right, thats one problem I observed.
  Can we generate
> such a test and see how it behaves?  I think it might be possible to
> address this if it turns out to be a problem.
Yeah, we can address this by rebalancing the cost.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: WIP/PoC for parallel backup

2019-10-24 Thread Asif Rehman

On Thu, Oct 24, 2019 at 3:21 PM Ibrar Ahmed  wrote:

>
>
> On Fri, Oct 18, 2019 at 4:12 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>>
>>
>> On Thu, Oct 17, 2019 at 10:51 AM Asif Rehman 
>> wrote:
>>
>>>
>>> Attached are the updated patches.
>>>
>>
>> I had a quick look over these changes and they look good overall.
>> However, here are my few review comments I caught while glancing the
>> patches
>> 0002 and 0003.
>>
>>
>> --- 0002 patch
>>
>> 1.
>> Can lsn option be renamed to start-wal-location? This will be more clear
>> too.
>>
>> 2.
>> +typedef struct
>> +{
>> +charname[MAXPGPATH];
>> +chartype;
>> +int32size;
>> +time_tmtime;
>> +} BackupFile;
>>
>> I think it will be good if we keep this structure in a common place so
>> that
>> the client can also use it.
>>
>> 3.
>> +SEND_FILE_LIST,
>> +SEND_FILES_CONTENT,
>> Can above two commands renamed to SEND_BACKUP_MANIFEST and
>> SEND_BACKUP_FILE
>> respectively?
>> The reason behind the first name change is, we are not getting only file
>> lists
>> here instead we are getting a few more details with that too. And for
>> others,
>> it will be inline with START_BACKUP/STOP_BACKUP/SEND_BACKUP_MANIFEST.
>>
>> 4.
>> Typos:
>> non-exlusive => non-exclusive
>> retured => returned
>> optionaly => optionally
>> nessery => necessary
>> totoal => total
>>
>>
>> --- 0003 patch
>>
>> 1.
>> +static int
>> +simple_list_length(SimpleStringList *list)
>> +{
>> +intlen = 0;
>> +SimpleStringListCell *cell;
>> +
>> +for (cell = list->head; cell; cell = cell->next, len++)
>> +;
>> +
>> +return len;
>> +}
>>
>> I think it will be good if it goes to simple_list.c. That will help in
>> other
>> usages as well.
>>
>> 2.
>> Please revert these unnecessary changes:
>>
>> @@ -1475,6 +1575,7 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult
>> *res, int rownum)
>>   */
>>  snprintf(filename, sizeof(filename), "%s/%s", current_path,
>>   copybuf);
>> +
>>  if (filename[strlen(filename) - 1] == '/')
>>  {
>>  /*
>>
>> @@ -1528,8 +1622,8 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult
>> *res, int rownum)
>>   * can map them too.)
>>   */
>>  filename[strlen(filename) - 1] = '\0';/* Remove
>> trailing slash */
>> -
>>  mapped_tblspc_path =
>> get_tablespace_mapping(©buf[157]);
>> +
>>  if (symlink(mapped_tblspc_path, filename) != 0)
>>  {
>>  pg_log_error("could not create symbolic link
>> from \"%s\" to \"%s\": %m",
>>
>> 3.
>> Typos:
>> retrive => retrieve
>> takecare => take care
>> tablespae => tablespace
>>
>> 4.
>> ParallelBackupEnd() function does not do anything for parallelism. Will
>> it be
>> better to just rename it as EndBackup()?
>>
>> 5.
>> To pass a tablespace path to the server in SEND_FILES_CONTENT, you are
>> reusing
>> a LABEL option, that seems odd. How about adding a new option for that?
>>
>> 6.
>> It will be good if we have some comments explaining what the function is
>> actually doing in its prologue. For functions like:
>> GetBackupFilesList()
>> ReceiveFiles()
>> create_workers_and_fetch()
>>
>>
>> Thanks
>>
>>
>>>
>>> Thanks,
>>>
>>> --
>>> Asif Rehman
>>> Highgo Software (Canada/China/Pakistan)
>>> URL : www.highgo.ca
>>>
>>>
>>
>> --
>> Jeevan Chalke
>> Associate Database Architect & Team Lead, Product Development
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>>
> I had a detailed discussion with Robert Haas at PostgreConf Europe about
> parallel backup.
> We discussed the current state of the patch and what needs to be done to
> get the patch committed.
>
> - The current patch uses a process to implement parallelism. There are many
> reasons we need to use threads instead of processes. To start with, as
> this is a client utility it makes
> more sense to use threads. The data needs to be shared amongst different
> threads and the main process,
> handling that is simpler as compared to interprocess communication.
>

Yes I agree. I have already converted the code to use threads instead of
processes. This avoids the overhead
of interprocess communication.

With a single file fetching strategy, this requires communication between
competing threads/processes. To handle
that in a multiprocess application, it requires IPC. The current approach
of multiple threads instead of processes
avoids this overhead.


> - Fetching a single file or multiple files was also discussed. We
> concluded in our discussion that we
> need to benchmark to see if disk I/O is a bottleneck or not and if
> parallel writing gives us
> any benefit. This benchmark needs to be done on different hardware and
> different
> network to identify which are the real bottlenecks. In general, we agreed
> that we could start with fetching
> o

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Michael Paquier

On Thu, Oct 24, 2019 at 11:57:33AM +0100, Simon Riggs wrote:
> I wonder why is that value 1000, rather than an aligned value or a whole
> WAL page?

Good question.  Heikki, why this choice?
--
Michael


signature.asc
Description: PGP signature

Re: pgbench - extend initialization phase control

2019-10-24 Thread Fabien COELHO



Hello Masao-san,


The benefit of controlling where begin/end actually occur is that it may
have an impact on performance, and it allows to check that.


I still fail to understand the benefit of addition of () settings.
Could you clarify what case () settings are useful for? You are
thinking to execute all initialization SQL statements within
single transaction, e.g., -I (dtgp), for some reasons?


Yep. Or anything else, including without (), to allow checking the 
performance impact or non impact of transactions on the initialization 
phase.



When using ( and ) with the -I, the documentation should indicate that double
quotes are required,


Or single quotes, or backslash, if launch from the command line. I added a
mention of escaping or protection in the doc in that case.


What about using, for example, b (BEGIN) and c (COMMIT) instead
to avoid such restriction?


It is indeed possible. Using a open/close symmetric character ( (), {}, 
[]) looks more pleasing and allows to see easily whether everything is 
properly closed. I switched to {} which does not generate the same quoting 
issue in shell.



I think that it's better to check whehter "v" is enclosed with () or not
at the beginning of pgbench, and report an error if it is.

Otherwise, if -I (dtgv) is specified, pgbench reports an error after 
time-consuming data generation is performed, and of course that data 
generation is rollbacked.


Patch v5 attached added a check for v inside (), although I'm not keen on 
putting it there, and uses {} instead of ().


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index e3a0abb4c7..e150209b4e 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -167,7 +167,7 @@ pgbench  options  d
 init_steps specifies the
 initialization steps to be performed, using one character per step.
 Each step is invoked in the specified order.
-The default is dtgvp.
+The default is dt{g}vp.
 The available steps are:
 
 
@@ -193,12 +193,34 @@ pgbench  options  d
   
  
  
-  g (Generate data)
+  { (begin transaction)
+  
+   
+Emit a BEGIN.
+   
+   
+Beware that some steps may not work when called within an explicit transaction.
+   
+  
+ 
+ 
+  g or G (Generate data, client or server side)
   

 Generate data and load it into the standard tables,
 replacing any data already present.

+   
+When data is generated server side, there is no progress output.
+   
+  
+ 
+ 
+  } (commit transaction)
+  
+   
+Emit a COMMIT.
+   
   
  
  
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e72ad0036e..be10630926 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -130,7 +130,13 @@ static int	pthread_join(pthread_t th, void **thread_return);
 /
  * some configurable parameters */
 
-#define DEFAULT_INIT_STEPS "dtgvp"	/* default -I setting */
+/*
+ * we do all data generation in one transaction to enable the backend's
+ * data-loading optimizations
+ */
+#define DEFAULT_INIT_STEPS "dt{g}vp"	/* default -I setting */
+
+#define ALL_INIT_STEPS "dtgGvpf{}"		/* all possible steps */
 
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
@@ -626,7 +632,7 @@ usage(void)
 		   "  %s [OPTION]... [DBNAME]\n"
 		   "\nInitialization options:\n"
 		   "  -i, --initialize invokes initialization mode\n"
-		   "  -I, --init-steps=[dtgvpf]+ (default \"dtgvp\")\n"
+		   "  -I, --init-steps=[" ALL_INIT_STEPS "]+ (default \"" DEFAULT_INIT_STEPS "\")\n"
 		   "   run selected initialization steps\n"
 		   "  -F, --fillfactor=NUM set fill factor\n"
 		   "  -n, --no-vacuum  do not run VACUUM during initialization\n"
@@ -3802,10 +3808,23 @@ append_fillfactor(char *opts, int len)
 }
 
 /*
- * Fill the standard tables with some data
+ * truncate away any old data, in one command in case there are foreign keys
  */
 static void
-initGenerateData(PGconn *con)
+initTruncateTables(PGconn *con)
+{
+	executeStatement(con, "truncate table "
+	 "pgbench_accounts, "
+	 "pgbench_branches, "
+	 "pgbench_history, "
+	 "pgbench_tellers");
+}
+
+/*
+ * Fill the standard tables with some data, from the client side
+ */
+static void
+initGenerateDataClientSide(PGconn *con)
 {
 	char		sql[256];
 	PGresult   *res;
@@ -3819,23 +3838,9 @@ initGenerateData(PGconn *con)
 remaining_sec;
 	int			log_interval = 1;
 
-	fprintf(stderr, "generating data...\n");
+	fprintf(stderr, "generating data client side...\n");
 
-	/*
-	 *

Re: pgbench - extend initialization phase control

2019-10-24 Thread Fujii Masao

On Thu, Oct 24, 2019 at 9:16 PM Fabien COELHO  wrote:
>
>
> Hello Masao-san,
>
> >> The benefit of controlling where begin/end actually occur is that it may
> >> have an impact on performance, and it allows to check that.
> >
> > I still fail to understand the benefit of addition of () settings.
> > Could you clarify what case () settings are useful for? You are
> > thinking to execute all initialization SQL statements within
> > single transaction, e.g., -I (dtgp), for some reasons?
>
> Yep. Or anything else, including without (), to allow checking the
> performance impact or non impact of transactions on the initialization
> phase.

Is there actually such performance impact? AFAIR most time-consuming part in
initialization phase is the generation of pgbench_accounts data. This part is
performed within single transaction whether () are specified or not. No?
So I'm not sure how () are useful to check performance impact in init phase.
Maybe I'm missing something...

> >>> When using ( and ) with the -I, the documentation should indicate that 
> >>> double
> >>> quotes are required,
> >>
> >> Or single quotes, or backslash, if launch from the command line. I added a
> >> mention of escaping or protection in the doc in that case.
> >
> > What about using, for example, b (BEGIN) and c (COMMIT) instead
> > to avoid such restriction?
>
> It is indeed possible. Using a open/close symmetric character ( (), {},
> []) looks more pleasing and allows to see easily whether everything is
> properly closed. I switched to {} which does not generate the same quoting
> issue in shell.
>
> > I think that it's better to check whehter "v" is enclosed with () or not
> > at the beginning of pgbench, and report an error if it is.
> >
> > Otherwise, if -I (dtgv) is specified, pgbench reports an error after
> > time-consuming data generation is performed, and of course that data
> > generation is rollbacked.
>
> Patch v5 attached added a check for v inside (), although I'm not keen on
> putting it there, and uses {} instead of ().

Thanks for updating the patch!

Regards,

-- 
Fujii Masao

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-10-24 Thread Amit Kapila

On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar  wrote:
>
> I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
>

I was wondering whether we have checked the code coverage after this
patch?  Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1].  After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same.  If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters.  If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.

[1] - 
https://coverage.postgresql.org/src/backend/replication/logical/reorderbuffer.c.gcov.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Masahiko Sawada

On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
>
> On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  wrote:
> >
> > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila  
> > wrote:
> > >
> > > On Fri, Oct 18, 2019 at 8:45 AM Dilip Kumar  wrote:
> > > >
> > > > On Thu, Oct 17, 2019 at 4:00 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Thu, Oct 17, 2019 at 3:25 PM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Oct 17, 2019 at 2:12 PM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Oct 17, 2019 at 5:30 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Another point in this regard is that the user anyway has an 
> > > > > > > > option to
> > > > > > > > turn off the cost-based vacuum.  By default, it is anyway 
> > > > > > > > disabled.
> > > > > > > > So, if the user enables it we have to provide some sensible 
> > > > > > > > behavior.
> > > > > > > > If we can't come up with anything, then, in the end, we might 
> > > > > > > > want to
> > > > > > > > turn it off for a parallel vacuum and mention the same in docs, 
> > > > > > > > but I
> > > > > > > > think we should try to come up with a solution for it.
> > > > > > >
> > > > > > > I finally got your point and now understood the need. And the 
> > > > > > > idea I
> > > > > > > proposed doesn't work fine.
> > > > > > >
> > > > > > > So you meant that all workers share the cost count and if a 
> > > > > > > parallel
> > > > > > > vacuum worker increase the cost and it reaches the limit, does the
> > > > > > > only one worker sleep? Is that okay even though other parallel 
> > > > > > > workers
> > > > > > > are still running and then the sleep might not help?
> > > > > > >
> > > > >
> > > > > Remember that the other running workers will also increase
> > > > > VacuumCostBalance and whichever worker finds that it becomes greater
> > > > > than VacuumCostLimit will reset its value and sleep.  So, won't this
> > > > > make sure that overall throttling works the same?
> > > > >
> > > > > > I agree with this point.  There is a possibility that some of the
> > > > > > workers who are doing heavy I/O continue to work and OTOH other
> > > > > > workers who are doing very less I/O might become the victim and
> > > > > > unnecessarily delay its operation.
> > > > > >
> > > > >
> > > > > Sure, but will it impact the overall I/O?  I mean to say the rate
> > > > > limit we want to provide for overall vacuum operation will still be
> > > > > the same.  Also, isn't a similar thing happens now also where heap
> > > > > might have done a major portion of I/O but soon after we start
> > > > > vacuuming the index, we will hit the limit and will sleep.
> > > >
> > > > Actually, What I meant is that the worker who performing actual I/O
> > > > might not go for the delay and another worker which has done only CPU
> > > > operation might pay the penalty?  So basically the worker who is doing
> > > > CPU intensive operation might go for the delay and pay the penalty and
> > > > the worker who is performing actual I/O continues to work and do
> > > > further I/O.  Do you think this is not a practical problem?
> > > >
> > >
> > > I don't know.  Generally, we try to delay (if required) before
> > > processing (read/write) one page which means it will happen for I/O
> > > intensive operations, so I am not sure if the point you are making is
> > > completely correct.
> >
> > Ok, I agree with the point that we are checking it only when we are
> > doing the I/O operation.  But, we also need to consider that each I/O
> > operations have a different weightage.  So even if we have a delay
> > point at I/O operation there is a possibility that we might delay the
> > worker which is just performing read buffer with page
> > hit(VacuumCostPageHit).  But, the other worker who is actually
> > dirtying the page(VacuumCostPageDirty = 20) continue the work and do
> > more I/O.
> >
> > >
> > > > Stepping back a bit,  OTOH, I think that we can not guarantee that the
> > > > one worker who has done more I/O will continue to do further I/O and
> > > > the one which has not done much I/O will not perform more I/O in
> > > > future.  So it might not be too bad if we compute shared costs as you
> > > > suggested above.
> > > >
> > >
> > > I am thinking if we can write the patch for both the approaches (a.
> > > compute shared costs and try to delay based on that, b. try to divide
> > > the I/O cost among workers as described in the email above[1]) and do
> > > some tests to see the behavior of throttling, that might help us in
> > > deciding what is the best strategy to solve this problem, if any.
> > > What do you think?
> >
> > I agree with this idea.  I can come up with a POC patch for approach
> > (b).  Meanwhile, if someone is interested to quickly hack with the
> > approach (a) then we can do some testing and compare.  Sawada-san,
> > by any chance will you be interested to write POC with approach (a)?
> > Otherwise, I will try to writ

Re: dropdb --force

2019-10-24 Thread Pavel Stehule

čt 24. 10. 2019 v 11:10 odesílatel Amit Kapila 
napsal:

> On Wed, Oct 23, 2019 at 12:59 PM Amit Kapila 
> wrote:
> >
> > On Tue, Oct 22, 2019 at 4:51 PM Pavel Stehule 
> wrote:
> > >
> > > út 22. 10. 2019 v 5:09 odesílatel Amit Kapila 
> napsal:
> > >>
> > >>
> > >> CountOtherDBBackends is called from other places as well, so I don't
> > >> think it is advisable to change the sleep time in that function.
> > >> Also, I don't want to add a parameter for it.  I think you have a
> > >> point that in some cases we might end up sleeping for 100ms when we
> > >> could do with less sleeping time, but I think it is true to some
> > >> extent today as well.  I think we can anyway change it in the future
> > >> if there is a problem with the sleep timing, but for now, I think we
> > >> can just call CountOtherDBBackends after sending SIGTERM and call it
> > >> good.  You might want to add a futuristic note in the code.
> > >>
> > >
> > > ok.
> > >
> > > I removed sleeping from TerminateOtherDBBackends().
> > >
> > > If you want to change any logic there, please, do it without any
> hesitations. Maybe I don't see, what you think.
> > >
> >
> > Fair enough, I will see if I need to change anything.
> >
>
> While making some changes in the patch, I noticed that in
> TerminateOtherDBBackends, there is a race condition where after we
> release the ProcArrayLock, the backend process to which we decided to
> send a signal exits by itself and the same pid can be assigned to
> another backend which is connected to some other database.  This leads
> to terminating a wrong backend.  I think we have some similar race
> condition at few other places like in pg_terminate_backend(),
> ProcSleep() and CountOtherDBBackends().  I think here the risk is a
> bit more because there could be a long list of pids.
>
> One idea could be that we write a new function similar to IsBackendPid
> which takes dbid and ensures that pid belongs to that database and use
> that before sending kill signal, but still it will not be completely
> safe.  But, I think it can be closer to cases like we already have in
> code.
>
> Another possible idea could be to use the SendProcSignal mechanism
> similar to how we have used it in CancelDBBackends() to allow the
> required backends to exit by themselves.  This might be safer.
>
> I am not sure if we can entirely eliminate this race condition and
> whether it is a good idea to accept such a race condition even though
> it exists in other parts of code.  What do you think?
>
> BTW, I have added/revised some comments in the code and done few other
> cosmetic changes, the result of which is attached.
>

Tomorrow I'll check variants that you mentioned.

We sure so there are not any new connect to removed database, because we
hold lock there. So check if oid db is same should be enough.

Pavel

>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: WIP: System Versioned Temporal Table

2019-10-24 Thread Surafel Temesgen

hi Vik,
On Wed, Oct 23, 2019 at 9:02 PM Vik Fearing 
wrote:


>
> If we're going to be implicitly adding stuff to the PK, we also need to
> add that stuff to the other unique constraints, no?  And I think it
> would be better to add both the start and the end column to these keys.
> Most of the temporal queries will be accessing both.
>
>
yes it have to be added to other constraint too but adding both system time
to PK will violate constraint because it allow multiple data in current
data


>
> Why aren't you following the standard syntax here?
>
>
>
because we do have TIME and SYSTEM_P as a key word and am not sure of
whether
its a right thing to add other keyword that contain those two word
concatenated


> > Any enlightenment?
> >
>
> There are quite a lot of typos and other things that aren't written "the
> Postgres way". But before I comment on any of that, I'd like to see the
> features be implemented correctly according to the SQL standard.
>

it is almost in sql standard syntax except the above small difference. i
can correct it
and post more complete patch soon.

regards
Surafel

Re: pgbench - extend initialization phase control

2019-10-24 Thread Fabien COELHO




Hello,


Yep. Or anything else, including without (), to allow checking the
performance impact or non impact of transactions on the initialization
phase.


Is there actually such performance impact? AFAIR most time-consuming part in
initialization phase is the generation of pgbench_accounts data.


Maybe. If you cannot check, you can only guess. Probably it should be 
small, but the current version does not allow to check whether it is so.


--
Fabien.

Add json_object(text[], json[])?

2019-10-24 Thread Paul Jungwirth


Hello,

I noticed that our existing 2-param json{,b}_object functions take 
text[] for both keys and values, so they are only able to build 
one-layer-deep JSON objects. I'm interested in adding json{,b}_object 
functions that take text[] for the keys and json{,b}[] for the values. 
It would otherwise behave the same as json_object(text[], text[]) (e.g. 
re NULL handling). Does that seem worthwhile to anyone?


I'll share my specific problem where I felt I could use this function, 
although you can stop reading here if that isn't interesting to you. :-) 
I was building a jsonb_dasherize(j jsonb) function, which converts 
snake_case JSON keys into dashed-case JSON keys. (It's because of a 
Javascript framework :-) My function needs to walk the whole JSON 
structure, doing this recursively when it sees objects inside arrays or 
other objects. Here is the definition, including a comment where my 
proposed jsonb_object would have helped:


CREATE FUNCTION jsonb_dasherize(j jsonb)
RETURNS jsonb
IMMUTABLE
AS
$$
DECLARE
t text;
key text;
val jsonb;
ret jsonb;
BEGIN
  t := jsonb_typeof(j);
  IF t = 'object' THEN
-- So close! If only jsonb_object took text[] and jsonb[] params
-- SELECT  jsonb_object(
--   array_agg(dasherize_key(k)),
--   array_agg(jsonb_dasherize(v)))
-- FROMjsonb_each(j) AS t(k, v);
ret := '{}';
FOR key, val IN SELECT * FROM jsonb_each(j) LOOP
  ret := jsonb_set(ret,
   array[REPLACE(key, '_', '-')],
   jsonb_dasherize(val), true);
END LOOP;
RETURN ret;
  ELSIF t = 'array' THEN
SELECT  COALESCE(jsonb_agg(jsonb_dasherize(elem)), '[]')
INTOret
FROMjsonb_array_elements(j) AS t(elem);
RETURN ret;
  ELSIF t IS NULL THEN
-- This should never happen internally
-- but only from a passed-in NULL.
RETURN NULL;
  ELSE
-- string/number/null:
RETURN j;
  END IF;
END;
$$
LANGUAGE plpgsql;

I also tried a recursive CTE there using jsonb_set, but it was too late 
at night for me to figure that one out. :-)


It seems like a json-taking json_object would be just what I needed. And 
in general I was surprised that Postgres didn't have a more convenient 
way to build multi-layer JSON. I'm happy to add this myself if other 
folks want it.


Regards,

--
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: Add json_object(text[], json[])?

2019-10-24 Thread Nikita Glukhov



On 24.10.2019 18:17, Paul Jungwirth wrote:

Hello,

I noticed that our existing 2-param json{,b}_object functions take 
text[] for both keys and values, so they are only able to build 
one-layer-deep JSON objects. I'm interested in adding json{,b}_object 
functions that take text[] for the keys and json{,b}[] for the values. 
It would otherwise behave the same as json_object(text[], text[]) 
(e.g. re NULL handling). Does that seem worthwhile to anyone?


I'll share my specific problem where I felt I could use this function, 
although you can stop reading here if that isn't interesting to you. 
:-) I was building a jsonb_dasherize(j jsonb) function, which converts 
snake_case JSON keys into dashed-case JSON keys. (It's because of a 
Javascript framework :-) My function needs to walk the whole JSON 
structure, doing this recursively when it sees objects inside arrays 
or other objects. Here is the definition, including a comment where my 
proposed jsonb_object would have helped:


CREATE FUNCTION jsonb_dasherize(j jsonb)
RETURNS jsonb
IMMUTABLE
AS
$$
DECLARE
t text;
key text;
val jsonb;
ret jsonb;
BEGIN
  t := jsonb_typeof(j);
  IF t = 'object' THEN
    -- So close! If only jsonb_object took text[] and jsonb[] params
    -- SELECT  jsonb_object(
    --   array_agg(dasherize_key(k)),
    --   array_agg(jsonb_dasherize(v)))
    -- FROM    jsonb_each(j) AS t(k, v);
    ret := '{}';
    FOR key, val IN SELECT * FROM jsonb_each(j) LOOP
  ret := jsonb_set(ret,
   array[REPLACE(key, '_', '-')],
   jsonb_dasherize(val), true);
    END LOOP;
    RETURN ret;
  ELSIF t = 'array' THEN
    SELECT  COALESCE(jsonb_agg(jsonb_dasherize(elem)), '[]')
    INTO    ret
    FROM    jsonb_array_elements(j) AS t(elem);
    RETURN ret;
  ELSIF t IS NULL THEN
    -- This should never happen internally
    -- but only from a passed-in NULL.
    RETURN NULL;
  ELSE
    -- string/number/null:
    RETURN j;
  END IF;
END;
$$
LANGUAGE plpgsql;

I also tried a recursive CTE there using jsonb_set, but it was too 
late at night for me to figure that one out. :-)


It seems like a json-taking json_object would be just what I needed. 
And in general I was surprised that Postgres didn't have a more 
convenient way to build multi-layer JSON. I'm happy to add this myself 
if other folks want it.


Regards,



You can simply use jsonb_object_agg() to build a jsonb object from a sequence
of transformed key-value pairs:

SELECT COALESCE(jsonb_object_agg(REPLACE(k, '_', '-'),
 jsonb_dasherize(v)), '{}')
INTO ret
FROM jsonb_each(j) AS t(k, v);

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: WIP: System Versioned Temporal Table

2019-10-24 Thread Vik Fearing

On 24/10/2019 16:54, Surafel Temesgen wrote:
>
> hi Vik,
> On Wed, Oct 23, 2019 at 9:02 PM Vik Fearing
> mailto:vik.fear...@2ndquadrant.com>> wrote:
>  
>
>
> If we're going to be implicitly adding stuff to the PK, we also
> need to
> add that stuff to the other unique constraints, no?  And I think it
> would be better to add both the start and the end column to these
> keys. 
> Most of the temporal queries will be accessing both.
>
>  
> yes it have to be added to other constraint too but adding both system
> time 
> to PK will violate constraint because it allow multiple data in
> current data


I don't understand what you mean by this.


>  
>
>
> Why aren't you following the standard syntax here?
>
>
>
> because we do have TIME and SYSTEM_P as a key word and am not sure of
> whether
> its a right thing to add other keyword that contain those two word
> concatenated


Yes, we have to do that.


>  
>  
>
> > Any enlightenment?
> >
>
> There are quite a lot of typos and other things that aren't
> written "the
> Postgres way". But before I comment on any of that, I'd like to
> see the
> features be implemented correctly according to the SQL standard.
>
>
> it is almost in sql standard syntax except the above small difference.
> i can correct it 
> and post more complete patch soon.


I don't mean just the SQL syntax, I also mean the behavior.

Re: Add json_object(text[], json[])?

2019-10-24 Thread Tom Lane

Paul Jungwirth  writes:
> I noticed that our existing 2-param json{,b}_object functions take 
> text[] for both keys and values, so they are only able to build 
> one-layer-deep JSON objects. I'm interested in adding json{,b}_object 
> functions that take text[] for the keys and json{,b}[] for the values. 
> It would otherwise behave the same as json_object(text[], text[]) (e.g. 
> re NULL handling). Does that seem worthwhile to anyone?

I think a potential problem is creation of ambiguity where there was
none before.  I prototyped this as

regression=# create function jsonb_object(text[], jsonb[]) returns jsonb
as 'select jsonb_object($1, $2::text[])' language sql;
CREATE FUNCTION

and immediately got

regression=# explain select jsonb_object('{a}', '{b}');
ERROR:  function jsonb_object(unknown, unknown) is not unique
LINE 1: explain select jsonb_object('{a}', '{b}');
   ^
HINT:  Could not choose a best candidate function. You might need to add 
explicit type casts.

which is something that works fine as long as there's only one
jsonb_object().  I'm not sure whether that's a big problem in
practice --- it seems like it will resolve successfully as long
as at least one input isn't an unknown literal.  But it could be
a problem for prepared statements, or clients using APIs that
involve prepared statements under the hood:

regression=# prepare foo as select jsonb_object($1,$2);
ERROR:  function jsonb_object(unknown, unknown) is not unique

Also, as the prototype implementation shows, it's not like you
can't get this functionality today ... you just need to cast
jsonb to text.  Admittedly that's annoying and wasteful.

regards, tom lane

psql tab-complete

2019-10-24 Thread Victor Spirin

I found some problem with tab-complete in the 12 version.  I checked 
this in the Windows.



Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index e00dbab5aa..5d7f24e57a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3577,7 +3577,7 @@ psql_completion(const char *text, int start, int end)
 
 /* WHERE */
/* Simple case of the word before the where being the table name */
-   else if (TailMatches(MatchAny, "WHERE"))
+   else if (TailMatches("WHERE", MatchAny))
COMPLETE_WITH_ATTR(prev2_wd, "");
 
 /* ... FROM ... */

Re: Add json_object(text[], json[])?

2019-10-24 Thread Paul A Jungwirth

On Thu, Oct 24, 2019 at 8:52 AM Tom Lane  wrote:
> I think a potential problem is creation of ambiguity where there was
> none before.

I agree that's not nice, and it seems like a new name might be better.

> Also, as the prototype implementation shows, it's not like you
> can't get this functionality today ... you just need to cast
> jsonb to text.  Admittedly that's annoying and wasteful.

I don't think that gives the same result, does it? For example:

# select jsonb_object(array['foo'], array['[{"bar-bar": ["baz"]}]'::jsonb]);
 jsonb_object
---
 {"foo": "[{\"bar-bar\": [\"baz\"]}]"}

You can see the values are JSON strings, not JSON arrays/objects/etc.

Regards,
Paul

Re: Add json_object(text[], json[])?

2019-10-24 Thread Paul A Jungwirth

On Thu, Oct 24, 2019 at 8:45 AM Nikita Glukhov  wrote:
> You can simply use jsonb_object_agg() to build a jsonb object from a sequence
> of transformed key-value pairs:

 I've even used that function before. :-) I tried
finding it on the JSON functions page but couldn't, so I thought maybe
I was going crazy. Of course it's on the aggregates page instead. As I
said it was late at night. :-) Your version works perfectly!

Even still, it may be nice to have a non-aggregate function that lets
you build nested JSON. But I agree jsonb_object_agg makes it less
needful.

Thanks!
Paul

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-24 Thread Tom Lane

Amit Langote  writes:
> On Mon, Oct 14, 2019 at 11:54 PM Tom Lane  wrote:
>> In view of the proposed patches being dependent on some other
>> 13-only changes, I wonder if we should fix v12 by reverting
>> d25ea0127.  The potential planner performance loss for large
>> partition sets could be nasty, but failing to plan at all is worse.

> Actually, the patch I proposed to fix equivalence code can be applied
> on its own.  The example I gave on that thread needs us to fix
> partitionwise code to even work, which is perhaps a 13-only change,
> but we have an example here that is broken due to d25ea01275.
> Perhaps, considering applying my patch seems better than alternatives
> which are either reverting d25ea01275 or avoiding doing partitionwise
> join for such queries.

I looked at this a bit, and I see that the core idea is to generate
the missing EC members by applying add_child_rel_equivalences when
building child join rels.  Perhaps we can make that work, but this
patch fails to, because you've ignored the comment pointing out
that the nullable_relids fixup logic only works for baserels:

 * And likewise for nullable_relids.  Note this code assumes
 * parent and child relids are singletons.

We could improve that to work for joinrels, I think, but it would become
very much more complicated (and slower).  AFAICS the logic would have
to be "iterate through top_parent_relids, see which ones are in
em_nullable_relids, and for each one that is, find the corresponding
child_relid and substitute that in new_nullable_relids".  This is a
bit of a problem because we don't have any really convenient way to
map individual top parent relids to child relids.  I wonder if we
should extend AppendRelInfo to include the top parent relid as well as
the immediate parent.  (Perhaps, while we're at it, we could make
adjust_appendrel_attrs_multilevel less of an inefficient and
underdocumented mess.)

Also, I'm pretty sure this addition is wrong/broken:

+
+/*
+ * There aren't going to be more expressions to translate in
+ * the same EC.
+ */
+break;

What makes you think that an EC can only contain one entry per rel?

More generally, as long as this patch requires changing
add_child_rel_equivalences' API anyway, I wonder if we should
rethink that altogether.  Looking at the code now, I realize that
d25ea01275 resulted in horribly bastardizing that function's API,
because the parent_rel and appinfo arguments are only consulted in
some cases, while in other cases we disregard them and rely on
child_rel->top_parent_relids to figure out how to translate stuff.
It would be possible to make the argument list be just (root, child_rel)
and always rely on child_rel->top_parent_relids.  At the very least,
if we keep the extra arguments, we should document them as being just
optimizations.

regards, tom lane

Re: Add json_object(text[], json[])?

2019-10-24 Thread Tom Lane

Paul A Jungwirth  writes:
> On Thu, Oct 24, 2019 at 8:52 AM Tom Lane  wrote:
>> Also, as the prototype implementation shows, it's not like you
>> can't get this functionality today ... you just need to cast
>> jsonb to text.  Admittedly that's annoying and wasteful.

> I don't think that gives the same result, does it?

Ah, you're right --- ENOCAFFEINE :-(.

regards, tom lane

Re: tuplesort test coverage

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-15 13:05:32 +0100, Peter Geoghegan wrote:
> > - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)
> 
> Also hard to test -- there was a bug here when abbreviated keys first
> went in -- that was detected by amcheck.
> 
> All of the places where we abort are essentially the same, though.

Why is it that hard?  Seems fairly easy to create cases that reliably
abort.

I really don't think it's ok to have as many abbrev abort related paths
without any coverage - the relevant code isn't that trivial. And
something like amcheck really doesn't strike me as sufficient. For one,
it doesn't provide any coverage either. For another, plenty sorts don't
end up in a form that amcheck sees.

Tests aren't just there to verify that the current behaviour isn't
broken, they're also there to allow to develop with some confidence. And
I don't think tuplesort as is really allows that, and e.g. abbreviated
keys made that substantially worse. That happens, but I think it'd be
good if you could help improving the situation.

E.g.
SELECT * FROM (SELECT ('----'||to_char(g.i, 
'FM'))::uuid uuid FROM generate_series(15000, 0, -1) g(i)) d ORDER 
BY uuid
reliably triggers abbreviated keys, and it looks to me like that should
be portable.  With a few tweaks it'd be fairly easy to use that to
provide some OK coverage for most the abbrev key cases.

> > - in memory backwards scans (lines 1936, 3042) - *any* coverage for
> > TSS_SORTEDONTAPE (line 1964)
> 
> That used to exist, but it went away when we killed replacement selection 
> sort.

Yes, that's kind of my point? Either that patch reduced coverage, or it
created dead code. Neither is good.

> > - mergeruns without abbrev key (line 2582)
> 
> mergeruns() doesn't use abbreviated keys -- this code disables their
> use in the standard way.

Well, then reformulate the point that we should have coverage for
mergeruns() when initially abbreviated keys were set up.

> > - disk sorts with more than one run (lines 2707, 2789)
> > - any disk based tuplesort_begin_heap() (lines 3649, 3676)
> 
> I had to convince Tom to get the coverage of external sorts we have
> now. Apparently there are buildfarm animals that are very sensitive to
> that cost, that could have substantially increased test runtimes were
> we to do more. Perhaps this could be revisited.

Hm. I'm a bit confused. Isn't all that's required to set a tiny amount
of work_mem? Then it's easy to trigger many passes without a lot of IO?

> > I'm pretty unhappy that tuplesort has been whacked around pretty heavily
> > in the last few years, while *reducing* effective test coverage
> > noticeably, rather than increasing it.
> 
> I don't think that that's true, on balance. There are only 1,000 extra
> lines of code in tuplesort.c in master compared to 9.4, even though we
> added parallel sorts and abbreviated keys, two huge enhancements. In
> many ways, tuplesort is now simpler than ever.

I'm not saying that tuplesort has gotten worse or anything. Just that
there's been too much development without adding tests.

> > There's pretty substantial and
> > nontrivial areas without any tests - do we have actually have any
> > confidence that they work?
> 
> Everything that you're talking about has existed since v11 came out a
> year ago, and most of it is a year or two older than that. So yeah,
> I'm pretty confident that it works.

That's may be true, but there's also basically no way to discover bugs
except manual testing, and users encountering the bugs. That's not good
enough.

Greetings,

Andres Freund

Re: Consider low startup cost in add_partial_path

2019-10-24 Thread James Coleman

On Fri, Oct 4, 2019 at 8:36 AM Robert Haas  wrote:
>
> On Wed, Oct 2, 2019 at 10:22 AM James Coleman  wrote:
> > In all cases I've been starting with:
> >
> > set enable_hashjoin = off;
> > set enable_nestloop = off;
> > set max_parallel_workers_per_gather = 4;
> > set min_parallel_index_scan_size = 0;
> > set min_parallel_table_scan_size = 0;
> > set parallel_setup_cost = 0;
> > set parallel_tuple_cost = 0;
> >
> > I've also tried various combinations of random_page_cost,
> > cpu_index_tuple_cost, cpu_tuple_cost.
> >
> > Interestingly I've noticed plans joining two relations that look like:
> >
> >  Limit
> >->  Merge Join
> >  Merge Cond: (t1.pk = t2.pk)
> >  ->  Gather Merge
> >Workers Planned: 4
> >->  Parallel Index Scan using t_pkey on t t1
> >  ->  Gather Merge
> >Workers Planned: 4
> >->  Parallel Index Scan using t_pkey on t t2
> >
> > Where I would have expected a Gather Merge above a parallelized merge
> > join. Is that reasonable to expect?
>
> Well, you told the planner that parallel_setup_cost = 0, so starting
> workers is free. And you told the planner that parallel_tuple_cost =
> 0, so shipping tuples from the worker to the leader is also free. So
> it is unclear why it should prefer a single Gather Merge over two
> Gather Merges: after all, the Gather Merge is free!
>
> If you use give those things some positive cost, even if it's smaller
> than the default, you'll probably get a saner-looking plan choice.

That makes sense.

Right now I currently see trying to get this a separate test feels a
bit like a distraction.

Given there doesn't seem to be an obvious way to reproduce the issue
currently, but we know we have a reproduction example along with
incremental sort, what is the path forward for this? Is it reasonable
to try to commit it anyway knowing that it's a "correct" change and
been demonstrated elsewhere?

James

Re: Creating foreign key on partitioned table is too slow

2019-10-24 Thread Alvaro Herrera

On 2019-Oct-23, kato-...@fujitsu.com wrote:

> Hello
> 
> To benchmark with tpcb model, I tried to create a foreign key in the 
> partitioned history table, but backend process killed by OOM.
> the number of partitions is 8192. I tried in master(commit: ad4b7aeb84).
> 
> I did the same thing in another server which has 200GB memory, but creating 
> foreign key did not end in 24 hours.

Thanks for reporting.  It sounds like there must be a memory leak here.
I am fairly pressed for time at present so I won't be able to
investigate this until, at least, mid November.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

TOAST corruption in standby database

2019-10-24 Thread Alex Adriaanse

We have primary and hot standby databases running Postgres 11.3 inside Docker, 
with their data directories bind-mounted to a reflink-enabled XFS filesystem. 
The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance.

I've seen TOAST corruption in one of the standby databases a few months ago in 
a ~44GB table, so I wiped the database and rebuilt it using pg_basebackup, 
which eliminated the corruption. This week I've seen corruption pop up again in 
the same table in one of the standby databases. The other standby database 
experienced no corruption.

The corrupted table has four columns of types integer, text, text, and jsonb. 
The corruption happens inside the jsonb column.

The corruption manifests itself as follows in the standby database:

SELECT length(json::text) FROM  WHERE identity = '...';
ERROR:  missing chunk number 0 for toast value 64265646 in pg_toast_16103925

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 ctid | chunk_id | chunk_seq | md5
--+--+---+-
(0 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
---
 2
(1 row)

Looking up the TOAST block that is supposed to contain this value you can see 
that the TOAST tuples are missing:

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f
(4 rows)

For comparison here are the same queries against the primary database:

SELECT length(json::text) FROM  WHERE identity = '...';
 length

   7817
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe
(2 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
---
 2
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe
 (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f
(6 rows)

Looking at the data file for the TOAST relation, the header data structures in 
the relevant block seem fine to me, which makes me think this is not caused by 
filesystem corruption (unless a write silently failed). The second half of that 
block is identical between the primary and corrupted standby, but in the first 
half the corrupted standby is missing data.

Standby (corrupted):

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
  a3 0e 00 00 48 46 88 0e  00 00 05 00 30 00 58 0f  |HF..0.X.|
0010  00 20 04 20 00 00 00 00  00 00 00 00 00 00 00 00  |. . |
0020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |X...|
0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0f50  00 00 00 00 00 00 00 00  32 b0 0a 01 00 00 00 00  |2...|
0f60  00 00 00 00 1b 00 61 5c  06 00 03 00 02 09 18 00  |..a\|
0f70  9b 90 75 02 01 00 00 00  ac 00 00 00 83 9f 64 00  |..u...d.|

Primary:

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
  bd 0e 00 00 08 ad 32 b7  00 00 05 00 30 00 90 04  |..2.0...|
0010  00 20 04 20 00 00 00 00  68 87 e0 0f 90 84 a8 05  |. . h...|
0020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |X...|
0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0490  a6 07 7e 02 00 00 00 00  00 00 00 00 1b 00 61 5c  |..~...a\|
04a0  02 00 03 00 02 09 18 00  ae 9d d4 03 01 00 00 00  |

Re: tuplesort test coverage

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-24 11:10:34 -0700, Andres Freund wrote:
> On 2019-10-15 13:05:32 +0100, Peter Geoghegan wrote:
> > > - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 
> > > 4266)
> > 
> > Also hard to test -- there was a bug here when abbreviated keys first
> > went in -- that was detected by amcheck.
> > 
> > All of the places where we abort are essentially the same, though.
> 
> Why is it that hard?  Seems fairly easy to create cases that reliably
> abort.
> 
> I really don't think it's ok to have as many abbrev abort related paths
> without any coverage - the relevant code isn't that trivial. And
> something like amcheck really doesn't strike me as sufficient. For one,
> it doesn't provide any coverage either. For another, plenty sorts don't
> end up in a form that amcheck sees.
> 
> Tests aren't just there to verify that the current behaviour isn't
> broken, they're also there to allow to develop with some confidence. And
> I don't think tuplesort as is really allows that, and e.g. abbreviated
> keys made that substantially worse. That happens, but I think it'd be
> good if you could help improving the situation.
> 
> E.g.
> SELECT * FROM (SELECT ('----'||to_char(g.i, 
> 'FM'))::uuid uuid FROM generate_series(15000, 0, -1) g(i)) d 
> ORDER BY uuid
> reliably triggers abbreviated keys, and it looks to me like that should
> be portable.  With a few tweaks it'd be fairly easy to use that to
> provide some OK coverage for most the abbrev key cases.

Here's a first stab at getting the coverage of tuplesort.c to a
satisfying level.  There's still bits uncovered, but that's largely
either a) trace_sort related b) hopefully unreachable stuff c) explain
related. The largest actually missing thing is a disk-based
mark/restore, which probably ought be covered.

I think the the test time of this would still be OK, but if not we could
also work a bit more on that angle.

I'm pretty sure there's some minor copy & paste mistakes in the test,
but I want to get this out there and get some reactions before investing
further time.

Peter, Tom?

- Andres
>From 3c987fa79d1d2341869de226879a7201e035040a Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 24 Oct 2019 13:58:40 -0700
Subject: [PATCH v1 1/2] Add tests for tuplesort.c.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/expected/tuplesort.out | 591 
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/sql/tuplesort.sql  | 258 +++
 3 files changed, 850 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/tuplesort.out
 create mode 100644 src/test/regress/sql/tuplesort.sql

diff --git a/src/test/regress/expected/tuplesort.out b/src/test/regress/expected/tuplesort.out
new file mode 100644
index 000..d3b99df73ae
--- /dev/null
+++ b/src/test/regress/expected/tuplesort.out
@@ -0,0 +1,591 @@
+-- only use parallelism when explicitly intending to do so
+SET max_parallel_maintenance_workers = 0;
+SET max_parallel_workers = 0;
+-- A table with with contents that, when sorted, triggers abbreviated
+-- key aborts. One easy way to achieve that is to use uuids that all
+-- have the same prefix, as abbreviated keys for uuids just use the
+-- first sizeof(Datum) bytes.
+DROP TABLE IF EXISTS abbrev_abort_uuids;
+NOTICE:  table "abbrev_abort_uuids" does not exist, skipping
+CREATE TABLE abbrev_abort_uuids (
+id serial not null,
+abort_increasing uuid,
+abort_decreasing uuid,
+noabort_increasing uuid,
+noabort_decreasing uuid);
+INSERT INTO abbrev_abort_uuids (abort_increasing, abort_decreasing, noabort_increasing, noabort_decreasing)
+SELECT
+('----'||to_char(g.i, 'FM'))::uuid abort_increasing,
+('----'||to_char(2 - g.i, 'FM'))::uuid abort_decreasing,
+(to_char(g.i % 10009, 'FM')||'----'||to_char(g.i, 'FM'))::uuid noabort_increasing,
+(to_char(((2 - g.i) % 10009), 'FM')||'----'||to_char(2 - g.i, 'FM'))::uuid noabort_decreasing
+FROM generate_series(0, 2, 1) g(i);
+-- and a few NULLs
+INSERT INTO abbrev_abort_uuids(id) VALUES(0);
+INSERT INTO abbrev_abort_uuids DEFAULT VALUES;
+INSERT INTO abbrev_abort_uuids DEFAULT VALUES;
+-- add just a few duplicates
+INSERT INTO abbrev_abort_uuids (abort_increasing, abort_decreasing, noabort_increasing, noabort_decreasing)
+SELECT abort_increasing, abort_decreasing, noabort_increasing, noabort_decreasing
+FROM abbrev_abort_uuids
+WHERE (id < 10 OR id > 19990) AND id % 3 = 0 AND abort_increasing is not null;
+
+-- Check sort node uses of tuplesort wrt. abbreviated keys
+
+-- plain sort triggering abbreviated abort
+SELECT abort_increasing, abort_decreasing FROM abbrev_abort_uuids ORDER BY abort_increasing OFFSET 2 - 4;
+   abort_increasing

Re: psql tab-complete

2019-10-24 Thread Victor Spirin


Sorry for wrong place and contents of my message.

It seems that the VA_ARGS_NARGS (__ VA_ARGS__) macros always return 1 on 
Windows.


Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

24.10.2019 19:11, Victor Spirin пишет:
I found some problem with tab-complete in the 12 version.  I checked 
this in the Windows.



Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: psql tab-complete

2019-10-24 Thread Tom Lane

Victor Spirin  writes:
> I found some problem with tab-complete in the 12 version.  I checked 
> this in the Windows.

This change seems to break the case intended by the comment,
ie given the context

SELECT * FROM tablename WHERE 

we want to offer the columns of "tablename" as completions.

I'd be the first to agree that that's completely lame, as there
are any number of related cases it fails to cover ... but this
patch isn't making it better.

regards, tom lane

Re: psql tab-complete

2019-10-24 Thread Victor Spirin

Yes, I found, that VA_ARGS_NARGS(__ VA_ARGS__) macros always return 1 on 
Windows.


Victor Spirin
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

25.10.2019 0:48, Tom Lane пишет:

Victor Spirin  writes:

I found some problem with tab-complete in the 12 version.  I checked
this in the Windows.

This change seems to break the case intended by the comment,
ie given the context

SELECT * FROM tablename WHERE 

we want to offer the columns of "tablename" as completions.

I'd be the first to agree that that's completely lame, as there
are any number of related cases it fails to cover ... but this
patch isn't making it better.

regards, tom lane

Re: WIP: expression evaluation improvements

2019-10-24 Thread Soumyadeep Chakraborty

Hey Andres,

After looking at
v2-0006-jit-Reference-functions-by-name-in-IOCOERCE-steps.patch, I was
wondering
about other places in the code where we have const pointers to functions
outside
LLVM's purview: specially EEOP_FUNCEXPR* for any function call expressions,
EEOP_DISTINCT and EEOP_NULLIF which involve operator specific comparison
function call invocations, deserialization and trans functions for
aggregates
etc. All of the above cases involve to some degree some server functions
that
can be inlined/optimized.

If we do go down this road, the most immediate solution that comes to mind
would
be to populate referenced_functions[] with these. Also, we can replace all
l_ptr_const() calls taking function addresses with calls to
llvm_function_reference() (this is safe as it falls back to a l_pt_const()
call). We could do the l_ptr_const() -> llvm_function_reference() even if we
don't go down this road.

One con with the approach above would be bloating of llvmjit_types.bc but we
would be introducing @declares instead of @defines in the IR...so I think
that
is fine.

Let me know your thoughts. I would like to submit a patch here in this
thread or
elsewhere.

--
Soumyadeep

Re: Creating foreign key on partitioned table is too slow

2019-10-24 Thread Tomas Vondra


On Thu, Oct 24, 2019 at 03:48:57PM -0300, Alvaro Herrera wrote:

On 2019-Oct-23, kato-...@fujitsu.com wrote:


Hello

To benchmark with tpcb model, I tried to create a foreign key in the 
partitioned history table, but backend process killed by OOM.
the number of partitions is 8192. I tried in master(commit: ad4b7aeb84).

I did the same thing in another server which has 200GB memory, but creating 
foreign key did not end in 24 hours.


Thanks for reporting.  It sounds like there must be a memory leak here.
I am fairly pressed for time at present so I won't be able to
investigate this until, at least, mid November.



I've briefly looked into this, and I think the main memory leak is in
RelationBuildPartitionDesc. It gets called with PortalContext, it
allocates a lot of memory building the descriptor, copies it into
CacheContext but does not even try to free anything. So we end up with
something like this:

TopMemoryContext: 215344 total in 11 blocks; 47720 free (12 chunks); 167624 used
 pgstat TabStatusArray lookup hash table: 32768 total in 3 blocks; 9160 free (4 
chunks); 23608 used
 TopTransactionContext: 4194304 total in 10 blocks; 1992968 free (18 chunks); 
2201336 used
 RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used
 MessageContext: 8192 total in 1 blocks; 3256 free (1 chunks); 4936 used
 Operator class cache: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
 smgr relation table: 32768 total in 3 blocks; 16768 free (8 chunks); 16000 used
 TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 
used
 Portal hash: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
 TopPortalContext: 8192 total in 1 blocks; 7648 free (0 chunks); 544 used
   PortalContext: 1557985728 total in 177490 blocks; 9038656 free (167645 chunks); 1548947072 used: 
 Relcache by OID: 16384 total in 2 blocks; 3424 free (3 chunks); 12960 used

 CacheMemoryContext: 17039424 total in 13 blocks; 7181480 free (9 chunks); 
9857944 used
   partition key: 1024 total in 1 blocks; 168 free (0 chunks); 856 used: history
   index info: 2048 total in 2 blocks; 568 free (1 chunks); 1480 used: 
pg_class_tblspc_relfilenode_index
 ...
   index info: 2048 total in 2 blocks; 872 free (0 chunks); 1176 used: 
pg_class_oid_index
 WAL record construction: 49776 total in 2 blocks; 6344 free (0 chunks); 43432 
used
 PrivateRefCount: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
 MdSmgr: 8192 total in 1 blocks; 5976 free (0 chunks); 2216 used
 LOCALLOCK hash: 65536 total in 4 blocks; 18584 free (12 chunks); 46952 used
 Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used
 ErrorContext: 8192 total in 1 blocks; 6840 free (4 chunks); 1352 used
Grand total: 1580997216 bytes in 177834 blocks; 18482808 free (167857 chunks); 
1562514408 used

(At which point I simply interrupted the query, it'd allocate more and
more memory until an OOM).

The attached patch trivially fixes that by adding a memory context
tracking all the temporary data, and then just deletes it as a whole at
the end of the function. This significantly reduces the memory usage for
me, not sure it's 100% correct.

FWIW, even with this fix it still takes an awful lot to create the
foreign key, because the CPU is stuck doing this

   60.78%60.78%  postgres  postgres[.] bms_equal
   32.58%32.58%  postgres  postgres[.] get_eclass_for_sort_expr
3.83% 3.83%  postgres  postgres[.] 
add_child_rel_equivalences
0.23% 0.00%  postgres  [unknown]   [.] 0x0005
0.22% 0.00%  postgres  [unknown]   [.] 
0.18% 0.18%  postgres  postgres[.] AllocSetCheck
  ...

Haven't looked into the details yet.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/partitioning/partdesc.c 
b/src/backend/partitioning/partdesc.c
index b207b765f2..7f15774266 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -72,6 +72,15 @@ RelationBuildPartitionDesc(Relation rel)
MemoryContext oldcxt;
int*mapping;
 
+   MemoryContext   tmp;
+   MemoryContext   old;
+
+   tmp = AllocSetContextCreate(CurrentMemoryContext,
+   "tmp context",
+   
ALLOCSET_SMALL_SIZES);
+
+   old = MemoryContextSwitchTo(tmp);
+
/*
 * Get partition oids from pg_inherits.  This uses a single snapshot to
 * fetch the list of children, so while more children may be getting 
added
@@ -231,6 +240,9 @@ RelationBuildPartitionDesc(Relation rel)
}
}
 
+   MemoryContextSwitchTo(old);
+   MemoryContextDelete(tmp);
+
rel->rd_partdesc = partdesc;
 }

Re: WIP: expression evaluation improvements

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-24 14:59:21 -0700, Soumyadeep Chakraborty wrote:
> After looking at
> v2-0006-jit-Reference-functions-by-name-in-IOCOERCE-steps.patch, I was
> wondering
> about other places in the code where we have const pointers to functions
> outside
> LLVM's purview: specially EEOP_FUNCEXPR* for any function call expressions,
> EEOP_DISTINCT and EEOP_NULLIF which involve operator specific comparison
> function call invocations, deserialization and trans functions for
> aggregates
> etc. All of the above cases involve to some degree some server functions
> that
> can be inlined/optimized.

I don't think there's other cases like this, except when we don't have a
symbol name. In the normal course that's "just" EEOP_PARAM_CALLBACK
IIRC.

For EEOP_PARAM_CALLBACK one solution would be to not use a callback
specified by pointer, but instead use an SQL level function taking an
INTERNAL parameter (to avoid it being called via SQL).

There's also a related edge-case where are unable to figure out a symbol
name in llvm_function_reference(), and then resort to creating a global
variable pointing to the function.  This is a somewhat rare case (IIRC
it's mostly if not solely around language PL handlers), so I don't think
it matters *too* much.

We probably should change that to not initialize the global, and instead
resolve the symbol during link time. As long as we generate a symbol
name that llvm_resolve_symbol() can somehow resolve, we'd be good.  I
was a bit wary of doing syscache lookups from within
llvm_resolve_symbol(), otherwise we could just look look up the function
address from within there.  So if we went this route I'd probably go for
a hashtable of additional symbol resolutions, which
llvm_resolve_symbol() would consult.

If indeed the only case this is being hit is language PL handlers, it
might be better to instead work out the symbol name for that handler -
we should be able to get that via pg_language.lanplcallfoid.

> If we do go down this road, the most immediate solution that comes to mind
> would
> be to populate referenced_functions[] with these. Also, we can replace all
> l_ptr_const() calls taking function addresses with calls to
> llvm_function_reference() (this is safe as it falls back to a l_pt_const()
> call). We could do the l_ptr_const() -> llvm_function_reference() even if we
> don't go down this road.

Which cases are you talking about here? Because I don't think there's
any others where would know a symbol name to add to referenced_functions
in the first place?

I'm also not quite clear what adding to referenced_functions would buy
us wrt constants. The benefit of adding a function there is that we get
the correct signature of the function, which makes it much harder to
accidentally screw up and call with the wrong signature. I don't think
there's any benefits around symbol names?

I do want to benefit from getting accurate signatures for patch
[PATCH v2 26/32] WIP: expression eval: relative pointer suppport
I had a number of cases where I passed the wrong parameters, and llvm
couldn't tell me...

Greetings,

Andres Freund

Re: WIP: expression evaluation improvements

2019-10-24 Thread Andreas Karlsson


On 10/23/19 6:38 PM, Andres Freund wrote:

In the attached *prototype* patch series there's a lot of incremental
improvements (and some cleanups) (in time, not importance order):


You may already know this but your patch set seems to require clang 9.

I get the below compilation error which is probably cause by 
https://github.com/llvm/llvm-project/commit/90868bb0584f first being 
committed for clang 9 (I run "clang version 7.0.1-8 
(tags/RELEASE_701/final)").


In file included from gistutil.c:24:
../../../../src/include/utils/float.h:103:7: error: invalid output 
constraint '=@ccae' in asm

 : "=@ccae"(ret), [clobber_reg]"=&x"(clobber_reg)
   ^
1 error generated.

Re: WIP: expression evaluation improvements

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-25 00:43:37 +0200, Andreas Karlsson wrote:
> On 10/23/19 6:38 PM, Andres Freund wrote:
> > In the attached *prototype* patch series there's a lot of incremental
> > improvements (and some cleanups) (in time, not importance order):
> 
> You may already know this but your patch set seems to require clang 9.

I didn't, so thanks!

> I get the below compilation error which is probably cause by
> https://github.com/llvm/llvm-project/commit/90868bb0584f first being
> committed for clang 9 (I run "clang version 7.0.1-8
> (tags/RELEASE_701/final)").
> 
> In file included from gistutil.c:24:
> ../../../../src/include/utils/float.h:103:7: error: invalid output
> constraint '=@ccae' in asm
>  : "=@ccae"(ret), [clobber_reg]"=&x"(clobber_reg)
>^
> 1 error generated.

I'll probably just drop this patch for now, it's not directly related. I
kind of wanted it on the list, so I have I place I can find it if I
forget :).

I think what really needs to happen instead is to improve the code
generated for __builtin_isinf[_sign]() by gcc/clang. They should proce
the constants like I did, instead of loading from the constant pool
every single time. That adds a fair bit of latency...

Greetings,

Andres Freund

Re: Creating foreign key on partitioned table is too slow

2019-10-24 Thread Tomas Vondra


On Fri, Oct 25, 2019 at 12:17:58AM +0200, Tomas Vondra wrote:


...

FWIW, even with this fix it still takes an awful lot to create the
foreign key, because the CPU is stuck doing this

  60.78%60.78%  postgres  postgres[.] bms_equal
  32.58%32.58%  postgres  postgres[.] get_eclass_for_sort_expr
   3.83% 3.83%  postgres  postgres[.] add_child_rel_equivalences
   0.23% 0.00%  postgres  [unknown]   [.] 0x0005
   0.22% 0.00%  postgres  [unknown]   [.] 
   0.18% 0.18%  postgres  postgres[.] AllocSetCheck
 ...

Haven't looked into the details yet.



OK, a bit more info. A better perf report (with framepointers etc) looks
like this:

 +   99.98% 0.00%  postgres  [unknown] [.] 0x495641002c4d133d
 +   99.98% 0.00%  postgres  libc-2.29.so  [.] __libc_start_main
 +   99.98% 0.00%  postgres  postgres  [.] startup_hacks
 +   99.98% 0.00%  postgres  postgres  [.] PostmasterMain
 +   99.98% 0.00%  postgres  postgres  [.] ServerLoop
 +   99.98% 0.00%  postgres  postgres  [.] BackendStartup
 +   99.98% 0.00%  postgres  postgres  [.] ExitPostmaster
 +   99.98% 0.00%  postgres  postgres  [.] PostgresMain
 +   99.98% 0.00%  postgres  postgres  [.] exec_simple_query
 +   99.98% 0.00%  postgres  postgres  [.] PortalRun
 +   99.98% 0.00%  postgres  postgres  [.] PortalRunMulti
 +   99.98% 0.00%  postgres  postgres  [.] PortalRunUtility
 +   99.98% 0.00%  postgres  postgres  [.] ProcessUtility
 +   99.98% 0.00%  postgres  postgres  [.] standard_ProcessUtility
 +   99.98% 0.00%  postgres  postgres  [.] ProcessUtilitySlow
 +   99.98% 0.00%  postgres  postgres  [.] AlterTable
 +   99.98% 0.00%  postgres  postgres  [.] ATController
 +   99.98% 0.00%  postgres  postgres  [.] ATRewriteTables
 +   99.98% 0.00%  postgres  postgres  [.] validateForeignKeyConstraint
 +   99.98% 0.00%  postgres  postgres  [.] RI_Initial_Check
 +   99.96% 0.00%  postgres  postgres  [.] SPI_execute_snapshot
 +   99.86% 0.00%  postgres  postgres  [.] _SPI_execute_plan
 +   99.70% 0.00%  postgres  postgres  [.] GetCachedPlan
 +   99.70% 0.00%  postgres  postgres  [.] BuildCachedPlan
 +   99.66% 0.00%  postgres  postgres  [.] pg_plan_queries
 +   99.66% 0.00%  postgres  postgres  [.] pg_plan_query
 +   99.66% 0.00%  postgres  postgres  [.] planner
 +   99.66% 0.00%  postgres  postgres  [.] standard_planner
 +   99.62% 0.00%  postgres  postgres  [.] subquery_planner
 +   99.62% 0.00%  postgres  postgres  [.] grouping_planner
 +   99.62% 0.00%  postgres  postgres  [.] query_planner
 +   99.31% 0.00%  postgres  postgres  [.] make_one_rel
 +   97.53% 0.00%  postgres  postgres  [.] set_base_rel_pathlists
 +   97.53% 0.00%  postgres  postgres  [.] set_rel_pathlist
 +   97.53% 0.01%  postgres  postgres  [.] set_append_rel_pathlist
 +   97.42% 0.00%  postgres  postgres  [.] set_plain_rel_pathlist
 +   97.40% 0.02%  postgres  postgres  [.] create_index_paths
 +   97.16% 0.01%  postgres  postgres  [.] get_index_paths
 +   97.12% 0.02%  postgres  postgres  [.] build_index_paths
 +   96.67% 0.01%  postgres  postgres  [.] build_index_pathkeys
 +   96.61% 0.01%  postgres  postgres  [.] make_pathkey_from_sortinfo
 +   95.70%21.27%  postgres  postgres  [.] get_eclass_for_sort_expr
 +   75.21%75.21%  postgres  postgres  [.] bms_equal
 +   48.72% 0.00%  postgres  postgres  [.] consider_index_join_clauses
 +   48.72% 0.00%  postgres  postgres  [.] 
consider_index_join_outer_rels
 +   48.72% 0.02%  postgres  postgres  [.] get_join_index_paths
 +1.78% 0.00%  postgres  postgres  [.] set_base_rel_sizes
 +1.78% 0.00%  postgres  postgres  [.] set_rel_size
 +1.78% 0.01%  postgres  postgres  [.] set_append_rel_size
 +1.66% 1.34%  postgres  postgres  [.] add_child_rel_equivalences

It is (pretty much) a single callstack, i.e. each function is simply
calling the one below it (with some minor exceptions at the end, but
that's pretty negligible here).

This essentially says that planning queries executed by RI_Initial_Check
with many partitions is damn expensive. An example query is this one:

test=# \timing

test=# SELECT fk."aid" FROM ONLY "public"."history_60" fk LEFT OUTER
JOIN "public"."accounts" pk ON ( pk."aid" OPERATOR(pg_catalog.=)
fk."aid") WHERE pk."aid" IS NULL AND (fk."aid" IS NOT NULL);

aid 
-

(0 rows)

Time: 28791.492 ms (00:28.791)

Bear in mind those are *empty* tables, so the execution is pretty cheap
(explain analyze says the execution takes ~65ms, but the planning itself
takes ~28 seconds). And we have 8192 such partitions, which means we'd
spend

Re: Creating foreign key on partitioned table is too slow

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-23 05:59:01 +, kato-...@fujitsu.com wrote:
> To benchmark with tpcb model, I tried to create a foreign key in the 
> partitioned history table, but backend process killed by OOM.
> the number of partitions is 8192. I tried in master(commit: ad4b7aeb84).

Obviously this should be improved. But I think it's also worthwhile to
note that using 8k partitions is very unlikely to be a good choice for
anything. The metadata, partition pruning, etc overhead is just going to
be very substantial.

Greetings,

Andres Freund

Re: v12.0: reindex CONCURRENTLY: lock ShareUpdateExclusiveLock on object 14185/39327/0 is already held

2019-10-24 Thread Michael Paquier

On Wed, Oct 23, 2019 at 10:08:21PM -0500, Justin Pryzby wrote:
> On Thu, Oct 24, 2019 at 11:42:04AM +0900, Michael Paquier wrote:
>> Please see the attached.  Justin, does it fix your problems regarding
>> the locks?
> 
> Confirmed.

Okay, committed and back-patched.  I have checked manually all the
interruptions for plain indexes and it is possible to clean up the
invalid indexes properly (old or new depending on the phase).
Partition indexes have other issues as you reported, but let's see
about that on the other thread. 
--
Michael


signature.asc
Description: PGP signature

Re: Duplicate Workers entries in some EXPLAIN plans

2019-10-24 Thread Andres Freund

Hi,

On 2019-10-22 11:58:35 -0700, Maciek Sakrejda wrote:
> I originally reported this in pgsql-bugs [0], but there wasn't much
> feedback there, so moving the discussion here. When using JSON, YAML, or
> XML-format EXPLAIN on a plan that uses a parallelized sort, the Sort nodes
> list two different entries for "Workers", one for the sort-related info,
> and one for general worker info. This is what this looks like in JSON (some
> details elided):
>
> {
>   "Node Type": "Sort",
>   ...
>   "Workers": [
> {
>   "Worker Number": 0,
>   "Sort Method": "external merge",
>   "Sort Space Used": 20128,
>   "Sort Space Type": "Disk"
> },
> {
>   "Worker Number": 1,
>   "Sort Method": "external merge",
>   "Sort Space Used": 20128,
>   "Sort Space Type": "Disk"
> }
>   ],
>   ...
>   "Workers": [
> {

> This is technically valid JSON, but it's extremely difficult to work with,
> since default JSON parsing in Ruby, node, Python, Go, and even Postgres'
> own jsonb only keep the latest key

It's also quite confusing.

> As Tom Lane pointed out in my pgsql-bugs thread, this has been
> reported before [1] and in that earlier thread, Andrew Dunstan suggested
> that perhaps the simplest solution is to just rename the sort-related
> Workers node.  Thoughts?

Yea, I think we should fix this. The current output basically makes no
sense.

> Tom expressed some concerns about a breaking change here,
> though I think the current behavior means vanishingly few users are parsing
> this data correctly.

Well, in a lot of the cases there's no parallel output for the sort, and
in other cases BUFFERS is not specified. In either case the 'duplicate
key' problem won't exist then.

While Tom said:

On 2019-10-16 09:16:56 +0200, Tom Lane wrote:
> I think the text-mode output is intentional, but the other formats
> need more work.

  Sort Method: external merge  Disk: 4920kB
  Worker 0:  Sort Method: external merge  Disk: 5880kB
  Worker 1:  Sort Method: external merge  Disk: 5920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
Buffers: shared hit=345 read=3507, temp read=505 written=744

I don't think this is close to being good enough to be worth
preserving. I think it's worth avoiding unnecessary breakage of explain
output, but we also shouldn't endlessly carry forward confusing output,
just because of that.

It clearly seems like it'd be better if this instead were

  Sort Method: external merge  Disk: 4920kB
  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
Sort Method: external merge  Disk: 5880kB
Buffers: shared hit=337 read=3489, temp read=505 written=739
  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
Buffers: shared hit=345 read=3507, temp read=505 written=744
Sort Method: external merge  Disk: 5920kB

I think the way this information was added in bf11e7ee2e36 and
33001fd7a707, contrasting to the output added in b287df70e408, is just
not right. If we add similar instrumentation reporting to more nodes,
we'll end up with duplicated information all over.  Additionally the
per-worker part of show_sort_info() basically just duplicated the rest
of the function.  I then also did something similar (although luckily
with a different key...), with the ExplainPrintJIT() call for Gather
nodes.

Unfortunately I think the fix isn't all that trivial, due to the way we
output the per-worker information at the end of ExplainNode(), by just
dumping things into a string.  It seems to me that a step in the right
direction would be for ExplainNode() to create
planstate->worker_instrument StringInfos, which can be handed to
routines like show_sort_info(), which would print the per-node
information into that, rather than directly dumping into
es->output. Most of the current "Show worker detail" would be done
earlier in ExplainNode(), at the place where we current display the
"actual rows" bit.

ISTM that should include removing the duplication fo the the contents of
show_sort_info(), and probably also for the Gather, GatherMerge blocks
(I've apparently skipped adding the JIT information to the latter, not
sure if we ought to fix that in the stable branches).

Any chance you want to take a stab at that?

I don't think we'll fix it soon, but damn, all this string appending
around just isn't a great way to reliably build nested data formats.

Greetings,

Andres Freund

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Amit Kapila

On Thu, Oct 24, 2019 at 8:12 PM Masahiko Sawada  wrote:
>
> On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
> >
> > I have come up with the POC for approach (a).
> >
> > The idea is
> > 1) Before launching the worker divide the current VacuumCostBalance
> > among workers so that workers start accumulating the balance from that
> > point.
> > 2) Also, divide the VacuumCostLimit among the workers.
> > 3) Once the worker are done with the index vacuum, send back the
> > remaining balance with the leader.
> > 4) The leader will sum all the balances and add that to its current
> > VacuumCostBalance.  And start accumulating its balance from this
> > point.
> >
> > I was trying to test how is the behaviour of the vacuum I/O limit, but
> > I could not find an easy way to test that so I just put the tracepoint
> > in the code and just checked that at what point we are giving the
> > delay.
> > I also printed the cost balance at various point to see that after how
> > much I/O accumulation we are hitting the delay.  Please feel free to
> > suggest a better way to test this.
> >
> > I have printed these logs for parallel vacuum patch (v30) vs v(30) +
> > patch for dividing i/o limit (attached with the mail)
> >
> > Note: Patch and the test results are attached.
> >
>
> Thank you!
>
> For approach (a) the basic idea I've come up with is that we have a
> shared balance value on DSM and each workers including the leader
> process add its local balance value to it in vacuum_delay_point, and
> then based on the shared value workers sleep. I'll submit that patch
> with other updates.
>

I think it would be better if we can prepare the I/O balance patches
on top of main patch and evaluate both approaches.  We can test both
the approaches and integrate the one which turned out to be good.

Note that, I will be away next week, so I won't be able to review your
latest patch unless you are planning to post today or tomorrow.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Amit Kapila

On Fri, Oct 25, 2019 at 7:37 AM Amit Kapila  wrote:
>
> On Thu, Oct 24, 2019 at 8:12 PM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
> > >
> > > I have come up with the POC for approach (a).
> > >
> > > The idea is
> > > 1) Before launching the worker divide the current VacuumCostBalance
> > > among workers so that workers start accumulating the balance from that
> > > point.
> > > 2) Also, divide the VacuumCostLimit among the workers.
> > > 3) Once the worker are done with the index vacuum, send back the
> > > remaining balance with the leader.
> > > 4) The leader will sum all the balances and add that to its current
> > > VacuumCostBalance.  And start accumulating its balance from this
> > > point.
> > >
> > > I was trying to test how is the behaviour of the vacuum I/O limit, but
> > > I could not find an easy way to test that so I just put the tracepoint
> > > in the code and just checked that at what point we are giving the
> > > delay.
> > > I also printed the cost balance at various point to see that after how
> > > much I/O accumulation we are hitting the delay.  Please feel free to
> > > suggest a better way to test this.
> > >
> > > I have printed these logs for parallel vacuum patch (v30) vs v(30) +
> > > patch for dividing i/o limit (attached with the mail)
> > >
> > > Note: Patch and the test results are attached.
> > >
> >
> > Thank you!
> >
> > For approach (a) the basic idea I've come up with is that we have a
> > shared balance value on DSM and each workers including the leader
> > process add its local balance value to it in vacuum_delay_point, and
> > then based on the shared value workers sleep. I'll submit that patch
> > with other updates.
> >
>
> I think it would be better if we can prepare the I/O balance patches
> on top of main patch and evaluate both approaches.  We can test both
> the approaches and integrate the one which turned out to be good.
>

Just to add something to testing both approaches.  I think we can
first come up with a way to compute the throttling vacuum does as
mentioned by me in one of the emails above [1] or in some other way.
I think Dilip is planning to give it a try and once we have that we
can evaluate both the patches.  Some of the tests I have in mind are:
a. All indexes have an equal amount of deleted data.
b. indexes have an uneven amount of deleted data.
c. try with mix of indexes (btree, gin, gist, hash, etc..) on a table.

Feel free to add more tests.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BPeiFLdTuwrE6CvbNdx80E-O%3DZxCuWB2maREKFD-RaCA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: EXPLAIN BUFFERS and I/O timing accounting questions

2019-10-24 Thread Andres Freund

Hi,

Migrating to -hackers, this seems clearly suboptimal. and confusing.

The original thread is at
https://www.postgresql.org/message-id/20191025003834.2rswu7smheaddag3%40alap3.anarazel.de

On 2019-10-24 17:38:34 -0700, Andres Freund wrote:
> Perhaps you could send me the full plan and query privately? And, if you
> have access to that, the plain text explain?

Maciek did, and I think I see what's going on...

(I asked whether it's ok to include the query)
On 2019-10-24 18:58:27 -0700, Maciek Sakrejda wrote:
> Thanks for your help. Here is the query text:
> 
> WITH q AS (
>   DELETE FROM queries WHERE last_occurred_at < now() - '60 days'::interval
> RETURNING queries.id
> ),
>   t1 AS (DELETE FROM query_fingerprints WHERE query_id IN (SELECT id FROM q)),
>   t2 as (DELETE FROM query_table_associations WHERE query_id IN
> (SELECT id FROM q))
> SELECT COUNT(id) FROM q

Note that t1 and t2 CTEs are not referenced in the query
itself. Normally that would mean that they're simply not evaluated - but
for CTE that include DML we force evaluation, as the result would
otherwise be inconsistent.

But that forcing happens not from within the normal query (the SELECT
COUNT(id) FROM q), but from the main executor. As the attribution of
child executor nodes to the layer above happens when the execution of a
node ends (see ExecProcNodeInstr()), the work of completing wCTEs is not
associated to any single node.

void
standard_ExecutorFinish(QueryDesc *queryDesc)
...
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);

static void
ExecPostprocessPlan(EState *estate)
..
/*
 * Run any secondary ModifyTable nodes to completion, in case the main
 * query did not fetch all rows from them.  (We do this to ensure that
 * such nodes have predictable results.)
 */
foreach(lc, estate->es_auxmodifytables)
...
slot = ExecProcNode(ps);

in contrast to e.g. AFTER triggers, which also get executed at the end
of the query, we do not associate the cost of running this post
processing work with any superior node.

Which is quite confusing, because the DML nodes do look like
they're subsidiary to another node.

I think we ought to either attribute the cost of ExecPostprocessPlan()
to a separate instrumentation that we display at the end of the explain
plan when not zero, or at least associate the cost with the parent node.

Greetings,

Andres Freund

Re: dropdb --force

2019-10-24 Thread Amit Kapila

On Thu, Oct 24, 2019 at 8:22 PM Pavel Stehule  wrote:
>
> čt 24. 10. 2019 v 11:10 odesílatel Amit Kapila  
> napsal:
>>
>> While making some changes in the patch, I noticed that in
>> TerminateOtherDBBackends, there is a race condition where after we
>> release the ProcArrayLock, the backend process to which we decided to
>> send a signal exits by itself and the same pid can be assigned to
>> another backend which is connected to some other database.  This leads
>> to terminating a wrong backend.  I think we have some similar race
>> condition at few other places like in pg_terminate_backend(),
>> ProcSleep() and CountOtherDBBackends().  I think here the risk is a
>> bit more because there could be a long list of pids.
>>
>> One idea could be that we write a new function similar to IsBackendPid
>> which takes dbid and ensures that pid belongs to that database and use
>> that before sending kill signal, but still it will not be completely
>> safe.  But, I think it can be closer to cases like we already have in
>> code.
>>
>> Another possible idea could be to use the SendProcSignal mechanism
>> similar to how we have used it in CancelDBBackends() to allow the
>> required backends to exit by themselves.  This might be safer.
>>
>> I am not sure if we can entirely eliminate this race condition and
>> whether it is a good idea to accept such a race condition even though
>> it exists in other parts of code.  What do you think?
>>
>> BTW, I have added/revised some comments in the code and done few other
>> cosmetic changes, the result of which is attached.
>
>
> Tomorrow I'll check variants that you mentioned.
>
> We sure so there are not any new connect to removed database, because we hold 
> lock there.
>

Right.

> So check if oid db is same should be enough.
>

We can do this before sending a kill signal but is it enough?  Because
as soon as we release ProcArrayLock anytime the other process can exit
and a new process can use its pid.  I think this is more of a
theoretical risk and might not be easy to hit, but still, we can't
ignore it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2019 at 8:12 PM Masahiko Sawada  wrote:
>
> On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
> >
> > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Fri, Oct 18, 2019 at 8:45 AM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Thu, Oct 17, 2019 at 4:00 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Oct 17, 2019 at 3:25 PM Dilip Kumar  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Thu, Oct 17, 2019 at 2:12 PM Masahiko Sawada 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Oct 17, 2019 at 5:30 PM Amit Kapila 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Another point in this regard is that the user anyway has an 
> > > > > > > > > option to
> > > > > > > > > turn off the cost-based vacuum.  By default, it is anyway 
> > > > > > > > > disabled.
> > > > > > > > > So, if the user enables it we have to provide some sensible 
> > > > > > > > > behavior.
> > > > > > > > > If we can't come up with anything, then, in the end, we might 
> > > > > > > > > want to
> > > > > > > > > turn it off for a parallel vacuum and mention the same in 
> > > > > > > > > docs, but I
> > > > > > > > > think we should try to come up with a solution for it.
> > > > > > > >
> > > > > > > > I finally got your point and now understood the need. And the 
> > > > > > > > idea I
> > > > > > > > proposed doesn't work fine.
> > > > > > > >
> > > > > > > > So you meant that all workers share the cost count and if a 
> > > > > > > > parallel
> > > > > > > > vacuum worker increase the cost and it reaches the limit, does 
> > > > > > > > the
> > > > > > > > only one worker sleep? Is that okay even though other parallel 
> > > > > > > > workers
> > > > > > > > are still running and then the sleep might not help?
> > > > > > > >
> > > > > >
> > > > > > Remember that the other running workers will also increase
> > > > > > VacuumCostBalance and whichever worker finds that it becomes greater
> > > > > > than VacuumCostLimit will reset its value and sleep.  So, won't this
> > > > > > make sure that overall throttling works the same?
> > > > > >
> > > > > > > I agree with this point.  There is a possibility that some of the
> > > > > > > workers who are doing heavy I/O continue to work and OTOH other
> > > > > > > workers who are doing very less I/O might become the victim and
> > > > > > > unnecessarily delay its operation.
> > > > > > >
> > > > > >
> > > > > > Sure, but will it impact the overall I/O?  I mean to say the rate
> > > > > > limit we want to provide for overall vacuum operation will still be
> > > > > > the same.  Also, isn't a similar thing happens now also where heap
> > > > > > might have done a major portion of I/O but soon after we start
> > > > > > vacuuming the index, we will hit the limit and will sleep.
> > > > >
> > > > > Actually, What I meant is that the worker who performing actual I/O
> > > > > might not go for the delay and another worker which has done only CPU
> > > > > operation might pay the penalty?  So basically the worker who is doing
> > > > > CPU intensive operation might go for the delay and pay the penalty and
> > > > > the worker who is performing actual I/O continues to work and do
> > > > > further I/O.  Do you think this is not a practical problem?
> > > > >
> > > >
> > > > I don't know.  Generally, we try to delay (if required) before
> > > > processing (read/write) one page which means it will happen for I/O
> > > > intensive operations, so I am not sure if the point you are making is
> > > > completely correct.
> > >
> > > Ok, I agree with the point that we are checking it only when we are
> > > doing the I/O operation.  But, we also need to consider that each I/O
> > > operations have a different weightage.  So even if we have a delay
> > > point at I/O operation there is a possibility that we might delay the
> > > worker which is just performing read buffer with page
> > > hit(VacuumCostPageHit).  But, the other worker who is actually
> > > dirtying the page(VacuumCostPageDirty = 20) continue the work and do
> > > more I/O.
> > >
> > > >
> > > > > Stepping back a bit,  OTOH, I think that we can not guarantee that the
> > > > > one worker who has done more I/O will continue to do further I/O and
> > > > > the one which has not done much I/O will not perform more I/O in
> > > > > future.  So it might not be too bad if we compute shared costs as you
> > > > > suggested above.
> > > > >
> > > >
> > > > I am thinking if we can write the patch for both the approaches (a.
> > > > compute shared costs and try to delay based on that, b. try to divide
> > > > the I/O cost among workers as described in the email above[1]) and do
> > > > some tests to see the behavior of throttling, that might help us in
> > > > deciding what is the best strategy to solve this problem, if any.
> > > > What do you think?
> > >
> > > I agree with this idea.  I can co

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-10-24 Thread Kyotaro Horiguchi

Hello. Thanks for the comment.

# Sorry in advance for possilbe breaking the thread.

> MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
> RelationNeedsWAL() returns false.

Thanks for pointing out that. And the test patch helped me very much.

Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.

AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.

> Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
> onstant for a given drive, and I think modern disks provide no way to read
> the relevant parameters.)  I like the name "wal_skip_threshold", and my second

I strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.

# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."

> the relevant parameters.)  I like the name "wal_skip_threshold", and
> my second choice would be "wal_skip_min_size".  Possibly documented
> as follows:
..
> Any other opinions on the GUC name?

I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.

> * We emit newpage WAL records for smaller size of relations.
> *
> * Small WAL records have a chance to be emitted at once along with
> * other backends' WAL records. We emit WAL records instead of syncing
> * for files that are smaller than a certain threshold expecting faster
- * commit. The threshold is defined by the GUC effective_io_block_size.
+ * commit. The threshold is defined by the GUC wal_skip_threshold.

The attached are:

- v21-0001-TAP-test-for-copy-truncation-optimization.patch
  same as v20

- v21-0002-Fix-WAL-skipping-feature.patch
  GUC name changed.

- v21-0003-Fix-MarkBufferDirtyHint.patch
  PoC of fixing the function. will be merged into 0002. (New)

- v21-0004-Documentation-for-wal_skip_threshold.patch
  GUC name and description changed. (Previous 0003)

- v21-0005-Additional-test-for-new-GUC-setting.patch
  including adjusted version of wal-optimize-noah-tests-v3.patch
  Maybe test names need further adjustment. (Previous 0004)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 34149545942480d8dcc1cc587f40091b19b5aa39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH v21 1/5] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->bas

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-10-24 Thread Kyotaro Horiguchi

Ugh!

2019年10月25日(金) 13:13 Kyotaro Horiguchi :

> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003)
> regards.
>

It's wrong that it also skips chnging flags.
I"ll fix it soon

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Masahiko Sawada

On Fri, Oct 25, 2019 at 12:44 PM Dilip Kumar  wrote:
>
> On Thu, Oct 24, 2019 at 8:12 PM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Fri, Oct 18, 2019 at 8:45 AM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Oct 17, 2019 at 4:00 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Oct 17, 2019 at 3:25 PM Dilip Kumar 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Oct 17, 2019 at 2:12 PM Masahiko Sawada 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Oct 17, 2019 at 5:30 PM Amit Kapila 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > Another point in this regard is that the user anyway has an 
> > > > > > > > > > option to
> > > > > > > > > > turn off the cost-based vacuum.  By default, it is anyway 
> > > > > > > > > > disabled.
> > > > > > > > > > So, if the user enables it we have to provide some sensible 
> > > > > > > > > > behavior.
> > > > > > > > > > If we can't come up with anything, then, in the end, we 
> > > > > > > > > > might want to
> > > > > > > > > > turn it off for a parallel vacuum and mention the same in 
> > > > > > > > > > docs, but I
> > > > > > > > > > think we should try to come up with a solution for it.
> > > > > > > > >
> > > > > > > > > I finally got your point and now understood the need. And the 
> > > > > > > > > idea I
> > > > > > > > > proposed doesn't work fine.
> > > > > > > > >
> > > > > > > > > So you meant that all workers share the cost count and if a 
> > > > > > > > > parallel
> > > > > > > > > vacuum worker increase the cost and it reaches the limit, 
> > > > > > > > > does the
> > > > > > > > > only one worker sleep? Is that okay even though other 
> > > > > > > > > parallel workers
> > > > > > > > > are still running and then the sleep might not help?
> > > > > > > > >
> > > > > > >
> > > > > > > Remember that the other running workers will also increase
> > > > > > > VacuumCostBalance and whichever worker finds that it becomes 
> > > > > > > greater
> > > > > > > than VacuumCostLimit will reset its value and sleep.  So, won't 
> > > > > > > this
> > > > > > > make sure that overall throttling works the same?
> > > > > > >
> > > > > > > > I agree with this point.  There is a possibility that some of 
> > > > > > > > the
> > > > > > > > workers who are doing heavy I/O continue to work and OTOH other
> > > > > > > > workers who are doing very less I/O might become the victim and
> > > > > > > > unnecessarily delay its operation.
> > > > > > > >
> > > > > > >
> > > > > > > Sure, but will it impact the overall I/O?  I mean to say the rate
> > > > > > > limit we want to provide for overall vacuum operation will still 
> > > > > > > be
> > > > > > > the same.  Also, isn't a similar thing happens now also where heap
> > > > > > > might have done a major portion of I/O but soon after we start
> > > > > > > vacuuming the index, we will hit the limit and will sleep.
> > > > > >
> > > > > > Actually, What I meant is that the worker who performing actual I/O
> > > > > > might not go for the delay and another worker which has done only 
> > > > > > CPU
> > > > > > operation might pay the penalty?  So basically the worker who is 
> > > > > > doing
> > > > > > CPU intensive operation might go for the delay and pay the penalty 
> > > > > > and
> > > > > > the worker who is performing actual I/O continues to work and do
> > > > > > further I/O.  Do you think this is not a practical problem?
> > > > > >
> > > > >
> > > > > I don't know.  Generally, we try to delay (if required) before
> > > > > processing (read/write) one page which means it will happen for I/O
> > > > > intensive operations, so I am not sure if the point you are making is
> > > > > completely correct.
> > > >
> > > > Ok, I agree with the point that we are checking it only when we are
> > > > doing the I/O operation.  But, we also need to consider that each I/O
> > > > operations have a different weightage.  So even if we have a delay
> > > > point at I/O operation there is a possibility that we might delay the
> > > > worker which is just performing read buffer with page
> > > > hit(VacuumCostPageHit).  But, the other worker who is actually
> > > > dirtying the page(VacuumCostPageDirty = 20) continue the work and do
> > > > more I/O.
> > > >
> > > > >
> > > > > > Stepping back a bit,  OTOH, I think that we can not guarantee that 
> > > > > > the
> > > > > > one worker who has done more I/O will continue to do further I/O and
> > > > > > the one which has not done much I/O will not perform more I/O in
> > > > > > future.  So it might not be too bad if we compute shared costs as 
> > > > > > you
> > > > > > suggested above.
> > > > > >
> > > > >
> > > > > I am thinking if we can write the patch for both the approaches

Re: pglz performance

2019-10-24 Thread Andrey Borodin



> 21 окт. 2019 г., в 14:09, Andrey Borodin  написал(а):
> 
> With Silesian corpus pglz_decompress_hacked is actually decreasing 
> performance on high-entropy data.
> Meanwhile pglz_decompress_hacked8 is still faster than usual pglz_decompress.
> In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer 
> option.

Here's v3 which takes into account recent benchmarks with Silesian Corpus and 
have better comments.

Thanks!


--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud


v3-0001-Use-memcpy-in-pglz-decompression.patch
Description: Binary data

Re: [HACKERS] Block level parallel vacuum

2019-10-24 Thread Dilip Kumar

On Fri, Oct 25, 2019 at 10:22 AM Masahiko Sawada  wrote:
>
> On Fri, Oct 25, 2019 at 12:44 PM Dilip Kumar  wrote:
> >
> > On Thu, Oct 24, 2019 at 8:12 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Oct 24, 2019 at 3:21 PM Dilip Kumar  wrote:
> > > >
> > > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Oct 18, 2019 at 8:45 AM Dilip Kumar  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Thu, Oct 17, 2019 at 4:00 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Oct 17, 2019 at 3:25 PM Dilip Kumar 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Oct 17, 2019 at 2:12 PM Masahiko Sawada 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Oct 17, 2019 at 5:30 PM Amit Kapila 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > Another point in this regard is that the user anyway has 
> > > > > > > > > > > an option to
> > > > > > > > > > > turn off the cost-based vacuum.  By default, it is anyway 
> > > > > > > > > > > disabled.
> > > > > > > > > > > So, if the user enables it we have to provide some 
> > > > > > > > > > > sensible behavior.
> > > > > > > > > > > If we can't come up with anything, then, in the end, we 
> > > > > > > > > > > might want to
> > > > > > > > > > > turn it off for a parallel vacuum and mention the same in 
> > > > > > > > > > > docs, but I
> > > > > > > > > > > think we should try to come up with a solution for it.
> > > > > > > > > >
> > > > > > > > > > I finally got your point and now understood the need. And 
> > > > > > > > > > the idea I
> > > > > > > > > > proposed doesn't work fine.
> > > > > > > > > >
> > > > > > > > > > So you meant that all workers share the cost count and if a 
> > > > > > > > > > parallel
> > > > > > > > > > vacuum worker increase the cost and it reaches the limit, 
> > > > > > > > > > does the
> > > > > > > > > > only one worker sleep? Is that okay even though other 
> > > > > > > > > > parallel workers
> > > > > > > > > > are still running and then the sleep might not help?
> > > > > > > > > >
> > > > > > > >
> > > > > > > > Remember that the other running workers will also increase
> > > > > > > > VacuumCostBalance and whichever worker finds that it becomes 
> > > > > > > > greater
> > > > > > > > than VacuumCostLimit will reset its value and sleep.  So, won't 
> > > > > > > > this
> > > > > > > > make sure that overall throttling works the same?
> > > > > > > >
> > > > > > > > > I agree with this point.  There is a possibility that some of 
> > > > > > > > > the
> > > > > > > > > workers who are doing heavy I/O continue to work and OTOH 
> > > > > > > > > other
> > > > > > > > > workers who are doing very less I/O might become the victim 
> > > > > > > > > and
> > > > > > > > > unnecessarily delay its operation.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sure, but will it impact the overall I/O?  I mean to say the 
> > > > > > > > rate
> > > > > > > > limit we want to provide for overall vacuum operation will 
> > > > > > > > still be
> > > > > > > > the same.  Also, isn't a similar thing happens now also where 
> > > > > > > > heap
> > > > > > > > might have done a major portion of I/O but soon after we start
> > > > > > > > vacuuming the index, we will hit the limit and will sleep.
> > > > > > >
> > > > > > > Actually, What I meant is that the worker who performing actual 
> > > > > > > I/O
> > > > > > > might not go for the delay and another worker which has done only 
> > > > > > > CPU
> > > > > > > operation might pay the penalty?  So basically the worker who is 
> > > > > > > doing
> > > > > > > CPU intensive operation might go for the delay and pay the 
> > > > > > > penalty and
> > > > > > > the worker who is performing actual I/O continues to work and do
> > > > > > > further I/O.  Do you think this is not a practical problem?
> > > > > > >
> > > > > >
> > > > > > I don't know.  Generally, we try to delay (if required) before
> > > > > > processing (read/write) one page which means it will happen for I/O
> > > > > > intensive operations, so I am not sure if the point you are making 
> > > > > > is
> > > > > > completely correct.
> > > > >
> > > > > Ok, I agree with the point that we are checking it only when we are
> > > > > doing the I/O operation.  But, we also need to consider that each I/O
> > > > > operations have a different weightage.  So even if we have a delay
> > > > > point at I/O operation there is a possibility that we might delay the
> > > > > worker which is just performing read buffer with page
> > > > > hit(VacuumCostPageHit).  But, the other worker who is actually
> > > > > dirtying the page(VacuumCostPageDirty = 20) continue the work and do
> > > > > more I/O.
> > > > >
> > > > > >
> > > > > > > Stepping back a bit,  OTOH, I think that we can not guarantee 
> > > > > > > that the
> > > > > >

Duplicate entries in pg_depend after REINDEX CONCURRENTLY

2019-10-24 Thread Michael Paquier

Hi all,

While digging into a separate issue, I have found a new bug with
REINDEX CONCURRENTLY.  Once the new index is built and validated,
a couple of things are done at the swap phase, like switching
constraints, comments, and dependencies.  The current code moves all
the dependency entries of pg_depend from the old index to the new
index, but it never counted on the fact that the new index may have
some entries already.  So, once the swapping is done, pg_depend
finishes with duplicated entries: the ones coming from the old index
and the ones of the index freshly-created.  For example, take an index
which uses an attribute or an expression and has dependencies with the
parent's columns.

Attached is a patch to fix the issue.  As we know that the old index
will have a definition and dependencies that match with the old one, I
think that we should just remove any dependency records on the new
index before moving the new set of dependencies from the old to the
new index.  The patch includes regression tests that scan pg_depend to
check that everything remains consistent after REINDEX CONCURRENTLY.

Any thoughts?
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c48ad93e28..8d4e74969a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1668,8 +1668,12 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 	}
 
 	/*
-	 * Move all dependencies of and on the old index to the new one
+	 * Move all dependencies of and on the old index to the new one.  First
+	 * remove any dependencies that the new index may have to provide an
+	 * initial clean state for the dependency move, and then move all the
+	 * dependencies from the old index to the new one.
 	 */
+	deleteDependencyRecordsFor(RelationRelationId, newIndexId, false);
 	changeDependenciesOf(RelationRelationId, oldIndexId, newIndexId);
 	changeDependenciesOn(RelationRelationId, oldIndexId, newIndexId);
 
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 324db1b6ae..5a93ac6e88 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1973,9 +1973,61 @@ ERROR:  conflicting key value violates exclusion constraint "concur_reindex_tab3
 DETAIL:  Key (c2)=([2,5)) conflicts with existing key (c2)=([1,3)).
 -- Check materialized views
 CREATE MATERIALIZED VIEW concur_reindex_matview AS SELECT * FROM concur_reindex_tab;
+-- Dependency lookup before and after the follow-up REINDEX commands.
+-- These should remain consistent.
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+   pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+   deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+  objid in ('concur_reindex_tab'::regclass,
+'concur_reindex_ind1'::regclass,
+	'concur_reindex_ind2'::regclass,
+	'concur_reindex_ind3'::regclass,
+	'concur_reindex_ind4'::regclass,
+	'concur_reindex_matview'::regclass)
+  ORDER BY 1, 2;
+   obj|   objref   | deptype 
+--++-
+ index concur_reindex_ind1| constraint concur_reindex_ind1 on table concur_reindex_tab | i
+ index concur_reindex_ind2| column c2 of table concur_reindex_tab  | a
+ index concur_reindex_ind3| column c1 of table concur_reindex_tab  | a
+ index concur_reindex_ind3| table concur_reindex_tab   | a
+ index concur_reindex_ind4| column c1 of table concur_reindex_tab  | a
+ index concur_reindex_ind4| column c1 of table concur_reindex_tab  | a
+ index concur_reindex_ind4| column c2 of table concur_reindex_tab  | a
+ materialized view concur_reindex_matview | schema public  | n
+ table concur_reindex_tab | schema public  | n
+(9 rows)
+
 REINDEX INDEX CONCURRENTLY concur_reindex_ind1;
 REINDEX TABLE CONCURRENTLY concur_reindex_tab;
 REINDEX TABLE CONCURRENTLY concur_reindex_matview;
+SELECT pg_describe_object(classid, objid, objsubid) as obj,
+   pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
+   deptype
+FROM pg_depend
+WHERE classid = 'pg_class'::regclass AND
+  objid in ('concur_reindex_tab'::regclass,
+'concur_reindex_ind1'::regclass,
+	'concur_reindex_ind2'::regclass,
+	'concur_reindex_ind3'::regclass,
+	'concur_reindex_ind4'::regclass,
+	'concur_reindex_matview'::regclass)
+  ORDER BY 1, 2;
+   obj|   objref

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Heikki Linnakangas


On 24/10/2019 15:08, Michael Paquier wrote:

On Thu, Oct 24, 2019 at 11:57:33AM +0100, Simon Riggs wrote:

I wonder why is that value 1000, rather than an aligned value or a whole
WAL page?


Good question.  Heikki, why this choice?


No particular reason, it's just a nice round value in decimal.

- Heikki

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Michael Paquier

On Thu, Oct 24, 2019 at 01:57:45PM +0530, Dilip Kumar wrote:
> I have noticed that in StartupXlog also we reset it with 1, you might
> want to fix that as well?

Tsunakawa-san's patch fixes that spot already.  Grepping for
unloggedLSN in the code there is only pg_resetwal on top of what you
are mentioning here.
--
Michael


signature.asc
Description: PGP signature

Re: Fix of fake unlogged LSN initialization

2019-10-24 Thread Michael Paquier

On Fri, Oct 25, 2019 at 02:07:04AM +, tsunakawa.ta...@fujitsu.com wrote:
> From: Simon Riggs 
>>  From xlogdefs.h added by 9155580:
>>  /*
>>   * First LSN to use for "fake" LSNs.
>>   *
>>   * Values smaller than this can be used for special per-AM purposes.
>>   */
>>  #define FirstNormalUnloggedLSN  ((XLogRecPtr) 1000)
> 
> Yeah, I had seen it, but I didn't understand what kind of usage is assumed.

There is an explanation in the commit message of 9155580: that's to
make an interlocking logic in GiST able to work where a valid LSN
needs to be used.  So a magic value was just wanted.

Your patch looks fine to me by the way after a second look, so I think
that we had better commit it and back-patch sooner than later.  If
there are any objections or more comments, please feel free..
--
Michael


signature.asc
Description: PGP signature

59 matches

Mail list logo