date:20190605

Re: MSVC Build support with visual studio 2019

2019-06-05 Thread Juan José Santamaría Flecha

On Wed, May 29, 2019 at 10:30 AM Haribabu Kommi 
wrote:

>
> Updated patches are attached.
>
>
All patches apply, build and pass tests. The patch
'0001-support-building-with-visual-studio-2019_v10_to_v9.6_v3.patch'
applies on version 9.5.

Not sure if more review is needed before moving to 'ready for commite'r,
but I have no more comments to make on current patches.

Regards,

Juan José Santamaría Flecha

Re: compiling PL/pgSQL plugin with C++

2019-06-05 Thread George Tarasov

Tom, thanks for operational response reaction.
Based on this topic and some nearby ones
the problem turned out to be deeper than
expceted... as always.

p.s. Sorry for cyrillic in the mailing list.
At the beginning I wrote from corporate email
and could not change the sender name.
If you can, please, replace.

Regards,
George

Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-05 Thread Christoph Berg

Re: Tom Lane 2019-06-04 <65800.1559662...@sss.pgh.pa.us>
> > There is something wrong here. On Debian Buster/unstable, using
> > system tzdata (2019a-1), if /etc/timezone is "Etc/UTC":
> 
> Is your build using --with-system-tzdata?  If so, which tzdb
> release is the system on, and is it a completely stock copy
> of that release?

It's using system tzdata (2019a-1).

There's one single patch on top of that:

https://sources.debian.org/src/tzdata/2019a-1/debian/patches/

> BTW, does Debian set up /etc/timezone as a symlink, by any chance,
> rather than a copy or hard link?  If it's a symlink, we could improve
> matters by teaching identify_system_timezone() to inspect it.

In the meantime I realized that I was only testing /etc/timezone
(which is a plain file with just the zone name), while not touching
/etc/localtime at all. In this environment, it's a symlink:

lrwxrwxrwx 1 root root 27 Mär 28 14:49 /etc/localtime -> 
/usr/share/zoneinfo/Etc/UTC

... but the name still gets canonicalized to Etc/UCT or UCT.

Christoph

Re: Confusing error message for REINDEX TABLE CONCURRENTLY

2019-06-05 Thread David Rowley

On Wed, 5 Jun 2019 at 18:11, Michael Paquier  wrote:
>
> On Tue, Jun 04, 2019 at 11:26:44AM -0700, Ashwin Agrawal wrote:
> > The variable is used in else scope hence I moved it there. But yes its
> > removed completely for this scope.
>
> Thanks for updating the patch.  It does its job by having one separate
> message for the concurrent and the non-concurrent cases as discussed.
> David, what do you think?  Perhaps you would like to commit it
> yourself?

Thanks. I've just pushed this with some additional comment changes.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: hyrax vs. RelationBuildPartitionDesc

2019-06-05 Thread Amit Langote

On Tue, Jun 4, 2019 at 9:25 PM Robert Haas  wrote:
> On Fri, May 17, 2019 at 4:36 PM Tom Lane  wrote:
> > Yeah, I did some additional testing that showed that it's pretty darn
> > hard to get the leak to amount to anything.  The test case that I
> > originally posted did many DDLs in a single transaction, and it
> > seems that that's actually essential to get a meaningful leak; as
> > soon as you exit the transaction the leaked contexts will be recovered
> > during sinval cleanup.
>
> My colleague Amul Sul rediscovered this same leak when he tried to
> attach lots of partitions to an existing partitioned table, all in the
> course of a single transaction.  This seems a little less artificial
> than Tom's original reproducer, which involved attaching and detaching
> the same partition repeatedly.
>
> Here is a patch that tries to fix this, along the lines I previously
> suggested; Amul reports that it does work for him.

Thanks for the patch.

I noticed a crash with one of the scenarios that I think the patch is
meant to address.  Let me describe the steps:

Attach gdb (or whatever) to session 1 in which we'll run a query that
will scan at least two partitions and set a breakpoint in
expand_partitioned_rtentry().  Run the query, so the breakpoint will
be hit.  Step through up to the start of the following loop in this
function:

i = -1;
while ((i = bms_next_member(live_parts, i)) >= 0)
{
Oid childOID = partdesc->oids[i];
Relationchildrel;
RangeTblEntry *childrte;
Index   childRTindex;
RelOptInfo *childrelinfo;

/* Open rel, acquiring required locks */
childrel = table_open(childOID, lockmode);

Note that 'partdesc' in the above code is from the partition
directory.  Before stepping through into the loop, start another
session and attach a new partition.  On into the loop.  When the 1st
partition is opened, its locking will result in
RelationClearRelation() being called on the parent relation due to
partition being attached concurrently, which I observed, is actually
invoked a couple of times due to recursion.  Parent relation's
rd_oldpdcxt will be set in this process, which contains the
aforementioned partition descriptor.

Before moving the loop to process the 2nd partition, attach another
partition in session 2.  When the 2nd partition is opened,
RelationClearRelation() will run again and in one of its recursive
invocations, it destroys the rd_oldpdcxt that was set previously,
taking the partition directory's partition descriptor with it.
Anything that tries to access it later crashes.

I couldn't figure out what to blame here though -- the design of
rd_pdoldcxt or the fact that RelationClearRelation() is invoked
multiple times.  I will try to study it more closely tomorrow.

Thanks,
Amit

Re: Confusing comment for function ExecParallelEstimate

2019-06-05 Thread Amit Kapila

On Wed, Jun 5, 2019 at 11:27 AM Wu, Fei  wrote:
>
> Thanks for your reply.
> From the code below:
> (https://github.com/postgres/postgres/blob/REL_10_7/src/backend/executor/execParallel.c)
> ###
> 443 /*
> 444  * Give parallel-aware nodes a chance to add to the 
> estimates, and get a
> 445  * count of how many PlanState nodes there are.
> 446  */
> 447 e.pcxt = pcxt;
> 448 e.nnodes = 0;
> 449 ExecParallelEstimate(planstate, &e);
> 450
> 451 /* Estimate space for instrumentation, if required. */
> 452 if (estate->es_instrument)
> 453 {
> 454 instrumentation_len =
> 455 offsetof(SharedExecutorInstrumentation, 
> plan_node_id) +
> 456 sizeof(int) * e.nnodes;
> 457 instrumentation_len = MAXALIGN(instrumentation_len);
> 458 instrument_offset = instrumentation_len;
> 459 instrumentation_len +=
> 460 mul_size(sizeof(Instrumentation),
> 461  mul_size(e.nnodes, 
> nworkers));
> 462 shm_toc_estimate_chunk(&pcxt->estimator, 
> instrumentation_len);
> 463 shm_toc_estimate_keys(&pcxt->estimator, 1);
>
> ###
> It seems that e.nnodes which returns from ExecParallelEstimate(planstate, &e) 
> , determines how much instrumentation structures in DSM(line459~line461).
> And  e.nnodes also determines the length of SharedExecutorInstrumentation-> 
> plan_node_id(line454~line456).
>
> So, I think here it refers to instrumentation.
>

Right.  I think the way it is mentioned
(SharedPlanStateInstrumentation structures ..) in the comment can
confuse readers.  We can replace SharedPlanStateInstrumentation with
Instrumentation in the comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Binary support for pgoutput plugin

2019-06-05 Thread Petr Jelinek

Hi,

On 05/06/2019 00:08, Andres Freund wrote:
> Hi,
> 
> On 2019-06-05 00:05:02 +0200, David Fetter wrote:
>> Would it make sense to work toward a binary format that's not
>> architecture-specific? I recall from COPY that our binary format is
>> not standardized across, for example, big- and little-endian machines.
> 
> I think you recall wrongly. It's obviously possible that we have bugs
> around this, but output/input routines are supposed to handle a
> endianess independent format. That usually means that you have to do
> endianess conversions, but that doesn't make it non-standardized.
> 

Yeah, there are really 3 formats of data we have, text protocol, binary
network protocol and internal on disk format. The internal on disk
format will not work across big/little-endian but network binary
protocol will.

FWIW I don't think the code for binary format was included in original
logical replication patch (I really tried to keep it as minimal as
possible), but the code and protocol is pretty much ready for adding that.

That said, pglogical has code which handles this (I guess Andres means
that by predecessor of pgoutput) so if you look for example at the
write_tuple/read_tuple/decide_datum_transfer in
https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
that can help you give some ideas on how to approach this.

-- 
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions
https://www.2ndQuadrant.com/

Re: Binary support for pgoutput plugin

2019-06-05 Thread Dave Cramer

Hi,


On Wed, 5 Jun 2019 at 07:18, Petr Jelinek 
wrote:

> Hi,
>
> On 05/06/2019 00:08, Andres Freund wrote:
> > Hi,
> >
> > On 2019-06-05 00:05:02 +0200, David Fetter wrote:
> >> Would it make sense to work toward a binary format that's not
> >> architecture-specific? I recall from COPY that our binary format is
> >> not standardized across, for example, big- and little-endian machines.
> >
> > I think you recall wrongly. It's obviously possible that we have bugs
> > around this, but output/input routines are supposed to handle a
> > endianess independent format. That usually means that you have to do
> > endianess conversions, but that doesn't make it non-standardized.
> >
>
> Yeah, there are really 3 formats of data we have, text protocol, binary
> network protocol and internal on disk format. The internal on disk
> format will not work across big/little-endian but network binary
> protocol will.
>
> FWIW I don't think the code for binary format was included in original
> logical replication patch (I really tried to keep it as minimal as
> possible), but the code and protocol is pretty much ready for adding that.
>
Yes, I looked through the public history and could not find it. Thanks for
confirming.

>
> That said, pglogical has code which handles this (I guess Andres means
> that by predecessor of pgoutput) so if you look for example at the
> write_tuple/read_tuple/decide_datum_transfer in
>
> https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
> that can help you give some ideas on how to approach this.
>

Thanks for the tip!


Dave Cramer

>
>
>

Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-06-05 Thread Antonin Houska

Robert Haas  wrote:

> On Fri, May 31, 2019 at 2:59 AM Antonin Houska  wrote:
> > > Sounds good.  I'm not quite sure how this is going to work.  Ideally
> > > you'd like the encryption key command to fetch the key from something
> > > like ssh-agent, or maybe pop up a window on the user's terminal with a
> > > key prompt.  Just reading from stdin and writing to stdout is not
> > > going to be very convenient...
> >
> > When I mentioned writing to stdout in my previous email, I viewed it from 
> > the
> > perspective of postmaster: whichever way the command gets the password from
> > the DBA, it'll always write it to stdout and postmaster will read it from
> > there. This was to address your objection that the executables do not 
> > actually
> > "return" strings.
> 
> So the part about returning strings is really just a wording issue:
> the documentation needs to talk about data sent to stdout or wherever,
> because that's what really happens, and somebody writing such a
> command from scratch needs to know what it must do.
> 
> What I'm talking about here is that it also has to be reasonably
> possible to write an encryption key command that does something
> useful.  I don't have a really clear vision for how that's going to
> work.  Nobody wants the server, which is probably being launched by
> pg_ctl or systemd or both, to prompt using its own stdin/stderr, but
> the we need to think about how the prompting is actually going to
> work.

Since you mentioned ssh-agent above, I think that postmaster can, instead of
running a command, create an unix socket and read the key from there. (I refer
only to the socket as a communication channel, not to the protocol - ssh-agent
does not seem to actually send key over the socket.) Unlike the socket for
backend connections, this one would only be readable by the user that runs
postmaster, and would only exist during the encryption initialization.

The simplest approach from the perspective of the DBA is that pg_ctl can write
the key to the socket. Besides that we can also implement a separate utility
to send the key, to be used in other special cases such as starting postgres
via systemd.

(If the unix socket is a problem on windows, we might need to use named pipe
instead.)

> > The header comment in pg_upgrade.c indicates that this is because of TOAST. 
> > So
> > I think that dbnode and relfilenode are not preserved just because there was
> > no strong reason to do so by now.
> 
> Maybe you want to propose an independent patch making that change?

I think of a separate diff in the encryption patch series. As the encryption
is the only reason for this enhancement, I'm not sure if it deserves a
separate CF entry. (Although it might be considered refactoring because
eventually pg_upgrade won't need to handle the different relnodes, and thus it
might become a little bit simpler.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

pg_basebackup failure after setting default_table_access_method option

2019-06-05 Thread vignesh C

Hi,

*I noticed pg_basebackup failure when default_table_access_method option is
set.*

*Test steps:*

Step 1: Init database
./initdb -D data

Step 2: Start Server
./postgres -D data &

Step 3: Set Guc option
export PGOPTIONS='-c default_table_access_method=zheap'

Step 4: Peform backup
/pg_basebackup -D backup -p 5432 --no-sync
2019-06-05 20:35:04.088 IST [11601] FATAL:  cannot read pg_class without
having selected a database
pg_basebackup: error: could not connect to server: FATAL:  cannot read
pg_class without having selected a database

*Reason why it is failing:*
pg_basebackup does not use any database to connect to server as it backs up
the whole data instance.
As the option default_table_access_method is set.
It tries to validate this option, but while validating the option in
ScanPgRelation function:
if (!OidIsValid(MyDatabaseId))
elog(FATAL, "cannot read pg_class without having selected a database");

Here as pg_basebackup uses no database the command fails.

Fix:
The patch has the fix for the above issue:

Let me know your opinion on this.

-- 
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com


0001-pg-backup-guc-option-validation-failure.patch
Description: Binary data

Re: Binary support for pgoutput plugin

2019-06-05 Thread Dave Cramer

On Wed, 5 Jun 2019 at 07:21, Dave Cramer  wrote:

> Hi,
>
>
> On Wed, 5 Jun 2019 at 07:18, Petr Jelinek 
> wrote:
>
>> Hi,
>>
>> On 05/06/2019 00:08, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2019-06-05 00:05:02 +0200, David Fetter wrote:
>> >> Would it make sense to work toward a binary format that's not
>> >> architecture-specific? I recall from COPY that our binary format is
>> >> not standardized across, for example, big- and little-endian machines.
>> >
>> > I think you recall wrongly. It's obviously possible that we have bugs
>> > around this, but output/input routines are supposed to handle a
>> > endianess independent format. That usually means that you have to do
>> > endianess conversions, but that doesn't make it non-standardized.
>> >
>>
>> Yeah, there are really 3 formats of data we have, text protocol, binary
>> network protocol and internal on disk format. The internal on disk
>> format will not work across big/little-endian but network binary
>> protocol will.
>>
>> FWIW I don't think the code for binary format was included in original
>> logical replication patch (I really tried to keep it as minimal as
>> possible), but the code and protocol is pretty much ready for adding that.
>>
> Yes, I looked through the public history and could not find it. Thanks for
> confirming.
>
>>
>> That said, pglogical has code which handles this (I guess Andres means
>> that by predecessor of pgoutput) so if you look for example at the
>> write_tuple/read_tuple/decide_datum_transfer in
>>
>> https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
>> that can help you give some ideas on how to approach this.
>>
>
> Thanks for the tip!
>

Looking at:
https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163

this seems completely ignored. What was the intent?

Dave

Re: WITH NOT MATERIALIZED and DML CTEs

2019-06-05 Thread Alvaro Herrera

On 2019-Jun-03, Andres Freund wrote:

> On 2019-06-03 11:45:51 -0400, Elvis Pranskevichus wrote:
> > It is understandably late in the 12 cycle, so maybe prohibit NOT 
> > MATERIALIZED with DML altogheter and revisit this in 13?
> 
> I could see us adding an error, or just continuing to silently ignore
> it.

Hmm, shouldn't we be throwing an error for that case?  I'm not sure it's
defensible that we don't.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binary support for pgoutput plugin

2019-06-05 Thread Andres Freund

Hi

On June 5, 2019 8:51:10 AM PDT, Dave Cramer  wrote:
>On Wed, 5 Jun 2019 at 07:21, Dave Cramer  wrote:
>
>> Hi,
>>
>>
>> On Wed, 5 Jun 2019 at 07:18, Petr Jelinek
>
>> wrote:
>>
>>> Hi,
>>>
>>> On 05/06/2019 00:08, Andres Freund wrote:
>>> > Hi,
>>> >
>>> > On 2019-06-05 00:05:02 +0200, David Fetter wrote:
>>> >> Would it make sense to work toward a binary format that's not
>>> >> architecture-specific? I recall from COPY that our binary format
>is
>>> >> not standardized across, for example, big- and little-endian
>machines.
>>> >
>>> > I think you recall wrongly. It's obviously possible that we have
>bugs
>>> > around this, but output/input routines are supposed to handle a
>>> > endianess independent format. That usually means that you have to
>do
>>> > endianess conversions, but that doesn't make it non-standardized.
>>> >
>>>
>>> Yeah, there are really 3 formats of data we have, text protocol,
>binary
>>> network protocol and internal on disk format. The internal on disk
>>> format will not work across big/little-endian but network binary
>>> protocol will.
>>>
>>> FWIW I don't think the code for binary format was included in
>original
>>> logical replication patch (I really tried to keep it as minimal as
>>> possible), but the code and protocol is pretty much ready for adding
>that.
>>>
>> Yes, I looked through the public history and could not find it.
>Thanks for
>> confirming.
>>
>>>
>>> That said, pglogical has code which handles this (I guess Andres
>means
>>> that by predecessor of pgoutput) so if you look for example at the
>>> write_tuple/read_tuple/decide_datum_transfer in
>>>
>>>
>https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
>>> that can help you give some ideas on how to approach this.
>>>
>>
>> Thanks for the tip!
>>
>
>Looking at:
>https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163
>
>this seems completely ignored. What was the intent?

That's about the output of the plugin, not the datatypes. And independent of 
text/binary output, the protocol contains non-printable chars.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-06-05 Thread Martijn van Oosterhout

Hoi hackers,

Please find attached updated versions of the patches, I've now tested
them. Also attached is a reproduction script to verify that they
actually work.

To test you need to create 150 databases as described in the script,
then simply execute it. Before patching you get the following results
(last figure is the CPU usage of Postgres):

1559749330 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 269.07%
1559749335 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 268.07%
1559749340 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 270.94%

After patching you get the following:

1559749840 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.02, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.09%
1559749845 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.06%
1559749850 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 4.75%

The async queue functions in postgres also no longer appear in the
perf output (below measuring threshold).

As for general method, it seems like the actual optimisation here is
that the async queue tail pointer is only updated once per SLRU page
instead of every message. This would require a significantly larger
patch, but shouldn't be too difficult. Thoughts?

Have a nice day,
Martijn

On Tue, 4 Jun 2019 at 09:08, Martijn van Oosterhout  wrote:
>
> Hoi hackers,
>
> We've been having issues with NOTIFYs blocking over multiple databases
> (see [1] for more details). That was 9.4 but we've updated the
> database to 11.3 and still have the same issue. Now however we could
> use perf to do profiling and got the following profile (useless
> details elided):
>
> --32.83%--ProcessClientReadInterrupt
>--32.68%--ProcessNotifyInterrupt
>   --32.16%--asyncQueueReadAllNotifications
>  --23.37%--asyncQueueAdvanceTail
> --20.49%--LWLockAcquire
>--18.93%--LWLockQueueSelf
>   --12.99%--LWLockWaitListLock
>
> (from: perf record -F 99 -ag -- sleep 600)
>
> That shows that more than 20% of the time is spent in that single
> function, waiting for an exclusive lock on the AsyncQueueLock. This
> will block any concurrent session doing a NOTIFY in any database on
> the system. This would certainly explain the symptoms we're seeing
> (process xxx still waiting for AccessExclusiveLock on object 0 of
> class 1262 of database 0).
>
> Analysis of the code leads me to the following hypothesis (and hence
> to the attached patches):
>
> We have ~150 databases, each of which has 2 active backends with an
> active LISTEN. When a NOTIFY happens anywhere on any database it
> (under an exclusive lock) makes a list of 300 backends to send a
> signal to. It then wakes up all of those backends. Each backend then
> examines the message and all but one discards it as being for the
> wrong database. Each backend then calls asyncQueueAdvanceTail (because
> the current position of the each backend was the tail) which then
> takes an exclusive lock and checks all the other backends to see if
> the tail can be advanced. All of these will conclude 'no', except the
> very last one which concludes the tail can be advanced by about 50
> bytes or so.
>
> So the inner loop of asyncQueueAdvanceTail will, while holding a
> global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million
> times for basically no benefit. During this time, no other transaction
> anywhere in the system that does a NOTIFY will be able to commit.
>
> The attached patches attempt reduce the overhead in two ways:
>
> Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the
> QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is
> that there's no point trying to advance the tail unless we can
> actually usefully truncate the SLRU. This does however mean that
> asyncQueueReadAllNotifications always has to call
> asyncQueueAdvanceTail since it can no longer be guaranteed that any
> backend is still at the tail, which is one of the assumptions of the
> current code. Not sure if this is a problem or if it can be improved
> without tracking much more state.
>
> Patch 2: Changes SignalBackends to only notify other backends when (a)
> they're the same database as me or (b) the notify queue has advanced
> to a new SLRU page. This avoids backends being woken up for messages
> which they are not interested in.
>
> As a consequence of these changes, we can reduce the number of
> exclusive locks and backend wake ups in our case by a factor of 300.
> You still however get a thundering herd at the end of each SLRU page.
>
> Note: these patches have not yet been extensively tested, and so
> should only be used as basis for discussion.
>
> Comments? Suggestions?
>
> [1] 
> https://www.postgresql.org/message-id/cadwg95t0j9zf0uwdcmh81kmndsitavhxmbvgyqrrjcd-ilw...@mail.gmail.com
>
> --
> Marti

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-05 Thread Rafia Sabih

On Mon, 3 Jun 2019 at 15:39, James Coleman  wrote:
>
> On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
>  wrote:
> > Currently, the costing logic (cost_incremental_sort) assumes all groups
> > have the same size, which is fine for uniform distributions. But for
> > skewed distributions, that may be an issue.
> >
> > Consider for example a table with 1M rows, two columns, 100 groups in each
> > column, and index on the first one.
> >
> > CREATE table t (a INT, b INT);
> >
> > INSERT INTO t SELECT 100*random(), 100*random()
> >   FROM generate_series(1,100);
> >
> > Now, let's do a simple limit query to find the first row:
> >
> > SELECT * FROM t ORDER BU a, b LIMIT 1;
> >
> > In this case the current costing logic is fine - the groups are close to
> > average size, and we only need to sort the first group, i.e. 1% of data.
> >
> > Now, let's say the first group is much larger:
> >
> >
> > INSERT INTO t SELECT 0, 100*random()
> >   FROM generate_series(1,90) s(i);
> >
> > INSERT INTO t SELECT 100*random(), 100*random()
> >   FROM generate_series(1,10);
> >
> > That is, the first group is roughly 90% of data, but the number of groups
> > is still the same. But we need to scan 90% of data. But the average group
> > size is still the same, so the current cost model is oblivious to this.
>
> Thinking out loud here: the current implementation doesn't guarantee
> that sort groups always have the same prefix column values because
> (from code comments) "Sorting many small groups with tuplesort is
> inefficient). While this seems a reasonable optimization, I think it's
> possible that thinking steered away from an optimization in the
> opposite direction. Perhaps we should always track whether or not all
> prefix tuples are the same (currently we only do that after reaching
> DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
> have tuplesort only care about the non-prefix columns (where currently
> it has to sort on all pathkey columns even though for a large group
> the prefix columns are guaranteed to be equal).
>
+1 for passing only the non-prefix columns to tuplesort.
> Essentially I'm trying to think of ways that would get us to
> comparable performance with regular sort in the case of large batch
> sizes.
>
> One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
> case where you have a very small group and then a very large batch,
> we'd lose the ability to optimize in the above way. That makes me
> wonder if we shouldn't intentionally optimize for the possibility of
> large batch sizes, since a little extra expense per group/tuple is
> more likely to be a non-concern with small groups anyway when there
> are large numbers of input tuples but a relatively small limit.
>
What about using the knowledge of MCV here, if we know the next value
is in MCV list then take the overhead of sorting this small group
alone and then leverage the optimization for the larger group, by
passing only the non-prefix columns.
> Thoughts?
>
> > So I think we should look at the MCV list, and use that information when
> > computing the startup/total cost. I think using the first/largest group to
> > compute the startup cost, and the average group size for total cost would
> > do the trick.
>
+1
> I think this sounds very reasonable.
>
> > I don't think we can do much better than this during planning. There will
> > inevitably be cases where the costing model will push us to do the wrong
> > thing, in either direction. The question is how serious issue that is, and
> > whether we could remedy that during execution somehow.
> >
> > When we "incorrectly" pick full sort (when the incremental sort would be
> > faster), that's obviously sad but I think it's mostly fine because it's
> > not a regression.
> >
> > The opposite direction (picking incremental sort, while full sort being
> > faster) is probably more serious, because it's a regression between
> > releases.
> >
> > I don't think we can fully fix that by refining the cost model. We have
> > two basic options:
> >
> > 1) Argue/show that this is not an issue in practice, because (a) such
> > cases are very rare, and/or (b) the regression is limited. In short, the
> > benefits of the patch outweight the losses.
>
> My comments above go in this direction. If we can improve performance
> in the worst case, I think it's plausible this concern becomes a
> non-issue.
>
> > 2) Provide some fallback at execution time. For example, we might watch
> > the size of the group, and if we run into an unexpectedly large one we
> > might abandon the incremental sort and switch to a "full sort" mode.
>
> Are there good examples of our doing this in other types of nodes
> (whether the fallback is an entirely different algorithm/node type)? I
> like this idea in theory, but I also think it's likely it would add a
> very significant amount of complexity. The other problem is knowing
> where to draw the line: you end up creating th

Re: Sort support for macaddr8

2019-06-05 Thread Melanie Plageman

On Tue, Jun 4, 2019 at 3:50 PM Peter Geoghegan  wrote:

> On Tue, Jun 4, 2019 at 3:23 PM Andres Freund  wrote:
> > > This is half the reason why I ended up implementing itemptr_encode()
> > > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some
> > > years back -- TID is 6 bytes wide, which doesn't have the necessary
> > > macro support within postgres.h. There is no reason why that couldn't
> > > be added for the benefit of both TID and macaddr types, though it
> > > probably wouldn't be worth it.
> >
> > I think we should definitely do that. It seems not unlikely that other
> > people want to write new fixed width types, and we shouldn't have them
> > deal with all this complexity unnecessarily.
>
> On second thought, maybe there is something to be said for being
> exhaustive here.
>
> It seems like there is a preference for making macaddr8 pass-by-value
> instead of adding abbreviated keys support to macaddr8, and possibly
> doing the same with the original macaddr type.
>
> Do you think that you'll be able to work on the project with this
> expanded scope, Melanie?
>
>
I can take on making macaddr8 pass-by-value
I tinkered a bit last night and got in/out mostly working (I think).
I'm not sure about macaddr, TID, and user-defined types.

-- 
Melanie Plageman

Re: Sort support for macaddr8

2019-06-05 Thread Alvaro Herrera

On 2019-Jun-05, Melanie Plageman wrote:

> I can take on making macaddr8 pass-by-value
> I tinkered a bit last night and got in/out mostly working (I think).
> I'm not sure about macaddr, TID, and user-defined types.

Yeah, let's see what macaddr8 looks like, and we can move from there --
I suppose adapting for macaddr would not be terribly different, but we
don't have to do both in a single commit.  I don't expect that TID would
necessarily be similar since we have lots of bespoke code for that in
lots of places; it might not affect anything (it should not!) but then
it might.  No reason not to move forward incrementally.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Sort support for macaddr8

2019-06-05 Thread Andres Freund

Hi,

On 2019-06-05 09:18:34 -0700, Melanie Plageman wrote:
> On Tue, Jun 4, 2019 at 3:50 PM Peter Geoghegan  wrote:
> 
> > On Tue, Jun 4, 2019 at 3:23 PM Andres Freund  wrote:
> > > > This is half the reason why I ended up implementing itemptr_encode()
> > > > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some
> > > > years back -- TID is 6 bytes wide, which doesn't have the necessary
> > > > macro support within postgres.h. There is no reason why that couldn't
> > > > be added for the benefit of both TID and macaddr types, though it
> > > > probably wouldn't be worth it.
> > >
> > > I think we should definitely do that. It seems not unlikely that other
> > > people want to write new fixed width types, and we shouldn't have them
> > > deal with all this complexity unnecessarily.
> >
> > On second thought, maybe there is something to be said for being
> > exhaustive here.
> >
> > It seems like there is a preference for making macaddr8 pass-by-value
> > instead of adding abbreviated keys support to macaddr8, and possibly
> > doing the same with the original macaddr type.
> >
> > Do you think that you'll be able to work on the project with this
> > expanded scope, Melanie?
> >
> >
> I can take on making macaddr8 pass-by-value
> I tinkered a bit last night and got in/out mostly working (I think).
> I'm not sure about macaddr, TID, and user-defined types.

I'd much rather see this tackled in a general way than fiddling with
individual datatypes. I think we should:

1) make fetch_att(), store_att_byval() etc support datums of any length
   between 1 and <= sizeof(Datum). Probably also convert them to inline
   functions. There's a few more functions to be adjusted, but not many,
   I think.

2) Remove ability to pass PASSEDBYVALUE to CREATE TYPE, but instead
   compute whether attyval is possible, solely based on INTERNALLENGTH
   (when INTERNALLENGTH > 0 obviously).

3) Fix the fallout, by fixing a few of the Datum<->type conversion
   functions affected by this change. That'll require a bit of work, but
   not too much. We should write those conversion routines in a way
   that'll keep them working for the scenarios where the type is
   actually passable by value, and not (required for > 4 byte datums).

Greetings,

Andres Freund

Re: Sort support for macaddr8

2019-06-05 Thread Alvaro Herrera

On 2019-Jun-05, Andres Freund wrote:

> I'd much rather see this tackled in a general way than fiddling with
> individual datatypes. I think we should:
> 
> 1) make fetch_att(), store_att_byval() etc support datums of any length
>between 1 and <= sizeof(Datum). Probably also convert them to inline
>functions. There's a few more functions to be adjusted, but not many,
>I think.

Does this mean that datatypes that are >4 and <=8 bytes need to handle
both cases?  Is it possible for them to detect the current environment?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Sort support for macaddr8

2019-06-05 Thread Andres Freund

Hi,

On June 5, 2019 12:14:42 PM PDT, Alvaro Herrera  
wrote:
>On 2019-Jun-05, Andres Freund wrote:
>
>> I'd much rather see this tackled in a general way than fiddling with
>> individual datatypes. I think we should:
>> 
>> 1) make fetch_att(), store_att_byval() etc support datums of any
>length
>>between 1 and <= sizeof(Datum). Probably also convert them to
>inline
>>functions. There's a few more functions to be adjusted, but not
>many,
>>I think.
>
>Does this mean that datatypes that are >4 and <=8 bytes need to handle
>both cases?  Is it possible for them to detect the current environment?

Well, the conversion macros need to know. You can look at float8 for an example 
of the difference - it's pretty centralized. We should provide a few helper 
macros to abstract that away.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Question about some changes in 11.3

2019-06-05 Thread Mat Arye

Thanks for taking a look at this Tom.

On Mon, Jun 3, 2019 at 4:07 PM Tom Lane  wrote:

> Mat Arye  writes:
> > On Fri, May 24, 2019 at 5:05 PM Mat Arye  wrote:
> >> 11.3 included some change to partition table planning. Namely
> >> commit 925f46f ("Fix handling of targetlist SRFs when scan/join
> relation is
> >> known empty.") seems to redo all paths for partitioned tables
> >> in apply_scanjoin_target_to_paths. It clears the paths in:
> >>
> >> ```
> >> if (rel_is_partitioned)
> >> rel->pathlist = NIL
> >> ```
> >>
> >> Then the code rebuild the paths. However, the rebuilt append path never
> >> gets the
> >> set_rel_pathlist_hook called. Thus, the work that hook did previously
> gets
> >> thrown away and the rebuilt append path can never be influenced by this
> >> hook. Is this intended behavior? Am I missing something?
>
> Hm.  I'd say this was already broken by the invention of
> apply_scanjoin_target_to_paths; perhaps 11-before-11.3 managed to
> still work for you, but it's not hard to envision applications of
> set_rel_pathlist_hook for which it would not have worked.  The contract
> for set_rel_pathlist_hook is supposed to be that it gets to editorialize
> on all normal (non-Gather) paths created by the core code, and that's
> no longer the case now that apply_scanjoin_target_to_paths can add more.
>

Yeah it worked for our cases because (I guess) out paths turned out to be
lower cost,
but I see your point.

>
> > I've  attached a small patch to address this discrepancy for when the
> > set_rel_pathlist_hook is called so that's it is called for actual paths
> > used for partitioned rels. Please let me know if I am misunderstanding
> how
> > this should be handled.
>
> I'm not very happy with this patch either, as it makes the situation
> even more confused, not less so.  The best-case scenario is that the
> set_rel_pathlist_hook runs twice and does useless work; the worst case
> is that it gets confused completely by being called twice for the same
> rel.  I think we need to maintain the invariant that that hook is
> called exactly once per baserel.
>

Yeah getting called once per baserel is a nice invariant to have.

> I wonder whether we could fix matters by postponing the
> set_rel_pathlist_hook call till later in the same cases where
> we postpone generate_gather_paths, ie, it's the only baserel.
>
> That would make its name pretty misleading, though.

How would simply delaying the hook make the name misleading? I am also
wondering if
using the condition `rel->reloptkind == RELOPT_BASEREL &&
bms_membership(root->all_baserels) != BMS_SINGLETON` is sufficient.
Is it really guaranteed that `apply_scanjoin_target_to_paths` will not be
called in other cases?

> Maybe we
> should leave it alone and invent a separate hook to be called
> by/after apply_scanjoin_target_to_paths?  Although I don't
> know if it'd be useful to add a new hook to v11 at this point.
> Extensions would have a hard time knowing if they could use it.
>

I think for us, either approach would work. We just need a place to
add/modify
some paths. FWIW, I think delaying the hook is easier to deal with on our
end if it could work
since we don't have to deal with two different code paths but either is
workable.

Certainly if we go with the new hook approach I think it should be added to
v11 as well.
That way extensions that need the functionality can hook into it and deal
with patch level
differences instead of having no way at all to get at this functionality.

I am more than happy to work on a new patch once we settle on an approach.

>
> regards, tom lane
>

Re: Fix runtime errors from -fsanitize=undefined

2019-06-05 Thread Robert Haas

On Mon, Jun 3, 2019 at 3:22 PM Peter Eisentraut
 wrote:
> After many years of trying, it seems the -fsanitize=undefined checking
> in gcc is now working somewhat reliably.  Attached is a patch that fixes
> all errors of the kind

Is this as of some particular gcc version?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan

2019-06-05 Thread Floris Van Nee

> To address this, probably we can do something like in the attached patch.
> Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is
> the same, but without the constants elimination. It's being used then for
> getting the real number of distinct keys, and to check the order of the 
> columns
> to not consider index skip scan if it's different. Hope it doesn't
> look too hacky.
>

Thanks! I've verified that it works now.
I was wondering if we're not too strict in some cases now though. Consider the 
following queries:

postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2';
  QUERY PLAN
---
 Index Only Scan using t_m_f_t_idx on t  (cost=0.29..11.60 rows=40 width=5) 
(actual time=0.056..0.469 rows=10 loops=1)
   Scan mode: Skip scan
   Index Cond: (m = 'M2'::text)
   Heap Fetches: 10
 Planning Time: 0.095 ms
 Execution Time: 0.490 ms
(6 rows)

postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2';
 QUERY PLAN

 Unique  (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 
loops=1)
   ->  Index Only Scan using t_m_f_t_idx on t  (cost=0.29..824.70 rows=10052 
width=5) (actual time=0.087..8.524 rows=1 loops=1)
 Index Cond: (m = 'M2'::text)
 Heap Fetches: 1
 Planning Time: 0.078 ms
 Execution Time: 10.944 ms
(6 rows)

This is basically the opposite case - when distinct_pathkeys matches the 
filtered list of index keys, an index skip scan could be considered. Currently, 
the user needs to write 'distinct m,f' explicitly, even though he specifies in 
the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine 
like this, but it could be a small improvement for consistency.

-Floris?

Add CREATE DATABASE LOCALE option

2019-06-05 Thread Peter Eisentraut

I propose this patch to add a LOCALE option to CREATE DATABASE.  This
sets both LC_COLLATE and LC_CTYPE with one option.  Similar behavior is
already supported in initdb, CREATE COLLATION, and createdb.

With collation providers other than libc, having separate lc_collate and
lc_ctype settings is not necessarily applicable, so this is also
preparation for such future functionality.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 5f64d10944b272a5b636d7b0033a0a6a3d28998b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 4 Jun 2019 14:45:00 +0200
Subject: [PATCH] Add CREATE DATABASE LOCALE option

This sets both LC_COLLATE and LC_CTYPE with one option.  Similar
behavior is already supported in initdb, CREATE COLLATION, and
createdb.
---
 doc/src/sgml/ref/create_database.sgml | 15 +--
 src/backend/commands/dbcommands.c | 20 
 src/bin/pg_dump/pg_dump.c | 23 +--
 3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml 
b/doc/src/sgml/ref/create_database.sgml
index b2c9e241c2..c5b7af5cf7 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@
 [ [ WITH ] [ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
+   [ LOCALE [=] locale ]
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace_name ]
@@ -111,6 +112,16 @@ Parameters

   
  
+ 
+  locale
+  
+   
+This is a shortcut for setting LC_COLLATE
+and LC_CTYPE at once.  If you specify this,
+you cannot specify either of those parameters.
+   
+  
+ 
  
   lc_collate
   
@@ -287,7 +298,7 @@ Examples
To create a database music with a different locale:
 
 CREATE DATABASE music
-LC_COLLATE 'sv_SE.utf8' LC_CTYPE 'sv_SE.utf8'
+LOCALE 'sv_SE.utf8'
 TEMPLATE template0;
 
 In this example, the TEMPLATE template0 clause is 
required if
@@ -300,7 +311,7 @@ Examples
different character set encoding:
 
 CREATE DATABASE music2
-LC_COLLATE 'sv_SE.iso885915' LC_CTYPE 'sv_SE.iso885915'
+LOCALE 'sv_SE.iso885915'
 ENCODING LATIN9
 TEMPLATE template0;
 
diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 15207bf75a..ac275f7e64 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -124,6 +124,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
DefElem*downer = NULL;
DefElem*dtemplate = NULL;
DefElem*dencoding = NULL;
+   DefElem*dlocale = NULL;
DefElem*dcollate = NULL;
DefElem*dctype = NULL;
DefElem*distemplate = NULL;
@@ -184,6 +185,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 parser_errposition(pstate, 
defel->location)));
dencoding = defel;
}
+   else if (strcmp(defel->defname, "locale") == 0)
+   {
+   if (dlocale)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or 
redundant options"),
+parser_errposition(pstate, 
defel->location)));
+   dlocale = defel;
+   }
else if (strcmp(defel->defname, "lc_collate") == 0)
{
if (dcollate)
@@ -244,6 +254,11 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 parser_errposition(pstate, 
defel->location)));
}
 
+   if (dlocale && (dcollate || dctype))
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or redundant options")));
+
if (downer && downer->arg)
dbowner = defGetString(downer);
if (dtemplate && dtemplate->arg)
@@ -276,6 +291,11 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 parser_errposition(pstate, 
dencoding->location)));
}
}
+   if (dlocale && dlocale->arg)
+   {
+   dbcollate = defGetString(dlocale);
+   dbctype = defGetString(dlocale);
+   }
if (dcollate && dcollate->arg)
dbcollate = defGetString(dcollate);
if (dctype && dctype->arg)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9f59cc74ee..ba34677787 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2812,1

Re: Add CREATE DATABASE LOCALE option

2019-06-05 Thread Fabrízio de Royes Mello

On Wed, Jun 5, 2019 at 5:17 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:
>
> I propose this patch to add a LOCALE option to CREATE DATABASE.  This
> sets both LC_COLLATE and LC_CTYPE with one option.  Similar behavior is
> already supported in initdb, CREATE COLLATION, and createdb.
>
> With collation providers other than libc, having separate lc_collate and
> lc_ctype settings is not necessarily applicable, so this is also
> preparation for such future functionality.
>

Cool... would be nice also add some test cases.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Why does pg_checksums -r not have a long option?

2019-06-05 Thread Peter Eisentraut

On 2019-05-28 04:56, Michael Paquier wrote:
> You could also use a long option for that without a one-letter option,
> like --file-path or such, so reserving a one-letter option for a
> future, hypothetical use is not really a stopper in my opinion.  In
> consequence, I think that that it is fine to just use -f/--filenode.
> Any objections or better suggestions from other folks here?

I think -r/--relfilenode was actually a good suggestion.  Because it
doesn't actually check a *file* but potentially several files (forks,
segments).  The -f naming makes it sound like it operates on a specific
file.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cleaning up and speeding up string functions

2019-06-05 Thread Alvaro Herrera

On 2019-May-26, David Rowley wrote:

> On Sun, 26 May 2019 at 04:50, Tom Lane  wrote:

> > Here the cost is code space rather than programmer-visible complexity,
> > but I still doubt that it's worth it.
> 
> I see on today's master the postgres binary did grow from 8633960
> bytes to 8642504 on my machine using GCC 8.3, so you might be right.
> pg_receivewal grew from 96376 to 96424 bytes.

I suppose one place that could be affected visibly is JSON object
construction (json.c, jsonfuncs.c) that could potentially deal with
millions of stringinfo manipulations, but most of those calls don't
actually use appendStringInfoString with constant values, so it's
probably not worth bothering with.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crash testing suggestions for 12 beta 1

2019-06-05 Thread Alvaro Herrera

On 2019-May-23, Jeff Janes wrote:

> Now that beta is out, I wanted to do some crash-recovery testing where I
> inject PANIC-inducing faults and see if it recovers correctly. A long-lived
> Perl process keeps track of what it should find after the crash, and
> verifies that it finds it.  You will probably be familiar with the general
> theme from examples like the threads below.  Would anyone like to nominate
> some areas to focus on?

Thanks for the offer!  Your work has showed its value in previous cycles.

REINDEX CONCURRENTLY would be one good area to focus on, I think, as
well as ALTER TABLE ATTACH PARTITION.  Maybe also INCLUDE columns in
GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crash testing suggestions for 12 beta 1

2019-06-05 Thread Peter Geoghegan

On Wed, Jun 5, 2019 at 2:11 PM Alvaro Herrera  wrote:
> REINDEX CONCURRENTLY would be one good area to focus on, I think, as
> well as ALTER TABLE ATTACH PARTITION.  Maybe also INCLUDE columns in
> GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62.

Those all seem like good things to target.

Forgive me for droning on about amcheck once more, but maybe it'll
help: amcheck has the capability to detect at least two historic bugs
in CREATE INDEX CONCURRENTLY that made it into stable releases. The
"heapallindexed" verification option's bt_tuple_present_callback()
function has a header comment that talks about this. In short, any
"unhandled" broken hot chain (i.e. broken hot chains that are somehow
not correctly detected and handled) should be reported as corrupt by
amcheck with the "heapallindexed" check, provided the tuple is visible
to verification's heap scan.

The CREATE INDEX CONCURRENTLY bug that Pavan found a couple of years
back while testing the WARM patch is one example. A bug that was
fallout from the DROP INDEX CONCURRENTLY work is another historic
example. Alvaro will recall that this same check had a role in the
"freeze the dead" business.

-- 
Peter Geoghegan

Re: Remove useless associativity/precedence from parsers

2019-06-05 Thread Alvaro Herrera

On 2019-May-28, Andrew Gierth wrote:

> The main cases I know of are:
> 
> 1. RANGE UNBOUNDED PRECEDING - this one is actually a defect in the
> standard SQL grammar, since UNBOUNDED is a non-reserved keyword and so
> it can also appear as a legal , and the construct
> RANGE  PRECEDING allows  to
> appear as a .

Should we report this to the SQL committee?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binary support for pgoutput plugin

2019-06-05 Thread Dave Cramer

Hi,

On Wed, 5 Jun 2019 at 12:01, Andres Freund  wrote:

> Hi
>
> On June 5, 2019 8:51:10 AM PDT, Dave Cramer  wrote:
> >On Wed, 5 Jun 2019 at 07:21, Dave Cramer  wrote:
> >
> >> Hi,
> >>
> >>
> >> On Wed, 5 Jun 2019 at 07:18, Petr Jelinek
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> On 05/06/2019 00:08, Andres Freund wrote:
> >>> > Hi,
> >>> >
> >>> > On 2019-06-05 00:05:02 +0200, David Fetter wrote:
> >>> >> Would it make sense to work toward a binary format that's not
> >>> >> architecture-specific? I recall from COPY that our binary format
> >is
> >>> >> not standardized across, for example, big- and little-endian
> >machines.
> >>> >
> >>> > I think you recall wrongly. It's obviously possible that we have
> >bugs
> >>> > around this, but output/input routines are supposed to handle a
> >>> > endianess independent format. That usually means that you have to
> >do
> >>> > endianess conversions, but that doesn't make it non-standardized.
> >>> >
> >>>
> >>> Yeah, there are really 3 formats of data we have, text protocol,
> >binary
> >>> network protocol and internal on disk format. The internal on disk
> >>> format will not work across big/little-endian but network binary
> >>> protocol will.
> >>>
> >>> FWIW I don't think the code for binary format was included in
> >original
> >>> logical replication patch (I really tried to keep it as minimal as
> >>> possible), but the code and protocol is pretty much ready for adding
> >that.
> >>>
> >> Yes, I looked through the public history and could not find it.
> >Thanks for
> >> confirming.
> >>
> >>>
> >>> That said, pglogical has code which handles this (I guess Andres
> >means
> >>> that by predecessor of pgoutput) so if you look for example at the
> >>> write_tuple/read_tuple/decide_datum_transfer in
> >>>
> >>>
> >
> https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c
> >>> that can help you give some ideas on how to approach this.
> >>>
> >>
> >> Thanks for the tip!
> >>
> >
> >Looking at:
> >
> https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163
> >
> >this seems completely ignored. What was the intent?
>
> That's about the output of the plugin, not the datatypes. And independent
> of text/binary output, the protocol contains non-printable chars.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>


So one of the things they would like added is to get not null information
in the schema record. This is so they can mark the field Optional in Java.
I presume this would also have some uses in other languages. As I
understand it this would require a protocol bump. If this were to be
accepted are there any outstanding asks that would useful to add if we were
going to bump the protocol?

Dave

Re: Binary support for pgoutput plugin

2019-06-05 Thread Andres Freund

Hi,

On 2019-06-05 18:47:57 -0400, Dave Cramer wrote:
> So one of the things they would like added is to get not null information
> in the schema record. This is so they can mark the field Optional in Java.
> I presume this would also have some uses in other languages. As I
> understand it this would require a protocol bump. If this were to be
> accepted are there any outstanding asks that would useful to add if we were
> going to bump the protocol?

I'm pretty strongly opposed to this. What's the limiting factor when
adding such information? I think clients that want something like this
ought to query the database for catalog information when getting schema
information.

Greetings,

Andres Freund

Re: Adding a test for speculative insert abort case

2019-06-05 Thread Melanie Plageman

On Thu, May 16, 2019 at 8:46 PM Melanie Plageman 
wrote:

>
> Good idea.
> I squashed the changes I suggested in previous emails, Ashwin's patch, my
> suggested updates to that patch, and the index order check all into one
> updated
> patch attached.
>
>
I've updated this patch to make it apply on master cleanly. Thanks to
Alvaro for format-patch suggestion.

The first patch in the set adds the speculative wait case discussed
above from Ashwin's patch.

The second patch in the set is another suggestion I have. I noticed
that the insert-conflict-toast test mentions that it is "not
guaranteed to lead to a failed speculative insertion" and, since it
seems to be testing the speculative abort but with TOAST tables, I
thought it might work to kill that spec file and move that test case
into insert-conflict-specconflict so the test can utilize the existing
advisory locks being used for the other tests in that file to make it
deterministic which session succeeds in inserting the tuple.

-- 
Melanie Plageman
From e8652d872c953ca9bc077027cc34b26b870e5b73 Mon Sep 17 00:00:00 2001
From: csteam 
Date: Wed, 5 Jun 2019 15:24:19 -0700
Subject: [PATCH v2 2/2] Add TOAST case to spec conflict tests

The spec insert-conflict-toast seems to be testing the speculative abort
case -- given two concurrent inserts inserting duplicate values to a
unique index, only one should succeed and, if the other has had a chance
to insert the tuple into the table, this tuple should be killed. The
comment in that test said that it was not deterministic, so, if making
it deterministic required adding additional pg_advisory_locks similar to
what is in the spec insert-conflict-specconflict, combining them made
sense.
---
 .../expected/insert-conflict-specconflict.out | 56 +++
 .../specs/insert-conflict-specconflict.spec   | 30 ++
 2 files changed, 86 insertions(+)

diff --git a/src/test/isolation/expected/insert-conflict-specconflict.out b/src/test/isolation/expected/insert-conflict-specconflict.out
index 2f003ca1bf..a21ed0b165 100644
--- a/src/test/isolation/expected/insert-conflict-specconflict.out
+++ b/src/test/isolation/expected/insert-conflict-specconflict.out
@@ -112,6 +112,62 @@ keydata
 
 k1 inserted s1 with conflict update s2
 
+starting permutation: controller_locks controller_show s1_insert_toast s2_insert_toast controller_show controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show controller_unlock_1_2 controller_show_count controller_unlock_2_2 controller_show_count
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess   lock   
+
+   1  1  
+   1  2  
+   1  3  
+   2  1  
+   2  2  
+   2  3  
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step s1_insert_toast: INSERT INTO upserttest VALUES('k2', ctoast_large_val()) ON CONFLICT DO NOTHING; 
+step s2_insert_toast: INSERT INTO upserttest VALUES('k2', ctoast_large_val()) ON CONFLICT DO NOTHING; 
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t  
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t  
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t  
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t  
+step controller_show: SELECT * FROM upserttest;
+keydata   
+
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t  
+step s1_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM upserttest;
+count  
+
+1  
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t  
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM upserttest;
+count  
+
+1  
+
 starting permutation: controller_locks controller_show s1_begin s2_begin s1_upsert s2_upsert controller_show controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show controller_unlock_1_2 controller_show controller_unlock_2_2 controller_show s1_commit controller_show s2_commit controller_show
 step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
 pg_advisory_locksess   lock   
diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec b/src/test

Re: Binary support for pgoutput plugin

2019-06-05 Thread Dave Cramer

Hi,


On Wed, 5 Jun 2019 at 18:50, Andres Freund  wrote:

> Hi,
>
> On 2019-06-05 18:47:57 -0400, Dave Cramer wrote:
> > So one of the things they would like added is to get not null information
> > in the schema record. This is so they can mark the field Optional in
> Java.
> > I presume this would also have some uses in other languages. As I
> > understand it this would require a protocol bump. If this were to be
> > accepted are there any outstanding asks that would useful to add if we
> were
> > going to bump the protocol?
>
> I'm pretty strongly opposed to this. What's the limiting factor when
> adding such information? I think clients that want something like this
> ought to query the database for catalog information when getting schema
> information.
>

I'm not intimately familiar with their code. I will query them more about
the ask.

I am curious why you are "strongly" opposed however. We already have the
information. Adding doesn't seem onerous.

Dave

Re: No mention of no CIC support for partitioned index in docs

2019-06-05 Thread David Rowley

On Wed, 5 Jun 2019 at 08:10, Alvaro Herrera  wrote:
>
> On 2019-May-24, David Rowley wrote:
>
> > It appears there is no mention of lack of support for CREATE INDEX
> > CONCURRENTLY on partitioned index in the documents.
>
> I'll leave this one for you to handle, thanks.

Thanks. I've just pushed something.

I ended up deciding that we owe the user a bit more of an explanation
of how they might work around the problem.  Of course, a partitioned
table index build is likely to take much longer than an index build on
a normal table, since most likely a partitioned table is larger. So I
went on to explain how they might minimise the time where writers will
be blocked by creating indexes concurrently on each partition first.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-06-05 Thread Peter Geoghegan

On Thu, May 23, 2019 at 7:48 PM David Rowley
 wrote:
> > cost_sort() costs sorts as top-N heapsorts. While we cannot make an
> > iron-clad guarantee that it will work out that way from within
> > tuplesort.c, that doesn't seem like it closes off the possibility of
> > more informative EXPLAIN output. For example, can't we at report that
> > the tuplesort will be "bounded" within EXPLAIN, indicating that we
> > intend to attempt to sort using a top-N heap sort (i.e. we'll
> > definitely do it that way if there is sufficient work_mem)?
>
> I think this really needs more of a concrete proposal.  Remember
> LIMIT/OFFSET don't need to be constants, they could be a Param or some
> return value from a subquery, so the bound might not be known until
> after executor startup, to which EXPLAIN is not going to get to know
> about that.

I was merely pointing out that it is clear when a sort *could* be a
top-n sort, which could be exposed by EXPLAIN without anyone feeling
misled.

> After that, what would we do with it in EXPLAIN?  Always show "Bound:
> ", if it's not -1?

I'm not sure.

The distinction between a top-n sort and any other sort is an
important one (it's certainly more important than the distinction
between an internal and external sort), so it's worth being flexible
in order to expose more information in EXPLAIN output. I would be
willing to accept some kind of qualified or hedged description in the
EXPLAIN output for a bounded sort node, even though that approach
doesn't seem desirable in general.

-- 
Peter Geoghegan

Re: pg_basebackup failure after setting default_table_access_method option

2019-06-05 Thread Haribabu Kommi

On Thu, Jun 6, 2019 at 1:46 AM vignesh C  wrote:

>
> Hi,
>
> *I noticed pg_basebackup failure when default_table_access_method option
> is set.*
>
> *Test steps:*
>
> Step 1: Init database
> ./initdb -D data
>
> Step 2: Start Server
> ./postgres -D data &
>
> Step 3: Set Guc option
> export PGOPTIONS='-c default_table_access_method=zheap'
>
> Step 4: Peform backup
> /pg_basebackup -D backup -p 5432 --no-sync
> 2019-06-05 20:35:04.088 IST [11601] FATAL:  cannot read pg_class without
> having selected a database
> pg_basebackup: error: could not connect to server: FATAL:  cannot read
> pg_class without having selected a database
>
> *Reason why it is failing:*
> pg_basebackup does not use any database to connect to server as it backs
> up the whole data instance.
> As the option default_table_access_method is set.
> It tries to validate this option, but while validating the option in
> ScanPgRelation function:
> if (!OidIsValid(MyDatabaseId))
> elog(FATAL, "cannot read pg_class without having selected a database");
>
> Here as pg_basebackup uses no database the command fails.
>

Thanks for the details steps to reproduce the bug, I am also able to
reproduce the problem.



> Fix:
> The patch has the fix for the above issue:
>
> Let me know your opinion on this.
>

Thanks for the patch and it fixes the problem.

Regards,
Haribabu Kommi
Fujitsu Australia

RE: Confusing comment for function ExecParallelEstimate

2019-06-05 Thread Wu, Fei

Sorry, Last mail forget to CC the mailing list.

Now the comment is confusing, Maybe someone should correct it.

Here is a simple patch, What do you think ?

With Regards,
Wu Fei

-Original Message-
From: Amit Kapila [mailto:amit.kapil...@gmail.com] 
Sent: Wednesday, June 05, 2019 7:18 PM
To: Wu, Fei/吴 非 
Cc: pgsql-hack...@postgresql.org
Subject: Re: Confusing comment for function ExecParallelEstimate

On Wed, Jun 5, 2019 at 11:27 AM Wu, Fei  wrote:
>
> Thanks for your reply.
> From the code below:
> (https://github.com/postgres/postgres/blob/REL_10_7/src/backend/execut
> or/execParallel.c) 
> ###
> 443 /*
> 444  * Give parallel-aware nodes a chance to add to the 
> estimates, and get a
> 445  * count of how many PlanState nodes there are.
> 446  */
> 447 e.pcxt = pcxt;
> 448 e.nnodes = 0;
> 449 ExecParallelEstimate(planstate, &e);
> 450
> 451 /* Estimate space for instrumentation, if required. */
> 452 if (estate->es_instrument)
> 453 {
> 454 instrumentation_len =
> 455 offsetof(SharedExecutorInstrumentation, 
> plan_node_id) +
> 456 sizeof(int) * e.nnodes;
> 457 instrumentation_len = MAXALIGN(instrumentation_len);
> 458 instrument_offset = instrumentation_len;
> 459 instrumentation_len +=
> 460 mul_size(sizeof(Instrumentation),
> 461  mul_size(e.nnodes, 
> nworkers));
> 462 shm_toc_estimate_chunk(&pcxt->estimator, 
> instrumentation_len);
> 463 shm_toc_estimate_keys(&pcxt->estimator, 1);
>
> ##
> # It seems that e.nnodes which returns from 
> ExecParallelEstimate(planstate, &e) , determines how much instrumentation 
> structures in DSM(line459~line461).
> And  e.nnodes also determines the length of SharedExecutorInstrumentation-> 
> plan_node_id(line454~line456).
>
> So, I think here it refers to instrumentation.
>

Right.  I think the way it is mentioned
(SharedPlanStateInstrumentation structures ..) in the comment can confuse 
readers.  We can replace SharedPlanStateInstrumentation with Instrumentation in 
the comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com






fix_confusing_comment_for_ExecParallelEstimate.patch
Description: fix_confusing_comment_for_ExecParallelEstimate.patch

Re: Confusing comment for function ExecParallelEstimate

2019-06-05 Thread Amit Kapila

On Thu, Jun 6, 2019 at 7:37 AM Wu, Fei  wrote:
>
> Sorry, Last mail forget to CC the mailing list.
>
> Now the comment is confusing, Maybe someone should correct it.
>

Sure, for the sake of clarity, when this code was initially introduced
in commit d1b7c1ff, the structure used was
SharedPlanStateInstrumentation, but later when it got changed to
Instrumentation structure in commit b287df70, we forgot to update the
comment.  So, we should backpatch this till 9.6 where it got
introduced.  I will commit this change by tomorrow or so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Should we warn against using too many partitions?

2019-06-05 Thread David Rowley

On Fri, 24 May 2019 at 22:00, David Rowley  wrote:
> I've attached the pg10 and pg11 patches with that updated... and also
> the master one (unchanged) with the hopes that the CF bot picks that
> one.

I got talking to Andres about this at PGCon after a use case of 250k
partitions was brought to our attention. I was thinking about the best
way to handle this on the long flight home and after studying the
current docs I really feel that they fairly well describe what we've
done so far implementing table partitioning, but they offer next to
nothing on best practices on how to make the most of the feature.

I've done some work on this today and what I've ended up with is an
entirely new section to the partitioning docs about best practices
which provides a bit of detail on how you might go about choosing the
partition key. It gives an example of why LIST partitioning on a set
of values that may grow significantly over time might be a bad idea.
It talks about memory growth with more partitions and mentions that
rel cache might become a problem even if queries are touching a small
number of partitions per query, but a large number per session.

The attached patch is aimed at master. PG11 will need the planner
memory and performance part tweaked and for PG10 I'll do that plus
remove the mention of PRIMARY KEY and UNIQUE constraints on the
partitioned table.

Does anyone see anything wrong with doing this?  I don't think there
should be an issue adding a section to the docs right at the end as
it's not causing any resequencing.

Or does anyone have any better ideas or better examples to give? or
any comments?

If it looks okay I can post version for PG11 and PG10 for review, but
I'd like to get this in fairly soon.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

add_docs_about_best_partition_practices.patch
Description: Binary data

Re: Should we warn against using too many partitions?

2019-06-05 Thread Justin Pryzby

I suggest just minor variations on language.

On Thu, Jun 06, 2019 at 04:43:48PM +1200, David Rowley wrote:

>diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
>index cce1618fc1..ab26630199 100644
>--- a/doc/src/sgml/ddl.sgml
>+++ b/doc/src/sgml/ddl.sgml
>@@ -4674,6 +4675,76 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
>>= DATE '2008-01-01';
>
>
>   
>+  
>+  
>+   Declarative Partitioning Best Practices
>+
>+   
>+The choice of how to partition a table should be considered carefully as

Either say "How to partition consider should be .." or "The choice should MADE 
carefully" ?

>+   
>+One of the most critical design decisions will be the column or columns
>+which you partition your data by.  Often the best choice will be to

by which ?

>+   
>+Choosing the number of partitions to divide the table into is also a

the TARGET number of partitions BY WHICH to divide the table ?

>+critical decision to make.  Not having enough partitions may mean that
>+indexes remain too large and that data locality remains poor which could
>+result in poor cache hit ratios.  However, dividing the table into too
>+many partitions can also cause issues.  Too many partitions can mean
>+slower query planning times and higher memory consumption during both
>+query planning and execution.  It's also important to consider what
>+changes may occur in the future when choosing how to partition your table.
>+For example, if you choose to have one partition per customer and you
>+currently have a small number of large customers, what will the

have ONLY ?

>+implications be if in several years you obtain a large number of small
>+customers.  In this case, it may be better to choose to partition by
>+HASH and choose a reasonable amount of partitions

reasonable NUMBER ?

>+   
>+It is also important to consider the overhead of partitioning during
>+query planning and execution.  The query planner is generally able to
>+handle partition hierarchies up a few thousand partitions fairly well,
>+providing that the vast majority of them can be pruned during query

provided ?

I would say: "provided that typical queries prune all but a small number of
partitions during planning time".

>+DELETE commands.  Also, even if most queries are
>+able to prune a high number of partitions during query planning, it still

LARGE number?

>+may be undesirable to have a large number of partitions as each partition

may still ?

>+also will obtain a relation cache entry in each session which uses the

will require ?  Or occupy ?

>+   
>+With data warehouse type workloads it can make sense to use a larger
>+number of partitions than with an OLTP type workload.  Generally, in data
>+warehouses, query planning time is less of a concern as the majority of
>+processing time is generally spent during query execution.  With either of

remove the 2nd "generally"

>+these two types of workload, it is important to make the right decisions
>+early as re-partitioning large quantities of data can be painstakingly

early COMMA ?

PAINFULLY slow

>+When performance is critical, performing workload simulations to
>+assist in making the correct decisions can be beneficial.  

I would say:
Simulations of the intended workload are beneficial for optimizing partitioning
strategy.

Thanks,
Justin

Re: Should we warn against using too many partitions?

2019-06-05 Thread Amit Langote

Hi,

On Thu, Jun 6, 2019 at 1:44 PM David Rowley
 wrote:
>
> On Fri, 24 May 2019 at 22:00, David Rowley  
> wrote:
> > I've attached the pg10 and pg11 patches with that updated... and also
> > the master one (unchanged) with the hopes that the CF bot picks that
> > one.
>
> I got talking to Andres about this at PGCon after a use case of 250k
> partitions was brought to our attention. I was thinking about the best
> way to handle this on the long flight home and after studying the
> current docs I really feel that they fairly well describe what we've
> done so far implementing table partitioning, but they offer next to
> nothing on best practices on how to make the most of the feature.

Agreed that some "best practices" text is overdue, so thanks for taking that up.

> I've done some work on this today and what I've ended up with is an
> entirely new section to the partitioning docs about best practices
> which provides a bit of detail on how you might go about choosing the
> partition key. It gives an example of why LIST partitioning on a set
> of values that may grow significantly over time might be a bad idea.

Design advice like this is good.

> It talks about memory growth with more partitions and mentions that
> rel cache might become a problem even if queries are touching a small
> number of partitions per query, but a large number per session.

I wasn't sure at first if stuff like this should be mentioned in the
user-facing documentation, but your wording seems fine in general.

Thanks,
Amit

Re: hyrax vs. RelationBuildPartitionDesc

2019-06-05 Thread Amit Langote

On Wed, Jun 5, 2019 at 8:03 PM Amit Langote  wrote:
> I noticed a crash with one of the scenarios that I think the patch is
> meant to address.  Let me describe the steps:
>
> Attach gdb (or whatever) to session 1 in which we'll run a query that
> will scan at least two partitions and set a breakpoint in
> expand_partitioned_rtentry().  Run the query, so the breakpoint will
> be hit.  Step through up to the start of the following loop in this
> function:
>
> i = -1;
> while ((i = bms_next_member(live_parts, i)) >= 0)
> {
> Oid childOID = partdesc->oids[i];
> Relationchildrel;
> RangeTblEntry *childrte;
> Index   childRTindex;
> RelOptInfo *childrelinfo;
>
> /* Open rel, acquiring required locks */
> childrel = table_open(childOID, lockmode);
>
> Note that 'partdesc' in the above code is from the partition
> directory.  Before stepping through into the loop, start another
> session and attach a new partition.  On into the loop.  When the 1st
> partition is opened, its locking will result in
> RelationClearRelation() being called on the parent relation due to
> partition being attached concurrently, which I observed, is actually
> invoked a couple of times due to recursion.  Parent relation's
> rd_oldpdcxt will be set in this process, which contains the
> aforementioned partition descriptor.
>
> Before moving the loop to process the 2nd partition, attach another
> partition in session 2.  When the 2nd partition is opened,
> RelationClearRelation() will run again and in one of its recursive
> invocations, it destroys the rd_oldpdcxt that was set previously,
> taking the partition directory's partition descriptor with it.
> Anything that tries to access it later crashes.
>
> I couldn't figure out what to blame here though -- the design of
> rd_pdoldcxt or the fact that RelationClearRelation() is invoked
> multiple times.  I will try to study it more closely tomorrow.

On further study, I concluded that it's indeed the multiple
invocations of RelationClearRelation() on the parent relation that
causes rd_pdoldcxt to be destroyed prematurely.  While it's
problematic that the session in which a new partition is attached to
the parent relation broadcasts 2 SI messages to invalidate the parent
relation [1], it's obviously better to fix how RelationClearRelation
manipulates rd_pdoldcxt so that duplicate SI messages are not harmful,
fixing the latter is hardly an option.

Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
which seems to fix this issue for me.  In the rare case where inval
messages due to multiple concurrent attachments get processed in a
session holding a reference on a partitioned table, there will be
multiple "old" partition descriptors and corresponding "old" contexts.
All of the old contexts are chained together via context-re-parenting,
with the newest "old" context being accessible from the table's
rd_pdoldcxt.

Thanks,
Amit

[1]: inval.c: AddRelcacheInvalidationMessage() does try to prevent
duplicate messages from being emitted, but the logic to detect
duplicates doesn't work across CommandCounterIncrement().  There are
at two relcache inval requests in ATTACH PARTITION code path, emitted
by SetRelationHasSubclass and StorePartitionBound, resp., which are
separated by at least one CCI, so the 2nd request isn't detected as
the duplicate of the 1st.

pdoldcxt-v1-amit-delta.patch
Description: Binary data

43 matches

Mail list logo