date:20190624

MAKEFLAGS in non-GNU Makefile

2019-06-24 Thread Thomas Munro

Hi,

In commit a76200de we added a line to unset MAKELEVEL to fix a problem
with our temp-install logic.  I don't think it was a great idea to
clear MAKEFLAGS at the same time, because now when you type "make -s
-j8" on a non-GNU system it ignores you and is loud and slow.
Admittedly there is something slightly weird about passing flags to
both makes, but in the case of widely understood flags like those
ones, it works fine.

-- 
Thomas Munro
https://enterprisedb.com

Re: GiST VACUUM

2019-06-24 Thread Michael Paquier

Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:
> I think we should fix this in a similar manner in B-tree, too, but that can
> be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is stored
> in the page opaque field.

This is an open item present already for a couple of weeks.  Are you
planning to tackle that soon?
--
Michael


signature.asc
Description: PGP signature

Re: dropdb --force

2019-06-24 Thread Anthony Nowocien

Hi,
patch no longer applies (as of 12beta2). 

postgres@ubuntudev:~/pg_testing/source/postgresql-12beta2$ patch -p1 < 
drop-database-force-20190310_01.patch
patching file doc/src/sgml/ref/drop_database.sgml
patching file doc/src/sgml/ref/dropdb.sgml
patching file src/backend/commands/dbcommands.c
Hunk #1 succeeded at 489 (offset 2 lines).
Hunk #2 succeeded at 779 (offset 2 lines).
Hunk #3 succeeded at 787 (offset 2 lines).
Hunk #4 succeeded at 871 (offset 2 lines).
Hunk #5 succeeded at 1056 (offset 2 lines).
Hunk #6 succeeded at 1186 (offset 2 lines).
patching file src/backend/nodes/copyfuncs.c
Hunk #1 succeeded at 3852 (offset 10 lines).
patching file src/backend/nodes/equalfuncs.c
Hunk #1 succeeded at 1666 (offset 3 lines).
patching file src/backend/parser/gram.y
Hunk #1 succeeded at 10194 (offset 45 lines).
Hunk #2 succeeded at 10202 (offset 45 lines).
patching file src/backend/storage/ipc/procarray.c
Hunk #1 succeeded at 2906 (offset -14 lines).
Hunk #2 succeeded at 2948 (offset -14 lines).
patching file src/backend/tcop/utility.c
patching file src/bin/scripts/dropdb.c
Hunk #1 succeeded at 34 (offset 1 line).
Hunk #2 succeeded at 50 (offset 1 line).
Hunk #3 succeeded at 63 (offset 2 lines).
Hunk #4 succeeded at 88 (offset 2 lines).
Hunk #5 succeeded at 128 (offset 2 lines).
Hunk #6 succeeded at 136 (offset 2 lines).
Hunk #7 succeeded at 164 (offset 1 line).
patching file src/bin/scripts/t/050_dropdb.pl
patching file src/include/commands/dbcommands.h
patching file src/include/nodes/parsenodes.h
Hunk #1 succeeded at 3133 (offset 16 lines).
patching file src/include/storage/procarray.h
Hunk #1 FAILED at 112.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/storage/procarray.h.rej

Could you please update it ? Thanks.
Anthony

The new status of this patch is: Waiting on Author

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-24 Thread Michael Paquier

On Fri, Jun 14, 2019 at 06:31:52PM -0400, Alvaro Herrera wrote:
> BTW I tested this manually and it seemed to work as intended, ie., if I
> change /etc/hosts for the hostname I am using between one connection and
> the next, we do keep the hostaddr if it was specified, or we resolve the
> name again if it's not.

Alvaro, did 313f56ce fix all the issues related to this thread?
Perhaps this open item can now be closed?
--
Michael


signature.asc
Description: PGP signature

Re: pglz performance

2019-06-24 Thread Andrey Borodin



> 18 мая 2019 г., в 11:44, Andrey Borodin  написал(а):
> 
Hi! 
Here's rebased version of patches.

Best regards, Andrey Borodin.


0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
Description: Binary data


0001-Use-memcpy-in-pglz-decompression.patch
Description: Binary data

Re: benchmarking Flex practices

2019-06-24 Thread John Naylor

I wrote:

> I'll look for other rules that could be more
> easily optimized, but I'm not terribly optimistic.

I found a possible other way to bring the size of the transition table
under 32k entries while keeping the existing no-backup rules in place:
Replace the "quotecontinue" rule with a new state. In the attached
draft patch, when Flex encounters a quote while inside any kind of
quoted string, it saves the current state and enters %xqs (think
'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
reenters the previous state and continues to slurp the string,
otherwise, it throws back everything and returns the string it just
exited. Doing it this way is a bit uglier, but with some extra
commentary it might not be too bad.

The array is now 30883 entries. That's still a bit close for comfort,
but shrinks the binary by 171kB on Linux x86-64 with Flex 2.6.4. The
bad news is I have these baffling backup states in my new rules:

State #133 is non-accepting -
 associated rule line numbers:
551 554 564
 out-transitions: [ \000-\377 ]
 jam-transitions: EOF []

State #162 is non-accepting -
 associated rule line numbers:
551 554 564
 out-transitions: [ \000-\377 ]
 jam-transitions: EOF []

2 backing up (non-accepting) states.

I already explicitly handle EOF, so I don't know what it's trying to
tell me. If it can be fixed while keeping the array size, I'll do
performance tests.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-lexer-redo-quote-continuation.patch
Description: Binary data

Re: Use of reloptions by EXTENSIONs

2019-06-24 Thread Dent John

Thank you, Tom. 

(And sorry for the delay following up.)

I’ve marked myself down for review for this patch in the next CF.

I’ll see if I can get the patch applied and feed back on how much it move 
towards making my use case a viable proposition. 

d.

> On 9 Jun 2019, at 17:21, Tom Lane  wrote:
> 
> Dent John  writes:
>> I guess my question is, and I correctly understanding that reloptions are 
>> basically off-limits to EXTENSIONS?
> 
> IIRC that's basically true.  There's a lot of dissatisfaction with the
> current implementation of reloptions, although I think that it's been
> mainly focused on the fact that adding new ones is hard/error-prone
> even within the core code.  If you want to help move this along, you
> could review the existing patch in the area:
> 
> https://www.postgresql.org/message-id/flat/2083183.Rn7qOxG4Ov@x200m
> 
> and/or propose additional changes.
> 
>regards, tom lane

Re: using explicit_bzero

2019-06-24 Thread Dagfinn Ilmari Mannsåker

Michael Paquier  writes:

> On Sun, Jun 23, 2019 at 09:56:40PM +0200, Peter Eisentraut wrote:
>> On 2019-06-21 15:45, Dagfinn Ilmari Mannsåker wrote:
>>> Also, on Linux it requires libbsd: https://libbsd.freedesktop.org/
>> 
>> No, it's in glibc.
>
> From man:
> explicit_bzero() first appeared in glibc 2.25.

Ah, I was looking on my Debian Stretch (stable) box, which only has
glibc 2.24.  Buster (testing, due out next week) has 2.28 which indeed
has it.

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl

GiST "choose subtree" support function to inline penalty

2019-06-24 Thread Komяpa

Hi,

I'm looking at PostGIS geometry GiST index build times and try to optimize
withing the current GiST framework. The function that shows a lot on my
flame graphs is penalty.

I spent weekend rewriting PostGIS penalty to be as fast as possible.
(FYI https://github.com/postgis/postgis/pull/425/files)

However I cannot get any meaningfully faster build time. Even when I strip
it to "just return edge extension" index build time is the same.

Is there a way to inline the penalty into above "choose subtree" loop
somehow? That would also let us stop bit-fiddling floats to simulate a more
complex choosing scheme.

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-24 Thread Michael Paquier

On Sat, Jun 15, 2019 at 12:25:12PM +1200, David Rowley wrote:
> With the attached version I'm just calling table_finish_bulk_insert()
> once per partition at the end of CopyFrom().  We've got an array with
> all the ResultRelInfos we touched in the proute, so it's mostly a
> matter of looping over that array and calling the function on each
> ResultRelInfo's ri_RelationDesc.  However, to make it more complex,
> PartitionTupleRouting is private to execPartition.c so we can't do
> this directly... After staring at my screen for a while, I decided to
> write a function that calls a callback function on each ResultRelInfo
> in the PartitionTupleRouting.

Don't take me bad, but I find the solution of defining and using a new
callback to call the table AM callback not really elegant, and keeping
all table AM callbacks called at a higher level than the executor
makes the code easier to follow.  Shouldn't we try to keep any calls
to table_finish_bulk_insert() within copy.c for each partition
instead?

> The other thing I noticed is that we call
> table_finish_bulk_insert(cstate->rel, ti_options); in copy.c
> regardless of if we've done any bulk inserts or not. Perhaps that
> should be under an if (insertMethod != CIM_SINGLE)

Yeah, good point.
--
Michael


signature.asc
Description: PGP signature

Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-24 Thread David Rowley

On Mon, 24 Jun 2019 at 22:16, Michael Paquier  wrote:
>
> On Sat, Jun 15, 2019 at 12:25:12PM +1200, David Rowley wrote:
> > With the attached version I'm just calling table_finish_bulk_insert()
> > once per partition at the end of CopyFrom().  We've got an array with
> > all the ResultRelInfos we touched in the proute, so it's mostly a
> > matter of looping over that array and calling the function on each
> > ResultRelInfo's ri_RelationDesc.  However, to make it more complex,
> > PartitionTupleRouting is private to execPartition.c so we can't do
> > this directly... After staring at my screen for a while, I decided to
> > write a function that calls a callback function on each ResultRelInfo
> > in the PartitionTupleRouting.
>
> Don't take me bad, but I find the solution of defining and using a new
> callback to call the table AM callback not really elegant, and keeping
> all table AM callbacks called at a higher level than the executor
> makes the code easier to follow.  Shouldn't we try to keep any calls
> to table_finish_bulk_insert() within copy.c for each partition
> instead?

I'm not quite sure if I follow you since the call to
table_finish_bulk_insert() is within copy.c still.

The problem was that PartitionTupleRouting is private to
execPartition.c, and we need a way to determine which of the
partitions we routed tuples to. It seems inefficient to flush all of
them if only a small number had tuples inserted into them and to me,
it seems inefficient to add some additional tracking in CopyFrom(),
like a hash table to store partition Oids that we inserted into. Using
PartitionTupleRouting makes sense. It's just a question of how to
access it, which is not so easy due to it being private.

I did suggest a few other ways that we could solve this. I'm not so
clear on which one of those you're suggesting or if you're thinking of
something new.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: GiST "choose subtree" support function to inline penalty

2019-06-24 Thread Andrey Borodin

Hi!

> 24 июня 2019 г., в 15:08, Darafei Komяpa Praliaskouski  
> написал(а):
> 
> I'm looking at PostGIS geometry GiST index build times and try to optimize 
> withing the current GiST framework. The function that shows a lot on my flame 
> graphs is penalty. 
> 
> I spent weekend rewriting PostGIS penalty to be as fast as possible. 
> (FYI https://github.com/postgis/postgis/pull/425/files) 
> 
> However I cannot get any meaningfully faster build time. Even when I strip it 
> to "just return edge extension" index build time is the same.
> 
> Is there a way to inline the penalty into above "choose subtree" loop 
> somehow? That would also let us stop bit-fiddling floats to simulate a more 
> complex choosing scheme.

Maybe we could just add new opclass function for choosing subtree?
I've created GSoC item for this[0].


Best regards, Andrey Borodin.

[0] https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29

Re: Index Skip Scan

2019-06-24 Thread Dmitry Dolgov

> On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee  
> wrote:
>
> However, _bt_readpage messes things up, because it only reads tuples that
> match all the provided keys (so where b=2)

Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.

> I was wondering about something else: don't we also have another problem with
> updating this current index tuple by skipping before calling
> btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
> when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
> killing dead tuples while doing index scans, but by looking at the code it
> seems to be possible that btgettuple returns a tuple, caller processes it and
> sets kill_prior_tuple to true in order to have it killed. However, then the
> skip scan kicks in, which sets the current tuple to a completely different
> tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is 
> my
> reasoning correct here or am I missing something?

Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.

> On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra  
> wrote:
>
> I've done some initial review on v20 - just reading through the code, no
> tests at this point. Here are my comments:

Thank you!

> 2) indices.sgml
>
> The new section is somewhat unclear and difficult to understand, I think
> it'd deserve a rewording. Also, I wonder if we really should link to the
> wiki page about FSM problems. We have a couple of wiki links in the sgml
> docs, but those seem more generic while this seems as a development page
> that might disapper. But more importantly, that wiki page does not say
> anything about "Loose Index scans" so is it even the right wiki page?

Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).

> 6) nodeIndexScan.c
>
> I wonder why we even add and initialize the ioss_ fields for IndexScan
> nodes, when the skip scans require index-only scans?

Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.

> 7) pathnode.c
>
> I wonder how much was the costing discussed. It seems to me the logic is
> fairly similar to ideas discussed in the incremental sort patch, and
> we've been discussing some weak points there. I'm not sure how much we
> need to consider those issues here.

Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.

> 9) pathnodes.h
>
> I don't understand what the uniq_distinct_pathkeys comment says :-(

Yeah, sorry, I'll try to improve the commentaries in the next version, where
I'm going to address all the feedback.

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-24 Thread Alvaro Herrera

On 2019-Jun-24, Michael Paquier wrote:

> On Fri, Jun 14, 2019 at 06:31:52PM -0400, Alvaro Herrera wrote:
> > BTW I tested this manually and it seemed to work as intended, ie., if I
> > change /etc/hosts for the hostname I am using between one connection and
> > the next, we do keep the hostaddr if it was specified, or we resolve the
> > name again if it's not.
> 
> Alvaro, did 313f56ce fix all the issues related to this thread?

Yes, as far as I am aware it does.

> Perhaps this open item can now be closed?

I left it there so that others could double-check if there were still
some things needing tweaks.  Feel free to close it now, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Copy data to DSA area

2019-06-24 Thread Ideriha, Takeshi

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Friday, April 26, 2019 11:50 PM
>To: 'Kyotaro HORIGUCHI' ;
>thomas.mu...@enterprisedb.com; robertmh...@gmail.com
>Cc: pgsql-hack...@postgresql.org
>Subject: RE: Copy data to DSA area
>
>Hi, I've updated Thomas's quick PoC.

Hi.

I've rebased the patch to fit the core code rather than extension.
Regarding shared memory context (ShmContext), I added three 
APIs:
- CreatePermShmContext
  create "permanent" context located in shared memory
- CreateTempShmContext
  Create "temporary" context located in local memory,
  Which has buffer to keep track of possible memory leak objects
- ChangeToPermShmContext
  Change allocated objects parent to permanent context

When you allocate something, add an object in the Temp ShmContext, 
and re-parent it to Perm ShmContext after it becomes not leaked.

Current method of keeping track of objects and freeing them at
rollback works well for the case where delete both the parent object 
and child object, which is pointed by parent. This is because no dangling 
pointer remains after rollback. If child object is freed but parent object
was already allocated in the permeant context, this object has a
dangling pointer. But if an object is pointed by already permanent object,
this means that just allocated object won't be leaked. So in such cases
we could skip temporary allocation and allocate it directory to permanent
context. At rollback case, we could just leave it in the shared memory 
and could make upper layer function handle its "ghost" object in a good way.
I cannot see the solution around here clearly. Do you have any thoughts?

MemoryContextMethods are not fully supported but some methods
like stats() might be needed. 

Current test case is very simple and same as previous one but
implemented as isolation test. It checks if interger can be set in 
shared memory and get it by another process. Actually, current test
case doesn't cover all possible case so more cases should be added.

I'll add this coming CF.

P.S 
Thomas, thank you for taking some time at PGCon 
to discuss this item and shared catalog cache. It was very helpful.
I'm going to submit email about shared catcache soon.

Regards,
Takeshi Ideriha


0001-Shared-memory-context-backed-by-DSA-and-its-test.patch
Description: 0001-Shared-memory-context-backed-by-DSA-and-its-test.patch

Re: Plugging some testing holes

2019-06-24 Thread Andrew Dunstan



On 6/23/19 10:27 PM, Michael Paquier wrote:
> On Sun, Jun 23, 2019 at 06:15:06PM -0400, Andrew Dunstan wrote:
>> Alvaro pointed out to me recently that the buildfarm client doesn't have
>> any provision for running module tests like commit_ts and
>> snapshot_too_old that use NO_INSTALLCHECK. On looking into this a bit
>> more, I noticed that we also don't run any TAP tests in
>> src/test/modules. I'm adding some code to the client to remedy both of
>> these, and crake has been running it quite happily for a week or so. Are
>> there any other holes of this nature that should be plugged?
> src/test/kerberos/ and src/test/ldap/.



We already make provision for those. See PG_TEST_EXTRA in the config file

>
> contrib modules having TAP tests are actually able to run the tests.
> Only an installcheck triggered from contrib/ happens at step
> contrib-install-check-C, right?


Yes, but I will add in support for the contrib TAP tests, thanks.


>
>> We'll need some MSVC build tools support for some of it.
> This one is more complex.  We don't actually track TAP_TESTS in
> src/tools/msvc/ yet.  Cough.


We do have support for some TAP tests, I will make sure we can run all
of them


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread James Coleman

On Sat, Apr 7, 2018 at 4:56 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> I agree with that. For bounded sort, attached patch now selects minimal
> group
> size as Min(DEFAULT_MIN_GROUP_SIZE, bound). That should improve
> "LIMIT small_number" case.

As I was working on some benchmarking I noticed that incremental sort
never seemed to switch into the top-n heapsort mode, which meant for
very large groups it significantly underperformed a regular sort since
it would have to spill to disk every time. Perhaps this indicates some
missing tests also.

I tracked that down to a missing case for IncrementalSortState in
ExecSetTupleBound and have updated the patched to correct the issue
(and confirmed it now properly switches sort modes).

That also means the optimization of choosing the min group size based
on bounds (if available) wasn't previously working.

I also haven't seen incremental sort used in any parallel plans,
though there seems to be some code intended to support it. I haven't
dug into really at all yet though, so can't comment further.

I'm attaching updated patch, and will reply separately with more
detailed comments on my current benchmarking work.

James Coleman

incremental-sort-28.patch
Description: Binary data

Re: Index Skip Scan

2019-06-24 Thread Tomas Vondra


On Mon, Jun 24, 2019 at 01:44:14PM +0200, Dmitry Dolgov wrote:

On Sun, Jun 23, 2019 at 1:04 PM Floris Van Nee  wrote:

However, _bt_readpage messes things up, because it only reads tuples that
match all the provided keys (so where b=2)


Right, the problem you've reported first had a similar origins. I'm starting to
think that probably using _bt_readpage just like that is not exactly right
thing to do, since the correct element is already found and there is no need to
check if tuples are matching after one step back. I'll try to avoid it in the
next version of patch.


I was wondering about something else: don't we also have another problem with
updating this current index tuple by skipping before calling
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of
killing dead tuples while doing index scans, but by looking at the code it
seems to be possible that btgettuple returns a tuple, caller processes it and
sets kill_prior_tuple to true in order to have it killed. However, then the
skip scan kicks in, which sets the current tuple to a completely different
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my
reasoning correct here or am I missing something?


Need to check, but probably we can avoid that by setting kill_prior_tuple to
false in case of skip scan as in index_rescan.


On Sun, Jun 23, 2019 at 3:10 PM Tomas Vondra  
wrote:

I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:


Thank you!


2) indices.sgml

The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?


Wow, indeed, looks like it's a totally wrong reference. I think Kyotaro already
mentioned it too, so probably I'm going to remove it (and instead describe the
idea in a few words in the documentation itself).


6) nodeIndexScan.c

I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?


Skip scans required index-only scans until recently, when the patch was updated
to incorporate the same approach for index scans too. My apologies, looks like
documentation and some commentaries are still inconsistent about this topic.



Yes, if that's the case then various bits of docs and comments are rather
misleading, ant fields in IndexScanState should be named 'iss_'.



7) pathnode.c

I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.


Can you please elaborate in a few words, which issues do you mean? Is it about
non uniform distribution of distinct values? If so, I believe it's partially
addressed when we have to skip too often, by searching a next index page.
Although yeah, there is still an assumption about uniform distribution of
distinct groups at the planning time.



Right, it's mostly about what happens when the group sizes are not close
to average size. The question is what happens in such cases - how much
slower will the plan be, compared to "current" plan without a skip scan?

I don't have a very good idea of the additional overhead associated with
skip-scans - presumably it's a bit more expensive, right?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Choosing values for multivariate MCV lists

2019-06-24 Thread Dean Rasheed

On Mon, 24 Jun 2019 at 00:42, Tomas Vondra  wrote:
>
> On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
> >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
> >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  
> >>wrote:
> >>>One annoying thing I noticed is that the base_frequency tends to end up
> >>>being 0, most likely due to getting too small. It's a bit strange, though,
> >>>because with statistic target set to 10k the smallest frequency for a
> >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> >>>something the float8 can represent).
> >>>
> >>
> >>Yeah, it should be impossible for the base frequency to underflow to
> >>0. However, it looks like the problem is with mcv_list_items()'s use
> >>of %f to convert to text, which is pretty ugly.
> >>
> >
> >Yeah, I realized that too, eventually. One way to fix that would be
> >adding %.15f to the sprintf() call, but that just adds ugliness. It's
> >probably time to rewrite the function to build the tuple from datums,
> >instead of relying on BuildTupleFromCStrings.
> >
>
> OK, attached is a patch doing this. It's pretty simple, and it does
> resolve the issue with frequency precision.
>
> There's one issue with the signature, though - currently the function
> returns null flags as bool array, but values are returned as simple
> text value (formatted in array-like way, but still just a text).
>
> In the attached patch I've reworked both to proper arrays, but obviously
> that'd require a CATVERSION bump - and there's not much apetite for that
> past beta2, I suppose. So I'll just undo this bit.
>

Hmm, I didn't spot that the old code was using a single text value
rather than a text array. That's clearly broken, especially since it
wasn't even necessarily constructing a valid textual representation of
an array (e.g., if an individual value's textual representation
included the array markers "{" or "}").

IMO fixing this to return a text array is worth doing, even though it
means a catversion bump.

Regards,
Dean

Re: Tweaking DSM and DSA limits

2019-06-24 Thread Robert Haas

On Thu, Jun 20, 2019 at 5:00 PM David Fetter  wrote:
> Is there perhaps a way to make raising max_connections not require a
> restart? There are plenty of situations out there where restarts
> aren't something that can be done on a whim.

Sure, if you want to make this take about 100x more work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow_system_table_mods stuff

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 4:37 PM Tom Lane  wrote:
> This line of thought leads to the conclusion that we do want
> separate "allow_system_table_dml" and "allow_system_table_ddl"
> bools.  Otherwise, the backwards-compatibility hack would need
> to turn on a level of unsafety that extension scripts have *not*
> had before and surely shouldn't have by default.

Right, exactly.

I'm repeating myself, but I still think it's super-useful to
distinguish things which are "for expert use only" from things which
are "totally bonkers." You can argue that if you're an expert, you
should know enough to avoid the totally bonkers things, but PostgreSQL
is pretty popular these days [citation needed] and there are a lot of
people administering databases who know what they are doing to a
pretty reasonable degree but don't have anywhere near the level of
understanding of someone who spends their days hacking core.  Putting
up some kind of a stop sign that lets you know when you're about to go
from adventurous to lethal will help those people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 6:54 PM Tom Lane  wrote:
> > That's not a bad goal, although invoking a user-supplied callback
> > while holding a buffer lock is a little scary.
>
> I nominate Robert for Understater of the Year.  I think there's pretty
> much 0 chance of that working reliably.

It's an honor to be nominated, although I am pretty sure this is not
my best work in category, even for 2019.  There are certainly useful
things that could be done by such a callback without doing anything
that touches shared memory and without doing anything that consumes
more than a handful of CPU cycles, so it doesn't seem utterly crazy to
think that such a design might survive. However, the constraints we'd
have to impose might chafe.

I am more inclined to ditch the callback model altogether in favor of
putting any necessary looping logic on the caller side. That seems a
lot more flexible, and the only trick is figuring out how to keep it
cheap. Providing some kind of context object that can hold onto one or
more pins seems like the most reasonable approach. Last week it seemed
to me that we would need several, but at the moment I can't think of a
reason why we would need more than one. I think we just want to
optimize the case where several undo lookups in quick succession are
actually reading from the same page, and we don't want to go to the
expense of looking that page up multiple times. It doesn't seem at all
likely that we would have a chain of undo records that leaves a
certain page and then comes back to it later, because this is a log
that grows forward, not some kind of random-access thing. So a cache
of size >1 probably wouldn't help.

Unless I'm still confused.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MAKEFLAGS in non-GNU Makefile

2019-06-24 Thread Tom Lane

Thomas Munro  writes:
> In commit a76200de we added a line to unset MAKELEVEL to fix a problem
> with our temp-install logic.  I don't think it was a great idea to
> clear MAKEFLAGS at the same time, because now when you type "make -s
> -j8" on a non-GNU system it ignores you and is loud and slow.

Feel free to undo that.  I was concerned about possible incompatibilities
in the makeflags, but if typical cases like this one seem to work, let's
allow it.

regards, tom lane

Re: Disconnect from SPI manager on error

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 3:45 AM RekGRpth  wrote:
> >It is not plpgsql's job to clean up after other backend subsystems
> during a transaction abort.
> But plpgsql do clean up on success! I suggest only do cleanup and on 
> exception.

Except that's wrong, because when an error happens, cleanup is - in
most cases - the job of (sub)transaction abort, not something that
should be done by individual bits of code.

PostgreSQL has a centralized system for processing exception cleanup
for a very good reason: there are LOTS of places where errors can be
thrown, and if each of those places has to have its own error cleanup
logic, you end up with a real mess. Instead we've gone the other way:
you can throw an error from anywhere without doing any cleanup, and
it's the job of the error-handling machinery to invoke subtransaction
abort logic, which is responsible for cleaning up whatever mess has
been left behind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread James Coleman

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>I think the first thing to do is get some concrete numbers on performance if 
>we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

(From reading the whole thread through again, there's a third case:
skewed group sizes, but I'm ignoring that for now because my intuition
is that if the sort can reasonably handle the above two cases *in
execution not just planning* the skew problem will become a
non-issue-)

For (1), I'd really like to be able to find a way to still benefit
from knowing we have prefix columns already sorted. Currently the
patch doesn't do that because of the optimization for small batches
precludes knowing that all prefixes in a batch are indeed equal. As
such I've temporarily removed that optimization in my testing to see
how much of a gain we get from not needing to sort prefix columns.
Intuitively it seems plausible that this would always beat a standard
sort, since a.) equality is sometimes much cheaper than ordering, and
b.) it reduces the likelihood of spilling to disk. In addition my fix
to make top-n heapsort improves things. The one thing I'm not sure
about here is the fact that I haven't seen parallel incremental sort
happen at all, and a parallel regular sort can sometimes beat single
process incremental sort (though at the cost of higher system
resources of course). As noted upthread though I haven't really
investigated the parallel component yet at all.

I did confirm a measurable speedup with suffix-only sorting (roughly
5%, relative to the current patch).

For (2), I'd love to have an optimization for the "batches are a
single tuple" case (indeed in my most common real-world use case,
that's virtually guaranteed). But that optimization is also
incompatible with the current minimum batch size optimization. There
was previously reference to the cost of copying the tuple slot being
the problem factor with small batches, and I've confirmed there is a
performance penalty for small batches if you handle every prefix as a
separate batch (relative to the current patch), but what I plan to
investigate next is whether or not that penalty is primarily due to
the bookkeeping of frequent copying of a pivot tuple or whether it's
due to running the sort algorithm itself so frequently. If it's just
the overhead of copying the pivot tuple, then I'm wondering if adding
functionality to tuplesort to allow peeking at the first inserted
tuple might be worth it.

Alternatively I've been thinking about ways to achieve a hybrid
approach: using single-batch suffix-only sort for large batches and
multi-batch full sort for very small batches. For example, I could
imagine maintaining two different tuplesorts (one for full sort and
one for suffix-only sort) and inserting tuples into the full sorter
until the minimum batch size, and then (like now) checking every tuple
to see when we've finished the batch. If we finish the batch by, say,
2 * minimum batch size, then we perform the full sort. If we don't
find a new batch by that point, we'd go back and check to see if the
tuples we've accumulated so far are actually all the same batch. If
so, we'd move all of them to the suffix-only sorter and continue
adding rows until we hit a new batch. If not, we'd perform the full
sort, but only return tuples out of the full sorter until we encounter
the current batch, and then we'd move the remainder into the
suffix-only sorter.

Using this kind of approach you can also imagine further optimizations
like checking the current tuple against the pivot tuple only every 10
processed tuples or so. That adds another level of complication,
obviously, but could be interesting.

Unrelated: There was some discussion upthread about whether or not
abbreviated column support could/should be supported. The comments
around tuplesort_set_bound in tuplesort.c claim that for bounded sort
abbreviated keys are not useful, and, in fact, the code there disables
abbreviated sort. G

Re: Usage of epoch in txid_current

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 7:01 PM Alexander Korotkov
 wrote:
> On Thu, Mar 28, 2019 at 8:30 AM Thomas Munro  wrote:
> > Thanks for the reviews!  Pushed.
>
> Any ideas we should move towards 64-bit xids in more places?  That has
> been discussed several times already.  I think last time it was
> discussed in person during FOSDEM PGDay 2018 Developer Meeting [1].
> There we've discussed that it probably doesn't worth it to change
> 32-bit on-disk xids in heap.  It's better to leave existing heap "as
> is", but allow other pluggable table access methods to support 64-bit
> xids.  Given now we have pluggable table access methods, we may build
> a plan on full support of 64-bit xids in core.
>
> In my vision sketchy plan may look like this.
>
> 1. Change all non-heap types of xids from TransactionId to
> FullTransactionId.

I think it's fine to replace TransactionId with FullTransactionId
without stressing about it too much in code that's not that heavily
trafficked. However, I'm not sure we can do that across the board. For
example, converting snapshots to use 64-bit XIDs would mean that in
the worst case a snapshot will use twice as many cache lines, and that
might have performance implications on some workloads.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: allow_system_table_mods stuff

2019-06-24 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jun 21, 2019 at 4:37 PM Tom Lane  wrote:
>> This line of thought leads to the conclusion that we do want
>> separate "allow_system_table_dml" and "allow_system_table_ddl"
>> bools.  Otherwise, the backwards-compatibility hack would need
>> to turn on a level of unsafety that extension scripts have *not*
>> had before and surely shouldn't have by default.

> Right, exactly.

> I'm repeating myself, but I still think it's super-useful to
> distinguish things which are "for expert use only" from things which
> are "totally bonkers."

Agreed, although "DML vs DDL" is a pretty poor approximation of that
boundary.  As shown in examples upthread, you can find reasonable things
to do and totally-catastrophic things to do in both categories.

The position I'm maintaining is that it's not worth our trouble to try to
mechanically distinguish which things are which.  Once you've broken the
glass and flipped either the big red switch or the slightly smaller orange
switch, it's entirely on you to not screw up your database beyond
recovery.

I do see value in two switches not one, but it's what I said above,
to not need to give people *more* chance-to-break-things than they
had before when doing manual catalog fixes.  That is, we need a
setting that corresponds more or less to current default behavior.

There's an aesthetic argument to be had about whether to have two
bools or one three-way switch, but I prefer the former; there's
no backward-compatibility issue here since allow_system_table_mods
couldn't be set by applications anyway.

regards, tom lane

Re: allow_system_table_mods stuff

2019-06-24 Thread Robert Haas

On Mon, Jun 24, 2019 at 11:21 AM Tom Lane  wrote:
> > I'm repeating myself, but I still think it's super-useful to
> > distinguish things which are "for expert use only" from things which
> > are "totally bonkers."
>
> Agreed, although "DML vs DDL" is a pretty poor approximation of that
> boundary.  As shown in examples upthread, you can find reasonable things
> to do and totally-catastrophic things to do in both categories.

I agree.  I would like it if there were a way to do better, but I'm
not sure that there is, at least for a reasonable level of effort.

> There's an aesthetic argument to be had about whether to have two
> bools or one three-way switch, but I prefer the former; there's
> no backward-compatibility issue here since allow_system_table_mods
> couldn't be set by applications anyway.

I'm happy to defer on that point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Usage of epoch in txid_current

2019-06-24 Thread Andres Freund

Hi,

On June 24, 2019 8:19:13 AM PDT, Robert Haas  wrote:
>On Fri, Jun 21, 2019 at 7:01 PM Alexander Korotkov
> wrote:
>> On Thu, Mar 28, 2019 at 8:30 AM Thomas Munro 
>wrote:
>> > Thanks for the reviews!  Pushed.
>>
>> Any ideas we should move towards 64-bit xids in more places?  That
>has
>> been discussed several times already.  I think last time it was
>> discussed in person during FOSDEM PGDay 2018 Developer Meeting [1].
>> There we've discussed that it probably doesn't worth it to change
>> 32-bit on-disk xids in heap.  It's better to leave existing heap "as
>> is", but allow other pluggable table access methods to support 64-bit
>> xids.  Given now we have pluggable table access methods, we may build
>> a plan on full support of 64-bit xids in core.
>>
>> In my vision sketchy plan may look like this.
>>
>> 1. Change all non-heap types of xids from TransactionId to
>> FullTransactionId.
>
>I think it's fine to replace TransactionId with FullTransactionId
>without stressing about it too much in code that's not that heavily
>trafficked. However, I'm not sure we can do that across the board. For
>example, converting snapshots to use 64-bit XIDs would mean that in
>the worst case a snapshot will use twice as many cache lines, and that
>might have performance implications on some workloads.

We could probably expand the transaction IDs on access or when computing them 
for most of these, as usually they'll largely  be about currently running 
transactions. It e.g. seems sensible to keep the procarray at 32 bit xids, 
expand xmin/xmax to 64 when computing snapshots, and leave the snapshot's 
transaction ID array at 32xids. That ought to be an negligible overhead.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Misleading comment about single_copy, and some bikeshedding

2019-06-24 Thread Tom Lane

Thomas Munro  writes:
> Hmm.  I wonder if we should rename force_parallel_mode to
> force_gather_node in v13.  The current name has always seemed slightly
> misleading to me; it sounds like some kind of turbo boost button but
> really it's a developer-only test mode.  Also, does it belong under
> DEVELOPER_OPTIONS instead of QUERY_TUNING_OTHER?  I'm also wondering
> if the variable single_copy would be better named
> no_leader_participation or single_participant.  I find "copy" a
> slightly strange way to refer to the number of copies *allowed to
> run*, but maybe that's just me.

FWIW, I agree 100% that these names are opaque.  I don't know if your
suggestions are the best we can do, but they each seem like improvements.
And yes, force_parallel_mode should be under DEVELOPER_OPTIONS; it's a
performance-losing test option.

regards, tom lane

Re: Do we need to do better for pg_ctl timeouts?

2019-06-24 Thread Peter Eisentraut

On 2019-06-20 18:33, Andres Freund wrote:
> I wonder if we need to split the timeout into two: One value for
> postmaster to acknowledge the action, one for that action to
> complete. It seems to me that that'd be useful for all of starting,
> restarting and stopping.
> 
> I think we have all the necessary information in the pid file, we would
> just need to check for PM_STATUS_STARTING for start, PM_STATUS_STOPPING
> for restart/stop.

A related thing I came across the other day: systemd has a new
sd_notify() functionality EXTEND_TIMEOUT_USEC where the service can
notify systemd to extend the timeout.  I think that's the same idea:
You want to timeout if you're stuck, but you want to keep going as long
as you're doing useful work.

So yes, improving that would be welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: benchmarking Flex practices

2019-06-24 Thread John Naylor

I wrote:

> > I'll look for other rules that could be more
> > easily optimized, but I'm not terribly optimistic.
>
> I found a possible other way to bring the size of the transition table
> under 32k entries while keeping the existing no-backup rules in place:
> Replace the "quotecontinue" rule with a new state. In the attached
> draft patch, when Flex encounters a quote while inside any kind of
> quoted string, it saves the current state and enters %xqs (think
> 'quotestop'). If it then sees {whitespace_with_newline}{quote}, it
> reenters the previous state and continues to slurp the string,
> otherwise, it throws back everything and returns the string it just
> exited. Doing it this way is a bit uglier, but with some extra
> commentary it might not be too bad.

I had an epiphany and managed to get rid of the backup states.
Regression tests pass. The array is down to 30367 entries and the
binary is smaller by 172kB on Linux x86-64. Performance is identical
to master on both tests mentioned upthread. I'll clean this up and add
it to the commitfest.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae859e8..67ad06da4f 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -56,6 +56,8 @@ fprintf_to_ereport(const char *fmt, const char *msg)
 	ereport(ERROR, (errmsg_internal("%s", msg)));
 }
 
+static int state_before;
+
 /*
  * GUC variables.  This is a DIRECT violation of the warning given at the
  * head of gram.y, ie flex/bison code must not depend on any GUC variables;
@@ -168,6 +170,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *   delimited identifiers (double-quoted identifiers)
  *   hexadecimal numeric string
  *   standard quoted strings
+ *   quote stop (detect continued strings)
  *   extended quoted strings (support backslash escape sequences)
  *   $foo$ quoted strings
  *   quoted identifier with Unicode escapes
@@ -185,6 +188,7 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
@@ -231,19 +235,7 @@ special_whitespace		({space}+|{comment}{newline})
 horiz_whitespace		({horiz_space}|{comment})
 whitespace_with_newline	({horiz_whitespace}*{newline}{special_whitespace}*)
 
-/*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
- */
 quote			'
-quotestop		{quote}{whitespace}*
-quotecontinue	{quote}{whitespace_with_newline}{quote}
-quotefail		{quote}{whitespace}*"-"
 
 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -476,21 +468,10 @@ other			.
 	startlit();
 	addlitchar('b', yyscanner);
 }
-{quotestop}	|
-{quotefail} {
-	yyless(1);
-	BEGIN(INITIAL);
-	yylval->str = litbufdup(yyscanner);
-	return BCONST;
-}
 {xhinside}	|
 {xbinside}	{
 	addlit(yytext, yyleng, yyscanner);
 }
-{quotecontinue}	|
-{quotecontinue}	{
-	/* ignore */
-}
 <>		{ yyerror("unterminated bit string literal"); }
 
 {xhstart}		{
@@ -505,13 +486,6 @@ other			.
 	startlit();
 	addlitchar('x', yyscanner);
 }
-{quotestop}	|
-{quotefail} {
-	yyless(1);
-	BEGIN(INITIAL);
-	yylval->str = litbufdup(yyscanner);
-	return XCONST;
-}
 <>		{ yyerror("unterminated hexadecimal string literal"); }
 
 {xnstart}		{
@@ -568,28 +542,65 @@ other			.
 	BEGIN(xus);
 	startlit();
 }
-{quotestop}	|
-{quotefail} {
-	yyless(1);
-	BEGIN(INITIAL);
+
+{quote} {
+	state_before = YYSTATE;
+	BEGIN(xqs);
+}
+{whitespace_with_newline}{quote} {
+	/* resume scanning string that started on a previous line */
+	BEGIN(state_before);
+}
+{quote} {
 	/*
-	 * check that the data remains valid if it might have been
-	 * made invalid by unescaping any chars.
+	 * SQL requires at least one newline in the whitespace separating
+	 * string literals that are to be concatenated, so throw an error
+	 * if we see the start of a new string on the same line.
 	 */
-	if (yyextra->saw_non_ascii)
-		pg_verifymbstr(yyextra->literalbuf,
-	   yyextra->literallen,
-	   false);
-	yylval->str = litbufdup(yyscanner);
-	return SCONST;
+	SET_YYLLOC();
+	ADVANCE_YYLLOC(yyleng - 1);
+	yyerror("syntax error");
 }
-{quotestop} |
-{quotefail} {
-	/* throw back all but the quote */
-	yyless(1);
-	/* xusend state looks for possible UESCAPE */
-	BEGIN(xusend);
+{whitespace

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread Simon Riggs

On Wed, 5 Jun 2019 at 17:14, Rafia Sabih  wrote:

> Regarding this, I came across this,
> /*
>   * Incremental sort can't be used with either EXEC_FLAG_REWIND,
>   * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
>   * bucket in tuplesortstate.
>   */
> I think that is quite understandable. How are you planning to support
> backwards scan for this? In other words, when will incremental sort be
> useful for backward scan.
>

We stopped materializing the sort by default about 15 years ago because it
wasn't a common use case and it was very expensive for large sorts.

It's no real problem if incremental sorts don't support backwards scans -
we just won't use incremental in that case.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread Simon Riggs

On Mon, 24 Jun 2019 at 16:10, James Coleman  wrote:

> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
> >I think the first thing to do is get some concrete numbers on performance
> if we:
> >
> >1. Only sort one group at a time.
> >2. Update the costing to prefer traditional sort unless we have very
> >high confidence we'll win with incremental sort.
> >
> >It'd be nice not to have to add additional complexity if at all possible.
>
> I've been focusing my efforts so far on seeing how much we can
> eliminate performance penalties (relative to traditional sort). It
> seems that if we can improve things enough there that we'd limit the
> amount of adjustment needed to costing -- we'd still need to consider
> cases where the lower startup cost results in picking significantly
> different plans in a broad sense (presumably due to lower startup cost
> and the ability to short circuit on a limit). But I'm hopeful then we
> might be able to avoid having to consult MCV lists (and we wouldn't
> have that available in all cases anyway)
>
> As I see it the two most significant concerning cases right now are:
> 1. Very large batches (in particular where the batch is effectively
> all of the matching rows such that we're really just doing a standard
> sort).
> 2. Many very small batches.
>

What is the specific use case for this? This sounds quite general case.

Do we know something about the nearly-sorted rows that could help us? Or
could we introduce some information elsewhere that would help with the sort?

Could we for-example, pre-sort the rows block by block, or filter out the
rows that are clearly out of order, so we can re-merge them later?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread James Coleman

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs  wrote:
>
> On Mon, 24 Jun 2019 at 16:10, James Coleman  wrote:
>>
>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>> >I think the first thing to do is get some concrete numbers on performance 
>> >if we:
>> >
>> >1. Only sort one group at a time.
>> >2. Update the costing to prefer traditional sort unless we have very
>> >high confidence we'll win with incremental sort.
>> >
>> >It'd be nice not to have to add additional complexity if at all possible.
>>
>> I've been focusing my efforts so far on seeing how much we can
>> eliminate performance penalties (relative to traditional sort). It
>> seems that if we can improve things enough there that we'd limit the
>> amount of adjustment needed to costing -- we'd still need to consider
>> cases where the lower startup cost results in picking significantly
>> different plans in a broad sense (presumably due to lower startup cost
>> and the ability to short circuit on a limit). But I'm hopeful then we
>> might be able to avoid having to consult MCV lists (and we wouldn't
>> have that available in all cases anyway)
>>
>> As I see it the two most significant concerning cases right now are:
>> 1. Very large batches (in particular where the batch is effectively
>> all of the matching rows such that we're really just doing a standard
>> sort).
>> 2. Many very small batches.
>
>
> What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

> Do we know something about the nearly-sorted rows that could help us? Or 
> could we introduce some information elsewhere that would help with the sort?
>
> Could we for-example, pre-sort the rows block by block, or filter out the 
> rows that are clearly out of order, so we can re-merge them later?

I'm not sure what you mean by "block by block"?

James Coleman

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 11:24 AM Stephen Frost  wrote:
> That's not quite accurate, given that it isn't how the ALTER SYSTEM call
> itself works, and clearly isn't how the authors of that feature expected
> things to work or they would have actually made it work.  They didn't,
> and it doesn't actually work.
>
> The notion that pg_basebackup was correct in this, when it wasn't tested
> at all, evidently, even after the concern was raised, and ALTER SYSTEM
> was wrong, even though it worked just fine before some later patch
> started making changes to the file, based on the idea that it's the
> "natural approach" doesn't make sense to me.
>
> If the change to pg_basebackup had included a change to ALTER SYSTEM to
> make it work the *same* way that pg_basebackup now does, or at least to
> work with the changes that pg_basebackup were making, then maybe
> everything would have been fine.

This argument boils down to: two people patches don't play nicely
together, and we should assume that the first patch had it right and
the second patch had it wrong, because the first patch was first.

I don't think it works like that. I think we should decide which patch
had it right by looking at what the nicest behavior actually is, not
by which one came first.  In my mind having ALTER SYSTEM drop
duplicate that other tools may have introduced is a clear winner with
basically no downside. You are arguing that it will produce confusion,
but I don't really understand who is going to be confused or why they
are going to be confused.  We can document whatever we do, and it
should be fine.  Humans aren't generally supposed to be examining this
file anyway, so they shouldn't get confused very often.

In my view, the original ALTER SYSTEM patch just has a bug -- it
doesn't modify the right copy of the setting when multiple copies are
present -- and we should just fix the bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Robert Haas

On Fri, Jun 21, 2019 at 12:55 PM Tom Lane  wrote:
> Ah, got it.  So it seems like the correct behavior might be for
> ALTER SYSTEM to
> (a) run through the whole file and remove any conflicting lines;
> (b) append new setting at the end.

That is exactly the behavior for which I am arguing.  Stephen also
wants a warning, but I disagree, because the warning is totally
non-actionable.  It tells you that some tool, at some point in the
past, did something bad. You can't do anything about that, and you
wouldn't need to except for the arbitrary decision to label duplicate
lines as bad in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Robert Haas

On Sat, Jun 22, 2019 at 5:13 PM Stephen Frost  wrote:
> All that said, whatever code it is that we write for pg_basebackup to do this 
> properly should go into our client side library, so other tools can leverage 
> that and avoid having to write it themselves.

That is probably only going to help people who are writing in C (or
maybe some close family member) and a lot of tools for managing
PostgreSQL will be written in scripting languages.  It is unlikely
that those people are going to get all of the rules for parsing a file
full of GUC settings exactly right, because translating flex into
Python is probably not anybody's idea of a fun time. So you'll end up
with a bunch of rewrite-postgresql.auto.conf tools written in
different languages at varying degrees of quality many of which will
misfire in corner cases where the GUC names contain funny characters
or the whitespace is off or there's unusual quoting involved.

If you just decreed that it was OK to append to the file, you could
avoid all that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Remove HeapTuple and Buffer dependency for predicate locking functions

2019-06-24 Thread Ashwin Agrawal

Proposing following changes to make predicate locking and checking
functions generic and remove dependency on HeapTuple and Heap AM. We
made these changes to help with Zedstore. I think the changes should
help Zheap and other AMs in general.

- Change PredicateLockTuple() to PredicateLockTID(). So, instead of
  passing HeapTuple to it, just pass ItemPointer and tuple insert
  transaction id if known. This was also discussed earlier in [1],
  don't think was done in that context but would be helpful in future
  if such requirement comes up as well.

- CheckForSerializableConflictIn() take blocknum instead of
  buffer. Currently, the function anyways does nothing with the buffer
  just needs blocknum. Also, helps to decouple dependency on buffer as
  not all AMs may have one to one notion between blocknum and single
  buffer. Like for zedstore, tuple is stored across individual column
  buffers. So, wish to have way to lock not physical buffer but
  logical blocknum.

- CheckForSerializableConflictOut() no more takes HeapTuple nor
  buffer, instead just takes xid. Push heap specific parts from
  CheckForSerializableConflictOut() into its own function
  HeapCheckForSerializableConflictOut() which calls
  CheckForSerializableConflictOut(). The alternative option could be
  CheckForSerializableConflictOut() take callback function and
  callback arguments, which gets called if required after performing
  prechecks. Though currently I fell AM having its own wrapper to
  perform AM specific task and then calling
  CheckForSerializableConflictOut() is fine.

Attaching patch which makes these changes.

This way PredicateLockTID(), CheckForSerializableConflictIn() and
CheckForSerializableConflictOut() functions become usable by any AM.


1]
https://www.postgresql.org/message-id/CAEepm%3D2QbqQ_%2BKQQCnhKukF6NEAeq4SqiO3Qxe%2BfHza5-H-jKA%40mail.gmail.com
From aac57c17f078f869bf360677556842d58d5d33ea Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Fri, 21 Jun 2019 13:42:00 -0700
Subject: [PATCH v1] Remove HeapTuple dependency for predicate locking
 functions.

Following changes help to make predicate locking functions generic and
not tied to particular AM.

- PredicateLockTuple() is renamed to PredicateLockTID(). It takes
  ItemPointer and inserting transaction id now instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer

- CheckForSerializableConflictOut() no more takes HeapTuple nor buffer
---
 src/backend/access/gin/ginbtree.c|   2 +-
 src/backend/access/gin/ginfast.c |   2 +-
 src/backend/access/gin/gininsert.c   |   4 +-
 src/backend/access/gist/gist.c   |   2 +-
 src/backend/access/hash/hashinsert.c |   2 +-
 src/backend/access/heap/heapam.c | 103 +---
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/indexam.c   |   4 +-
 src/backend/access/nbtree/nbtinsert.c|   4 +-
 src/backend/storage/lmgr/predicate.c | 119 ---
 src/include/access/heapam.h  |   2 +
 src/include/storage/predicate.h  |   9 +-
 12 files changed, 150 insertions(+), 110 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 11a8ed7bbc2..e795375495b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
 	stack->predictNumber = 1;
 
 	if (rootConflictCheck)
-		CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
 
 	for (;;)
 	{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 2b3dd1c677f..f8ffeb06f8a 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 	 * tree, so it conflicts with all serializable scans.  All scans acquire a
 	 * predicate lock on the metabuffer to represent that.
 	 */
-	CheckForSerializableConflictIn(index, NULL, metabuffer);
+	CheckForSerializableConflictIn(index, NULL, GIN_METAPAGE_BLKNO);
 
 	if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
 	{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..046a20a3d41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -221,7 +221,7 @@ ginEntryInsert(GinState *ginstate,
 			return;
 		}
 
-		CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer));
 		/* modify an existing leaf entry */
 		itup = addItemPointersToLeafTuple(ginstate, itup,
 		  items, nitem, buildStats, stack->buffer);
@@ -230,7 +230,7 @@ ginEntryInsert(GinState *ginstate,
 	}
 	else
 	{
-		CheckForSerializableConflictI

Re: Usage of epoch in txid_current

2019-06-24 Thread Alvaro Herrera

On 2019-Jun-22, Alexander Korotkov wrote:

> 2. Also introduce FullMultixactId, and apply to MultixactId the
> similar change as #1.
> 3. Change SLRU on-disk format to handle 64-bit xids and multixacts.
> In particular make SLRU page numbers 64-bit too.  Add SLRU upgrade
> procedure to pg_upgrade.

I think enlarging multixacts to 64 bits is a terrible idea.  I would
rather get rid of multixacts completely; zheap is proposing not to use
multixacts at all, for example.  The amount of bloat caused by
pg_multixact data is already pretty bad ... because of which requiring
pg_multixact to be rewritten by pg_upgrade would cause a severe slowdown
for upgrades.  (It worked for FSM because the volume is tiny, but that's
not the case for multixact.)

I think the pg_upgrade problem can be worked around by creating a new
dir pg_multixact64 (an example) which is populated from the upgrade
point onwards; so you keep the old data unchanged, and new multixacts
use the new location, but the system knows to read the old one when
reading old tuples.  But, as I said above, I would rather not have
multixacts at all.

Another idea: create a new table AM that mimics heapam (I think ß-heap
"eszett-heap" is a good name), except that it reuses zheap's idea of
keeping "transaction locks" separately for tuple locks rather than
multixacts; heapam continues to use 32bits multixact.  Tables can be
migrated from heapam to ß-heap (alter table ... set access method) to
incrementally reduce reliance on multixacts going forward.  No need for
pg_upgrade-time disruption.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Comment typo in tableam.h

2019-06-24 Thread Ashwin Agrawal

On Mon, Jun 3, 2019 at 5:24 PM Ashwin Agrawal  wrote:

> There were few more minor typos I had collected for table am, passing them
> along here.
>
> Some of the required callback functions are missing Assert checking (minor
> thing), adding them in separate patch.
>

Curious to know if need to register such small typo fixing and assertion
adding patchs to commit-fest as well ?

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-06-24 Thread Andres Freund

Hi,

On 2019-06-24 10:41:06 -0700, Ashwin Agrawal wrote:
> Proposing following changes to make predicate locking and checking
> functions generic and remove dependency on HeapTuple and Heap AM. We
> made these changes to help with Zedstore. I think the changes should
> help Zheap and other AMs in general.

Indeed.


> - Change PredicateLockTuple() to PredicateLockTID(). So, instead of
>   passing HeapTuple to it, just pass ItemPointer and tuple insert
>   transaction id if known. This was also discussed earlier in [1],
>   don't think was done in that context but would be helpful in future
>   if such requirement comes up as well.

Right.


> - CheckForSerializableConflictIn() take blocknum instead of
>   buffer. Currently, the function anyways does nothing with the buffer
>   just needs blocknum. Also, helps to decouple dependency on buffer as
>   not all AMs may have one to one notion between blocknum and single
>   buffer. Like for zedstore, tuple is stored across individual column
>   buffers. So, wish to have way to lock not physical buffer but
>   logical blocknum.

Hm.  I wonder if we somehow ought to generalize the granularity scheme
for predicate locks to not be tuple/page/relation.  But even if, that's
probably a separate patch.


> - CheckForSerializableConflictOut() no more takes HeapTuple nor
>   buffer, instead just takes xid. Push heap specific parts from
>   CheckForSerializableConflictOut() into its own function
>   HeapCheckForSerializableConflictOut() which calls
>   CheckForSerializableConflictOut(). The alternative option could be
>   CheckForSerializableConflictOut() take callback function and
>   callback arguments, which gets called if required after performing
>   prechecks. Though currently I fell AM having its own wrapper to
>   perform AM specific task and then calling
>   CheckForSerializableConflictOut() is fine.

I think it's right to move the xid handling out of
CheckForSerializableConflictOut(). But I think we also ought to move the
subtransaction handling out of the function - e.g. zheap doesn't
want/need that.


> Attaching patch which makes these changes.

Please make sure that there's a CF entry for this (I'm in a plane with a
super slow connection, otherwise I'd check).

Greetings,

Andres Freund

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread Simon Riggs

On Mon, 24 Jun 2019 at 18:01, James Coleman  wrote:

> On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs 
> wrote:
>
> > What is the specific use case for this? This sounds quite general case.
>
> They are both general cases in some sense, but the concerns lie mostly
> with what happens when they're unexpectedly encountered. For example,
> if the expected row count or group size is off by a good bit and we
> effectively have to perform a sort of all (or most) possible rows.
>
> If we can get the performance to a point where that misestimated row
> count or group size doesn't much matter, then ISTM including the patch
> becomes a much more obvious total win.
>

I was trying to think of ways of using external information/circumstance to
knowingly avoid negative use cases. i.e. don't treat sort as a black box,
use its context.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise

Re: Minimal logical decoding on standbys

2019-06-24 Thread Amit Khandekar

On Thu, 20 Jun 2019 at 00:31, Andres Freund  wrote:
>
> > > Or else, do you think we can just increment the record pointer by
> > > doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> > > SizeOfXLogShortPHD() ?
> >
> > I found out that we can't do this, because we don't know whether the
> > xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
> > our context, it is SizeOfXLogLongPHD. So we indeed need the
> > XLogReaderState handle.
>
> Well, we can determine whether a long or a short header is going to be
> used, as that's solely dependent on the LSN:

Discussion of this point (plus some more points) is in a separate
reply. You can reply to my comments there :
https://www.postgresql.org/message-id/CAJ3gD9f_HjQ6qP%3D%2B1jwzwy77fwcbT4-M3UvVsqpAzsY-jqM8nw%40mail.gmail.com

>
>
> >  /*
> > + * Get the wal_level from the control file.
> > + */
> > +int
> > +ControlFileWalLevel(void)
> > +{
> > + return ControlFile->wal_level;
> > +}
>
> Any reason not to return the type enum WalLevel instead?  I'm not sure I
> like the function name - perhaps something like GetActiveWalLevel() or
> such? The fact that it's in the control file doesn't seem relevant
> here.  I think it should be close to DataChecksumsEnabled() etc, which
> all return information from the control file.

Done.

>
>
> > +/*
> >   * Initialization of shared memory for XLOG
> >   */
> >  Size
> > @@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
> >   /* Update our copy of the parameters in pg_control */
> >   memcpy(&xlrec, XLogRecGetData(record), 
> > sizeof(xl_parameter_change));
> >
> > + /*
> > +  * Drop logical slots if we are in hot standby and master 
> > does not have
> > +  * logical data. Don't bother to search for the slots if 
> > standby is
> > +  * running with wal_level lower than logical, because in that 
> > case,
> > +  * we would have disallowed creation of logical slots.
> > +  */
>
> s/disallowed creation/disallowed creation or previously dropped/

Did this :
* we would have either disallowed creation of logical slots or dropped
* existing ones.

>
> > + if (InRecovery && InHotStandby &&
> > + xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> > + wal_level >= WAL_LEVEL_LOGICAL)
> > + ResolveRecoveryConflictWithSlots(InvalidOid, 
> > InvalidTransactionId);
> > +
> >   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> >   ControlFile->MaxConnections = xlrec.MaxConnections;
> >   ControlFile->max_worker_processes =
> >   xlrec.max_worker_processes;
>
> Not for this patch, but I kinda feel the individual replay routines
> ought to be broken out of xlog_redo().
Yeah, agree.

>
>
> >  /* 
> >   * Functions for decoding the data and block references in a record.
> >   * 
> > diff --git a/src/backend/replication/logical/decode.c 
> > b/src/backend/replication/logical/decode.c
> > index 151c3ef..c1bd028 100644
> > --- a/src/backend/replication/logical/decode.c
> > +++ b/src/backend/replication/logical/decode.c
> > @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, 
> > XLogRecordBuffer *buf)
> >* can restart from there.
> >*/
> >   break;
> > + case XLOG_PARAMETER_CHANGE:
> > + {
> > + xl_parameter_change *xlrec =
> > + (xl_parameter_change *) 
> > XLogRecGetData(buf->record);
> > +
> > + /* Cannot proceed if master itself does not have 
> > logical data */
> > + if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> > + ereport(ERROR,
> > + 
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +  errmsg("logical decoding on 
> > standby requires "
> > + "wal_level >= 
> > logical on master")));
> > + break;
> > + }
>
> This should also HINT to drop the replication slot.

In this case, DecodeXLogOp() is being called because somebody is using
the slot itself. Not sure if it makes sense to hint the user to drop
the very slot that he/she is using. It would have made better sense to
hint about dropping the slot if something else was being done that
does not require a slot, but because the slot is becoming a nuisance,
we hint to drop the slot so as to avoid the error. What do you say ?
Probably the error message itself hints at setting the wal-level back
to logical.

>
>
> > + /*
> > +  * It is not guaranteed that the restart_lsn points to a valid
> > +  * record location. E.g. on standby, restart_lsn initially points to 
>

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Jun 21, 2019 at 11:24 AM Stephen Frost  wrote:
> > That's not quite accurate, given that it isn't how the ALTER SYSTEM call
> > itself works, and clearly isn't how the authors of that feature expected
> > things to work or they would have actually made it work.  They didn't,
> > and it doesn't actually work.
> >
> > The notion that pg_basebackup was correct in this, when it wasn't tested
> > at all, evidently, even after the concern was raised, and ALTER SYSTEM
> > was wrong, even though it worked just fine before some later patch
> > started making changes to the file, based on the idea that it's the
> > "natural approach" doesn't make sense to me.
> >
> > If the change to pg_basebackup had included a change to ALTER SYSTEM to
> > make it work the *same* way that pg_basebackup now does, or at least to
> > work with the changes that pg_basebackup were making, then maybe
> > everything would have been fine.
> 
> This argument boils down to: two people patches don't play nicely
> together, and we should assume that the first patch had it right and
> the second patch had it wrong, because the first patch was first.

No, the point I was making is that one wasn't "natural" compared to the
other as we have two patches which clearly chose differently.  Had they
picked the same, as I said above, maybe everything would have been fine.

> I don't think it works like that. I think we should decide which patch
> had it right by looking at what the nicest behavior actually is, not
> by which one came first.  In my mind having ALTER SYSTEM drop
> duplicate that other tools may have introduced is a clear winner with
> basically no downside. You are arguing that it will produce confusion,
> but I don't really understand who is going to be confused or why they
> are going to be confused.  We can document whatever we do, and it
> should be fine.  Humans aren't generally supposed to be examining this
> file anyway, so they shouldn't get confused very often.

I'm not the only one who feels that it would be confusing for ALTER
SYSTEM to drop duplicates while every other tools creates them and
doesn't do anything to prevent them from being there.  As for who-
anyone who deals with PostgreSQL on a regular basis will end up running
into the "oh, huh, after pg_basebackup ran, I ended up with duplicates
in postgresql.auto.conf, I wonder if that's ok?" follow by "oh, errr, I
used to have duplicates but now they're gone?!?!  how'd that happen?",
unless, perhaps, they are reading this thread, in which case they'll
certainly know and expect it.  You can probably guess which camp is
larger.

When telling other tool authors how to manipulate PGDATA files, I really
dislike the "do as I say, not as I do" approach that you're advocating
for here.  Let's come up with a specific, clear, and ideally simple way
for everything to modify postgresql.auto.conf and let's have everything
use it.

> In my view, the original ALTER SYSTEM patch just has a bug -- it
> doesn't modify the right copy of the setting when multiple copies are
> present -- and we should just fix the bug.

Removing duplicates wouldn't be necessary for ALTER SYSTEM to just
modify the 'correct' version.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jun 21, 2019 at 12:55 PM Tom Lane  wrote:
>> Ah, got it.  So it seems like the correct behavior might be for
>> ALTER SYSTEM to
>> (a) run through the whole file and remove any conflicting lines;
>> (b) append new setting at the end.

> That is exactly the behavior for which I am arguing.  Stephen also
> wants a warning, but I disagree, because the warning is totally
> non-actionable.  It tells you that some tool, at some point in the
> past, did something bad. You can't do anything about that, and you
> wouldn't need to except for the arbitrary decision to label duplicate
> lines as bad in the first place.

Agreed; there's no particular reason to consider the situation as wrong.
guc.c has always had the policy that dups are fine and the last one wins.
The very design of ALTER SYSTEM owes its workability to that policy, so
we can hardly say that A.S. should have a different policy internally.

The problem here is simply that ALTER SYSTEM is failing to consider the
possibility that there are dups in postgresql.auto.conf, and that seems
like little more than an oversight to be fixed.

There's more than one way we could implement a fix, perhaps, but I don't
really see a reason to work harder than is sketched above.

(BTW, has anyone checked whether ALTER SYSTEM RESET is prepared to remove
multiple lines for the same var?)

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Jun 21, 2019 at 12:55 PM Tom Lane  wrote:
> > Ah, got it.  So it seems like the correct behavior might be for
> > ALTER SYSTEM to
> > (a) run through the whole file and remove any conflicting lines;
> > (b) append new setting at the end.
> 
> That is exactly the behavior for which I am arguing.  Stephen also
> wants a warning, but I disagree, because the warning is totally
> non-actionable.  It tells you that some tool, at some point in the
> past, did something bad. You can't do anything about that, and you
> wouldn't need to except for the arbitrary decision to label duplicate
> lines as bad in the first place.

Stephen and Magnus want a warning, because it's an indication that a
tool author, or *something* modified the file in an unexpected way, and
that we are having to do some kind of cleanup on the file because of it.

If it was a tool author, who it certainly may very well be as they're
writing in support for the v12 changes, they'd almost certainly go and
fix their code to avoid doing that, lest users complain, which would be
exactly the behavior we want.

If it was the user themselves, which is also *entirely* likely, then
hopefully they'd realize that they really shouldn't be modifying that
file.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Jun 22, 2019 at 5:13 PM Stephen Frost  wrote:
> > All that said, whatever code it is that we write for pg_basebackup to do 
> > this properly should go into our client side library, so other tools can 
> > leverage that and avoid having to write it themselves.
> 
> That is probably only going to help people who are writing in C (or
> maybe some close family member) and a lot of tools for managing
> PostgreSQL will be written in scripting languages.  It is unlikely
> that those people are going to get all of the rules for parsing a file
> full of GUC settings exactly right, because translating flex into
> Python is probably not anybody's idea of a fun time. So you'll end up
> with a bunch of rewrite-postgresql.auto.conf tools written in
> different languages at varying degrees of quality many of which will
> misfire in corner cases where the GUC names contain funny characters
> or the whitespace is off or there's unusual quoting involved.

Calling into C functions from Python certainly isn't new, nor is it
difficult to do from Perl, or various other languages, someone just
needs to write the bindings.  I'm not sure where the idea came from that
someone would translate flex into Python, that's certainly not what I
was suggesting at any point in this discussion.

> If you just decreed that it was OK to append to the file, you could
> avoid all that.

As I said elsewhere on this thread, I have absolutely no problem with
that as the documented approach to working with this file- but if that's
what we're going to have be the documented approach, then everything
should be using that approach...

Thanks,

Stephen

signature.asc
Description: PGP signature

Ear on mailing

2019-06-24 Thread Sascha Kuhl

Does Microsoft or any other DB manufacturer have an ear on this mailing
list?

Sascha Kuhl

Re: Ear on mailing

2019-06-24 Thread Alvaro Herrera

On 2019-Jun-23, Sascha Kuhl wrote:

> Does Microsoft or any other DB manufacturer have an ear on this mailing
> list?

This is a public mailing list, so anybody with an interest can subscribe
to it.  If you search the archives, you might find a more explicit
answer to your question.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Alvaro Herrera

On 2019-Jun-24, Robert Haas wrote:

> On Sat, Jun 22, 2019 at 5:13 PM Stephen Frost  wrote:
> > All that said, whatever code it is that we write for pg_basebackup to do 
> > this properly should go into our client side library, so other tools can 
> > leverage that and avoid having to write it themselves.
> 
> That is probably only going to help people who are writing in C (or
> maybe some close family member) and a lot of tools for managing
> PostgreSQL will be written in scripting languages.

But we already have ALTER SYSTEM, so why do we need to write it again?
You just need to check whether the system is running: if it is, connect
and do "ALTER SYSTEM".  If it isn't, do `echo ALTER SYSTEM | postgres
--single`.  Maybe we can embed smarts to do that in, say, pg_ctl; then
everybody has access to it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Tom Lane

Stephen Frost  writes:
> Stephen and Magnus want a warning, because it's an indication that a
> tool author, or *something* modified the file in an unexpected way, and
> that we are having to do some kind of cleanup on the file because of it.

But you're presuming something that not everybody agrees with, which
is that this situation should be considered unexpected.

In particular, in order to consider it unexpected, you have to suppose
that the content rules for postgresql.auto.conf are different from those
for postgresql.conf (wherein we clearly allow last-one-wins).  Can you
point to any user-facing documentation that says that?

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > On Fri, Jun 21, 2019 at 12:55 PM Tom Lane  wrote:
> >> Ah, got it.  So it seems like the correct behavior might be for
> >> ALTER SYSTEM to
> >> (a) run through the whole file and remove any conflicting lines;
> >> (b) append new setting at the end.
> 
> > That is exactly the behavior for which I am arguing.  Stephen also
> > wants a warning, but I disagree, because the warning is totally
> > non-actionable.  It tells you that some tool, at some point in the
> > past, did something bad. You can't do anything about that, and you
> > wouldn't need to except for the arbitrary decision to label duplicate
> > lines as bad in the first place.
> 
> Agreed; there's no particular reason to consider the situation as wrong.
> guc.c has always had the policy that dups are fine and the last one wins.
> The very design of ALTER SYSTEM owes its workability to that policy, so
> we can hardly say that A.S. should have a different policy internally.
> 
> The problem here is simply that ALTER SYSTEM is failing to consider the
> possibility that there are dups in postgresql.auto.conf, and that seems
> like little more than an oversight to be fixed.
> 
> There's more than one way we could implement a fix, perhaps, but I don't
> really see a reason to work harder than is sketched above.

Why bother removing the duplicate lines?

If ALTER SYSTEM should remove them, why shouldn't other tools?

> (BTW, has anyone checked whether ALTER SYSTEM RESET is prepared to remove
> multiple lines for the same var?)

No, it doesn't handle that today either, as discussed earlier in this
thread.

If we want to get to should/must kind of language, then we could say
that tools should remove duplicated values, and must append to the end,
but I'm not sure that really changes things from what I'm proposing
anyway.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> On 2019-Jun-24, Robert Haas wrote:
> 
> > On Sat, Jun 22, 2019 at 5:13 PM Stephen Frost  wrote:
> > > All that said, whatever code it is that we write for pg_basebackup to do 
> > > this properly should go into our client side library, so other tools can 
> > > leverage that and avoid having to write it themselves.
> > 
> > That is probably only going to help people who are writing in C (or
> > maybe some close family member) and a lot of tools for managing
> > PostgreSQL will be written in scripting languages.
> 
> But we already have ALTER SYSTEM, so why do we need to write it again?
> You just need to check whether the system is running: if it is, connect
> and do "ALTER SYSTEM".  If it isn't, do `echo ALTER SYSTEM | postgres
> --single`.  Maybe we can embed smarts to do that in, say, pg_ctl; then
> everybody has access to it.

While I'm not against adding some kind of support like that if we feel
like we really need it, I tend to think that just having it in
libpgcommon would be enough for most tool authors to use..

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Stephen and Magnus want a warning, because it's an indication that a
> > tool author, or *something* modified the file in an unexpected way, and
> > that we are having to do some kind of cleanup on the file because of it.
> 
> But you're presuming something that not everybody agrees with, which
> is that this situation should be considered unexpected.

And, at least at present, not everyone seems to be agreeing that having
duplicates should be considered expected, either.  Using only ALTER
SYSTEM, you'd never end up with duplicates either.

> In particular, in order to consider it unexpected, you have to suppose
> that the content rules for postgresql.auto.conf are different from those
> for postgresql.conf (wherein we clearly allow last-one-wins).  Can you
> point to any user-facing documentation that says that?

The backend and frontend tools don't modify postgresql.conf, and we
don't document how to modify postgresql.auto.conf at *all*, even though
we clearly now expect tool authors to go modifying it so that they can
provide the same capabilities that pg_basebackup does and which they
used to through recovery.conf, so I don't really see that as being
comparable.

The only thing we used to have to go on was what ALTER SYSTEM did, and
then pg_basebackup went and did something different, and enough so that
they ended up conflicting with each other, leading to this discussion.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Ear on mailing

2019-06-24 Thread Stephen Frost

Greetings,

* Sascha Kuhl (yogidaba...@gmail.com) wrote:
> Does Microsoft or any other DB manufacturer have an ear on this mailing
> list?

Many, many, many of the people who are on this mailing list work for DB
manufacturers.

I suspect that most of them are manufacturing PostgreSQL.

Thanks,

Stephen

signature.asc
Description: PGP signature

Prevent invalid memory access in LookupFuncName

2019-06-24 Thread Alexander Lakhin

Hello hackers,

When running on REL_11_STABLE the following query:
CREATE PROCEDURE test_ambiguous_procname(int) as $$ begin end; $$
language plpgsql;
CREATE PROCEDURE test_ambiguous_procname(text) as $$ begin end; $$
language plpgsql;
DROP PROCEDURE test_ambiguous_procname;
under valgrind I get the memory access errors:

2019-06-24 22:21:39.925 MSK|law|regression|5d1122c2.2921|LOG: 
statement: DROP PROCEDURE test_ambiguous_procname;
==00:00:00:07.756 10529== Conditional jump or move depends on
uninitialised value(s)
==00:00:00:07.756 10529==    at 0x4C35E60: __memcmp_sse4_1 (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:07.756 10529==    by 0x2E9A7B: LookupFuncName (parse_func.c:2078)
==00:00:00:07.756 10529==    by 0x2E9D96: LookupFuncWithArgs
(parse_func.c:2141)
==00:00:00:07.756 10529==    by 0x2A2F7E: get_object_address
(objectaddress.c:893)
==00:00:00:07.756 10529==    by 0x31C9C0: RemoveObjects (dropcmds.c:71)
==00:00:00:07.756 10529==    by 0x50D4FF: ExecDropStmt (utility.c:1738)
==00:00:00:07.756 10529==    by 0x5100BC: ProcessUtilitySlow
(utility.c:1580)
==00:00:00:07.756 10529==    by 0x50EDFE: standard_ProcessUtility
(utility.c:835)
==00:00:00:07.756 10529==    by 0x50F07D: ProcessUtility (utility.c:360)
==00:00:00:07.756 10529==    by 0x50B4D2: PortalRunUtility (pquery.c:1178)
==00:00:00:07.756 10529==    by 0x50C169: PortalRunMulti (pquery.c:1324)
==00:00:00:07.756 10529==    by 0x50CEFF: PortalRun (pquery.c:799)
==00:00:00:07.756 10529==  Uninitialised value was created by a stack
allocation
==00:00:00:07.756 10529==    at 0x2E9C31: LookupFuncWithArgs
(parse_func.c:2106)
==00:00:00:07.756 10529==
...
==00:00:00:07.756 10529== Conditional jump or move depends on
uninitialised value(s)
==00:00:00:07.756 10529==    at 0x2E9A7E: LookupFuncName (parse_func.c:2078)
==00:00:00:07.756 10529==    by 0x2E9D96: LookupFuncWithArgs
(parse_func.c:2141)
==00:00:00:07.757 10529==    by 0x2A2F7E: get_object_address
(objectaddress.c:893)
==00:00:00:07.757 10529==    by 0x31C9C0: RemoveObjects (dropcmds.c:71)
==00:00:00:07.757 10529==    by 0x50D4FF: ExecDropStmt (utility.c:1738)
==00:00:00:07.757 10529==    by 0x5100BC: ProcessUtilitySlow
(utility.c:1580)
==00:00:00:07.757 10529==    by 0x50EDFE: standard_ProcessUtility
(utility.c:835)
==00:00:00:07.757 10529==    by 0x50F07D: ProcessUtility (utility.c:360)
==00:00:00:07.757 10529==    by 0x50B4D2: PortalRunUtility (pquery.c:1178)
==00:00:00:07.757 10529==    by 0x50C169: PortalRunMulti (pquery.c:1324)
==00:00:00:07.757 10529==    by 0x50CEFF: PortalRun (pquery.c:799)
==00:00:00:07.757 10529==    by 0x5090FF: exec_simple_query
(postgres.c:1145)
==00:00:00:07.757 10529==  Uninitialised value was created by a stack
allocation
==00:00:00:07.757 10529==    at 0x2E9C31: LookupFuncWithArgs
(parse_func.c:2106)

As I see, the code in LookupFuncName can fall through the "if (nargs ==
-1)" condition and execute memcmp with nargs==-1.
The proposed patch is attached.

Best regards,
Alexander
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 44257154b81..ee9f73ef977 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -2071,6 +2071,7 @@ LookupFuncName(List *funcname, int nargs, const Oid *argtypes, bool noError)
 		 errmsg("could not find a function named \"%s\"",
 NameListToString(funcname;
 		}
+		return InvalidOid;
 	}
 
 	while (clist)

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread Tomas Vondra

On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:

On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs  wrote:

On Mon, 24 Jun 2019 at 16:10, James Coleman  wrote:

On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>I think the first thing to do is get some concrete numbers on performance if 
we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.

Yes, that seems like a reasonable approach. Essentially, we're trying to
construct plausible worst case examples, and then minimize the overhead
compared to regular sort. If we get sufficiently close, then it's fine
to rely on somewhat shaky stats - like group size estimates.

Do we know something about the nearly-sorted rows that could help us?
Or could we introduce some information elsewhere that would help with
the sort?

Could we for-example, pre-sort the rows block by block, or filter out
the rows that are clearly out of order, so we can re-merge them
later?

I'm not sure what you mean by "block by block"?

I'm not sure what "block by block" means either.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread Tomas Vondra


On Mon, Jun 24, 2019 at 07:05:24PM +0100, Simon Riggs wrote:

On Mon, 24 Jun 2019 at 18:01, James Coleman  wrote:


On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs 
wrote:

> What is the specific use case for this? This sounds quite general case.

They are both general cases in some sense, but the concerns lie mostly
with what happens when they're unexpectedly encountered. For example,
if the expected row count or group size is off by a good bit and we
effectively have to perform a sort of all (or most) possible rows.

If we can get the performance to a point where that misestimated row
count or group size doesn't much matter, then ISTM including the patch
becomes a much more obvious total win.



I was trying to think of ways of using external information/circumstance to
knowingly avoid negative use cases. i.e. don't treat sort as a black box,
use its context.



Like what, for example? I'm not saying there's no such additional
information, but I can't think of anything at the moment.

One of the issues is that while the decision whether to use incremental
sort is done during planning, most of the additional (and reliable)
infomation about the data is only available at execution time. And it's
likely not available explicitly - we need to deduce it. Which I think is
exactly what the min group size heuristics is about, no?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Update list of combining characters

2019-06-24 Thread Peter Eisentraut

On 2019-06-19 21:55, Tom Lane wrote:
> Peter Eisentraut  writes:
>>> Indeed.  Here is an updated script and patch.
> 
>> committed (to master)
> 
> Cool, but should we also put your recalculation script into git, to help
> the next time we decide that we need to update this list?  It's
> demonstrated to be nontrivial to get it right ;-)

For PG12, having the script in the archives is sufficient, I think.  Per
thread "more Unicode data updates", we should come up with a method that
updates all (currently three) places where Unicode data is applied,
which would involve some larger restructuring, probably.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: more Unicode data updates

2019-06-24 Thread Peter Eisentraut

On 2019-06-19 22:34, Peter Eisentraut wrote:
> src/include/common/unicode_norm_table.h also should be updated to the
> latest Unicode tables, as described in src/common/unicode.  See attached
> patches.  This also passes the tests described in
> src/common/unicode/README.  (That is, the old code does not pass the
> current Unicode test file, but the updated code does pass it.)

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-24 Thread Peter Eisentraut

On 2019-06-24 06:06, Michael Paquier wrote:
> -   if (strcmp(*newval, "epoch") == 0 ||
> -   strcmp(*newval, "infinity") == 0 ||
> -   strcmp(*newval, "-infinity") == 0 ||
> Why do you remove these?  They should still be rejected because they
> make no sense as recovery targets, no?

Yeah but the new code already rejects those anyway.  Note how
timestamptz_in() has explicit switch cases to accept those, and we
didn't carry those over into check_recovery_time().

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-06-24 Thread James Coleman

On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
 wrote:
>
> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs  wrote:
> >>
> >> On Mon, 24 Jun 2019 at 16:10, James Coleman  wrote:
> >>>
> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
...
> >>> As I see it the two most significant concerning cases right now are:
> >>> 1. Very large batches (in particular where the batch is effectively
> >>> all of the matching rows such that we're really just doing a standard
> >>> sort).
> >>> 2. Many very small batches.
> >>
> >>
> >> What is the specific use case for this? This sounds quite general case.
> >
> >They are both general cases in some sense, but the concerns lie mostly
> >with what happens when they're unexpectedly encountered. For example,
> >if the expected row count or group size is off by a good bit and we
> >effectively have to perform a sort of all (or most) possible rows.
> >
> >If we can get the performance to a point where that misestimated row
> >count or group size doesn't much matter, then ISTM including the patch
> >becomes a much more obvious total win.
> >
>
> Yes, that seems like a reasonable approach. Essentially, we're trying to
> construct plausible worst case examples, and then minimize the overhead
> compared to regular sort. If we get sufficiently close, then it's fine
> to rely on somewhat shaky stats - like group size estimates.

I have a bit of a mystery in my performance testing. I've been setting
up a table like so:

create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
insert into foo(owner_fk, created_at)
select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
from generate_series(1, 1) time_t(i)
cross join generate_series(1, 1000) fk_t(i);
-- double up on one set to guarantee matching prefixes
insert into foo (owner_fk, created_at) select owner_fk, created_at
from foo where owner_fk = 23;
create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
analyze foo;

and then I have the following query:

select *
from foo
where owner_fk = 23
order by created_at desc, pk desc
limit 2;

The idea here is to force a bit of a worst case for small groups: we
have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
then query with a limit matching the actual number of rows we know
will match the query -- so even though there's a limit we're forcing a
total sort (and also guaranteeing both plans have to touch the same
number of rows). Note: I know that batches of size is actually the
worst case, but I chose batches of two because I've also been testing
a change that would skip the sort entirely for single tuple batches.

On master (really the commit right before the current revision of the
patch), I get:
latency average = 14.271 ms
tps = 70.075243 (excluding connections establishing)

With the patch (and incremental sort enabled):
latency average = 11.975 ms
tps = 83.512090 (excluding connections establishing)

With the patch (but incremental sort disabled):
latency average = 11.884 ms
tps = 84.149834 (excluding connections establishing)

All of those are 60 seconds runs on pgbench with a single thread.

So we have a very substantial speedup with patch *even if the new
feature isn't enabled*. I've confirmed the plan looks the same on
patched with incremental sort disabled and master. The only changes
that would seem to really effect execution time would be the changes
to tuplesort.c, but looking through them I don't see anything I'd
expect to change things so dramatically.

Any thoughts on this?

James Coleman

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-24 Thread Noah Misch

On Mon, Jun 24, 2019 at 08:52:00AM -0400, Alvaro Herrera wrote:
> On 2019-Jun-24, Michael Paquier wrote:
> > On Fri, Jun 14, 2019 at 06:31:52PM -0400, Alvaro Herrera wrote:
> > > BTW I tested this manually and it seemed to work as intended, ie., if I
> > > change /etc/hosts for the hostname I am using between one connection and
> > > the next, we do keep the hostaddr if it was specified, or we resolve the
> > > name again if it's not.
> > 
> > Alvaro, did 313f56ce fix all the issues related to this thread?
> 
> Yes, as far as I am aware it does.

I agree, it did.

Re: New vacuum option to do only freezing

2019-06-24 Thread Michael Paquier

On Sun, Jun 23, 2019 at 10:29:25PM +0900, Michael Paquier wrote:
> So, are there any objections with this patch?  Or do people think that
> it's too late for v12 and that it is better to wait until v13 opens
> for business?

Committed, and open item closed.
--
Michael


signature.asc
Description: PGP signature

Re: fix psql \conninfo & \connect when using hostaddr

2019-06-24 Thread Michael Paquier

On Mon, Jun 24, 2019 at 04:51:04PM -0700, Noah Misch wrote:
> On Mon, Jun 24, 2019 at 08:52:00AM -0400, Alvaro Herrera wrote:
>> On 2019-Jun-24, Michael Paquier wrote:
>>> Alvaro, did 313f56ce fix all the issues related to this thread?
>> 
>> Yes, as far as I am aware it does.
> 
> I agree, it did.

Indeed.  I have been looking at the thread and I can see the
difference with and without the patch committed.  This open item is
closed.
--
Michael


signature.asc
Description: PGP signature

Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-24 Thread Michael Paquier

On Mon, Jun 24, 2019 at 11:27:26PM +0200, Peter Eisentraut wrote:
> Yeah but the new code already rejects those anyway.  Note how
> timestamptz_in() has explicit switch cases to accept those, and we
> didn't carry those over into check_recovery_time().

Ditto.  I was not paying much attention to the code.  Your patch
indeed rejects anything else than DTK_DATE.  So we are good here,
sorry for the noise.
--
Michael


signature.asc
Description: PGP signature

Re: Use of reloptions by EXTENSIONs

2019-06-24 Thread Michael Paquier

On Mon, Jun 24, 2019 at 11:47:09AM +0200, Dent John wrote:
> I’ll see if I can get the patch applied and feed back on how much it
> move towards making my use case a viable proposition. 

There is another patch which provides more coverage for reloptions:
https://commitfest.postgresql.org/23/2064/

Based on my last lookup, I was rather unhappy with its shape because
of the assumptions behind the tests and the extra useless work it was
doing with parameter strings (the set of WARNING is also something we
don't need).  If we get that first in, we can then make sure that any
extra refactoring has hopefully no impact.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Ian Barwick

On 6/25/19 4:06 AM, Alvaro Herrera wrote:
> On 2019-Jun-24, Robert Haas wrote:
>
>> On Sat, Jun 22, 2019 at 5:13 PM Stephen Frost  wrote:
>>> All that said, whatever code it is that we write for pg_basebackup to
>>> do this properly should go into our client side library, so other tools
>>> can leverage that and avoid having to write it themselves.
>>
>> That is probably only going to help people who are writing in C (or
>> maybe some close family member) and a lot of tools for managing
>> PostgreSQL will be written in scripting languages.
>
> But we already have ALTER SYSTEM, so why do we need to write it again?
> You just need to check whether the system is running: if it is, connect
> and do "ALTER SYSTEM".  If it isn't, do `echo ALTER SYSTEM | postgres
> --single`.  Maybe we can embed smarts to do that in, say, pg_ctl; then
> everybody has access to it.

Unfortunately, to quote the emitted log message, "standby mode is not
supported by single-user servers", which as-is renders this approach useless for
setting up replication configuration on a standby server (unless I'm missing
something).

I've looked in to what might be involved into creating a client-side function
for modifying .auto.conf while the system is not running, and basically
it seems to involve maintaining a stripped down version of ParseConfigFp()
which doesn't recurse (because we don't allow "include" directives in
.auto.conf, right? Right? I'll send in a different patch for that later...)
and somehow exposing write_auto_conf_file().

And for all those scripts which can't call the putative frontend C function,
we could provide a utility called "pg_alter_system" or similar which accepts
a name and a value and (provided the system is not running) "safely"
writes it to .auto.conf (though of course it won't be able to validate the
provided parameter(s)).

Alternatively (waves hand vaguely in air) there might be some way of
creating a single user startup mode for the express purpose of leveraging
the backend code to modify .auto.conf.

Bur that seems like a lot of effort and complexity to replace what, in Pg11
and earlier, was just a case of writing to recovery.conf.

Which brings me to another thought which AFAIK hasn't been discussed -
what use-cases are there for modifying .auto.conf when the system isn't
running?

The only one I can think of is the case at hand, i.e. configuring replication
after cloning a standby in a manner which *guarantees* that the
replication configuration will be read at startup, which was the case
with recovery.conf in Pg11 and earlier.

For anything else, it seems reasonable to me to expect any customised
settings to be written (e.g. by a provisioning system) to the normal
configuration file(s).

Having pg_basebackup write the replication configuration to a normal file
is icky because there's no guarantee the configuration will be written
last, or even included at all, which is a regression against earlier
versions as there you could clone a standby and (assuming there are no
issues with any cloned configuration files) have the standby start up
reliably.

Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-06-24 Thread Ian Barwick


> In particular, in order to consider it unexpected, you have to suppose
>> that the content rules for postgresql.auto.conf are different from those
>> for postgresql.conf (wherein we clearly allow last-one-wins).  Can you
>> point to any user-facing documentation that says that?
>
> The backend and frontend tools don't modify postgresql.conf, and we
> don't document how to modify postgresql.auto.conf at *all*, even though
> we clearly now expect tool authors to go modifying it so that they can
> provide the same capabilities that pg_basebackup does and which they
> used to through recovery.conf, so I don't really see that as being
> comparable.
>
> The only thing we used to have to go on was what ALTER SYSTEM did, and
> then pg_basebackup went and did something different, and enough so that
> they ended up conflicting with each other, leading to this discussion.

Or looking at it from another perspective - previously there was no
particular use-case for appending to .auto.conf, until it (implicitly)
became the only way of doing what recovery.conf used to do, and happened to
expose the issue at hand.

Leaving aside pg_basebackup and the whole issue of writing replication
configuration, .auto.conf remains a text file which could potentially
include duplicate entries, no matter how much we stipulate it shouldn't.
As-is, ALTER SYSTEM fails to deal with this case, which in my opinion
is a bug and a potential footgun which needs fixing.

(Though we'll still need to define/provide a way of writing configuration
while the server is not running, which will be guaranteed to be read in last
when it starts up).


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgbench prints suspect tps numbers

2019-06-24 Thread Daniel Wood

Short benchmark runs are bad if the runs aren't long enough to produce 
consistent results.

Having to do long runs because a benchmarking tool 'converges to reality' over 
time in reporting a tps number, due to miscalculation, is also bad.


I want to measure TPS at a particular connection count.  A fully cached Select 
Only pgbench produces fairly consistent numbers over short runs of a few 
minutes.

pgbench's "including connections establishing" number is polluted by fact that 
for many seconds the benchmark is running with less than the expected number of 
connections.  I thought that was why the 'excluding' number was also printed 
and I had been relying on that number.

pgbench's "excluding connections establishing" number seems to be a total 
garbage number which can be way way over the actual tps.  During a period when 
I had a bug causing slow connections I noticed a consistent value of about 100K 
tps over the measurement intervals.  At the end of a 5 minute run it reported 
450K tps!  There was no point anywhere during the benchmark that it ran 
anywhere near that number.


I had been using 'excluding' because it 'seemed' perhaps right in the past.  It 
was only when I got crazy numbers I looked at the calculation to find:


tps_exclude = total->cnt / (time_include - 
(INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));


The 'cnt' is the total across the entire run including the period when 
connections are ramping up.  I don't see how dividing by the total time minus 
the average connection time produces the correct result.


Even without buggy slow connections, when connecting 1000 clients, I've 
wondered why the 'excluding' number seemed a bit higher than any given 
reporting interval numbers, over a 5 minute run.  I now understand why.  NOTE: 
When the system hits 100% cpu utilization(after about the first 100 
connections), on a fully cached Select only pgbench, further connections can 
struggle to get connected which really skews the results.


How about a patch which offered the option to wait on an advisory lock as a 
mechanism to let the main thread delay the start of the workload after all 
clients have connected and entered a READY state?  This would produce a much 
cleaner number.

Re: Prevent invalid memory access in LookupFuncName

2019-06-24 Thread Michael Paquier

On Mon, Jun 24, 2019 at 11:10:54PM +0300, Alexander Lakhin wrote:
> When running on REL_11_STABLE the following query:
> CREATE PROCEDURE test_ambiguous_procname(int) as $$ begin end; $$
> language plpgsql;
> CREATE PROCEDURE test_ambiguous_procname(text) as $$ begin end; $$
> language plpgsql;
> DROP PROCEDURE test_ambiguous_procname;
> under valgrind I get the memory access errors:

Thanks!  I have been able to reproduce the problem, and the error is
obvious looking at the code.  I have changed the patch to be more
consistent with HEAD though, returning InvalidOid in the code paths
generating the error.  The logic is the same, but that looked cleaner
to me, and I have added some comments on the way, similarly to what
bfb456c1 has done for HEAD (where LookupFuncNameInternal is doing the
right thing already).  This has been incorrect since aefeb68, so
back-patched down to 10.
--
Michael


signature.asc
Description: PGP signature

Re: old_snapshot_threshold vs indexes

2019-06-24 Thread Thomas Munro

On Fri, Jun 21, 2019 at 1:21 AM Thomas Munro  wrote:
> I ran into someone with a system where big queries scanning 8GB+ of
> all-in-cache data took consistently ~2.5x longer on a primary server
> than on a replica.  Both servers had concurrent activity on them but
> plenty of spare capacity and similar specs.  After some investigation
> it turned out that on the primary there were (1) some select()
> syscalls waiting for 1ms, which might indicate contended
> SpinLockAcquire() back-offs, and (2) a huge amount of time spent in:
>
> + 93,31% 0,00% postgres postgres [.] index_getnext
> + 93,30% 0,00% postgres postgres [.] index_fetch_heap
> + 81,66% 0,01% postgres postgres [.] heap_page_prune_opt
> + 75,85% 0,00% postgres postgres [.] TransactionIdLimitedForOldSnapshots
> + 75,83% 0,01% postgres postgres [.] RelationHasUnloggedIndex
> + 75,79% 0,00% postgres postgres [.] RelationGetIndexList
> + 75,79% 75,78% postgres postgres [.] list_copy

On my laptop, all prewarmed, no concurrency, the mere existence of 10
brin indexes causes a sequential scan to take ~5% longer and an
uncorrelated index scan to take ~45% longer (correlated index scans
don't suffer).  Here's a draft patch for v13 that fixes that problem
by caching the result of RelationHasUnloggedIndex().

Reproducer scripts also attached.  I ran them with shared_buffers=8GB,
old_snapshot_threshold=10s and pg_prewarm installed.

I didn't try to look into the complaint about suspected spinlock contention.

-- 
Thomas Munro
https://enterprisedb.com

0001-Cache-the-result-of-RelationHasUnloggedIndex.patch
Description: Binary data

test-seqscan.sql
Description: Binary data

test-indexscan.sql
Description: Binary data

Re: Problem with default partition pruning

2019-06-24 Thread yuzuko

Hello Shawn, Alvaro,

Thank you for testing patches and comments.
Yes, there are two patches:
(1) v4_default_partition_pruning.patch fixes problems with default
partition pruning
and (2) v3_ignore_contradictory_where_clauses_at_partprune_step.patch determines
if a given clause contradicts a sub-partitioned table's partition constraint.
I'll post two patches together next time.

Anyway, I'll rebase two patches to apply on master and fix space.

Regards,
Yuzuko Hosoya

On Mon, Jun 24, 2019 at 12:45 PM Alvaro Herrera
 wrote:
>
> On 2019-Jun-24, shawn wang wrote:
>
> Hello,
>
> > Thank you for your reply.
> > You can see that the mail start time is February 22. So I looked at the
> > latest version at that time. I found that v11.2 was the newest branch at
> > the time. So I tried to merge this patch into the code, and I found that
> > everything worked.
>
> I see -- I only tried master, didn't occur to me to try it against
> REL_11_STABLE.
>
> > So I tested on this branch and got the results.
> > You need to add the v4_default_partition_pruning.patch
> > 
> > first,
> > and then add the
> > v3_ignore_contradictory_where_clauses_at_partprune_step.patch
> > 
>
> Oh, so there are two patches?  It's easier to keep track if they're
> always posted together.  Anyway, I may have some time to have a look
> tomorrow (Monday).
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Re: pgbench prints suspect tps numbers

2019-06-24 Thread Fabien COELHO




Hello Daniel,


I want to measure TPS at a particular connection count. [...]

pgbench's "including connections establishing" number is polluted by 
fact that for many seconds the benchmark is running with less than the 
expected number of connections.  I thought that was why the 'excluding' 
number was also printed and I had been relying on that number.


pgbench's "excluding connections establishing" number seems to be a 
total garbage number which can be way way over the actual tps.  During a 
period when I had a bug causing slow connections I noticed a consistent 
value of about 100K tps over the measurement intervals.  At the end of a 
5 minute run it reported 450K tps!  There was no point anywhere during 
the benchmark that it ran anywhere near that number.


Could you report the precise version, settings and hardware?

In particular, how many threads, clients and what is the underlying 
hardware?


Are you reconnecting on each transaction?

I had been using 'excluding' because it 'seemed' perhaps right in the 
past.  It was only when I got crazy numbers I looked at the calculation 
to find:


   tps_exclude = total->cnt / (time_include - 
(INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));


The 'cnt' is the total across the entire run including the period when 
connections are ramping up.


Yep. The threads are running independently, so there is no p

I don't see how dividing by the total time minus the average connection 
time produces the correct result.


The above formula looks okay to me, at least at 7AM:-) Maybe the variable 
could be given better names.



Even without buggy slow connections, when connecting 1000 clients,


That is a lot. Really.

I've wondered why the 'excluding' number seemed a bit higher than any 
given reporting interval numbers, over a 5 minute run.  I now understand 
why.  NOTE: When the system hits 100% cpu utilization(after about the 
first 100 connections),


Obviously.

on a fully cached Select only pgbench, further connections can struggle 
to get connected which really skews the results.


Sure, with 1000 clients the system can only by highly overloaded.

How about a patch which offered the option to wait on an advisory lock 
as a mechanism to let the main thread delay the start of the workload 
after all clients have connected and entered a READY state?  This would 
produce a much cleaner number.


A barrier could be implemented, but it should be pretty useless because 
without reconnections the connection time is expected to be negligeable.


--
Fabien

Re: tableam vs. TOAST

2019-06-24 Thread Prabhat Sahu

On Tue, Jun 11, 2019 at 9:47 PM Robert Haas  wrote:

> On Tue, May 21, 2019 at 2:10 PM Robert Haas  wrote:
> > Updated and rebased patches attached.
>
> And again.
>

Hi Robert,

I have tested the TOAST patches(v3) with different storage options
like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k, 40k,
etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size =
10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test
executions.

Please find the SQL script attached herewith, which I have used to perform
the observation.

Below are the test scenarios, how I have checked the behavior and
performance of TOAST patches against PG master.
1. where a single column is compressed(SCC)
2. where multiple columns are compressed(MCC)
-- ALTER the table column/s for storage as "MAIN" to make sure that
the column values are COMPRESSED.

3. where a single column is pushed to the TOAST table but not
compressed(SCTNC)
4. where multiple columns are pushed to the TOAST table but not
compressed(MCTNC)
-- ALTER the table column/s for storage as "EXTERNAL" to make sure
that the column values are pushed to the TOAST table but not COMPRESSED.

5. where a single column is pushed to the TOAST table and also
compressed(SCTC)
6. where multiple columns are pushed to the TOAST table and also
compressed(MCTC)
-- ALTER the table column/s for storage as "EXTENDED" to make sure
that the column values are pushed to the TOAST table and also COMPRESSED.

7. updating the tuples with similar data shouldn't affect the behavior of
storage options.

Please find my observation as below:
System Used: (VCPUs: 8, RAM: 16GB, Size: 640GB)
1 Tuples 2 Tuples 4 Tuples 8 Tuples
Without Patch With Patch Without Patch With Patch Without Patch With
Patch Without
Patch With Patch
1. SCC INSERT 125921.737 ms (02:05.922) 125992.563 ms (02:05.993) 234263.295
ms (03:54.263) 235952.336 ms (03:55.952) 497290.442 ms (08:17.290) 502820.139
ms (08:22.820) 948470.603 ms (15:48.471) 941778.952 ms (15:41.779)
1. SCC UPDATE 263017.814 ms (04:23.018) 270893.910 ms (04:30.894) 488393.748
ms (08:08.394) 507937.377 ms (08:27.937) 1078862.613 ms (17:58.863) 1053029.428
ms (17:33.029) 2037119.576 ms (33:57.120) 2023633.862 ms (33:43.634)
2. MCC INSERT 35415.089 ms (00:35.415) 35910.552 ms (00:35.911) 70899.737
ms (01:10.900) 70800.964 ms (01:10.801) 142185.996 ms (02:22.186) 142241.913
ms (02:22.242)
2. MCC UPDATE 72043.757 ms (01:12.044) 73848.732 ms (01:13.849) 137717.696
ms (02:17.718) 137577.606 ms (02:17.578) 276358.752 ms (04:36.359) 276520.727
ms (04:36.521)
3. SCTNC INSERT 26377.274 ms (00:26.377) 25600.189 ms (00:25.600) 45702.630
ms (00:45.703) 45163.510 ms (00:45.164) 99903.299 ms (01:39.903) 100013.004
ms (01:40.013)
3. SCTNC UPDATE 78385.225 ms (01:18.385) 76680.325 ms (01:16.680) 151823.250
ms (02:31.823) 153503.971 ms (02:33.504) 308197.734 ms (05:08.198) 308474.937
ms (05:08.475)
4. MCTNC INSERT 26214.069 ms (00:26.214) 25383.522 ms (00:25.384) 50826.522
ms (00:50.827) 50221.669 ms (00:50.222) 106034.338 ms (01:46.034) 106122.827
ms (01:46.123)
4. MCTNC UPDATE 78423.817 ms (01:18.424) 75154.593 ms (01:15.155) 158885.787
ms (02:38.886) 156530.964 ms (02:36.531) 319721.266 ms (05:19.721) 322385.709
ms (05:22.386)
5. SCTC INSERT 38451.022 ms (00:38.451) 38652.520 ms (00:38.653) 71590.748
ms (01:11.591) 71048.975 ms (01:11.049) 143327.913 ms (02:23.328) 142593.207
ms (02:22.593)
5. SCTC UPDATE 82069.311 ms (01:22.069) 81678.131 ms (01:21.678) 138763.508
ms (02:18.764) 138625.473 ms (02:18.625) 277534.080 ms (04:37.534) 277091.611
ms (04:37.092)
6. MCTC INSERT 36325.730 ms (00:36.326) 35803.368 ms (00:35.803) 73285.204
ms (01:13.285) 72728.371 ms (01:12.728) 142324.859 ms (02:22.325) 144368.335
ms (02:24.368)
6. MCTC UPDATE 73740.729 ms (01:13.741) 73002.511 ms (01:13.003) 141309.859
ms (02:21.310) 139676.173 ms (02:19.676) 278906.647 ms (04:38.907) 279522.408
ms (04:39.522)

All the observation looks good to me,
except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC
INSERT with tuple count(40K)  there was a slightly increse in time taken
incase of "with patch" result. For a better observation, I also have ran
the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.

I also have performed the below test with TOAST table objects.
8. pg_dump/restore, pg_upgrade with these
9. Streaming Replication setup
10. Concurrent Transactions

While testing few concurrent transactions I have below query:
-- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE ..
SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)

-- Session 1:
CREATE TABLE a (a_id text PRIMARY KEY);
CREATE TABLE b (b_id text);
INSERT INTO a VALUES ('a'), ('b');
INSERT INTO b VALUES ('a'), ('b'),

76 matches

Mail list logo