date:20201201


On 01.12.2020 03:08, James Coleman wrote:

On Tue, Nov 3, 2020 at 4:39 PM Tomas Vondra
 wrote:

I've pushed the 0001 part, i.e. the main fix. Not sure about the other
parts (comments and moving the code back to postgres_fdw) yet.

I noticed the CF entry [1] got moved to the next CF; I'm thinking this
entry should be marked as committed since the fix for the initial bug
reported on this thread has been pushed. We have the parallel safety
issue outstanding, but there's a separate thread and patch for that,
so I'll make a new CF entry for that.

I can mark it as committed, but I'm not sure how to "undo" (or if
that's desirable) the move to the next CF.

Thoughts?

James

1: https://commitfest.postgresql.org/30/2754/



Oops...
I must have rushed with this one, thank you for noticing.
I don't see how to move it back either. I think it's fine to mark it as 
Committed where it is now.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-12-01 Thread Daniel Gustafsson

> On 1 Dec 2020, at 06:38, Michael Paquier  wrote:
> 
> On Mon, Nov 30, 2020 at 02:29:29PM +0100, Daniel Gustafsson wrote:
>> Yeah, that's along the lines of what I was thinking of.
> 
> Hmm.  I have looked at that, and thought first about having directly a
> reference to the resowner directly in pg_cryptohash_ctx, but that's
> not a good plan for two reasons:
> - common/cryptohash.h would get knowledge of that, requiring bundling
> in a bunch of dependencies.
> - There is no need for that in the non-OpenSSL case.
> So, instead, I have been thinking about using an extra context layer
> only for cryptohash_openssl.c with a structure saved as
> pg_cryptohash_context->data that stores the information about 
> EVP_MD_CTX* and the resource owner.  Then, I was thinking about
> storing directly pg_cryptohash_ctx in the resowner EVP array and just
> call pg_cryptohash_free() from resowner.c without the need of an
> extra routine.  I have not tested this idea but that should work.
> What's your take?

That sounds like it would work.  Since the cryptohash implementation owns and
controls the void *data contents it can store whatever it needs to properly
free it.

> In parallel, I have spent more time today polishing and reviewing 0001
> (indented, adjusted a couple of areas and added also brackets and
> extra comments as you suggested) and tested it on Linux and Windows,
> with and without OpenSSL down to 1.0.1, the oldest version supported
> on HEAD.  So I'd like to apply the attached first and sort out the
> resowner stuff in a next step.

+1 on separating the API, EVP migration, resowner patches.

Reading through the v6 patch I see nothing sticking out and all review comments
addressed, +1 on applying that one and then we'll take if from there with the
remaining ones in the patchset.

cheers ./daniel

Re: [HACKERS] Custom compression methods

2020-12-01 Thread Dilip Kumar

On Wed, Nov 25, 2020 at 12:50 AM Alvaro Herrera  wrote:
>
> On 2020-Nov-24, Tom Lane wrote:
>
> > Robert Haas  writes:
>
> > > Oh, I thought it had been suggested in previous discussions that these
> > > should be treated as access methods rather than inventing a whole new
> > > concept just for this, and it seemed like a good idea to me. I guess I
> > > missed the fact that the patch wasn't doing it that way. Hmm.
> >
> > FWIW, I kind of agree with Robert's take on this.  Heap and index AMs
> > are pretty fundamentally different animals, yet we don't have a problem
> > sticking them in the same catalog.  I think anything that is related to
> > storage access could reasonably go into that catalog, rather than
> > inventing a new one.
>
> Right -- Something like amname=lz4, amhandler=lz4handler, amtype=c.
> The core code must of course know how to instantiate an AM of type
> 'c' and what to use it for.
>
> https://postgr.es/m/20171213151818.75a20...@postgrespro.ru

I have changed this, I agree that using the access method for creating
compression has simplified the code.  I will share the updated patch
set after fixing other review comments by Robert.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar  wrote:
> On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov  wrote:
>> BTW, how do you get that required gcc version is 9.4?  I've managed to
>> use LSE with gcc 9.3.
>
> Did they backported it to 9.3?
> I am just looking at the gcc guide.
> https://gcc.gnu.org/gcc-9/changes.html
>
> GCC 9.4
>
> Target Specific Changes
>
> AArch64
>
> The option -moutline-atomics has been added to aid deployment of the Large 
> System Extensions (LSE)

No, you've misread this release notes item.  This item relates to a
new option for LSE support.  See the rest of the description.

"When the option is specified code is emitted to detect the presence
of LSE instructions at runtime and use them for standard atomic
operations. For more information please refer to the documentation."

LSE support itself was introduced in gcc 6.
https://gcc.gnu.org/gcc-6/changes.html
 * The ARMv8.1-A architecture and the Large System Extensions are now
supported. They can be used by specifying the -march=armv8.1-a option.
Additionally, the +lse option extension can be used in a similar
fashion to other option extensions. The Large System Extensions
introduce new instructions that are used in the implementation of
atomic operations.

But, -moutline-atomics looks very interesting itself.  I think we
should add this option to our arm64 packages if we haven't already.

--
Regards,
Alexander Korotkov

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 9:01 AM Tom Lane  wrote:
> I did what I could in this department.  It's late and I'm not going to
> have time to run read/write benchmarks before bed, but here are some
> results for the "pgbench -S" cases.  I tried to match your testing
> choices, but could not entirely:
>
> * Configure options are --enable-debug, --disable-cassert, no other
> special configure options or CFLAG choices.
>
> * I have not been able to find a way to make Apple's compiler not
> use the LSE spinlock instructions, so all of these correspond to
> your LSE cases.
>
> * I used shared_buffers = 1GB, because this machine only has 16GB
> RAM so 32GB is clearly out of reach.  Also I used pgbench scale
> factor 100 not 1000.  Since we're trying to measure contention
> effects not I/O speed, I don't think a huge test case is appropriate.
>
> * I still haven't gotten pgbench to work with -j settings above 128,
> so these runs use -j equal to half -c.  Shouldn't really affect
> conclusions about scaling.  (BTW, I see a similar limitation on
> macOS Catalina x86_64, so whatever that is, it's not new.)
>
> * Otherwise, the first plot shows median of three results from
> "pgbench -S -M prepared -T 120 -c $n -j $j", as you had it.
> The right-hand plot shows all three of the values in yerrorbars
> format, just to give a sense of the noise level.
>
> Clearly, there is something weird going on at -c 4.  There's a cluster
> of results around 180K TPS, and another cluster around 210-220K TPS,
> and nothing in between.  I suspect that the scheduler is doing
> something bogus with sometimes putting pgbench onto the slow cores.
> Anyway, I believe that the apparent gap between HEAD and the other
> curves at -c 4 is probably an artifact: HEAD had two 180K-ish results
> and one 220K-ish result, while the other curves had the reverse, so
> the medians are different but there's probably not any non-chance
> effect there.
>
> Bottom line is that these patches don't appear to do much of
> anything on M1, as you surmised.

Great, thank you for your research!

--
Regards,
Alexander Korotkov

Re: Deleting older versions in unique indexes to avoid page splits

2020-12-01 Thread Heikki Linnakangas

This is a wholly new concept with a lot of heuristics. It's a lot of 
swallow. But here are a few quick comments after a first read-through:


On 30/11/2020 21:50, Peter Geoghegan wrote:

+/*
+ * State used when calling table_index_delete_check() to perform "bottom up"
+ * deletion of duplicate index tuples.  State is intialized by index AM
+ * caller, while state is finalized by tableam, which modifies state.
+ */
+typedef struct TM_IndexDelete
+{
+   ItemPointerData tid;/* table TID from index tuple */
+   int16   id; /* Offset into 
TM_IndexStatus array */
+} TM_IndexDelete;
+
+typedef struct TM_IndexStatus
+{
+   OffsetNumber idxoffnum; /* Index am page offset number */
+   int16   tupsize;/* Space freed in index if 
tuple deleted */
+   boolispromising;/* Duplicate in index? */
+   booldeleteitup; /* Known dead-to-all? */
+} TM_IndexStatus;
...
+ * The two arrays are conceptually one single array.  Two arrays/structs are
+ * used for performance reasons.  (We really need to keep the TM_IndexDelete
+ * struct small so that the tableam can do an initial sort by TID as quickly
+ * as possible.)


Is it really significantly faster to have two arrays? If you had just 
one array, each element would be only 12 bytes long. That's not much 
smaller than TM_IndexDeletes, which is 8 bytes.



+   /* First sort caller's array by TID */
+   heap_tid_shellsort(delstate->deltids, delstate->ndeltids);
+
+   /* alltids caller visits all blocks, so make sure that happens */
+   if (delstate->alltids)
+   return delstate->ndeltids;
+
+   /* Calculate per-heap-block count of TIDs */
+   blockcounts = palloc(sizeof(IndexDeleteCounts) * delstate->ndeltids);
+   for (int i = 0; i < delstate->ndeltids; i++)
+   {
+   ItemPointer deltid = &delstate->deltids[i].tid;
+   TM_IndexStatus *dstatus = delstate->status + 
delstate->deltids[i].id;
+   boolispromising = dstatus->ispromising;
+
+   if (curblock != ItemPointerGetBlockNumber(deltid))
+   {
+   /* New block group */
+   nblockgroups++;
+
+   Assert(curblock < ItemPointerGetBlockNumber(deltid) ||
+  !BlockNumberIsValid(curblock));
+
+   curblock = ItemPointerGetBlockNumber(deltid);
+   blockcounts[nblockgroups - 1].ideltids = i;
+   blockcounts[nblockgroups - 1].ntids = 1;
+   blockcounts[nblockgroups - 1].npromisingtids = 0;
+   }
+   else
+   {
+   blockcounts[nblockgroups - 1].ntids++;
+   }
+
+   if (ispromising)
+   blockcounts[nblockgroups - 1].npromisingtids++;
+   }


Instead of sorting the array by TID, wouldn't it be enough to sort by 
just the block numbers?



 * While in general the presence of promising tuples (the hint that 
index
+* AMs provide) is the best information that we have to go on, it is 
based
+* on simple heuristics about duplicates in indexes that are understood 
to
+* have specific flaws.  We should not let the most promising heap pages
+* win or lose on the basis of _relatively_ small differences in the 
total
+* number of promising tuples.  Small differences between the most
+* promising few heap pages are effectively ignored by applying this
+* power-of-two bucketing scheme.
+*


What are the "specific flaws"?

I understand that this is all based on rough heuristics, but is there 
any advantage to rounding/bucketing the numbers like this? Even if small 
differences in the total number of promising tuple is essentially noise 
that can be safely ignored, is there any harm in letting those small 
differences guide the choice? Wouldn't it be the essentially the same as 
picking at random, or are those small differences somehow biased?


- Heikki

Re: Improving spin-lock implementation on ARM.

On Tue, 1 Dec 2020 at 15:16, Alexander Korotkov 
wrote:

> On Tue, Dec 1, 2020 at 6:26 AM Krunal Bauskar 
> wrote:
> > On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov 
> wrote:
> >> BTW, how do you get that required gcc version is 9.4?  I've managed to
> >> use LSE with gcc 9.3.
> >
> > Did they backported it to 9.3?
> > I am just looking at the gcc guide.
> > https://gcc.gnu.org/gcc-9/changes.html
> >
> > GCC 9.4
> >
> > Target Specific Changes
> >
> > AArch64
> >
> > The option -moutline-atomics has been added to aid deployment of the
> Large System Extensions (LSE)
>
> No, you've misread this release notes item.  This item relates to a
> new option for LSE support.  See the rest of the description.
>

Some confusion here.

What I meant was outline-atomics support was added in GCC-9.4 and was made
default in gcc-10.
LSE support is present for quite some time.

In one of my up-thread reply  I mentioned about the outline atomic that
would help us enable lse if underline architecture support it.


2. As you pointed out LSE is enabled starting only with arm-v8.1 but not
all aarch64 tag machines are arm-v8.1 compatible.
This means we would need a separate package or a more optimal way would
be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with
-moutline-atomics that would emit both traditional and lse code and
flow would dynamically select depending on the target machine
(I have blogged about it in MySQL context
https://mysqlonarm.github.io/ARM-LSE-and-MySQL/)


Problem with this approach is it would need gcc-9.4+ (I also ready
gcc-9.3.1 would do) or gcc10.x but most of the distro doesn't support it so
user using old compiler especially
default one shipped with OS would not be able to take advantage of this.


>
> "When the option is specified code is emitted to detect the presence
> of LSE instructions at runtime and use them for standard atomic
> operations. For more information please refer to the documentation."
>
> LSE support itself was introduced in gcc 6.
> https://gcc.gnu.org/gcc-6/changes.html
>  * The ARMv8.1-A architecture and the Large System Extensions are now
> supported. They can be used by specifying the -march=armv8.1-a option.
> Additionally, the +lse option extension can be used in a similar
> fashion to other option extensions. The Large System Extensions
> introduce new instructions that are used in the implementation of
> atomic operations.
>
> But, -moutline-atomics looks very interesting itself.  I think we
> should add this option to our arm64 packages if we haven't already.
>
> --
> Regards,
> Alexander Korotkov
>


-- 
Regards,
Krunal Bauskar

Commitfest 2020-11 is closed


Hi, hackers!

Commitfest 2020-11 is officially closed now.
Many thanks to everyone who participated by posting patches, reviewing 
them, committing and sharing ideas in discussions!


Today, me and Georgios will move the remaining items to the next CF or 
return them with feedback. We're planning to leave Ready For Committer 
till the end of the week, to make them more visible and let them get the 
attention they deserve. And finally in the weekend I'll gather and share 
some statistics.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Amit Khandekar

On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar  wrote:
> What I meant was outline-atomics support was added in GCC-9.4 and was made 
> default in gcc-10.
> LSE support is present for quite some time.

FWIW, here is an earlier discussion on the same (also added the
proposal author here) :

https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-12-01 Thread Amit Kapila

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar  wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila  wrote:
> >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.
>

On further thinking about this point, I think the message seen on
subscriber [1] won't occur if missed the first stream. This is because
we always check the value of fileset from the stream hash table
(xidhash) and it won't be there if we directly send the second stream
and that would have lead to a different kind of problem (probably
crash). This symptom seems to be due to the reason (a) mentioned above
unless we are missing something else. Now, I am not sure how the file
can be removed without the corresponding entry in hash table (xidhash)
is still present. The only reasons that come to mind are that some
other process cleaned pgsql_tmp directory thinking these temporary
file are not required or one manually removes it, none of those seems
plausible reasons.

[1] - ERROR: could not open temporary file "16393-510.changes.0" from
BufFile "16393-510.changes": No such file or directory

-- 
With Regards,
Amit Kapila.

Avoid using lcons and list_delete_first in plan_union_children()

2020-12-01 Thread Hou, Zhijie

Hi,

In function plan_union_children(),  
I found the lcons and list_delete_first here is easy to be replaced by lappend 
and list_delete_last.

And I also found a previous commit do similar thing, so I try to improve this 
one.

Previous commit:
d97b714a219959a50f9e7b37ded674f5132f93f3

Best regards,
houzj




0001-Avoid-using-lcons-and-list_delete_first.patch
Description: 0001-Avoid-using-lcons-and-list_delete_first.patch

Re: [HACKERS] Custom compression methods

2020-12-01 Thread Dilip Kumar

On Sat, Nov 21, 2020 at 3:50 AM Robert Haas  wrote:

While working on this comment I have doubts.

> I wonder in passing about TOAST tables and materialized views, which
> are the other things that have storage. What gets stored for
> attcompression? For a TOAST table it probably doesn't matter much
> since TOAST table entries shouldn't ever be toasted themselves, so
> anything that doesn't crash is fine (but maybe we should test that
> trying to alter the compression properties of a TOAST table doesn't
> crash, for example).

Yeah for the toast table it doesn't matter,  but I am not sure what do
you mean by altering the compression method for the toast table. Do you
mean manually update the pg_attribute tuple for the toast table and
set different compression methods?  Or there is some direct way to
alter the toast table?

 For a materialized view it seems reasonable to
> want to set column properties, but I'm not quite sure how that works
> today for things like STORAGE anyway. If we do allow setting STORAGE
> or COMPRESSION for materialized view columns then dump-and-reload
> needs to preserve the values.

I see that we allow setting the STORAGE for the materialized view but
I am not sure what is the use case.  Basically, the tuples are
directly getting selected from the host table and inserted in the
materialized view without checking target and source storage type.
The behavior is the same if you execute INSERT INTO dest_table SELECT
* FROM source_table.  Basically, if the source_table attribute has
extended storage and the target table has plain storage, still the
value will be inserted directly into the target table without any
conversion.  However, in the table, you can insert the new tuple and
that will be stored as per the new storage method so that is still
fine but I don't know any use case for the materialized view.  Now I am
thinking what should be the behavior for the materialized view?

For the materialized view can we have the same behavior as storage?  I
think for the built-in compression method that might not be a problem
but for the external compression method how can we handle the
dependency, I mean when the materialized view has created the table
was having an external compression method "cm1" and we have created
the materialized view based on that now if we alter table and set the
new compression method and force table rewrite then what will happen
to the tuple inside the materialized view, I mean tuple is still
compressed with "cm1" and there is no attribute is maintaining the
dependency on "cm1" because the materialized view can point to any
compression method.  Now if we drop the cm1 it will be allowed to
drop.  So I think for the compression method we can consider the
materialized view same as the table, I mean we can allow setting the
compression method for the materialized view and we can always ensure
that all the tuple in this view is compressed with the current or the
preserved compression methods.  So whenever we are inserting in the
materialized view then we should compare the datum compression method
with the target compression method.

> +   /*
> +* Use default compression method if the existing compression method 
> is
> +* invalid but the new storage type is non plain storage.
> +*/
> +   if (!OidIsValid(attrtuple->attcompression) &&
> +   (newstorage != TYPSTORAGE_PLAIN))
> +   attrtuple->attcompression = DefaultCompressionOid;
>
> You have a few too many parens in there.
>
> I don't see a particularly good reason to treat plain and external
> differently.

Yeah, I think they should be treated the same.

 More generally, I think there's a question here about
> when we need an attribute to have a valid compression type and when we
> don't. If typstorage is plan or external, then there's no point in
> ever having a compression type and maybe we should even reject
> attempts to set one (but I'm not sure).

I agree.

> However, the attstorage is a
> different case. Suppose the column is created with extended storage
> and then later it's changed to plain. That's only a hint, so there may
> still be toasted values in that column, so the compression setting
> must endure. At any rate, we need to make sure we have clear and
> sensible rules for when attcompression (a) must be valid, (b) may be
> valid, and (c) must be invalid. And those rules need to at least be
> documented in the comments, and maybe in the SGML docs.

IIUC, even if we change the attstorage the existing tuples are stored
as it is without changing the tuple storage.  So I think even if the
attstorage is changed the attcompression should not have any change.

After observing this behavior of storage I tend to think that for
built-in compression methods also we should have the same behavior,  I mean
if the tuple is compressed with one of the built-in compression
methods and if we are altering the compression method or we are doing
INSERT INTO SELECT to th

Re: Two fsync related performance issues?

2020-12-01 Thread Craig Ringer

On Wed, 14 Oct 2020, 13:06 Michael Paquier,  wrote:

> On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote:
> > On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
> >  wrote:
> >> One question about this: Did you consider the case of a basebackup being
> >> copied/restored somewhere and the restore/PITR being started? Shouldn't
> >> Postgres then sync the whole data directory first in order to assure
> >> durability, or do we consider this to be on the tool that does the
> >> copying? Or is this not needed somehow?
> >
> > To go with precise fsyncs, we'd have to say that it's the job of the
> > creator of the secondary copy.  Unfortunately that's not a terribly
> > convenient thing to do (or at least the details vary).
>
> Yeah, it is safer to assume that it is the responsability of the
> backup tool to ensure that because it could be possible that a host is
> unplugged just after taking a backup, and having Postgres do this work
> at the beginning of archive recovery would not help in most cases.
>

Let's document that assumption in the docs for pg_basebackup and the file
system copy based replica creation docs. With a reference to initdb's
datadir sync option.

IMO this comes back to the point where we usually should not care much
> how long a backup  takes as long as it is done right.  Users care much
> more about how long a restore takes until consistency is reached.  And
> this is in line with things that have been done via bc34223b or
> 96a7128.
> --
> Michael
>

Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

2020-12-01 Thread Bharath Rupireddy

Hi,

I think we can pass CURSOR_OPT_PARALLEL_OK to pg_plan_query() for
refresh mat view so that parallelism can be considered for the SELECT
part of the previously created mat view. The refresh mat view queries
can be faster in cases where SELECT is parallelized.

Attaching a small patch. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Allow-parallel-mode-in-REFRESH-MAT-VIEW-planning.patch
Description: Binary data

Re: Improving spin-lock implementation on ARM.

On Tue, 1 Dec 2020 at 02:16, Alexander Korotkov 
wrote:

> On Mon, Nov 30, 2020 at 9:21 PM Tom Lane  wrote:
> > Alexander Korotkov  writes:
> > > I tend to think that LSE is enabled by default in Apple's clang based
> > > on your previous message[1].  In order to dispel the doubts could you
> > > please provide assembly of SpinLockAcquire for following clang
> > > options.
> > > "-O2"
> > > "-O2 -march=armv8-a+lse"
> > > "-O2 -march=armv8-a"
> >
> > Huh.  Those options make exactly zero difference to the code generated
> > for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread,
> > for either the HEAD definition of TAS() or the CAS patch's version.
> >
> > So now I'm at a loss as to the reason for the performance difference
> > I got.  -march=armv8-a+lse does make a difference to code generation
> > someplace, because the overall size of the postgres executable changes
> > by 16kB or so.  One might argue that the performance difference is due
> > to better code elsewhere than the spinlocks ... but the test I'm running
> > is basically just
> >
> > while (count-- > 0)
> > {
> > XLogGetLastRemovedSegno();
> >
> > CHECK_FOR_INTERRUPTS();
> > }
> >
> > so it's hard to see where a non-spinlock-related code change would come
> > in.  That loop itself definitely generates the same code either way.
> >
> > I did find this interesting output from "clang -v":
> >
> > -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8
> -target-feature +neon -target-feature +crc -target-feature +crypto
> -target-feature +fullfp16 -target-feature +ras -target-feature +lse
> -target-feature +rdm -target-feature +rcpc -target-feature +zcm
> -target-feature +zcz -target-feature +sha2 -target-feature +aes
> >
> > whereas adding -march=armv8-a+lse changes that to just
> >
> > -target-cpu vortex -target-feature +neon -target-feature +lse
> -target-feature +zcm -target-feature +zcz
> >
> > On the whole, that would make one think that -march=armv8-a+lse
> > should produce worse code than the default settings.
>
> Great, thanks.
>
> So, I think the following hypothesis isn't disproved yet.
> 1) On ARM with LSE support, PostgreSQL built with LSE is faster than
> PostgreSQL built without LSE.  Even if the latter is patched with
> anything considered in this thread.
> 2) None of the patches considered in this thread give a clear
> advantage for PostgreSQL built with LSE.
>
> To further confirm this let's wait for Kunpeng 920 tests by Krunal
> Bauskar and Amit Khandekar.  Also it would be nice if someone will run
> benchmarks similar to [1] on Apple M1.
>

--

I have completed benchmarking with lse.

Graph attached.

* For me, lse failed to perform. With lse enabled performance with higher
scalability was infact less than baseline (with TAS).
  [This is true with and without patch (cas)]. So lse is unable to push
things.

* Graph traces all 4 lines (baseline (tas), baseline-patched (cas),
baseline-lse (tas+lse), baseline-patched-lse (cas+lse))

- for select, there is no clear winner but baseline-patched-lse (cas+lse)
perform better than others (2-4%).

- for update/tpcb like baseline-patched (cas) continue to outperform all
the 3 options (10-40%).
  In fact, with lse enabled a regression (compared to baseline/head) is
observed.

[I am not surprised since MySQL showed the same behavior with lse of-course
with a lesser percentage].

I have repeated the test 2 times to confirm the behavior.
Also, to reduce noise I normally run 4 rounds discarding 1st round and
taking the median of the last 3 rounds.

In all, I see consistent numbers.
conf:
https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf

---

>From all the observations till now following points are clear:

* Patch doesn't seem to make a significant difference with lse enabled.
  Ir-respective of the machine (Kunpeng, Graviton2, M1). [infact introduces
regression with Kunpeng].

* Patch helps generate significant +ve difference in performace with lse
disabled.
  Observed and Confirmed with both Kunpeng, Graviton2.
  (not possible to disable lse on m1).

* Does the patch makes things worse with lse enabled.
  * Graviton2: No (based on Alexander graph)
  * M1: No (based on Tom's graph. Wondering why was debug build used
(--enable-debug?))
  * Kunpeng: regression observed.

--

Should we enable lse?

Based on the comment from [1] I see pgsql aligns with decision made by
compiler/distribution vendors. If the compiler/distribution vendors
has not made something default then there could be reason for it.

-

Without lse as we have seen patch continue to score.

Effect of

Re: Additional improvements to extended statistics

2020-12-01 Thread Tomas Vondra

On 12/1/20 9:15 AM, Dean Rasheed wrote:
> On Sun, 29 Nov 2020 at 21:02, Tomas Vondra
>  wrote:
>>
>> Those are fairly minor issues. I don't have any deeper objections, and
>> it seems committable. Do you plan to do that sometime soon?
>>
> 
> OK, I've updated the patch status in the CF app, and I should be able
> to push it in the next day or so.
> 

Cool, thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: runtime error copying oids field

On 2020-Nov-30, Zhihong Yu wrote:

> Alvaro, et al:
> Please let me know how to proceed with the patch.
> 
> Running test suite with the patch showed no regression.

That's good to hear.  I'll get it pushed today.  Thanks for following
up.

Re: Display individual query in pg_stat_activity

2020-12-01 Thread Georgios Kokolatos

This patch fails in the cfbot for quite some time now.
I shall close it as Return With Feedback and not move it to the next CF.
Please feel free to register an updated version afresh for the next CF.

Cheers,
//Georgios

Re: Notes on physical replica failover with logical publisher or subscriber

2020-12-01 Thread Amit Kapila

On Mon, Nov 30, 2020 at 9:30 AM Craig Ringer
 wrote:
>
> Hi all
>
> I recently wrote some notes on interaction between physical
> replication failover/promotion and logical replication publisher
> and/or standby.
>
> As you probably all know, right now we don't support physical failover
> for logical replication publishers at all, either for in-core logical
> replication or for 3rd party solutions. And while we support physical
> failover and promotion of subscribers, there turn out to be some
> issues with that too.
>
> I'm not trying to solve those issues right now.  But since there are
> various out-of-tree replication tools trying to work around these
> limitations, I wanted to share my knowledge of the various hazards and
> challenges involved in doing so, so I've written a wiki article on it.
>
> https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover
>
> I tried to address many of these issues with failover slots, but I am
> not trying to beat that dead horse now. I know that at least some
> people here are of the opinion that effort shouldn't go into
> logical/physical replication interoperation anyway - that we should
> instead address the remaining limitations in logical replication so
> that it can provide complete HA capabilities without use of physical
> replication.
>

What is the advantage of using logical replication for HA over
physical replication? It sends more WAL in most cases and probably
requires more set up in terms of the publisher-subscriber. I see the
major use as upgrades can be easier because it can provide
cross-version replication and others as mentioned in docs [1]. I can
also imagine that one might want to use it for scale-out by carefully
having transactions specific to different master nodes or probably by
having a third-party transaction coordinator.

> So for now I'm just trying to save others who go looking
> into these issues some time and warn them about some of the less
> obvious booby-traps.
>
> I do want to add some info to the logical decoding docs around slot
> fast-forward behaviour and how to write clients to avoid missing or
> double-processing transactions.
>

+1.

> I'd welcome advice on the best way to
> do that in a manner that would be accepted by this community.
>

Why not enhance test_decoding apart from mentioning in docs to explain
such concepts? It might help in some code-coverage as well depending
on what we want to cover.


[1] - https://www.postgresql.org/docs/devel/logical-replication.html

-- 
With Regards,
Amit Kapila.

Re: Cost overestimation of foreign JOIN

2020-12-01 Thread Ashutosh Bapat

On Mon, Nov 30, 2020 at 11:56 PM Andrey Lepikhov
 wrote:
>
> On 30.11.2020 22:38, Tom Lane wrote:
> > Andrey Lepikhov  writes:
> >> Maybe it is needed to swap lines 2908 and 2909 (see attachment)?
> >
> > No; as explained in the comment immediately above here, we're assuming
> > that the join conditions will be applied on the cross product of the
> > input relations.
>
> Thank you. Now it is clear to me.
> >
> > Now admittedly, that's a worst-case assumption, since it amounts to
> > expecting that the remote server will do the join in the dumbest
> > possible nested-loop way.  If the remote can use a merge or hash
> > join, for example, the cost is likely to be a lot less.
>
> My goal is scaling Postgres on a set of the same servers with same
> postgres instances. If one server uses for the join a hash-join node, i
> think it is most likely that the other server will also use for this
> join a hash-join node (Maybe you remember, I also use the statistics
> copying technique to provide up-to-date statistics on partitions). Tests
> show good results with such an approach. But maybe this is my special case.
>
> >  But it is
> > not the job of this code path to outguess the remote planner.  It's
> > certainly not appropriate to invent an unprincipled cost estimate
> > as a substitute for trying to guess that.
>
> Agreed.
> >
> > If you're unhappy with the planning results you get for this,
> > why don't you have use_remote_estimate turned on?
>
> I have a mixed load model. Large queries are suitable for additional
> estimate queries. But for many simple SELECT's that touch a small
> portion of the data, the latency has increased significantly. And I
> don't know how to switch the use_remote_estimate setting in such case.

You may disable use_remote_estimates for given table or a server. So
if tables participating in short queries are different from those in
the large queries, you could set use_remote_estimate at table level to
turn it off for the first set. Otherwise, we need a FDW level GUC
which can be turned on/off for a given session or a query.

Generally use_remote_estimate isn't scalable and there have been
discussions about eliminating the need of it. But no concrete proposal
has come yet.

-- 
Best Wishes,
Ashutosh Bapat

Re: Consider parallel for lateral subqueries with limit

2020-12-01 Thread James Coleman

On Mon, Nov 30, 2020 at 7:00 PM James Coleman  wrote:
>
> I've been investigating parallelizing certain correlated subqueries,
> and during that work stumbled across the fact that
> set_rel_consider_parallel disallows parallel query on what seems like
> a fairly simple case.
>
> Consider this query:
>
> select t.unique1
> from tenk1 t
> join lateral (select t.unique1 from tenk1 offset 0) l on true;
>
> Current set_rel_consider_parallel sets consider_parallel=false on the
> subquery rel because it has a limit/offset. That restriction makes a
> lot of sense when we have a subquery whose results conceptually need
> to be "shared" (or at least be the same) across multiple workers
> (indeed the relevant comment in that function notes that cases where
> we could prove a unique ordering would also qualify, but punts on
> implementing that due to complexity). But if the subquery is LATERAL,
> then no such conceptual restriction.
>
> If we change the code slightly to allow considering parallel query
> even in the face of LIMIT/OFFSET for LATERAL subqueries, then our
> query above changes from the following plan:
>
>  Nested Loop
>Output: t.unique1
>->  Gather
>  Output: t.unique1
>  Workers Planned: 2
>  ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
>Output: t.unique1
>->  Gather
>  Output: NULL::integer
>  Workers Planned: 2
>  ->  Parallel Index Only Scan using tenk1_hundred on public.tenk1
>Output: NULL::integer
>
> to this plan:
>
>  Gather
>Output: t.unique1
>Workers Planned: 2
>->  Nested Loop
>  Output: t.unique1
>  ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
>Output: t.unique1
>  ->  Index Only Scan using tenk1_hundred on public.tenk1
>Output: NULL::integer
>
> The code change itself is quite simple (1 line). As far as I can tell
> we don't need to expressly check parallel safety of the limit/offset
> expressions; that appears to happen elsewhere (and that makes sense
> since the RTE_RELATION case doesn't check those clauses either).
>
> If I'm missing something about the safety of this (or any other
> issue), I'd appreciate the feedback.

Note that near the end of grouping planner we have a similar check:

if (final_rel->consider_parallel && root->query_level > 1 &&
!limit_needed(parse))

guarding copying the partial paths from the current rel to the final
rel. I haven't managed to come up with a test case that exposes that
though since simple examples like the one above get converted into a
JOIN, so we're not in grouping_planner for a subquery. Making the
subquery above correlated results in us getting to that point, but
isn't currently marked as parallel safe for other reasons (because it
has params), so that's not a useful test. I'm not sure if there are
cases where we can't convert to a join but also don't involve params;
haven't thought about it a lot though.

James

Re: Error on failed COMMIT

2020-12-01 Thread Georgios Kokolatos

Hi,

this patch fails on the cfbot yet it has received an update during the current 
CF.

I will move it to the next CF and mark it there as Waiting on Author.

Cheers,
Georgios

The new status of this patch is: Needs review

Re: Terminate the idle sessions


On 25.11.2020 05:18, Li Japin wrote:




On Nov 24, 2020, at 11:20 PM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:


On Mon, Nov 23, 2020 at 11:22 PM Li Japin > wrote:



How about use “foreign-data wrapper” replace “postgres_fdw”?


I don't see much value in avoiding mentioning that specific term - my 
proposal turned it into an example instead of being exclusive.



-         This parameter should be set to zero if you use some
connection-pooling software,
-         or pg servers used by postgres_fdw, because connections
might be closed unexpectedly.
+         This parameter should be set to zero if you use
connection-pooling software,
+         or PostgreSQL servers
connected to using foreign-data
+         wrapper, because connections might be closed unexpectedly.
         


Maybe:

+ or your PostgreSQL server receives connection from postgres_fdw or 
similar middleware.

+ Such software is expected to self-manage its connections.


Thank you for your suggestion and patient! Fixed.

```
+        
+         This parameter should be set to zero if you use 
connection-pooling software,
+         or PostgreSQL servers connected 
to using postgres_fdw
+         or similar middleware (such software is expected to 
self-manage its connections),

+         because connections might be closed unexpectedly.
+        
```

--
Best regards
Japin Li



Status update for a commitfest entry.
As far as I see, all recommendations from reviewers were addressed in 
the last version of the patch.


It passes CFbot successfully, so I move it to Ready For Committer.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

RE: Disable WAL logging to speed up data loading

2020-12-01 Thread osumi.takami...@fujitsu.com

Hi

Sorry for being late.
On Tuesday, December 1, 2020 10:42 AM Tsunakawa, Takayuki 
 wrote:
> From: Kyotaro Horiguchi  
> > Yeah, although it's enough only to restrict non-harmful records
> > practically, if we find that only a few kinds of records are needed,
> > maybe it's cleaner to allow only required record type(s).
> >
> > Maybe it's right that if we can filter-out records looking only rmid,
> > since the xlog facility doesn't need to know about record types of a
> > resource manager.  But if we need to finer-grained control on the
> > record types, I'm afraid that that's wrong.  However, if we need only
> > the XLOG_CHECKPOINT_SHUTDOWN record, it might be better to let
> > XLogInsert filter records rather than inserting that filtering code to
> > all the caller sites.
> 
> Agreed.  As the kind of WAL records to be emitted is very limited, I think
> XLogInsert() can filter them where the current patch does.  
Yeah, I can do that.
I'll restrict the types of WAL in a strict manner,
which means filtering out the types of WAL in XLogInsert().
I'll modify this point in the next patch.

> > I don't dislike "none" since it seems to me practically "none".  It
> > seems rather correct if we actually need only the shutdown checkpoint
> > record.
> >
> > "unrecoverable" is apparently misleading. "crash_unsafe" is precise
> > but seems somewhat alien being among "logical", "replica" and
> > "minimal".
> 
> OK, I'm happy with "none" too.  We can describe in the manual that some
> limited amount of WAL is emitted.
Sure, no problem. Let's keep adopting "none", then.

By the way, I've conducted one additional performance evaluation
to see whether reducing WAL could contribute to boost the speed of data loading 
or not.

I prepared a test VM with 24 cores, HDD and 128GB memory
and used a 9.3GB input file which can be generated by pgbench's scale factor 
1000.
The test table is a partitioned table whose number of child tables is 20.
The reason why I choose partitioned table is to remove the concern
that any delay of the loading time comes from lock contention of table.
Therefore, I divided the input file into 20 files and loaded each file
by different 20 sessions respectively toward each correspondent table.
I compared wal_level=minimal and wal_level=none this time
by using the server with the PG configurations below.

-
wal_level = minimal or minimal
archive_mode = off
max_prepared_transactions = 128
max_worker_processes = 128
max_parallel_workers = 128
max_wal_senders = 0
max_wal_size = 100GB
wal_buffers = 1GB
checkpoint_timeout = 1d
shared_buffers = 32GB # 25% of RAM
maintenance_work_mem = 26GB # 20% of RAM
work_mem = 26GB # 20% of RAM
--

I executed each wal_level three times and calculated the average time
and found that disabling WAL logging reduced about 73 % of the minimal's 
loading speed
in this test. This speed-up came from the difference of generated WAL sizes.
In this test, to load the data generated more than 10GB of WALs with 
wal_level=minimal
while wal_level=none emits just 760 bytes of WALs.

I double-checked this fact from pg_wal_lsn_diff() for each wal_level
and had a look at the folder size of pg_wal.
I expect this size for none will become smaller when
I take the modification to filter out the types of WAL which is discussed above.
Also, I monitored numbers of iostat's 'util' and noticed that
util's spike to use I/O reduced from twice to once when I changed the level
from minimal to none, which should be the effect of the patch.

Any comments ?

Best,
Takamichi Osumi

Re: Strange behavior with polygon and NaN


On 25.11.2020 11:14, Kyotaro Horiguchi wrote:

At Wed, 25 Nov 2020 11:39:39 +0900 (JST), Kyotaro Horiguchi 
 wrote in

So that line of thought prompts me to tread *very* carefully when
trying to dodge NaN results.  We need to be certain that we
introduce only logically-defensible special cases.  Something like
float8_coef_mul() seems much more likely to lead us into errors
than away from them.

Agreed on that point.  I'm going to rewirte the patch in that
direction.

Removed the function float8_coef_mul().


I noticed that the check you proposed to add to line_closept_point
doesn't work for the following case:

select line('{1,-1,0}') <-> point(1e300, 'Infinity');

Ax + By + C = 1 * 1e300 + -1 * Inf + 0 = -Inf is not NaN so we go on
the following steps.

derive the perpendicular line: => line(-1, -1, Inf}
derive the cross point   : => point(Inf, Inf)
calculate the distance   : => NaN  (which should be Infinity)

So I left the check whether distance is NaN in this version. In the previous 
version the check is done before directly calculating the distance, but since 
we already have the result of Ax+Bx+C so I decided not to use point_dt() in this
version.

Although I wrote that it should be wrong that applying FPzero() to
coefficients, there are some places already doing that so I followed
those predecessors.


Reverted the change of pg_hypot().


While checking the regression results, I noticed that the follwoing
calculation, which seems wrong.

select line('{3,NaN,5}') = line('{3,NaN,5}');
  ?column?
--
  t

But after looking point_eq(), I decided to let the behavior alone
since I'm not sure the reason for the behavior of the functions. At
least the comment for point_eq() says that is the delibarate
behvior. box_same, poly_same base on the point_eq_point so they behave
the same way.


By the way, '=' doesn't compare the shape but compares the area.
However, what is the area of a line?  That should be always 0 even if
we considered it. And it is also strange that we don't have
corresponding comparison ('<' and so) operators.  It seems to me as if
a mistake of '~='.  If it is correct, I should revert the change of
line_eq() along with fixing operator assignment.

regards.



Status update for a commitfest entry.

The commitfest is closed now and this entry is "Waiting on author".
As far as I see, part of the fixes is already committed. Is there 
anything left to work on or this patch needs review/ ready for committer 
now?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: error_severity of brin work item

On 2020-Nov-30, Justin Pryzby wrote:

> On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote:
> > The more I look at this, the less I like it.  This would set a precedent
> > that any action that can be initiated from an autovac work-item has a
> > requirement of silently being discarded when it referenced a
> > non-existant relation.
> 
> My original request was to change to INFO, which is what vacuum proper does at
> vacuum_open_relation().  I realize that still means that new work item 
> handlers
> would have to do the LockOid, try_open dance - maybe it could be factored out.

As I understand, INFO is not well suited to messages that are not going
to the client.  Anyway, my point is about the contortions that are
needed to support the case, rather than what exactly do we do when it
happens.

Retrospectively, it's strange that this problem (what happens when
indexes with pending work-items are dropped) hadn't manifested.  It
seems a pretty obvious one.

> > I'd rather have the code that drops the index go through the list of
> > work-items and delete those that reference that relation.
> > 
> > I'm not sure if this is something that ought to be done in index_drop();
> > One objection might be that if the drop is rolled back, the work-items
> > are lost.
> 
> Should it be done in an AtCommit hook or something like that ?

I didn't like this idea much on first read, on extensibility grounds,
but perhaps it's not so bad because we can generalize it whenever
there's pressure to add a second type of work-item (*if* that ever
happens).

I suppose the idea is that index_drop saves the index OID when a BRIN
index with autosummarization=on is dropped, and then the
AtCommit_WorkItems() call scans the items list and drops those that
match any OIDs in the list.  (It does require to be mindful of subxact
aborts, of course.)

Re: Unnecessary delay in streaming replication due to replay lag


On 20.11.2020 11:21, Michael Paquier wrote:

On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1...@sina.cn wrote:

I read the code and test the patch, it run well on my side, and I have several 
issues on the
patch.

+   RequestXLogStreaming(ThisTimeLineID,
+startpoint,
+PrimaryConnInfo,
+PrimarySlotName,
+wal_receiver_create_temp_slot);

This patch thinks that it is fine to request streaming even if
PrimaryConnInfo is not set, but that's not fine.

Anyway, I don't quite understand what you are trying to achieve here.
"startpoint" is used to request the beginning of streaming.  It is
roughly the consistency LSN + some alpha with some checks on WAL
pages (those WAL page checks are not acceptable as they make
maintenance harder).  What about the case where consistency is
reached but there are many segments still ahead that need to be
replayed?  Your patch would cause streaming to begin too early, and
a manual copy of segments is not a rare thing as in some environments
a bulk copy of segments can make the catchup of a standby faster than
streaming.

It seems to me that what you are looking for here is some kind of
pre-processing before entering the redo loop to determine the LSN
that could be reused for the fast streaming start, which should match
the end of the WAL present locally.  In short, you would need a
XLogReaderState that begins a scan of WAL from the redo point until it
cannot find anything more, and use the last LSN found as a base to
begin requesting streaming.  The question of timeline jumps can also
be very tricky, but it could also be possible to not allow this option
if a timeline jump happens while attempting to guess the end of WAL
ahead of time.  Another thing: could it be useful to have an extra
mode to begin streaming without waiting for consistency to finish?
--
Michael



Status update for a commitfest entry.

This entry was "Waiting On Author" during this CF, so I've marked it as 
returned with feedback. Feel free to resubmit an updated version to a 
future commitfest.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Corner-case bug in pg_rewind


On 16.11.2020 05:49, Ian Lawrence Barwick wrote:


Note that the patch  may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.


Regards

Ian Barwick


--
EnterpriseDB: https://www.enterprisedb.com




Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix, I am 
moving it to the next CF.

Ian, are you planning to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PATCH] SET search_path += octopus

2020-12-01 Thread Abhijit Menon-Sen

At 2020-10-28 18:29:30 -0700, and...@anarazel.de wrote:
>
> I think my gut feeling about this still is that it's not worth
> implementing something that doesn't work in postgresql.conf. The
> likelihood of ending up with something that makes it hard to to
> eventually implement proper postgresql.conf seems high.

OK, thanks for explaining. That seems a reasonable concern.

I can't think of a reasonable way to accommodate the variety of possible
modifications to settings in postgresql.conf without introducing some
kind of functional syntax:

shared_preload_libraries =
list('newval', $shared_preload_libraries, 'otherval')

I rather doubt my ability to achieve anything resembling consensus on a
feature like that, even if it were restricted solely to list operations
on a few settings to begin with. I am also concerned that such a feature
would make it substantially harder to understand where and how the value
of a particular setting is derived (even if I do find some way to record
multiple sources in pg_settings—a problem that was brought up in earlier
discussions).

I'm certainly not in favour of introducing multiple operators to express
the various list operations, like += for prepend and =+ for append. (By
the standard of assembling such things from spare parts, the right thing
to do would be to add M4 support to postgresql.conf, but I'm not much of
a fan of that idea either.)

Therefore, for lack of any plausible way to proceed, I do not plan to
work on this feature any more.

-- Abhijit

Re: Corner-case bug in pg_rewind

2020-12-01 Thread Pavel Borisov

>
> Status update for a commitfest entry.
>
> The patch is Waiting on Author for some time. As this is a bug fix, I am
> moving it to the next CF.
> Ian, are you planning to continue working on it?
>

As a reviewer, I consider the patch useful and good overall. The comments I
left were purely cosmetic. It's a pity to me that this bugfix delayed for
such a small reason and outdated, therefore. It would be nice to complete
this fix on the next CF.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar  wrote:
> I have completed benchmarking with lse.
>
> Graph attached.

Thank you for benchmarking.

Now I agree with this comment by Tom Lane

> In general, I'm pretty skeptical of *all* the results posted so far on
> this thread, because everybody seems to be testing exactly one machine.
> If there's one thing that it's safe to assume about ARM, it's that
> there are a lot of different implementations; and this area seems very
> very likely to differ across implementations.

Different ARM implementations look too different.  As you pointed out,
LSE is enabled in gcc-10 by default.  I doubt we can accept a patch,
which gives benefits for specific platform and only when the compiler
isn't very modern.  Also, we didn't cover all ARM planforms.  Given
they are so different, we can't guarantee that patch doesn't cause
regression of some ARM.  Additionally, the effect of the CAS patch
even for Kunpeng seems modest.  It makes the drop off of TPS more
smooth, but it doesn't change the trend.

--
Regards,
Alexander Korotkov

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar  wrote:
> On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar  wrote:
> > What I meant was outline-atomics support was added in GCC-9.4 and was made 
> > default in gcc-10.
> > LSE support is present for quite some time.
>
> FWIW, here is an earlier discussion on the same (also added the
> proposal author here) :
>
> https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com

Thank you for pointing! I wonder why the effect of LSE on Graviton2
observed by Tsahi Zidenberg is so modest.  It's probably because he
runs the tests with a low number of clients.

--
Regards,
Alexander Korotkov

Re: Corruption during WAL replay


On 06.11.2020 14:40, Masahiko Sawada wrote:


So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,


--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Status update for a commitfest entry.

The commitfest is closed now. As this entry is a bug fix, I am moving it 
to the next CF.

Are you planning to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: runtime error copying oids field

On 2020-Dec-01, Alvaro Herrera wrote:

> On 2020-Nov-30, Zhihong Yu wrote:
> 
> > Alvaro, et al:
> > Please let me know how to proceed with the patch.
> > 
> > Running test suite with the patch showed no regression.
> 
> That's good to hear.  I'll get it pushed today.  Thanks for following
> up.

Done, thanks for reporting this.

Re: Strange behavior with polygon and NaN

Anastasia Lubennikova  writes:
> The commitfest is closed now and this entry is "Waiting on author".
> As far as I see, part of the fixes is already committed. Is there 
> anything left to work on or this patch needs review/ ready for committer 
> now?

I think it should be "needs review" now.

regards, tom lane

Re: Reduce the time required for a database recovery from archive.


On 09.11.2020 19:31, Stephen Frost wrote:

Greetings,

* Dmitry Shulga (d.shu...@postgrespro.ru) wrote:

On 19 Oct 2020, at 23:25, Stephen Frost  wrote:

process finishes a WAL file but then just sit around doing nothing while
waiting for the applying process to finish another segment.

I believe that for typical set-up the parameter max_restore_command_workers 
would have value 2 or 3 in order to supply
a delivered WAL file on time just before it be started processing.

This use case is for environment where time required for delivering WAL file 
from archive is greater  than time required for applying records contained in 
the WAL file.
If time required for WAL file delivering lesser than than time required for 
handling records contained in it then max_restore_command_workers shouldn't be 
specified at all

That's certainly not correct at all- the two aren't really all that
related, because any time spent waiting for a WAL file to be delivered
is time that the applying process *could* be working to apply WAL
instead of waiting.  At a minimum, I'd expect us to want to have, by
default, at least one worker process running out in front of the
applying process to hopefully eliminate most, if not all, time where the
applying process is waiting for a WAL to show up.  In cases where the
applying process is faster than a single fetching process, a user might
want to have two or more restore workers, though ultimately I still
contend that what they really want is as many workers as needed to make
sure that the applying process doesn't ever need to wait- up to some
limit based on the amount of space that's available.

And back to the configuration side of this- have you considered the
challenge that a user who is using very large WAL files might run
into with the proposed approach that doesn't allow them to control the
amount of space used?  If I'm using 1G WAL files, then I need to have
16G available to have *any* pre-fetching done with this proposed
approach, right?  That doesn't seem great.

Thanks,

Stephen


Status update for a commitfest entry.

The commitfest is closed now. As this entry has been Waiting on Author 
for a while, I've marked it as returned with feedback. Dmitry, feel free 
to resubmit an updated version to a future commitfest.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Improving spin-lock implementation on ARM.

On Tue, 1 Dec 2020 at 20:25, Alexander Korotkov 
wrote:

> On Tue, Dec 1, 2020 at 3:44 PM Krunal Bauskar 
> wrote:
> > I have completed benchmarking with lse.
> >
> > Graph attached.
>
> Thank you for benchmarking.
>
> Now I agree with this comment by Tom Lane
>
> > In general, I'm pretty skeptical of *all* the results posted so far on
> > this thread, because everybody seems to be testing exactly one machine.
> > If there's one thing that it's safe to assume about ARM, it's that
> > there are a lot of different implementations; and this area seems very
> > very likely to differ across implementations.
>
> Different ARM implementations look too different.  As you pointed out,
> LSE is enabled in gcc-10 by default.  I doubt we can accept a patch,
> which gives benefits for specific platform and only when the compiler
> isn't very modern.  Also, we didn't cover all ARM planforms.  Given
> they are so different, we can't guarantee that patch doesn't cause
> regression of some ARM.  Additionally, the effect of the CAS patch
> even for Kunpeng seems modest.  It makes the drop off of TPS more
> smooth, but it doesn't change the trend.
>

There are 2 parts:

** Does CAS patch help scale PGSQL. Yes.*
** Is LSE beneficial for all architectures. Probably No.*

The patch addresses only the former one which is true for all cases.
(Enabling LSE should be an independent process).

gcc-10 made it default but when I read [1] it quotes that canonical decided
to remove it as default
as part of* Ubuntu-20.04 which means LSE has not proven the test of
canonical (probably).*
Also, most of the distro has not yet started shipping GCC-10 which is way
far before it makes it to all distro.

So if we keep the LSE effect aside and just look at the patch from
performance improvement it surely helps
achieve a good gain. I see an improvement in the range of 10-40%.
Amit during his independent testing also observed the gain in the same
range and your testing with G-2 has re-attested the same point.
Pardon me if this is modest as per pgsql standards.

With 1024 scalability PGSQL on other arches (beyond ARM) struggle to scale
so there is something more
inherent that needs to be addressed from a generic perspective.

Also, the said patch non-only helps pgbench kind of workload but other
workloads too.

--

I would request you guys to re-think it from this perspective to help
ensure that PGSQL can scale well on ARM.
s_lock becomes a top-most function and LSE is not a universal solution but
CAS surely helps ease the main bottleneck.

And surely let me know if more data is needed.

Link:
[1]:
https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com

> --
> Regards,
> Alexander Korotkov
>

-- 
Regards,
Krunal Bauskar

Re: BUG #15383: Join Filter cost estimation problem in 10.5


On 30.10.2020 19:33, David G. Johnston wrote:
On Fri, Oct 30, 2020 at 9:16 AM Anastasia Lubennikova 
mailto:lubennikov...@gmail.com>> wrote:


Status update for a commitfest entry.

It looks like there was no real progress on this issue since
April. I see only an experimental patch. What kind of review does
it need right now? Do we need more testing or maybe production
statistics for such queries?

David, are you going to continue working on it?
Can you, please, provide a short summary of the problem and list
open questions for reviewers.

The new status of this patch is: Waiting on Author


I agree that updating a wiki seems like an unappealing conclusion to 
this entry.  I'm on the fence whether we just leave it in the cf and 
get a periodic nag when it gets bumped.  As we are describing real 
problems or potential improvements to live portions of the codebase 
ISTM that adding code comments near to those locations would be 
warranted.  The commit message can reference this thread.  There seems 
to already be a convention for annotating code comments in such a way 
that they can be searched for - which basically is what sticking it in 
the wiki would accomplish but the searcher would have to look in the 
codebase and not on a website.  For the target audience of this 
specific patch that seems quite reasonable.


David J.


Here we are again.
The commitfest is closed now and the discussion haven't moved from where 
it was a month ago.
I don't see any use in moving it to the next CF as is and I don't want 
to return it either, as this is clearly a bug.


I think we have a few options:

1. We can add it into TODO until better times.
2. We can write a patch to address this problem in code comments.
3. We can maybe CC this thread to someone and ask for help. Do you know 
people, who have  expertise in this area?


None of these options is perfect, but the second option will perhaps be 
a good compromise.

David, can you, please submit a patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

Hi Justin,

Thanks for all the comments. I'll incorporate everything and submit an
updated version later.

On 2020-Nov-30, Justin Pryzby wrote:

> On Tue, Nov 03, 2020 at 08:56:06PM -0300, Alvaro Herrera wrote:

> > +++ b/src/bin/psql/describe.c
> > -   printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s"), 
> > parent_name,
> > - partdef);
> > +   printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s%s"), 
> > parent_name,
> > + partdef,
> > + strcmp(detached, "t") 
> > == 0 ? " DETACHED" : "");
> 
> The attname "detached" is a stretch of what's intuitive (it's more like
> "detachING" or half-detached).  But I think psql should for sure show 
> something
> more obvious to users.  Esp. seeing as psql output isn't documented.  Let's
> figure out what to show to users and then maybe rename the column that, too.

OK.  I agree that "being detached" is the state we want users to see, or
maybe "detach pending", or "unfinisheddetach" (ugh).  I'm not sure that
pg_inherits.inhbeingdetached" is a great column name.  Opinions welcome.

> > +PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
> 
> Instead of finalize .. deferred ?  Or ??

Well, I'm thinking that this has to be a verb in the imperative mood.
The user is commanding the server to "finalize this detach operation".
I'm not sure that DEFERRED fits that grammatical role.  If there are
other ideas, let's discuss them.

ALTER TABLE tst DETACH PARTITION tst_1 FINALIZE  <-- decent
ALTER TABLE tst DETACH PARTITION tst_1 COMPLETE  <-- I don't like it
ALTER TABLE tst DETACH PARTITION tst_1 DEFERRED  <-- grammatically faulty?

> ATExecDetachPartition:
> Doesn't this need to lock the table before testing for default partition ?

Correct, it does.

> I ended up with apparently broken constraint when running multiple loops 
> around
> a concurrent detach / attach:
> 
> while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES 
> FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; 
> done&
> while psql -h /tmp postgres -c "ALTER TABLE p ATTACH PARTITION p1 FOR VALUES 
> FROM (1)TO(2)" -c "ALTER TABLE p DETACH PARTITION p1 CONCURRENTLY"; do :; 
> done&
> 
> "p1_check" CHECK (true)
> "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2)

Not good.

Re: [HACKERS] Custom compression methods

2020-12-01 Thread Dilip Kumar

On Tue, Dec 1, 2020 at 4:50 PM Dilip Kumar  wrote:
>
> On Sat, Nov 21, 2020 at 3:50 AM Robert Haas  wrote:
>
> While working on this comment I have doubts.
>
> > I wonder in passing about TOAST tables and materialized views, which
> > are the other things that have storage. What gets stored for
> > attcompression? For a TOAST table it probably doesn't matter much
> > since TOAST table entries shouldn't ever be toasted themselves, so
> > anything that doesn't crash is fine (but maybe we should test that
> > trying to alter the compression properties of a TOAST table doesn't
> > crash, for example).
>
> Yeah for the toast table it doesn't matter,  but I am not sure what do
> you mean by altering the compression method for the toast table. Do you
> mean manually update the pg_attribute tuple for the toast table and
> set different compression methods?  Or there is some direct way to
> alter the toast table?
>
>  For a materialized view it seems reasonable to
> > want to set column properties, but I'm not quite sure how that works
> > today for things like STORAGE anyway. If we do allow setting STORAGE
> > or COMPRESSION for materialized view columns then dump-and-reload
> > needs to preserve the values.
>
> I see that we allow setting the STORAGE for the materialized view but
> I am not sure what is the use case.  Basically, the tuples are
> directly getting selected from the host table and inserted in the
> materialized view without checking target and source storage type.
> The behavior is the same if you execute INSERT INTO dest_table SELECT
> * FROM source_table.  Basically, if the source_table attribute has
> extended storage and the target table has plain storage, still the
> value will be inserted directly into the target table without any
> conversion.  However, in the table, you can insert the new tuple and
> that will be stored as per the new storage method so that is still
> fine but I don't know any use case for the materialized view.  Now I am
> thinking what should be the behavior for the materialized view?
>
> For the materialized view can we have the same behavior as storage?  I
> think for the built-in compression method that might not be a problem
> but for the external compression method how can we handle the
> dependency, I mean when the materialized view has created the table
> was having an external compression method "cm1" and we have created
> the materialized view based on that now if we alter table and set the
> new compression method and force table rewrite then what will happen
> to the tuple inside the materialized view, I mean tuple is still
> compressed with "cm1" and there is no attribute is maintaining the
> dependency on "cm1" because the materialized view can point to any
> compression method.  Now if we drop the cm1 it will be allowed to
> drop.  So I think for the compression method we can consider the
> materialized view same as the table, I mean we can allow setting the
> compression method for the materialized view and we can always ensure
> that all the tuple in this view is compressed with the current or the
> preserved compression methods.  So whenever we are inserting in the
> materialized view then we should compare the datum compression method
> with the target compression method.
>
>
> > +   /*
> > +* Use default compression method if the existing compression 
> > method is
> > +* invalid but the new storage type is non plain storage.
> > +*/
> > +   if (!OidIsValid(attrtuple->attcompression) &&
> > +   (newstorage != TYPSTORAGE_PLAIN))
> > +   attrtuple->attcompression = DefaultCompressionOid;
> >
> > You have a few too many parens in there.
> >
> > I don't see a particularly good reason to treat plain and external
> > differently.
>
> Yeah, I think they should be treated the same.
>
>  More generally, I think there's a question here about
> > when we need an attribute to have a valid compression type and when we
> > don't. If typstorage is plan or external, then there's no point in
> > ever having a compression type and maybe we should even reject
> > attempts to set one (but I'm not sure).
>
> I agree.
>
> > However, the attstorage is a
> > different case. Suppose the column is created with extended storage
> > and then later it's changed to plain. That's only a hint, so there may
> > still be toasted values in that column, so the compression setting
> > must endure. At any rate, we need to make sure we have clear and
> > sensible rules for when attcompression (a) must be valid, (b) may be
> > valid, and (c) must be invalid. And those rules need to at least be
> > documented in the comments, and maybe in the SGML docs.
>
> IIUC, even if we change the attstorage the existing tuples are stored
> as it is without changing the tuple storage.  So I think even if the
> attstorage is changed the attcompression should not have any change.
>

I have put some more thought into this and IMHO the rules should be as b

Re: Confusing behavior of psql's \e

On 11/30/20 22:38, Laurenz Albe wrote:

> I have been annoyed about this myself, and I have heard other prople
> complain about it, so I propose to clear the query buffer if the
> editor exits without modifying the file.

Or how about keeping the query buffer content unchanged, but not
executing it? Then it's still there if you have another idea about
editing it. It's easy enough to \r if you really want it gone.

One downside I can foresee is that I could form a habit of doing
something that would be safe in psql version x but would bite me
one day if I'm in psql version x-- for some reason.

Regards,
-Chap

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar  wrote:
> I would request you guys to re-think it from this perspective to help ensure 
> that PGSQL can scale well on ARM.
> s_lock becomes a top-most function and LSE is not a universal solution but 
> CAS surely helps ease the main bottleneck.

CAS patch isn't proven to be a universal solution as well.  We have
tested the patch on just a few processors, and Tom has seen the
regression [1].  The benchmark used by Tom was artificial, but the
results may be relevant for some real-life workload.

I'm expressing just my personal opinion, other committers can have
different opinions.  I don't particularly think this topic is
necessarily a non-starter.  But I do think that given ambiguity we've
observed in the benchmark, much more research is needed to push this
topic forward.

Links.
1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us

--
Regards,
Alexander Korotkov

Re: Phrase search vs. multi-lexeme tokens

On Thu, Nov 12, 2020 at 4:09 PM Alexander Korotkov  wrote:
> This issue looks like the much more complex design bug in phrase
> search.  Fixing this would require some kind of readahead or multipass
> processing, because we don't know how to process 'pg_class' in
> advance.
>
> Is this really a design bug existing in phrase search from the
> beginning.  Or am I missing something?

No feedback yet.  I've added this to the commitfest to don't lose track of this.
https://commitfest.postgresql.org/31/2854/

--
Regards,
Alexander Korotkov

Re: Avoid using lcons and list_delete_first in plan_union_children()

2020-12-01 Thread Heikki Linnakangas


On 01/12/2020 12:52, Hou, Zhijie wrote:

Hi,

In function plan_union_children(),
I found the lcons and list_delete_first here is easy to be replaced by lappend 
and list_delete_last.

And I also found a previous commit do similar thing, so I try to improve this 
one.

Previous commit:
d97b714a219959a50f9e7b37ded674f5132f93f3


This doesn't matter much in practice. I was able to measure a 
performance difference with a very extreme example: "SELECT 1 UNION 
SELECT 2 UNION ... UNION SELECT 1". That was about 30% faster with 
this patch.


I don't see any downside, though. And like the commit message of 
d97b714a21, it's better to have good examples than bad ones in the code 
base, if someone copy-pastes it to somewhere where it matters.


But then again, I'm not excited about changing these one by one. If this 
is worth changing, what about the one in simplify_or_arguments()? Or in 
CopyMultiInsertInfoFlush()?


Perhaps it would make sense to invent a new Queue abstraction for this. 
Or maybe it would be overkill..


- Heikki

Re: Confusing behavior of psql's \e

2020-12-01 Thread Laurenz Albe

On Tue, 2020-12-01 at 11:03 -0500, Chapman Flack wrote:
> On 11/30/20 22:38, Laurenz Albe wrote:
> > I have been annoyed about this myself, and I have heard other prople
> > complain about it, so I propose to clear the query buffer if the
> > editor exits without modifying the file.
> 
> Or how about keeping the query buffer content unchanged, but not
> executing it? Then it's still there if you have another idea about
> editing it. It's easy enough to \r if you really want it gone.

What if the buffer was empty?  Would you want to get the previous
query in the query buffer?  I'd assume not...

Yours,
Laurenz Albe

Re: Add session statistics to pg_stat_database

2020-12-01 Thread Magnus Hagander

On Fri, Nov 20, 2020 at 3:41 PM Laurenz Albe 
wrote:

> On Tue, 2020-11-17 at 17:33 +0100, Magnus Hagander wrote:
> > I've taken a look as well, and here are a few short notes:
>
> Much appreciated!
>

Sorry about the delay in getting back to you on this one. FYI, while the
patch has been bumped to the next CF by now, I do intend to continue
working on it before that starts.


> * It talks about "number of connections" but "number of aborted
> sessions". We should probably
> >   be consistent about talking either about connections or sessions? In
> particular, connections
> >   seems wrong in this case, because it only starts counting after
> authentication is complete
> >   (since otherwise we send no stats)? (This goes for both docs and
> actual function names)
>
> Yes, that is true.  I have changed "connections" to "sessions" and renamed
> the new
> column "connections" to "session_count".
>
> I think that most people will understand a session as started after a
> successful
> connection.
>

Yeah, I agree, and as long as it's consistent we don't need more
explanations than that.

Further int he views, it's a bit strange to have session_count and
aborted_session, but I'm not sure what to suggest. "aborted_session_count"
seems too long. Maybe just "sessions" instead of "session_count" -- no
other counters actually have the "_count" suffix.


> * Is this actually a fix that's independent of the new stats? It seems in
> general to be
> >   changing the behaviour of "force", which is more generic?
> > -   !have_function_stats)
> > +   !have_function_stats && !force)
>
> The comment right above that reads:
> /* Don't expend a clock check if nothing to do */
> So it is just a quick exit if there is nothing to do.
>
> But with that patch we have something to do if "force" (see below) is true:
> Report the remaining session duration and if the session was closed
> normally.
>
> Thus the additional check.
>

Ah yeah, makes sense. It becomes more clear with the rename.


> * in pgstat_send_connstat() you pass the parameter "force" in as
> "disconnect".
> >   That behaviour at least requires a comment saying why, I think. My
> understanding is
> >   it relies on that "force" means this is a "backend is shutting down",
> but that is not
> >   actually documented anywhere. Maybe the "force" parameter should
> actually be renamed
> >   to indicate this is really what it means, to avoid a future mistake in
> the area?
> >   But even with that, how does that turn into disconnect?
>
> "pgstat_report_stat(true)" is only called from "pgstat_beshutdown_hook()",
> so
> it is currently only called when the backend is about to exit.
>
> According the the comments the flag means that "caller wants to force
> stats out".
> I guess that the author thought that there may arise other reasons to
> force sending
> statistics in the future (commit 641912b4d from 2007).
>
> However, since that has not happened, I have renamed the flag to
> "disconnect" and
> adapted the documentation.  This doesn't change the current behavior, but
> establishes
> a new rule.
>

That makes it a lot more clear. And I agree, if nobody came up with a
reason since 2007, then we are free to repurpose it :)



> * Maybe rename pgStatSessionDisconnected to
> pgStatSessionNormalDisconnected?
> >   To avoid having to go back to the setting point and look it up in a
> comment.
>
> Long, descriptive names are a good thing.
> I have decided to use "pgStatSessionDisconnectedNormally", since that is
> even longer
> and seems to fit the "yes or no" category better.
>

WFM.


> I wonder if there would also be a way to count "sessions that crashed" as
> well.
> >  That is,the ones that failed in a way that caused the postmaster to
> restart the system.
> >  But that's information we'd have to send from the postmaster, but I'm
> actually unsure
> >  if we're "allowed" to send things to the stats collector from the
> postmaster.
> >  But I think it could be quite useful information to have. Maybe we can
> find some way
> >  to piggyback on the fact that we're restarting the stats collector as a
> result?
>
> Sure, a crash count would be useful.  I don't know if it is easy for the
> stats collector
> to tell the difference between a start after a backend crash and - say -
> starting from
> a base backup.
>
> Patch v6 attached.
>
> I think that that would be material for another patch, and I don't think
> it should go
> to "pg_stat_database", because a) it might be hard to tell to which
> database the crashed
> backend was attached, b) it might be a background process that doesn't
> belong to a database
> and c) if the crash were caused by - say - corruption in a shared catalog,
> it would be
> misleading


I'm not sure it is outside the scope of this patch, because I think it
might be easier to do than I (and I think you) first thought. We don't need
to track which database crashed -- if we track all *other* ways a database
exits, then crashes are a

Re: Confusing behavior of psql's \e

On 12/01/20 11:21, Laurenz Albe wrote:
> On Tue, 2020-12-01 at 11:03 -0500, Chapman Flack wrote:
>>> complain about it, so I propose to clear the query buffer if the
>>> editor exits without modifying the file.
>>
>> Or how about keeping the query buffer content unchanged, but not
>> executing it? Then it's still there if you have another idea about
>> editing it. It's easy enough to \r if you really want it gone.
> 
> What if the buffer was empty?  Would you want to get the previous
> query in the query buffer?  I'd assume not...

I took your proposal to be specifically about what happens if the editor
is exited with no change to the buffer, and in that case, I would suggest
making no change to the buffer, but not re-executing it.

If the editor is exited after deliberately emptying the editor buffer,
I would expect that to be treated as emptying the query buffer.

I don't foresee any case that would entail bringing a /previous/ query
back into the query buffer.

Regards,
-Chap

Re: Improving spin-lock implementation on ARM.

Alexander Korotkov  writes:
> On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar  wrote:
>> I would request you guys to re-think it from this perspective to help ensure 
>> that PGSQL can scale well on ARM.
>> s_lock becomes a top-most function and LSE is not a universal solution but 
>> CAS surely helps ease the main bottleneck.

> CAS patch isn't proven to be a universal solution as well.  We have
> tested the patch on just a few processors, and Tom has seen the
> regression [1].  The benchmark used by Tom was artificial, but the
> results may be relevant for some real-life workload.

Yeah.  I think that the main conclusion from what we've seen here is
that on smaller machines like M1, a standard pgbench benchmark just
isn't capable of driving PG into serious spinlock contention.  (That
reflects very well on the work various people have done over the years
to get rid of spinlock contention, because ten or so years ago it was
a huge problem on this size of machine.  But evidently, not any more.)
Per the results others have posted, nowadays you need dozens of cores
and hundreds of client threads to measure any such issue with pgbench.

So that is why I experimented with a special test that does nothing
except pound on one spinlock.  Sure it's artificial, but if you want
to see the effects of different spinlock implementations then it's
just too hard to get any results with pgbench's regular scripts.

And that's why it disturbs me that the CAS-spinlock patch showed up
worse in that environment.  The fact that it's not visible in the
regular pgbench test just means that the effect is too small to
measure in that test.  But in a test where we *can* measure an effect,
it's not looking good.

It would be interesting to see some results from the same test I did
on other processors.  I suspect the results would look a lot different
from mine ... but we won't know unless someone does it.  Or, if someone
wants to propose some other test case, let's have a look.

> I'm expressing just my personal opinion, other committers can have
> different opinions.  I don't particularly think this topic is
> necessarily a non-starter.  But I do think that given ambiguity we've
> observed in the benchmark, much more research is needed to push this
> topic forward.

Yeah.  I'm not here to say "do nothing".  But I think we need results
from more machines and more test cases to convince ourselves whether
there's a consistent, worthwhile win from any specific patch.

regards, tom lane

Re: Improving spin-lock implementation on ARM.

On Tue, Dec 1, 2020 at 7:57 PM Zidenberg, Tsahi  wrote:
> > On 01/12/2020, 16:59, "Alexander Korotkov"  wrote:
> >  On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar  
> > wrote:
> >   > FWIW, here is an earlier discussion on the same (also added the
> >> proposal author here) :
>
> Thanks for looping me in!
>
> >>
> >> 
> > https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com
> >
> >
> >Thank you for pointing! I wonder why the effect of LSE on Graviton2
> >observed by Tsahi Zidenberg is so modest.  It's probably because he
> >runs the tests with a low number of clients.
>
> There are multiple possible reasons why I saw a smaller effect of LSE, but I 
> think an important one was that I
> used a 32-core instance rather than a 64-core one. The reason I did so, was 
> that 32-cores gave me better
> absolute results than 64 cores, and I didn't want to feel like I could 
> misguide anyone.

BTW, what number of clients did you use?  I can't find it in your message.

--
Regards,
Alexander Korotkov

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Zidenberg, Tsahi

> On 01/12/2020, 16:59, "Alexander Korotkov"  wrote:
>  On Tue, Dec 1, 2020 at 1:10 PM Amit Khandekar  wrote:
>   > FWIW, here is an earlier discussion on the same (also added the
>> proposal author here) :

Thanks for looping me in!

>>
>> 
> https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com
>
>
>Thank you for pointing! I wonder why the effect of LSE on Graviton2
>observed by Tsahi Zidenberg is so modest.  It's probably because he
>runs the tests with a low number of clients.

There are multiple possible reasons why I saw a smaller effect of LSE, but I 
think an important one was that I
used a 32-core instance rather than a 64-core one. The reason I did so, was 
that 32-cores gave me better
absolute results than 64 cores, and I didn't want to feel like I could misguide 
anyone.

The 64-core instance results is a good example for the benefit of LSE. LSE 
becomes most important in edges,
and with adversarial workloads. If multiple CPUs try to acquire a lock 
simultaneously - LSE ensures one CPU
will indeed get the lock (with just one transaction), while LDRX/STRX could 
have multiple CPUS looping and
no-one acquiring a lock. This is why I believe just looking at "reasonable" 
benchmarks misses out on effects
real customers will run into.

Happy to see another arm-optimization thread so quickly :)

Thank you!
Tsahi.

Re: Minor documentation error regarding streaming replication protocol

2020-12-01 Thread Jeff Davis

On Thu, 2020-10-08 at 23:17 -0400, Bruce Momjian wrote:
> As I said before, it is more raw bytes, but
> we
> don't have an SQL data type for that.

Sorry to jump in to this thread late.

Can't we just set both "filename" and "content" to be BYTEA, but then
set the format code to 1 (indicating binary)?

For clients that do look at the descriptor, they are more likely to get
it right; and for clients that don't look at the descriptor, there will
be no change.

Regards,
Jeff Davis

Off-Schedule Minor Release Consideration?

2020-12-01 Thread David G. Johnston

Hackers,

Given that we have already heard about 3 separate minor release regressions
in the most recent update I'm putting forth for discussion whether an
off-schedule release should be done.  I probably wouldn't care as much if
it wasn't for the fact that the same release fixes two CVEs.

https://www.postgresql.org/message-id/flat/1b8561412e8a4f038d7a491c8b922788%40smhi.se
(notification queue)
https://www.postgresql.org/message-id/flat/16746-44b30e2edf4335d4%40postgresql.org
(\connect)
https://www.postgresql.org/message-id/flat/831E420C-D9D5-4563-ABA4-F05F03E6D7DE%40guidance.nl#718d1785b6533f13b992c800f3bd6647
(create table like)

David J.

Re: error_severity of brin work item

2020-12-01 Thread Justin Pryzby

On Tue, Dec 01, 2020 at 11:07:30AM -0300, Alvaro Herrera wrote:
> > Should it be done in an AtCommit hook or something like that ?
> 
> I didn't like this idea much on first read, on extensibility grounds,
> but perhaps it's not so bad because we can generalize it whenever
> there's pressure to add a second type of work-item (*if* that ever
> happens).
> 
> I suppose the idea is that index_drop saves the index OID when a BRIN
> index with autosummarization=on is dropped, and then the
> AtCommit_WorkItems() call scans the items list and drops those that
> match any OIDs in the list.  (It does require to be mindful of subxact
> aborts, of course.)

This was an idea I made up - I don't know any of the details of this, but if
you give a hint I could look at it more.  There'd (still) be a race window, but
I think that's ok.

Another idea is if perform_work_item() were responsible for discarding
relations which disappear.  Currently it does this, which is racy since it
holds no lock.

cur_relname = get_rel_name(workitem->avw_relation);
cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
cur_datname = get_database_name(MyDatabaseId);
if (!cur_relname || !cur_nspname || !cur_datname)
goto deleted2;

-- 
Justin

Re: error_severity of brin work item

On 2020-Dec-01, Justin Pryzby wrote:

> This was an idea I made up - I don't know any of the details of this, but if
> you give a hint I could look at it more.  There'd (still) be a race window, 
> but
> I think that's ok.

See CommitTransaction() and friends, where AtEOXact_on_commit_actions()
and others are called.  You'd have to create a new routine (say
AtEOXact_Autovacuum or more specific AtEOXact_AutovacuumWorkItems), to
be called at the right places in xact.c.  Keep a global variable, say a
list of OIDs.  On subxact commit, the list is reassigned to its parent
transaction; on subxact abort, the list is discarded.  On top xact
commit, the list of OIDs is passed to some new routine in autovacuum.c
that scans the workitem array and deletes items as appropriate.

Not sure what's a good place for OIDs to be added to the list.  We don't
have AM-specific entry points for relation drop.  I think this is the
weakest point of this.

> Another idea is if perform_work_item() were responsible for discarding
> relations which disappear.  Currently it does this, which is racy since it
> holds no lock.

That has the property that it remains contained in autovacuum.c, but no
other advantages I think.

Re: proposal: unescape_text function

>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
>> napsal:
>>> I checked this and it is "prefix backslash-u hex" used by Java,
>>> JavaScript  or RTF -
>>> https://billposer.org/Software/ListOfRepresentations.html

If I look on that page, it appears that RTF is using a similar-looking
escape but in decimal rather than hex.

It would be important to define what is done with non-BMP characters?
Will there be another escape for a six- or eight-hexdigit format for
the codepoint, or will it be represented by two four-hexdigit escapes
for consecutive UTF-16 surrogates?

Regards,
-Chap

Re: PG vs LLVM 12 on seawasp, next round

2020-12-01 Thread Fabien COELHO



Hello Thomas,


Since seawasp's bleeding edge LLVM installation moved to "trunk
20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red.  Further
updates didn't help it and it's now on "trunk 20201127 6ee22ca6
12.0.0".  I wonder if there is something in Fabien's scripting that
needs to be tweaked, perhaps a symlink name or similar.


The compiler compilation script is quite straightforward (basically, get 
sources, configure and compile), even for a such a moving target…


The right approach is to wait for some time before looking at the issue, 
typically one week for the next recompilation, in case the problem 
evaporates, so you were right not to jump on it right away:-)


Andres investigated a few days ago, managed to reproduce the issue 
locally, and has one line patch. I'm unsure if it should be prevently 
back-patched, though.


--
Fabien.

Re: PG vs LLVM 12 on seawasp, next round

2020-12-01 Thread Andres Freund

Hi,

On 2020-12-01 21:04:44 +0100, Fabien COELHO wrote:
> Andres investigated a few days ago, managed to reproduce the issue locally,
> and has one line patch. I'm unsure if it should be prevently back-patched,
> though.

I see no reason not to backpatch - it's more correct for past versions
of LLVM as well.

Greetings,

Andres Freund

Re: proposal: unescape_text function

2020-12-01 Thread Pavel Stehule

út 1. 12. 2020 v 20:20 odesílatel Chapman Flack 
napsal:

> >> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com>
> >> napsal:
> >>> I checked this and it is "prefix backslash-u hex" used by Java,
> >>> JavaScript  or RTF -
> >>> https://billposer.org/Software/ListOfRepresentations.html
>
> If I look on that page, it appears that RTF is using a similar-looking
> escape but in decimal rather than hex.
>
> It would be important to define what is done with non-BMP characters?
> Will there be another escape for a six- or eight-hexdigit format for
> the codepoint, or will it be represented by two four-hexdigit escapes
> for consecutive UTF-16 surrogates?
>

the detection of decimal or hexadecimal codes can be a hard problem -
string "12" is valid in both systems, but the numbers are different. So
there should be external specification as an argument.

Regards

Pavel



> Regards,
> -Chap
>

Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Dmitry Dolgov

> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
>
> I had a quick look at this patch. I haven't been following this thread, so
> sorry if I'm repeating old arguments, but here we go:

Thanks!

> - I'm surprised you need a new index AM function (amskip) for this. Can't
> you just restart the scan with index_rescan()? The btree AM can check if the
> new keys are on the same page, and optimize the rescan accordingly, like
> amskip does. That would speed up e.g. nested loop scans too, where the keys
> just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

> - Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

> - This logic in build_index_paths() is not correct:
>
> > +   /*
> > +* Skip scan is not supported when there are qual conditions, 
> > which are not
> > +* covered by index. The reason for that is that those 
> > conditions are
> > +* evaluated later, already after skipping was applied.
> > +*
> > +* TODO: This implementation is too restrictive, and doesn't 
> > allow e.g.
> > +* index expressions. For that we need to examine index_clauses 
> > too.
> > +*/
> > +   if (root->parse->jointree != NULL)
> > +   {
> > +   ListCell *lc;
> > +
> > +   foreach(lc, (List *)root->parse->jointree->quals)
> > +   {
> > +   Node *expr, *qual = (Node *) lfirst(lc);
> > +   Var *var;
> > +   bool found = false;
> > +
> > +   if (!is_opclause(qual))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   expr = get_leftop(qual);
> > +
> > +   if (!IsA(expr, Var))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   var = (Var *) expr;
> > +
> > +   for (int i = 0; i < index->ncolumns; i++)
> > +   {
> > +   if (index->indexkeys[i] == 
> > var->varattno)
> > +   {
> > +   found = true;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (!found)
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +   }
> > +   }
>
> If you care whether the qual is evaluated by the index AM or not, you need
> to also check that the operator is indexable. Attached is a query that
> demonstrates that problem.
> ...
> Also, you should probably check that the index quals are in the operator
> family as that used for the DISTINCT.

Yes, good point, will change this in the next version.

> I'm actually a bit confused why we need this condition. The IndexScan
> executor node should call amskip() only after checking the additional quals,
> no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.

Re: [sqlsmith] Failed assertion during partition pruning

I wrote:
> What it looks like to me is that the code for setting up run-time
> partition pruning has failed to consider the possibility of nested
> partitioning: it's expecting that every partitioned table will have
> at least one direct child that is a leaf.  I'm not sure though
> whether just the Assert is wrong, or there's more fundamental
> issues here.

After looking into the git history I realized that this assertion is
quite new, stemming from David's a929e17e5a8 of 2020-11-02.  So there's
something not right about that.

regards, tom lane

room for improvement in amcheck btree checking?

2020-12-01 Thread Mark Dilger

Peter,

While working on the pg_amcheck code for [1], I discovered an unexpected
deficiency in the way btree indexes are checked. So far as I can tell from the
docs [2], the deficiency does not violate any promises that amcheck is making,
but I found it rather surprising all the same. To reproduce:

1) Create a (possibly empty) table and btree index over the table.
2) Flush buffers and backup a copy of the heap relation file.
3) Load (more) data into the table.
4) Flushing buffers as needed, revert the heap relation file to the backup
previously taken.
5) Run bt_index_check and bt_index_parent_check with and without heapallindexed
and/or rootdescend. Note that the index passes all checks.
6) Run a SQL query that uses a sequential scan on the table and observe no
errors.
7) Run a SQL query that uses an index scan on the table and see that it errors
with something like:

ERROR: could not read block 0 in file "base/13097/16391": read only 0 of
8192 bytes

I found it surprising that even when precisely zero of the tids in the index
exist in the table the index checks all come back clean. The heapallindexed
check is technically running as advertised, checking that all of the zero
tuples in the heap are present in the index. That is a pretty useless check
under this condition, though. Is a "indexallheaped" option (by some less crazy
name) needed?

Users might also run into this problem when a heap relation file gets
erroneously shortened by some number of blocks but not fully truncated, or
perhaps with torn page writes.

Have you already considered and rejected a "indexallheaped" type check?

Background
---

I have up until recently been focused on corruption caused by twiddling the
bits within heap and index relation pages, but real-world user error, file
system error, and perhaps race conditions in the core postgresql code seem at
least as likely to result in missing or incorrect versions of blocks of
relation files rather than individual bytes within those blocks being wrong.
Per our discussions in [3], not all corruptions that can be created under
laboratory conditions are equally likely to occur in the wild, and it may be
reasonable to only harden the amcheck code against corruptions that are more
likely to happen in actual practice.

To make it easier for tap tests to cover common corruption type scenarios, I
have been extending PostgresNode.pm with functions to perform these kinds of
file system corruptions. I expect to post that work in another thread soon. I
am not embedding any knowledge of the internal structure of heap, index, or
toast relations in PostgresNode, only creating functions to archive versions of
files and perform full or partial reversions of them later.

The ultimate goal of this work is to have sufficient regression tests to
demonstrate that pg_amcheck can be run with default options against a system
corrupted in these common ways without crashing, and with reasonable likelihood
of detecting these common corruptions. Users might understand that hard to
detect corruption will go unnoticed, but it would be harder to explain to them
why, immediately after getting a clean bill of health on their system, a query
bombed out with the sort of error shown above.

[1] https://commitfest.postgresql.org/31/2670/
[2] https://www.postgresql.org/docs/13/amcheck.html
[3]
https://www.postgresql.org/message-id/flat/CAH2-WznaU6HcahLV4Hg-DnhEmW8DuSdYfn3vfWXoj3Me9jq%3DsQ%40mail.gmail.com#0691475da5e9163d21b13fc415095801

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Heikki Linnakangas


On 01/12/2020 22:21, Dmitry Dolgov wrote:

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:

I had a quick look at this patch. I haven't been following this thread, so
sorry if I'm repeating old arguments, but here we go:


Thanks!


- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.


An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?


I think it'll be OK. But if it's not, you could add a hint argument to 
index_rescan() to hint the index AM that the new key is known to be 
greater than the previous key.



- Does this optimization apply to bitmap index scans?


No, from what I understand it doesn't.


Would it be hard to add? Don't need to solve everything in the first 
version of this, but I think in principle you could do the same 
optimization for bitmap index scans, so if the current API can't do it, 
that's maybe an indication that the API isn't quite right.



- This logic in build_index_paths() is not correct:


+   /*
+* Skip scan is not supported when there are qual conditions, 
which are not
+* covered by index. The reason for that is that those 
conditions are
+* evaluated later, already after skipping was applied.
+*
+* TODO: This implementation is too restrictive, and doesn't 
allow e.g.
+* index expressions. For that we need to examine index_clauses 
too.
+*/
+   if (root->parse->jointree != NULL)
+   {
+   ListCell *lc;
+
+   foreach(lc, (List *)root->parse->jointree->quals)
+   {
+   Node *expr, *qual = (Node *) lfirst(lc);
+   Var *var;
+   bool found = false;
+
+   if (!is_opclause(qual))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   expr = get_leftop(qual);
+
+   if (!IsA(expr, Var))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   var = (Var *) expr;
+
+   for (int i = 0; i < index->ncolumns; i++)
+   {
+   if (index->indexkeys[i] == 
var->varattno)
+   {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   {
+   not_empty_qual = true;
+   break;
+   }
+   }
+   }


If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.


Yes, good point, will change this in the next version.


I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?


This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.


As I understand this, the executor logic goes like this:

query: SELECT DISTINCT ON (a, b)  a, b FROM foo where c like '%y%' and a 
like 'a%' and b = 'b';


1. Call index_beginscan, keys: a >= 'a', b = 'b'

2. Call index_getnext, which returns first row to the Index Scan node

3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto 
step 2 to get next tuple.


4. Return tuple to parent node

5. index_amskip(), to the next tuple with a > 'a'. Goto 2.

The logic should work fine, even if there are quals that are not 
indexable, like "c like '%y'" in the above example. So why doesn't it 
work? What am I mis

Re: Setof RangeType returns

2020-12-01 Thread Patrick Handja

Thanks for the feedback Tom!

> TypeCacheEntry *typcache;
> PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));

The use of typcache really confuses me. range_get_typcache() is used in
order to initialize typcache
> typcache =range_get_typcache(fcinfo, RangeTypeGetOid(r1));

In my case, I do not have a range as an argument, I am receiving 2 int,
which I am using to create a range. How can I initialize typcache in this
case?
That's the part where I am really stuck.

Datum
make_range_griis(PG_FUNCTION_ARGS){
RangeBound lower;
RangeBound upper;

int32   start = PG_GETARG_INT32(0);
int32   finish = PG_GETARG_INT32(1);

lower.val = (Datum) (start);
lower.infinite = false;
lower.lower = true;

upper.val = (Datum) (finish);
upper.infinite = false;
upper.lower = false;

if (!lower.infinite && !lower.inclusive){
lower.val = DirectFunctionCall2(int4pl, lower.val, Int32GetDatum(1));
lower.inclusive = true;
}

if (!upper.infinite && upper.inclusive){
upper.val = DirectFunctionCall2(int4pl, upper.val, Int32GetDatum(1));
upper.inclusive = false;
}

TypeCacheEntry *typcache;
//> typcache = ??;
PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));
}

regards,

*Andjasubu Bungama, Patrick *




Le ven. 27 nov. 2020 à 14:24, Tom Lane  a écrit :

> Patrick Handja  writes:
> > This is what I am doing:
>
> > static int
> > get_range_lower(FunctionCallInfo fcinfo, RangeType *r1)
> > {
> > /* Return NULL if there's no finite lower bound */
> > if (empty || lower.infinite)
> > PG_RETURN_NULL();
>
> You can't really use PG_RETURN_NULL here, mainly because there is
> no good value for it to return from get_range_lower().
>
> > return (lower.val);
>
> Here and elsewhere, you're cavalierly casting between Datum and int.
> While you can get away with that as long as the SQL type you're
> working with is int4, it's bad style; mainly because it's confusing,
> but also because you'll have a hard time adapting the code if you
> ever want to work with some other type.  Use DatumGetInt32 or
> Int32GetDatum as appropriate.
>
> > TypeCacheEntry *typcache;
> > PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));
>
> This sure appears to be passing an uninitialized typcache pointer
> to range_serialize().  If your compiler isn't whining about that,
> you don't have adequately paranoid warning options enabled.
> That's an excellent way to waste a lot of time, as you just have.
> C is an unforgiving language, so you need all the help you can get.
>
> BTW, use of PG_RETURN_RANGE_P here isn't very appropriate either,
> since the function is not declared as returning Datum.
>
> regards, tom lane
>

Re: Deleting older versions in unique indexes to avoid page splits

2020-12-01 Thread Peter Geoghegan

On Tue, Dec 1, 2020 at 1:50 AM Heikki Linnakangas  wrote:
> This is a wholly new concept with a lot of heuristics. It's a lot of
> swallow.

Thanks for taking a look! Yes, it's a little unorthodox.

Ideally, you'd find time to grok the patch and help me codify the
design in some general kind of way. If there are general lessons to be
learned here (and I suspect that there are), then this should not be
left to chance. The same principles can probably be applied in heapam,
for example. Even if I'm wrong about the techniques being highly
generalizable, it should still be interesting! (Something to think
about a little later.)

Some of the intuitions behind the design might be vaguely familiar to
you as the reviewer of my earlier B-Tree work. In particular, the
whole way that we reason about how successive related operations (in
this case bottom-up deletion passes) affect individual leaf pages over
time might give you a feeling of déjà vu. It's a little like the
nbtsplitloc.c stuff that we worked on together during the Postgres 12
cycle. We can make what might seem like rather bold assumptions about
what's going on, and adapt to the workload. This is okay because we're
sure that the downside of our being wrong is a fixed, low performance
penalty. (To a lesser degree it's okay because the empirical evidence
shows that we're almost always right, because we apply the
optimization again and again precisely because it worked out last
time.)

I like to compare it to the familiar "rightmost leaf page applies leaf
fillfactor" heuristic, which applies an assumption that is wrong in
the general case, but nevertheless obviously helps enormously as a
practical matter. Of course it's still true that anybody reviewing
this patch ought to start with a principled skepticism of this claim
-- that's how you review any patch. I can say for sure that that's the
idea behind the patch, though. I want to be clear about that up front,
to save you time -- if this claim is wrong, then I'm wrong about
everything.

Generational garbage collection influenced this work, too. It also
applies pragmatic assumptions about where garbage is likely to appear.
Assumptions that are based on nothing more than empirical observations
about what is likely to happen in the real world, that are validated
by experience and by benchmarking.

> On 30/11/2020 21:50, Peter Geoghegan wrote:
> > +} TM_IndexDelete;

> > +} TM_IndexStatus;

> Is it really significantly faster to have two arrays? If you had just
> one array, each element would be only 12 bytes long. That's not much
> smaller than TM_IndexDeletes, which is 8 bytes.

Yeah, but the swap operations really matter here. At one point I found
that to be the case, anyway. That might no longer be true, though. It
might be that the code became less performance critical because other
parts of the design improved, resulting in the code getting called
less frequently. But if that is true then it has a lot to do with the
power-of-two bucketing that you go on to ask about -- that helped
performance a lot in certain specific cases (as I go into below).

I will add a TODO item for myself, to look into this again. You may
well be right.

> > + /* First sort caller's array by TID */
> > + heap_tid_shellsort(delstate->deltids, delstate->ndeltids);

> Instead of sorting the array by TID, wouldn't it be enough to sort by
> just the block numbers?

I don't understand. Yeah, I guess that we could do our initial sort of
the deltids array (the heap_tid_shellsort() call) just using
BlockNumber (not TID). But OTOH there might be some small locality
benefit to doing a proper TID sort at the level of each heap page. And
even if there isn't any such benefit, does it really matter?

If you are asking about the later sort of the block counts array
(which helps us sort the deltids array a second time, leaving it in
its final order for bottom-up deletion heapam.c processing), then the
answer is no. This block counts metadata array sort is useful because
it allows us to leave the deltids array in what I believe to be the
most useful order for processing. We'll access heap blocks primarily
based on the number of promising tuples (though as I go into below,
sometimes the number of promising tuples isn't a decisive influence on
processing order).

> >* While in general the presence of promising tuples (the hint that 
> > index
> > +  * AMs provide) is the best information that we have to go on, it is 
> > based
> > +  * on simple heuristics about duplicates in indexes that are 
> > understood to
> > +  * have specific flaws.  We should not let the most promising heap 
> > pages
> > +  * win or lose on the basis of _relatively_ small differences in the 
> > total
> > +  * number of promising tuples.  Small differences between the most
> > +  * promising few heap pages are effectively ignored by applying this
> > +  * power-of-two bucketing scheme.
> > +  *
>
> What are the "specific flaws"?

Re: Setof RangeType returns

2020-12-01 Thread Patrick Handja

Just figured out. I think I can use RANGE_EMPTY and it will be like:

> typcache =range_get_typcache(fcinfo, RangeTypeGetOid(RANGE_EMPTY));

Regards,

*Andjasubu Bungama, Patrick *



Le mar. 1 déc. 2020 à 16:42, Patrick Handja  a
écrit :

> Thanks for the feedback Tom!
>
> > TypeCacheEntry *typcache;
> > PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));
>
> The use of typcache really confuses me. range_get_typcache() is used in
> order to initialize typcache
> > typcache =range_get_typcache(fcinfo, RangeTypeGetOid(r1));
>
> In my case, I do not have a range as an argument, I am receiving 2 int,
> which I am using to create a range. How can I initialize typcache in this
> case?
> That's the part where I am really stuck.
>
> Datum
> make_range_griis(PG_FUNCTION_ARGS){
> RangeBound lower;
> RangeBound upper;
>
> int32   start = PG_GETARG_INT32(0);
> int32   finish = PG_GETARG_INT32(1);
>
> lower.val = (Datum) (start);
> lower.infinite = false;
> lower.lower = true;
>
> upper.val = (Datum) (finish);
> upper.infinite = false;
> upper.lower = false;
>
> if (!lower.infinite && !lower.inclusive){
> lower.val = DirectFunctionCall2(int4pl, lower.val, Int32GetDatum(1));
> lower.inclusive = true;
> }
>
> if (!upper.infinite && upper.inclusive){
> upper.val = DirectFunctionCall2(int4pl, upper.val, Int32GetDatum(1));
> upper.inclusive = false;
> }
>
> TypeCacheEntry *typcache;
> //> typcache = ??;
> PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));
> }
>
> regards,
>
> *Andjasubu Bungama, Patrick *
>
>
>
>
> Le ven. 27 nov. 2020 à 14:24, Tom Lane  a écrit :
>
>> Patrick Handja  writes:
>> > This is what I am doing:
>>
>> > static int
>> > get_range_lower(FunctionCallInfo fcinfo, RangeType *r1)
>> > {
>> > /* Return NULL if there's no finite lower bound */
>> > if (empty || lower.infinite)
>> > PG_RETURN_NULL();
>>
>> You can't really use PG_RETURN_NULL here, mainly because there is
>> no good value for it to return from get_range_lower().
>>
>> > return (lower.val);
>>
>> Here and elsewhere, you're cavalierly casting between Datum and int.
>> While you can get away with that as long as the SQL type you're
>> working with is int4, it's bad style; mainly because it's confusing,
>> but also because you'll have a hard time adapting the code if you
>> ever want to work with some other type.  Use DatumGetInt32 or
>> Int32GetDatum as appropriate.
>>
>> > TypeCacheEntry *typcache;
>> > PG_RETURN_RANGE_P(range_serialize(typcache, &lower, &upper, false));
>>
>> This sure appears to be passing an uninitialized typcache pointer
>> to range_serialize().  If your compiler isn't whining about that,
>> you don't have adequately paranoid warning options enabled.
>> That's an excellent way to waste a lot of time, as you just have.
>> C is an unforgiving language, so you need all the help you can get.
>>
>> BTW, use of PG_RETURN_RANGE_P here isn't very appropriate either,
>> since the function is not declared as returning Datum.
>>
>> regards, tom lane
>>
>

Re: error_severity of brin work item

2020-12-01 Thread Justin Pryzby

On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote:
> On 2020-Dec-01, Justin Pryzby wrote:
> 
> > This was an idea I made up - I don't know any of the details of this, but if
> > you give a hint I could look at it more.  There'd (still) be a race window, 
> > but
> > I think that's ok.
> 
> See CommitTransaction() and friends, where AtEOXact_on_commit_actions()

..thanks.  I was going to persue the other idea first since this is new to me.

> Not sure what's a good place for OIDs to be added to the list.  We don't
> have AM-specific entry points for relation drop.  I think this is the
> weakest point of this.

I assume it would just add OIDs of *all* dropped rels, and the autovacuum half
would silently ignore any OID for which there's no work item.  (As it would do
in any case).

> > Another idea is if perform_work_item() were responsible for discarding
> > relations which disappear.  Currently it does this, which is racy since it
> > holds no lock.
> 
> That has the property that it remains contained in autovacuum.c, but no
> other advantages I think.

It has the advantage that it moves all the try_open stuff out of brin.

I started implementing this, and then realized that the try_open stuff *has* to
be in the brin_summarize function, to handle the case that someone passes a
non-index, since it's SQL exposed.
So maybe we should use your LockOid patch now, and refactor in the future if we
add additional work-item types.

-- 
Justin

Re: Setof RangeType returns

On 12/01/20 17:19, Patrick Handja wrote:
> Just figured out. I think I can use RANGE_EMPTY and it will be like:
> 
>> typcache =range_get_typcache(fcinfo, RangeTypeGetOid(RANGE_EMPTY));

Are you sure you are not simply looking for
range_get_typcache(fcinfo, INT4RANGEOID) ?

Top-posting is not used for replies on these lists.

Regards,
-Chap

Recent eelpout failures on 9.x branches

For about a week, eelpout has been failing the pg_basebackup test
more often than not, but only in the 9.5 and 9.6 branches:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_6_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL9_5_STABLE

The failures all look pretty alike:

# Running: pg_basebackup -D 
/home/tmunro/build-farm/buildroot/REL9_6_STABLE/pgsql.build/src/bin/pg_basebackup/tmp_check/tmp_test_jJOm/backupxs
 -X stream
pg_basebackup: could not send copy-end packet: server closed the connection 
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: child process exited with exit code 1
not ok 44 - pg_basebackup -X stream runs

What shows up in the postmaster log is

2020-12-02 09:04:53.064 NZDT [29536:1] [unknown] LOG:  connection received: 
host=[local]
2020-12-02 09:04:53.065 NZDT [29536:2] [unknown] LOG:  replication connection 
authorized: user=tmunro
2020-12-02 09:04:53.175 NZDT [29537:1] [unknown] LOG:  connection received: 
host=[local]
2020-12-02 09:04:53.178 NZDT [29537:2] [unknown] LOG:  replication connection 
authorized: user=tmunro
2020-12-02 09:05:42.860 NZDT [29502:2] LOG:  using stale statistics instead of 
current ones because stats collector is not responding
2020-12-02 09:05:53.074 NZDT [29542:1] LOG:  using stale statistics instead of 
current ones because stats collector is not responding
2020-12-02 09:05:53.183 NZDT [29537:3] pg_basebackup LOG:  terminating 
walsender process due to replication timeout
2020-12-02 09:05:53.183 NZDT [29537:4] pg_basebackup LOG:  disconnection: 
session time: 0:01:00.008 user=tmunro database= host=[local]
2020-12-02 09:06:33.996 NZDT [29536:3] pg_basebackup LOG:  disconnection: 
session time: 0:01:40.933 user=tmunro database= host=[local]

The "using stale statistics" gripes seem to be from autovacuum, so they
may be unrelated to the problem; but they suggest that the system
is under very heavy load, or else that there's some kernel-level issue.
Note however that some of the failures don't have those messages, and
I also see those messages in some runs that didn't fail.

Perhaps this is just a question of the machine being too slow to complete
the test, in which case we ought to raise wal_sender_timeout.  But it's
weird that it would've started to fail just now, because I don't really
see any changes in those branches that would explain a week-old change
in the test runtime.

Any thoughts?

regards, tom lane

Re: Setof RangeType returns

Patrick Handja  writes:
> In my case, I do not have a range as an argument, I am receiving 2 int,
> which I am using to create a range. How can I initialize typcache in this
> case?

You should be passing down the pointer from the outer function, which
does have it at hand, no?

regards, tom lane

Re: Recent eelpout failures on 9.x branches

2020-12-01 Thread Thomas Munro

On Wed, Dec 2, 2020 at 11:36 AM Tom Lane  wrote:
> Perhaps this is just a question of the machine being too slow to complete
> the test, in which case we ought to raise wal_sender_timeout.  But it's
> weird that it would've started to fail just now, because I don't really
> see any changes in those branches that would explain a week-old change
> in the test runtime.

Unfortunately, eelpout got kicked off the nice shiny ARM server it was
running on so last week I moved it to a rack mounted Raspberry Pi.  It
seems to be totally I/O starved causing some timeouts to be reached,
and I'm looking into fixing that by adding fast storage.  This may
take a few days.  Sorry for the noise.

Re: proposal: unescape_text function

2020-12-01 Thread Andrew Dunstan



On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> On 2020-11-29 18:36, Pavel Stehule wrote:
>>
>>     I don't really get the point of this function.  There is AFAICT no
>>     function to produce this escaped format, and it's not a recognized
>>     interchange format.  So under what circumstances would one need to
>>     use this?
>>
>>
>> Some corporate data can be in CSV format with escaped unicode
>> characters. Without this function it is not possible to decode these
>> files without external application.
>
> I would like some supporting documentation on this.  So far we only
> have one stackoverflow question, and then this implementation, and
> they are not even the same format.  My worry is that if there is not
> precise specification, then people are going to want to add things in
> the future, and there will be no way to analyze such requests in a
> principled way.
>
>
>


Also, should this be an extension? I'm dubious about including such
marginal uses in the core code unless there's a really good case for it.


cheers


andrew

Re: Recent eelpout failures on 9.x branches

Thomas Munro  writes:
> Unfortunately, eelpout got kicked off the nice shiny ARM server it was
> running on so last week I moved it to a rack mounted Raspberry Pi.  It
> seems to be totally I/O starved causing some timeouts to be reached,
> and I'm looking into fixing that by adding fast storage.  This may
> take a few days.  Sorry for the noise.

Ah-hah.  Now that I look, eelpout is very clearly slower overall
than it was a couple weeks ago, so all is explained.

It might still be reasonable to raise wal_sender_timeout in the
buildfarm environment, though.  We usually try to make sure that
buildfarm timeouts border on ridiculous, not just because of
underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
animals.

I'm also wondering a bit why the issue isn't affecting the newer
branches.  It's certainly not because we made the test shorter ...

regards, tom lane

Re: room for improvement in amcheck btree checking?

On 2020-Dec-01, Mark Dilger wrote:

> 7) Run a SQL query that uses an index scan on the table and see that it 
> errors with something like:
> 
>ERROR:  could not read block 0 in file "base/13097/16391": read only 0 of 
> 8192 bytes
> 
> I found it surprising that even when precisely zero of the tids in the
> index exist in the table the index checks all come back clean.

Yeah, I've seen this kind of symptom in production databases (indexes
pointing to non-existant heap pages).

I think one useful cross-check that amcheck could do, is verify that if
a heap page is referenced from the index, then the heap page must exist.
Otherwise, it's a future index corruption of sorts: the old index
entries will point to the wrong new heap tuples as soon as the table
grows again.

Re: [DOC] Document concurrent index builds waiting on each other

On 2020-Nov-30, James Coleman wrote:

> On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  
> wrote:
> >
> > On 2020-Sep-30, Michael Paquier wrote:

> > Yeah, I think it might be more sensible to document this in
> > maintenance.sgml, as part of the paragraph that discusses removing
> > tuples "to save space".  But making it inline with the rest of the flow,
> > it seems to distract from higher-level considerations, so I suggest to
> > make it a footnote instead.
> 
> I have mixed feelings about wholesale moving it; users aren't likely
> to read the vacuum doc when considering how running CIC might impact
> their system, though I do understand why it otherwise fits there.

Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
it should go in REINDEX CONCURRENTLY as well.

> > I'm not sure on the wording to use; what about this?
> 
> The wording seems fine to me.

Great, thanks.

> This is a replacement for what was 0002 earlier? And 0001 from earlier
> still seems to be a useful standalone patch?

0001 is the one that I got pushed yesterday, I think -- correct?
src/tools/git_changelog says:

Author: Alvaro Herrera 
Branch: master [58ebe967f] 2020-11-30 18:24:55 -0300
Branch: REL_13_STABLE [3fe0e7c3f] 2020-11-30 18:24:55 -0300
Branch: REL_12_STABLE [b2603f16a] 2020-11-30 18:24:55 -0300
Branch: REL_11_STABLE [ed9c9b033] 2020-11-30 18:24:55 -0300
Branch: REL_10_STABLE [d3bd36a63] 2020-11-30 18:24:55 -0300
Branch: REL9_6_STABLE [b3d33bf59] 2020-11-30 18:24:55 -0300
Branch: REL9_5_STABLE [968a537b4] 2020-11-30 18:24:55 -0300

Document concurrent indexes waiting on each other

Because regular CREATE INDEX commands are independent, and there's no
logical data dependency, it's not immediately obvious that transactions
held by concurrent index builds on one table will block the second phase
of concurrent index creation on an unrelated table, so document this
caveat.

Backpatch this all the way back.  In branch master, mention that only
some indexes are involved.

Author: James Coleman 
Reviewed-by: David Johnston 
Discussion: 
https://postgr.es/m/CAAaqYe994=purn8cjz4ueo_s-ffrr_3ogeryhtdghab2wg_...@mail.gmail.com

Re: room for improvement in amcheck btree checking?

2020-12-01 Thread Peter Geoghegan

On Tue, Dec 1, 2020 at 12:39 PM Mark Dilger
 wrote:
> I found it surprising that even when precisely zero of the tids in the index 
> exist in the table the index checks all come back clean.  The heapallindexed 
> check is technically running as advertised, checking that all of the zero 
> tuples in the heap are present in the index.  That is a pretty useless check 
> under this condition, though.  Is a "indexallheaped" option (by some less 
> crazy name) needed?
>
> Users might also run into this problem when a heap relation file gets 
> erroneously shortened by some number of blocks but not fully truncated, or 
> perhaps with torn page writes.

It would probably be possible to detect this exact condition (though
not other variants of the more general problem) relatively easily.
Perhaps by doing something with RelationOpenSmgr() with the heap
relation, while applying a little knowledge of what must be true about
the heap relation given what is known about the index relation. (I'm
not 100% sure that that's possible, but offhand it seems like it
probably is.)

See also: commit 6754fe65a4c, which tightened things up in this area
for the index relation itself.

> Have you already considered and rejected a "indexallheaped" type check?

Actually, I pointed out that we should do something along these lines
very recently:

https://postgr.es/m/CAH2-Wz=dy--fg5ij0kpcqums0w5g+xqed3t-7he+uqak_hm...@mail.gmail.com

I'd be happy to see you pursue it.

> Background
> ---
>
> I have up until recently been focused on corruption caused by twiddling the 
> bits within heap and index relation pages, but real-world user error, file 
> system error, and perhaps race conditions in the core postgresql code seem at 
> least as likely to result in missing or incorrect versions of blocks of 
> relation files rather than individual bytes within those blocks being wrong.  
> Per our discussions in [3], not all corruptions that can be created under 
> laboratory conditions are equally likely to occur in the wild, and it may be 
> reasonable to only harden the amcheck code against corruptions that are more 
> likely to happen in actual practice.

Right. I also like to harden amcheck in ways that happen to be easy,
especially when it seems like it might kind of work as a broad
defense that doesn't consider any particular scenario. For example,
hardening to make sure the an index tuple's lp_len matches
IndexTupleSize() for the tuple proper (this also kind of works as
storage format documentation). I am concerned about both costs and
benefits.

FWIW, this is a perfect example of the kind of hardening that makes
sense to me. Clearly this kind of failure is a reasonably plausible
scenario (probably even a known real world scenario with catalog
corruption), and clearly we could do something about it that's pretty
simple and obviously low risk. It easily meets the standard that I
might apply here.

--
Peter Geoghegan

Re: [DOC] Document concurrent index builds waiting on each other

2020-12-01 Thread James Coleman

On Tue, Dec 1, 2020 at 6:51 PM Alvaro Herrera  wrote:
>
> On 2020-Nov-30, James Coleman wrote:
>
> > On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2020-Sep-30, Michael Paquier wrote:
>
> > > Yeah, I think it might be more sensible to document this in
> > > maintenance.sgml, as part of the paragraph that discusses removing
> > > tuples "to save space".  But making it inline with the rest of the flow,
> > > it seems to distract from higher-level considerations, so I suggest to
> > > make it a footnote instead.
> >
> > I have mixed feelings about wholesale moving it; users aren't likely
> > to read the vacuum doc when considering how running CIC might impact
> > their system, though I do understand why it otherwise fits there.
>
> Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
> it should go in REINDEX CONCURRENTLY as well.

Agreed. Or, alternatively, a blurb something like "Please note how CIC
interacts with VACUUM ...", and then the primary language in
maintenance.sgml. That would have the benefit of maintaining the core
language in only one place.

> > > I'm not sure on the wording to use; what about this?
> >
> > The wording seems fine to me.
>
> Great, thanks.
>
> > This is a replacement for what was 0002 earlier? And 0001 from earlier
> > still seems to be a useful standalone patch?
>
> 0001 is the one that I got pushed yesterday, I think -- correct?
> src/tools/git_changelog says:
>
> Author: Alvaro Herrera 
> Branch: master [58ebe967f] 2020-11-30 18:24:55 -0300
> Branch: REL_13_STABLE [3fe0e7c3f] 2020-11-30 18:24:55 -0300
> Branch: REL_12_STABLE [b2603f16a] 2020-11-30 18:24:55 -0300
> Branch: REL_11_STABLE [ed9c9b033] 2020-11-30 18:24:55 -0300
> Branch: REL_10_STABLE [d3bd36a63] 2020-11-30 18:24:55 -0300
> Branch: REL9_6_STABLE [b3d33bf59] 2020-11-30 18:24:55 -0300
> Branch: REL9_5_STABLE [968a537b4] 2020-11-30 18:24:55 -0300
>
> Document concurrent indexes waiting on each other
>
> Because regular CREATE INDEX commands are independent, and there's no
> logical data dependency, it's not immediately obvious that transactions
> held by concurrent index builds on one table will block the second phase
> of concurrent index creation on an unrelated table, so document this
> caveat.
>
> Backpatch this all the way back.  In branch master, mention that only
> some indexes are involved.
>
> Author: James Coleman 
> Reviewed-by: David Johnston 
> Discussion: 
> https://postgr.es/m/CAAaqYe994=purn8cjz4ueo_s-ffrr_3ogeryhtdghab2wg_...@mail.gmail.com

Ah, yes, somehow I'd missed that that had been pushed.

James

Re: Allow some recovery parameters to be changed with reload





On 2020/11/27 12:05, Kyotaro Horiguchi wrote:

At Fri, 27 Nov 2020 09:48:25 +0900, Fujii Masao  
wrote in



On 2020/11/27 9:30, Kyotaro Horiguchi wrote:

At Thu, 26 Nov 2020 22:43:48 +0900, Fujii Masao
 wrote in



On 2020/11/12 4:38, Sergei Kornilov wrote:

Hello


Anyway, for now I think that your first patch would be enough, i.e.,
just change the context of restore_command to PGC_SIGHUP.

Glad to hear. Attached a rebased version of the original proposal.


Thanks for rebasing the patch!

  This parameter is required for archive recovery,

I found the above description in config.sgml. I was just wondering
if it should be updated so that the actual specification is described
or not.
The actual spec is that restore_command is required to start archive
recovery, but optional (i.e., the parameter can be reset to an empty)
after archive recovery has started. But this updated version of
description would be rather confusing to users. So I'm now thinking
not to update that.

Does anyone object to the patch? If no, I'm thinking to commit the
patch.

Although I don't object to make the parameter reloadable, I think it
needs to be documented that server could stop after reloading if the
server failed to execute the new command line.


You mean that we should document that if restore_command is set to
improper command mistakenly, archive recovery may fail to restore some
archived WAL files and finish without replaying those WAL? But isn't
this true even without applying the patch?


If we set a wrong command in .conf and start the server in recovery
mode, the server immediately stops. It's fine.  If we change
restore_command wrong way on a running server, "pg_ctl reload" stops
the server.  If it is a hot standby, the server stops unexpectedly.

However, after rechecking, I found that recovery_end_command with
wrong content causes server stop at the end of recovery, or at
promotion. And that variable is PGC_SIGHUP.

So I agree not to document that.  Sorry for the noise.


OK, so I pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: room for improvement in amcheck btree checking?

2020-12-01 Thread Mark Dilger




> On Dec 1, 2020, at 4:38 PM, Peter Geoghegan  wrote:
> 
> On Tue, Dec 1, 2020 at 12:39 PM Mark Dilger
>  wrote:
>> I found it surprising that even when precisely zero of the tids in the index 
>> exist in the table the index checks all come back clean.  The heapallindexed 
>> check is technically running as advertised, checking that all of the zero 
>> tuples in the heap are present in the index.  That is a pretty useless check 
>> under this condition, though.  Is a "indexallheaped" option (by some less 
>> crazy name) needed?
>> 
>> Users might also run into this problem when a heap relation file gets 
>> erroneously shortened by some number of blocks but not fully truncated, or 
>> perhaps with torn page writes.
> 
> It would probably be possible to detect this exact condition (though
> not other variants of the more general problem) relatively easily.
> Perhaps by doing something with RelationOpenSmgr() with the heap
> relation, while applying a little knowledge of what must be true about
> the heap relation given what is known about the index relation. (I'm
> not 100% sure that that's possible, but offhand it seems like it
> probably is.)
> 
> See also: commit 6754fe65a4c, which tightened things up in this area
> for the index relation itself.

Yes, some of the test code I've been playing with already hit the error 
messages introduced in that commit.

>> Have you already considered and rejected a "indexallheaped" type check?
> 
> Actually, I pointed out that we should do something along these lines
> very recently:
> 
> https://postgr.es/m/CAH2-Wz=dy--fg5ij0kpcqums0w5g+xqed3t-7he+uqak_hm...@mail.gmail.com

Right.

> I'd be happy to see you pursue it.

I'm not sure how much longer I'll be pursuing corruption checkers before 
switching to another task.  Certainly, I'm interested in pursuing this if time 
permits.

>> Background
>> ---
>> 
>> I have up until recently been focused on corruption caused by twiddling the 
>> bits within heap and index relation pages, but real-world user error, file 
>> system error, and perhaps race conditions in the core postgresql code seem 
>> at least as likely to result in missing or incorrect versions of blocks of 
>> relation files rather than individual bytes within those blocks being wrong. 
>>  Per our discussions in [3], not all corruptions that can be created under 
>> laboratory conditions are equally likely to occur in the wild, and it may be 
>> reasonable to only harden the amcheck code against corruptions that are more 
>> likely to happen in actual practice.
> 
> Right. I also like to harden amcheck in ways that happen to be easy,
> especially when it seems like it might kind of work as a broad
> defense that doesn't consider any particular scenario. For example,
> hardening to make sure the an index tuple's lp_len matches
> IndexTupleSize() for the tuple proper (this also kind of works as
> storage format documentation). I am concerned about both costs and
> benefits.
> 
> FWIW, this is a perfect example of the kind of hardening that makes
> sense to me. Clearly this kind of failure is a reasonably plausible
> scenario (probably even a known real world scenario with catalog
> corruption), and clearly we could do something about it that's pretty
> simple and obviously low risk. It easily meets the standard that I
> might apply here.

Ok, it seems we're in general agreement about that.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Disable WAL logging to speed up data loading

2020-12-01 Thread tsunakawa.ta...@fujitsu.com

From: Osumi, Takamichi/大墨 昂道 
> I executed each wal_level three times and calculated the average time
> and found that disabling WAL logging reduced about 73 % of the minimal's
> loading speed
> in this test. This speed-up came from the difference of generated WAL sizes.

So, it's 4x speedup when the WAL buffer and/or disk is likely to be saturated.  
That's nice!


> In this test, to load the data generated more than 10GB of WALs with
> wal_level=minimal
> while wal_level=none emits just 760 bytes of WALs.
> 
> I expect this size for none will become smaller when
> I take the modification to filter out the types of WAL which is discussed 
> above.

I don't expect so, because the WAL volume is already very small in this 
workload (probably only the commit records.)


> Also, I monitored numbers of iostat's 'util' and noticed that
> util's spike to use I/O reduced from twice to once when I changed the level
> from minimal to none, which should be the effect of the patch.

It's a bit mysterious.  I thought %util would be constantly 100% when wal_level 
= minimal.  That may be because wal_buffers is large.


Regards
Takayuki Tsunakawa

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-12-01 Thread Michael Paquier

On Tue, Dec 01, 2020 at 10:10:45AM +0100, Daniel Gustafsson wrote:
> That sounds like it would work.  Since the cryptohash implementation owns and
> controls the void *data contents it can store whatever it needs to properly
> free it.

That seems to work properly.  I have injected some elog(ERROR) with
the attached in some of the SQL functions from cryptohashes.c and the
cleanup happens with the correct resowner references.  It felt a bit
strange to use directly the term EVP in resowner.c once I did this
switch, so this is renamed to "CryptoHash" instead.

> Reading through the v6 patch I see nothing sticking out and all review 
> comments
> addressed, +1 on applying that one and then we'll take if from there with the
> remaining ones in the patchset.

Thanks.  0001 has been applied and the buildfarm does not complain, so
it looks like we are good (I'll take care of any issues, like the one
Fujii-san has just reported).  Attached are new patches for 0002, the
EVP switch.  One thing I noticed is that we need to free the backup
manifest a bit earlier once we begin to use resource owner in
basebackup.c as there is a specific step that may do a double-free.
This would not happen when not using OpenSSL or on HEAD.  It would be
easy to separate the resowner and cryptohash portions of the patch
here, but both are tightly linked, so I'd prefer to keep them
together.

Another thing to note is that this makes 0003 for pgcrypto
meaningless, because this would track down only states created by
cryptohash_openssl.c, and not directly EVP_MD_CTX.  Consistency in the
backend code is for the best, and considering that pgcrypto has
similar linked lists for ciphers it feels a bit weird to switch only
half of it to use something in resowner.c.  So, I am not sure if we
need to do anything here, and I am discarding this part.
--
Michael
From d2c9f2a9c9a36d80a014d588334e21a709a1c27d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 2 Dec 2020 11:51:43 +0900
Subject: [PATCH v7] Switch cryptohash_openssl.c to use EVP

Postgres is two decades late for this switch.
---
 src/include/utils/resowner_private.h  |   7 ++
 src/backend/replication/basebackup.c  |   8 +-
 src/backend/utils/resowner/resowner.c |  62 
 src/common/cryptohash_openssl.c   | 132 --
 src/tools/pgindent/typedefs.list  |   1 +
 5 files changed, 158 insertions(+), 52 deletions(-)

diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..c373788bc1 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -95,4 +95,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
    Datum handle);
 
+/* support for cryptohash context management */
+extern void ResourceOwnerEnlargeCryptoHash(ResourceOwner owner);
+extern void ResourceOwnerRememberCryptoHash(ResourceOwner owner,
+			Datum handle);
+extern void ResourceOwnerForgetCryptoHash(ResourceOwner owner,
+		  Datum handle);
+
 #endif			/* RESOWNER_PRIVATE_H */
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 22be7ca9d5..79b458c185 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -729,11 +729,17 @@ perform_base_backup(basebackup_options *opt)
  errmsg("checksum verification failure during base backup")));
 	}
 
+	/*
+	 * Make sure to free the manifest before the resource owners as
+	 * manifests use cryptohash contexts that may depend on resource
+	 * owners (like OpenSSL).
+	 */
+	FreeBackupManifest(&manifest);
+
 	/* clean up the resource owner we created */
 	WalSndResourceCleanup(true);
 
 	pgstat_progress_end_command();
-	FreeBackupManifest(&manifest);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..546ad8d1c5 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -20,6 +20,7 @@
  */
 #include "postgres.h"
 
+#include "common/cryptohash.h"
 #include "common/hashfn.h"
 #include "jit/jit.h"
 #include "storage/bufmgr.h"
@@ -128,6 +129,7 @@ typedef struct ResourceOwnerData
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
 	ResourceArray jitarr;		/* JIT contexts */
+	ResourceArray cryptohasharr;	/* cryptohash contexts */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -175,6 +177,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintCryptoHashLeakWarning(Datum handle);
 
 
 /*
@@ -444,6 +447,7 @@ Re

Re: Recent eelpout failures on 9.x branches

2020-12-01 Thread Noah Misch

On Tue, Dec 01, 2020 at 06:07:17PM -0500, Tom Lane wrote:
> Thomas Munro  writes:
> > Unfortunately, eelpout got kicked off the nice shiny ARM server it was
> > running on so last week I moved it to a rack mounted Raspberry Pi.  It
> > seems to be totally I/O starved causing some timeouts to be reached,
> > and I'm looking into fixing that by adding fast storage.  This may
> > take a few days.  Sorry for the noise.
> 
> Ah-hah.  Now that I look, eelpout is very clearly slower overall
> than it was a couple weeks ago, so all is explained.
> 
> It might still be reasonable to raise wal_sender_timeout in the
> buildfarm environment, though.  We usually try to make sure that
> buildfarm timeouts border on ridiculous, not just because of
> underpowered critters but also for cases like CLOBBER_CACHE_ALWAYS
> animals.

My buildfarm animals override these:

extra_config =>{
DEFAULT => [
"authentication_timeout = '600s'",
"wal_receiver_timeout = '18000s'",
"wal_sender_timeout = '18000s'",
],
},
build_env =>{
PGCTLTIMEOUT => 18000,
},

Each of those timeouts caused failures before I changed it.  For animals fast
enough to make them irrelevant, I've not yet encountered a disadvantage.

> I'm also wondering a bit why the issue isn't affecting the newer
> branches.  It's certainly not because we made the test shorter ...

That is peculiar.

Re: autovac issue with large number of tables

2020-12-01 Thread Masahiko Sawada

On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  wrote:
>
> On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao  
> wrote:
> >
> >
> >
> > On 2020/12/01 16:23, Masahiko Sawada wrote:
> > > On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
> > >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao 
> > >>  wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 2020/11/30 10:43, Masahiko Sawada wrote:
> >  On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
> >   wrote:
> > >
> > > Hi, Thanks for you comments.
> > >
> > > On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao 
> > >  wrote:
> > >>
> > >>
> > >>
> > >> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
> > >>> Hi,
> > >>>
> > >>> On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
> > >>>  wrote:
> > 
> > 
> > 
> >  On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > > On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> > >  wrote:
> > >>
> > >> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> > >>  wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
> > >>>  wrote:
> > 
> >  On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> >   wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > >  wrote:
> > >>> I wonder if we could have table_recheck_autovac do two 
> > >>> probes of the stats
> > >>> data.  First probe the existing stats data, and if it shows 
> > >>> the table to
> > >>> be already vacuumed, return immediately.  If not, *then* 
> > >>> force a stats
> > >>> re-read, and check a second time.
> > >> Does the above mean that the second and subsequent 
> > >> table_recheck_autovac()
> > >> will be improved to first check using the previous refreshed 
> > >> statistics?
> > >> I think that certainly works.
> > >>
> > >> If that's correct, I'll try to create a patch for the PoC
> > >
> > > I still don't know how to reproduce Jim's troubles, but I was 
> > > able to reproduce
> > > what was probably a very similar problem.
> > >
> > > This problem seems to be more likely to occur in cases where 
> > > you have
> > > a large number of tables,
> > > i.e., a large amount of stats, and many small tables need 
> > > VACUUM at
> > > the same time.
> > >
> > > So I followed Tom's advice and created a patch for the PoC.
> > > This patch will enable a flag in the table_recheck_autovac 
> > > function to use
> > > the existing stats next time if VACUUM (or ANALYZE) has 
> > > already been done
> > > by another worker on the check after the stats have been 
> > > updated.
> > > If the tables continue to require VACUUM after the refresh, 
> > > then a refresh
> > > will be required instead of using the existing statistics.
> > >
> > > I did simple test with HEAD and HEAD + this PoC patch.
> > > The tests were conducted in two cases.
> > > (I changed few configurations. see attached scripts)
> > >
> > > 1. Normal VACUUM case
> > >   - SET autovacuum = off
> > >   - CREATE tables with 100 rows
> > >   - DELETE 90 rows for each tables
> > >   - SET autovacuum = on and restart PostgreSQL
> > >   - Measure the time it takes for all tables to be 
> > > VACUUMed
> > >
> > > 2. Anti wrap round VACUUM case
> > >   - CREATE brank tables
> > >   - SELECT all of these tables (for generate stats)
> > >   - SET autovacuum_freeze_max_age to low values and 
> > > restart PostgreSQL
> > >   - Consumes a lot of XIDs by using txid_curent()
> > >   - Measure the time it takes for all tables to be 
> > > VACUUMed
> > >
> > > For each test case, the following results were obtained by 
> > > changing
> > > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > > Also changing num of tables to 1000, 5000, 1 and 2.
> > >
> > > Due to the poor VM environment (2 VCPU/4 GB), the results are 
> > > a little unstable,
> > > but I think it's enough to ask for a trend.
> > >
> > > ==

Re: Improving spin-lock implementation on ARM.

On Tue, 1 Dec 2020 at 22:19, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar 
> wrote:
> >> I would request you guys to re-think it from this perspective to help
> ensure that PGSQL can scale well on ARM.
> >> s_lock becomes a top-most function and LSE is not a universal solution
> but CAS surely helps ease the main bottleneck.
>
> > CAS patch isn't proven to be a universal solution as well.  We have
> > tested the patch on just a few processors, and Tom has seen the
> > regression [1].  The benchmark used by Tom was artificial, but the
> > results may be relevant for some real-life workload.
>
> Yeah.  I think that the main conclusion from what we've seen here is
> that on smaller machines like M1, a standard pgbench benchmark just
> isn't capable of driving PG into serious spinlock contention.  (That
> reflects very well on the work various people have done over the years
> to get rid of spinlock contention, because ten or so years ago it was
> a huge problem on this size of machine.  But evidently, not any more.)
> Per the results others have posted, nowadays you need dozens of cores
> and hundreds of client threads to measure any such issue with pgbench.
>
> So that is why I experimented with a special test that does nothing
> except pound on one spinlock.  Sure it's artificial, but if you want
> to see the effects of different spinlock implementations then it's
> just too hard to get any results with pgbench's regular scripts.
>
> And that's why it disturbs me that the CAS-spinlock patch showed up
> worse in that environment.  The fact that it's not visible in the
> regular pgbench test just means that the effect is too small to
> measure in that test.  But in a test where we *can* measure an effect,
> it's not looking good.
>
> It would be interesting to see some results from the same test I did
> on other processors.  I suspect the results would look a lot different
> from mine ... but we won't know unless someone does it.  Or, if someone
> wants to propose some other test case, let's have a look.
>
> > I'm expressing just my personal opinion, other committers can have
> > different opinions.  I don't particularly think this topic is
> > necessarily a non-starter.  But I do think that given ambiguity we've
> > observed in the benchmark, much more research is needed to push this
> > topic forward.
>
> Yeah.  I'm not here to say "do nothing".  But I think we need results
> from more machines and more test cases to convince ourselves whether
> there's a consistent, worthwhile win from any specific patch.
>

I think there is
*an ambiguity with lse and that has been the*
*source of some confusion* so let's make another attempt to
understand all the observations and then define the next steps.

-


*1. CAS patch (applied on the baseline)*   - Kunpeng: 10-45% improvement
observed [1]
   - Graviton2: 30-50% improvement observed [2]
   - M1: Only select results are available cas continue to maintain a
marginal gain but not significant. [3]
 [inline with what we observed with Kunpeng and Graviton2 for select
results too].


*2. Let's ignore CAS for a sec and just think of LSE independently*   -
Kunpeng: regression observed
   - Graviton2: gain observed
   - M1: regression observed
 [while lse probably is default explicitly enabling it with +lse causes
regression on the head itself [4].
  client=2/4: 1816/714  vs   892/610]

   There is enough reason not to immediately consider enabling LSE given
its unable to perform consistently on all hardware.
-

With those 2 aspects clear let's evaluate what options we have in hand


*1. Enable CAS approach*   *- What we gain:* pgsql scale on
Kunpeng/Graviton2
 (m1 awaiting read-write result but may marginally scale  [[5]: "but
the patched numbers are only about a few percent better"])
   *- What we lose:* Nothing for now.


*2. LSE:*   *- What we gain: *Scaled workload with Graviton2
  * - What we lose:* regression on M1 and Kunpeng.

Let's think of both approaches independently.

- Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1)
- Enabling LSE would help us scale only on some but regress on others.
  [LSE could be considered in the future once it stabilizes and all
hardware adapts to it]

---

*Let me know what do you think about this analysis and any specific
direction that we should consider to help move forward.*

---

Links:
[1]:
https://www.postgresql.org/message-id/attachment/116612/Screenshot%20from%202020-12-01%2017-55-21.png
[2]: https://www.postgresql.org/message-id/attachment/116521/arm-rw.png
[3]:
https://www.postgresql.org/message-id/1367116.1606802480%40sss.pgh.pa.us
[4]:
https://www.postgresql.org/message-id

pg_stat_statements oddity with track = all

2020-12-01 Thread Julien Rouhaud

Hi,

Someone raised an interested point recently on pg_stat_kcache extension for
handling nested statements, which also applies to pg_stat_statements.

The root issue is that when pg_stat_statements tracks nested statements,
there's no way to really make sense of the counters, as top level statements
will also account for underlying statements.  Using a naive example:

=# CREATE FUNCTION f1() RETURNS VOID AS $$BEGIN PERFORM pg_sleep(5); END; $$ 
LANGUAGE plpgsql;
CREATE FUNCTION

=# SELECT pg_stat_statements_reset();
 pg_stat_statements_reset
--

(1 row)

=# SELECT f1();
 f1


(1 row)

=# select sum(total_exec_time) from pg_stat_statements;
 sum
--
 10004.403601
(1 row)

It looks like there was 10s total execution time overall all statements, which
doesn't really make sense.

It's of course possible to avoid that using track = top, but tracking all
nested statements is usually quite useful so it could be better to find a way
to better address that problem.

The only idea I have for that is to add a new field to entry key, for instance
is_toplevel.  The immediate cons is obviously that it could amplify quite a lot
the number of entries tracked, so people may need to increase
pg_stat_statements.max to avoid slowdown if that makes them reach frequent
entry eviction.

Should we address the problem, and in that case do you see a better way for
that, or should we just document this behavior?

Re: Add table access method as an option to pgbench

2020-12-01 Thread David Zhang


Thanks a lot for the comments, Fabien.

Some feedback about v3:

In the doc I find TABLEACCESSMETHOD quite hard to read. Use TABLEAM 
instead? 

Yes, in this case, *TABLEAM* is easy to read, updated.

Also, the next entry uses lowercase (tablespace), why change the style?
The original style is not so consistent, for example, in doc, 
--partition-method using *NAME*, but --table-space using *tablespace*; 
in help, --partition-method using *(rang|hash)*, but --tablespace using 
*TABLESPACE*. To keep it more consistent, I would rather use *TABLEAM* 
in both doc and help.
Remove space after comma in help string. I'd use the more readable 
TABLEAM in the help string rather than the mouthful.

Yes the *space* is removed after comma.


It looks that the option is *silently* ignored when creating a 
partitionned table, which currently results in an error, and not 
passed to partitions, which would accept them. This is pretty weird.
The input check is added with an error message when both partitions and 
table access method are used.


I'd suggest that the table am option should rather by changing the 
default instead, so that it would apply to everything relevant 
implicitely when appropriate.
I think this is a better idea, so in v4 patch I change it to something 
like "set default_table_access_method to *TABLEAM*" instead of using the 
*using* clause.


About tests :

They should also trigger failures, eg 
"--table-access-method=no-such-table-am", to be added to the @errors 
list.
No sure how to address this properly, since the table access method 
potentially can be *any* name.


I do not understand why you duplicated all possible option entry.

Test with just table access method looks redundant if the feature is 
make to work orthonogonally to partitions, which should be the case.
Only one positive test case added using *heap* as the table access 
method to verify the initialization.


By the way, I saw the same style for other variables, such as 
escape_tablespace, should this be fixed as well?


Nope, let it be.


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
From c25b55580b1b3f3faac63b4d72291c80fb2c9c1f Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Tue, 1 Dec 2020 20:47:01 -0800
Subject: [PATCH] add table access method as an option to pgbench

---
 doc/src/sgml/ref/pgbench.sgml| 10 +++
 src/bin/pgbench/pgbench.c| 31 +++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 12 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..992fbbdb4b 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -359,6 +359,16 @@ pgbench  options 
 d
   
  
 
+ 
+  
--table-access-method=TABLEAM
+  
+   
+Create tables using the specified table access method, rather than the 
default.
+tablespace.
+   
+  
+ 
+ 
  
   
--tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..961841b255 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -184,10 +184,11 @@ doublethrottle_delay = 0;
 int64  latency_limit = 0;
 
 /*
- * tablespace selection
+ * tablespace and table access method selection
  */
 char  *tablespace = NULL;
 char  *index_tablespace = NULL;
+char  *table_access_method = NULL;
 
 /*
  * Number of "pgbench_accounts" partitions.  0 is the default and means no
@@ -641,6 +642,9 @@ usage(void)
   "  --partition-method=(range|hash)\n"
   "   partition pgbench_accounts with 
this method (default: range)\n"
   "  --partitions=NUM partition pgbench_accounts into 
NUM parts (default: 0)\n"
+  "  --table-access-method=TABLEAM\n"
+  "   create tables with using 
specified table access method,\n"
+  "   rather than the default.\n"
   "  --tablespace=TABLESPACE  create tables in the specified 
tablespace\n"
   "  --unlogged-tablescreate tables as unlogged 
tables\n"
   "\nOptions to select what to run:\n"
@@ -3747,6 +3751,20 @@ initCreateTables(PGconn *con)
 
fprintf(stderr, "creating tables...\n");
 
+   if (table_access_method != NULL && partition_method == PART_NONE && 
partitions == 0)
+   {
+   char   *escaped_table_access_method;
+
+   initPQExpBuffer(&query);
+   escaped_table_access_method = PQescapeIdentifier(con,
+   table_access_method, 
strlen(table_access_method));
+   appendPQExpBuffer(&query, "set default_table_access_method to 
%s;\n",
+   escaped_table_access_method);
+   PQfr

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

On 2020/12/01 14:01, Fujii Masao wrote:

On 2020/11/26 16:07, Masahiro Ikeda wrote:

On 2020-11-25 20:19, Fujii Masao wrote:

On 2020/11/19 16:31, Masahiro Ikeda wrote:

On 2020-11-17 11:46, Fujii Masao wrote:

On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.

We should avoid that duplication as possible even if the both number
are important.

Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?

I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;

Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.

+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?

Yes, thanks. I fixed it.

prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.

I'll remove the above source code because these counters are not useful.

On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.

2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact

Re: proposal: unescape_text function

2020-12-01 Thread Pavel Stehule

st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan 
napsal:

>
> On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> > On 2020-11-29 18:36, Pavel Stehule wrote:
> >>
> >> I don't really get the point of this function.  There is AFAICT no
> >> function to produce this escaped format, and it's not a recognized
> >> interchange format.  So under what circumstances would one need to
> >> use this?
> >>
> >>
> >> Some corporate data can be in CSV format with escaped unicode
> >> characters. Without this function it is not possible to decode these
> >> files without external application.
> >
> > I would like some supporting documentation on this.  So far we only
> > have one stackoverflow question, and then this implementation, and
> > they are not even the same format.  My worry is that if there is not
> > precise specification, then people are going to want to add things in
> > the future, and there will be no way to analyze such requests in a
> > principled way.
> >
> >
> >
>
>
> Also, should this be an extension? I'm dubious about including such
> marginal uses in the core code unless there's a really good case for it.
>

I am not sure, and  I am inclined so it should be core functionality.

1. Although this use case is marginal, this is related to most used
encodings - ascii and unicode. 8 bit encodings enhanced about escaped
multibyte chars will be used for a very long time. Unfortunately - this
will be worse, because Postgres will be used more in the corporate
environment, where there is a bigger press to conserve very legacy
technologies without correct multibyte support. The core problem so this
issue is out of concept bytea -> text or text -> bytea transformations
supported by Postgres. This is text -> text transformation (for almost all
encoding based on ascii), that is not supported by Postgres now.

2. Postgres already has this functionality - but unfortunately there is a
limit just only literal constants.

create or replace function uunescape(text)
returns text as $$
declare r text;
begin
  -- don't use this code!!!
  execute 'select e''' || $1 ||  into r;
  return r;
end;
$$ language plpgsql immutable;

But one way how anybody can use it is SQL injection vulnerable and slow. So
some simple buildin solution can be protection against some future security
issues. Personally I am happy with just this limited function that will be
safe (although the design based on introducing new encoding and conversions
can be more complete and accurate). I agree so this case is marginal, but
it is a fully valid use case, and supporting unicode escaped codes just by
parser is a needless limit.

3. there are new disadvantages of extensions in current DBaaS times. Until
the extension is not directly accepted by a cloud provider, then the
extension is not available for users. The acceptance of extensions is not
too agile - so moving this code to extension doesn't solve this problem.
Without DBaaS the implementation of this feature as the extensions can be
good enough.

Regards

Pavel

>
> cheers
>
>
> andrew
>
>

Re: pg_stat_statements oddity with track = all

2020-12-01 Thread Nikolay Samokhvalov

On Tue, Dec 1, 2020 at 8:05 PM Julien Rouhaud  wrote:

> Hi,
>
> Someone raised an interested point recently on pg_stat_kcache extension for
> handling nested statements, which also applies to pg_stat_statements.
>
...

> The only idea I have for that is to add a new field to entry key, for
> instance
> is_toplevel.

This particular problem often bothered me when dealing with
pg_stat_statements contents operating under "track = all" (especially when
performing the aggregated analysis, like you showed).

I think the idea of having a flag to distinguish the top-level entries is
great.

> The immediate cons is obviously that it could amplify quite a lot
> the number of entries tracked, so people may need to increase
> pg_stat_statements.max to avoid slowdown if that makes them reach frequent
> entry eviction.
>

If all top-level records in pg_stat_statements have "true" in the new
column (is_toplevel), how would this lead to the need to increase
pg_stat_statements.max? The number of records would remain the same, as
before extending pg_stat_statements.

Re: pg_stat_statements oddity with track = all

2020-12-01 Thread Julien Rouhaud

On Tue, Dec 01, 2020 at 10:08:06PM -0800, Nikolay Samokhvalov wrote:
> On Tue, Dec 1, 2020 at 8:05 PM Julien Rouhaud  wrote:
> 
> > Someone raised an interested point recently on pg_stat_kcache extension for
> > handling nested statements, which also applies to pg_stat_statements.
> >
> ...
> 
> > The only idea I have for that is to add a new field to entry key, for
> > instance
> > is_toplevel.
> 
> 
> This particular problem often bothered me when dealing with
> pg_stat_statements contents operating under "track = all" (especially when
> performing the aggregated analysis, like you showed).
> 
> I think the idea of having a flag to distinguish the top-level entries is
> great.
> 

Ok!

> > The immediate cons is obviously that it could amplify quite a lot
> > the number of entries tracked, so people may need to increase
> > pg_stat_statements.max to avoid slowdown if that makes them reach frequent
> > entry eviction.
> >
> 
> If all top-level records in pg_stat_statements have "true" in the new
> column (is_toplevel), how would this lead to the need to increase
> pg_stat_statements.max? The number of records would remain the same, as
> before extending pg_stat_statements.

If the same query is getting executed both at top level and as a nested
statement, two entries will then be created.  That's probably unlikely for
things like RI trigger queries, but I don't know what to expect for client
application queries.

Re: autovac issue with large number of tables





On 2020/12/02 12:53, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  wrote:


On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao  wrote:




On 2020/12/01 16:23, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
 wrote:


Hi,

On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  wrote:




On 2020/11/30 10:43, Masahiko Sawada wrote:

On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
 wrote:


Hi, Thanks for you comments.

On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao  wrote:




On 2020/11/27 18:38, Kasahara Tatsuhito wrote:

Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
   - SET autovacuum = off
   - CREATE tables with 100 rows
   - DELETE 90 rows for each tables
   - SET autovacuum = on and restart PostgreSQL
   - Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
   - CREATE brank tables
   - SELECT all of these tables (for generate stats)
   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
   - Consumes a lot of XIDs by using txid_curent()
   - Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
  tables:1000
   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

  tables:5000
   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

  tables:1
   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

  tables:2
   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec

[2.Anti wrap round VACUUM case]
  tables:1000
   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec

Re: pg_stat_statements oddity with track = all