Re: Move catalog toast table and index declarations

2020-11-05 Thread Peter Eisentraut

On 2020-10-27 13:12, John Naylor wrote:
There's nothing wrong; it's just a minor point of consistency. For the 
first part, I mean defined symbols in this file that are invisible to 
the C compiler are written


#define SOMETHING()

If some are written

#define SOMETHING() extern int no_such_variable

I imagine some future reader will wonder why there's a difference.


The difference is that CATALOG() is followed in actual use by something like

{ ... } FormData_pg_attribute;

so it becomes a valid C statement.  For DECLARE_INDEX() etc., we need to 
do something else to make it valid.  I guess this could be explained in 
more detail (as I'm attempting in this email), but this isn't materially 
changed by this patch.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread Kyotaro Horiguchi
At Thu, 5 Nov 2020 11:07:21 +0530, Amit Kapila  wrote 
in 
> On Thu, Nov 5, 2020 at 8:26 AM k.jami...@fujitsu.com
>  wrote:
> > > > Few comments on patches:
> > > > ==
> > > > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > > > --
> > > > -
> > > > 1.
> > > > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > > > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > > >  {
> > > >   BlockNumber result;
> > > >
> > > >   /*
> > > >   * For now, we only use cached values in recovery due to lack of a
> > > > shared
> > > > - * invalidation mechanism for changes in file size.
> > > > + * invalidation mechanism for changes in file size.  The cached
> > > > + values
> > > > + * could be smaller than the actual number of existing buffers of the 
> > > > file.
> > > > + * This is caused by lseek of buggy Linux kernels that might not have
> > > > + * accounted for the recent write.
> > > >   */
> > > >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > > > InvalidBlockNumber)
> > > > + {
> > > > + if (accurate != NULL)
> > > > + *accurate = true;
> > > > +
> > > >
> > > > I don't understand this comment. Few emails back, I think we have
> > > > discussed that cached value can't be less than the number of buffers
> > > > during recovery. If that happens to be true then we have some problem.
> > > > If you want to explain 'accurate' variable then you can do the same
> > > > atop of function. Would it be better to name this variable as
> > > > 'cached'?
> > >
> > > (I agree that the comment needs to be fixed.)
> > >
> > > FWIW I don't think 'cached' suggests the characteristics of the returned 
> > > value
> > > on its interface.  It was introduced to reduce fseek() calls, and after 
> > > that we
> > > have found that it can be regarded as the authoritative source of the 
> > > file size.
> > > The "accurate" means that it is guaranteed that we don't have a buffer 
> > > for the
> > > file blocks further than that number.  I don't come up with a more proper
> > > word than "accurate" but also I don't think "cached" is proper here.
> >
> 
> Sure but that is not the guarantee this API gives. It has to be
> guaranteed by the logic else-where, so not sure if it is a good idea
> to try to reflect the same here. The comments in the caller where we
> use this should explain why it is safe to use this value.

Isn't it already guaranteed by the bugmgr code that we don't have
buffers for nonexistent file blocks?  What is needed here is, yeah,
the returned value from smgrblocks is "reliable".  If "reliable" is
still not proper, I give up and agree to "cached".

> > I also couldn't think of a better parameter name. Accurate seems to be 
> > better fit
> > as it describes a measurement close to an accepted value.
> > How about fixing the comment like below, would this suffice?
> >
> > /*
> >  *  smgrnblocks() -- Calculate the number of blocks in the
> >  *   supplied relation.
> >  *
> > *   accurate flag acts as an authoritative source of the file 
> > size and
> >  *  ensures that no buffers exist for blocks after the file 
> > size is known
> >  *  to the process.
> >  */
> > BlockNumber
> > smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > {
> > BlockNumber result;
> >
> > /*
> >  * For now, we only use cached values in recovery due to lack of a 
> > shared
> >  * invalidation mechanism for changes in file size.  In recovery, 
> > the cached
> >  * value returned by the first lseek could be smaller than the 
> > actual number
> >  * of existing buffers of the file, which is caused by buggy Linux 
> > kernels
> >  * that might not have accounted for the recent write.  However, we 
> > can
> >  * still rely on the cached value even if we get a bogus value from 
> > first
> >  * lseek since it is impossible to have buffer for blocks after the 
> > file size.
> >  */
> >
> >
> > > By the way, if there's a case where we extend a file by more than one 
> > > block the
> > > cached value becomes invalid.  I'm not sure if it actually happens, but 
> > > the
> > > following sequence may lead to a problem. We need a protection for that
> > > case.
> > >
> > > smgrnblocks(): cached n
> > > truncate to n-5  : cached n=5
> > > extend to m + 2  : cached invalid
> > > (fsync failed)
> > > smgrnblocks(): returns and cached n-5
> >
> 
> I think one possible idea is to actually commit the Assert patch
> (v29-0001-Prevent-invalidating-blocks-in-smgrextend-during) to ensure
> that it can't happen during recovery. And even if it happens why would
> there be any buffer with the block in it left when the fsync failed?
> And if there is no buffer with a block which doesn't account due to
> lseek lies then there shouldn't be

Re: [PoC] Non-volatile WAL buffer

2020-11-05 Thread Takashi Menjo
Hi Gang,

I appreciate your patience. I reproduced the results you reported to me, on
my environment.

First of all, the condition you gave to me was a little unstable on my
environment, so I made the values of {max_,min_,nv}wal_size larger and the
pre-warm duration longer to get stable performance. I didn't modify your
table and query, and benchmark duration.

Under the stable condition, Original (PMEM) still got better performance
than Non-volatile WAL Buffer. To sum up, the reason was that Non-volatile
WAL Buffer on Optane PMem spent much more time than Original (PMEM) for
XLogInsert when using your table and query. It offset the improvement of
XLogFlush, and degraded performance in total. VTune told me that
Non-volatile WAL Buffer took more CPU time than Original (PMEM) for
(XLogInsert => XLogInsertRecord => CopyXLogRecordsToWAL =>) memcpy while it
took less time for XLogFlush. This profile was very similar to the one you
reported.

In general, when WAL buffers are on Optane PMem rather than DRAM, it is
obvious that it takes more time to memcpy WAL records into the buffers
because Optane PMem is a little slower than DRAM. In return for that,
Non-volatile WAL Buffer reduces the time to let the records hit to devices
because it doesn't need to write them out of the buffers to somewhere else,
but just need to flush out of CPU caches to the underlying memory-mapped
file.

Your report shows that Non-volatile WAL Buffer on Optane PMem is not good
for certain kinds of transactions, and is good for others. I have tried to
fix how to insert and flush WAL records, or the configurations or constants
that could change performance such as NUM_XLOGINSERT_LOCKS, but
Non-volatile WAL Buffer have not achieved better performance than Original
(PMEM) yet when using your table and query. I will continue to work on this
issue and will report if I have any update.

By the way, did your performance progress reported by pgbench with -P
option get down to zero when you run Non-volatile WAL Buffer? If so, your
{max_,min_,nv}wal_size might be too small or your checkpoint configurations
might be not appropriate. Could you check your results again?

Best regards,
Takashi

-- 
Takashi Menjo 


Re: Some doubious code in pgstat.c

2020-11-05 Thread Kyotaro Horiguchi
At Thu, 5 Nov 2020 11:48:24 +0530, Amit Kapila  wrote 
in 
> On Thu, Nov 5, 2020 at 9:44 AM Masahiko Sawada  wrote:
> >
> > On Thu, Nov 5, 2020 at 11:18 AM Kyotaro Horiguchi
> >  wrote:
> > > As another issue, just replace memcpy with strlcpy makes compiler
> > > complain of type mismatch, as the first paramter to memcpy had an
> > > needless "&" operator. I removed it in this patch.
> > >
> > > (&msg.m_slotname is a "char (*)[NAMEDATALEN]", not a "char *".)
> > >
> >
> > The patch looks good to me.
> >
> 
> LGTM as well but the proposed commit message seems to be a bit
> unclear. How about something like this:
> "Use strlcpy instead of memcpy for copying the slot name in pgstat.c.
> 
> There is no outright bug here but it is better to be consistent with
> the usage at other places in the same file. In the passing, fix a wrong
> Assertion in pgstat_recv_replslot."

Looks better, thanks.

By the way, I noticed the following sequence.

pgstat.c: 3204
3204>   lbeentry.st_appname[0] = '\0';
3205>   if (MyProcPort && MyProcPort->remote_hostname)
3206>   strlcpy(lbeentry.st_clienthostname, MyProcPort->remote_hostname,
3207>   NAMEDATALEN);
3208>   else
3209>   lbeentry.st_clienthostname[0] = '\0';
3210>   lbeentry.st_activity_raw[0] = '\0';
3211>   /* Also make sure the last byte in each string area is always 0 */
3212>   lbeentry.st_appname[NAMEDATALEN - 1] = '\0';
3213>   lbeentry.st_clienthostname[NAMEDATALEN - 1] = '\0';
3214>   lbeentry.st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';


The strlcpy at the line 3206 makes sure that st_clienthostname is
null-terminated so it's nonsense to do line 3213.  st_appname and
st_activity_raw are set to zero-length string.

Is there any point in setting terminating nul to them?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: overriding current_timestamp

2020-11-05 Thread thehesiod
so I used this method here: 
https://gist.github.com/thehesiod/d0314d599c721216f075375c667e2d9a
   and
indeed does not work for current_timestamp and the like functions, and
there's another major problems, it doesn't seem to work for existing
triggers either, it seems like functions referenced by triggers are bound
when the function is created :(. I created an account to post this tid-bit
for others trying to achieve the same



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: Protect syscache from bloating with negative cache entries

2020-11-05 Thread Heikki Linnakangas

On 19/11/2019 12:48, Kyotaro Horiguchi wrote:

1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)

  This is the most straightforward way to add an alternative feature.

pattern 1 | 8459.73 |  28.15  # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 |  55.61
pattern 1 | 8541.81 |  41.56
pattern 1 | 8552.20 |  27.99
master| 7757.58 |  22.65
master| 7801.32 |  20.64
master| 7839.57 |  25.28
master| 7925.30 |  38.84

  It's so slow that it cannot be used.


This is very surprising. A branch that's never taken ought to be 
predicted by the CPU's branch-predictor, and be very cheap.


Do we actually need a branch there? If I understand correctly, the point 
is to bump up a usage counter on the catcache entry. You could increment 
the counter unconditionally, even if the feature is not used, and avoid 
the branch that way.


Another thought is to bump up the usage counter in ReleaseCatCache(), 
and only when the refcount reaches zero. That might be somewhat cheaper, 
if it's a common pattern to acquire additional leases on an entry that's 
already referenced.


Yet another thought is to replace 'refcount' with an 'acquirecount' and 
'releasecount'. In SearchCatCacheInternal(), increment acquirecount, and 
in ReleaseCatCache, increment releasecount. When they are equal, the 
entry is not in use. Now you have a counter that gets incremented on 
every access, with the same number of CPU instructions in the hot paths 
as we have today.


Or maybe there are some other ways we could micro-optimize 
SearchCatCacheInternal(), to buy back the slowdown that this feature 
would add? For example, you could remove the "if (cl->dead) continue;" 
check, if dead entries were kept out of the hash buckets. Or maybe the 
catctup struct could be made slightly smaller somehow, so that it would 
fit more comfortably in a single cache line.


My point is that I don't think we want to complicate the code much for 
this. All the indirection stuff seems over-engineered for this. Let's 
find a way to keep it simple.


- Heikki




Re: Some doubious code in pgstat.c

2020-11-05 Thread Amit Kapila
On Thu, Nov 5, 2020 at 2:13 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 5 Nov 2020 11:48:24 +0530, Amit Kapila  
> wrote in
> > On Thu, Nov 5, 2020 at 9:44 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Nov 5, 2020 at 11:18 AM Kyotaro Horiguchi
> > >  wrote:
> > > > As another issue, just replace memcpy with strlcpy makes compiler
> > > > complain of type mismatch, as the first paramter to memcpy had an
> > > > needless "&" operator. I removed it in this patch.
> > > >
> > > > (&msg.m_slotname is a "char (*)[NAMEDATALEN]", not a "char *".)
> > > >
> > >
> > > The patch looks good to me.
> > >
> >
> > LGTM as well but the proposed commit message seems to be a bit
> > unclear. How about something like this:
> > "Use strlcpy instead of memcpy for copying the slot name in pgstat.c.
> >
> > There is no outright bug here but it is better to be consistent with
> > the usage at other places in the same file. In the passing, fix a wrong
> > Assertion in pgstat_recv_replslot."
>
> Looks better, thanks.
>
> By the way, I noticed the following sequence.
>
> pgstat.c: 3204
> 3204>   lbeentry.st_appname[0] = '\0';
> 3205>   if (MyProcPort && MyProcPort->remote_hostname)
> 3206>   strlcpy(lbeentry.st_clienthostname, 
> MyProcPort->remote_hostname,
> 3207>   NAMEDATALEN);
> 3208>   else
> 3209>   lbeentry.st_clienthostname[0] = '\0';
> 3210>   lbeentry.st_activity_raw[0] = '\0';
> 3211>   /* Also make sure the last byte in each string area is always 0 */
> 3212>   lbeentry.st_appname[NAMEDATALEN - 1] = '\0';
> 3213>   lbeentry.st_clienthostname[NAMEDATALEN - 1] = '\0';
> 3214>   lbeentry.st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
>
>
> The strlcpy at the line 3206 makes sure that st_clienthostname is
> null-terminated so it's nonsense to do line 3213.  st_appname and
> st_activity_raw are set to zero-length string.
>
> Is there any point in setting terminating nul to them?
>

I also don't see any reason for the same except being extra careful.
This is not directly related to this patch so I think we can leave
this or if you want you can discuss this in a separate thread. It
seems to be introduced in commit 85ccb689.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread Amit Kapila
On Thu, Nov 5, 2020 at 1:59 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 5 Nov 2020 11:07:21 +0530, Amit Kapila  
> wrote in
> > On Thu, Nov 5, 2020 at 8:26 AM k.jami...@fujitsu.com
> >  wrote:
> > > > > Few comments on patches:
> > > > > ==
> > > > > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > > > > --
> > > > > -
> > > > > 1.
> > > > > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > > > > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> > > > >  {
> > > > >   BlockNumber result;
> > > > >
> > > > >   /*
> > > > >   * For now, we only use cached values in recovery due to lack of a
> > > > > shared
> > > > > - * invalidation mechanism for changes in file size.
> > > > > + * invalidation mechanism for changes in file size.  The cached
> > > > > + values
> > > > > + * could be smaller than the actual number of existing buffers of 
> > > > > the file.
> > > > > + * This is caused by lseek of buggy Linux kernels that might not have
> > > > > + * accounted for the recent write.
> > > > >   */
> > > > >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > > > > InvalidBlockNumber)
> > > > > + {
> > > > > + if (accurate != NULL)
> > > > > + *accurate = true;
> > > > > +
> > > > >
> > > > > I don't understand this comment. Few emails back, I think we have
> > > > > discussed that cached value can't be less than the number of buffers
> > > > > during recovery. If that happens to be true then we have some problem.
> > > > > If you want to explain 'accurate' variable then you can do the same
> > > > > atop of function. Would it be better to name this variable as
> > > > > 'cached'?
> > > >
> > > > (I agree that the comment needs to be fixed.)
> > > >
> > > > FWIW I don't think 'cached' suggests the characteristics of the 
> > > > returned value
> > > > on its interface.  It was introduced to reduce fseek() calls, and after 
> > > > that we
> > > > have found that it can be regarded as the authoritative source of the 
> > > > file size.
> > > > The "accurate" means that it is guaranteed that we don't have a buffer 
> > > > for the
> > > > file blocks further than that number.  I don't come up with a more 
> > > > proper
> > > > word than "accurate" but also I don't think "cached" is proper here.
> > >
> >
> > Sure but that is not the guarantee this API gives. It has to be
> > guaranteed by the logic else-where, so not sure if it is a good idea
> > to try to reflect the same here. The comments in the caller where we
> > use this should explain why it is safe to use this value.
>
> Isn't it already guaranteed by the bugmgr code that we don't have
> buffers for nonexistent file blocks?  What is needed here is, yeah,
> the returned value from smgrblocks is "reliable".  If "reliable" is
> still not proper, I give up and agree to "cached".
>


I still feel 'cached' is a better name.

>
> > > I am not sure if the patch should cover this or should be a separate 
> > > thread altogether since
> > > a number of functions also rely on the smgrnblocks(). But I'll take it 
> > > into consideration.
> > >
> > >
> > > > > v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery
> > > > > --
> > > > > 
> > > > > 2.
> > > > > + /* Check that it is in the buffer pool. If not, do nothing. */
> > > > > + LWLockAcquire(bufPartitionLock, LW_SHARED); buf_id =
> > > > > + BufTableLookup(&bufTag, bufHash); LWLockRelease(bufPartitionLock);
> > > > > +
> > > > > + if (buf_id < 0)
> > > > > + continue;
> > > > > +
> > > > > + bufHdr = GetBufferDescriptor(buf_id);
> > > > > +
> > > > > + buf_state = LockBufHdr(bufHdr);
> > > > > +
> > > > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > > >
> > > > > I think a pre-check for RelFileNode might be better before LockBufHdr
> > > > > for the reasons mentioned in this function few lines down.
> > > >
> > > > The equivalent check is already done by BufTableLookup(). The last line 
> > > > in
> > > > the above is not a precheck but the final check.
> > >
> >
> > Which check in that API you are talking about? Are you telling because
> > we are trying to use a hash value corresponding to rnode.node to find
> > the block then I don't think it is equivalent because there is a
> > difference in actual values. But even if we want to rely on that, a
> > comment is required but I guess we can do the check as well because it
> > shouldn't be a costly pre-check.
>
> I think the only problematic case is that BufTableLookup wrongly
> misses buffers actually to be dropped. (And the case of too-many
> false-positives, not critical though.)  If omission is the case, we
> cannot adopt this optimization at all.  And if the false-positive is
> the case, maybe we need to adopt more restrictive prechecking, but
> RelFileNodeEquals is *not* more restrictive than BufTableLookup in t

Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Daniel Gustafsson
> On 5 Nov 2020, at 04:44, Michael Paquier  wrote:
> 
> On Wed, Nov 04, 2020 at 10:05:48AM +0100, Magnus Hagander wrote:
>> Yes, we should absolutely do that. We already do this for
>> pg_strong_random() itself, and we should definitely repeat the pattern
>> in the init function.
> 
> This poked at my curiosity, so I looked at it.  The result looks
> indeed like an improvement to me, while taking care of the point of
> upthread to make the implementation stuff controlled only by
> USE_OPENSSL_RANDOM.  Per se the attached.

This must check for USE_OPENSSL as well as per my original patch, since we'd
otherwise fail to perform post-fork initialization in case one use OpenSSL with
anothe PRNG for pg_strong_random.  That might be theoretical at this point, but
if we ever support that and miss updating this it would be problematic.

+#if defined(USE_OPENSSL_RANDOM)

I'm not sure this comment adds any value, we currently have two non-TLS library
PRNGs in pg_strong_random, so even if we add NSS it will at best be 50%:

+ * Note that this applies normally to SSL implementations, so when
+ * implementing a new one, be careful to consider this initialization
+ * step.

> This could make random number generation predictible when an extension
> calls directly RAND_bytes() if USE_OPENSSL_RANDOM is not used while
> building with OpenSSL, but perhaps I am just too much of a pessimistic
> nature.

Any extension blindly calling RAND_bytes and expecting there to automagically
be an OpenSSL library available seems questionable.

cheers ./daniel



Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-05 Thread Bharath Rupireddy
On Tue, Nov 3, 2020 at 4:54 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> Use case 1- 100mn tuples, 2 integer columns, exec time in sec:
> HEAD: 131.507 when the select part is not parallel, 128.832 when the
select part is parallel
> Patch: 98.925 when the select part is not parallel, 52.901 when the
select part is parallel
>
> Use case 2- 10mn tuples, 4 integer and 6 text columns, exec time in sec:
> HEAD: 76.801 when the select part is not parallel, 66.074 when the select
part is parallel
> Patch: 74.083 when the select part is not parallel, 57.739 when the
select part is parallel
>

I did some more testing with v1 patch: execution time is in seconds, each
test is run 2 times, with custom configuration [1].

Use case 3: 1 int and 1 text column. each row size 129 bytes, size of 1
text column 101 bytes, number of rows 100million, size of heap file 12.9GB.
HEAD: 253.227, 259.575
Patch: 177.921, 174.196

We get better performance 1.4X with the patch.

Use case 4: 1 int and 30 text columns. each row size 28108 bytes, size of 1
text column 932 bytes, number of rows 1, size of heap file 281.08MB.
HEAD: 222.812, 218.837
Patch: 222.492, 222.295

We don't see much difference with and without patch. Each time only 2
tuples(2*28108 = 56216 bytes < MAX_MULTI_INSERT_BUFFERED_BYTES(65535
bytes)) are buffered and flushed.

Use case 5: 1 int and 75 text columns. each row size 70228 bytes, size of 1
text column 932 bytes, number of rows 1, size of heap file 702.28MB.
HEAD: 554.709, 562.745
Patch: 553.378, 560.370

We don't see much difference with and without patch. Since each row
size(70228 bytes) is bigger than the MAX_MULTI_INSERT_BUFFERED_BYTES(65535
bytes), multi insert code is not picked, each single row is inserted with
table_tuple_insert() itself.

Use case 6: 1 int and 1 text column. each row size 9205 bytes, size of 1
text column 9173 bytes, number of rows 1, size of heap file 92.05MB.
HEAD: 70.583, 70251
Patch: 72.633, 73.521

We see 2-3 seconds more with patch. When I intentionally made the computed
tuple size to 0(sz =0) after GetTupleSize(), which means the single inserts
happen, the results are 70.364, 70.406. Looks like this 2-3 seconds extra
time is due to the multi insert code and happens for with this use case
only. And I think this should not be a problem as the difference is not
huge.

+sz = GetTupleSize(slot, MAX_MULTI_INSERT_BUFFERED_BYTES);
+
*+.  sz = 0;*
+
+/* In case the computed tuple size is 0, we go for single inserts. */
+if (sz != 0)
+{

[1] - The postgresql.conf used:
shared_buffers = 40GB
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Move catalog toast table and index declarations

2020-11-05 Thread John Naylor
On Thu, Nov 5, 2020 at 4:24 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-10-27 13:12, John Naylor wrote:
> > There's nothing wrong; it's just a minor point of consistency. For the
> > first part, I mean defined symbols in this file that are invisible to
> > the C compiler are written
> >
> > #define SOMETHING()
> >
> > If some are written
> >
> > #define SOMETHING() extern int no_such_variable
> >
> > I imagine some future reader will wonder why there's a difference.
>
> The difference is that CATALOG() is followed in actual use by something
> like
>
>  { ... } FormData_pg_attribute;
>
> so it becomes a valid C statement.  For DECLARE_INDEX() etc., we need to
> do something else to make it valid.  I guess this could be explained in
> more detail (as I'm attempting in this email), but this isn't materially
> changed by this patch.
>

I think we're talking past eachother. Here's a concrete example:

#define BKI_ROWTYPE_OID(oid,oidmacro)
#define DECLARE_TOAST(name,toastoid,indexoid) extern int no_such_variable

I understand these to be functionally equivalent as far as what the C
compiler sees. If not, I'd be curious to know what the difference is. I was
thinking this is just a random style difference, and if so, they should be
the same now that they're in the same file together:

#define BKI_ROWTYPE_OID(oid,oidmacro)
#define DECLARE_TOAST(name,toastoid,indexoid)

And yes, this doesn't materially change the patch, it's just nitpicking :-)
. Materially, I believe it's fine.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Michael Paquier
On Thu, Nov 05, 2020 at 10:49:45AM +0100, Daniel Gustafsson wrote:
> This must check for USE_OPENSSL as well as per my original patch, since we'd
> otherwise fail to perform post-fork initialization in case one use OpenSSL 
> with
> anothe PRNG for pg_strong_random.  That might be theoretical at this point, 
> but
> if we ever support that and miss updating this it would be problematic.

That's actually the same point I tried to make at the end of my last
email, but worded differently, isn't it?  In short we have
USE_OPENSSL, but !USE_OPENSSL_RANDOM and we still need an
initialization.  We could just do something like the following:
#ifdef USE_OPENSSL
RAND_poll();
#endif
#if defined(USE_OPENSSL_RANDOM)
/* OpenSSL is done above, because blah.. */
#elif etc..
[...]
#error missing an init, pal.
#endif

Or do you jave something else in mind?

> +#if defined(USE_OPENSSL_RANDOM)
> 
> I'm not sure this comment adds any value, we currently have two non-TLS 
> library
> PRNGs in pg_strong_random, so even if we add NSS it will at best be 50%:

I don't mind removing this part, the compilation hint may be enough,
indeed.
--
Michael


signature.asc
Description: PGP signature


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Daniel Gustafsson
> On 5 Nov 2020, at 13:12, Michael Paquier  wrote:
> 
> On Thu, Nov 05, 2020 at 10:49:45AM +0100, Daniel Gustafsson wrote:
>> This must check for USE_OPENSSL as well as per my original patch, since we'd
>> otherwise fail to perform post-fork initialization in case one use OpenSSL 
>> with
>> anothe PRNG for pg_strong_random.  That might be theoretical at this point, 
>> but
>> if we ever support that and miss updating this it would be problematic.
> 
> That's actually the same point I tried to make at the end of my last
> email, but worded differently, isn't it?  

Ah, ok, then I failed to parse it that way.  At least we are in agreement then
which is good.

> In short we have
> USE_OPENSSL, but !USE_OPENSSL_RANDOM and we still need an
> initialization.  We could just do something like the following:
> #ifdef USE_OPENSSL
>RAND_poll();
> #endif
> #if defined(USE_OPENSSL_RANDOM)
>/* OpenSSL is done above, because blah.. */
> #elif etc..
> [...]
> #error missing an init, pal.
> #endif
> 
> Or do you jave something else in mind?

What about the (hypothetical) situation where USE_OPENSSL_RANDOM is used
without USE_OPENSSL? Wouldn't the below make sure we cover all bases?

#if defined(USE_OPENSSL) || defined(USE_OPENSSL_RANDOM)

cheers ./daniel



redundant error messages

2020-11-05 Thread Peter Eisentraut
A few client tools duplicate error messages already provided by libpq, 
such as


pg_rewind: fatal: could not connect to server: could not connect to 
server: No such file or directory


pg_basebackup: error: could not connect to server: could not connect to 
server: No such file or directory


psql: error: could not connect to server: could not connect to server: 
No such file or directory


The psql case is actually a regression introduced in PG12, but the other 
two appear to be ancient.


Other client tools provide a different error message so in aggregate it 
looks like this:


createdb: error: could not connect to database template1: could not 
connect to server: No such file or directory


The attached patch removes the redundant message from the client tools. 
I suppose it's a bit dubious because there is no guarantee what the 
level of detail the message supplied by libpq has.  But I think these 
few cases are not particularly hard to keep in sync.
From a2c1e26ceff05ec6ebbdb9cf74a3171ebf969911 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 5 Nov 2020 13:24:47 +0100
Subject: [PATCH] Fix redundant error messages in client tools

A few client tools duplicate error messages already provided by libpq.
---
 src/bin/pg_basebackup/streamutil.c | 5 ++---
 src/bin/pg_rewind/pg_rewind.c  | 3 +--
 src/bin/psql/startup.c | 2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index be653ebb2d..5c536eb5fd 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -183,7 +183,7 @@ GetConnection(void)
 */
if (!tmpconn)
{
-   pg_log_error("could not connect to server");
+   pg_log_error("out of memory");
exit(1);
}
 
@@ -200,8 +200,7 @@ GetConnection(void)
 
if (PQstatus(tmpconn) != CONNECTION_OK)
{
-   pg_log_error("could not connect to server: %s",
-PQerrorMessage(tmpconn));
+   pg_log_error("%s", PQerrorMessage(tmpconn));
PQfinish(tmpconn);
free(values);
free(keywords);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 421a45ef5b..52e3fc40e8 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -282,8 +282,7 @@ main(int argc, char **argv)
conn = PQconnectdb(connstr_source);
 
if (PQstatus(conn) == CONNECTION_BAD)
-   pg_fatal("could not connect to server: %s",
-PQerrorMessage(conn));
+   pg_fatal("%s", PQerrorMessage(conn));
 
if (showprogress)
pg_log_info("connected to server");
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index e8d35a108f..586fcb3366 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -295,7 +295,7 @@ main(int argc, char *argv[])
 
if (PQstatus(pset.db) == CONNECTION_BAD)
{
-   pg_log_error("could not connect to server: %s", 
PQerrorMessage(pset.db));
+   pg_log_error("%s", PQerrorMessage(pset.db));
PQfinish(pset.db);
exit(EXIT_BADCONN);
}
-- 
2.28.0



Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Michael Paquier
On Thu, Nov 05, 2020 at 01:18:15PM +0100, Daniel Gustafsson wrote:
> What about the (hypothetical) situation where USE_OPENSSL_RANDOM is used
> without USE_OPENSSL? Wouldn't the below make sure we cover all bases?

You can actually try that combination, because it is possible today to
compile without --with-openssl but try to enforce USE_OPENSSL_RANDOM.
This leads to a compilation failure.  I think that it is important to
have the #if/#elif business in the init function match the conditions
of the main function.

> #if defined(USE_OPENSSL) || defined(USE_OPENSSL_RANDOM)

It seems to me that this one would become incorrect if compiling with
OpenSSL but select a random source that requires an initialization, as
it would enforce only OpenSSL initialization all the time.
Theoretical point now, of course, because such combination does not
exist yet in the code.
--
Michael


signature.asc
Description: PGP signature


Re: libpq compression

2020-11-05 Thread Matthias van de Meent
On Mon, 2 Nov 2020 at 20:20, Konstantin Knizhnik
 wrote:
>
> On 02.11.2020 19:53, Matthias van de Meent wrote:
> > This is the result of network traffic of two backends one with enabled
> >> compression and another with disable compression
> >> after execution of "select * from pg_class" command:
> >>
> >> select * from pg_stat_network_traffic;
> >> pid  | rx_raw_bytes | tx_raw_bytes | rx_compressed_bytes |
> >> tx_compressed_bytes
> >> ---+--+--+-+-
> >>22276 |   29 |86327 |  38
> >> |   10656
> >>22282 |   73 |86327 |   0
> >> |   0
> > The current names and values of these columns are confusing me:
> > What column contains the amount of bytes sent to/received from the
> > client? Is the compression method of pid 22282 extremely efficient at
> > compressing, or does it void the data (compresses down to 0 bytes)?
> Names of the columns can be changed if you or somebody else will propose
> better alternatives.

How about Xx_logical_bytes for raw the pg command stream data, and
keeping Xx_compressed_bytes for the compressed data in/out?

> This view pg_stat_network_traffic reports traffic from server (backend)
> point of view, i.e.
> rx_bytes (received bytes) are commands sent from client to the server
> tx_bytes (transmitted bytes) are responses sent by server to the client.
>
> If compression is not used then rx_compressed_bytes =
> tx_compressed_bytes = 0
> It seems to be more natural then assigning them the same values as (raw
> bytes).
> Because it can really happen that for BLOBs with already compressed data
> (video images or sound)
> compressed data will be almost the same as raw data even if compression
> is enabled.
> So it seems to be important to distinguished situations when data can
> not be compressed and
> when it is not compressed at all.

Looking at it from that viewpoint, I agree. My primary reason for
suggesting this was that it would be useful to expose how much data
was transferred between the client and the server, which cannot be
constructed from that view for compression-enabled connections. That
is because the compression methods' counting only starts after some
bytes have already been transferred, and the raw/logical counter
starts deviating once compression is enabled.




Re: ModifyTable overheads in generic plans

2020-11-05 Thread Amit Langote
On Wed, Nov 4, 2020 at 11:32 AM Amit Langote  wrote:
> On Tue, Nov 3, 2020 at 9:05 PM Heikki Linnakangas  wrote:
> > A) We could change FDW API so that BeginDirectModify takes the same
> > arguments as BeginForeignModify(). That avoids the assumption that it's
> > a ForeignScan node, because BeginForeignModify() doesn't take
> > ForeignScanState as argument. That would be consistent, which is nice.
> > But I think we'd somehow still need to pass the ResultRelInfo to the
> > corresponding ForeignScan, and I'm not sure how.
>
> Maybe ForeignScan doesn't need to contain any result relation info
> then?  ForeignScan.operation != CMD_SELECT is enough to tell it to
> call IterateDirectModify() as today.
>
> > B) Look up the ResultRelInfo, and call BeginDirectModify(), on the first
> > call to ForeignNext().
> >
> > C) Accept the Assertion. And add an elog() check in the planner for that
> > with a proper error message.
> >
> > I'm leaning towards B), but maybe there's some better solution I didn't
> > think of?   Perhaps changing the API would make sense in any case, it is a
> > bit weird as it is. Backwards-incompatible API changes are not nice, but
> > I don't think there are many FDWs out there that implement the
> > DirectModify functions. And those functions are pretty tightly coupled
> > with the executor and ModifyTable node details anyway, so I don't feel
> > like we can, or need to, guarantee that they stay unchanged across major
> > versions.
>
> B is not too bad, but I tend to prefer doing A too.

How about I update the 0001 patch to implement A?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Magnus Hagander
On Thu, Nov 5, 2020 at 1:28 PM Michael Paquier  wrote:
>
> On Thu, Nov 05, 2020 at 01:18:15PM +0100, Daniel Gustafsson wrote:
> > What about the (hypothetical) situation where USE_OPENSSL_RANDOM is used
> > without USE_OPENSSL? Wouldn't the below make sure we cover all bases?
>
> You can actually try that combination, because it is possible today to
> compile without --with-openssl but try to enforce USE_OPENSSL_RANDOM.
> This leads to a compilation failure.  I think that it is important to
> have the #if/#elif business in the init function match the conditions
> of the main function.

+1 -- whatever those are, they should be the same.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Daniel Gustafsson
> On 5 Nov 2020, at 13:28, Michael Paquier  wrote:

> It seems to me that this one would become incorrect if compiling with
> OpenSSL but select a random source that requires an initialization, as
> it would enforce only OpenSSL initialization all the time.

Right, how about something like the attached (untested) diff?

> Theoretical point now, of course, because such combination does not
> exist yet in the code.

Not yet, and potentially never will.  Given the consequences of a PRNG which
hasn't been properly initialized I think it's ok to be defensive in this
codepath however.

cheers ./daniel



openssl_random_macros-v3.patch
Description: Binary data


RE: Parallel copy

2020-11-05 Thread Hou, Zhijie
Hi

> 
> my $bytes = $ARGV[0];
> for(my $i = 0; $i < $bytes; $i+=8){
>  print "longdata";
> }
> print "\n";
> 
> 
> postgres=# copy longdata from program 'perl /tmp/longdata.pl 1'
> with (parallel 2);
> 
> This gets stuck forever (or at least I didn't have the patience to wait
> it finish). Both worker processes are consuming 100% of CPU.

I had a look over this problem.

the ParallelCopyDataBlock has size limit:
uint8   skip_bytes;
chardata[DATA_BLOCK_SIZE];  /* data read from file */

It seems the input line is so long that the leader process run out of the 
Shared memory among parallel copy workers.
And the leader process keep waiting free block.

For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED,
But leader process won't set the line_state unless it read the whole line.

So it stuck forever.
May be we should reconsider about this situation.

The stack is as follows:

Leader stack:
#3  0x0075f7a1 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, 
wait_event_info=wait_event_info@entry=150994945) at latch.c:411
#4  0x005a9245 in WaitGetFreeCopyBlock 
(pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546
#5  0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, 
line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, 
raw_buf_ptr=raw_buf_ptr@entry=65536, 
copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572
#6  0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at 
copy.c:4058
#7  0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at 
copy.c:3863

Worker stack:
#0  GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474
#1  0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, 
buff_count=buff_count@entry=0) at copyparallel.c:711
#2  0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at 
copyparallel.c:885
#3  0x005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, 
fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at 
copy.c:3615
#4  0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, 
econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at 
copy.c:3696
#5  0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at 
copy.c:2985


Best regards,
houzj





Re: redundant error messages

2020-11-05 Thread Euler Taveira
On Thu, 5 Nov 2020 at 09:27, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> A few client tools duplicate error messages already provided by libpq,
> such as
>
> pg_rewind: fatal: could not connect to server: could not connect to
> server: No such file or directory
>
> Good catch!


> Other client tools provide a different error message so in aggregate it
> looks like this:
>
> createdb: error: could not connect to database template1: could not
> connect to server: No such file or directory
>
> Is the database name important for this message? You should inform which
database you want to connect for all client tools except pg_dumpall. Hence,
you
already know which database has the connection problem. IMO the pg_dumpall
message should inform the database name. My suggestion is:

if (fail_on_error)
{
pg_log_error("database \"%s\": %s",
 dbname, PQerrorMessage(conn));
exit_nicely(1);
}

and remove the redundant 'could not connect to database %s' from
scripts/common.c.


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-11-05 Thread Justin Pryzby
On Wed, Oct 28, 2020 at 02:34:02PM -0500, Justin Pryzby wrote:
> On Tue, Sep 08, 2020 at 02:51:26PM -0500, Justin Pryzby wrote:
> > On Sat, Jul 18, 2020 at 03:15:32PM -0500, Justin Pryzby wrote:
> > > Still waiting for feedback from a committer.
> > 
> > This patch has been waiting for input from a committer on the approach I've
> > taken with the patches since March 10, so I'm planning to set to "Ready" - 
> > at
> > least ready for senior review.
> 
> @cfbot: rebased

Rebased on e152506adef4bc503ea7b8ebb4fedc0b8eebda81

-- 
Justin
>From 3e98e1e782f46d3b4ae8240d8cbb608af80bc9f8 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v23 01/10] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7b1dc264f6..f068260188 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25908,7 +25908,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users
-- 
2.17.0

>From 9cef25af7c265e040def5cf89286ec068edca075 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 18:59:16 -0500
Subject: [PATCH v23 02/10] pg_stat_file and pg_ls_dir_* to use lstat()..

pg_ls_dir_* will now skip (no longer show) symbolic links, same as other
non-regular file types, as we advertize we do since 8b6d94cf6.  That seems to
be the intented behavior, since irregular file types are 1) less portable; and,
2) we don't currently show a file's type except for "bool is_dir".

pg_stat_file will now 1) show metadata of links themselves, rather than their
target; and, 2) specifically, show links to directories with "is_dir=false";
and, 3) not error on broken symlinks.
---
 doc/src/sgml/func.sgml  | 2 +-
 src/backend/utils/adt/genfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f068260188..7b1dc264f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25908,7 +25908,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory (or a symbolic link to a directory).
+indicating if it is a directory.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index d34182a7b0..9f4927220b 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -406,7 +406,7 @@ pg_stat_file(PG_FUNCTION_ARGS)
 
 	filename = convert_and_check_filename(filename_t);
 
-	if (stat(filename, &fst) < 0)
+	if (lstat(filename, &fst) < 0)
 	{
 		if (missing_ok && errno == ENOENT)
 			PG_RETURN_NULL();
@@ -632,7 +632,7 @@ pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
 
 		/* Get the file info */
 		snprintf(path, sizeof(path), "%s/%s", dir, de->d_name);
-		if (stat(path, &attrib) < 0)
+		if (lstat(path, &attrib) < 0)
 		{
 			/* Ignore concurrently-deleted files, else complain */
 			if (errno == ENOENT)
-- 
2.17.0

>From 3b72b3e0941bd10e796744019adea3570fc64063 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Mar 2020 13:16:24 -0500
Subject: [PATCH v23 03/10] Add tests on pg_ls_dir before changing it

---
 src/test/regress/expected/misc_functions.out | 18 ++
 src/test/regress/sql/misc_functions.sql  |  5 +
 2 files changed, 23 insertions(+)

diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..2e87c548eb 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -201,6 +201,24 @@ select count(*) > 0 from
  t
 (1 row)
 
+select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true
+ name 
+--
+ .
+(1 row)
+
+select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false
+ name 
+--
+(0 rows)
+
+select pg_ls_dir('does not exist', true, false); -- ok with missingok=true
+ pg_ls_dir 
+---
+(0 rows)
+
+select pg_ls_dir('does not exist'); --

Re: redundant error messages

2020-11-05 Thread Isaac Morland
On Thu, 5 Nov 2020 at 08:34, Euler Taveira 
wrote:

Is the database name important for this message? You should inform which
> database you want to connect for all client tools except pg_dumpall.
> Hence, you
> already know which database has the connection problem. IMO the pg_dumpall
> message should inform the database name. My suggestion is:
>

In principle, the client knows the database name. In practice, if it's
coming from PGDATABASE or via a service configuration, one may be confused
about the database; having the error message be explicit will avoid many
problems. I can easily imagine that "unable to connect to database" would
be mystifying, whereas "unable to connect to database foo" would elicit the
response, "wait, I'm trying to connect to what now?" leading much more
quickly to a resolution.


Re: enable pg_stat_statements to track rows processed by REFRESH MATERIALIZED VIEW

2020-11-05 Thread Seino Yuki

2020-11-02 20:01 に Fujii Masao さんは書きました:

On 2020/11/02 14:02, Yuki Seino wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

+1.
I checked the patch and there were no problems.


+   PG_END_TRY();
+   SetQueryCompletion(qc, CMDTAG_REFRESH_MATERIALIZED_VIEW, 
processed);

Isn't it better to call SetQueryCompletion() in ExecRefreshMatView()
instead of ProcessUtilitySlow() (e.g., ExecCreateTableAs() does)?

Regards,



Sorry. I missed it.
I've incorporated your point into this patch.
So the changes to "matview.h" and "utility.c" have been canceld.

We also confirmed that the new patch passed the regression test.

Regards,diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0edb134f3..2a303a7f07 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -530,8 +530,8 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 
 --
 -- Track the total number of rows retrieved or affected by the utility
--- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW
--- and SELECT INTO
+-- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW,
+-- REFRESH MATERIALIZED VIEW and SELECT INTO
 --
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
@@ -543,6 +543,7 @@ CREATE TABLE pgss_ctas AS SELECT a, 'ctas' b FROM generate_series(1, 10) a;
 SELECT generate_series(1, 10) c INTO pgss_select_into;
 COPY pgss_ctas (a, b) FROM STDIN;
 CREATE MATERIALIZED VIEW pgss_matv AS SELECT * FROM pgss_ctas;
+REFRESH MATERIALIZED VIEW pgss_matv;
 BEGIN;
 DECLARE pgss_cursor CURSOR FOR SELECT * FROM pgss_matv;
 FETCH NEXT pgss_cursor;
@@ -586,10 +587,11 @@ SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE
  FETCH FORWARD 5 pgss_cursor | 0 | 1 |5
  FETCH FORWARD ALL pgss_cursor   | 0 | 1 |7
  FETCH NEXT pgss_cursor  | 0 | 1 |1
+ REFRESH MATERIALIZED VIEW pgss_matv | 0 | 1 |   13
  SELECT generate_series(1, 10) c INTO pgss_select_into   | 0 | 1 |   10
  SELECT pg_stat_statements_reset()   | 0 | 1 |1
  SELECT query, plans, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 1 | 0 |0
-(12 rows)
+(13 rows)
 
 --
 -- Track user activity and reset them
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..1c2ac24cf6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1173,11 +1173,12 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/*
 		 * Track the total number of rows retrieved or affected by
 		 * the utility statements of COPY, FETCH, CREATE TABLE AS,
-		 * CREATE MATERIALIZED VIEW and SELECT INTO.
+		 * CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW and SELECT INTO.
 		 */
 		rows = (qc && (qc->commandTag == CMDTAG_COPY ||
 	   qc->commandTag == CMDTAG_FETCH ||
-	   qc->commandTag == CMDTAG_SELECT)) ?
+	   qc->commandTag == CMDTAG_SELECT ||
+	   qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
 			qc->nprocessed : 0;
 
 		/* calc differences of buffer counters. */
diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
index 996a24a293..e9f5bb84e3 100644
--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
@@ -252,8 +252,8 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 
 --
 -- Track the total number of rows retrieved or affected by the utility
--- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW
--- and SELECT INTO
+-- commands of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED VIEW,
+-- REFRESH MATERIALIZED VIEW and SELECT INTO
 --
 SELECT pg_stat_statements_reset();
 
@@ -265,6 +265,7 @@ COPY pgss_ctas (a, b) FROM STDIN;
 13	copy
 \.
 CREATE MATERIALIZED VIEW pgss_matv AS SELECT * FROM pgss_ctas;
+REFRESH MATERIALIZED VIEW pgss_matv;
 BEGIN;
 DECLARE pgss_cursor CURSOR FOR SELECT * FROM pgss_matv;
 FETCH NEXT pgss_cursor;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..8af8bd6aa8 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -356,6 +356

Re: How to retain lesser paths at add_path()?

2020-11-05 Thread Dmitry Dolgov
> On Tue, Jan 14, 2020 at 12:46:02AM +0900, Kohei KaiGai wrote:
> The v2 patch is attached.
>
> This adds two dedicated lists on the RelOptInfo to preserve lesser paths
> if extension required to retain the path-node to be removed in usual manner.
> These lesser paths are kept in the separated list, so it never expand the 
> length
> of pathlist and partial_pathlist. That was the arguable point in the 
> discussion
> at the last October.
>
> The new hook is called just before the path-node removal operation, and
> gives extension a chance for extra decision.
> If extension considers the path-node to be removed can be used in the upper
> path construction stage, they can return 'true' as a signal to preserve this
> lesser path-node.
> In case when same kind of path-node already exists in the preserved_pathlist
> and the supplied lesser path-node is cheaper than the old one, extension can
> remove the worse one arbitrarily to keep the length of preserved_pathlist.
> (E.g, PG-Strom may need one GpuJoin path-node either pathlist or preserved-
> pathlist for further opportunity of combined usage with GpuPreAgg path-node.
> It just needs "the best GpuJoin path-node" somewhere, not two or more.)
>
> Because PostgreSQL core has no information which preserved path-node can
> be removed, extensions that uses path_removal_decision_hook() has 
> responsibility
> to keep the length of preserved_(partial_)pathlist reasonable.

Hi,

Thanks for the patch! I had a quick look at it and have a few questions:

* What would be the exact point/hook at which an extension can use
  preserved pathlists? I guess it's important, since I can imagine it's
  important for one of the issues mentioned in the thread about such an
  extension have to re-do significant part of the calculations from
  add_path.

* Do you have any benchmark results with some extension using this
  hook? The idea with another pathlist of "discarded" paths sounds like
  a lightweight solution, and indeed I've performed few tests with two
  workloads (simple queries, queries with joins of 10 tables) and the
  difference between master and patched versions is rather small (no
  stable difference for the former, couple of percent for the latter).
  But it's of course with an empty hook, so it would be good to see
  other benchmarks as well.

* Does it make sense to something similar with add_path_precheck,
  which also in some situations excluding paths?

* This part sounds dangerous for me:

> Because PostgreSQL core has no information which preserved path-node can
> be removed, extensions that uses path_removal_decision_hook() has 
> responsibility
> to keep the length of preserved_(partial_)pathlist reasonable.

  since an extension can keep limited number of paths in the list, but
  then the same hook could be reused by another extension which will
  also try to limit such paths, but together they'll explode.




Re: redundant error messages

2020-11-05 Thread Alvaro Herrera
On 2020-Nov-05, Isaac Morland wrote:

> In principle, the client knows the database name. In practice, if it's
> coming from PGDATABASE or via a service configuration, one may be confused
> about the database; having the error message be explicit will avoid many
> problems. I can easily imagine that "unable to connect to database" would
> be mystifying, whereas "unable to connect to database foo" would elicit the
> response, "wait, I'm trying to connect to what now?" leading much more
> quickly to a resolution.

Also consider cases like running something via cron, where the person
reading the error output does not necessarily know what command is being
run: it might be hidden inside a script.  It's often very helpful to
have object names in error messages, even if for the normal usage it
seems that the object being operated on is very obvious by just looking
at the command.




Re: libpq compression

2020-11-05 Thread Konstantin Knizhnik



On 05.11.2020 15:40, Matthias van de Meent wrote:


How about Xx_logical_bytes for raw the pg command stream data, and
keeping Xx_compressed_bytes for the compressed data in/out?


Frankly speaking I do not like work "logical" in this context.
It is in any case physical bytes, received from the peer.
Speaking about compression or encryption, "raw" is much widely used
for uncompressed or plain data.


This view pg_stat_network_traffic reports traffic from server (backend)
point of view, i.e.
rx_bytes (received bytes) are commands sent from client to the server
tx_bytes (transmitted bytes) are responses sent by server to the client.

If compression is not used then rx_compressed_bytes =
tx_compressed_bytes = 0
It seems to be more natural then assigning them the same values as (raw
bytes).
Because it can really happen that for BLOBs with already compressed data
(video images or sound)
compressed data will be almost the same as raw data even if compression
is enabled.
So it seems to be important to distinguished situations when data can
not be compressed and
when it is not compressed at all.

Looking at it from that viewpoint, I agree. My primary reason for
suggesting this was that it would be useful to expose how much data
was transferred between the client and the server, which cannot be
constructed from that view for compression-enabled connections. That
is because the compression methods' counting only starts after some
bytes have already been transferred, and the raw/logical counter
starts deviating once compression is enabled.


Sorry, I do not understand your point.
This view reports network traffic from server's side.
But client's traffic information is "mirror" of this statistic: 
server_tx=client_rx and visa versa.


Yes, first few bytes exchanged by client and server during handshake are 
not compressed.
But them are correctly calculated as "raw bytes". And certainly this few 
bytes can not have any influence on
measured average compression ratio (the main goal of using this network 
traffic statistic from my point of view).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



REFRESH MATERIALIZED VIEW and completion tag output

2020-11-05 Thread Fujii Masao

Hi,

The patch that makes pg_stat_statements track the number of rows that
REFRESH MATERIALIZED VIEW command processes was proposed at [1].
When reviewing the patch, I just wondered why the completion tag output
for REFRESH MATERIALIZED VIEW doesn't include the row count. Is this
an intentional behavior? I've not found the past discussion about this yet.

OTOH, CREATE MATERIALIZED VIEW reports something like "SELECT NNN"
(NNN is the row count) as the completion tag output. If WITH NO DATA is
specified, "CREATE MATERIALIZED VIEW" is reported instead. Is it better
to make REFRESH MATERIALIZED VIEW do the similar thing? Or we should
not do that, for example, since changing the completion tag output may
break the client app?

[1] https://postgr.es/m/71f6bc72f8bbaa06e701f8bd2562c...@oss.nttdata.com

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: REFRESH MATERIALIZED VIEW and completion tag output

2020-11-05 Thread Mark Dilger



> On Nov 5, 2020, at 8:20 AM, Fujii Masao  wrote:
> 
> The patch that makes pg_stat_statements track the number of rows that
> REFRESH MATERIALIZED VIEW command processes was proposed at [1].
> When reviewing the patch, I just wondered why the completion tag output
> for REFRESH MATERIALIZED VIEW doesn't include the row count. Is this
> an intentional behavior? I've not found the past discussion about this yet.

Of 191 command tags, only 7 return the number of rows.  See 
src/include/tcop/cmdtaglist.h.  REFRESH MATERIALIZED VIEW is similar to CREATE 
MATERIALIZED VIEW and also CREATE TABLE AS, which also do not return the row 
count.  I think this behavior is historical, and preserved for compatibility.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Yet another fast GiST build

2020-11-05 Thread Andrey Borodin



> 2 нояб. 2020 г., в 19:45, Heikki Linnakangas  написал(а):
> 
>> How do you think, should this patch and patch with pageinspect GiST 
>> functions be registered on commitfest?
> 
> Yeah, that'd be good.
I've registered both patches on January CF.
pageinspect patch's code looks goot to me (besides TODO item there), but it 
lacks docs and tests. I can add some info and calls in future reviews.

> 
> On 30/10/2020 20:20, Andrey Borodin wrote:
> A few quick comments:
> 
> * Currently with float8, you immediately abort abbreviation if SIZEOF_DATUM 
> == 4. Like in the 'inet' above, you could convert the float8 to float4 
> instead.
> 
> * Some of the ALTER OPERATOR FAMILY commands in btree_gist--1.6--1.7.sql are 
> copy-pasted wrong. Here's one example:
> 
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>   FUNCTION11  (text, text) gbt_text_sortsupport;
> 
> * It's easy to confuse the new comparison functions with the existing 
> comparisons used by the picksplit functions. Some comments and/or naming 
> changes would be good to clarify how they're different.
> 
> * It would be straightforward to have abbreviated functions for macaddr and 
> macaddr8 too.

I'll fix these issues soon. But things like this bother me a lot:
> ALTER OPERATOR FAMILY gist_timestamptz_ops USING gist ADD
>   FUNCTION11  (text, text) gbt_text_sortsupport;

To test that functions are actually called for sorting build we should support 
directive sorting build like "CREATE INDEX ON A USING GIST(B) 
WITH(sorting=surely,and fail if not)".
If we have unconditional sorting build and unconditional buffered build we can 
check opclasses code better.
The problem is current reloption is called "buffering". It would be strange if 
we gave this enum possible value like "not buffering, but very much like 
buffering, just another way".
How do you think, is it ok to add reloption "buffering=sorting" to enhance 
tests?

Best regards, Andrey Borodin.



Re: Fix brin_form_tuple to properly detoast data

2020-11-05 Thread Alvaro Herrera
On 2020-Nov-04, Tomas Vondra wrote:

> The first test is fairly trivial - it simply builds index on toasted data
> and then shows how an insert and select fail. There's a caveat, that this
> requires a DELETE + VACUUM, and the VACUUM actually has to cleanup the rows.
> So there must be no concurrent transactions that might need the rows, which
> is unlikely in regression tests. So this requires waiting for all running
> transactions to finish - I did that by building an index concurrently. It's
> a bit strange, but it's better than any other solution I could think of
> (timeout or some custom wait for xacts).

There are recent changes in vacuum for temp tables (commit 94bc27b57680?)
that would maybe make this stable enough, without having to have the CIC
there.  At least, I tried it locally a few times and it appears to work well.
This won't work for older releases though, just master.  This is patch
0001 attached here.

> The second test is a bit redundant - it merely checks that both CREATE INDEX
> and INSERT INTO fail the same way when the index tuple gets too large.
> Before the fix there were some inconsistencies - the CREATE INDEX succeeded
> because it used TOASTed data. So ultimately this tests the same thing, but
> from a different perspective.

Hmm.  This one shows page size in the error messages, so it'll fail on
nonstandard builds.  I think we try to stay away from introducing those,
so I'd leave this test out.

The code fix looks all right -- I'd just move the #include lines to
their place.  Patch 0002.

You add this comment:

> + /*
> +  * Do nothing if value is not of varlena type. We don't 
> need to
> +  * care about NULL values here, thanks to bv_allnulls 
> above.
> +  *
> +  * If value is stored EXTERNAL, must fetch it so we are 
> not
> +  * depending on outside storage.
> +  *
> +  * XXX Is this actually true? Could it be that the 
> summary is
> +  * NULL even for range with non-NULL data? E.g. 
> degenerate bloom
> +  * filter may be thrown away, etc.
> +  */

I think the XXX comment points to a bug that we don't have right now,
since neither minmax nor inclusion can end up with a NULL summary
starting from non-NULL data.  But if the comment about bloom is correct,
then surely it'll become a bug when bloom is added.

I don't think we need the second part of this comment:

> +/*
> + * This enables de-toasting of index entries.  Needed until VACUUM is
> + * smart enough to rebuild indexes from scratch.
> + */

... because, surely, we're now never working on having VACUUM rebuild
indexes from scratch.   In fact, I wonder if we need the #define at
all.  I propose to remove all those #ifdef lines in your patch.

The fix looks good to me.  I just added a comment in 0003.

>From 101638065fb13e540f58f96ff839af982d7c5344 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 5 Nov 2020 14:12:38 -0300
Subject: [PATCH 1/3] Simplify test to use a temp table rather than vacuum.

---
 src/test/regress/expected/brin.out | 7 +--
 src/test/regress/sql/brin.sql  | 8 +---
 2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/src/test/regress/expected/brin.out b/src/test/regress/expected/brin.out
index 79ec028f69..0650e24a31 100644
--- a/src/test/regress/expected/brin.out
+++ b/src/test/regress/expected/brin.out
@@ -527,7 +527,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE b = 1;
 (2 rows)
 
 -- make sure data are properly de-toasted in BRIN index
-CREATE TABLE brintest_3 (a text, b text, c text, d text);
+CREATE TEMP TABLE brintest_3 (a text, b text, c text, d text);
 -- long random strings (~2000 chars each, so ~6kB for min/max on two
 -- columns) to trigger toasting
 WITH rand_value AS (SELECT string_agg(md5(i::text),'') AS val FROM generate_series(1,60) s(i))
@@ -535,11 +535,6 @@ INSERT INTO brintest_3
 SELECT val, val, val, val FROM rand_value;
 CREATE INDEX brin_test_toast_idx ON brintest_3 USING brin (b, c);
 DELETE FROM brintest_3;
--- We need to wait a bit for all transactions to complete, so that the
--- vacuum actually removes the TOAST rows. Creating an index concurrently
--- is a one way to achieve that, because it does exactly such wait.
-CREATE INDEX CONCURRENTLY brin_test_temp_idx ON brintest_3(a);
-DROP INDEX brin_test_temp_idx;
 -- vacuum the table, to discard TOAST data
 VACUUM brintest_3;
 -- retry insert with a different random-looking (but deterministic) value
diff --git a/src/test/regress/sql/brin.sql b/src/test/regress/sql/brin.sql
index f1b510fb0b..9b284738dd 100644
--- a/src/test/regress/sql/brin.sql
+++ b/src/test/regress/sql/brin.sql
@@ -472,7 +472,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE a = 1;
 EXPLAIN (COSTS OFF) SELECT * FROM brin_test WHERE b = 1;
 
 -- make sure data are properly de-toasted in BRIN index
-

Re: Yet another fast GiST build

2020-11-05 Thread Justin Pryzby
On Thu, Nov 05, 2020 at 10:11:52PM +0500, Andrey Borodin wrote:
> To test that functions are actually called for sorting build we should 
> support directive sorting build like "CREATE INDEX ON A USING GIST(B) 
> WITH(sorting=surely,and fail if not)".

Maybe you could add a DEBUG1 message for that, and include that in regression
tests, which would then fail if sorting wasn't used.

Maybe you'd need to make it consistent by setting GUCs like work_mem /
max_parallel_maintenance_workers / ??

Similar to this

postgres=# SET client_min_messages =debug;
postgres=# CREATE INDEX ON A USING GIST(i) WITH(buffering=off);
DEBUG:  building index "a_i_idx2" on table "a" serially
CREATE INDEX

> If we have unconditional sorting build and unconditional buffered build we 
> can check opclasses code better.
> The problem is current reloption is called "buffering". It would be strange 
> if we gave this enum possible value like "not buffering, but very much like 
> buffering, just another way".
> How do you think, is it ok to add reloption "buffering=sorting" to enhance 
> tests?
> 
> Best regards, Andrey Borodin.




Re: Fix brin_form_tuple to properly detoast data

2020-11-05 Thread Tomas Vondra




On 11/5/20 6:17 PM, Alvaro Herrera wrote:

On 2020-Nov-04, Tomas Vondra wrote:


The first test is fairly trivial - it simply builds index on toasted data
and then shows how an insert and select fail. There's a caveat, that this
requires a DELETE + VACUUM, and the VACUUM actually has to cleanup the rows.
So there must be no concurrent transactions that might need the rows, which
is unlikely in regression tests. So this requires waiting for all running
transactions to finish - I did that by building an index concurrently. It's
a bit strange, but it's better than any other solution I could think of
(timeout or some custom wait for xacts).


There are recent changes in vacuum for temp tables (commit 94bc27b57680?)
that would maybe make this stable enough, without having to have the CIC
there.  At least, I tried it locally a few times and it appears to work well.
This won't work for older releases though, just master.  This is patch
0001 attached here.



IIUC you're suggesting to use a temporary table in the test? 
Unfortunately, this does not work on older releases, and IMHO the test 
should be backpatched too. IMHO the CIC "hack" is acceptable, unless 
there's a better solution that I'm not aware of.



The second test is a bit redundant - it merely checks that both CREATE INDEX
and INSERT INTO fail the same way when the index tuple gets too large.
Before the fix there were some inconsistencies - the CREATE INDEX succeeded
because it used TOASTed data. So ultimately this tests the same thing, but
from a different perspective.


Hmm.  This one shows page size in the error messages, so it'll fail on
nonstandard builds.  I think we try to stay away from introducing those,
so I'd leave this test out.


Hmm, OK. I don't think having "redundant" test is a big deal, but I 
haven't thought about builds with different block sizes. I'll leave this 
out.




The code fix looks all right -- I'd just move the #include lines to
their place.  Patch 0002.



OK


You add this comment:


+   /*
+* Do nothing if value is not of varlena type. We don't 
need to
+* care about NULL values here, thanks to bv_allnulls 
above.
+*
+* If value is stored EXTERNAL, must fetch it so we are 
not
+* depending on outside storage.
+*
+* XXX Is this actually true? Could it be that the 
summary is
+* NULL even for range with non-NULL data? E.g. 
degenerate bloom
+* filter may be thrown away, etc.
+*/


I think the XXX comment points to a bug that we don't have right now,
since neither minmax nor inclusion can end up with a NULL summary
starting from non-NULL data.  But if the comment about bloom is correct,
then surely it'll become a bug when bloom is added.



Yeah, but that'd be something for the bloom patch to fix, I think.

This got me thinking though - wouldn't it be better to handle too large 
values by treating the range as "degenerate" (i.e. store NULL and 
consider it as matching all queries), instead of failing the CREATE 
INDEX or DML? I find the current behavior rather annoying, because it 
depends on the other rows in the page range, not just on the one row the 
user deals with. Perhaps this might even be considered an information 
leak about the other data. Of course, not something this patch should 
deal with.



I don't think we need the second part of this comment:


+/*
+ * This enables de-toasting of index entries.  Needed until VACUUM is
+ * smart enough to rebuild indexes from scratch.
+ */


... because, surely, we're now never working on having VACUUM rebuild
indexes from scratch.   In fact, I wonder if we need the #define at
all.  I propose to remove all those #ifdef lines in your patch.



That's a verbatim copy of a comment from indextuple.c. IMHO we should 
keep it the same in both places.



The fix looks good to me.  I just added a comment in 0003.



Thanks. Any opinions on fixing this in existing clusters? Any better 
ideas than just giving users the SQL query to list possibly-affected 
indexes?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: libpq compression

2020-11-05 Thread Matthias van de Meent
On Thu, 5 Nov 2020 at 17:01, Konstantin Knizhnik
 wrote:
>
> Sorry, I do not understand your point.
> This view reports network traffic from server's side.
> But client's traffic information is "mirror" of this statistic: 
> server_tx=client_rx and visa versa.
>
> Yes, first few bytes exchanged by client and server during handshake are not 
> compressed.
> But them are correctly calculated as "raw bytes". And certainly this few 
> bytes can not have any influence on
> measured average compression ratio (the main goal of using this network 
> traffic statistic from my point of view).

As I understand it, the current metrics are as follows:

Server
 |<- |<- Xx_raw_bytes
 |  Compression
 |   |<- Xx_compressed_bytes
Client connection
 |
Network

>From the views' name 'pg_stat_network_traffic', to me 'Xx_raw_bytes'
would indicate the amount of bytes sent/received over the client
connection (e.g. measured between the Client connection and Network
part, or between the Server/Client connection and Compression/Client
connection sections), because that is my natural understanding of
'raw tx network traffic'. This is why I proposed 'logical' instead
of 'raw', as 'raw' is quite apparently understood differently when
interpreted by different people, whereas 'logical' already implies
that the value is an application logic-determined value (e.g. size
before compression).

The current name implies a 'network' viewpoint when observing this
view, not the 'server'/'backend' viewpoint you describe. If the
'server'/'backend' viewpoint is the desired default viewpoint, then
I suggest to rename the view to `pg_stat_network_compression`, as
that moves the focus to the compression used, and subsequently
clarifies `raw` as the raw application command data.

If instead the name `pg_stat_network_traffic` is kept, I suggest
changing the metrics collected to the following scheme:

Server
 |<- |<- Xx_logical_bytes
 |  Compression
 |   |<- Xx_compressed_bytes (?)
 |<- |<- Xx_raw_bytes
Client connection
 |
Network

This way, `raw` in the context of 'network_traffic' means
"sent-over-the-connection"-data, and 'logical' is 'application logic'
-data (as I'd expect from both a network as an application point of
view). 'Xx_compressed_bytes' is a nice addition, but not strictly
necessary, as you can subtract raw from logical to derive the bytes
saved by compression.




Re: Move catalog toast table and index declarations

2020-11-05 Thread Peter Eisentraut

On 2020-11-05 12:59, John Naylor wrote:

I think we're talking past eachother. Here's a concrete example:

#define BKI_ROWTYPE_OID(oid,oidmacro)
#define DECLARE_TOAST(name,toastoid,indexoid) extern int no_such_variable

I understand these to be functionally equivalent as far as what the C 
compiler sees.


The issue is that you can't have a bare semicolon at the top level of a 
C compilation unit, at least on some compilers.  So doing


#define FOO(stuff) /*empty*/

and then

FOO(123);

won't work.  You need to fill the definition of FOO with some stuff to 
make it valid.


BKI_ROWTYPE_OID on the other hand is not used at the top level like 
this, so it can be defined to empty.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: libpq compression

2020-11-05 Thread Peter Eisentraut

On 2020-11-02 20:50, Andres Freund wrote:

On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote:

But the price of compression is 1 cpu for 500MB/s (zstd). With a
20Gbps network adapters cost of recompressing all traffic is at most
~4 cores.


It's not quite that simple, because presumably each connection is going
to be handled by one core at a time in the pooler. So it's easy to slow
down peak throughput if you also have to deal with TLS etc.


Also, current deployments of connection poolers use rather small machine 
sizes.  Telling users you need 4 more cores per instance now to 
decompress and recompress all the traffic doesn't seem very attractive. 
Also, it's not unheard of to have more than one layer of connection 
pooling.  With that, this whole design sounds a bit like a 
heat-generation system. ;-)





list_free() in index_get_partition()

2020-11-05 Thread Justin Pryzby
Seems to be missing.

The 2nd patch does some more cleanup - Before, a failed syscache lookup would
ERROR, but I don't think that's supposed to happen.  get_rel_relispartition()
would instead return false, and we won't call get_partition_parent().

commit a8ad949f22b8dd7b23049b0b0704e5be9233e319
Author: Justin Pryzby 
Date:   Thu Nov 5 12:06:49 2020 -0600

list_free() in index_get_partition()

which was added at: a6da0047158b8a227f883aeed19eb7fcfbef11fb

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 239ac017fa..4dfac39adf 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -153,44 +153,45 @@ Oid
 index_get_partition(Relation partition, Oid indexId)
 {
List   *idxlist = RelationGetIndexList(partition);
ListCell   *l;
 
foreach(l, idxlist)
{
Oid partIdx = lfirst_oid(l);
HeapTuple   tup;
Form_pg_class classForm;
boolispartition;
 
tup = SearchSysCache1(RELOID, ObjectIdGetDatum(partIdx));
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for relation %u", 
partIdx);
classForm = (Form_pg_class) GETSTRUCT(tup);
ispartition = classForm->relispartition;
ReleaseSysCache(tup);
if (!ispartition)
continue;
-   if (get_partition_parent(lfirst_oid(l)) == indexId)
+   if (get_partition_parent(partIdx) == indexId)
{
list_free(idxlist);
return partIdx;
}
}
 
+   list_free(idxlist);
return InvalidOid;
 }

commit 0a01cb7561d6ec74aa5829040bd1478e7b113d89
Author: Justin Pryzby 
Date:   Thu Nov 5 12:14:24 2020 -0600

index_get_partition(): use lsyscache ??

The old code failed when !HeapTupleIsValid(), the new code will return 
false -
is that ok ??  Two of the existing callers error anyway.

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 4dfac39adf..3d78bc0872 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -28,6 +28,7 @@
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -158,17 +159,8 @@ index_get_partition(Relation partition, Oid indexId)
foreach(l, idxlist)
{
Oid partIdx = lfirst_oid(l);
-   HeapTuple   tup;
-   Form_pg_class classForm;
-   boolispartition;
-
-   tup = SearchSysCache1(RELOID, ObjectIdGetDatum(partIdx));
-   if (!HeapTupleIsValid(tup))
-   elog(ERROR, "cache lookup failed for relation %u", 
partIdx);
-   classForm = (Form_pg_class) GETSTRUCT(tup);
-   ispartition = classForm->relispartition;
-   ReleaseSysCache(tup);
-   if (!ispartition)
+
+   if (!get_rel_relispartition(partIdx))
continue;
if (get_partition_parent(partIdx) == indexId)
{
>From 90e155f660314ec70de14177241dd1ec029d9311 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 5 Nov 2020 12:06:49 -0600
Subject: [PATCH 1/2] Fixes to index_get_partition()

which was added at: a6da0047158b8a227f883aeed19eb7fcfbef11fb

free list in cases where there's no parent
use lsyscache: get_rel_relispartition()
---
 src/backend/catalog/partition.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 239ac017fa..4dfac39adf 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -170,13 +170,14 @@ index_get_partition(Relation partition, Oid indexId)
 		ReleaseSysCache(tup);
 		if (!ispartition)
 			continue;
-		if (get_partition_parent(lfirst_oid(l)) == indexId)
+		if (get_partition_parent(partIdx) == indexId)
 		{
 			list_free(idxlist);
 			return partIdx;
 		}
 	}
 
+	list_free(idxlist);
 	return InvalidOid;
 }
 
-- 
2.17.0

>From 768f91199b5e197759b2ead6d838a72974dca6b6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 5 Nov 2020 12:14:24 -0600
Subject: [PATCH 2/2] index_get_partition(): use lsyscache ??

The old code failed when !HeapTupleIsValid(), the new code will return false -
is that ok ??
---
 src/backend/catalog/partition.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 4dfac39adf..3d78bc0872 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -28,6 +28,7 @@
 #include "partitionin

Re: default result formats setting

2020-11-05 Thread Peter Eisentraut

On 2020-10-26 09:45, Pavel Stehule wrote:

The attached patch implements this.  For example, to get int2, int4,
int8 in binary by default, you could set

SET default_result_formats = '21=1,23=1,20=1';


Using SET statement for this case looks very obscure :/

This is a protocol related issue, and should be solved by protocol 
extending. I don't think so SQL level is good for that.


We could also make it a protocol message, but it would essentially 
implement the same thing, just again separately.  And then you'd have no 
support to inspect the current setting, test out different settings 
interactively, etc.  That seems pretty wasteful and complicated for no 
real gain.


> More, this format is not practical for custom types, and the list can
> be pretty long.

The list is what the list is.  I don't see how you can make it any 
shorter.  You have to list the data types that you're interested in 
somehow.  Any other ideas?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: default result formats setting

2020-11-05 Thread Peter Eisentraut

On 2020-10-26 15:35, Tom Lane wrote:

In the discussion in [0], I pondered a new protocol message for that,
but after further thought, a GUC setting would do just as well.


I think a GUC is conceptually the wrong level ...


It does feel that way, but it gets the job done well and you can use all 
the functionality already existing, such as being able to inspect 
settings, temporarily change settings, etc.  Otherwise we'd have to 
implement a lot of things like that again.  That would turn this 200 
line patch into a 2000 line patch without any real additional benefit.



In order to be able to test this via libpq, I had to add a little hack.


... which is part of the reason why you have to kluge this.  I'm not
entirely certain which levels of the client stack need to know about
this, but surely libpq is one.

>
> I'm also quite worried about failures (maybe even security problems)
> arising from the "wrong level" of the client stack setting the GUC.

I don't think libpq needs to know about this very deeply.  The protocol 
provides format information with the result set.  Libpq programs can 
query that with PQfformat() and act accordingly.  Nothing else is needed.


The real consumer of this would be the JDBC driver, which has built-in 
knowledge of the binary formats of some data types.  Libpq doesn't, so 
it wouldn't use this facility anyway.  (Not saying someone couldn't 
write a higher-level C library that does this, but it doesn't exist now. 
... hmm ... ecpg ...)



Independently of that, how would you implement "says otherwise" here,
ie do a single-query override of the session's prevailing setting?
Maybe the right thing for that is to define -1 all the way down to the
protocol level as meaning "use the session's per-type default", and
then if you don't want that you can pass 0 or 1.  An advantage of that
is that you couldn't accidentally break an application that wasn't
ready for this feature, because it would not be the default to use it.


Yeah, that sounds a lot better.  I'll look into that.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Fix brin_form_tuple to properly detoast data

2020-11-05 Thread Alvaro Herrera
On 2020-Nov-05, Tomas Vondra wrote:

> On 11/5/20 6:17 PM, Alvaro Herrera wrote:

> > There are recent changes in vacuum for temp tables (commit 94bc27b57680?)
> > that would maybe make this stable enough, without having to have the CIC
> > there.  At least, I tried it locally a few times and it appears to work 
> > well.
> > This won't work for older releases though, just master.  This is patch
> > 0001 attached here.
> 
> IIUC you're suggesting to use a temporary table in the test? Unfortunately,
> this does not work on older releases, and IMHO the test should be
> backpatched too. IMHO the CIC "hack" is acceptable, unless there's a better
> solution that I'm not aware of.

Oh, sure, the CIC hack is acceptable for the older branches.  I'm just
saying that you can use a temp table everywhere, and keep CIC in the old
branches and no CIC in master.

> This got me thinking though - wouldn't it be better to handle too large
> values by treating the range as "degenerate" (i.e. store NULL and consider
> it as matching all queries), instead of failing the CREATE INDEX or DML? I
> find the current behavior rather annoying, because it depends on the other
> rows in the page range, not just on the one row the user deals with. Perhaps
> this might even be considered an information leak about the other data. Of
> course, not something this patch should deal with.

Hmm.  Regarding text I remember thinking we could just truncate values
(as we do for LIKE, as I recall).  I suppose that strategy would work
even for bytea.


> > > +/*
> > > + * This enables de-toasting of index entries.  Needed until VACUUM is
> > > + * smart enough to rebuild indexes from scratch.
> > > + */
> > 
> > ... because, surely, we're now never working on having VACUUM rebuild
> > indexes from scratch.   In fact, I wonder if we need the #define at
> > all.  I propose to remove all those #ifdef lines in your patch.
> 
> That's a verbatim copy of a comment from indextuple.c. IMHO we should keep
> it the same in both places.

Sure, if you want to.

> > The fix looks good to me.  I just added a comment in 0003.
> 
> Thanks. Any opinions on fixing this in existing clusters? Any better ideas
> than just giving users the SQL query to list possibly-affected indexes?

None here.




Re: Reduce the number of special cases to build contrib modules on windows

2020-11-05 Thread David Rowley
Thank you for looking at this.

On Tue, 3 Nov 2020 at 09:49, Andres Freund  wrote:
> > diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
> > index 90594bd41b..491a465e2f 100644
> > --- a/src/tools/msvc/Mkvcbuild.pm
> > +++ b/src/tools/msvc/Mkvcbuild.pm
> > @@ -32,16 +32,13 @@ my $libpq;
> >  my @unlink_on_exit;
> >
> >  # Set of variables for modules in contrib/ and src/test/modules/
> > -my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
> > -my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
> > -my @contrib_uselibpgport   = ('oid2name', 'pg_standby', 'vacuumlo');
> > -my @contrib_uselibpgcommon = ('oid2name', 'pg_standby', 'vacuumlo');
> > +my $contrib_defines = undef;
> > +my @contrib_uselibpq = undef;
> > +my @contrib_uselibpgport   = ('pg_standby');
> > +my @contrib_uselibpgcommon = ('pg_standby');
> >  my $contrib_extralibs  = undef;
> >  my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
> > -my $contrib_extrasource = {
> > - 'cube' => [ 'contrib/cube/cubescan.l', 'contrib/cube/cubeparse.y' ],
> > - 'seg'  => [ 'contrib/seg/segscan.l',   'contrib/seg/segparse.y' ],
> > -};
> > +my $contrib_extrasource = undef;
>
> Hm - Is that all the special case stuff we get rid of?

What else did you have in mind?

> What's with the now unef'd arrays/hashes? First, wouldn't an empty array be
> more appropriate? Second, can we just get rid of them?

Yes, those should be empty hashtables/arrays. I've changed that.

We could get rid of the variables too. I've just left them in there
for now as I wasn't sure if it might be a good idea to keep them for
if we really need to brute force something in the future. I found
parsing makefiles quite tedious, so it didn't seem unrealistic to me
that someone might struggle in the future to make something work.

> And why is the special stuff for pg_standby still needed?

I'm not much of an expert, but I didn't see anything in the makefile
for pg_standby that indicates we should link libpgport or libpgcommon.
It would be good if someone could explain how that works.

> >  my @contrib_excludes = (
> >   'bool_plperl',  'commit_ts',
> >   'hstore_plperl','hstore_plpython',
> > @@ -163,7 +160,7 @@ sub mkvcbuild
> >   $postgres = $solution->AddProject('postgres', 'exe', '', 
> > 'src/backend');
> >   $postgres->AddIncludeDir('src/backend');
> >   $postgres->AddDir('src/backend/port/win32');
> > - $postgres->AddFile('src/backend/utils/fmgrtab.c');
> > + $postgres->AddFile('src/backend/utils/fmgrtab.c', 1);
> >   $postgres->ReplaceFile('src/backend/port/pg_sema.c',
> >   'src/backend/port/win32_sema.c');
> >   $postgres->ReplaceFile('src/backend/port/pg_shmem.c',
> > @@ -316,8 +313,8 @@ sub mkvcbuild
>
> Why do so many places need this new parameter? Looks like all explicit
> calls use it? Can't we just use it by default, using a separate function
> for the internal cases? Would make this a lot more readable...

That makes sense. I've updated the patch to have AddFile() add any
additional files always then I've also added a new function named
AddFileConditional which does what AddFile(..., 0) did.

> >   my $isolation_tester =
> > $solution->AddProject('isolationtester', 'exe', 'misc');
> > - $isolation_tester->AddFile('src/test/isolation/isolationtester.c');
> > - $isolation_tester->AddFile('src/test/isolation/specparse.y');
> > - $isolation_tester->AddFile('src/test/isolation/specscanner.l');
> > - $isolation_tester->AddFile('src/test/isolation/specparse.c');
> > + $isolation_tester->AddFile('src/test/isolation/isolationtester.c', 1);
> > + $isolation_tester->AddFile('src/test/isolation/specparse.y', 1);
> > + $isolation_tester->AddFile('src/test/isolation/specscanner.l', 1);
> > + $isolation_tester->AddFile('src/test/isolation/specparse.c', 1);
> >   $isolation_tester->AddIncludeDir('src/test/isolation');
> >   $isolation_tester->AddIncludeDir('src/port');
> >   $isolation_tester->AddIncludeDir('src/test/regress');
> > @@ -342,8 +339,8 @@ sub mkvcbuild
>
> Why aren't these dealth with using the .c->.l/.y logic you added?

Yeah, some of those could be removed. I mostly only paid attention to
contrib though.

> > + # Process custom compiler flags
> > + if ($mf =~ /^PG_CPPFLAGS\s*=\s*(.*)$/mg)
>
> Probably worth mentioning in pgxs.mk or such.

I'm not quite sure I understand what you mean here.

> > + {
> > + foreach my $flag (split /\s+/, $1)
> > + {
> > + if ($flag =~ /^-D(.*)$/)
> > + {
> > + foreach my $proj (@projects)
> > + {
> > + $proj->AddDefine($1);
> > + }
> > + }
> > + elsif ($flag =~ /^-I(.*)$/)
> > + {
> > + for

Re: Collation versioning

2020-11-05 Thread Thomas Munro
On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier  wrote:
> On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> >  We could create a static table with the conversion based on what was
> > discussed for commit a169155, please find attached a spreadsheet with the
> > comparison. This would require maintenance as new LCIDs are released [1].
> >
> > [1]
> > https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f
>
> I am honestly not a fan of something like that as it has good chances
> to rot.

No opinion on that, other than that we'd surely want a machine
readable version.  As for *when* we use that information, I'm
wondering if it would make sense to convert datcollate to a language
tag in initdb, and also change pg_upgrade's equivalent_locale()
function to consider "English_United States.*" and "en-US" to be
equivalent when upgrading to 14 (which would then be the only point
you'd ever have to have faith that we can convert the old style names
to the new names correctly).  I'm unlikely to work on this myself as I
have other operating systems to fix, but I'll certainly be happy if
somehow we can get versioning for default on Windows in PG14 and not
have to come up with weasel words in the manual.

Just by the way, I think Windows does one thing pretty nicely here: it
has versions with a major and a minor part.  If the minor part goes
up, it means that they only added new code points, but didn't change
the ordering of any existing code points, so in some circumstances you
don't have to rebuild (which I think is the case for many Unicode
updates, adding new Chinese characters or emojis or whatever).  I
thought about whether we should replace the strcmp() comparison with a
call into provider-specific code, and in the case of Win32 locales it
could maybe understand that.  But there are two problems of limited
surmountability: (1) You have an idex built with version 42.1, and now
version 42.3 is present; OK, we can read this index, but if we write
any new data, then a streaming replica that has 42.2 will think it's
OK to read data, but it's not OK; so as soon as you write, you'd need
to update the catalogue, which is quite complicated (cf enum types);
(2) The whole theory only holds together if you didn't actually use
any of the new codepoints introduced by 42.3 while the index said
42.1, yet PostgreSQL isn't validating the codepoints you use against
the collation provider's internal map of valid code points.  So I gave
up with that line of thinking for now.




Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-05 Thread David Rowley
On Mon, 2 Nov 2020 at 20:43, David Rowley  wrote:
>
> On Tue, 20 Oct 2020 at 22:30, David Rowley  wrote:
> I did some further tests this time with some tuple deforming.  Again,
> it does seem that v9 is slower than v8.
>
> Graphs attached
>
> Looking at profiles, I don't really see any obvious reason as to why
> this is.  I'm very much inclined to just pursue the v8 patch (separate
> Result Cache node) and just drop the v9 idea altogether.

Nobody raised any objections, so I'll start taking a more serious look
at the v8 version (the patch with the separate Result Cache node).

One thing that I had planned to come back to about now is the name
"Result Cache".  I admit to not thinking for too long on the best name
and always thought it was something to come back to later when there's
some actual code to debate a better name for. "Result Cache" was
always a bit of a placeholder name.

Some other names that I'd thought of were:

"MRU Hash"
"MRU Cache"
"Parameterized Tuple Cache" (bit long)
"Parameterized Cache"
"Parameterized MRU Cache"

I know Robert had shown some interest in using a different name.  It
would be nice to settle on something most people are happy with soon.

David




Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread Mohamed Wael Khobalatte
On Tue, Nov 3, 2020 at 11:54 AM Nikhil Benesch 
wrote:

> to_json is declared as taking "anyelement" as input, which means
> you can't pass it something of unknown type:
>
>  postgres=# SELECT to_json('foo');
>  ERROR:  could not determine polymorphic type because input has type
> unknown
>
> But this works fine with the very similar json_build_array function:
>
>  postgres=# SELECT json_build_array('foo');
>   json_build_array
>  --
>   ["foo"]
>  (1 row)
>
> The difference is that json_build_array takes type "any" as input, while
> to_json takes "anyelement" as input.
>
> Is there some reason to_json couldn't be switched to take "any" as input?
> Hacking this together seems to mostly just work:
>
>  postgres=# CREATE FUNCTION my_to_json ("any") RETURNS json LANGUAGE
> 'internal' AS 'to_json';
>  postgres=# SELECT my_to_json('foo');
>   my_to_json
>  
>   "foo"
>  (1 row)
>
> Is there something I'm missing?
>
> Nikhil
>
>
Hm, good question. I am also curious as to why this happens.
`json_build_array` ends up casting unknowns to text (from reading the
code), which seems like a reasonable (although not completely tight)
assumption. Not sure why `to_json` can't just do the same. You can always
cast to text yourself, of course, but I am not familiar with the type
hierarchy enough to tell why `to_json` can't deduce that as text whereas
the other function can.


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread Thomas Munro
On Thu, Nov 5, 2020 at 10:47 PM Amit Kapila  wrote:
> I still feel 'cached' is a better name.

Amusingly, this thread is hitting all the hardest problems in computer
science according to the well known aphorism...

Here's a devil's advocate position I thought about:  It's OK to leave
stray buffers (clean or dirty) in the buffer pool if files are
truncated underneath us by gremlins, as long as your system eventually
crashes before completing a checkpoint.  The OID can't be recycled
until after a successful checkpoint, so the stray blocks can't be
confused with the blocks of another relation, and weird errors are
expected on a system that is in serious trouble.  It's actually much
worse that we can give incorrect answers to queries when files are
truncated by gremlins (in the window of time before we presumably
crash because of EIO), because we're violating basic ACID principles
in user-visible ways.  In this thread, discussion has focused on
availability (ie avoiding failures when trying to write back stray
buffers to a file that has been unlinked), but really a system that
can't see arbitrary committed transactions *shouldn't be available*.
This argument applies whether you think SEEK_END can only give weird
answers in the specific scenario I demonstrated with NFS, or whether
you think it's arbitrarily b0rked and reports random numbers: we
fundamentally can't tolerate that, so why are we trying to?




Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-05 Thread Michael Paquier
On Thu, Nov 05, 2020 at 01:59:11PM +0100, Daniel Gustafsson wrote:
> Not yet, and potentially never will.  Given the consequences of a PRNG which
> hasn't been properly initialized I think it's ok to be defensive in this
> codepath however.

+   /*
+* In case the backend is using the PRNG from OpenSSL without being built
+* with support for OpenSSL, make sure to perform post-fork initialization.
+* If the backend is using OpenSSL then we have already performed this
+* step. The same version caveat as discussed in the comment above applies
+* here as well.
+*/
+#ifndef USE_OPENSSL
+   RAND_poll();
+#endif

I still don't see the point of this extra complexity, as
USE_OPENSSL_RANDOM implies USE_OPENSSL, and we also call RAND_poll() a
couple of lines down in the main function under USE_OPENSSL_RANDOM.
So I would just remove this whole block, and replace the comment by a
simple "initialization already done above".
--
Michael


signature.asc
Description: PGP signature


RE: extension patch of CREATE OR REPLACE TRIGGER

2020-11-05 Thread osumi.takami...@fujitsu.com
Hi,


On Thursday, November 5, 2020 1:22 PM
Peter Smith  wrote:
> On Wed, Nov 4, 2020 at 2:53 PM osumi.takami...@fujitsu.com
>  wrote:
> > > (1) COMMENT
> > > File: NA
> > > Maybe next time consider using format-patch to make the patch. Then
> > > you can include a comment to give the background/motivation for this
> change.
> > OK. How about v15 ?
> 
> Yes, it is good now, apart from some typos in the first sentence:
> "grammer" --> "grammar"
> "exisiting" --> "existing"
Sorry for such minor mistakes. Fixed.


> > > ===
> > >
> > > (2) COMMENT
> > > File: doc/src/sgml/ref/create_trigger.sgml
> > > @@ -446,6 +454,13 @@ UPDATE OF
> > > column_name1
> > > [, column_name2 > > Currently it says:
> > > When replacing an existing trigger with CREATE OR REPLACE TRIGGER,
> > > there are restrictions. You cannot replace constraint triggers. That
> > > means it is impossible to replace a regular trigger with a
> > > constraint trigger and to replace a constraint trigger with another
> constraint trigger.
> > >
> > > --
> > >
> > > Is that correct wording? I don't think so. Saying "to replace a
> > > regular trigger with a constraint trigger" is NOT the same as "replace a
> constraint trigger".
> > I corrected my wording in create_trigger.sgml, which should cause less
> > confusion than v14. The reason why I changed the documents is described
> below.
> 
> Yes, OK. But it might be simpler still just to it like:
> "CREATE OR REPLACE TRIGGER works only for replacing a regular (not
> constraint) trigger with another regular trigger."
Yeah, this kind of supplementary words help user to understand the
exact usage of this feature. Thanks.

> 
> >
> > > Maybe I am mistaken but I think the help and the code are no longer
> > > in sync anymore. e.g. In previous versions of this patch you used to
> > > verify replacement trigger kinds (regular/constraint) match. AFAIK
> > > you are not doing that in the current code (but you should be). So
> > > although you say "impossible to replace a regular trigger with a
> > > constraint trigger" I don't see any code to check/enforce that ( ??
> > > ) IMO when you simplified this v14 patch you may have removed some
> > > extra trigger kind conditions that should not have been removed.
> > >
> > > Also, the test code should have detected this problem, but I think
> > > the tests have also been broken in v14. See later COMMENT (9).
> > Don't worry and those are not broken.
> >
> > I changed some codes in gram.y to throw a syntax error when OR REPLACE
> > clause is used with CREATE CONSTRAINT TRIGGER.
> >
> > In the previous discussion with Tsunakawa-San in this thread, I judged
> > that OR REPLACE clause is not necessary for *CONSTRAINT* TRIGGER to
> > achieve the purpose of this patch.
> > It is to support the database migration from Oracle to Postgres by
> > supporting the same syntax for trigger replacement. Here, because the
> > constraint trigger is unique to the latter, I prohibited the usage of
> > CREATE CONSTRAINT TRIGGER and OR REPLACE clauses at the same
> time at
> > the grammatical level.
> > Did you agree with this way of modification ?
> >
> > To prohibit the combination OR REPLACE and CONSTRAINT clauses may
> seem
> > a little bit radical but I refer to an example of the combination to
> > use CREATE CONSTRAINT TRIGGER and AFTER clause.
> > When the timing of trigger is not AFTER for CONSTRAINT TRIGGER, an
> > syntax error is caused. I learnt and followed the way for my
> > modification from it.
> 
> OK, I understand now. In my v14 review I failed to notice that you did it this
> way, which is why I thought a check was missing in the code.
> 
> I do think this is a bit subtle. Perhaps this should be asserted and commented
> a bit more in the code to make it much clearer what you did.
> For example:
> --
> BEFORE
> /*
>  * can't replace constraint trigger.
>  */
>  if (OidIsValid(existing_constraint_oid))
> AFTER
> /*
>  * It is not allowed to replace with a constraint trigger.
>  * The OR REPLACE syntax is not available for constraint triggers (see
> gram.y).
>  */
> Assert(!stmt->isconstraint);
> 
> /*
>  * It is not allowed to replace an existing constraint trigger.
>  */
>  if (OidIsValid(existing_constraint_oid))
> --
Agreed.
Note that this part of the latest patch v16 shows different indent of the 
comments
that you gave me in your previous reply but those came from the execution of 
pgindent.

> 
> > > (9) COMMENT
> > > File: src/test/regress/expected/triggers.out
> > > +-- test for detecting incompatible replacement of trigger create
> > > +table my_table (id integer); create function funcA() returns
> > > +trigger as $$ begin
> > > +  raise notice 'hello from funcA';
> > > +  return null;
> > > +end; $$ language plpgsql;
> > > +create function funcB() returns trigger as $$ begin
> > > +  raise notice 'hello from funcB';
> > > +  return null;
> > > +end; $$ language plpgsql;
> > > +create or replace trigger my_trig
> > > +  after insert on my_table
> >

Re: list_free() in index_get_partition()

2020-11-05 Thread Michael Paquier
On Thu, Nov 05, 2020 at 02:36:06PM -0600, Justin Pryzby wrote:
> Seems to be missing.

Any code paths calling index_get_partition() is in tablecmds.c,
involving commands that are not that hot with such lookups, but that
could be an issue if this gets called by some out-of-core code if
memory context cleanup is sloppy.  So I agree to fix this one and
backpatch down to 12.  I'd like to apply your fix, except if Alvaro
thinks differently as that's his commit originally.  So let's wait a
bit first.

> The 2nd patch does some more cleanup - Before, a failed syscache lookup would
> ERROR, but I don't think that's supposed to happen.  get_rel_relispartition()
> would instead return false, and we won't call get_partition_parent().

The cache lookup error is here as a safeguard to remind that any code
path calling index_get_partition() needs a proper lock on the
relations involved before doing such lookups, so that's a defensive
move.  By switching the code as you do in 0002, you could mask such
logical problems as different errors could get raised.  So I don't
think that it is a good idea.
--
Michael


signature.asc
Description: PGP signature


Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread David G. Johnston
On Thu, Nov 5, 2020 at 3:43 PM Mohamed Wael Khobalatte <
mkhobala...@grubhub.com> wrote:

>  You can always cast to text yourself, of course, but I am not familiar
> with the type hierarchy enough to tell why `to_json` can't deduce that as
> text whereas the other function can.
>

My understanding is that "any" is defined to accept that behavior -
allowing any pseudo-type and unknown.  The "anyelement" polymorphic
pseudo-type is defined such that only concrete known types are allowed to
match - and then the rules of polymorphism apply when performing a lookup.
My uninformed conclusion is that since to_json only defines a single
parameter that changing it from "anyelement" to "any" would be reasonable
and the hack describe probably "just works" (though I'd test it on a
wide-range of built-in types first if I was actually going to use the hack).

You only get to use "any" for a C-language function but that is indeed the
case here.

David J.


Re: REFRESH MATERIALIZED VIEW and completion tag output

2020-11-05 Thread Fujii Masao




On 2020/11/06 1:56, Mark Dilger wrote:




On Nov 5, 2020, at 8:20 AM, Fujii Masao  wrote:

The patch that makes pg_stat_statements track the number of rows that
REFRESH MATERIALIZED VIEW command processes was proposed at [1].
When reviewing the patch, I just wondered why the completion tag output
for REFRESH MATERIALIZED VIEW doesn't include the row count. Is this
an intentional behavior? I've not found the past discussion about this yet.


Of 191 command tags, only 7 return the number of rows.  See 
src/include/tcop/cmdtaglist.h.  REFRESH MATERIALIZED VIEW is similar to CREATE 
MATERIALIZED VIEW and also CREATE TABLE AS, which also do not return the row 
count.


Yes, so I was wondering if REFRESH MATERIALIZED VIEW also should use
CMDTAG_SELECT like CREATE MATERIALIZED VIEW does.


 I think this behavior is historical, and preserved for compatibility.


Maybe true. The patch that I mentioned upthread tries to change
REFRESH MATERIALIZED VIEW so that it register the rowcount into
its query completion so that pg_stat_statements can track
the number of processed rows. But display_rowcount for
CMDTAG_REFRESH_MATERIALIZED_VIEW still should be false
in cmdtaglist.h to avoid the change of the completion tag output.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-05 Thread Masahiro Ikeda

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+   /* fill in some values using pgWalUsage */
+   WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+	WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;

+   WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+   if (AmWalWriterProcess()){
+   WalStats.m_wal_write_walwriter++;
+   }
+   else
+   {
+   WalStats.m_wal_write_backend++;
+   }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the 
counters

per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.


I'll remove the above source code because these counters are not useful.


On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, 
auto_explain,

autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is 
reported high,
to reduce the number of this initialization, we can tune WAL-related 
parameters

so that more "recycled" WAL fi

Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread Nikhil Benesch

On 11/5/20 7:38 PM, David G. Johnston wrote:

My understanding is that "any" is defined to accept that behavior - allowing any pseudo-type and unknown.  The 
"anyelement" polymorphic pseudo-type is defined such that only concrete known types are allowed to match - and then the 
rules of polymorphism apply when performing a lookup.  My uninformed conclusion is that since to_json only defines a single 
parameter that changing it from "anyelement" to "any" would be reasonable and the hack describe probably 
"just works" (though I'd test it on a wide-range of built-in types first if I was actually going to use the hack).

You only get to use "any" for a C-language function but that is indeed the case 
here.


That exactly matches my understanding as well. I'll put together a patch.

Nikhil




Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread Tom Lane
Nikhil Benesch  writes:
> On 11/5/20 7:38 PM, David G. Johnston wrote:
>> My understanding is that "any" is defined to accept that behavior - allowing 
>> any pseudo-type and unknown.  The "anyelement" polymorphic pseudo-type is 
>> defined such that only concrete known types are allowed to match - and then 
>> the rules of polymorphism apply when performing a lookup.  My uninformed 
>> conclusion is that since to_json only defines a single parameter that 
>> changing it from "anyelement" to "any" would be reasonable and the hack 
>> describe probably "just works" (though I'd test it on a wide-range of 
>> built-in types first if I was actually going to use the hack).
>> 
>> You only get to use "any" for a C-language function but that is indeed the 
>> case here.

> That exactly matches my understanding as well. I'll put together a patch.

"any" is a dinosaur IMO.  It's definitely lower-level than anyelement;
for example the function has to be prepared to deal with raw "unknown"
literals.  So I feel like the proposed solution here is a bit of a hack.

What I'm wondering about as I think about this is why we don't allow
unknown literals to be resolved as text when matching to anyelement.
Maybe that was intentional, or maybe just overly conservative; or maybe
there is a good reason for it.  I don't recall, but it would be worth
excavating in the list archives to see if it was discussed when the
polymorphic types were being designed.

A relevant data point is that we *do* allow the case with the more
recently added "anycompatible" polymorphics:

regression=# create function foo(anycompatible) returns anycompatible as 
'select $1' language sql;
CREATE FUNCTION
regression=# select foo('bar');
 foo 
-
 bar
(1 row)

regression=# select pg_typeof(foo('bar'));
 pg_typeof 
---
 text
(1 row)

So even if we decide that changing the rules for "anyelement" is
too scary, I think switching to_json to anycompatible would be
preferable to switching it to "any".

regards, tom lane




RE: pgbench: option delaying queries till connections establishment?

2020-11-05 Thread kuroda.hay...@fujitsu.com
Dear Fabien, 

> Indeed. I scanned the file but did not find other places that needed 
> updating.

> Yes.

> Not sure either. I'm not for having too many braces anyway, so I removed 
> them.

I checked your fixes and I think it's OK. 
Finally, please move the doc fixes to patch B in order to separate patches
completely.

> Indeed. I took your next patch with an added explanation. I'm unclear 
> whether proceeding makes much sense though, that is some thread would run 
> the test and other would have aborted. Hmmm.

Your comment looks good, thanks.
In the previous version pgbench starts benchmarking even if some connections 
fail.
And users can notice the connection failure by stderr output.
Hence the current specification may be enough.
If you agree, please remove the following lines:

```
+* It is unclear whether it is worth doing 
anything rather than
+* coldly exiting with an error message.
```

> ISTM that there is another patch in the queue which needs barriers to 
> delay some initialization so as to fix a corner case bug, in which case 
> the behavior would be mandatory. The current submission could add an 
> option to skip the barrier synchronization, but I'm not enthousiastic to 
> add an option and remove it shortly later.

Could you tell me which patch you mention? Basically I agree what you say,
but I want to check it.

Hayato Kuroda
FUJITSU LIMITED





Re: Add Information during standby recovery conflicts

2020-11-05 Thread Masahiko Sawada
On Fri, Oct 30, 2020 at 6:02 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 10/30/20 4:25 AM, Fujii Masao wrote:
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender
> > and know the content is safe.
> >
> >
> >
> > On 2020/10/30 10:29, Masahiko Sawada wrote:
> >> ,
> >>
> >> On Thu, 29 Oct 2020 at 00:16, Fujii Masao
> >>  wrote:
> >>>
> >>> I read v8 patch. Here are review comments.
> >>
> >> Thank you for your review.
> >>

Thank you for updating the patch!

Looking at the latest version patch
(v8-0002-Log-the-standby-recovery-conflict-waits.patch), I think it
doesn't address some comments from Fujii-san.

> >>>
> >>> When recovery conflict with buffer pin happens, log message is output
> >>> every deadlock_timeout. Is this intentional behavior? If yes, IMO
> >>> that's
> >>> not good because lots of messages can be output.
> >>
> >> Agreed.

I think the latest patch doesn't fix the above comment. Log message
for recovery conflict on buffer pin is logged every deadlock_timeout.

> >>
> >> if we were to log the recovery conflict only once in bufferpin
> >> conflict case, we would log it multiple times only in lock conflict
> >> case. So I guess it's better to do that in all conflict cases.
> >
> > Yes, I agree that this behavior basically should be consistent between
> > all cases.

The latest patch seems not to address this comment as well.

> >
> >>
> >>>
> >>> +   if (log_recovery_conflict_waits)
> >>> +   waitStart = GetCurrentTimestamp();
> >>>
> >>> What happens if log_recovery_conflict_waits is off at the beginning and
> >>> then it's changed during waiting for the conflict? In this case,
> >>> waitStart is
> >>> zero, but log_recovery_conflict_waits is on. This may cause some
> >>> problems?
> >>
> >> Hmm, I didn't see a path that happens to reload the config file during
> >> waiting for buffer cleanup lock. Even if the startup process receives
> >> SIGHUP during that, it calls HandleStartupProcInterrupts() at the next
> >> convenient time. It could be the beginning of main apply loop or
> >> inside WaitForWALToBecomeAvailable() and so on but I didn’t see it in
> >> the wait loop for taking a buffer cleanup.
> >
> > Yes, you're right. I seem to have read the code wrongly.
> >
> >> However, I think it’s
> >> better to use (waitStart > 0) for safety when checking if we log the
> >> recovery conflict instead of log_recovery_conflict_waits.
> >>
> >>>
> >>> +   if (report_waiting)
> >>> +   ts = GetCurrentTimestamp();
> >>>
> >>> GetCurrentTimestamp() doesn't need to be called every cycle
> >>> in the loop after "logged" is true and "new_status" is not NULL.
> >>
> >> Agreed
> >>
> >>>
> >>> +extern const char *get_procsignal_reason_desc(ProcSignalReason
> >>> reason);
> >>>
> >>> Is this garbage?
> >>
> >> Yes.
> >>
> >>>
> >>> When log_lock_waits is enabled, both "still waiting for ..." and
> >>> "acquired ..."
> >>> messages are output. Like this, when log_recovery_conflict_waits is
> >>> enabled,
> >>> not only "still waiting ..." but also "resolved ..." message should
> >>> be output?
> >>> The latter message might not need to be output if the conflict is
> >>> canceled
> >>> due to max_standby_xxx_delay parameter. The latter message would be
> >>> useful to know how long the recovery was waiting for the conflict.
> >>> Thought?
> >>> It's ok to implement this as a separate patch later, though.
> >>
> >> There was a discussion that the latter message without waiting time is
> >> not necessarily needed because the canceled process will log
> >> "canceling statement due to conflict with XXX" as you mentioned. I
> >> agreed with that. But I agree that the latter message with waiting
> >> time would help users, for instance when the startup process is
> >> waiting for multiple processes and it takes a time to cancel all of
> >> them.
> >
> > I agree that it's useful to output the wait time.
> >
> > But as I told in previous email, it's ok to focus on the current patch
> > for now and then implement this as a separate patch later.

Agreed.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




subscribing

2020-11-05 Thread meission

subscribing

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-05 Thread Ajin Cherian
On Wed, Nov 4, 2020 at 9:31 PM Amit Kapila  wrote:

> Okay, so it is mentioned in the comment why we need to execute
> invalidations even when the database is not the same. So, we are
> probably good here if we are executing the invalidations whenever we
> skip to decode the prepared xact.

Updated to execute invalidations while skipping prepared transactions.
Also ran pgindent on the
source files with updated typedefs.
Attaching v17 with 1,2 and 3.

regards,
Ajin Cherian
Fujitsu Australia


v17-0001-Support-2PC-txn-base.patch
Description: Binary data


v17-0002-Support-2PC-txn-backend.patch
Description: Binary data


v17-0003-Support-2PC-test-cases-for-test_decoding.patch
Description: Binary data


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread k.jami...@fujitsu.com
On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi 
 wrote:
> I'm not sure about the exact steps of the test, but it can be expected if we
> have many small relations to truncate.
> 
> Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which is quite arbitrary that comes from a wild guess.
> 
> Perhaps we need to run benchmarks that drops one relation with several
> different ratios between the number of buffers to-be-dropped and Nbuffers,
> and preferably both on spinning rust and SSD.

Sorry to get back to you on this just now.
Since we're prioritizing the vacuum patch, we also need to finalize which 
threshold value to use.
I proceeded testing with my latest set of patches because Amit-san's comments 
on the code, the ones we addressed,
don't really affect the performance. I'll post the updated patches for 0002 & 
0003 after we come up with the proper
boolean parameter name for smgrnblocks and the buffer full scan threshold value.

Test the VACUUM performance with the following thresholds: 
   NBuffers/512, NBuffers/256, NBuffers/128,
and determine which of the ratio has the best performance in terms of speed.

I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
Configure streaming replication environment.
shared_buffers = 100GB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min

[Steps]
1. Create TABLE
2. INSERT data
3. DELETE from TABLE
4. Pause WAL replay on standby
5. VACUUM. Stop the primary.
6. Resume WAL replay and promote standby.

With 1 relation, there were no significant changes that we can observe:
(In seconds)
| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 0.106  | 0.105| 0.105| 0.105| 
| 100GB | 0.106  | 0.105| 0.105| 0.105|

So I tested with 100 tables and got more convincing measurements:

| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 1.006  | 1.007| 1.006| 0.107| 
| 1GB   | 0.706  | 0.606| 0.606| 0.605| 
| 20GB  | 1.907  | 0.606| 0.606| 0.605| 
| 100GB | 7.013  | 0.706| 0.606| 0.607|

The threshold NBuffers/128 has the best performance for default shared_buffers 
(128MB)
with 0.107 s, and equally performing with large shared_buffers up to 100GB.

We can use NBuffers/128 for the threshold, although I don't have a measurement 
for HDD yet. 
However, I wonder if the above method would suffice to determine the final 
threshold that we
can use. If anyone has suggestions on how we can come up with the final value, 
like if I need
to modify some steps above, I'd appreciate it.

Regarding the parameter name. Instead of accurate, we can use "cached" as 
originally intended from
the early versions of the patch since it is the smgr that handles smgrnblocks 
to get the the block
size of smgr_cached_nblocks.. "accurate" may confuse us because the cached 
value may not
be actually accurate..

Regards,
Kirk Jamison





Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread Nikhil Benesch

On 11/5/20 8:58 PM, Tom Lane wrote:

"any" is a dinosaur IMO.  It's definitely lower-level than anyelement;
for example the function has to be prepared to deal with raw "unknown"
literals.  So I feel like the proposed solution here is a bit of a hack.


I see what you are saying, but since the code for to_jsonb is shared with
the code for jsonb_build_array and jsonb_build_object, which already
handle the pitfalls of "any", the patch seems to be literally this
simple:

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c01da4bf01..11adf748c9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8466,7 +8466,7 @@
   prosrc => 'json_object_two_arg' },
 { oid => '3176', descr => 'map input to json',
   proname => 'to_json', provolatile => 's', prorettype => 'json',
-  proargtypes => 'anyelement', prosrc => 'to_json' },
+  proargtypes => 'any', prosrc => 'to_json' },
 { oid => '3261', descr => 'remove object fields with null values from json',
   proname => 'json_strip_nulls', prorettype => 'json', proargtypes => 'json',
   prosrc => 'json_strip_nulls' },
@@ -9289,7 +9289,7 @@
   proargtypes => '_text _text', prosrc => 'jsonb_object_two_arg' },
 { oid => '3787', descr => 'map input to jsonb',
   proname => 'to_jsonb', provolatile => 's', prorettype => 'jsonb',
-  proargtypes => 'anyelement', prosrc => 'to_jsonb' },
+  proargtypes => 'any', prosrc => 'to_jsonb' },
 { oid => '3265', descr => 'jsonb aggregate transition function',
   proname => 'jsonb_agg_transfn', proisstrict => 'f', provolatile => 's',
   prorettype => 'internal', proargtypes => 'internal anyelement',

I think my argument is that regardless of which of
{any,anyelement,anycompatible} is best, it seems like to_jsonb,
jsonb_build_array, and jsonb_build_object should all use the same
type.


What I'm wondering about as I think about this is why we don't allow
unknown literals to be resolved as text when matching to anyelement.
Maybe that was intentional, or maybe just overly conservative; or maybe
there is a good reason for it.  I don't recall, but it would be worth
excavating in the list archives to see if it was discussed when the
polymorphic types were being designed.


I excavated two separate threads from 2003 from when you and Joe Conway
were designing SQL99 array support and the initial polymorphic types:

https://www.postgresql.org/message-id/3E701869.4020301%40joeconway.com
https://www.postgresql.org/message-id/1272.1048633920%40sss.pgh.pa.us

I didn't see anything obvious about unknown coercions, though I
certainly could have overlooked something. For what it's worth, the
error message "could not determine polymorphic type because input has
type unknown" has existed, with slightly different wording, since
the very first commit of the feature:

https://github.com/postgres/postgres/commit/730840c9b (parse_coerce.c, L840)


A relevant data point is that we *do* allow the case with the more
recently added "anycompatible" polymorphics:

<...snipped examples...>

So even if we decide that changing the rules for "anyelement" is
too scary, I think switching to_json to anycompatible would be
preferable to switching it to "any".


Oh these new polymorphic types are interesting. I hadn't seen these.

So, I don't feel particularly qualified to determine how to proceed.
These are the options that I would be excited about:

  1. Switch to_jsonb to take "any", as in the above patch.
  2. Convert all of to_jsonb, jsonb_build_array, and jsonb_build_object
 to use the new "anycompatible" type.
  3. Switch to_jsonb to take "anyelement", but change "anyelement" and
 friends so that "unknown" arguments are coereced to text.

Would someone care to offer guidance on which path to choose?

Nikhil




Re: Dumping/restoring fails on inherited generated column

2020-11-05 Thread Masahiko Sawada
On Wed, Nov 4, 2020 at 4:23 AM Peter Eisentraut
 wrote:
>
> On 2020-08-27 13:30, Masahiko Sawada wrote:
> > This issue is not fixed yet. I've attached the updated version patch
> > and registered it to commit fest so as not to forget. Please review
> > it.
>
> A few fixes have been committed in this thread, basically to prevent
> situations that shouldn't have been allowed.
>
> What's left is the originally reported issue that some parts of the
> regression test database are dumped incorrectly.  The two proposed
> patches in their most recent versions are
>
> https://www.postgresql.org/message-id/attachment/107447/v3-0001-pg_dump-Fix-dumping-of-inherited-generated-column.patch
> (message
> https://www.postgresql.org/message-id/b1c831dd-d520-5e7f-0304-0eeed39c9996%402ndquadrant.com)
>
> and
>
> https://www.postgresql.org/message-id/attachment/113487/fix_gcolumn_dump_v2.patch
> (message
> https://www.postgresql.org/message-id/CA%2Bfd4k6pLzrZDQsdsxcS06AwGRf1DgwOw84sFq9oXNw%2B83nB1g%40mail.gmail.com)
>
> Both of these result in the same change to the dump output.  Both of
> them have essentially the same idea.  The first one adds the
> conditionals during the information gathering phase of pg_dump, the
> second one adds the conditionals during the output phase.
>
> Any further thoughts?

I think the first one is better than the second (mine) because it can
save the number of intermediate objects.

Regards,


--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread Amit Kapila
On Fri, Nov 6, 2020 at 5:02 AM Thomas Munro  wrote:
>
> On Thu, Nov 5, 2020 at 10:47 PM Amit Kapila  wrote:
> > I still feel 'cached' is a better name.
>
> Amusingly, this thread is hitting all the hardest problems in computer
> science according to the well known aphorism...
>
> Here's a devil's advocate position I thought about:  It's OK to leave
> stray buffers (clean or dirty) in the buffer pool if files are
> truncated underneath us by gremlins, as long as your system eventually
> crashes before completing a checkpoint.  The OID can't be recycled
> until after a successful checkpoint, so the stray blocks can't be
> confused with the blocks of another relation, and weird errors are
> expected on a system that is in serious trouble.  It's actually much
> worse that we can give incorrect answers to queries when files are
> truncated by gremlins (in the window of time before we presumably
> crash because of EIO), because we're violating basic ACID principles
> in user-visible ways.  In this thread, discussion has focused on
> availability (ie avoiding failures when trying to write back stray
> buffers to a file that has been unlinked), but really a system that
> can't see arbitrary committed transactions *shouldn't be available*.
> This argument applies whether you think SEEK_END can only give weird
> answers in the specific scenario I demonstrated with NFS, or whether
> you think it's arbitrarily b0rked and reports random numbers: we
> fundamentally can't tolerate that, so why are we trying to?
>

It is not very clear to me how this argument applies to the patch
in-discussion where we are relying on the cached value of blocks
during recovery. I understand your point that we might skip scanning
the pages and thus might not show some recently added data but that
point is not linked with what we are trying to do with this patch.
AFAIU, the theory we discussed above is that there shouldn't be any
stray blocks in the buffers with this patch because even if the
smgrnblocks(SEEK_END) didn't gave us the right answers, we shouldn't
have any buffers for the blocks after the size returned by smgrnblocks
during recovery. I think the problem could happen if we extend the
relation by multiple blocks which will invalidate the cached value
during recovery and then probably the future calls to smgrnblocks can
lead to problems if it lies with us but as far as I know we don't do
that. Can you please be more specific how this patch can lead to a
problem?


-- 
With Regards,
Amit Kapila.




Re: Implementing Incremental View Maintenance

2020-11-05 Thread Justin Pryzby
On Mon, Oct 05, 2020 at 06:16:18PM +0900, Yugo NAGATA wrote:
> Hi,
> 
> Attached is the rebased patch (v18) to add support for Incremental

This needs to be rebased again - the last version doesn't apply anymore.
http://cfbot.cputube.org/yugo-nagata.html

I looked though it a bit and attach some fixes to the user-facing docs.

There's some more typos in the source that I didn't fix:
constrains
materliazied
cluase
immediaite
clumn
Temrs
migth
recalculaetd
speified
secuirty

commit message: comletion

psql and pg_dump say 13 but should say 14 now:
pset.sversion >= 13 

# bag union
big union?

+   relisivm bool
+  
+  
+   True if materialized view enables incremental view maintenance

This isn't clear, but I think it should say "True for materialized views which
are enabled for incremental view maintenance (IVM)."

-- 
Justin
>From 568f8626e2d9ab0deb25ac9e10089a79abecdab0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 5 Nov 2020 22:54:23 -0600
Subject: [PATCH] incremental view doc fixes

---
 doc/src/sgml/catalogs.sgml|   2 +-
 .../sgml/ref/create_materialized_view.sgml|  44 +++---
 .../sgml/ref/refresh_materialized_view.sgml   |   4 +-
 doc/src/sgml/rules.sgml   | 132 +-
 4 files changed, 90 insertions(+), 92 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9b52119382..b942605a1b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3479,7 +3479,7 @@ SCRAM-SHA-256$:&l
The dependent object was created as part of creation of the Materialized
View with Incremental View Maintenance reference, and is really just a 
part of its internal implementation. The dependent object must not be
-   dropped unless materialized view dropped.
+   dropped unless the materialized view is also dropped.
   
  
 
diff --git a/doc/src/sgml/ref/create_materialized_view.sgml b/doc/src/sgml/ref/create_materialized_view.sgml
index 6f4e634ab0..e034a2c17f 100644
--- a/doc/src/sgml/ref/create_materialized_view.sgml
+++ b/doc/src/sgml/ref/create_materialized_view.sgml
@@ -67,7 +67,7 @@ CREATE [ INCREMENTAL ] MATERIALIZED VIEW [ IF NOT EXISTS ] table_na
  
  
   There are restrictions of query definitions allowed to use this
-  option. Followings are supported query definitions for IMMV:
+  option. The following are supported in query definitions for IMMV:
   
 

@@ -78,12 +78,12 @@ CREATE [ INCREMENTAL ] MATERIALIZED VIEW [ IF NOT EXISTS ] table_na
 

 
- Outer joins with following restrictions:
+ Outer joins with the following restrictions:
 
  
   

-Outer join view's targetlist must contain attributes used in the
+Outer join view's targetlist must contain all attributes used in the
 join conditions.
 
 CREATE INCREMENTAL MATERIALIZED VIEW mv AS SELECT a.i FROM mv_base_a a LEFT
@@ -95,7 +95,7 @@ ERROR:  targetlist must contain vars in the join condition for IVM with outer jo
 
   

-Outer join view's targetlist cannot contain non strict functions.
+Outer join view's targetlist cannot contain non-strict functions.
 
 CREATE INCREMENTAL MATERIALIZED VIEW mv AS SELECT a.i, b.i, (k > 10 OR k = -1)
 FROM mv_base_a a LEFT JOIN mv_base_b b ON a.i=b.i;
@@ -135,7 +135,7 @@ ERROR:  WHERE cannot contain non null-rejecting predicates for IVM with outer jo
 
   

-Aggregate is not supported with outer join.
+Aggregates are not supported with outer join.
 
 CREATE INCREMENTAL MATERIALIZED VIEW mv(a,b,v) AS SELECT a.i, b.i, sum(k) FROM
 mv_base_a a LEFT JOIN mv_base_b b ON a.i=b.i GROUP BY a.i, b.i;
@@ -165,7 +165,7 @@ ERROR:  subquery is not supported by IVM together with outer join
 

 
- Subqueries. However following forms are not supported.
+ Subqueries. However, the following forms are not supported.
 
 
 
@@ -176,14 +176,14 @@ mv_base_a WHERE i IN (SELECT i FROM mv_base_b WHERE k < 103 );
  
 
 
- subqueries in target list is not supported:
+ subqueries in the target list are not supported:
  
 CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm05 AS SELECT i,j, (SELECT k
 FROM mv_base_b b WHERE a.i = b.i) FROM mv_base_a a;
  
 
 
- Nested EXISTS subqueries is not supported:
+ Nested EXISTS subqueries are not supported:
  
 CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm11 AS SELECT a.i,a.j FROM
 mv_base_a a WHERE EXISTS(SELECT 1 FROM mv_base_b b WHERE EXISTS(SELECT
@@ -210,14 +210,14 @@ a.i > 5;
 

 
- Simple CTEs which do not contain aggregates or DISTINCT.
+ Simple CTEs that do not contain aggregates or DISTINCT.
 
  

Re: REFRESH MATERIALIZED VIEW and completion tag output

2020-11-05 Thread Mark Dilger



> On Nov 5, 2020, at 4:45 PM, Fujii Masao  wrote:
> 
> 
> 
> On 2020/11/06 1:56, Mark Dilger wrote:
>>> On Nov 5, 2020, at 8:20 AM, Fujii Masao  wrote:
>>> 
>>> The patch that makes pg_stat_statements track the number of rows that
>>> REFRESH MATERIALIZED VIEW command processes was proposed at [1].
>>> When reviewing the patch, I just wondered why the completion tag output
>>> for REFRESH MATERIALIZED VIEW doesn't include the row count. Is this
>>> an intentional behavior? I've not found the past discussion about this yet.
>> Of 191 command tags, only 7 return the number of rows.  See 
>> src/include/tcop/cmdtaglist.h.  REFRESH MATERIALIZED VIEW is similar to 
>> CREATE MATERIALIZED VIEW and also CREATE TABLE AS, which also do not return 
>> the row count.
> 
> Yes, so I was wondering if REFRESH MATERIALIZED VIEW also should use
> CMDTAG_SELECT like CREATE MATERIALIZED VIEW does.

For both CREATE MATERIALIZED VIEW and CREATE TABLE AS, ExecCreateTableAs() does:

  SetQueryCompletion(qc, CMDTAG_SELECT, queryDesc->estate->es_processed);

but ExecRefreshMatView() does nothing with it's QueryCompletion parameter.   I 
don't know why this case should behave differently than the other two.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Why does to_json take "anyelement" rather than "any"?

2020-11-05 Thread Pavel Stehule
pá 6. 11. 2020 v 1:39 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> On Thu, Nov 5, 2020 at 3:43 PM Mohamed Wael Khobalatte <
> mkhobala...@grubhub.com> wrote:
>
>>  You can always cast to text yourself, of course, but I am not familiar
>> with the type hierarchy enough to tell why `to_json` can't deduce that as
>> text whereas the other function can.
>>
>
> My understanding is that "any" is defined to accept that behavior -
> allowing any pseudo-type and unknown.  The "anyelement" polymorphic
> pseudo-type is defined such that only concrete known types are allowed to
> match - and then the rules of polymorphism apply when performing a lookup.
> My uninformed conclusion is that since to_json only defines a single
> parameter that changing it from "anyelement" to "any" would be reasonable
> and the hack describe probably "just works" (though I'd test it on a
> wide-range of built-in types first if I was actually going to use the hack).
>
> You only get to use "any" for a C-language function but that is indeed the
> case here.
>

Type "anyelement"  can force the function's result type directly. But there
cannot be function that returns UNKNOWN.

Type "any" just accept any argument without any impact on result type.
Unfortunately, inside a function is necessary to do much more work related
to casting types, and the execution can be slower.

I checked the source code of to_json and this function can use "any"
without any change.

Regards

Pavel



> David J.
>
>


Re: default result formats setting

2020-11-05 Thread Pavel Stehule
čt 5. 11. 2020 v 21:48 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2020-10-26 09:45, Pavel Stehule wrote:
> > The attached patch implements this.  For example, to get int2, int4,
> > int8 in binary by default, you could set
> >
> > SET default_result_formats = '21=1,23=1,20=1';
> >
> >
> > Using SET statement for this case looks very obscure :/
> >
> > This is a protocol related issue, and should be solved by protocol
> > extending. I don't think so SQL level is good for that.
>
> We could also make it a protocol message, but it would essentially
> implement the same thing, just again separately.  And then you'd have no
> support to inspect the current setting, test out different settings
> interactively, etc.  That seems pretty wasteful and complicated for no
> real gain.
>

If you need a debug API, then it can be better implemented with functions.
But why do you need it on SQL level?

This is a protocol related thing.


>  > More, this format is not practical for custom types, and the list can
>  > be pretty long.
>
> The list is what the list is.  I don't see how you can make it any
> shorter.  You have to list the data types that you're interested in
> somehow.  Any other ideas?
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread Thomas Munro
On Fri, Nov 6, 2020 at 5:09 PM Amit Kapila  wrote:
> On Fri, Nov 6, 2020 at 5:02 AM Thomas Munro  wrote:
> > Here's a devil's advocate position I thought about:  It's OK to leave
> > stray buffers (clean or dirty) in the buffer pool if files are
> > truncated underneath us by gremlins, as long as your system eventually
> > crashes before completing a checkpoint.  The OID can't be recycled
> > until after a successful checkpoint, so the stray blocks can't be
> > confused with the blocks of another relation, and weird errors are
> > expected on a system that is in serious trouble.  It's actually much
> > worse that we can give incorrect answers to queries when files are
> > truncated by gremlins (in the window of time before we presumably
> > crash because of EIO), because we're violating basic ACID principles
> > in user-visible ways.  In this thread, discussion has focused on
> > availability (ie avoiding failures when trying to write back stray
> > buffers to a file that has been unlinked), but really a system that
> > can't see arbitrary committed transactions *shouldn't be available*.
> > This argument applies whether you think SEEK_END can only give weird
> > answers in the specific scenario I demonstrated with NFS, or whether
> > you think it's arbitrarily b0rked and reports random numbers: we
> > fundamentally can't tolerate that, so why are we trying to?
>
> It is not very clear to me how this argument applies to the patch
> in-discussion where we are relying on the cached value of blocks
> during recovery. I understand your point that we might skip scanning
> the pages and thus might not show some recently added data but that
> point is not linked with what we are trying to do with this patch.

It's an argument for giving up the hard-to-name cache trick completely
and going back to using unmodified smgrnblocks(), both in recovery and
online.  If the only mechanism for unexpected file shrinkage is
writeback failure, then your system will be panicking soon enough
anyway -- so is it really that bad if there are potentially some other
weird errors logged some time before that?  Maybe those errors will
even take the system down sooner, and maybe that's appropriate?  If
there are other mechanisms for random file shrinkage that don't imply
a panic in your near future, then we have bigger problems that can't
be solved by any number of bandaids, at least not without
understanding the details of this hypothetical unknown failure mode.

The main argument I can think of against the idea of using plain old
smgrnblocks() is that the current error messages on smgrwrite()
failure for stray blocks would be indistinguishible from cases where
an external actor unlinked the file.  I don't mind getting an error
that prevents checkpointing -- your system is in big trouble! -- but
it'd be nice to be able to detect that *we* unlinked the file,
implying the filesystem and bufferpool are out of sync, and spit out a
special diagnostic message.  I suppose if it's the checkpointer doing
the writing, it could check if the relfilenode is on the
queued-up-for-delete-after-the-checkpoint list, and if so, it could
produce a different error message just for this edge case.
Unfortunately that's not a general solution, because any backend might
try to write a buffer out and they aren't synchronised with
checkpoints.

I'm not sure what the best approach is.  It'd certainly be nice to be
able to drop small tables quickly online too, as a benefit of this
approach.




Re: libpq compression

2020-11-05 Thread Konstantin Knizhnik




On 05.11.2020 21:07, Matthias van de Meent wrote:

On Thu, 5 Nov 2020 at 17:01, Konstantin Knizhnik
 wrote:

Sorry, I do not understand your point.
This view reports network traffic from server's side.
But client's traffic information is "mirror" of this statistic: 
server_tx=client_rx and visa versa.

Yes, first few bytes exchanged by client and server during handshake are not 
compressed.
But them are correctly calculated as "raw bytes". And certainly this few bytes 
can not have any influence on
measured average compression ratio (the main goal of using this network traffic 
statistic from my point of view).

As I understand it, the current metrics are as follows:

Server
  |<- |<- Xx_raw_bytes
  |  Compression
  |   |<- Xx_compressed_bytes
Client connection
  |
Network

 From the views' name 'pg_stat_network_traffic', to me 'Xx_raw_bytes'
would indicate the amount of bytes sent/received over the client
connection (e.g. measured between the Client connection and Network
part, or between the Server/Client connection and Compression/Client
connection sections), because that is my natural understanding of
'raw tx network traffic'. This is why I proposed 'logical' instead
of 'raw', as 'raw' is quite apparently understood differently when
interpreted by different people, whereas 'logical' already implies
that the value is an application logic-determined value (e.g. size
before compression).

The current name implies a 'network' viewpoint when observing this
view, not the 'server'/'backend' viewpoint you describe. If the
'server'/'backend' viewpoint is the desired default viewpoint, then
I suggest to rename the view to `pg_stat_network_compression`, as
that moves the focus to the compression used, and subsequently
clarifies `raw` as the raw application command data.

If instead the name `pg_stat_network_traffic` is kept, I suggest
changing the metrics collected to the following scheme:

Server
  |<- |<- Xx_logical_bytes
  |  Compression
  |   |<- Xx_compressed_bytes (?)
  |<- |<- Xx_raw_bytes
Client connection
  |
Network

This way, `raw` in the context of 'network_traffic' means
"sent-over-the-connection"-data, and 'logical' is 'application logic'
-data (as I'd expect from both a network as an application point of
view). 'Xx_compressed_bytes' is a nice addition, but not strictly
necessary, as you can subtract raw from logical to derive the bytes
saved by compression.
Sorry, but "raw" in this context means "not transformed", i.e. not 
compressed.
I have not used term uncompressed, because it assumes that there are 
"compressed" bytes which is not true if compression is not used.
So "raw" bytes are not bytes which we sent through network - quite 
opposite: application writes "raw" (uncompressed) data,

it is compressed ans then compressed bytes are sent.

May be I am wrong, but term "logical" is much more confusing and 
overloaded than "raw".
Especially taken in account that it is widely used in Postgres for 
logical replication.

The antonym to "logical" is "physical", i.e. something materialized.
But in case of data exchanged between client and server, which one can 
be named physical, which one logical?
Did you ever heard about logical size of the file (assuming that may 
contain holes or be compressed by file system?)

In zfs it is called "apparent" size.

Also I do not understand at your picture why Xx_compressed_bytes may be 
different from Xx_raw_bytes?







Re: libpq compression

2020-11-05 Thread Konstantin Knizhnik




On 05.11.2020 22:22, Peter Eisentraut wrote:

On 2020-11-02 20:50, Andres Freund wrote:

On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote:

But the price of compression is 1 cpu for 500MB/s (zstd). With a
20Gbps network adapters cost of recompressing all traffic is at most
~4 cores.


It's not quite that simple, because presumably each connection is going
to be handled by one core at a time in the pooler. So it's easy to slow
down peak throughput if you also have to deal with TLS etc.


Also, current deployments of connection poolers use rather small 
machine sizes.  Telling users you need 4 more cores per instance now 
to decompress and recompress all the traffic doesn't seem very 
attractive. Also, it's not unheard of to have more than one layer of 
connection pooling.  With that, this whole design sounds a bit like a 
heat-generation system. ;-)


Compression will be mostly useful for:
1. Replication protocol
2. COPY command
3. OLAP queries returning large result sets
4. Queries returning BLOBs/JSON

It seems to be not so good idea to switch on compression for all 
connections.

And cases describe above are usually affect only small number of backends.

Also please notice, that compression may significantly reduce size of 
transferred data. An copying data between multiple levels of network 
protocol
also consumes significant amount of CPU. Also number of system calls can 
be proportional to the size of transferred data and in many cases
now performances if servers is limited by number of system calls, rather 
than by network throughput/latency and other factors.


So I accept your arguments, but still think that picture is not so 
straightforward.
This is why it is very interesting to me to know results of using 
compression with Odyssey in real production environment.






Re: libpq compression

2020-11-05 Thread Andrey Borodin



> 6 нояб. 2020 г., в 00:22, Peter Eisentraut 
>  написал(а):
> 
> On 2020-11-02 20:50, Andres Freund wrote:
>> On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote:
>>> But the price of compression is 1 cpu for 500MB/s (zstd). With a
>>> 20Gbps network adapters cost of recompressing all traffic is at most
>>> ~4 cores.
>> It's not quite that simple, because presumably each connection is going
>> to be handled by one core at a time in the pooler. So it's easy to slow
>> down peak throughput if you also have to deal with TLS etc.
> 
> Also, current deployments of connection poolers use rather small machine 
> sizes.  Telling users you need 4 more cores per instance now to decompress 
> and recompress all the traffic doesn't seem very attractive. Also, it's not 
> unheard of to have more than one layer of connection pooling.  With that, 
> this whole design sounds a bit like a heat-generation system. ;-)

User should ensure good bandwidth between pooler and DB. At least they must be 
within one availability zone. This makes compression between pooler and DB 
unnecessary.
Cross-datacenter traffic is many times more expensive.

I agree that switching between compression levels (including turning it off) 
seems like nice feature. But
1. Scope of its usefulness is an order of magnitude smaller than compression of 
the whole connection.
2. Protocol for this feature is significantly more complicated.
3. Restarted compression is much less efficient and effective.

Can we design a protocol so that this feature may be implemented in future, 
currently focusing on getting things compressed? Are there any drawbacks in 
this approach?

Best regards, Andrey Borodin.



Refactor MD5 implementations and switch to EVP for OpenSSL

2020-11-05 Thread Michael Paquier
Hi all,

Please find attached a patch set to sanitize the use of MD5 we have in
the core tree.

As of now, there are two duplicated implementations of MD5, one in
contrib/pgcrypto/ used as a fallback when not compiling with OpenSSL,
and one in src/common/ used by the backend when compiling with *or*
without OpenSSL.  This is bad on several aspects:
- There is no need to have the same implementation twice, obviously.
- When compiling with OpenSSL, we use an incorrect implementation,
causing Postgres to cheat if FIPS is enabled because MD5 should not be
authorized.  Making use of what OpenSSL provides with EVP allows us to
rely on OpenSSL to control such restrictions.  So we authorize MD5
authentications while these should be blocked, making Postgres not
completely compliant with STIG and its kind.

The attached patch set does a bit of rework to make the Postgres code
more consistent with OpenSSL, similarly to the work I did for all the
SHA2 implementations with EVP in [1]:
- 0001 is something stolen from the SHA2 set, adding to resowner.c
control of EVP contexts, so as it is possible to clean up anything
allocated by OpenSSL.
- 0002 is the central piece, that moves the duplicated
implementation.  src/common/ and pgcrypto/ use the same code, but I
have reused pgcrypto as it was already doing the init/update/final
split similarly to PostgreSQL.  New APIs are designed to control MD5
contexts, similarly to the work done for SHA2.  Upon using this patch,
note that pgcrypto+OpenSSL uses our in-core implementation instead of
OpenSSL's one, but that's fixed in 0003.  We have a set of three
convenience routines used to generate MD5-hashed passwords, that I
have moved to a new file in src/common/md5_common.c, aimed at being
shared between all the implementations.
- 0003 adds the MD5 implementation based on OpenSSL's EVP, ending the
work.

This set of patches is independent on the SHA2 refactoring, even if it
shares a part with the SHA2 refactoring in its design.  Note that 0001
and 0002 don't depend on each other, but 0003 depends on both.

Thanks,

[1]: https://www.postgresql.org/message-id/20200924025314.ge7...@paquier.xyz
--
Michael
From 0e0404232072d096cae61fbf1e3135516b18fd0e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 5 Nov 2020 15:53:00 +0900
Subject: [PATCH 1/3] Add APIs to control EVP contexts for resource owners

This will be used by a set of upcoming patches for EVP contexts with
SHA2 and MD5.
---
 src/include/utils/resowner_private.h  |  7 +++
 src/backend/utils/resowner/resowner.c | 65 +++
 2 files changed, 72 insertions(+)

diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..5ce6fcf882 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -95,4 +95,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
    Datum handle);
 
+/* support for EVP context management */
+extern void ResourceOwnerEnlargeEVP(ResourceOwner owner);
+extern void ResourceOwnerRememberEVP(ResourceOwner owner,
+	 Datum handle);
+extern void ResourceOwnerForgetEVP(ResourceOwner owner,
+   Datum handle);
+
 #endif			/* RESOWNER_PRIVATE_H */
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..1efb5e98b4 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -20,6 +20,10 @@
  */
 #include "postgres.h"
 
+#ifdef USE_OPENSSL
+#include 
+#endif
+
 #include "common/hashfn.h"
 #include "jit/jit.h"
 #include "storage/bufmgr.h"
@@ -128,6 +132,7 @@ typedef struct ResourceOwnerData
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
 	ResourceArray jitarr;		/* JIT contexts */
+	ResourceArray evparr;		/* EVP contexts */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -175,6 +180,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintEVPLeakWarning(Datum handle);
 
 
 /*
@@ -444,6 +450,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->evparr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -553,6 +560,17 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			jit_release_context(context);
 		}
+
+		/* Ditto for EVP contexts */
+		while (ResourceArrayGetAny(&(owner->evparr), &foundres))
+		{
+			if (isCommit)
+P