date:20190805

Re: pglz performance

2019-08-05 Thread Michael Paquier

On Fri, Aug 02, 2019 at 07:52:39PM +0200, Tomas Vondra wrote:
> On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:
>> Why would they be stuck continuing to *compress* with pglz? As we
>> fully retoast on write anyway we can just gradually switch over to the
>> better algorithm. Decompression speed is another story, of course.
> 
> Hmmm, I don't remember the details of those patches so I didn't realize
> it allows incremental recompression. If that's possible, that would mean
> existing systems can start using it. Which is good.

It may become a problem on some platforms though (Windows?), so
patches to improve either the compression or decompression of pglz are
not that much crazy as they are still likely going to be used, and for
read-mostly switching to a new algo may not be worth the extra cost so
it is not like we are going to drop it completely either.  My take,
still the same as upthread, is that it mostly depends on the amount of
complexity each patch introduces compared to the performance gain. 

> Another question is whether we'd actually want to include the code in
> core directly, or use system libraries (and if some packagers might
> decide to disable that, for whatever reason).

Linking to system libraries would make our maintenance much easier,
and when it comes to have a copy of something else in the tree we
would be stuck with more maintenance around it.  These tend to rot
easily.  After that comes the case where the compression algo is not
in the binary across one server to another, in which case we have an
automatic ERROR in case of a mismatching algo, or FATAL for
deompression of FPWs at recovery when wal_compression is used.
--
Michael


signature.asc
Description: PGP signature

Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2019-08-05 Thread Daniel Gustafsson

> On 2 Aug 2019, at 00:41, Thomas Munro  wrote:
> 
> On Tue, Jul 9, 2019 at 11:07 PM Daniel Gustafsson  wrote:
>> The attached v3 also has that fix in order to see if the cfbot is happier 
>> with
>> this.
> 
> Noticed while moving this to the next CF:
> 
> heap.c: In function ‘StorePartitionKey’:
> 1191heap.c:3582:3: error: ‘referenced’ undeclared (first use in this function)
> 1192   referenced.classId = RelationRelationId;
> 1193   ^

Thanks, the attached v4 updates to patch to handle a0555ddab9b672a046 as well.

cheers ./daniel



catalog_multi_insert-v4.patch
Description: Binary data

Re: pglz performance

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 00:19:28 +0200, Petr Jelinek wrote:
> It carries that information inside the compressed value, like I said in the
> other reply, that's why the extra byte.

I'm not convinced that that is a good plan - imo the reference to the
compressed data should carry that information.

I.e. in the case of toast, at least toast pointers should hold enough
information to determine the compression algorithm. And in the case of
WAL, the WAL record should contain that.

Consider e.g. adding support for slice fetching of datums compressed
with some algorithm - we should be able to determine whether that's
possible without fetching the datum (so we can take a non-exceptional
path for datums compressed otherwise). Similarly, for WAL, we should be
able to detect whether an incompatible compression format is used,
without having to invoke a generic compression routine that then fails
in some way. Or adding compression reporting for WAL to xlogdump.

I also don't particularly like baking in the assumption that we don't
support tuples larger than 1GB in further places. To me it seems likely
that we're going to have to fix that, and it's hard enough already... I
know that my patch did that too...

For external datums I suggest encoding the compression method as a
distinct VARTAG_ONDISK_COMPRESSED, and then have that include the
compression method as a field.

For in-line compressed values (so VARATT_IS_4B_C), doing something
roughly like you did, indicating the type of metadata following using
the high bit sounds reasonable. But I think I'd make it so that if the
highbit is set, the struct is instead entirely different, keeping a full
4 byte byte length, and including the compression type header inside the
struct. Perhaps I'd combine the compression type with the high-bit-set
part?  So when the high bit is set, it'd be something like

{
int32   vl_len_;/* varlena header (do not touch 
directly!) */

/*
 * Actually only 7 bit, the high bit determines whether this
 * is the old compression header (if unset), or this type of header
 * (if set).
 */
uint8 type;

/*
 * Stored as uint8[4], to avoid unnecessary alignment padding.
 */
uint8[4] length;

char va_data[FLEXIBLE_ARRAY_MEMBER];
}

I think it's worth spending some effort trying to get this right - we'll
be bound by design choices for a while.

It's kinda annoying that toast datums aren't better designed. Creating
them from scratch, I'd make it something like:

1) variable-width integer describing the "physical length", so that
   tuple deforming can quickly determine length - all the ifs necessary
   to determine lengths are a bottleneck. I'd probably just use a 127bit
   encoding, if the high bit is set, there's a following length byte.

2) type of toasted datum, probably also in a variable width encoding,
   starting at 1byte. Not that I think it's likely we'd overrun 256
   types - but it's cheap enough to just declare the high bit as an
   length extension bit.

These are always stored unaligned. So there's no need to deal with
padding bytes having to be zero to determine whether we're dealing with
a 1byte datum etc.

Then, type dependant:

For In-line uncompressed datums
3a) alignment padding, amount determined by 2) above, i.e. we'd just
have different types for different amounts of alignment. Probably
using some heuristic to use unaligned when either dealing with data
that doesn't need alignment, or when the datum is fairly small, so
copying to get the data as unaligned won't be a significant penalty.
4a) data

For in-line compressed datums
3b) compression metadata {varint rawsize, varint compression algorithm}
4b) unaligned compressed data - there's no benefit in keeping it aligned

For External toast for uncompressed data:
3d) {toastrelid, valueid, varint rawsize}

For External toast for compressed data:
3e) {valueid, toastrelid, varint compression_algorithm, varint rawsize, varint 
extsize}

That'd make it a lot more extensible, easier to understand, faster to
decode in a lot of cases, remove a lot of arbitrary limits.  Yes, it'd
increase the header size for small datums to two bytes, but I think
that'd be more than bought back by the other improvements.

Greetings,

Andres Freund

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-08-05 Thread Kyotaro Horiguchi

At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand 
 wrote in <1564902241482-0.p...@n3.nabble.com>
> >  However having the nested queryid in 
> > pg_stat_activity would be convenient to track
> > what is a long stored functions currently doing.
> 
> +1
> 
> And  this could permit to get wait event sampling per queryid when
> pg_stat_statements.track = all

I'm strongly on this side emotionally, but also I'm on Tom and
Andres's side that exposing querid that way is not the right
thing.

Doing that means we don't need exact correspondence between
top-level query and queryId (in nested or multistatement queries)
in this patch.  pg_stat_statements will allow us to do the same
thing by having additional uint64[MaxBackends] array in
pgssSharedState, instead of expanding PgBackendStatus array in
core by the same size.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

SSL Connection still showing TLSv1.3 even it is disabled in ssl_ciphers

2019-08-05 Thread tushar


Hi ,

While testing SSL version 1.1.1c , I only enabled TLSv1.2 and rest 
including TLSv1.3 has been disabled , like this -


postgres=# show ssl_ciphers ;
 ssl_ciphers
--
 TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3

To cofirm the same, there is a tool called  - sslyze ( SSLyze is a 
Python library and a CLI tool that can analyze the SSL configuration of 
a server by connecting to it)

(https://github.com/nabla-c0d3/sslyze) which i configured on my machine .

Run this command -

[root@localhost Downloads]#  python -m sslyze --sslv2 --sslv3 --tlsv1 
--tlsv1_1 --tlsv1_2 --tlsv1_3  localhost:5432 --starttls=postgres 
--hide_rejected_ciphers


 AVAILABLE PLUGINS
 -

  CompressionPlugin
  HttpHeadersPlugin
  OpenSslCcsInjectionPlugin
  OpenSslCipherSuitesPlugin
  SessionResumptionPlugin
  FallbackScsvPlugin
  CertificateInfoPlugin
  RobotPlugin
  HeartbleedPlugin
  SessionRenegotiationPlugin



 CHECKING HOST(S) AVAILABILITY
 -

   localhost:5432  => 127.0.0.1




 SCAN RESULTS FOR LOCALHOST:5432 - 127.0.0.1
 ---

 * SSLV2 Cipher Suites:
  Server rejected all cipher suites.

** TLSV1_3 Cipher Suites:**
**  Server rejected all cipher suites.**
*
 * SSLV3 Cipher Suites:
  Server rejected all cipher suites.

 * TLSV1_1 Cipher Suites:
  Server rejected all cipher suites.

 * TLSV1_2 Cipher Suites:
   Forward Secrecy    OK - Supported
   RC4    OK - Not Supported

 Preferred:
    TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 ECDH-256 
bits  256 bits

 Accepted:
    TLS_DHE_RSA_WITH_AES_256_CBC_SHA256   DH-2048 
bits   256 bits

    RSA_WITH_AES_256_CCM_8 -  256 bits
    TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 -  256 
bits

    TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256 -  256 bits
    TLS_RSA_WITH_CAMELLIA_256_CBC_SHA256 -  256 bits
    RSA_WITH_AES_256_CCM -  256 bits
    TLS_DHE_RSA_WITH_CAMELLIA_256_CBC_SHA256 -  256 bits
    ARIA256-GCM-SHA384 -  256 bits
    TLS_RSA_WITH_AES_256_CBC_SHA256 -  256 bits
    TLS_ECDHE_RSA_WITH_CAMELLIA_256_CBC_SHA384 -  256 bits
    TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 ECDH-256 
bits  256 bits

    DHE_RSA_WITH_AES_256_CCM_8 -  256 bits
    ECDHE-ARIA256-GCM-SHA384 -  256 bits
    TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 ECDH-256 
bits  256 bits
    TLS_DHE_RSA_WITH_AES_256_GCM_SHA384   DH-2048 
bits   256 bits

    TLS_RSA_WITH_AES_256_GCM_SHA384 -  256 bits
    TLS_DHE_RSA_WITH_AES_256_CCM -  256 bits
    DHE-RSA-ARIA256-GCM-SHA384 -  256 bits
    TLS_RSA_WITH_CAMELLIA_128_CBC_SHA256 -  128 bits
    RSA_WITH_AES_128_CCM_8 -  128 bits
    RSA_WITH_AES_128_CCM -  128 bits
    DHE_RSA_WITH_AES_128_CCM -  128 bits
    DHE_RSA_WITH_AES_128_CCM_8 -  128 bits
    ARIA128-GCM-SHA256 -  128 bits
    TLS_ECDHE_RSA_WITH_CAMELLIA_128_CBC_SHA256 -  128 bits
    TLS_DHE_RSA_WITH_AES_128_CBC_SHA256   DH-2048 
bits   128 bits
    TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 ECDH-256 
bits  128 bits

    TLS_RSA_WITH_AES_128_CBC_SHA256 -  128 bits
    ECDHE-ARIA128-GCM-SHA256 -  128 bits
    TLS_DHE_RSA_WITH_AES_128_GCM_SHA256   DH-2048 
bits   128 bits

    TLS_RSA_WITH_AES_128_GCM_SHA256 -  128 bits
    TLS_DHE_RSA_WITH_CAMELLIA_128_CBC_SHA256 -  128 bits
    DHE-RSA-ARIA128-GCM-SHA256 -  128 bits
    TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 ECDH-256 
bits  128 bits


 * TLSV1 Cipher Suites:
  Server rejected all cipher suites.


 SCAN COMPLETED IN 0.84 S
 


These are the ones which got rejected for TLSV1_3

* TLSV1_3 Cipher Suites:
 Rejected:
    TLS_CHACHA20_POLY1305_SHA256    TLS / Alert: 
protocol version
*TLS_AES_256_GCM_SHA384*    TLS / Alert: protocol 
version
    TLS_AES_128_GCM_SHA256    TLS / Alert: 
protocol version
    TLS_AES_128_CCM_SHA256    TLS / Alert: 
protocol version
    TLS_AES_128_CCM_8_SHA256 TLS / Alert: 
protocol version


when  i connect to psql terminal -

psql.bin (10.9)
SSL connection (protocol: TLSv1.3, cipher: *TLS_AES_256_GCM_SHA384*, 
bits: 256, compression: off)

Type "help" for help.

postgres=# show ssl_ciphers ;
 ssl_ciphers
--
 TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3
(1 r

Re: More issues with pg_verify_checksums and checksum verification in base backups

2019-08-05 Thread Michael Paquier

On Sat, Aug 03, 2019 at 06:47:48PM +0200, Michael Banck wrote:
> first off, a bit of a meta-question: Did the whitelist approach die
> completely, or are we going to tackle it again for v13 or later?

At this stage, it is burried.  Amen.

> This is something I still have in the test suite of my pg_checksums
> fork, cause I never reverted that one from isRelFile() back to
> skipfile() (so it doesn't fail on the above cause `123.' is not
> considered a relation file worth checksumming).

We could actually fix this one.  It is confusing to have pg_checksums
generate a report about a segment number which is actually incorrect.

> Independently of the whitelist/blacklist question, I believe
> pg_checksums should not error out as soon as it encounters a weird looking
> file, but either (i) still checksum it or (ii) skip it? Or is that to be
> considered a pilot error and it's fine for pg_checksums to fold?

That's actually the distinctions between the black and white lists
which would have handled that.
--
Michael


signature.asc
Description: PGP signature

Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2019-08-05 Thread Michael Paquier

On Mon, Aug 05, 2019 at 09:20:07AM +0200, Daniel Gustafsson wrote:
> Thanks, the attached v4 updates to patch to handle a0555ddab9b672a046 as well.

+   if (referenced->numrefs == 1)
+   recordDependencyOn(depender, &referenced->refs[0], behavior);
+   else
+   recordMultipleDependencies(depender,
+  referenced->refs, referenced->numrefs,
+  behavior);
This makes me wonder if we should not just add a shortcut in
recordMultipleDependencies() to use recordDependencyOn if there is
only one reference in the set.  That would save the effort of a multi
insert for all callers of recordMultipleDependencies() this way,
including the future ones.  And that could also be done independently
of the addition of InsertPgAttributeTuples(), no?
--
Michael


signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-08-05 Thread Julien Rouhaud

On Mon, Aug 5, 2019 at 9:28 AM Kyotaro Horiguchi
 wrote:
>
> At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand 
>  wrote in <1564902241482-0.p...@n3.nabble.com>
> > >  However having the nested queryid in
> > > pg_stat_activity would be convenient to track
> > > what is a long stored functions currently doing.
> >
> > +1
> >
> > And  this could permit to get wait event sampling per queryid when
> > pg_stat_statements.track = all
>
> I'm strongly on this side emotionally, but also I'm on Tom and
> Andres's side that exposing querid that way is not the right
> thing.
>
> Doing that means we don't need exact correspondence between
> top-level query and queryId (in nested or multistatement queries)
> in this patch.  pg_stat_statements will allow us to do the same
> thing by having additional uint64[MaxBackends] array in
> pgssSharedState, instead of expanding PgBackendStatus array in
> core by the same size.

Sure, but the problem with this approach is that all extensions that
compute their own queryid would have to do the same.  I hope that we
can come up with an approach friendlier for those extensions.

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-05 Thread Thomas Munro

On Sat, Aug 3, 2019 at 11:56 AM Ashwin Agrawal  wrote:
> On Wed, Jul 31, 2019 at 2:06 PM Andres Freund  wrote:
>>   I'm also a bit confused why we don't need to pass in the offset of the
>>  current tuple, rather than the HOT root tuple here. That's not related
>>  to this patch. But aren't we locking the wrong tuple here, in case of
>>  HOT?
>
> Yes, root is being locked here instead of the HOT. But I don't have full 
> context on the same. If we wish to fix it though, can be easily done now with 
> the patch by passing "tid" instead of &(heapTuple->t_self).

Here are three relevant commits:

1.  Commit dafaa3efb75 "Implement genuine serializable isolation
level." (2011) locked the root tuple, because it set t_self to *tid.
Possibly due to confusion about the effect of the preceding line
ItemPointerSetOffsetNumber(tid, offnum).

2.  Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self,
offnum).

3.  Commit b89e151054a "Introduce logical decoding." (2014) also did
ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer),
offnum), for the benefit of historical MVCC snapshots (unnecessarily,
considering the change in the commit #2), but then, intending to
"reset to original, non-redirected, tid", clobbered it, reintroducing
the bug fixed by #2.

My first guess is that commit #3 above was developed before commit #2,
and finished up clobbering it.  In fact, both logical decoding and SSI
want offnum, so we should be able to just remove the "reset" bit
(perhaps like in the attached sketch, not really tested, though it
passes).  This must be in want of an isolation test, but I haven't yet
tried to get my head around how to write a test that would show the
difference.

-- 
Thomas Munro
https://enterprisedb.com
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94309949fa..107605d276 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1576,7 +1576,6 @@ heap_hot_search_buffer(ItemPointer tid, Relation 
relation, Buffer buffer,
heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
heapTuple->t_len = ItemIdGetLength(lp);
heapTuple->t_tableOid = RelationGetRelid(relation);
-   ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);

/*
 * Shouldn't see a HEAP_ONLY tuple at chain start.
@@ -1608,15 +1607,14 @@ heap_hot_search_buffer(ItemPointer tid, Relation 
relation, Buffer buffer,
 * of the root tuple of the HOT chain. This is 
important because
 * the *Satisfies routine for historical mvcc snapshots 
needs the
 * correct tid to decide about the visibility in some 
cases.
+* SSI also needs offnum.
 */
-   ItemPointerSet(&(heapTuple->t_self), 
BufferGetBlockNumber(buffer), offnum);
+   ItemPointerSetOffsetNumber(&heapTuple->t_self, offnum);

/* If it's visible per the snapshot, we must return it 
*/
valid = HeapTupleSatisfiesVisibility(heapTuple, 
snapshot, buffer);
CheckForSerializableConflictOut(valid, relation, 
heapTuple,

buffer, snapshot);
-   /* reset to original, non-redirected, tid */
-   heapTuple->t_self = *tid;

if (valid)
{

Re: partition routing layering in nodeModifyTable.c

2019-08-05 Thread Amit Langote

Hi Andres, Fujita-san,

On Sat, Aug 3, 2019 at 3:01 AM Andres Freund  wrote:
> On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > I looked into trying to do the things I mentioned above and it seems
> > to me that revising BeginDirectModify()'s API to receive the
> > ResultRelInfo directly as Andres suggested might be the best way
> > forward.  I've implemented that in the attached 0001.  Patches that
> > were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> > respectively.  0002 is now a patch to "remove"
> > es_result_relation_info.
>
> Thanks!  Some minor quibbles aside, the non FDW patches look good to me.
>
> Fujita-san, do you have any comments on the FDW API change?  Or anybody
> else?

Based on the discussion, I have updated the patch.

> I'm a bit woried about the move of BeginDirectModify() into
> nodeModifyTable.c - it just seems like an odd control flow to me. Not
> allowing any intermittent nodes between ForeignScan and ModifyTable also
> seems like an undesirable restriction for the future. I realize that we
> already do that for BeginForeignModify() (just btw, that already accepts
> resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> makes sense), but it still seems like the wrong direction to me.
>
> The need for that move, I assume, comes from needing knowing the correct
> ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> at plan time (in setrefs.c), somewhat similar to how we determine
> ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

The patch adds a resultRelIndex field to ForeignScan node, which is
set to >= 0 value for non-SELECT queries.  I first thought to set it
only if direct modification is being used, but maybe it'd be simpler
to set it even if direct modification is not used.  To set it, the
patch teaches set_plan_refs() to initialize resultRelIndex of
ForeignScan plans that appear under ModifyTable.  Fujita-san said he
plans to revise the planning of direct-modification style queries to
not require a ModifyTable node anymore, but maybe he'll just need to
add similar code elsewhere but not outside setrefs.c.

> Then we could just have BeginForeignModify, BeginDirectModify,
> BeginForeignScan all be called from ExecInitForeignScan().

I too think that it would've been great if we could call both
BeginForeignModify and BeginDirectModify from ExecInitForeignScan, but
the former's API seems to be designed to be called from
ExecInitModifyTable from the get-go.  Maybe we should leave that
as-is?

> Path 04 is such a nice improvement. Besides getting rid of a substantial
> amount of code, it also makes the control flow a lot easier to read.

Thanks.

> > @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
> >   * If there are no triggers in 'trigdesc' that request relevant transition
> >   * tables, then return NULL.
> >   *
> > - * The resulting object can be passed to the ExecAR* functions.  The caller
> > - * should set tcs_map or tcs_original_insert_tuple as appropriate when 
> > dealing
> > - * with child tables.
> > + * The resulting object can be passed to the ExecAR* functions.
> >   *
> >   * Note that we copy the flags from a parent table into this struct (rather
> >   * than subsequently using the relation's TriggerDesc directly) so that we 
> > can
> > @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo 
> > *relinfo,
> >*/
> >   if (row_trigger && transition_capture != NULL)
> >   {
> > - TupleTableSlot *original_insert_tuple = 
> > transition_capture->tcs_original_insert_tuple;
> > - TupleConversionMap *map = transition_capture->tcs_map;
> > + TupleTableSlot *original_insert_tuple;
> > + PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo;
> > + TupleConversionMap *map = pinfo ?
> > + 
> > pinfo->pi_PartitionToRootMap :
> > + 
> > relinfo->ri_ChildToRootMap;
> >   booldelete_old_table = 
> > transition_capture->tcs_delete_old_table;
> >   boolupdate_old_table = 
> > transition_capture->tcs_update_old_table;
> >   boolupdate_new_table = 
> > transition_capture->tcs_update_new_table;
> >   boolinsert_new_table = 
> > transition_capture->tcs_insert_new_table;
> >
> >   /*
> > +  * Get the originally inserted tuple from the global variable 
> > and set
> > +  * the latter to NULL because any given tuple must be read 
> > only once.
> > +  * Note that the TransitionCaptureState is shared across many 
> > calls
> > +  * to this function.
> > +  */
> > + original_insert_tuple = 
> > transition_capture->tcs_original_insert_tuple;
> > + transition_capture->tcs_origin

Re: Index Skip Scan

2019-08-05 Thread Floris Van Nee

Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken 
a close look at the uniquekeys patch yet.


In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be 
much faster. It's about this piece of code in planner.c


/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 
'a' is not in there anymore. Still, it'd be much faster to use the skip scan 
here, because a regular scan will scan all entries with a=1, even though we're 
really only interested in the first one. In previous versions, this would be 
fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys 
instead of root->distinct_pathkeys, but things change a bit now that the patch 
is rebased on the unique-keys patch. Would it be valid to change this check to 
root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I 
don't think _bt_read_closest is correctly implemented, and I'm not sure if it 
can be used at all, due to concerns by Tom and Peter about such approach. I had 
a similar idea to only partially read items from a page for another use case, 
for which I submitted a patch last Friday. However, both Tom and Peter find 
this idea quite scary [1]. You could take a look at my patch on that thread to 
see the approach taken to correctly partially read a page (well, correct as far 
as I can see so far...), but perhaps we need to just use the regular 
_bt_readpage function which reads everything, although this is unfortunate from 
a performance point of view, since most of the time we're indeed just 
interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in 
_bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset 
number for the index offset, so it looks like just a random item is saved. 
Furthermore, index offsets may change due to insertions and vacuums, so if we, 
at any point, release the lock, these offsets are not necessarily valid 
anymore. However, currently, the patch just reads the closest and then doesn't 
consider this page at all anymore, if the first tuple skipped to turns out to 
be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, 
generate_series(1, 1) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;

 a | b
---+---
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

delete from a where a=2 and b=1;
DELETE 1

select distinct on (a) a,b from a order by a,b;

 a |  b
---+-
 1 |   1
 2 | 249   ->> this should be b=2, because we deleted a=2 && b=1. 
however, it doesn't consider any tuples from that page anymore and gives us the 
first tuple from the next page.
 3 |   1
 4 |   1
 5 |   1
(5 rows)
?


-Floris


[1] 
https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Amit Kapila

On Mon, Aug 5, 2019 at 12:09 PM Heikki Linnakangas  wrote:
>
> On 05/08/2019 07:23, Thomas Munro wrote:
> > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila  wrote:
> >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas  wrote:
> >>> Could we leave out the UNDO and discard worker processes for now?
> >>> Execute all UNDO actions immediately at rollback, and after crash
> >>> recovery. That would be fine for cleaning up orphaned files,
> >>
> >> Even if we execute all the undo actions on rollback, we need discard
> >> worker to discard undo at regular intervals.  Also, what if we get an
> >> error while applying undo actions during rollback?  Right now, we have
> >> a mechanism to push such a request to background worker and allow the
> >> session to continue.  Instead, we might want to Panic in such cases if
> >> we don't want to have background undo workers.
> >>
> >>> and it
> >>> would cut down the size of the patch to review.
> >>
> >> If we can find some way to handle all cases and everyone agrees to it,
> >> that would be good. In fact, we can try to get the basic stuff
> >> committed first and then try to get the rest (undo-worker machinery)
> >> done.
> >
> > I think it's definitely worth exploring.
>
> Yeah. For cleaning up orphaned files, if unlink() fails, we can just log
> the error and move on. That's what we do in the main codepath, too. For
> any other error, PANIC seems ok. We're not expecting any errors during
> undo processing, so it doesn't seems safe to continue running.
>
> Hmm. Since applying the undo record is WAL-logged, you could run out of
> disk space while creating the WAL record. That seems unpleasant.
>

We might get away by doing some minimum error handling for orphan file
cleanup patch, but this facility was supposed to be a generic
facility.  Assuming, all of us agree on error handling stuff, still, I
think we might not be able to get away with the requirement for
discard worker to discard the logs.

> >>> Can this race condition happen: Transaction A creates a table and an
> >>> UNDO record to remember it. The transaction is rolled back, and the file
> >>> is removed. Another transaction, B, creates a different table, and
> >>> chooses the same relfilenode. It loads the table with data, and commits.
> >>> Then the system crashes. After crash recovery, the UNDO record for the
> >>> first transaction is applied, and it removes the file that belongs to
> >>> the second table, created by transaction B.
> >>
> >> I don't think such a race exists, but we should verify it once.
> >> Basically, once the rollback is complete, we mark the transaction
> >> rollback as complete in the transaction header in undo and write a WAL
> >> for it.  After crash-recovery, we will skip such a transaction.  Isn't
> >> that sufficient to prevent such a race condition?
>
> Ok, I didn't realize there's a flag in the undo record to mark it as
> applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's
> fine. Do you do something different in zheap? I presume writing a WAL
> record for every applied undo record would be too heavy there.
>

For zheap, we collect all the records of a page, apply them together
and then write the entire page in WAL.  The progress of transaction is
updated at either transaction end (rollback complete) or after
processing some threshold of undo records.  So, generally, the WAL
won't be for each undo record apply.

> This needs some performance testing. We're creating one extra WAL record
> and one UNDO record for every file creation, and another WAL record on
> abort. It's probably cheap compared to all the other work done during
> table creation, but we should still get some numbers on it.
>

makes sense.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Identity columns should own only one sequence

2019-08-05 Thread Laurenz Albe

Peter Eisentraut wrote:
> On 2019-05-08 16:49, Laurenz Albe wrote:
> > I believe we should have both:
> > 
> > - Identity columns should only use sequences with an INTERNAL dependency,
> >   as in Peter's patch.
> 
> I have committed this.

Thanks!

> > - When a column default is dropped, remove all dependencies between the
> >   column and sequences.
> 
> There is no proposed patch for this, AFAICT.

There was one in
https://www.postgresql.org/message-id/3916586ef7f33948235fe60f54a3750046f5d940.camel%40cybertec.at

> So I have closed this commit fest item for now.

That's fine with me.

Yours,
Laurenz Albe

Re: Removing unneeded self joins

2019-08-05 Thread Andrey Lepikhov




On 02/08/2019 04:54, Thomas Munro wrote:

On Thu, Jun 27, 2019 at 6:42 PM Andrey Lepikhov
 wrote:

Version v.17 of the patch that fix the bug see in attachment.


While moving this to the September CF, I noticed that it needs to be
updated for the recent pg_list.h API changes.

The patch was updated:
1. Changes caused by pg_list.h API changes.
2. Fix the problem of joint clause_relids and required_relids changes [1].
3. Add eclass mentions of removed relation into the kept relation (field 
eclass_indexes was introduced by commit 3373c71553).


[1] 
https://www.postgresql.org/message-id/flat/5c21029d-81a2-c999-6744-6a898fcc9a19%40postgrespro.ru


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 72ac38de7fb55fe741677ea75b280eecb443e978 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 2 Aug 2019 11:01:47 +0500
Subject: [PATCH] Remove self joins v18

---
 src/backend/nodes/outfuncs.c  |   19 +-
 src/backend/optimizer/path/indxpath.c |   28 +-
 src/backend/optimizer/path/joinpath.c |6 +-
 src/backend/optimizer/plan/analyzejoins.c | 1021 +++--
 src/backend/optimizer/plan/planmain.c |7 +-
 src/backend/optimizer/util/pathnode.c |3 +-
 src/backend/optimizer/util/relnode.c  |   26 +-
 src/include/nodes/nodes.h |1 +
 src/include/nodes/pathnodes.h |   22 +-
 src/include/optimizer/pathnode.h  |4 +
 src/include/optimizer/paths.h |3 +-
 src/include/optimizer/planmain.h  |7 +-
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  256 +-
 src/test/regress/sql/equivclass.sql   |   17 +
 src/test/regress/sql/join.sql |  127 +++
 16 files changed, 1480 insertions(+), 99 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 86c31a48c9..9d511d1d1b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2270,7 +2270,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_OID_FIELD(userid);
 	WRITE_BOOL_FIELD(useridiscurrent);
 	/* we don't try to print fdwroutine or fdw_private */
-	/* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */
+	WRITE_NODE_FIELD(unique_for_rels);
+	/* can't print non_unique_for_rels; BMSes aren't Nodes */
 	WRITE_NODE_FIELD(baserestrictinfo);
 	WRITE_UINT_FIELD(baserestrict_min_security);
 	WRITE_NODE_FIELD(joininfo);
@@ -2342,6 +2343,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node)
 	WRITE_BITMAPSET_FIELD(keys);
 }
 
+static void
+_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node)
+{
+	WRITE_NODE_TYPE("UNIQUERELINFO");
+
+	WRITE_BITMAPSET_FIELD(outerrelids);
+	if (node->index)
+	{
+		WRITE_OID_FIELD(index->indexoid);
+		WRITE_NODE_FIELD(column_values);
+	}
+}
+
 static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
@@ -4108,6 +4122,9 @@ outNode(StringInfo str, const void *obj)
 			case T_StatisticExtInfo:
 _outStatisticExtInfo(str, obj);
 break;
+			case T_UniqueRelInfo:
+_outUniqueRelInfo(str, obj);
+break;
 			case T_ExtensibleNode:
 _outExtensibleNode(str, obj);
 break;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5f339fdfde..b57be54df6 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3568,7 +3568,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
  * relation_has_unique_index_for
  *	  Determine whether the relation provably has at most one row satisfying
  *	  a set of equality conditions, because the conditions constrain all
- *	  columns of some unique index.
+ *	  columns of some unique index. If index_info is not null, it is set to
+ *	  point to a new UniqueRelInfo containing the index and conditions.
  *
  * The conditions can be represented in either or both of two ways:
  * 1. A list of RestrictInfo nodes, where the caller has already determined
@@ -3589,12 +3590,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
-			  List *exprlist, List *oprlist)
+			  List *exprlist, List *oprlist,
+			  UniqueRelInfo **unique_info)
 {
 	ListCell   *ic;
 
 	Assert(list_length(exprlist) == list_length(oprlist));
 
+	if (unique_info)
+		*unique_info = NULL;
+
 	/* Short-circuit if no indexes... */
 	if (rel->indexlist == NIL)
 		return false;
@@ -3645,6 +3650,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List *column_values = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it's
@@ -3693,6 +3699,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 if (match_index_to_op

Re: Identity columns should own only one sequence

2019-08-05 Thread Laurenz Albe

Peter Eisentraut wrote:
> On 2019-05-08 16:49, Laurenz Albe wrote:
> > I believe we should have both:
> > 
> > - Identity columns should only use sequences with an INTERNAL dependency,
> >   as in Peter's patch.
> 
> I have committed this.

Since this is a bug fix, shouldn't it be backpatched?

Yours,
Laurenz Albe

Re: Optimize single tuple fetch from nbtree index

2019-08-05 Thread Floris Van Nee

Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. 
I'm a bit puzzled as to why this case wouldn't come up more often by other 
users though. We have many large tables with timeseries data and it seems to me 
that with timeseries data, two of the most common queries are:
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know 
that there's an update *exactly* at timepoint t - so you're left with trying to 
find the latest update smaller than t)
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. 
sensor a1 and sensor a2, but both provide a temperature value and update 
independently from each other).
Both of these can also be done with some kind of DISTINCT ON clause, but this 
is often already much slower than just doing a nested loop of fast index 
lookups with LIMIT 1 (this depends on the frequency of the timeseries data 
itself versus the sampling rate of your query though, for high frequency time 
series and/or low frequency sampling the LIMIT 1 approach is much faster).

Note that there is actually some related work to this - in the Index Skip Scan 
thread [1] a function called _bt_read_closest was developed which also 
partially reads the page. A Skip Scan has a very similar access pattern to the 
use case I describe here, because it's also very likely to just require one 
tuple from the page. Even though the implementation in that patch is currently 
incorrect, performance of the Skip Scan would likely also be quite a bit faster 
if it had a correct implementation of this partial page-read and it wouldn't 
have to read the full page every time.

I have one further question about these index offsets. There are several 
comments in master that indicate that it's impossible that an item moves 'left' 
on a page, if we continuously hold a pin on the page. For example, 
_bt_killitems has a comment like this:
 
* Note that if we hold a pin on the target page continuously from initially
 * reading the items until applying this function, VACUUM cannot have deleted
 * any items from the page, and so there is no need to search left from the
 * recorded offset.  (This observation also guarantees that the item is still
 * the right one to delete, which might otherwise be questionable since heap
 * TIDs can get recycled.)  This holds true even if the page has been 
modified
 * by inserts and page splits, so there is no need to consult the LSN.
 
Still, exactly this case happens in practice. In my tests I was able to get 
behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have 
occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved 
left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I 
added code in my patch to also look left on the page from the previous 
indexOffset.

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part 
of Postgres that might move index items left even though there are others 
holding a pin?

-Floris

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169

psql's meta command \d is broken as of 12 beta2.

2019-08-05 Thread Dmitry Igrishin

Hi,

Sorry if this report is duplicate but there is no column relhasoids in
pg_catalog.pg_class required by the underlying query of the \d meta
command of psql.

Re: psql's meta command \d is broken as of 12 beta2.

2019-08-05 Thread Pavel Stehule

Hi


po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin  napsal:

> Hi,
>
> Sorry if this report is duplicate but there is no column relhasoids in
> pg_catalog.pg_class required by the underlying query of the \d meta
> command of psql.
>

do you use psql from this release?

The psql client should be higher or same like server.

Pavel

Re: psql's meta command \d is broken as of 12 beta2.

2019-08-05 Thread Dmitry Igrishin

пн, 5 авг. 2019 г. в 14:42, Pavel Stehule :
>
> Hi
>
>
> po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin  napsal:
>>
>> Hi,
>>
>> Sorry if this report is duplicate but there is no column relhasoids in
>> pg_catalog.pg_class required by the underlying query of the \d meta
>> command of psql.
>
>
> do you use psql from this release?
>
> The psql client should be higher or same like server.
>
> Pavel
Oops, I'm wrong. When I looked at describe.c I understood my mistake.
Sorry for noise and thank you!

Re: psql's meta command \d is broken as of 12 beta2.

2019-08-05 Thread Pavel Stehule

po 5. 8. 2019 v 13:46 odesílatel Dmitry Igrishin  napsal:

> пн, 5 авг. 2019 г. в 14:42, Pavel Stehule :
> >
> > Hi
> >
> >
> > po 5. 8. 2019 v 13:35 odesílatel Dmitry Igrishin 
> napsal:
> >>
> >> Hi,
> >>
> >> Sorry if this report is duplicate but there is no column relhasoids in
> >> pg_catalog.pg_class required by the underlying query of the \d meta
> >> command of psql.
> >
> >
> > do you use psql from this release?
> >
> > The psql client should be higher or same like server.
> >
> > Pavel
> Oops, I'm wrong. When I looked at describe.c I understood my mistake.
> Sorry for noise and thank you!
>

no problem

Pavel

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila  wrote:
> For zheap, we collect all the records of a page, apply them together
> and then write the entire page in WAL.  The progress of transaction is
> updated at either transaction end (rollback complete) or after
> processing some threshold of undo records.  So, generally, the WAL
> won't be for each undo record apply.

This explanation omits a crucial piece of the mechanism, because
Heikki is asking what keeps the undo from being applied multiple
times.  When we apply the undo records to a page, we also adjust the
undo pointers in the page.  Since we have an undo pointer per
transaction slot, and each transaction has its own slot, if we apply
all the undo for a transaction to a page, we can just clear the slot;
if we somehow end up back at the same point later, we'll know not to
apply the undo a second time because we'll see that there's no
transaction slot pointing to the undo we were thinking of applying. If
we roll back to a savepoint, or for some other reason choose to apply
only some of the undo to a page, we can set the undo record pointer
for the transaction back to the value it had before we generated any
newer undo.  Then, we'll know that the newer undo doesn't need to be
applied but the older undo can be applied.

At least, I think that's how it's supposed to work.  If you just
update the progress field, it doesn't guarantee anything, because in
the event of a crash, we could end up keeping the page changes but
losing the update to the progress, as they are part of separate undo
records.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas

On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas  wrote:
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>
> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary".

I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.

And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.

> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.

I don't think that's going to work very well, because we often need to
deal with multiple records at a time.  There is (or was) a bulk-fetch
interface, but I've also found while experimenting with this code that
it can be useful to do things like:

current = undo_fetch(starting_record);
loop:
next = undo_fetch(current->next_record_ptr);
if some_test(next):
break;
undo_free(current);
current = next;

I think we shouldn't view such cases as exceptions to the general
paradigm of looking at undo records one at a time, but instead as the
normal case for which everything is optimized.  Cases like orphaned
file cleanup where the number of undo records is probably small and
they're all independent of each other will, I think, turn out to be
the exception rather than the rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-05 Thread Robert Haas

On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> + rc = system(copycmd);

I don't think this patch should be calling system() in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem with default partition pruning

2019-08-05 Thread Alvaro Herrera

On 2019-Aug-05, yuzuko wrote:

> However, I'm still concerned that the block
> -
> ...
> -
> is written in the right place as Amit explained [1].

Yeah, I have that patch installed locally and I'm looking about it.
Thanks for the reminder.  I also have an eye on Horiguchi's patch
elsewhere in the thread.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SSL Connection still showing TLSv1.3 even it is disabled in ssl_ciphers

2019-08-05 Thread Tom Lane

tushar  writes:
> when  i connect to psql terminal -

> psql.bin (10.9)
> SSL connection (protocol: TLSv1.3, cipher: *TLS_AES_256_GCM_SHA384*, 
> bits: 256, compression: off)
> Type "help" for help.

> postgres=# show ssl_ciphers ;
>   ssl_ciphers
> --
>   TLSv1.2:!aNULL:!SSLv2:!SSLv3:!TLSv1:!TLSv1.3
> (1 row)

My guess is that OpenSSL ignored your ssl_ciphers setting on the
grounds that it's stupid to reject all possible ciphers.
In any case, this would be something to raise with them not us.
PG does nothing with that value except pass it to SSL_CTX_set_cipher_list.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote:
> >On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> >>Tomas Vondra  writes:
> >>> There seems to be a consensus that this this not a pg_basebackup issue
> >>> (i.e. duplicate values don't make the file invalid), and it should be
> >>> handled in ALTER SYSTEM.
> >>
> >>Yeah.  I doubt pg_basebackup is the only actor that can create such
> >>situations.
> >>
> >>> The proposal seems to be to run through the .auto.conf file, remove any
> >>> duplicates, and append the new entry at the end. That seems reasonable.
> >>
> >>+1
> >
> >I disagree that this should only be addressed in alter system, as I’ve said
> >before and as others have agreed with.  Having one set of code that can be
> >used to update parameters in the auto.conf and then have that be used by
> >pg_basebackup, alter system, and external tools, is the right approach.
> >
> >The idea that alter system should be the only thing that doesn’t just
> >append changes to the file is just going to lead to confusion and bugs down
> >the road.
> 
> I don't remember any suggestions ALTER SYSTEM should be the only thing
> that can rewrite the config file, but maybe it's buried somewhere in the
> thread history. The current proposal certainly does not prohibit any
> external tool from doing so, it just says we should expect duplicates.

There's an ongoing assumption that's been made that only ALTER SYSTEM
could make these changes because nothing else has the full GUC system
and a running PG instance to validate everything.

The suggestion that an external tool could do it goes against that.

If we can't, for whatever reason, work our way towards having code that
external tools could leverage to manage .auto.conf, then if we could at
least document what the expectations are and what tools can/can't do
with the file, that would put us in a better position than where we are
now.

I strongly believe that whatever the rules and expectations are that we
come up with, both ALTER SYSTEM and the in-core and external tools
should follow them.

If we say to that tools should expect duplicates in the file, then
ALTER SYSTEM should as well, which was the whole issue in the first
place- ALTER SYSTEM didn't expect duplicates, but the external tools and
the GUC system did.

If we say that it's acceptable for something to remove duplicate GUC
entries from the file, keeping the last one, then external tools should
feel comfortable doing that too and we should make it clear what
"duplicate" means here and how to identify one.

If we say it's optional for a tool to remove duplicates, then we should
point out the risk of "running out of disk space" for tool authors to
consider.  I don't agree with the idea that tool authors should be asked
to depend on someone running ALTER SYSTEM to address that risk.  If
there's a strong feeling that tool authors should be able to depend on
PG to perform that cleanup for them, then we should use a mechanism to
do so which doesn't involve an entirely optional feature.

For reference, all of the above, while not as cleanly as it could have
been, was addressed with the recovery.conf/recovery.done system.  Tool
authors had a good sense that they could replace that file, and that PG
would clean it up at exactly the right moment, and there wasn't this
ugly interaction with ALTER SYSTEM to have to worry about.  That none of
this was really even discussed or addressed previously even after being
pointed out is really disappointing.

Just to be clear, I brought up this exact concern back in *November*:

https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net

And now because this was pushed forward and the concerns that I raised
ignored, we're having to deal with this towards the end of the release
cycle instead of during normal development.  The things we're talking
about now and which I'm getting push-back on because of the release
cycle situation were specifically suggestions I made in the above email
in November where I also brought up concern that ALTER SYSTEM would be
confused by the duplicates- giving external tools guideance on how to
modify .auto.conf, or providing them a tool (or library), or both.

None of this should be coming as a surprise to anyone who was following
and I feel we should be upset that this was left to such a late point in
the release cycle to address these issues.

> >>There was a discussion whether to print warnings about the duplicates. I
> >>> personally see not much point in doing that - if we consider duplicates
> >>> to be expected, and if ALTER SYSTEM has the license to rework the config
> >>> file any way it wants, why warn about it?
> >>
> >>Personally I agree that warnings are unnecessary.
> >
> >And at least Magnus and I disagree with that, as I recall from this
> >thread.  Let’s have a clean and clear way to modify the auto.conf and have
> >everything that touch

Re: block-level incremental backup

2019-08-05 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> > + rc = system(copycmd);
> 
> I don't think this patch should be calling system() in the first place.

+1.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Isaac Morland

Here's a radical suggestion: replace postgresql.auto.conf with a directory
containing multiple files. Each file is named after a configuration
parameter, and its content is the value of the parameter.

So to remove a special configuration parameter, delete its file. To set it,
write the file, replacing an existing file if it exists.

For this to work unambiguously we would have to specify an exact,
case-sensitive, form of every parameter name that must be used within the
auto conf directory. I would suggest using the form listed in the
documentation (i.e., lower case, to my knowledge).

In order to prevent confusing and surprising behaviour, the system should
complain vociferously if it finds a configuration parameter file that is
not named after a defined parameter, rather than just ignoring it.

On Mon, 5 Aug 2019 at 10:21, Stephen Frost  wrote:

> Greetings,
>
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote:
> > >On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> > >>Tomas Vondra  writes:
> > >>> There seems to be a consensus that this this not a pg_basebackup
> issue
> > >>> (i.e. duplicate values don't make the file invalid), and it should be
> > >>> handled in ALTER SYSTEM.
> > >>
> > >>Yeah.  I doubt pg_basebackup is the only actor that can create such
> > >>situations.
> > >>
> > >>> The proposal seems to be to run through the .auto.conf file, remove
> any
> > >>> duplicates, and append the new entry at the end. That seems
> reasonable.
> > >>
> > >>+1
> > >
> > >I disagree that this should only be addressed in alter system, as I’ve
> said
> > >before and as others have agreed with.  Having one set of code that can
> be
> > >used to update parameters in the auto.conf and then have that be used by
> > >pg_basebackup, alter system, and external tools, is the right approach.
> > >
> > >The idea that alter system should be the only thing that doesn’t just
> > >append changes to the file is just going to lead to confusion and bugs
> down
> > >the road.
> >
> > I don't remember any suggestions ALTER SYSTEM should be the only thing
> > that can rewrite the config file, but maybe it's buried somewhere in the
> > thread history. The current proposal certainly does not prohibit any
> > external tool from doing so, it just says we should expect duplicates.
>
> There's an ongoing assumption that's been made that only ALTER SYSTEM
> could make these changes because nothing else has the full GUC system
> and a running PG instance to validate everything.
>
> The suggestion that an external tool could do it goes against that.
>
> If we can't, for whatever reason, work our way towards having code that
> external tools could leverage to manage .auto.conf, then if we could at
> least document what the expectations are and what tools can/can't do
> with the file, that would put us in a better position than where we are
> now.
>
> I strongly believe that whatever the rules and expectations are that we
> come up with, both ALTER SYSTEM and the in-core and external tools
> should follow them.
>
> If we say to that tools should expect duplicates in the file, then
> ALTER SYSTEM should as well, which was the whole issue in the first
> place- ALTER SYSTEM didn't expect duplicates, but the external tools and
> the GUC system did.
>
> If we say that it's acceptable for something to remove duplicate GUC
> entries from the file, keeping the last one, then external tools should
> feel comfortable doing that too and we should make it clear what
> "duplicate" means here and how to identify one.
>
> If we say it's optional for a tool to remove duplicates, then we should
> point out the risk of "running out of disk space" for tool authors to
> consider.  I don't agree with the idea that tool authors should be asked
> to depend on someone running ALTER SYSTEM to address that risk.  If
> there's a strong feeling that tool authors should be able to depend on
> PG to perform that cleanup for them, then we should use a mechanism to
> do so which doesn't involve an entirely optional feature.
>
> For reference, all of the above, while not as cleanly as it could have
> been, was addressed with the recovery.conf/recovery.done system.  Tool
> authors had a good sense that they could replace that file, and that PG
> would clean it up at exactly the right moment, and there wasn't this
> ugly interaction with ALTER SYSTEM to have to worry about.  That none of
> this was really even discussed or addressed previously even after being
> pointed out is really disappointing.
>
> Just to be clear, I brought up this exact concern back in *November*:
>
>
> https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net
>
> And now because this was pushed forward and the concerns that I raised
> ignored, we're having to deal with this towards the end of the release
> cycle instead of during normal development.  The things we're talking
> about now and which I'm gettin

Re: Problem with default partition pruning

2019-08-05 Thread Alvaro Herrera

On 2019-Aug-05, yuzuko wrote:

> So I proposed moving the if() block to the current place.
> The latest patch can solve both queries but I found the latter
> problem can be solved by setting constraint_exclusion = on.

So we have three locations for that test; one is where it currently is,
which handles a small subset of the cases.  The other is where Amit
first proposed putting it, which handles some additional cases; and the
third one is where your latest patch puts it, which seems to handle all
cases.  Isn't that what Amit is saying?  If that's correct (and that's
what I want to imply with the comment changes I proposed), then we
should just accept that version of the patch.

I don't think that we care about what happens with constraint_exclusion
is on.  That's not the recommended value for that setting anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tomas Vondra

On Mon, Aug 05, 2019 at 10:21:39AM -0400, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote:
>On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
>>Tomas Vondra  writes:
>>> There seems to be a consensus that this this not a pg_basebackup issue
>>> (i.e. duplicate values don't make the file invalid), and it should be
>>> handled in ALTER SYSTEM.
>>
>>Yeah.  I doubt pg_basebackup is the only actor that can create such
>>situations.
>>
>>> The proposal seems to be to run through the .auto.conf file, remove any
>>> duplicates, and append the new entry at the end. That seems reasonable.
>>
>>+1
>
>I disagree that this should only be addressed in alter system, as I’ve said
>before and as others have agreed with.  Having one set of code that can be
>used to update parameters in the auto.conf and then have that be used by
>pg_basebackup, alter system, and external tools, is the right approach.
>
>The idea that alter system should be the only thing that doesn’t just
>append changes to the file is just going to lead to confusion and bugs down
>the road.

I don't remember any suggestions ALTER SYSTEM should be the only thing
that can rewrite the config file, but maybe it's buried somewhere in the
thread history. The current proposal certainly does not prohibit any
external tool from doing so, it just says we should expect duplicates.

There's an ongoing assumption that's been made that only ALTER SYSTEM
could make these changes because nothing else has the full GUC system
and a running PG instance to validate everything.

The suggestion that an external tool could do it goes against that.

If we can't, for whatever reason, work our way towards having code that
external tools could leverage to manage .auto.conf, then if we could at
least document what the expectations are and what tools can/can't do
with the file, that would put us in a better position than where we are
now.

IMO documenting the basic rules, and then doing some cleanup/validation
at instance start is the only practical solution, really.

You can't really validate "everything" without a running instance,
because that's the only place where you have GUCs defined by extensions.
I don't see how that could work for external tools, expected to run
exactly when the instance is not running.

I can't think of a use case where simply appending to the file would not
be perfectly sufficient. You can't really do much when the instance is
not running.

I strongly believe that whatever the rules and expectations are that we
come up with, both ALTER SYSTEM and the in-core and external tools
should follow them.

I'm not against giving external tools such capability, in whatever way
we think is appropriate (library, command-line binary, ...).

I'm against (a) making that a requirement for the external tools,
instead of just allowing them to append to the file, and (b) trying to
do that in PG12. We're at beta3, we don't even have any patch, and it
does quite work for past releases (although it's not that pressing
there, thanks to still having recovery.conf).

If we say to that tools should expect duplicates in the file, then
ALTER SYSTEM should as well, which was the whole issue in the first
place- ALTER SYSTEM didn't expect duplicates, but the external tools and
the GUC system did.

Sure.

If we say that it's acceptable for something to remove duplicate GUC
entries from the file, keeping the last one, then external tools should
feel comfortable doing that too and we should make it clear what
"duplicate" means here and how to identify one.

Sure. I don't see why the external tools would bother with doing that,
but I agree there's no reason not to document what duplicates mean.

If we say it's optional for a tool to remove duplicates, then we should
point out the risk of "running out of disk space" for tool authors to
consider.  I don't agree with the idea that tool authors should be asked
to depend on someone running ALTER SYSTEM to address that risk.  If
there's a strong feeling that tool authors should be able to depend on
PG to perform that cleanup for them, then we should use a mechanism to
do so which doesn't involve an entirely optional feature.

Considering the external tools are only allowed to modify the file while
the instance is not running, and that most instances are running all the
time, I very much doubt this is a risk we need to worry about.

And I don't see why we'd have to run ALTER SYSTEM - I proposed to do the
cleanup at instance start too.

For reference, all of the above, while not as cleanly as it could have
been, was addressed with the recovery.conf/recovery.done system.  Tool
authors had a good sense that they could replace that file, and that PG
would clean it up at exactly the right moment, and there wasn't this
ugly interaction with ALTER SYSTEM to have to worry about.  That none of
this was really even discussed or addressed previ

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Tomas Vondra  writes:
> On Mon, Aug 05, 2019 at 10:21:39AM -0400, Stephen Frost wrote:
>> I'd be happier with one set of code at least being the recommended
>> approach to modifying the file and only one set of code in our codebase
>> that's messing with .auto.conf, so that, hopefully, it's done
>> consistently and properly and in a way that everything agrees on and
>> expects, but if we can't get there due to concerns about where we are in
>> the release cycle, et al, then let's at least document what is
>> *supposed* to happen and have our code do so.

> I think fixing ALTER SYSTEM to handle duplicities, and documenting the
> basic rules about modifying .auto.conf is the way to go.

I agree.  So the problem at the moment is we lack a documentation
patch.  Who wants to write it?

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Isaac Morland  writes:
> Here's a radical suggestion: replace postgresql.auto.conf with a directory
> containing multiple files. Each file is named after a configuration
> parameter, and its content is the value of the parameter.

Hmm ... that seems like a lot of work and code churn --- in particular,
guaranteed breakage of code that works today --- to solve a problem
we haven't got.

The problem we do have is lack of documentation, which this wouldn't
in itself remedy.

> In order to prevent confusing and surprising behaviour, the system should
> complain vociferously if it finds a configuration parameter file that is
> not named after a defined parameter, rather than just ignoring it.

As has been pointed out repeatedly, the set of known parameters just
isn't that stable.  Different builds can recognize different sets of
GUCs, even without taking extensions into account.

regards, tom lane

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas

On Tue, Jul 30, 2019 at 4:02 AM Andres Freund  wrote:
> I'm a bit worried about expanding the use of
> ReadBufferWithoutRelcache(). Not so much because of the relcache itself,
> but because it requires doing separate smgropen() calls. While not
> crazily expensive, it's also not free. Especially combined with closing
> all such relations at transaction end (c.f. AtEOXact_SMgr).
>
> I'm somewhat inclined to think that this requires a slightly bigger
> refactoring than done in this patch. Imo at the very least the smgr
> entries ought not to be unowned. But working towards not haven to
> re-open the smgr entry for every single trival request ought to be part
> of this too.

I spent some time trying to analyze this today and I agree with you
that there seems to be room for improvement here. When I first looked
at your comments, I wasn't too convinced, because access patterns that
skip around between undo logs seem like they may be fairly common.
Admittedly, there are cases where we want to read from just one undo
log over and over again, and it would be good to optimize those, but I
was initially a bit unconvinced that that there was a problem here
worth being concerned about. Then I realized that you would also
repeat the smgropen() if you read a single record that happens to be
split across two pages, which seems a little silly.

But then I realized that we're being a little silly even in the case
where we're reading a single undo record that is stored entirely on a
single page.  We are certainly going to need to look up the undo log,
but as things stand, we'll basically do it twice. For example, in the
write path, we'll call UndoLogAllocate() and it will look up an
UndoLogControl object for the undo log of interest, and then we'll
call ReadBufferWithoutRelcache() which will call smgropen() which will
do a hash table lookup to find the SMgrRelation associated with that
undo log. That's not a large cost, as you say, but it does seem like
it might be better to avoid having two different lookups in the same
commonly-used code path, each of which peeks into a different
backend-private data structure for information about the very same
undo log.

The obvious thing to do seems to be to have UndoLogControl objects own
SmgrRelations. That would be something of a novelty, since it looks
like currently only a Relation ever owns an SMgrRelation, but the smgr
infrastructure seems to have been set up in a generic way so as to
permit that sort of thing, so it seems like it should be workable.
Perhaps UndoLogAllocate() function could return a pointer to the
UndoLogControl object as well as UndoRecPtr. Then, there could be a
function UndoLogWrite(UndoLogControl *, UndoRecPtr, char *, Size).  On
the read side, instead of calling UndoRecPtrAssignRelFileNode, maybe
the undo log storage layer should provide a function that again
returns an UndoLogControl, and then we could have a matching function
UndoLogRead(UndoLogControl *, UndoRecPtr, char *, Size).

I think this kind of design would address your concerns about using
the unowned list, too, since the UndoLogControl objects would be
owning the SMgrRelations.  It took me a while to understand why you
were concerned about using the unowned list, so I'm going to repeat it
in my own words to make sure I've got it right, and also to possibly
help out anyone else who may also have had difficulty grokking your
concern.  If we have a bunch of short transactions each of which
accesses the same relation, the relcache entry will remain open and
the file won't get closed in between, but if we have a bunch of short
transactions each of which accesses the same undo log, the undo log
will be closed and reopened at the operating system level for each
individual transaction. That happens because when an SMgrRelation is
"owned," the owner takes care of closing it, and so can keep it open
across transactions, but when it's "unowned," it's automatically
closed during transaction cleanup.  And we should fix it, because
closing and reopening the same file for every transaction
unnecessarily might be expensive enough to matter, at least a little
bit.

How does all that sound?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Christoph Berg

Re: Tomas Vondra 2019-08-03 <20190803124111.2aaddumd7url5wmq@development>
> If we really want to give external tools a sensible (and optional) API
> to access the file, a simple command-line tool seems much better. Say we
> have something like
> 
>   pg_config_file -f PATH --set KEY VALUE
>   pg_config_file -f PATH --get KEY

Fwiw, Debian has pg_conftool (based on the perl lib around
PgCommon.pm):

NAME
   pg_conftool - read and edit PostgreSQL cluster configuration files

SYNOPSIS
   pg_conftool [options] [version cluster] [configfile] command

DESCRIPTION
   pg_conftool allows to show and set parameters in PostgreSQL 
configuration files.

   If version cluster is omitted, it defaults to the default cluster (see 
user_clusters(5) and postgresqlrc(5)). If configfile is
   omitted, it defaults to postgresql.conf. configfile can also be a path, 
in which case version cluster is ignored.

OPTIONS
   -b, --boolean
   Format boolean value as on or off (not for "show all").

   -s, --short
   Show only the value (instead of key = value pair).

   -v, --verbose
   Verbose output.

   --help
   Print help.

COMMANDS
   show parameter|all
   Show a parameter, or all present in this config file.

   set parameter value
   Set or update a parameter.

   remove parameter
   Remove (comment out) a parameter from a config file.

   edit
   Open the config file in an editor.  Unless $EDITOR is set, vi is 
used.

SEE ALSO
   user_clusters(5), postgresqlrc(5)

AUTHOR
   Christoph Berg 

Debian2019-07-15
PG_CONFTOOL(1)

Re: pgbench - implement strict TPC-B benchmark

2019-08-05 Thread Fabien COELHO



Hello Andres,

If not, do you think advisable to spend time improving the evaluator & 
variable stuff and possibly other places for an overall 15% gain?


Also, what would be the likelyhood of such optimization patch to pass?

I could do a limited variable management improvement patch, eventually, I 
have funny ideas to speedup the thing, some of which outlined above, some 
others even more terrible.


Attached is a quick PoC.

  sh> cat set.sql
  \set x 0
  \set y :x
  \set z :y
  \set w :z
  \set v :w
  \set x :v
  \set y :x
  \set z :y
  \set w :z
  \set v :w
  \set x :v
  \set y :x
  \set z :y
  \set w :z
  \set v :w
  \set x1 :x
  \set x2 :x
  \set y1 :z
  \set y0 :w
  \set helloworld :x

Before the patch:

  sh> ./pgbench -T 10 -f vars.sql
  ...
  tps = 802966.183240 (excluding connections establishing) # 0.8M

After the patch:

  sh> ./pgbench -T 10 -f vars.sql
  ...
  tps = 2665382.878271 (excluding connections establishing) # 2.6M

Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3 
complex expressions tests is not measurable, though.


Probably variable management should be reworked more deeply to cleanup the 
code.


Questions:
 - how likely is such a patch to pass? (IMHO not likely)
 - what is its impact to overall performance when actual queries
   are performed (IMHO very small).

--
Fabien.diff --git a/src/bin/pgbench/Makefile b/src/bin/pgbench/Makefile
index 25abd0a875..39bba12c23 100644
--- a/src/bin/pgbench/Makefile
+++ b/src/bin/pgbench/Makefile
@@ -7,7 +7,7 @@ subdir = src/bin/pgbench
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = pgbench.o exprparse.o $(WIN32RES)
+OBJS = pgbench.o exprparse.o symbol_table.o $(WIN32RES)
 
 override CPPFLAGS := -I. -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 17c9ec32c5..e5a15ae136 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -212,6 +212,7 @@ make_variable(char *varname)
 
 	expr->etype = ENODE_VARIABLE;
 	expr->u.variable.varname = varname;
+	expr->u.variable.varnum = getOrCreateSymbolId(symbol_table, varname);
 	return expr;
 }
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..14153ebc35 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -118,6 +118,7 @@ typedef int pthread_attr_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+// TODO: mutex?
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include 
@@ -232,7 +233,9 @@ const char *progname;
 volatile bool timer_exceeded = false;	/* flag from signal handler */
 
 /*
- * Variable definitions.
+ * Variable contents.
+ *
+ * Variable names are managed in symbol_table with a number.
  *
  * If a variable only has a string value, "svalue" is that value, and value is
  * "not set".  If the value is known, "value" contains the value (in any
@@ -243,11 +246,14 @@ volatile bool timer_exceeded = false;	/* flag from signal handler */
  */
 typedef struct
 {
-	char	   *name;			/* variable's name */
-	char	   *svalue;			/* its value in string form, if known */
-	PgBenchValue value;			/* actual variable's value */
+	int			number;	/* variable symbol number */
+	char		   *svalue;	/* its value in string form, if known */
+	PgBenchValue	value;		/* actual variable's value */
 } Variable;
 
+#define undefined_variable_value(v)		\
+	v.svalue == NULL && v.value.type == PGBT_NO_VALUE
+
 #define MAX_SCRIPTS		128		/* max number of SQL scripts allowed */
 #define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */
 
@@ -395,7 +401,6 @@ typedef struct
 	/* client variables */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;		/* number of variables */
-	bool		vars_sorted;	/* are variables sorted by name? */
 
 	/* various times about current transaction */
 	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
@@ -493,6 +498,7 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  * varprefix 	SQL commands terminated with \gset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
+ * set_varnum	variable symbol number for \set and \setshell.
  * expr			Parsed expression, if needed.
  * stats		Time spent in this command.
  */
@@ -505,6 +511,7 @@ typedef struct Command
 	int			argc;
 	char	   *argv[MAX_ARGS];
 	char	   *varprefix;
+	int			set_varnum;
 	PgBenchExpr *expr;
 	SimpleStats stats;
 } Command;
@@ -523,6 +530,10 @@ static int64 total_weight = 0;
 
 static int	debug = 0;			/* debug flag */
 
+/* also used by exprparse.y */
+#define INITIAL_SYMBOL_TABLE_SIZE 128
+SymbolTable symbol_table = NULL;
+
 /* Bu

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread David Fetter

On Mon, Aug 05, 2019 at 03:52:07PM +0900, Ian Barwick wrote:
> On 8/4/19 1:59 AM, Tom Lane wrote:> Tomas Vondra 
>  writes:
> >> On Fri, Aug 02, 2019 at 06:08:02PM -0700, Andres Freund wrote:
> >>> We're WAY WAY past feature freeze. This isn't the time to rewrite guc.c,
> >>> guc-file.l to be suitable for running outside of a backend environment.
> >
> >> Right. And even if we had the code, it's not quite backpatchable (which
> >> we probably should do, considering this is a general ALTER SYSTEM issue,
> >> so not pg12-only).
> >
> >> Not to mention there's no clear consensus this is actually desirable.
> >> I'd argue forcing external tools (written in arbitrary language) to use
> >> this library (written in C), just to modify a "stupid" text file is a
> >> bit overkill. IMO duplicates don't make the file invalid, we should
> >> handle that correctly/gracefully, so I don't see why external tools
> >> could not simply append to the file. We can deduplicate the file when
> >> starting the server, on ALTER SYSTEM, or some other time.
> >
> > Yup.  I'd also point out that even if we had a command-line tool of this
> > sort, there would be scenarios where it's not practical or not convenient
> > to use.  We need not go further than "my tool needs to work with existing
> > PG releases" to think of good examples.
> 
> I suspect this hasn't been an issue before, simply because until the removal
> of recovery.conf AFAIK there hasn't been a general use-case where you'd need
> to modify pg.auto.conf while the server is not running. The use-case which now
> exists (i.e. for writing replication configuration) is one where the tool will
> need to be version-aware anyway (like pg_basebackup is), so I don't see that 
> as
> a particular deal-breaker.
> 
> But...
> 
> > I think we should just accept the facts on the ground, which are that
> > some tools modify pg.auto.conf by appending to it
> 
> +1. It's just a text file...

So are crontab and /etc/passwd, but there are gizmos that help make it
difficult for people to write complete gobbledygook to those. Does it
make sense to discuss tooling of that type?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: pgbench - implement strict TPC-B benchmark

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 17:38:23 +0200, Fabien COELHO wrote:
> Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3
> complex expressions tests is not measurable, though.

I don't know why that could be disappointing. We put in much more work
for much smaller gains in other places.

> Probably variable management should be reworked more deeply to cleanup the
> code.

Agreed.

> Questions:
>  - how likely is such a patch to pass? (IMHO not likely)

I don't see why? I didn't review the patch in any detail, but it didn't
look crazy in quick skim? Increasing how much load can be simulated
using pgbench, is something I personally find much more interesting than
adding capabilities that very few people will ever use.

FWIW, the areas I find current pgbench "most lacking" during development
work are:

1) Data load speed. The data creation is bottlenecked on fprintf in a
   single process. The index builds are done serially. The vacuum could
   be replaced by COPY FREEZE.  For a lot of meaningful tests one needs
   10-1000s of GB of testdata - creating that is pretty painful.

2) Lack of proper initialization integration for custom
   scripts. I.e. have steps that are in the custom script that allow -i,
   vacuum, etc to be part of the script, rather than separately
   executable steps. --init-steps doesn't do anything for that.

3) pgbench overhead, although that's to a significant degree libpq's fault

4) Ability to cancel pgbench and get approximate results. That currently
   works if the server kicks out the clients, but not when interrupting
   pgbench - which is just plain weird.  Obviously that doesn't matter
   for "proper" benchmark runs, but often during development, it's
   enough to run pgbench past some events (say the next checkpoint).

>  - what is its impact to overall performance when actual queries
>are performed (IMHO very small).

Obviously not huge - I'd also not expect it to be unobservably small
either. Especially if somebody went and fixed some of the inefficiency
in libpq, but even without that. And even moreso, if somebody revived
the libpq batch work + the relevant pgbench patch, because that removes
a lot of the system/kernel overhead, due to the crazy number of context
switches (obviously not realistic for all workloads, but e.g. for plenty
java workloads, it is), but leaves the same number of variable accesses
etc.

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

David Fetter  writes:
> On Mon, Aug 05, 2019 at 03:52:07PM +0900, Ian Barwick wrote:
>> On 8/4/19 1:59 AM, Tom Lane wrote:
>>> I think we should just accept the facts on the ground, which are that
>>> some tools modify pg.auto.conf by appending to it

>> +1. It's just a text file...

> So are crontab and /etc/passwd, but there are gizmos that help make it
> difficult for people to write complete gobbledygook to those. Does it
> make sense to discuss tooling of that type?

Perhaps as a future improvement, but it's not happening for v12,
at least not unless you accept "use ALTER SYSTEM in a standalone
backend" as a usable answer.

regards, tom lane

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-05 Thread Konstantin Knizhnik





On 31.07.2019 1:38, Daniel Migowski wrote:


Am 31.07.2019 um 00:29 schrieb Tom Lane:

Daniel Migowski  writes:
Ok, just have read about the commitfest thing. Is the patch OK for 
that? Or is there generally no love for a mem_usage column here? If 
it was, I really would add some memory monitoring in our app 
regarding this.

You should certainly put it into the next commitfest.  We might or
might not accept it, but if it's not listed in the CF we might
not remember to even review it.  (The CF app is really a to-do
list for patches ...)


OK, added it there. Thanks for your patience :).

Regards,
Daniel Migowski



The patch is not applied to the most recent source because extra 
parameter was added to CreateTemplateTupleDesc method.

Please rebase it - the fix is trivial.

I think that including in pg_prepared_statements information about 
memory used this statement is very useful.
CachedPlanMemoryUsage function may be useful not only for this view, but 
for example it is also need in my autoprepare patch.


I wonder if you consider go further and not only report but control 
memory used by prepared statements?
For example implement some LRU replacement discipline on top of prepared 
statements cache which can

evict rarely used prepared statements to avoid memory overflow.
We have such patch for PgPro-EE but it limits only number of prepared 
statement, not taken in account amount of memory used by them.
I think that memory based limit will be more accurate (although it adds 
more overhead).


If you want, I can be reviewer of your patch.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 12:25 PM Tom Lane  wrote:
> Perhaps as a future improvement, but it's not happening for v12,
> at least not unless you accept "use ALTER SYSTEM in a standalone
> backend" as a usable answer.

I'm not sure that would even work for the cases at issue ... because
we're talking about setting up recovery parameters, and wouldn't the
server run recovery before it got around to do anything with ALTER
SYSTEM?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 20:58:05 +1200, Thomas Munro wrote:
> On Sat, Aug 3, 2019 at 11:56 AM Ashwin Agrawal  wrote:
> > On Wed, Jul 31, 2019 at 2:06 PM Andres Freund  wrote:
> >>   I'm also a bit confused why we don't need to pass in the offset of the
> >>  current tuple, rather than the HOT root tuple here. That's not related
> >>  to this patch. But aren't we locking the wrong tuple here, in case of
> >>  HOT?
> >
> > Yes, root is being locked here instead of the HOT. But I don't have full 
> > context on the same. If we wish to fix it though, can be easily done now 
> > with the patch by passing "tid" instead of &(heapTuple->t_self).
> 
> Here are three relevant commits:

Thanks for digging!


> 1.  Commit dafaa3efb75 "Implement genuine serializable isolation
> level." (2011) locked the root tuple, because it set t_self to *tid.
> Possibly due to confusion about the effect of the preceding line
> ItemPointerSetOffsetNumber(tid, offnum).
> 
> 2.  Commit commit 81fbbfe3352 "Fix bugs in SSI tuple locking." (2013)
> fixed that by adding ItemPointerSetOffsetNumber(&heapTuple->t_self,
> offnum).

Hm. It's not at all sure that it's ok to report the non-root tuple tid
here. I.e. I'm fairly sure there was a reason to not just set it to the
actual tid. I think I might have written that up on the list at some
point. Let me dig in memory and list. Obviously possible that that was
also obsoleted by parallel changes.


> 3.  Commit b89e151054a "Introduce logical decoding." (2014) also did
> ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer),
> offnum), for the benefit of historical MVCC snapshots (unnecessarily,
> considering the change in the commit #2), but then, intending to
> "reset to original, non-redirected, tid", clobbered it, reintroducing
> the bug fixed by #2.

> My first guess is that commit #3 above was developed before commit #2,
> and finished up clobbering it.

Yea, that sounds likely.


> This must be in want of an isolation test, but I haven't yet tried to
> get my head around how to write a test that would show the difference.

Indeed.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 11:25:10 -0400, Robert Haas wrote:
> The obvious thing to do seems to be to have UndoLogControl objects own
> SmgrRelations. That would be something of a novelty, since it looks
> like currently only a Relation ever owns an SMgrRelation, but the smgr
> infrastructure seems to have been set up in a generic way so as to
> permit that sort of thing, so it seems like it should be workable.

Yea, I think that'd be a good step.

I'm not 100% convinced it's quite enough, due to the way the undo smgr
only ever has a single file descriptor open, and that undo log segments
are fairly small, and that there'll often be multiple persistence levels
active at the same time. But the undo fd handling is probably a separate
concern than from who owns the smgr relations.

> I think this kind of design would address your concerns about using
> the unowned list, too, since the UndoLogControl objects would be
> owning the SMgrRelations.

Yup.

> How does all that sound?

A good move in the right direction, imo.

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> On Mon, Aug 5, 2019 at 12:25 PM Tom Lane  wrote:
>> Perhaps as a future improvement, but it's not happening for v12,
>> at least not unless you accept "use ALTER SYSTEM in a standalone
>> backend" as a usable answer.

> I'm not sure that would even work for the cases at issue ... because
> we're talking about setting up recovery parameters, and wouldn't the
> server run recovery before it got around to do anything with ALTER
> SYSTEM?

Yeah, good point.  There are a lot of other cases where you really
don't want system startup to happen, too.  On the other hand,
people have also opined that they want full error checking on
the inserted values, and that seems out of reach with less than
a full running system (mumble extensions mumble).

In the end, I think I don't buy Stephen's argument that there should
be a one-size-fits-all answer.  It seems entirely reasonable that
we'll have more than one way to do it, because the constraints are
different depending on what use-case you think about.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 12:43:24 -0400, Tom Lane wrote:
> Yeah, good point.  There are a lot of other cases where you really
> don't want system startup to happen, too.

Agreed.

> On the other hand, people have also opined that they want full error
> checking on the inserted values, and that seems out of reach with less
> than a full running system (mumble extensions mumble).

I think the error checking ought to be about as complete as the one we
do during a normal postmaster startup. Afaict that requires loading
shared_preload_library extensions, but does *not* require booting up far
enough to run GUC checks in a context with database access.

The one possible "extension" to that that I can see is that arguably we
might want to error out if DefineCustom*Variable() doesn't think the
value is valid for a shared_preload_library, rather than just WARNING
(i.e. refuse to start). We can't really do that for other libraries, but
for shared_preload_libraries it might make sense.  Although I suspect
the better approach would be to just generally convert that to an error,
rather than having some startup specific logic.

Greetings,

Andres Freund

Re: [HACKERS] Cached plans and statement generalization

2019-08-05 Thread Konstantin Knizhnik




On 01.08.2019 19:56, Konstantin Knizhnik wrote:



On 31.07.2019 19:56, Heikki Linnakangas wrote:

On 09/07/2019 23:59, Konstantin Knizhnik wrote:

Fixed patch version of the path is attached.


Much of the patch and the discussion has been around the raw parsing, 
and guessing which literals are actually parameters that have been 
inlined into the SQL text. Do we really need to do that, though? The 
documentation mentions pgbouncer and other connection poolers, where 
you don't have session state, as a use case for this. But such 
connection poolers could and should still be using the extended query 
protocol, with Parse+Bind messages and query parameters, even if they 
don't use named prepared statements. I'd want to encourage 
applications and middleware to use out-of-band query parameters, to 
avoid SQL injection attacks, regardless of whether they use prepared 
statements or cache query plans. So how about dropping all the raw 
parse tree stuff, and doing the automatic caching of plans based on 
the SQL string, somewhere in the exec_parse_message? Check the 
autoprepare cache in exec_parse_message(), if it was an "unnamed" 
prepared statement, i.e. if the prepared statement name is an empty 
string.


(I'm actually not sure if exec_parse/bind_message is the right place 
for this, but I saw that your current patch did it in 
exec_simple_query, and exec_parse/bind_message are the equivalent of 
that for the extended query protocol).


- Heikki


I decided to implement your proposal. Much simple version of 
autoprepare patch is attached.

At my computer I got the following results:

 pgbench -M simple -S 22495 TPS
 pgbench -M extended -S    47633 TPS
 pgbench -M prepared -S    47683 TPS


So autoprepare speedup execution of queries sent using extended 
protocol more than twice and it is almost the same as with explicitly 
prepared statements.
I failed to create regression test for it because I do not know how to 
force psql to use extended protocol. Any advice is welcome.




Slightly improved and rebased version of the patch.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/autoprepare.sgml b/doc/src/sgml/autoprepare.sgml
new file mode 100644
index 000..c4d96e6
--- /dev/null
+++ b/doc/src/sgml/autoprepare.sgml
@@ -0,0 +1,62 @@
+
+
+ 
+  Autoprepared statements
+
+  
+   autoprepared statements
+  
+
+  
+PostgreSQL makes it possible prepare
+frequently used statements to eliminate cost of their compilation
+and optimization on each execution of the query. On simple queries
+(like ones in pgbench -S) using prepared statements
+increase performance more than two times.
+  
+
+  
+Unfortunately not all database applications are using prepared statements
+and, moreover, it is not always possible. For example, in case of using
+pgbouncer or any other session pooler,
+there is no session state (transactions of one client may be executed at different
+backends) and so prepared statements can not be used.
+  
+
+  
+Autoprepare mode allows to overcome this limitation.
+In this mode Postgres stores plans for all queries pass using extended protocol.
+Speed of execution of autoprepared statements is almost the same as of explicitly prepared statements.
+  
+
+  
+By default autoprepare mode is switched off. To enable it, assign non-zero
+value to GUC variable autoprepare_limit or autoprepare_memory_limit.
+This variables specify limit for number of autoprepared statement or memory used by them.
+Autoprepare is enabled if one of thme is non zero. Value -1 means unlimited.
+Please notice that event autoprepare is anabled, Postgres makes a decision about using
+generalized plan vs. customized execution plans based on the results
+of comparison of average time of five customized plans with
+time of generalized plan.
+  
+
+  
+Too large number of autoprepared statements can cause memory overflow
+(especially if there are many active clients, because prepared statements cache
+is local to the backend). So using unlimited autoprepare hash is debgerous and not recommended.
+Postgres is using LRU policy to keep in memory most frequently used queries.
+  
+
+  
+Autoprepare hash is local to the backend. It is implicitely reseted on any change of database schema or
+session variables.
+  
+
+  
+It is possible to inspect autoprepared queries in the backend using
+pg_autoprepared_statements view. It shows original text of the
+query, types of the extracted parameters (replacing literals),
+query execution counter and used memory.
+  
+
+ 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cdc30fa..28c9343 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5326,6 +5326,39 @@ SELECT * FROM parent WHERE key = 2400;
   
  
 
+ 
+   autoprepare_limit (int

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 10:21 AM Stephen Frost  wrote:
> Just to be clear, I brought up this exact concern back in *November*:
>
> https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net
>
> And now because this was pushed forward and the concerns that I raised
> ignored, we're having to deal with this towards the end of the release
> cycle instead of during normal development.

I disagree. My analysis is that you're blocking a straightforward bug
fix because we're not prepared to redesign the world to match your
expectations. The actual point of controversy at the moment, as I
understand it, is this: if the backend, while rewriting
postgresql.auto.conf, discovers that it contains duplicates, should we
(a) give a WARNING or (b) not?

The argument for not doing that is pretty simple: if we give a WARNING
when this happens, then every tool that appends to
postgresql.auto.conf has to be responsible for making sure to remove
duplicates along the way.  To do that reliably, it needs a
client-accessible version of all the GUC parsing stuff.  You refer to
this above as an "assumption," but it seems to me that a more accurate
word would be "fact."  Now, I don't think anyone would disagree with
the idea that it possible to do it in an only-approximately-correct
way pretty easily: just match the first word of the line against the
GUC you're proposing to set, and drop the line if it matches. If you
want it to be exactly correct, though, you either need to run the
original code, or your own custom code that behaves in exactly the
same way.  And since the original code runs only in the server, it
follows directly that if you are not running inside the server, you
cannot be running the original code. How you can label any of that as
an "assumption" is beyond me.

Now, I agree that IF we were prepared to provide a standalone
config-editing tool that removes duplicates, THEN it would not be
crazy to emit a WARNING if we find a duplicate, because we could
reasonably tell people to just use that tool. However, such a tool is
not trivial to construct, as evidenced by the fact that, on this very
thread, Ian tried and Andres thought the result contained too much
code duplication. Moreover, we are past feature freeze, which is the
wrong time to add altogether new things to the release, even if we had
code that everybody liked. Furthermore, even if we had such a tool and
even if it had already been committed, I would still not be in favor
of the WARNING, because duplicate settings in postgresql.auto.conf are
harmless unless you include a truly insane number of them, and there
is no reason for anybody to ever do that.  In a way, I sorta hope
somebody does do that, because if I get a problem report from a user
that they put 10 million copies of their recovery settings in
postgresql.auto.conf and the server now starts up very slowly, I am
going to have a good private laugh, and then suggest that they maybe
not do that.

In general, I am sympathetic to the argument that we ought to do tight
integrity-checking on inputs: that's one of the things for which
PostgreSQL is known, and it's a strength of the project. In this case,
though, the cost-benefit trade-off seems very poor to me: it just
makes life complicated without really buying us anything. The whole
reason postgresql.conf is a text file in the first place instead of
being stored in the catalogs is because you might not be able to start
the server if it's not set right, and if you can't edit it without
being able to start the server, then you're stuck. Indeed, one of the
key arguments in getting ALTER SYSTEM accepted in the first place was
that, if you put dumb settings into postgresql.auto.conf and borked
your system so it wouldn't start, you could always use a text editor
to undo it. Given that history, any argument that postgresql.auto.conf
is somehow different and should be subjected to tighter integrity
constraints does not resonate with me. Its mission is to be a
machine-editable postgresql.conf, not to be some other kind of file
that plays by a different set of rules.

I really don't understand why you're fighting so hard about this. We
have a long history of being skeptical about WARNING messages. If, on
the one hand, they are super-important, they might still get ignored
because it could be an automated context where nobody will see it; and
if, on the other hand, they are not that important, then emitting them
is just clutter in the first place. The particular WARNING under
discussion here is one that would likely only fire long after the
fact, when it's far too late to do anything about it, and when, in all
probability, no real harm has resulted anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-05 Thread Andres Freund

Hi,

On 2019-07-28 06:20:40 +, Daniel Migowski wrote:
> how do you want to generalize it? Are you thinking about a view solely
> for the display of the memory usage of different objects?

I'm not quite sure. I'm just not sure that adding separate
infrastructure for various objects is a sutainable approach. We'd likely
want to have this for prepared statements, for cursors, for the current
statement, for various caches, ...

I think an approach would be to add an 'owning_object' field to memory
contexts, which has to point to a Node type if set. A table returning reporting
function could recursively walk through the memory contexts, starting at
TopMemoryContext. Whenever it encounters a context with owning_object
set, it prints a string version of nodeTag(owning_object). For some node
types it knows about (e.g. PreparedStatement, Portal, perhaps some of
the caches), it prints additional metadata specific to the type (so for
prepared statement it'd be something like 'prepared statement', '[name
of prepared statement]'), and prints information about the associated
context and all its children.

The general context information probably should be something like:
context_name, context_ident,
context_total_bytes, context_total_blocks, context_total_freespace, 
context_total_freechunks, context_total_used, context_total_children
context_self_bytes, context_self_blocks, context_self_freespace, 
context_self_freechunks, context_self_used, context_self_children,

It might make sense to have said function return a row for the contexts
it encounters that do not have an owner set too (that way we'd e.g. get
CacheMemoryContext handled), but then still recurse.

Arguably the proposed owning_object field would be a bit redundant with
the already existing ident/MemoryContextSetIdentifier field, which
e.g. already associates the query string with the contexts used for a
prepared statement. But I'm not convinced that's going to be enough
context in a lot of cases, because e.g. for prepared statements it could
be interesting to have access to both the prepared statement name, and
the statement.

The reason I like something like this is that we wouldn't add new
columns to a number of views, and lack views to associate such
information to for some objects. And it'd be disproportional to add all
the information to numerous places anyway.

One counter-argument is that it'd be more expensive to get information
specific to prepared statements (or other object types) that way. I'm
not sure I buy that that's a problem - this isn't something that's
likely going to be used at a high frequency. But if it becomes a
problem, we can add a function that starts that process at a distinct
memory context (e.g. a function that does this just for a single
prepared statement, identified by name) - but I'd not start there.

> While being interesting I still believe monitoring the mem usage of
> prepared statements is a bit more important than that of other objects
> because of how they change memory consumption of the server without
> using any DDL or configuration options and I am not aware of other
> objects with the same properties, or are there some? And for the other
> volatile objects like tables and indexes and their contents PostgreSQL
> already has it's information functions.

Plenty other objects have that property. E.g. cursors. And for the
catalog/relation/... caches it's even more pernicious - the client might
have closed all its "handles", but we still use memory (and it's
absolutely crucial for performance).

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > On Mon, Aug 5, 2019 at 12:25 PM Tom Lane  wrote:
> >> Perhaps as a future improvement, but it's not happening for v12,
> >> at least not unless you accept "use ALTER SYSTEM in a standalone
> >> backend" as a usable answer.
> 
> > I'm not sure that would even work for the cases at issue ... because
> > we're talking about setting up recovery parameters, and wouldn't the
> > server run recovery before it got around to do anything with ALTER
> > SYSTEM?
> 
> Yeah, good point.  There are a lot of other cases where you really
> don't want system startup to happen, too.  On the other hand,
> people have also opined that they want full error checking on
> the inserted values, and that seems out of reach with less than
> a full running system (mumble extensions mumble).

There have been ideas brought up about some way to provide "full
validation" but I, at least, don't recall seeing anyone actually say
that they *want* that- just different people suggesting that it could be
done.

I agree that full validation is a pipe dream for this kind of system and
isn't what I was intending to suggest at any point.

> In the end, I think I don't buy Stephen's argument that there should
> be a one-size-fits-all answer.  It seems entirely reasonable that
> we'll have more than one way to do it, because the constraints are
> different depending on what use-case you think about.

This doesn't seem to me, at least, to be an accurate representation of
my thoughts on this- there could be 15 different ways to modify the
file, but let's have a common set of code to provide those ways instead
of different code between the backend ALTER SYSTEM and the frontend
pg_basebackup (and if we put it in the common library that we already
have for sharing code between the backend and the frontend, and which we
make available for external tools, then those external tools could use
those methods in the same way that we do).

I'm happy to be told I'm wrong, but as far as I know, there's nothing in
appending to the file or removing duplicates that actually requires
validation of the values which are included in order to apply those
operations correctly.

I'm sure I'll be told again about how we can't do this for 12, and I do
appreciate that, but it's because we ignored these issues during
development that we're here and that's really just not something that's
acceptable in my view- we shouldn't be pushing features which have known
issues that we then have to fight about how to fix it at the last minute
and with the constraint that we can't make any big changes.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Aug 5, 2019 at 10:21 AM Stephen Frost  wrote:
> > Just to be clear, I brought up this exact concern back in *November*:
> >
> > https://www.postgresql.org/message-id/20181127153405.GX3415%40tamriel.snowman.net
> >
> > And now because this was pushed forward and the concerns that I raised
> > ignored, we're having to deal with this towards the end of the release
> > cycle instead of during normal development.
> 
> I disagree. 

It's unclear what you're disagreeing with here as the below response
doesn't seem to discuss the question about if these issues were brought
up and pointed out previously, nor about if I was the one who raised
them, nor about if we're towards the end of the release cycle.

> My analysis is that you're blocking a straightforward bug
> fix because we're not prepared to redesign the world to match your
> expectations. The actual point of controversy at the moment, as I
> understand it, is this: if the backend, while rewriting
> postgresql.auto.conf, discovers that it contains duplicates, should we
> (a) give a WARNING or (b) not?

No, that isn't the point of the controversy nor does it relate, at all,
to what you quoted above, so I don't think there's much value in
responding to the discussion about WARNING or not that you put together
below.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-08-05 Thread legrand legrand

Kyotaro Horiguchi-4 wrote
> At Sun, 4 Aug 2019 00:04:01 -0700 (MST), legrand legrand <

> legrand_legrand@

> > wrote in <

> 1564902241482-0.post@.nabble

>>
>> >  However having the nested queryid in 
>> > pg_stat_activity would be convenient to track
>> > what is a long stored functions currently doing.
>> 
>> +1
>> 
>> And  this could permit to get wait event sampling per queryid when
>> pg_stat_statements.track = all
> 
> I'm strongly on this side emotionally, but also I'm on Tom and
> Andres's side that exposing querid that way is not the right
> thing.
> 
> Doing that means we don't need exact correspondence between
> top-level query and queryId (in nested or multistatement queries)
> in this patch.  pg_stat_statements will allow us to do the same
> thing by having additional uint64[MaxBackends] array in
> pgssSharedState, instead of expanding PgBackendStatus array in
> core by the same size.
> 
> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center

Hi Kyotaro,
Thank you for this answer.
What you propose here is already available 
Inside pg_stat_sql_plans extension (a derivative from 
Pg_stat_statements and pg_store_plans)
And I’m used to this queryid behavior with top Level
Queries...
My emotion was high but I will accept it !
Regards
PAscal




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-08-05 12:43:24 -0400, Tom Lane wrote:
> > On the other hand, people have also opined that they want full error
> > checking on the inserted values, and that seems out of reach with less
> > than a full running system (mumble extensions mumble).
> 
> I think the error checking ought to be about as complete as the one we
> do during a normal postmaster startup. Afaict that requires loading
> shared_preload_library extensions, but does *not* require booting up far
> enough to run GUC checks in a context with database access.

I'm not following this thread of the discussion.

You're not suggesting that pg_basebackup perform this error checking
after it modifies the file, are you..?

Where are you thinking this error checking would be happening?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Unused header file inclusion

2019-08-05 Thread Alvaro Herrera

On 2019-Aug-04, vignesh C wrote:

> Made the fixes based on your comments, updated patch has the changes
> for the same.

Well, you fixed the two things that seem to me quoted as examples of
problems, but you didn't fix other occurrences of the same issues
elsewhere.  For example, you remove lwlock.h from dsa.c but there are
structs there that have LWLocks as members.  That's just the first hunk
of the patch; didn't look at the others but it wouldn't surprise that
they have similar issues.  I suggest this patch should be rejected.

Then there's the  removal, which is in tuplesort.c because of
INT_MAX as added by commit d26559dbf356 and still present ...

FWIW sharedtuplestore.c, a very young file, also includes  but
that appears to be copy-paste of includes from some other file (and also
in an inappropriate place), so I have no objections to obliterating that
one.  But other than that one line, this patch needs more "adult
supervision".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 13:34:39 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2019-08-05 12:43:24 -0400, Tom Lane wrote:
> > > On the other hand, people have also opined that they want full error
> > > checking on the inserted values, and that seems out of reach with less
> > > than a full running system (mumble extensions mumble).
> > 
> > I think the error checking ought to be about as complete as the one we
> > do during a normal postmaster startup. Afaict that requires loading
> > shared_preload_library extensions, but does *not* require booting up far
> > enough to run GUC checks in a context with database access.
> 
> I'm not following this thread of the discussion.

It's about the future, not v12.


> Where are you thinking this error checking would be happening?

A hypothethical post v12 tool that can set config values with as much
checking as feasible. The IMO most realistic tool to do so is postmaster
itself, similar to it's already existing -C.  Boot it up until
shared_preload_libraries have been processed, run check hook(s) for the
new value(s), change postgresql.auto.conf, shutdown.


> You're not suggesting that pg_basebackup perform this error checking
> after it modifies the file, are you..?

Not at the moment, at least.


Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 1:29 PM Stephen Frost  wrote:
> No, that isn't the point of the controversy nor does it relate, at all,
> to what you quoted above, so I don't think there's much value in
> responding to the discussion about WARNING or not that you put together
> below.

Well, if that's not what we're arguing about, then what the heck are
we arguing about?

All we need to do to resolve this issue is have Tom commit his patch.
The set of people who are objecting to that is either {} or {Stephen
Frost}.  Even if it's the latter, we should just proceed, because
there are clearly enough votes in favor of the patch to proceed,
including 2 from RMT members, and if it's the former, we should
DEFINITELY proceed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

On Mon, Aug 5, 2019 at 14:11 Andres Freund  wrote:

> On 2019-08-05 13:34:39 -0400, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On 2019-08-05 12:43:24 -0400, Tom Lane wrote:
> > > > On the other hand, people have also opined that they want full error
> > > > checking on the inserted values, and that seems out of reach with
> less
> > > > than a full running system (mumble extensions mumble).
> > >
> > > I think the error checking ought to be about as complete as the one we
> > > do during a normal postmaster startup. Afaict that requires loading
> > > shared_preload_library extensions, but does *not* require booting up
> far
> > > enough to run GUC checks in a context with database access.
> >
> > I'm not following this thread of the discussion.
>
> It's about the future, not v12.

I’m happy to chat about post-v12, certainly. As I said up thread, I get
that we are in this unfortunate situation for v12 and we can do what needs
doing here (where I agree with what Tom said, “a doc patch”- and with fixes
for ALTER SYSTEM to be in line with that doc patch, along with
pg_basebackup, should any changes be needed, of course).

> Where are you thinking this error checking would be happening?
>
> A hypothethical post v12 tool that can set config values with as much
> checking as feasible. The IMO most realistic tool to do so is postmaster
> itself, similar to it's already existing -C.  Boot it up until
> shared_preload_libraries have been processed, run check hook(s) for the
> new value(s), change postgresql.auto.conf, shutdown.

That’s a nice idea but I don’t think it’s really necessary and I’m not sure
how useful this level of error checking would end up being as part of
pg_basebackup.

I can certainly see value in a tool that could be run to verify a
postgresql.conf+auto.conf is valid to the extent that we are able to do so,
since that could, ideally, be run by the init script system prior to a
restart to let the user know that their restart is likely to fail.  Having
that be some option to the postmaster could work, as long as it is assured
to not do anything that would upset a running PG instance (like, say, try
to allocate shared buffers).

> You're not suggesting that pg_basebackup perform this error checking
> > after it modifies the file, are you..?
>
> Not at the moment, at least.

Since pg_basebackup runs, typically, on a server other than the one that PG
is running on, it certainly would have to have a way to at least disable
anything that caused it to try and load libraries on the destination side,
or do anything else that required something external in order to validate-
but frankly I don’t think it should ever be loading libraries that it has
no business with, not even if it means that the error checking of the
postgresql.conf would be wonderful.

Thanks,

Stephen

>

Re: Unused header file inclusion

2019-08-05 Thread Tom Lane

Alvaro Herrera  writes:
> Then there's the  removal, which is in tuplesort.c because of
> INT_MAX as added by commit d26559dbf356 and still present ...

One has to be especially wary of removing system-header inclusions;
the fact that they don't seem to be needed on your own machine doesn't
prove they aren't needed elsewhere.

> ... I suggest this patch should be rejected.

Yeah.  If we do anything along this line it should be based on
pgrminclude results, and even then I think it'd require manual
review, especially for changes in header files.

The big picture here is that removing #includes is seldom worth
the effort it takes :-(

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> All we need to do to resolve this issue is have Tom commit his patch.

I think Stephen is not being unreasonable to suggest that we need some
documentation about what external tools may safely do to pg.auto.conf.
So somebody's got to write that.  But I agree that we could go ahead
with the code patch.

(At this point I won't risk doing so before the wrap, though.)

regards, tom lane

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Andres Freund

Hi,

(as I was out of context due to dealing with bugs, I've switched to
lookign at the current zheap/undoprocessing branch.

On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> +/*
> + * Insert a previously-prepared undo records.
> + *
> + * This function will write the actual undo record into the buffers which are
> + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  
> This
> + * step should be performed inside a critical section.
> + */

Again, I think it's not ok to just assume you can lock an essentially
unbounded number of buffers. This seems almost guaranteed to result in
deadlocks. And there's limits on how many lwlocks one can hold etc.

As far as I can tell there's simply no deadlock avoidance scheme in use
here *at all*? I must be missing something.


> + /* Main loop for writing the undo record. */
> + do
> + {

I'd prefer this to not be a do{} while(true) loop - as written I need to
read to the end to see what the condition is. I don't think we have any
loops like that in the code.


> + /*
> +  * During recovery, there might be some blocks which 
> are already
> +  * deleted due to some discard command so we can just 
> skip
> +  * inserting into those blocks.
> +  */
> + if (!BufferIsValid(buffer))
> + {
> + Assert(InRecovery);
> +
> + /*
> +  * Skip actual writing just update the context 
> so that we have
> +  * write offset for inserting into next blocks.
> +  */
> + SkipInsertingUndoData(&ucontext, BLCKSZ - 
> starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + break;
> + }

How exactly can this happen?


> + else
> + {
> + page = BufferGetPage(buffer);
> +
> + /*
> +  * Initialize the page whenever we try to write 
> the first
> +  * record in page.  We start writing 
> immediately after the
> +  * block header.
> +  */
> + if (starting_byte == UndoLogBlockHeaderSize)
> + UndoPageInit(page, BLCKSZ, 
> prepared_undo->urec->uur_info,
> +  
> ucontext.already_processed,
> +  
> prepared_undo->urec->uur_tuple.len,
> +  
> prepared_undo->urec->uur_payload.len);
> +
> + /*
> +  * Try to insert the record into the current 
> page. If it
> +  * doesn't succeed then recall the routine with 
> the next page.
> +  */
> + InsertUndoData(&ucontext, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;

At this point we're five indentation levels deep. I'd extract at least
either the the per prepared undo code or the code performing the writing
across block boundaries into a separate function. Perhaps both.



> +/*
> + * Helper function for UndoGetOneRecord
> + *
> + * If any of  rmid/reloid/xid/cid is not available in the undo record, then
> + * it will get the information from the first complete undo record in the
> + * page.
> + */
> +static void
> +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp,
> +  RelFileNode rnode, UndoLogCategory 
> category, Buffer buffer)
> +{
> + /*
> +  * If any of the common header field is not available in the current 
> undo
> +  * record then we must read it from the first complete record of the 
> page.
> +  */

How is it guaranteed that the first record on the page is actually from
the current transaction? Can't there be a situation where that's from
another transaction?



> +/*
> + * Helper function for UndoFetchRecord and UndoBulkFetchRecord
> + *
> + * curbuf - If an input buffer is valid then this function will not release 
> the
> + * pin on that buffer.  If the buffer is not valid then it will assign curbuf
> + * with the first buffer of the current undo record and also it will keep the
> + * pin and lock on that buffer in a hope that while traversing the undo chain
> + * the caller might want to read the previous undo record from the same 
> block.
> + */

Wait,

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

On Mon, Aug 5, 2019 at 14:29 Tom Lane  wrote:

> Robert Haas  writes:
> > All we need to do to resolve this issue is have Tom commit his patch.
>
> I think Stephen is not being unreasonable to suggest that we need some
> documentation about what external tools may safely do to pg.auto.conf.

I dare say that if we had some documentation around what to expect and how
to deal with it, for our own tools as well as external ones, then maybe we
wouldn’t be in this situation in the first place.  Clearly ALTER SYSTEM and
the pg_basebackup modifications had different understands and expectations.

So somebody's got to write that.  But I agree that we could go ahead
> with the code patch.

I haven’t looked at the code patch at all, just to be clear.  That said, if
you’re comfortable with it and it’s in line with what we document as being
how you handle pg.auto.conf (for ourselves as well as external tools..),
then that’s fine with me.

Thanks,

Stephen

>

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 2:29 PM Tom Lane  wrote:
> Robert Haas  writes:
> > All we need to do to resolve this issue is have Tom commit his patch.
>
> I think Stephen is not being unreasonable to suggest that we need some
> documentation about what external tools may safely do to pg.auto.conf.
> So somebody's got to write that.

I mean, really?  We're going to document that if you want to add a
setting to the file, you can just append it, but that if you find
yourself desirous of appending so many settings that the entire disk
will fill up, you should maybe reconsider? Perhaps I'm being mean
here, but that seems like it's straight out of the
blinding-flashes-of-the-obvious department.

If we were going to adopt Stephen's proposed rule that you must remove
duplicates or be punished later with an annoying WARNING, I would
agree that it ought to be documented, because I believe many people
would find that a POLA violation.  And to be clear, I'm not objecting
to a sentence someplace that says that duplicates in
postgresql.auto.conf will be ignored and the last value will be used,
same as for any other PostgreSQL configuration file. However, I think
it's highly likely people would have assumed that anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> On Mon, Aug 5, 2019 at 2:29 PM Tom Lane  wrote:
>> I think Stephen is not being unreasonable to suggest that we need some
>> documentation about what external tools may safely do to pg.auto.conf.
>> So somebody's got to write that.

> I mean, really?  We're going to document that if you want to add a
> setting to the file, you can just append it, but that if you find
> yourself desirous of appending so many settings that the entire disk
> will fill up, you should maybe reconsider? Perhaps I'm being mean
> here, but that seems like it's straight out of the
> blinding-flashes-of-the-obvious department.

I don't think we need to go on about it at great length, but it seems
to me that it'd be reasonable to point out that (a) you'd be well
advised not to touch the file while the postmaster is up, and (b)
last setting wins.  Those things are equally true of postgresql.conf
of course, but I don't recall whether they're already documented.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 2:42 PM Tom Lane  wrote:
> I don't think we need to go on about it at great length, but it seems
> to me that it'd be reasonable to point out that (a) you'd be well
> advised not to touch the file while the postmaster is up, and (b)
> last setting wins.  Those things are equally true of postgresql.conf
> of course, but I don't recall whether they're already documented.

OK, fair enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 2:35 PM Stephen Frost  wrote:
> I dare say that if we had some documentation around what to expect and how to 
> deal with it, for our own tools as well as external ones, then maybe we 
> wouldn’t be in this situation in the first place.  Clearly ALTER SYSTEM and 
> the pg_basebackup modifications had different understands and expectations.

But that's not the problem.  The problem is that ALTER SYSTEM modifies
the first match instead of the last one, when it's a well-established
principle that when reading from a PostgreSQL configuration file, the
system adopts the value from the last match, not the first one. I
admit that if somebody had thought to document what ALTER SYSTEM was
doing, that person would probably have also realized that they had
made a mistake in the code, and then they would have fixed the bug,
and that would be great.

But we have exactly zero need to invent a new set of principles
explaining how to deal with postgresql.auto.conf.  We just need to
make the ALTER SYSTEM code conform to the same general rule that has
been well-established for many years.  The only reason why we're still
carrying on about this 95 messages later is because you're trying to
make an argument that postgresql.auto.conf is a different kind of
thing from postgresql.conf and therefore can have its own novel set of
rules which consequently need to be debated.  IMHO, it's not, it
shouldn't, and they don't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-05 Thread Peter Geoghegan

On Fri, Aug 2, 2019 at 1:28 PM Julien Rouhaud  wrote:
> I'm fine with it!

Pushed a version with similar wording just now.

Thanks!
-- 
Peter Geoghegan

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

On Mon, Aug 5, 2019 at 14:38 Robert Haas  wrote:

> On Mon, Aug 5, 2019 at 2:29 PM Tom Lane  wrote:
> > Robert Haas  writes:
> > > All we need to do to resolve this issue is have Tom commit his patch.
> >
> > I think Stephen is not being unreasonable to suggest that we need some
> > documentation about what external tools may safely do to pg.auto.conf.
> > So somebody's got to write that.
>
> I mean, really?  We're going to document that if you want to add a
> setting to the file, you can just append it, but that if you find
> yourself desirous of appending so many settings that the entire disk
> will fill up, you should maybe reconsider? Perhaps I'm being mean
> here, but that seems like it's straight out of the
> blinding-flashes-of-the-obvious department.
>
> If we were going to adopt Stephen's proposed rule that you must remove
> duplicates or be punished later with an annoying WARNING, I would
> agree that it ought to be documented, because I believe many people
> would find that a POLA violation.  And to be clear, I'm not objecting
> to a sentence someplace that says that duplicates in
> postgresql.auto.conf will be ignored and the last value will be used,
> same as for any other PostgreSQL configuration file. However, I think
> it's highly likely people would have assumed that anyway.

The authors and committer for ALTER SYSTEM didn’t.  It’s not uncommon for
us to realize when different people and/or parts of the system make
different assumption about something and they end up causing bugs, we try
to document the “right way” and what expectations one can have.

Also, to be clear, if we document it then I don’t feel we need a WARNING to
be issued- because then it’s expected and handled.

Yes, there was a lot of discussion about how it’d be nice to go further
than documentation and actually provide a facility for tools to use to
modify the file, so we could have the same code being used by pg_basebackup
and ALTER SYSTEM, but the argument was made that we can’t make that happen
for v12.  I’m not sure I agree with that because a lot of the responses
have been blowing up the idea of what amounts to a simple parser/editor of
PG config files (which, as Christoph pointed out, has already been done
externally and I doubt it’s actually all that’s much code) to a full blown
we must validate everything and load every extension’s .so file to make
sure the value is ok, but even so, I’ve backed away from that position and
agreed that a documentation fix would be enough for v12 and hopefully
someone will revisit it in the future and improve on it- at least with the
documentation, there would be a higher chance that they’d get it right and
not end up making different assumptions than those made by other hackers.

Thanks,

Stephen

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-05 Thread Julien Rouhaud

On Mon, Aug 5, 2019 at 8:51 PM Peter Geoghegan  wrote:
>
> Pushed a version with similar wording just now.

Thanks!

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

On Mon, Aug 5, 2019 at 14:43 Tom Lane  wrote:

> Robert Haas  writes:
> > On Mon, Aug 5, 2019 at 2:29 PM Tom Lane  wrote:
> >> I think Stephen is not being unreasonable to suggest that we need some
> >> documentation about what external tools may safely do to pg.auto.conf.
> >> So somebody's got to write that.
>
> > I mean, really?  We're going to document that if you want to add a
> > setting to the file, you can just append it, but that if you find
> > yourself desirous of appending so many settings that the entire disk
> > will fill up, you should maybe reconsider? Perhaps I'm being mean
> > here, but that seems like it's straight out of the
> > blinding-flashes-of-the-obvious department.
>
> I don't think we need to go on about it at great length, but it seems
> to me that it'd be reasonable to point out that (a) you'd be well
> advised not to touch the file while the postmaster is up, and (b)
> last setting wins.  Those things are equally true of postgresql.conf
> of course, but I don't recall whether they're already documented.

Folks certainly modify postgresql.conf while the postmaster is running
pretty routinely, and we expect them to which is why we have a reload
option, so I don’t think we can say that the auto.conf and postgresql.conf
are to be handled in the same way.

Last setting wins, duplicates should be ignored and may be removed,
comments should be ignored and may be removed, and appending to the file is
acceptable for modifying a value.  I’m not sure how much we really document
the structure of the file itself offhand- back when users were editing it
we could probably be a bit more fast and loose with it, but now that we
have different parts of the system modifying it along with external tools
doing so, we should probably write it down a bit more clearly/precisely.

I suspect the authors of pg_conftool would appreciate that too, so they
could make sure that they aren’t doing anything unexpected or incorrect.

Thanks,

Stephen

>

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> But that's not the problem.  The problem is that ALTER SYSTEM modifies
> the first match instead of the last one, when it's a well-established
> principle that when reading from a PostgreSQL configuration file, the
> system adopts the value from the last match, not the first one. I
> admit that if somebody had thought to document what ALTER SYSTEM was
> doing, that person would probably have also realized that they had
> made a mistake in the code, and then they would have fixed the bug,
> and that would be great.

Well, actually, the existing code is perfectly clear about this:

/* Search the list for an existing match (we assume there's only one) */

That assumption is fine *if* you grant that only ALTER SYSTEM itself
is authorized to write that file.  I think the real argument here
centers around who else is authorized to write the file, and when
and how.

In view of the point you made upthread that we explicitly made
pg.auto.conf a plain text file so that one could recover from
mistakes by hand-editing it, I think it's pretty silly to adopt
a position that external mods are disallowed.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> On Mon, Aug 5, 2019 at 2:42 PM Tom Lane  wrote:
>> I don't think we need to go on about it at great length, but it seems
>> to me that it'd be reasonable to point out that (a) you'd be well
>> advised not to touch the file while the postmaster is up, and (b)
>> last setting wins.  Those things are equally true of postgresql.conf
>> of course, but I don't recall whether they're already documented.

> OK, fair enough.

Concretely, how about the attached?

(Digging around in config.sgml, I found that last-one-wins is stated,
but only in the context of one include file overriding another.
That's not *directly* a statement about what happens within a single
file, and it's in a different subsection anyway, so repeating the
info in 19.1.2 doesn't seem unreasonable.)

regards, tom lane

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cdc30fa..f5986b2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -153,6 +153,8 @@ shared_buffers = 128MB
  identifiers or numbers must be single-quoted.  To embed a single
  quote in a parameter value, write either two quotes (preferred)
  or backslash-quote.
+ If the file contains multiple entries for the same parameter,
+ all but the last one are ignored.
 
 
 
@@ -185,18 +187,27 @@ shared_buffers = 128MB
  In addition to postgresql.conf,
  a PostgreSQL data directory contains a file
  postgresql.auto.confpostgresql.auto.conf,
- which has the same format as postgresql.conf but should
- never be edited manually.  This file holds settings provided through
- the  command.  This file is automatically
- read whenever postgresql.conf is, and its settings take
- effect in the same way.  Settings in postgresql.auto.conf
- override those in postgresql.conf.
+ which has the same format as postgresql.conf but
+ is intended to be edited automatically not manually.  This file holds
+ settings provided through the  command.
+ This file is read whenever postgresql.conf is,
+ and its settings take effect in the same way.  Settings
+ in postgresql.auto.conf override those
+ in postgresql.conf.
+
+
+
+ External tools might also
+ modify postgresql.auto.conf, typically by appending
+ new settings to the end.  It is not recommended to do this while the
+ server is running, since a concurrent ALTER SYSTEM
+ command could overwrite such changes.
 
 
 
  The system view
  pg_file_settings
- can be helpful for pre-testing changes to the configuration file, or for
+ can be helpful for pre-testing changes to the configuration files, or for
  diagnosing problems if a SIGHUP signal did not have the
  desired effects.

Cleanup of intro.sgml

2019-08-05 Thread Joshua D. Drake


-hackers,

I went through and made some readability and modernization of the 
intro.sgml today. Patch attached.


JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
Postgres centered full stack support, consulting and development.
Advocate: @amplifypostgres || Get help: https://commandprompt.com/
* Unless otherwise stated, opinions are my own.   *

diff --git a/doc/src/sgml/intro.sgml b/doc/src/sgml/intro.sgml
index 3038826311..7b23a8a918 100644
--- a/doc/src/sgml/intro.sgml
+++ b/doc/src/sgml/intro.sgml
@@ -4,78 +4,77 @@
  Preface
 
  
-  This book is the official documentation of
+  This is the official documentation of
   PostgreSQL.  It has been written by the
-  PostgreSQL developers and other
-  volunteers in parallel to the development of the
-  PostgreSQL software.  It describes all
-  the functionality that the current version of
-  PostgreSQL officially supports.
+  PostgreSQL community in parallel with the 
+   development of the PostgreSQL database.  
+   It describes the functionality of current version of
+  PostgreSQL.
  
-
+	
  
-  To make the large amount of information about
-  PostgreSQL manageable, this book has been
-  organized in several parts.  Each part is targeted at a different
-  class of users, or at users in different stages of their
+  This book has been broken up in to categories to better manage
+  the large amount of information about
+  PostgreSQL.  Each part is targeted at 
+  a different category of user, or at users in different stages of their
   PostgreSQL experience:
 
   

 
   is an informal introduction for new users.
+  It includes information on topics such as installation, architecture
+  and how to create your first database.
 

 

 
   documents the SQL query
- language environment, including data types and functions, as well
- as user-level performance tuning.  Every
- PostgreSQL user should read this.
+ language, including data types and functions. It also includes 
+ user-level performance tuning hints. All
+ PostgreSQL users should read this.
 

 

 
   describes the installation and
- administration of the server.  Everyone who runs a
- PostgreSQL server, be it for private
- use or for others, should read this part.
+ administration of the PostgreSQL server.  
+ Everyone who administrates a
+ PostgreSQL server, should read this part.
 

 

 
-  describes the programming
- interfaces for PostgreSQL client
- programs.
+  describes the C based programming
+ interfaces for PostgreSQL. Topics such as
+ Python or Go are not discussed in this section. 
 

 
 

 
-  contains information for
- advanced users about the extensibility capabilities of the
- server.  Topics include user-defined data types and
- functions.
+  contains information on
+ the extensibility of PostgreSQL.
+ server.  Topics include user-defined data types, procedural languages,
+ and triggers.
 

 

 
-  contains reference information about
- SQL commands, client and server programs.  This part supports
- the other parts with structured information sorted by command or
- program.
+  contains reference information including
+ SQL commands, client applications and server applications.  
 

 

 
   contains assorted information that might be of
- use to PostgreSQL developers.
+ use to PostgreSQL internals developers.
 

   
@@ -120,6 +119,29 @@
 

 
+   PostgreSQL also supports a comprehensive array of Enterprise level
+   features:
+
+   
+
+ Unstructured data via JSON
+
+ Binary and Logical Replication
+
+
+ Hot Backups
+
+
+ Partioniing
+
+
+ Materialized Views
+
+
+ Procedures
+
+   
+
Also, PostgreSQL can be extended by the
user in many ways, for example by adding new
 
@@ -146,8 +168,8 @@
   
 
   
-   And because of the liberal license,
-   PostgreSQL can be used, modified, and
+   The liberal license allows
+   PostgreSQL to be used, modified, and
distributed by anyone free of charge for any purpose, be it
private, commercial, or academic.

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-05 Thread Daniel Migowski


Am 05.08.2019 um 18:30 schrieb Konstantin Knizhnik:

On 31.07.2019 1:38, Daniel Migowski wrote:

Am 31.07.2019 um 00:29 schrieb Tom Lane:

Daniel Migowski  writes:
Ok, just have read about the commitfest thing. Is the patch OK for 
that? Or is there generally no love for a mem_usage column here? If 
it was, I really would add some memory monitoring in our app 
regarding this.

You should certainly put it into the next commitfest.  We might or
might not accept it, but if it's not listed in the CF we might
not remember to even review it.  (The CF app is really a to-do
list for patches ...)

OK, added it there. Thanks for your patience :).
The patch is not applied to the most recent source because extra 
parameter was added to CreateTemplateTupleDesc method.

Please rebase it - the fix is trivial.

OK, will do.
I think that including in pg_prepared_statements information about 
memory used this statement is very useful.
CachedPlanMemoryUsage function may be useful not only for this view, 
but for example it is also need in my autoprepare patch.
I would love to use your work if it's done, and would also love to work 
together here. I am quite novice in C thought, I might take my time to 
get things right.
I wonder if you consider go further and not only report but control 
memory used by prepared statements?
For example implement some LRU replacement discipline on top of 
prepared statements cache which can

evict rarely used prepared statements to avoid memory overflow.


THIS! Having some kind of safety net here would finally make sure that 
my precious processes will not grow endlessly until all mem is eaten up, 
even with prep statement count limits.


While working on stuff I noticed there are three things stored in a 
CachedPlanSource. The raw query tree (a relatively small thing), the 
query list (analyzed-and-rewritten query tree) which takes up the most 
memory (at least here, maybe different with your usecases), and (often 
after the 6th call) the CachedPlan, which is more optimized that the 
query list and often needs less memory (half of the query list here).


The query list seems to take the most time to create here, because I hit 
the GEQO engine here, but it could be recreated easily (up to 500ms for 
some queries). Creating the CachedPlan afterwards takes 60ms in some 
usecase. IF we could just invalidate them from time to time, that would 
be grate.


Also, invalidating just the queries or the CachedPlan would not 
invalidate the whole prepared statement, which would break clients 
expectations, but just make them a slower, adding much to the stability 
of the system. I would pay that price, because I just don't use manually 
named prepared statements anyway and just autogenerate them as 
performance sugar without thinking about what really needs to be 
prepared anyway. There is an option in the JDBC driver to use prepared 
statements automatically after you have used them a few time.


We have such patch for PgPro-EE but it limits only number of prepared 
statement, not taken in account amount of memory used by them.
I think that memory based limit will be more accurate (although it 
adds more overhead).


Limiting them by number is already done automatically here and would 
really not be of much value, but having a mem limit would be great. We 
could have a combined memory limit for your autoprepared statements as 
well as the manually prepared ones, so clients can know for sure the 
server processes won't eat up more that e.g. 800MB for prepared 
statements. And also I would like to have this value spread across all 
client processes, e.g. specifying max_prepared_statement_total_mem=5G 
for the server, and maybe max_prepared_statement_mem=1G for client 
processes. Of course we would have to implement cross client process 
invalidation here, and I don't know if communicating client processes 
are even intended.


Anyway, a memory limit won't really add that much more overhead. At 
least not more than having no prepared statements at all because of the 
fear of server OOMs, or have just a small count of those statements. I 
was even think about a prepared statement reaper that checks the 
pg_prepared_statements every few minutes to clean things up manually, 
but having this in the server would be of great value to me.



If you want, I can be reviewer of your patch.


I'd love to have you as my reviewer.

Regards,
Daniel Migowski

Re: tableam vs. TOAST

2019-08-05 Thread Robert Haas

On Fri, Aug 2, 2019 at 6:42 PM Andres Freund  wrote:
> Hm, those all include writing, right? And for read-only we don't expect
> any additional overhead, correct? The write overhead is probably too
> large show a bit of function call overhead - but if so, it'd probably be
> on unlogged tables? And with COPY, because that utilizes multi_insert,
> which means more toasting in a shorter amount of time?

Yes and yes. I guess we could test the unlogged case and with COPY,
but in any realistic case you're still looking for a tiny CPU overhead
in a sea of I/O costs.  Even if an extremely short COPY on an unlogged
table regresses slightly, we do not normally reject patches that
improve code quality on the grounds that they add function call
overhead in a few places.  Code like this hard to optimize and
maintain; as you remarked yourself, there are multiple opportunities
to do this stuff better that are hard to see in the current structure.

> .oO(why does everyone attach attachements out of order? Is that
> a gmail thing?)

Must be.

> I wonder if toasting.c should be moved too?

I mean, we could, but I don't really see a reason.  It'd just be
moving it from one fairly-generic place to another, and I'd rather
minimize churn.

> trailing whitespace after "#ifndef DETOAST_H ".

Will fix.

> Hm. This leaves toast_insert_or_update() as a name. That makes it sound
> like it's generic toast code, rather than heap specific?

I'll rename it to heap_toast_insert_or_update().  But I think I'll put
that in 0004 with the other renames.

> It's definitely nice how a lot of repetitive code has been deduplicated.
> Also makes it easier to see how algorithmically expensive
> toast_insert_or_update() is :(.

Yep.

> Shouldn't this condition be the other way round?

I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests.  For now I prefer to draw a line in the sand and
change nothing.

> Couldn't most of these be handled via colflags, instead of having
> numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
> TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
> size check ought to boil down to a single mask test?

I'm not really seeing how more flags would significantly simplify this
logic, but I might be missing something.

> I wonder if a better prefix wouldn't be toasting_...

I'm open to other votes, but I think it's toast_tuple is more specific
than toasting_ and thus better.

> > +/*
> > + * Information about one column of a tuple being toasted.
> > + *
> > + * NOTE: toast_action[i] can have these values:
> > + *  ' ' default handling
> > + *  'p' already processed --- don't touch it
> > + *  'x' incompressible, but OK to move off
> > + *
> > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes 
> > with
> > + * toast_action[i] different from 'p'.
> > + */
> > +typedef struct
> > +{
> > +struct varlena *tai_oldexternal;
> > +int32   tai_size;
> > +uint8   tai_colflags;
> > +} ToastAttrInfo;
>
> I think that comment is outdated?

Oops.  Will fix.

> > +/*
> > + * Flags indicating the overall state of a TOAST operation.
> > + *
> > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need
> > + * to be deleted.
> > + *
> > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be 
> > freed.
> > + *
> > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being 
> > toasted.
> > + *
> > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other
> > + * words, the toaster did something.
> > + */
> > +#define TOAST_NEEDS_DELETE_OLD  0x0001
> > +#define TOAST_NEEDS_FREE0x0002
> > +#define TOAST_HAS_NULLS 0x0004
> > +#define TOAST_NEEDS_CHANGE  0x0008
>
> I'd make these enums. They're more often accessible in a debugger...

Ugh, I hate that style.  Abusing enums to make flag bits makes my skin
crawl.  I always wondered what the appeal was -- I guess now I know.
Blech.

> I'm quite unconvinced that making the chunk size specified by the AM is
> a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
> pg_control etc. It seems a bit dangerous to let AMs provide the size,
> without being very clear that any change of the value will make data
> inaccessible. It'd be different if the max were only used during
> toasting.

I was actually thinking about proposing that we rip
TOAST_MAX_CHUNK_SIZE out of pg_control.  No real effort has been made
here to make this something that users could configure, and I don't
know of a good reason to configure it. It also seems pretty out of
place in a world where there are mult

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 3:07 PM Tom Lane  wrote:
> Concretely, how about the attached?

Works for me, for whatever that's worth.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

2019-08-05 Thread Tom Lane

Robert Haas  writes:
> I don't feel entirely convinced that there's any rush to do all of
> this right now, and the more I change the harder it is to make sure
> that I haven't broken anything.  How strongly do you feel about this
> stuff?

FWIW, I agree with your comment further up that this patch ought to
just refactor, not change any algorithms.  The latter can be done
in separate patches.

regards, tom lane

Re: pgbench - implement strict TPC-B benchmark

2019-08-05 Thread Robert Haas

On Fri, Aug 2, 2019 at 2:38 AM Fabien COELHO  wrote:
> Ok, one thread cannot feed an N core server if enough client are executed
> per thread and the server has few things to do.

Right ... where N is, uh, TWO.

> The point I'm clumsily trying to make is that pgbench-specific overheads
> are quite small: Any benchmark driver would have pretty much at least the
> same costs, because you have the cpu cost of the tool itself, then the
> library it uses, eg lib{pq,c}, then syscalls. Even if the first costs are
> reduced to zero, you still have to deal with the database through the
> system, and this part will be the same.

I'm not convinced. Perhaps you're right; after all, it's not like
pgbench is doing any real work. On the other hand, I've repeatedly
been annoyed by how inefficient pgbench is, so I'm not totally
prepared to concede that any benchmark driver would have the same
costs, or that it's a reasonably well-optimized client application.
When I run the pgbench, I want to know how fast the server is, not how
fast pgbench is.

> What name would you suggest, if it were to be made available from pgbench
> as a builtin, that avoids confusion with "tpcb-like"?

I'm not in favor of adding it as a built-in.  If we were going to do
it, I don't know that we could do better than tcpb-like-2, and I'm not
excited about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cleanup of intro.sgml

2019-08-05 Thread Chapman Flack

On 8/5/19 3:20 PM, Joshua D. Drake wrote:
> intro.sgml today. Patch attached.

Things I noticed quickly:

broken up in to categoriess/in to/into/

Unstructured data via JSON(or XML ?)

s/Partioniing/Partitioning/


Regards,
-Chap

Re: Redacting information from logs

2019-08-05 Thread Jeff Davis

On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote:
> It seems to me that it'd be sufficient to do the annotation by
> inserting wrapper functions, like the errparam() you suggest above.
> If we just had errparam() choosing whether to return "..." instead of
> its argument string, we'd have what we need, without messing with
> the format language.

I'm having trouble getting the ergonomics to work out here so that it
can generate both a redacted and an unredacted message.

If errparam() is a normal argument to errmsg(), then errparam() will be
evaluated first. Will it return the redacted version, the unredacted
version, or a special type that holds both?

If I try to use macros to force multiple evaluation (to get one
redacted and one unredacted string), then it seems like that would
happen for all arguments (not just errparam arguments), which would be
bad.

Suggestions?

Regards,
Jeff Davis

Re: Index Skip Scan

2019-08-05 Thread Dmitry Dolgov

> On Mon, Aug 5, 2019 at 12:05 PM Floris Van Nee 
> wrote:
>
> The root->distinct_pathkeys is already filtered for redundant keys, so column
> 'a' is not in there anymore. Still, it'd be much faster to use the skip scan
> here, because a regular scan will scan all entries with a=1, even though
> we're really only interested in the first one. In previous versions, this
> would be fixed by changing the check in planner.c to use
> root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things
> change a bit now that the patch is rebased on the unique-keys patch. Would it
> be valid to change this check to root->query_uniquekeys != NIL to consider
> skip scans also for this query?

[including a commentary from Jesper]
On Mon, Aug 5, 2019 at 6:55 PM Jesper Pedersen
 wrote:

Yes, the check should be for that. However, the query in question
doesn't have any query_pathkeys, and hence query_uniquekeys in
standard_qp_callback(), so therefore it isn't supported.

Your scenario is covered by one of the test cases in case the
functionality is supported. But, I think that is outside the scope of
the current patch.

> However, currently, the patch just reads the closest and then doesn't
> consider this page at all anymore, if the first tuple skipped to turns out to
> be not visible. Consider the following sql to illustrate:

For the records, the purpose of `_bt_read_closest` is not so much to reduce
amount of data we read from a page, but more to correctly handle those
situations we were discussing before with reading forward/backward in cursors,
since for that in some cases we need to remember previous values for stepping
to the next. I've limited number of items, fetched in this function just
because I was misled by having a check for dead tuples in `_bt_skip`. Of course
we can modify it to read a whole page and leave visibility check for the code
after `index_getnext_tid` (although in case if we know that all tuples on this
page are visilbe I guess it's not strictly necessary, but we still get
improvement from skipping itself).

Re: pgbench - implement strict TPC-B benchmark

2019-08-05 Thread Fabien COELHO



Hello Andres,


Which is a (somehow disappointing) * 3.3 speedup. The impact on the 3
complex expressions tests is not measurable, though.


I don't know why that could be disappointing. We put in much more work
for much smaller gains in other places.


Probably, but I thought I would have a better deal by eliminating most 
string stuff from variables.



Questions:
 - how likely is such a patch to pass? (IMHO not likely)


I don't see why? I didn't review the patch in any detail, but it didn't
look crazy in quick skim? Increasing how much load can be simulated
using pgbench, is something I personally find much more interesting than
adding capabilities that very few people will ever use.


Yep, but my point is that the bottleneck is mostly libpq/system, as I 
tried to demonstrate with the few experiments I reported.



FWIW, the areas I find current pgbench "most lacking" during development
work are:

1) Data load speed. The data creation is bottlenecked on fprintf in a
  single process.


snprintf actually, could be replaced.

I submitted a patch to add more control on initialization, including a 
server-side loading feature, i.e. the client does not send data, the 
server generates its own, see 'G':


https://commitfest.postgresql.org/24/2086/

However on my laptop it is slower than client-side loading on a local 
socket. The client version is doing around 70 MB/s, the client load is 
20-30%, postgres load is 85%, but I'm not sure I can hope for much more on 
my SSD. On my laptop the bottleneck is postgres/disk, not fprintf.


The index builds are done serially. The vacuum could be replaced by COPY 
FREEZE.


Well, it could be added?

For a lot of meaningful tests one needs 10-1000s of GB of testdata - 
creating that is pretty painful.


Yep.


2) Lack of proper initialization integration for custom
  scripts.


Hmmm…

You can always write a psql script for schema and possibly simplistic data 
initialization?


However, generating meaningful pseudo-random data for an arbitrary schema 
is a pain. I did an external tool for that a few years ago:


http://www.coelho.net/datafiller.html

but it is still a pain.

I.e. have steps that are in the custom script that allow -i, vacuum, etc 
to be part of the script, rather than separately executable steps. 
--init-steps doesn't do anything for that.


Sure. It just gives some control.


3) pgbench overhead, although that's to a significant degree libpq's fault


I'm afraid that is currently the case.


4) Ability to cancel pgbench and get approximate results. That currently
  works if the server kicks out the clients, but not when interrupting
  pgbench - which is just plain weird.  Obviously that doesn't matter
  for "proper" benchmark runs, but often during development, it's
  enough to run pgbench past some events (say the next checkpoint).


Do you mean have a report anyway on "Ctrl-C"?

I usually do a -P 1 to see the progress, but making Ctrl-C work should be 
reasonably easy.



 - what is its impact to overall performance when actual queries
   are performed (IMHO very small).


Obviously not huge - I'd also not expect it to be unobservably small
either.


Hmmm… Indeed, the 20 \set script runs at 2.6 M/s, that is 0.019 µs per 
\set, and any discussion over the connection is at least 15 µs (for one 
client on a local socket).


--
Fabien.

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-05 Thread Daniel Migowski


Am 05.08.2019 um 19:16 schrieb Andres Freund:

On 2019-07-28 06:20:40 +, Daniel Migowski wrote:

how do you want to generalize it? Are you thinking about a view solely
for the display of the memory usage of different objects?

I'm not quite sure. I'm just not sure that adding separate
infrastructure for various objects is a sutainable approach. We'd likely
want to have this for prepared statements, for cursors, for the current
statement, for various caches, ...

I think an approach would be to add an 'owning_object' field to memory
contexts, which has to point to a Node type if set. A table returning reporting
function could recursively walk through the memory contexts, starting at
TopMemoryContext. Whenever it encounters a context with owning_object
set, it prints a string version of nodeTag(owning_object). For some node
types it knows about (e.g. PreparedStatement, Portal, perhaps some of
the caches), it prints additional metadata specific to the type (so for
prepared statement it'd be something like 'prepared statement', '[name
of prepared statement]'), and prints information about the associated
context and all its children.
I understand. So it would be something like the output of 
MemoryContextStatsInternal, but in table form with some extra columns. I 
would have loved this extra information already in 
MemoryContextStatsInternal btw., so it might be a good idea to upgrade 
it first to find the information and wrap a table function over it 
afterwards.

The general context information probably should be something like:
context_name, context_ident,
context_total_bytes, context_total_blocks, context_total_freespace, 
context_total_freechunks, context_total_used, context_total_children
context_self_bytes, context_self_blocks, context_self_freespace, 
context_self_freechunks, context_self_used, context_self_children,

It might make sense to have said function return a row for the contexts
it encounters that do not have an owner set too (that way we'd e.g. get
CacheMemoryContext handled), but then still recurse.
A nice way to learn about the internals of the server and to analyze the 
effects of memory reducing enhancements.

Arguably the proposed owning_object field would be a bit redundant with
the already existing ident/MemoryContextSetIdentifier field, which
e.g. already associates the query string with the contexts used for a
prepared statement. But I'm not convinced that's going to be enough
context in a lot of cases, because e.g. for prepared statements it could
be interesting to have access to both the prepared statement name, and
the statement.
The identifier seems to be more like a category at the moment, because 
it does not seem to hold any relevant information about the object in 
question. So a more specific name would be nice.

The reason I like something like this is that we wouldn't add new
columns to a number of views, and lack views to associate such
information to for some objects. And it'd be disproportional to add all
the information to numerous places anyway.
I understand your argumentation, but things like Cursors and Portals are 
rather short living while prepared statements seem to be the place where 
memory really builds up.

One counter-argument is that it'd be more expensive to get information
specific to prepared statements (or other object types) that way. I'm
not sure I buy that that's a problem - this isn't something that's
likely going to be used at a high frequency. But if it becomes a
problem, we can add a function that starts that process at a distinct
memory context (e.g. a function that does this just for a single
prepared statement, identified by name) - but I'd not start there.
I also see no problem here, and with Konstantin Knizhnik's autoprepare I 
wouldn't use this very often anyway, more just for monitoring purposes, 
where I don't care if my query is a bit more complex.

While being interesting I still believe monitoring the mem usage of
prepared statements is a bit more important than that of other objects
because of how they change memory consumption of the server without
using any DDL or configuration options and I am not aware of other
objects with the same properties, or are there some? And for the other
volatile objects like tables and indexes and their contents PostgreSQL
already has it's information functions.

Plenty other objects have that property. E.g. cursors. And for the
catalog/relation/... caches it's even more pernicious - the client might
have closed all its "handles", but we still use memory (and it's
absolutely crucial for performance).


Maybe we can do both? Add a single column to pg_prepared_statements, and 
add another table for the output of MemoryContextStatsDetail? This has 
the advantage that the single real memory indicator useful for end users 
(to the question: How much mem takes my sh*t up?) is in 
pg_prepared_statements and some more intrinsic information in a detail 
view.


Thinking about the latter I am against such a

Re: Adding column "mem_usage" to view pg_prepared_statements

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 22:46:47 +0200, Daniel Migowski wrote:
> > Arguably the proposed owning_object field would be a bit redundant with
> > the already existing ident/MemoryContextSetIdentifier field, which
> > e.g. already associates the query string with the contexts used for a
> > prepared statement. But I'm not convinced that's going to be enough
> > context in a lot of cases, because e.g. for prepared statements it could
> > be interesting to have access to both the prepared statement name, and
> > the statement.
> The identifier seems to be more like a category at the moment, because it
> does not seem to hold any relevant information about the object in question.
> So a more specific name would be nice.

I think you might be thinking of the context's name, not ident? E.g. for
prepared statements the context name is:

source_context = AllocSetContextCreate(CurrentMemoryContext,

   "CachedPlanSource",

   ALLOCSET_START_SMALL_SIZES);

which is obviously the same for every statement. But then there's

MemoryContextSetIdentifier(source_context, plansource->query_string);

which obviously differs.


> > The reason I like something like this is that we wouldn't add new
> > columns to a number of views, and lack views to associate such
> > information to for some objects. And it'd be disproportional to add all
> > the information to numerous places anyway.
> I understand your argumentation, but things like Cursors and Portals are
> rather short living while prepared statements seem to be the place where
> memory really builds up.

That's not necessarily true, especially given WITH HOLD cursors.  Nor
does one only run out of memory in the context of long-lived objects.


> > > While being interesting I still believe monitoring the mem usage of
> > > prepared statements is a bit more important than that of other objects
> > > because of how they change memory consumption of the server without
> > > using any DDL or configuration options and I am not aware of other
> > > objects with the same properties, or are there some? And for the other
> > > volatile objects like tables and indexes and their contents PostgreSQL
> > > already has it's information functions.

> > Plenty other objects have that property. E.g. cursors. And for the
> > catalog/relation/... caches it's even more pernicious - the client might
> > have closed all its "handles", but we still use memory (and it's
> > absolutely crucial for performance).
> 
> Maybe we can do both? Add a single column to pg_prepared_statements, and add
> another table for the output of MemoryContextStatsDetail? This has the
> advantage that the single real memory indicator useful for end users (to the
> question: How much mem takes my sh*t up?) is in pg_prepared_statements and
> some more intrinsic information in a detail view.

I don't see why we'd want to do both. Just makes pg_prepared_statements
a considerably more expensive. And that's used by some applications /
clients in an automated manner.


> Thinking about the latter I am against such a table, at least in the form
> where it gives information like context_total_freechunks, because it would
> just be useful for us developers.

Developers are also an audience for us. I mean we certainly can use this
information during development. But even for bugreports such information
would be useufl.


> Why should any end user care for how many
> chunks are still open in a MemoryContext, except when he is working on
> C-style extensions. Could just be a source of confusion for them.

Meh. As long as the crucial stuff is first, that's imo enough.


> Let's think about the goal this should have: The end user should be able to
> monitor the memory consumption of things he's in control of or could affect
> the system performance. Should such a table automatically aggregate some
> information? I think so. I would not add more than two memory columns to the
> view, just mem_used and mem_reserved. And even mem_used is questionable,
> because in his eyes only the memory he cannot use for other stuff because of
> object x is important for him (that was the reason I just added one column).
> He would even ask: WHY is there 50% more memory reserved than used, and how
> I can optimize it? (Would lead to more curious PostgreSQL developers maybe,
> so that's maybe a plus).

It's important because it influences how memory usage will grow.


> On the other hand: The Generic Plan had been created for the first
> invocation of the prepared statement, why not store it immediatly. It is a
> named statement for a reason that it is intended to be reused, even when it
> is just twice, and since memory seems not to be seen as a scarce resource in
> this context why not store that immediately. Would drop the need for a
> hierarchy here also.

Well, we'll maybe never use

Re: Redacting information from logs

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 13:37:50 -0700, Jeff Davis wrote:
> On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote:
> > It seems to me that it'd be sufficient to do the annotation by
> > inserting wrapper functions, like the errparam() you suggest above.
> > If we just had errparam() choosing whether to return "..." instead of
> > its argument string, we'd have what we need, without messing with
> > the format language.
> 
> I'm having trouble getting the ergonomics to work out here so that it
> can generate both a redacted and an unredacted message.
> 
> If errparam() is a normal argument to errmsg(), then errparam() will be
> evaluated first. Will it return the redacted version, the unredacted
> version, or a special type that holds both?

I was thinking that it'd just store a struct ErrParam, which'd reference
the passed value and metadata like the name (for structured log
output) and redaction category.  The bigger problem I see is handling
the different types of arguments - but perhaps the answer there would be
to just make the macro typesafe? Or have one for scalar values and one
for pointer types?

We could even allocate the necessary information for this on the stack,
with some macro trickery. Not sure if it's worth it, but ...

Doing something like this does however require not directly using plain
sprintf. I'm personally not bothered by that, but I'd not be surprised
if e.g. Tom saw that differently.

> If I try to use macros to force multiple evaluation (to get one
> redacted and one unredacted string), then it seems like that would
> happen for all arguments (not just errparam arguments), which would be
> bad.

Yea, multiple evaluation clearly is a no-go.

Greetings,

Andres Freund

Re: Redacting information from logs

2019-08-05 Thread Jeff Davis

On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote:
> I'm imagining something like
> 
> #define pg_printf(fmt, ...) \
> do { \
> if ( __builtin_constant_p(fmt)) \
> { \
> static processed_fmt processed_fmt_ = {.format = fmt}; \
> if (unlikely(!processed_fmt_.parsed)) \
>preprocess_format_string(&processed_fmt_) \
> pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
> } \
> else \
> pg_printf_unprocessed(fmt, __VA_ARGS__); \
> } while (0) \

What would you do in the preprocessing exactly? Create a list of
indexes into the string where the format codes are?

> I think we've quite repeatedly had requests for a log format that can
> be
> parsed reasonably, and annotated with additional information that's
> too
> verbose for the main message.  It's a pain to have to parse quite
> complex strings in production, just to extract the value of the
> actual
> information in the message.
> 
> ts="2019-08-04 10:19:21.286 PDT" pid=32416 vxid=12/9 sev=LOG \
> msg="Prozess 15027 erlangte AccessShareLock-Sperre auf Relation
> 1259 der Datenbank 13416 nach 29808,865 ms" \
> fmt="process %d acquired %s on %s after %ld.%03d ms" \
> p:modename="AccessShareLock-Sperre auf Relation 1259" \
> p:msec=18343 p:usec=173 \
> p:locktag=190345/134343/0/0/relation/default \
> c:statement="LOCK TABLE pg_class;" \
> l:func=ProcSleep l:file=proc.c l:line=1495
> 

...

> If we have enough information to pass to the logging hook, we don't
> even
> need to define how all of this is going to look like exactly
> (although
> I'd probably argue that a logfmt or json target ought to be in core).

I think I see where you are going with this now: it is almost
orthogonal to your new-style format strings ( %{foo}s ), but not quite.

You're suggesting that we save all of the arguments, along with some
annotation, in the ErrorData struct, and then let emit_log_hook sort it
out (perhaps by constructing some JSON message, perhaps translating the
message_id, etc.).

I like the idea, but still playing with the ergonomics a bit, and how
it interacts with various message parts (msg, hint, detail, etc.). If
we had the name-based format strings, then the message parts could all
share a common set of parameters; but with the current positional
format strings, I think each message part would need its own set of
arguments.

Positional:

   ereport(ERROR,
   (errcode(ERRCODE_UNIQUE_VIOLATION),
errmsg_params("duplicate key value violates unique
constraint \"%s\"",
  errparam("constraintname", MSGDEFAULT,
   RelationGetRelationName(rel)),
errdetail_params("Key %s already exists.",
 errparam("key", MSGUSERDATA, key_desc)))
   );

Named:

   ereport(ERROR,
   (errcode(ERRCODE_UNIQUE_VIOLATION),
errmsg_rich("duplicate key value violates unique constraint
\"%{constraintname}s\""),
errdetail_rich("Key %{key}s already exists."),
errparam("key", MSGUSERDATA, key_desc))
errparam("constraintname", MSGDEFAULT, 
 RelationGetRelationName(rel)))
   );

I think either one needs some ergonomic improvements, but it seems like
we are going in the right direction.

Maybe we can make the parameters common to different message parts by
using an integer index to reference the parameter, like:

   ereport(ERROR,
   (errcode(ERRCODE_UNIQUE_VIOLATION),
errmsg_rich("duplicate key value violates unique constraint
\"%s\"", 1 /* second errparam */),
errdetail_rich("Key %s already exists.", 0 /* first
errparam */),
errparam("key", MSGUSERDATA, key_desc))
errparam("constraintname", MSGDEFAULT, 
 RelationGetRelationName(rel)))
   );

Not quite ideal, but might get us closer.

Regards,
Jeff Davis

Re: Redacting information from logs

2019-08-05 Thread Jeff Davis

On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote:
> I was thinking that it'd just store a struct ErrParam, which'd
> reference
> the passed value and metadata like the name (for structured log
> output) and redaction category.  The bigger problem I see is handling
> the different types of arguments - but perhaps the answer there would
> be
> to just make the macro typesafe? Or have one for scalar values and
> one
> for pointer types?

Losing the compile-time checks for compatibility between format codes
and arguments would be a shame. Are you saying there's a potential
solution for that?

> Doing something like this does however require not directly using
> plain
> sprintf. I'm personally not bothered by that, but I'd not be
> surprised
> if e.g. Tom saw that differently.

It may be possible to still use sprintf if we translate the ErrParams
into plain values first.

Regards,
Jeff Davis

Re: Redacting information from logs

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 14:26:44 -0700, Jeff Davis wrote:
> On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote:
> > I'm imagining something like
> > 
> > #define pg_printf(fmt, ...) \
> > do { \
> > if ( __builtin_constant_p(fmt)) \
> > { \
> > static processed_fmt processed_fmt_ = {.format = fmt}; \
> > if (unlikely(!processed_fmt_.parsed)) \
> >preprocess_format_string(&processed_fmt_) \
> > pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
> > } \
> > else \
> > pg_printf_unprocessed(fmt, __VA_ARGS__); \
> > } while (0) \
> 
> What would you do in the preprocessing exactly? Create a list of
> indexes into the string where the format codes are?

Yea, basically. If you look at snprint.c's dopr(), there's a fair bit of
parsing going on that is going to stay the same from call to call in
nearly all cases (and the cases where not we possibly ought to fix). And
things like the $ processing for argument order, or having named
arguments as I suggest, make that more pronounced.


> > If we have enough information to pass to the logging hook, we don't
> > even
> > need to define how all of this is going to look like exactly
> > (although
> > I'd probably argue that a logfmt or json target ought to be in core).
> 
> I think I see where you are going with this now: it is almost
> orthogonal to your new-style format strings ( %{foo}s ), but not quite.
> 
> You're suggesting that we save all of the arguments, along with some
> annotation, in the ErrorData struct, and then let emit_log_hook sort it
> out (perhaps by constructing some JSON message, perhaps translating the
> message_id, etc.).

Right.


> I like the idea, but still playing with the ergonomics a bit, and how
> it interacts with various message parts (msg, hint, detail, etc.). If
> we had the name-based format strings, then the message parts could all
> share a common set of parameters; but with the current positional
> format strings, I think each message part would need its own set of
> arguments.

Right, I think that's a good part of where I was coming from.


> Maybe we can make the parameters common to different message parts by
> using an integer index to reference the parameter, like:
> 
>ereport(ERROR,
>(errcode(ERRCODE_UNIQUE_VIOLATION),
> errmsg_rich("duplicate key value violates unique constraint
> \"%s\"", 1 /* second errparam */),
> errdetail_rich("Key %s already exists.", 0 /* first
> errparam */),
> errparam("key", MSGUSERDATA, key_desc))
> errparam("constraintname", MSGDEFAULT, 
>  RelationGetRelationName(rel)))
>);
> 
> Not quite ideal, but might get us closer.

If we insist that errmsg_rich/errdetail_rich may not have parameters,
then they can just use the same set of arguments, without any of this,
at the cost of sometimes more complicated % syntax (i.e. %1$d to refer
to the first argument).

I think the probable loss of gcc format warnings would be the biggest
issue with this whole proposal, and potential translator trouble the
biggest impediment for named parameters.

Greetings,

Andres Freund

Re: Redacting information from logs

2019-08-05 Thread Andres Freund

Hi,

On 2019-08-05 14:32:36 -0700, Jeff Davis wrote:
> On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote:
> > I was thinking that it'd just store a struct ErrParam, which'd
> > reference
> > the passed value and metadata like the name (for structured log
> > output) and redaction category.  The bigger problem I see is handling
> > the different types of arguments - but perhaps the answer there would
> > be
> > to just make the macro typesafe? Or have one for scalar values and
> > one
> > for pointer types?
> 
> Losing the compile-time checks for compatibility between format codes
> and arguments would be a shame. Are you saying there's a potential
> solution for that?

Yea, I'd just written that in another reply to yours. I did actually
think about this some recently:
https://www.postgresql.org/message-id/20190730181845.jyyk4selyohagwnf%40alap3.anarazel.de

Not sure if any of that is really applicable.  Once more I really am
regretting that PG is C rather than C++.


> > Doing something like this does however require not directly using
> > plain
> > sprintf. I'm personally not bothered by that, but I'd not be
> > surprised
> > if e.g. Tom saw that differently.
> 
> It may be possible to still use sprintf if we translate the ErrParams
> into plain values first.

It's pretty hard to call those functions with a runtime variable number
of arguments. va_list as an interface is not great...

Greetings,

Andres Freund

Re: Assertion for logically decoding multi inserts into the catalog

2019-08-05 Thread Daniel Gustafsson

> On 31 Jul 2019, at 19:20, Heikki Linnakangas  wrote:

> This patch makes the assertion more strict than it was before. I don't see 
> how it could possibly make a regression failure go away. On the contrary. So, 
> huh?

Yeah, this is clearly fat-fingered, the intent is to only run the Assert in
case XLH_INSERT_CONTAINS_NEW_TUPLE is set in xlrec->flags, as it only applies
under that condition.  The attached is tested in both in the multi-insert patch
and on HEAD, but I wish I could figure out a better way to express this Assert.

cheers ./daniel

logdec_assert-v2.patch
Description: Binary data

Re: Redacting information from logs

2019-08-05 Thread Jeff Davis

On Mon, 2019-08-05 at 14:44 -0700, Andres Freund wrote:
> at the cost of sometimes more complicated % syntax (i.e. %1$d to
> refer
> to the first argument).
> 
> I think the probable loss of gcc format warnings would be the biggest
> issue with this whole proposal, and potential translator trouble the
> biggest impediment for named parameters.

I'd be OK with '%1$d' syntax.

That leaves type safety as the main problem. Your solution involving
_Generic is interesting -- I didn't even know that existed. I don't
think it needs to be supported on all compilers, as long as we are
getting errors from somewhere. They would be runtime errors instead of
compile-time errors, though.

Regards,
Jeff Davis

Putting kerberos/ldap logs somewhere useful

2019-08-05 Thread Tom Lane

I got frustrated just now because this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-05%2021%3A18%3A23

is essentially undebuggable, thanks to the buildfarm's failure to
capture any error output from slapd.  That's not the buildfarm
script's fault: it's willing to capture everything placed in the
agreed-on log directory.  But the TAP test script randomly places
the daemon's log file somewhere else, one level up.  The kerberos
test script has the same problem.

Hence, I propose the attached.  This just moves the actual log
files ... we could possibly move the daemons' .conf files as well,
but I think they're probably not variable enough to be interesting.

regards, tom lane

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 34845a7..e3eb052 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -67,8 +67,8 @@ my $realm= 'EXAMPLE.COM';
 
 my $krb5_conf   = "${TestLib::tmp_check}/krb5.conf";
 my $kdc_conf= "${TestLib::tmp_check}/kdc.conf";
-my $krb5_log= "${TestLib::tmp_check}/krb5libs.log";
-my $kdc_log = "${TestLib::tmp_check}/krb5kdc.log";
+my $krb5_log= "${TestLib::log_path}/krb5libs.log";
+my $kdc_log = "${TestLib::log_path}/krb5kdc.log";
 my $kdc_port= get_free_port();
 my $kdc_datadir = "${TestLib::tmp_check}/krb5kdc";
 my $kdc_pidfile = "${TestLib::tmp_check}/krb5kdc.pid";
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 47bc090..5efb87a 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -52,7 +52,7 @@ my $ldap_datadir  = "${TestLib::tmp_check}/openldap-data";
 my $slapd_certs   = "${TestLib::tmp_check}/slapd-certs";
 my $slapd_conf= "${TestLib::tmp_check}/slapd.conf";
 my $slapd_pidfile = "${TestLib::tmp_check}/slapd.pid";
-my $slapd_logfile = "${TestLib::tmp_check}/slapd.log";
+my $slapd_logfile = "${TestLib::log_path}/slapd.log";
 my $ldap_conf = "${TestLib::tmp_check}/ldap.conf";
 my $ldap_server   = 'localhost';
 my $ldap_port = get_free_port();

Re: An out-of-date comment in nodeIndexonlyscan.c

2019-08-05 Thread Thomas Munro

On Fri, Jun 28, 2019 at 12:55 PM Ashwin Agrawal  wrote:
> On Thu, Jun 27, 2019 at 4:33 PM Thomas Munro  wrote:
>> I wonder if it might be possible to avoid page locks on both the heap
>> *and* index in some limited cases, as I mentioned over here (just an
>> idea, could be way off base):
>>
>> https://www.postgresql.org/message-id/CA%2BhUKGJGDVfhHmoUGzi-_J%2BN8FmRjmWTY0MCd1ZV5Fj9qdyb1w%40mail.gmail.com
>
> I am in same boat as you. One can get serializable conflict error today based 
> on tuple gets HOT updated or not. HOT is logically internal code optimization 
> and not so much user functionality, so ideally feels shouldn't affect 
> serializable behavior. But it does currently, again, due to index locking. 
> Disable HOT update and 4 isolation tests fail due to "could not serialize 
> access due to read/write dependencies among transactions" otherwise not. If 
> the tuple happens to fit on same page user will not get the error, if the 
> tuple gets inserted to different page the error can happen, due to index page 
> locking. I had discussed this with Heikki and thinking is we shouldn't need 
> to take the lock on the index page, if the index key was not changed, even if 
> it was a non-HOT update. Not sure of complications and implications, but just 
> a thought.

Oh, I think the idea I was suggesting might be the same as this item
from README-SSI (unrelated to HOT, but related to index uniqueness,
particularly in index-only-scan where you might be able to skip the
btree page lock):

"Several optimizations are possible, though not all are implemented yet:
...
* An index scan which is comparing for equality on the entire key
for a unique index need not acquire a predicate lock as long as a key
is found corresponding to a visible tuple which has not been modified
by another transaction -- there are no "between or around" gaps to
cover."

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-08-05 Thread Robert Haas

On Mon, Aug 5, 2019 at 12:42 PM Andres Freund  wrote:
> A good move in the right direction, imo.

I spent some more time thinking about this and talking to Thomas about
it and I'd like to propose a somewhat more aggressive restructuring
proposal, with the aim of getting a cleaner separation between layers
of this patch set.

Right now, the undo log storage stuff knows nothing about the contents
of an undo log, whereas the undo interface storage knows everything
about the contents of an undo log. In particular, it knows that it's a
series of records, and those records are grouped into transactions,
and it knows both the format of the individual records and also the
details of how transaction headers work. Nothing can use the undo log
storage system except for the undo interface layer, because the undo
interface layer assumes that all the data in the undo storage system
conforms to the record/recordset format which it defines. However,
there are a few warts: while the undo log storage patch doesn't know
anything about the contents of undo logs, it does know that that
transaction boundaries matter, and it signals to the undo interface
layer whether a transaction header should be inserted for a new
record.  That's a strange thing for the storage layer to be doing.
Also, in addition to three persistence levels, it knows about a fourth
undo log category for "special" data for multixact or TPD-like things.
That's another wart.

Suppose that we instead invent a new layer which sits on top of the
undo log storage layer.  This layer manages what I'm going to call
GHOBs, growable hunks of bytes.  (This is probably not the best name,
but I thought of it in 5 seconds during a complex technical
conversation, so bear with me.)  The GHOB layer supports
open/close/grow/write/overwrite operations.  Conceptually, you open a
GHOB with an initial size and a persistence level, and then you can
subsequently grow it unless you fill up the undo log in which case you
can't grow it any more; when you're done, you close it.  Opening and
closing a GHOB are operations that only make in-memory state changes.
Opening a GHOB finds a place where you could write the initial amount
of data you specify, but it doesn't actually write any data or change
any persistent state yet, except for making sure that nobody else can
grab that space as long as you have the GHOB open.  Closing a GHOB
tells the system that you're not going to grow the object any more,
which means some other GHOB can be placed immediately after the last
data you wrote.  Growing a GHOB doesn't do anything persistent either;
it just tests whether there would be room to write those bytes.  So,
the only operations that make actual persistent changes are write and
overwrite.  These operations just copy data into shared buffers and
mark them dirty, but they are set up so that you can integrate this
with whatever WAL-logging your doing for those operations, so that you
can make the same writes happen at redo time.

Then, on top of the GHOB layer, you have separate submodules for
different kinds of GHOBs.  Most importantly, you have a
transaction-GHOB manager, which opens a GHOB per persistence level the
first time somebody wants to write to it and closes those GHOBs at
end-of-xact.  AMs push records into the transaction-GHOB manager, and
it pushes them into GHOBs on the other side. Then you can also have a
multi-GHOB manager, which would replace what Thomas now has as a
separate undo log category.  The undo-log-storage layer wouldn't have
any fixed limit on the number of GHOBs that could be open at the same
time; it would just be the sum of whatever the individual GHOB type
managers can open. It would be important to keep that number fairly
small since there's not an unlimited supply of undo logs, but that
doesn't seem like a problem for any of the uses we currently have in
mind.  Each GHOB would begin with a magic number identifying the GHOB
type, and would have callbacks for everything else, like "how big is
this GHOB?" and "is it discardable?".

I'm not totally sure I've thought through all of the problems here,
but it seems like this might help us fix some of the aforementioned
layering inversions. The undo log storage system only knows about
storage: it doesn't have to help with things like transaction
boundaries any more, and it continues to be indifferent to the actual
contents of the storage.  At the GHOB layer, we know that we've got
chunks of storage which are the unit of undo discard, and we know that
they start with a magic number that identifies the type, but it
doesn't know whether they are internally broken into records or, if
so, how those records are organized. The individual GHOB managers do
know that stuff; for example, the transaction-GHOB manager would know
that AMs insert undo records and how those records are compressed and
so forth.  One thing that feels good about this system is that you
could actually write something like the test_undo module that Thomas
ha

Re: Psql patch to show access methods info

2019-08-05 Thread Alexander Korotkov

On Wed, Jul 24, 2019 at 4:59 PM Alexander Korotkov
 wrote:
> On Wed, Jul 24, 2019 at 9:01 AM Andres Freund  wrote:
> > Based on a quick skim of the thread - which means I most definitely
> > missed things - there's not been discussion of why we actually want to
> > add this.  Who's the prospective user of this facility? And why wouldn't
> > they just query pg_am[proc]?  None of this information seems like it's
> > going to be even remotely targeted towards even advanced users.  For
> > developers it's not clear what these add?
>
> I see your point regarding pg_am details.  Probably nobody expect
> developers need this.  And probably even developers don't need this,
> because it's easier to see IndexAmRoutine directly with more details.
> So, +1 for removing this.
>
> pg_amproc for gin/gist/sp-gist/brin is probably for developers.  But I
> think pg_amproc for btree/hash could be useful for advanced users.
> btree/hash opclasses could be written by advanced users using
> pl/something, I've faced that several times.

Revised patch is attached.  Changes to \dA+ command are reverted.  It
also contains some minor improvements.

Second patch looks problematic for me, because it provides index
description alternative to \d+.  IMHO, if there is something really
useful to display about index, we should keep it in \d+.  So, I
propose to postpone this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Add-psql-AM-info-commands-v09.patch
Description: Binary data

Re: Implementing Incremental View Maintenance

2019-08-05 Thread Tatsuo Ishii

It's not mentioned below but some bugs including seg fault when
--enable-casser is enabled was also fixed in this patch.

BTW, I found a bug with min/max support in this patch and I believe
Yugo is working on it. Details:
https://github.com/sraoss/pgsql-ivm/issues/20

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Yugo Nagata 
Subject: Re: Implementing Incremental View Maintenance
Date: Wed, 31 Jul 2019 18:08:51 +0900
Message-ID: <20190731180851.73856441d8abb494bf5e6...@sraoss.co.jp>

> Hi,
> 
> Attached is the latest patch for supporting min and max aggregate functions.
> 
> When new tuples are inserted into base tables, if new values are small
> (for min) or large (for max), matview just have to be updated with these
> new values. Otherwise, old values just remains.
> 
> However, in the case of deletion, this is more complicated. If deleted
> values exists in matview as current min or max, we have to recomputate
> new min or max values from base tables for affected groups, and matview
> should be updated with these recomputated values. 
> 
> Also, regression tests for min/max are also added.
> 
> In addition, incremental update algorithm of avg aggregate values is a bit
> improved. If an avg result in materialized views is updated incrementally
> y using the old avg value, numerical errors in avg values are accumulated
> and the values get wrong eventually. To prevent this, both of sum and count
> values are contained in views as hidden columns and use them to calculate
> new avg value instead of using old avg values.
> 
> Regards,
> 
> On Fri, 26 Jul 2019 11:35:53 +0900
> Yugo Nagata  wrote:
> 
>> Hi,
>> 
>> I've updated the wiki page of Incremental View Maintenance.
>> 
>> https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
>> 
>> On Thu, 11 Jul 2019 13:28:04 +0900
>> Yugo Nagata  wrote:
>> 
>> > Hi Thomas,
>> > 
>> > Thank you for your review and discussion on this patch!
>> > 
>> > > > 2019年7月8日(月) 15:32 Thomas Munro :
>> > > > 
>> > > > > On Fri, Jun 28, 2019 at 10:56 PM Yugo Nagata  
>> > > > > wrote:
>> > > > > > Attached is a WIP patch of IVM which supports some aggregate 
>> > > > > > functions.
>> > > > >
>> > > > > Hi Nagata-san and Hoshiai-san,
>> > > > >
>> > > > > Thank you for working on this.  I enjoyed your talk at PGCon.  I've
>> > > > > added Kevin Grittner just in case he missed this thread; he has 
>> > > > > talked
>> > > > > often about implementing the counting algorithm, and he wrote the
>> > > > > "trigger transition tables" feature to support exactly this.  While
>> > > > > integrating trigger transition tables with the new partition 
>> > > > > features,
>> > > > > we had to make a number of decisions about how that should work, and
>> > > > > we tried to come up with answers that would work for IMV, and I hope
>> > > > > we made the right choices!
>> > 
>> > Transition tables is a great feature. I am now using this in my 
>> > implementation
>> > of IVM, but the first time I used this feature was when I implemented a PoC
>> > for extending view updatability of PostgreSQL[1]. At that time, I didn't 
>> > know
>> > that this feature is made originally aiming to support IVM. 
>> > 
>> > [1] https://www.pgcon.org/2017/schedule/events/1074.en.html
>> > 
>> > I think transition tables is a good choice to implement a statement level
>> > immediate view maintenance where materialized views are refreshed in a 
>> > statement
>> > level after trigger. However, when implementing a transaction level 
>> > immediate
>> > view maintenance where views are refreshed per transaction, or deferred 
>> > view
>> > maintenance, we can't update views in a after trigger, and we will need an
>> > infrastructure to manage change logs of base tables. Transition tables can 
>> > be
>> > used to collect these logs, but using logical decoding of WAL is another 
>> > candidate.
>> > In any way, if these logs can be collected in a tuplestore, we might able 
>> > to
>> > convert this to "ephemeral named relation (ENR)" and use this to calculate 
>> > diff
>> > sets for views.
>> > 
>> > > > >
>> > > > > I am quite interested to learn how IVM interacts with SERIALIZABLE.
>> > > > >
>> > > > 
>> > > >  Nagata-san has been studying this. Nagata-san, any comment?
>> > 
>> > In SERIALIZABLE or REPEATABLE READ level, table changes occurred in other 
>> > ransactions are not visible, so views can not be maintained correctly in 
>> > AFTER
>> > triggers. If a view is defined on two tables and each table is modified in
>> > different concurrent transactions respectively, the result of view 
>> > maintenance
>> > done in trigger functions can be incorrect due to the race condition. This 
>> > is the
>> > reason why such transactions are aborted immediately in that case in my 
>> > current
>> > implementation.
>> > 
>> > One idea to resolve this is performing view maintenance in two phases. 
>> > Firstly, 
>> > views

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-05 Thread Bruce Momjian

On Wed, Jul 31, 2019 at 03:29:59PM +0900, Masahiko Sawada wrote:
> Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8
> bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce
> space. Even though we have 21 bits left we cannot store relfilenode to
> the IV.

No.  The nonce is the LSN and page number, the CTR counter (11 bits),
and 21 extra bits.  CTR needs a nonce for every 16-byte block, if you
want to think of it that way.

Even though there isn't space for the relfilenode in the nonce, We could
use the relfilenode/tablespace/database id by mixing into a derived key,
based on the master key, but as I stated in another email, we don't want
do that unles we have to.

> BTW I've received a review about the current design by some
> cryptologists in our company. They recommended to use CTR rather than
> CBC. The main reason is that in block cipher it's important to ensure
> the uniqueness even for every input block to the block cipher. CBC is
> hard to ensure that because the previous output is the next block's
> input. Whereas in CTR, it encrypts each blocks separately with
> key+nonce. Therefore if we can ensure the uniqueness of IV we can meet
> that. Also it's not necessary to encrypt IV as it's okey to be
> predictable. So I vote for CTR for at least for tables/indexes
> encryption, there already might be consensus though.

Yes, you are more likely to get duplicate nonce in CBC mode rather than
the CTR mode we are proposing.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-05 Thread Bruce Momjian

On Wed, Jul 31, 2019 at 04:58:49PM +0900, Masahiko Sawada wrote:
> On Wed, Jul 31, 2019 at 3:29 PM Masahiko Sawada  wrote:
> >
> >
> > For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> > and then write only the encrypted data of the new WAL record using
> > pg_pwrite() rather than write whole encrypted page. So each time we
> > encrypt 8k WAL page we end up with encrypting different data with the
> > same key+nonce but since we don't write to the disk other than space
> > where we actually wrote WAL records it's not a problem. Is that right?
> 
> Hmm that's incorrect. We always write an entire 8k WAL page even if we
> write a few WAl records into a page. It's bad because we encrypt
> different pages with the same key+IV, but we cannot change IV for each
> WAL writes as we end up with changing also
> already-flushed-WAL-records. So we might need to change the WAL write
> so that it write only WAL records we actually wrote.

Uh, I don't understand.  We use the LSN to write the 8k page, and we use
a different nonce scheme for the WAL.  The LSN changes each time the
page is modified. The 8k page in the WAL is encrypted just like the rest
of the WAL.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-05 Thread Bruce Momjian

On Wed, Jul 31, 2019 at 09:43:00AM -0400, Sehrope Sarkuni wrote:
> On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada  wrote:
> 
> Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8
> bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce
> space. Even though we have 21 bits left we cannot store relfilenode to
> the IV.
> 
> 
> Fields like relfilenode, database, or tablespace could be added to the derived
> key, not the per-page IV. There's no space limitations as they are additional
> inputs into the HKDF (key derivation function).

Yes, but we want to avoid that for other reasons.

> For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> and then write only the encrypted data of the new WAL record using
> pg_pwrite() rather than write whole encrypted page. So each time we
> encrypt 8k WAL page we end up with encrypting different data with the
> same key+nonce but since we don't write to the disk other than space
> where we actually wrote WAL records it's not a problem. Is that right?
> 
> Ah, this is what I was referring to in my previous mail. I'm not familiar with
> how the writes happen yet (reading up...) but, yes, we would need to ensure
> that encrypted data is not written more than once (i.e. no writing of encrypt
> (zero) followed by writing of encrypt(non-zero) at the same spot).

Right.  The 8k page LSN changes each time the page is modified, and the
is part of the page nonce.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-05 Thread Bruce Momjian

On Wed, Jul 31, 2019 at 04:11:03PM +0900, Masahiko Sawada wrote:
> On Wed, Jul 31, 2019 at 5:48 AM Bruce Momjian  wrote:
> > I am thinking for the heap/index IV, it would be:
> >
> > uint64 lsn;
> > unint32 page number;
> > /* only uses 11 bits for a zero-based CTR counter for 32k pages */
> > uint32 counter;
> >
> 
> +1
> IIUC since this would require to ensure uniqueness by using key+IV we
> need to use different keys for different relations. Is that right?

No.  My other email states that the LSN is only used for a single
relation, so there is no need for the relfilenode in the nonce.  A
single LSN writing to multiple parts of the relation generates a unique
nonce since the page number is also part of the nonce.

> > and for WAL it would be:
> >
> > uint64 segment_number;
> > uint32counter;
> > /* guarantees this IV doesn't match any relation IV */
> > uint32   2^32-1 /* all 1's */
> 
> I would propose to include the page number within a WAL segment to IV
> so that we can encrypt each WAL page with the counter always starting
> from 0. And if we use different encryption keys for tables/indexes and

What is the value of that?

> And if we use different encryption keys for tables/indexes and
> WAL I think we don't need 2^32-1.

I see little value to using different encryption keys for tables/indexes
and WAL.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-05 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > On Mon, Aug 5, 2019 at 2:42 PM Tom Lane  wrote:
> >> I don't think we need to go on about it at great length, but it seems
> >> to me that it'd be reasonable to point out that (a) you'd be well
> >> advised not to touch the file while the postmaster is up, and (b)
> >> last setting wins.  Those things are equally true of postgresql.conf
> >> of course, but I don't recall whether they're already documented.
> 
> > OK, fair enough.
> 
> Concretely, how about the attached?


> (Digging around in config.sgml, I found that last-one-wins is stated,
> but only in the context of one include file overriding another.
> That's not *directly* a statement about what happens within a single
> file, and it's in a different subsection anyway, so repeating the
> info in 19.1.2 doesn't seem unreasonable.)

Agreed.

> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index cdc30fa..f5986b2 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -153,6 +153,8 @@ shared_buffers = 128MB
>   identifiers or numbers must be single-quoted.  To embed a single
>   quote in a parameter value, write either two quotes (preferred)
>   or backslash-quote.
> + If the file contains multiple entries for the same parameter,
> + all but the last one are ignored.
>  

Looking at this patch quickly but also in isolation, so I could be wrong
here, but it seems like the above might be a good place to mention
"duplicate entries and comments may be removed."

>  
> @@ -185,18 +187,27 @@ shared_buffers = 128MB
>   In addition to postgresql.conf,
>   a PostgreSQL data directory contains a file
>   
> postgresql.auto.confpostgresql.auto.conf,
> - which has the same format as postgresql.conf but 
> should
> - never be edited manually.  This file holds settings provided through
> - the  command.  This file is 
> automatically
> - read whenever postgresql.conf is, and its settings 
> take
> - effect in the same way.  Settings in 
> postgresql.auto.conf
> - override those in postgresql.conf.
> + which has the same format as postgresql.conf but
> + is intended to be edited automatically not manually.  This file holds
> + settings provided through the  command.
> + This file is read whenever postgresql.conf is,
> + and its settings take effect in the same way.  Settings
> + in postgresql.auto.conf override those
> + in postgresql.conf.
> +

The above hunk looks fine.

> +
> + External tools might also
> + modify postgresql.auto.conf, typically by appending
> + new settings to the end.  It is not recommended to do this while the
> + server is running, since a concurrent ALTER SYSTEM
> + command could overwrite such changes.
>  

Alternatively, or maybe also here, we could say "note that appending to
the file as a mechanism for setting a new value by an external tool is
acceptable even though it will cause duplicates- PostgreSQL will always
use the last value set and other tools should as well.  Duplicates and
comments may be removed when rewriting the file, and parameters may be
lower-cased." (istr that last bit being true too but I haven't checked
lately).

>  
>   The system view
>linkend="view-pg-file-settings">pg_file_settings
> - can be helpful for pre-testing changes to the configuration file, or for
> + can be helpful for pre-testing changes to the configuration files, or 
> for
>   diagnosing problems if a SIGHUP signal did not 
> have the
>   desired effects.
>  

This hunk looks fine.

Reviewing https://www.postgresql.org/docs/current/config-setting.html
again, it looks reasonably comprehensive regarding the format of the
file- perhaps we should link to there from the "external tools might
also modify" para..?  "Tool authors should review  to understand
the structure of postgresql.auto.conf".

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-05 Thread Bruce Momjian

On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote:
> On Tue, Jul 30, 2019 at 4:48 PM Bruce Momjian  wrote:
> 
> I had more time to think about the complexity of adding relfilenode to
> the IV.  Since relfilenode is only unique within a database/tablespace,
> we would need to have pg_upgrade preserve database/tablespace oids
> (which I assume are the same as the directory and tablespace symlinks).
> Then, to decode a page, you would need to look up those values.  This is
> in addition to the new complexity of CREATE DATABASE and moving files
> between tablespaces.  I am also concerned that crash recovery operations
> and cluster forensics and repair would need to also deal with this.
> 
> I am not even clear if pg_upgrade preserving relfilenode is possible ---
> when we wrap the relfilenode counter, does it start at 1 or at the
> first-user-relation-oid?  If the former, it could conflict with oids
> assigned to new system tables in later major releases.  Tying the
> preservation of relations to two restrictions seems risky.
> 
> 
> Agreed. Unless you know for sure the input is going to immutable across copies
> or upgrades, including anything in either the IV or key derivation gets risky
> and could tie you down for the future. That's partly why I like the idea
> separate salt (basically you directly pay for the complexity by tracking 
> that).

Yes, fragility is something to be concerned about.  The system is
already very complex, and we occasionally have to do forensic work or
repairs.

> Even if we do not include a separate per-relation salt or things like
> relfilenode when generating a derived key, we can still include other types of
> immutable attributes. For example the fork type could be included to 
> eventually
> allow multiple forks for the same relation to be encrypted with the same IV =
> LSN + Page Number as the derived key per-fork would be distinct.

Yes, the fork number could be useful in this case.  I was thinking we
would just leave the extra bits as zeros and we can then set it to '1'
or something else for a different fork.

> 
> Using just the page LSN and page number allows a page to be be
> decrypted/encrypted independently of its file name, tablespace, and
> database, and I think that is a win for simplicity.  Of course, if it is
> insecure we will not do it.
> 
> 
> As LSN + Page Number combo is unique for all relations (not just one relation)
> I think we're good for pages.

Yes, since a single LSN can only be used for a single relation, and I
added an Assert to check that.  Good.

> I am thinking for the heap/index IV, it would be:
> 
>         uint64 lsn;
>         unint32 page number;
>         /* only uses 11 bits for a zero-based CTR counter for 32k pages */
>         uint32 counter;
> 
> 
> Looks good. 
>  
> 
> and for WAL it would be:
> 
>         uint64 segment_number;
>         uint32    counter;
>         /* guarantees this IV doesn't match any relation IV */
>         uint32   2^32-1 /* all 1's */   
> 
> 
> I need to read up more on the structure of the WAL records but here's some 
> high
> level thoughts:
> 
> WAL encryption should not use the same key as page encryption so there's no
> need to design the IV to try to avoid matching the page IVs. Even a basic
> derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK = HKDF(MDEK,
> "PAGE") would ensure separate keys. That's the the literal string "WAL" or
> "PAGE" being added as a salt to generate the respective keys, all that matters
> is they're different.

I was thinking the WAL would use the same key since the nonce is unique
between the two.  What value is there in using a different key?

> Ideally WAL encryption would generating new derived keys as part of the WAL
> stream. The WAL stream is not fixed so you have the luxury of being able to 
> add
> a "Use new random salt XZY going forward" records. Forcing generation of a new
> salt/key upon promotion of a replica would ensure that at least the WAL is
> unique going forward. Could also generate a new upon server startup, after

Ah, yes, good point, and using a derived key would make that easier. 
The tricky part is what to use to create the new derived key, unless we
generate a random number and store that somewhere in the data directory,
but that might lead to fragility, so I am worried.  We have pg_rewind,
which allows to make the WAL go backwards.  What is the value in doing
this?

> every N bytes, or a new one for each new WAL file. There's much more
> flexibility compared to page encryption.
> 
> As WAL is a single continuous stream, we can start the IV for each derived WAL
> key from zero. There's no need to complicate it further as Key + IV will never
> be reused.

Uh, you want a new random key for each WAL file?  I was going to use the
WAL segment number as the nonce, which is always increasing, and easily
determined.  The f

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-05 Thread Peter Geoghegan

On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
 wrote:
> > * Included my own pageinspect hack to visualize the minimum TIDs in
> > posting lists. It's broken out into a separate patch file. The code is
> > very rough, but it might help someone else, so I thought I'd include
> > it.
> Cool, I think we should add it to the final patchset,
> probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
> can be
> larger. It may happen if keysize <= 4 byte.
> In this situation original tuples must have been aligned to size 16
> bytes each,
> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
> also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

> I changed DEBUG message to ERROR in v4 and it passes all regression tests.
> I doubt that it covers all corner cases, so I'll try to add more special
> tests.

It also passes my tests, FWIW.

> Hmm, I can't get the problem.
> In current implementation each posting tuple is smaller than BTMaxItemSize,
> so no split can lead to having tuple of larger size.

That sounds correct, then.

> No, we don't need them both. I don't mind combining them into one macro.
> Actually, we never needed BTreeTupleGetMinTID(),
> since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

> The main reason why I decided to avoid applying compression to unique
> indexes
> is the performance of microvacuum. It is not applied to items inside a
> posting
> tuple. And I expect it to be important for unique indexes, which ideally
> contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

> if (ItemIdIsDead(itemId))
> continue;
>
> In the previous review Rafia asked about "some reason".
> Trying to figure out if this situation possible, I changed this line to
> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
> performance
> test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
-- 
Peter Geoghegan

v5-0001-Compression-deduplication-in-nbtree.patch
Descr

1 2 >

1 - 100 of 124 matches

Mail list logo