Re: FETCH FIRST clause PERCENT option

2019-08-07 Thread Surafel Temesgen
Hi
On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi 
wrote:

>
> I have some comments.
>
> This patch uses distinct parameters for exact number and
> percentage. On the othe hand planner has a notion of
> tuple_fraction covering the both. The same handling is also used
> for tuple number estimation. Using the same scheme will make
> things far simpler. See the comment of grouping_planner().
>
>
Its because of data type difference .In planner the data type is the same

In executor part, addition to LimiteState.position, this patch
> uses LimiteState.backwardPosition to count how many tuples we're
> back from the end of the current tuplestore. But things will get
> simpler by just asking the tuplestore for the number of holding
> tuples.
>
>
backwardPosition hold how many tuple returned in backward scan


>
> +slot = node->subSlot;
>
> Why is this needed? The variable is properly set before use and
> the assignment is bogus in the first place.
>
>
its because Tuplestore needs initialized slot.


> The new code block in LIMIT_RESCAN in ExecLimit is useless since
> it is exatctly the same with existing code block. Why didn't you
> use the existing if block?
>

But they test different scenario


>
> >if (node->limitOption == PERCENTAGE)
> +{
> +node->perExactCount = ceil(node->percent *
> node->position / 100.0);
> +
> +
>
> node->position is the number of tuples returned to upper node so
> far (not the number of tuples this node has read from the lower
> node so far).  I don't understand what the expression means.
>

node->position hold the number of tuples this node has read from the lower
node so far. see LIMIT_RESCAN state


>
>
> +if (node->perExactCount == node->perExactCount + 1)
> +node->perExactCount++;
>
> What? The condition never be true. As the result, the following
> if block below won't run.
>

it became true according to the number of tuple returned in from the lower
node so far
and percentage specification.


>
> >/*
> + * Return the tuple up to the number of exact count in
> OFFSET
> + * clause without percentage value consideration.
> + */
> +if (node->perExactCount > 0)
> +{
> +
>
>
>
>
> +/*
> + * We may needed this tuple in backward scan so put it into
> + * tuplestore.
> + */
> +if (node->limitOption == PERCENTAGE)
> +{
> +tuplestore_puttupleslot(node->tupleStore, slot);
> +tuplestore_advance(node->tupleStore, true);
> +}
>
> "needed"->"need" ? The comment says that this is needed for
> backward scan, but it is actually required even in forward
> scan. More to the point, tuplestore_advance lacks comment.


ok


>
> Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> example, if it is used as the inner scan of a NestLoop, the
> tuplestore accumulates tuples by every scan. You will see that
> the tuplestore stores up to 1000 tuples (10 times of the inner)
> by the following query.
>

It this because in percentage we scan the whole table


regards
Surafel


Handling RestrictInfo in expression_tree_walker

2019-08-07 Thread Konstantin Knizhnik

Hi hackers,

I wonder if there is some particular reason for not handling 
T_RestrictInfo node tag in expression_tree_walker?
There are many data structure in Postgres which contains lists of 
RestrictInfo or expression with RestrictInfo as parameter (for example 
orclause in RestrictInfo).
To handle such cases now it is needed to write code performing list 
iteration and calling expression_tree_walker for each list element and 
handling RrestrictInfo in callback function:


static bool
change_varno_walker(Node *node, ChangeVarnoContext *context)
{
    if (node == NULL)
        return false;

    if (IsA(node, Var) && ((Var *) node)->varno == context->oldRelid)
    {
        ((Var *) node)->varno = context->newRelid;
        ((Var *) node)->varnoold = context->newRelid;
        return false;
    }
    if (IsA(node, RestrictInfo))
    {
        change_rinfo((RestrictInfo*)node, context->oldRelid, 
context->newRelid);

        return false;
    }
    return expression_tree_walker(node, change_varno_walker, context);
}

Are there any complaints against handling RestrictInfo in 
expression_tree_walker?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: partition routing layering in nodeModifyTable.c

2019-08-07 Thread Amit Langote
On Wed, Aug 7, 2019 at 12:00 PM Etsuro Fujita  wrote:
> IIUC, I think we reached a consensus at least on the 0001 patch.
> Andres, would you mind if I commit that patch?

I just noticed obsolete references to es_result_relation_info that
0002 failed to remove.  One of them is in fdwhandler.sgml:


TupleTableSlot *
IterateDirectModify(ForeignScanState *node);


... The data that was actually inserted, updated
 or deleted must be stored in the
 
es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple
 of the node's EState.

We will need to rewrite this without mentioning
es_result_relation_info.  How about as follows:

- 
es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple
- of the node's EState.
+ 
ri_projectReturning->pi_exprContext->ecxt_scantuple
+ of the result relation'sResultRelInfo that has
+ been made available via node.

I've updated 0001 with the above change.

Also, I updated 0002 to remove other references.

Thanks,
Amit


v6-0001-Revise-BeginDirectModify-API-to-pass-ResultRelInf.patch
Description: Binary data


v6-0002-Remove-es_result_relation_info.patch
Description: Binary data


v6-0004-Refactor-transition-tuple-capture-code-a-bit.patch
Description: Binary data


v6-0003-Rearrange-partition-update-row-movement-code-a-bi.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Kuntal Ghosh
Hello Andres,

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> > +/* Each worker queue is a binary heap. */
> > +typedef struct
> > +{
> > + binaryheap *bh;
> > + union
> > + {
> > + UndoXidQueue *xid_elems;
> > + UndoSizeQueue *size_elems;
> > + UndoErrorQueue *error_elems;
> > + }   q_choice;
> > +} UndoWorkerQueue;
>
> As we IIRC have decided to change this into a rbtree, I'll ignore
> related parts of the current code.  What is the status of that work?
> I've checked the git trees, without seeing anything? Your last mail with
> patches
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com
> doesn't seem to contain that either?
>
Yeah, we're changing this into a rbtree. This is still work-in-progress.

>
> > .
> > +#define GetErrorQueueNthElem(n) \
> > +( \
> > + AssertMacro(!ErrorQueueIsEmpty()), \
> > + DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \
> > +)
>
>
> -ETOOMANYMACROS
>
> I think nearly all of these shouldn't exist. See further below.
>
>
> > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, 
> > e_retry_at, e_occurred_at) \
> > +( \
> > + GetErrorQueueElem(elem).dbid = e_dbid, \
> > + GetErrorQueueElem(elem).full_xid = e_full_xid, \
> > + GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \
> > + GetErrorQueueElem(elem).next_retry_at = e_retry_at, \
> > + GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \
> > +)
>
> It's very very rarely a good idea to have macros that evaluate their
> arguments multiple times. It'll also never be a good idea to get the
> same element multiple times from a queue.  If needed - I'm very doubtful
> of that, given that there's a single caller - it should be a static
> inline function that gets the element once, stores it in a local
> variable, and then updates all the fields.
>
Noted. Earlier, Robert also raised the point of using so many macros.
He also suggested to use a single type of object that stores all the
information we need. It'll make things simpler and easier to
understand. In the upcoming patch set, we're removing all these
changes.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: Handling RestrictInfo in expression_tree_walker

2019-08-07 Thread Amit Langote
Hi Konstantin,

On Wed, Aug 7, 2019 at 4:24 PM Konstantin Knizhnik
 wrote:
>
> Hi hackers,
>
> I wonder if there is some particular reason for not handling
> T_RestrictInfo node tag in expression_tree_walker?
> There are many data structure in Postgres which contains lists of
> RestrictInfo or expression with RestrictInfo as parameter (for example
> orclause in RestrictInfo).
> To handle such cases now it is needed to write code performing list
> iteration and calling expression_tree_walker for each list element and
> handling RrestrictInfo in callback function:
>
> static bool
> change_varno_walker(Node *node, ChangeVarnoContext *context)
> {
>  if (node == NULL)
>  return false;
>
>  if (IsA(node, Var) && ((Var *) node)->varno == context->oldRelid)
>  {
>  ((Var *) node)->varno = context->newRelid;
>  ((Var *) node)->varnoold = context->newRelid;
>  return false;
>  }
>  if (IsA(node, RestrictInfo))
>  {
>  change_rinfo((RestrictInfo*)node, context->oldRelid,
> context->newRelid);
>  return false;
>  }
>  return expression_tree_walker(node, change_varno_walker, context);
> }
>
> Are there any complaints against handling RestrictInfo in
> expression_tree_walker?

As I understand it, RestrictInfo is not something that appears in
query trees or plan trees, but only in the planner data structures as
means of caching some information about the clauses that they wrap.  I
see this comment describing what expression_tree_walker() is supposed
to handle:

 * The node types handled by expression_tree_walker include all those
 * normally found in target lists and qualifier clauses during the planning
 * stage.

You may also want to read this discussion:

https://www.postgresql.org/message-id/553FC9BC.5060402%402ndquadrant.com

Thanks,
Amit




Re: Built-in connection pooler

2019-08-07 Thread Konstantin Knizhnik

Hi, Li

Thank you very much for reporting the problem.

On 07.08.2019 7:21, Li Japin wrote:

I inspect the code, and find the following code in DefineRelation function:

if (stmt->relation->relpersistence != RELPERSISTENCE_TEMP
      && stmt->oncommit != ONCOMMIT_DROP)
      MyProc->is_tainted = true;

For temporary table, MyProc->is_tainted might be true, I changed it as
following:

if (stmt->relation->relpersistence == RELPERSISTENCE_TEMP
      || stmt->oncommit == ONCOMMIT_DROP)
      MyProc->is_tainted = true;

For temporary table, it works. I not sure the changes is right.

Sorry, it is really a bug.
My intention was to mark backend as tainted if it is creating temporary 
table without ON COMMIT DROP clause (in the last case temporary table 
will be local to transaction and so cause no problems with pooler).

Th right condition is:

    if (stmt->relation->relpersistence == RELPERSISTENCE_TEMP
        && stmt->oncommit != ONCOMMIT_DROP)
        MyProc->is_tainted = true;


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> Here we go.
>
Thanks for the review.  I will work on them.  Currently, I need
suggestions on some of the review comments.
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > + Sizesize;
> > +
> > + /* Add fixed header size. */
> > + size = SizeOfUndoRecordHeader;
> > +
> > + /* Add size of transaction header if it presets. */
> > + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > + size += SizeOfUndoRecordTransaction;
> > +
> > + /* Add size of rmid if it presets. */
> > + if ((uur_info & UREC_INFO_RMID) != 0)
> > + size += sizeof(RmgrId);
> > +
> > + /* Add size of reloid if it presets. */
> > + if ((uur_info & UREC_INFO_RELOID) != 0)
> > + size += sizeof(Oid);
> > +

> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
>
I agree with the point that we are repeating this in a couple of
function and doing different actions e.g.  In this function we are
computing the size and in some other function we are copying the
field.  I am not sure what would be the best way to handle it?  One
approach could just write one function which handles all these cases
but the caller will suggest what action to take.  Basically, it will
look like this.

Function (uur_info, action)
{
   if ((uur_info & UREC_INFO_TRANSACTION) != 0)
  {
 // if action is compute header size
  size += SizeOfUndoRecordTransaction;
//else if action is copy to dest
  dest = src
...
  }
Repeat for other types
}

But, IMHO that function will look confusing to anyone that what
exactly it's trying to achieve.  If anyone has a better idea please
suggest.

> > +/*
> > + * Insert the undo record into the input page from the unpack undo context.
> > + *
> > + * Caller can  call this function multiple times until desired stage is 
> > reached.
> > + * This will write the undo record into the page.
> > + */
> > +void
> > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > + char   *writeptr = (char *) page + starting_byte;
> > + char   *endptr = (char *) page + BLCKSZ;
> > +
> > + switch (ucontext->stage)
> > + {
> > + case UNDO_PACK_STAGE_HEADER:
> > + /* Insert undo record header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_hd,
> > +  
> > SizeOfUndoRecordHeader, &writeptr, endptr,
> > +  
> > &ucontext->already_processed,
> > +  
> > &ucontext->partial_bytes))
> > + return;
> > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> > + /* fall through */
> > +
>
> I don't understand. The only purpose of this is that we can partially
> write a packed-but-not-actually-packed record onto a bunch of pages? And
> for that we have an endless chain of copy and pasted code calling
> InsertUndoBytes()? Copying data into shared buffers in tiny increments?
>
> If we need to this, what is the whole packed record format good for?
> Except for adding a bunch of functions with 10++ ifs and nearly
> identical code?
>
> Copying data is expensive. Copying data in tiny increments is more
> expensive. Copying data in tiny increments, with a bunch of branches, is
> even more expensive. Copying data in tiny increments, with a bunch of
> branches, is even more expensive, especially when it's shared
> memory. Copying data in tiny increments, with a bunch of branches, is
> even more expensive, especially when it's shared memory, especially when
> all that shared meory is locked at once.

My idea is, indeed of keeping all these fields duplicated in the
context, just allocate a single memory segment equal to the expected
record size (maybe the payload data can keep separate).  Now, based on
uur_info pack all the field of UnpackedUndoRecord in that memory
segment.  After that In InsertUndoData, we just need one call to
InsertUndoBytes for copying complete header in one shot and another
call for copying payload data.  Does this sound reasonable to you?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Handling RestrictInfo in expression_tree_walker

2019-08-07 Thread Konstantin Knizhnik




On 07.08.2019 10:42, Amit Langote wrote:

Hi Konstantin,

On Wed, Aug 7, 2019 at 4:24 PM Konstantin Knizhnik
 wrote:

Hi hackers,

I wonder if there is some particular reason for not handling
T_RestrictInfo node tag in expression_tree_walker?
There are many data structure in Postgres which contains lists of
RestrictInfo or expression with RestrictInfo as parameter (for example
orclause in RestrictInfo).
To handle such cases now it is needed to write code performing list
iteration and calling expression_tree_walker for each list element and
handling RrestrictInfo in callback function:

static bool
change_varno_walker(Node *node, ChangeVarnoContext *context)
{
  if (node == NULL)
  return false;

  if (IsA(node, Var) && ((Var *) node)->varno == context->oldRelid)
  {
  ((Var *) node)->varno = context->newRelid;
  ((Var *) node)->varnoold = context->newRelid;
  return false;
  }
  if (IsA(node, RestrictInfo))
  {
  change_rinfo((RestrictInfo*)node, context->oldRelid,
context->newRelid);
  return false;
  }
  return expression_tree_walker(node, change_varno_walker, context);
}

Are there any complaints against handling RestrictInfo in
expression_tree_walker?

As I understand it, RestrictInfo is not something that appears in
query trees or plan trees, but only in the planner data structures as
means of caching some information about the clauses that they wrap.  I
see this comment describing what expression_tree_walker() is supposed
to handle:

  * The node types handled by expression_tree_walker include all those
  * normally found in target lists and qualifier clauses during the planning
  * stage.

You may also want to read this discussion:

https://www.postgresql.org/message-id/553FC9BC.5060402%402ndquadrant.com

Thanks,
Amit

Thank you very much for response and pointing me to this thread.
Unfortunately I do not understand from this thread how the problem was 
solved with pullvars - right now  pull_varnos_walker and 
pull_varattnos_walker

are not handling RestrictInfo.

Also I do not completely understand the argument "RestrictInfo is not a 
general expression node and support for it has
been deliberately omitted from expression_tree_walker()". If there is 
BoolOp expression which contains RestrictInfo expression as it 
arguments, then either this expression is not

correct, either RestrictInfo should be considered as "expression node".

Frankly speaking I do not see some good reasons for not handling 
RestrictInfo in expression_tree_worker. It can really simplify writing 
of mutators/walkers.
And I do not think that reporting error instead of handling this tag 
adds some extra safety or error protection.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Masahiko Sawada
On Wed, Aug 7, 2019 at 2:55 AM Bruce Momjian  wrote:
>
> On Wed, Aug  7, 2019 at 12:31:58AM +0900, Masahiko Sawada wrote:
> > Well, so you mean that for example we encrypt only 100 bytes WAL
> > record when append 100 bytes WAL records?
> >
> > For WAL encryption, if we encrypt the entire 8k WAL page and write the
> > entire page, the encrypted-and-written page will contain 100 bytes WAL
> > record data and (8192-100) bytes garbage (omitted WAL page header for
> > simplify), although WAL data on WAL buffer is still not encrypted
> > state. And then if we append 200 bytes again, the
> > encrypted-and-written page will contain 300 bytes WAL record data and
> > (8192-300)bytes garbage, data on WAL buffer is still not encrypted
> > state though.
> >
> > In this case I think the first 100 bytes of two 8k WAL pages are the
> > same because we encrypted both from the beginning of the page with the
> > counter = 0. But the next 200 bytes are different; it's (encrypted)
> > garbage in the former case but it's (encrypted) WAL record data in the
> > latter case. I think that's a problem.
> >
> > On the other hand, if we encrypt 8k WAL page with the different
> > counter of nonce after append 200 byes WAL record, the first 100 byte
> > (and of course the entire 8k page also) will be different. However
> > since it's the same thing doing as changing already-flushed WAL record
> > on the disk it's bad.
> >
> > Also, if we encrypt only append data instead of entire 8k page, we
> > would need to have the information  in somewhere about how much byte
> > the WAL page has valid values. Otherwise reading WAL would not work
> > fine.
>
> OK, onlist reply.  We are going to encrypt the _entire_ WAL stream as we
> write it, which is possible with CTR.  If we write 200 bytes of WAL, we
> encrypt/XOR 200 bytes of WAL.  If we write 10k of WAL, and 8k of that is
> an 8k page, we encrypt the entire 10k of WAL --- we don't care if there
> is an 8k page in there or not.
>
> CTR mode creates a bit stream for the first 16 bytes with nonce of
> (segment_number, counter = 0), and the next 16 bytes with
> (segment_number, counter = 1), etc.  We only XOR using the parts of the
> bit stream we want to use.  We don't care what the WAL content is --- we
> just XOR it with the stream with the matching counter for that part of
> the WAL.
>
> It is true we are encrypting the same 8k page in the heap/index page,
> and in WAL, with different key/nonce combinations, which I think is
> secure.  What is insecure is to encrypt two different pieces of data
> with the same key/nonce combination.
>

I understood. IIUC in your approach postgres processes encrypt WAL
records when inserting to the WAL buffer. So WAL data is encrypted
even on the WAL buffer.

It works but I think the implementation might be complex; For example
using openssl, we would use EVP functions to encrypt data by
AES-256-CTR. We would need to make IV and pass it to them and these
functions however don't manage the counter value of nonce as long as I
didn't miss. That is, we need to calculate the correct counter value
for each encryption and pass it to EVP functions. Suppose we encrypt
20 bytes of WAL. The first 16 bytes is encrypted with nonce of
(segment_number, 0) and the next 4 bytes is encrypted with nonce of
(segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We
cannot use nonce of (segment_number, 2) but should use nonce of
(segment_number , 1). Therefore we would need 4 bytes padding and to
encrypt it and then to throw that 4 bytes away .


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> > + *   false, otherwise.
> > + */
> > +static bool
> > +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr)
> > +{
> > + UnpackedUndoRecord *uur = NULL;
> > + UndoRecordFetchContext  context;
> > +
> > + /* Fetch the undo record. */
> > + BeginUndoFetch(&context);
> > + uur = UndoFetchRecord(&context, to_urecptr);
> > + FinishUndoFetch(&context);
>
> Literally all the places that fetch a record, fetch them with exactly
> this combination of calls. If that's the pattern, what do we gain by
> this split?   Note that UndoBulkFetchRecord does *NOT* use an
> UndoRecordFetchContext, for reasons that are beyond me.

Actually, for the zheap or any other AM, where it needs to traverse
the transactions undo the chain. For example, in zheap we will get the
latest undo record pointer from the slot but we need to traverse the
undo record chain backward using the prevundo pointer store in the
undo record and find the undo record for a particular tuple.  Earlier,
there was a loop in UndoFetchRecord which were traversing the chain of
the undo until it finds the matching record and record was matched
using a callback.  There was also an optimization that if the current
record doesn't satisfy the callback then we keep the pin hold on the
buffer and go to the previous record in the chain.  Later based on the
review comments by Robert we have decided that finding the matching
undo record should be caller's responsibility so we have moved the
loop out of the UndoFetchRecord and kept it in the zheap code.  The
reason for keeping the context is that we can keep the buffer pin held
and remember that buffer in the context so that the caller can call
UndoFetchRecord in a loop and the pin will be held on the buffer from
which we have read the last undo record.

I agree that in undoprocessing patch set we always need to fetch one
record so instead of repeating this pattern everywhere we can write
one function and move this sequence of calls in that function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Handling RestrictInfo in expression_tree_walker

2019-08-07 Thread Amit Langote
Hi,

On Wed, Aug 7, 2019 at 5:07 PM Konstantin Knizhnik
 wrote:
> On 07.08.2019 10:42, Amit Langote wrote:
> > You may also want to read this discussion:
> >
> > https://www.postgresql.org/message-id/553FC9BC.5060402%402ndquadrant.com
> >
> Thank you very much for response and pointing me to this thread.
> Unfortunately I do not understand from this thread how the problem was
> solved with pullvars - right now  pull_varnos_walker and
> pull_varattnos_walker
> are not handling RestrictInfo.
>
> Also I do not completely understand the argument "RestrictInfo is not a
> general expression node and support for it has
> been deliberately omitted from expression_tree_walker()". If there is
> BoolOp expression which contains RestrictInfo expression as it
> arguments, then either this expression is not
> correct, either RestrictInfo should be considered as "expression node".
>
> Frankly speaking I do not see some good reasons for not handling
> RestrictInfo in expression_tree_worker. It can really simplify writing
> of mutators/walkers.
> And I do not think that reporting error instead of handling this tag
> adds some extra safety or error protection.

Well, Tom has expressed in various words in that thread that expecting
to successfully run expression_tree_walker() on something containing
RestrictInfos may be a sign of bad design somewhere in the code that
you're trying to add.  I have recollections of submitting such code,
but later realizing that there's some other way to do things
differently that doesn't require walking expressions containing
RestrictInfos.

Btw, looking at the example walker function  you've shown in the first
email, maybe you want to use a mutator, not a walker.  The latter
class of functions is only supposed to inspect the input tree, not
modify it.

Thanks,
Amit




Re: Small patch to fix build on Windows

2019-08-07 Thread Kyotaro Horiguchi
Hi,

At Tue, 6 Aug 2019 22:50:14 +0300, Dmitry Igrishin  wrote in 

> The attached self-documented patch fixes build on Windows in case when
> path to Python has embedded spaces.

-  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+  "\"$solution->{options}->{python}\\python\" -c \"$pythonprog\"";

Solution.pm has the following line:

>   my $opensslcmd =
> $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";

AFAICS that's all.


-if ($lib =~ m/\s/)
-{
-$lib = '"' . $lib . """;
-}
+# Since VC automatically quotes paths specified as the data of
+#  in VC project file, it's mistakably
+# to quote them here. Thus, it's okay if $lib contains spaces.

I'm not sure, but it's not likely that someone adds it without
actually stumbling on space-containing paths with the ealier
version. Anyway if we shouldn't touch this unless the existing
code makes actual problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




s/rewinded/rewound/?

2019-08-07 Thread Thomas Munro
Hello,

The word "rewinded" appears in our manual and in a comment.  That
sounds strange to my ears.  Isn't it a mistake?  Oxford lists the form
as "poetic" and "rare", and then says it was used by one specific
Victorian poet.  Perhaps I'll send them a pull request: it's now G. M.
Hopkins and PostgreSQL?  Or maybe it's in common usage in another part
of the world?

-- 
Thomas Munro
https://enterprisedb.com




Re: s/rewinded/rewound/?

2019-08-07 Thread Magnus Hagander
On Wed, Aug 7, 2019 at 10:49 AM Thomas Munro  wrote:

> Hello,
>
> The word "rewinded" appears in our manual and in a comment.  That
> sounds strange to my ears.  Isn't it a mistake?  Oxford lists the form
> as "poetic" and "rare", and then says it was used by one specific
> Victorian poet.  Perhaps I'll send them a pull request: it's now G. M.
> Hopkins and PostgreSQL?  Or maybe it's in common usage in another part
> of the world?
>

To me this sounds like a classic non-English-native-speaker-mistake.  But
it seems at least the one in the docs come from Bruce, who definitely is...
So perhaps it's intentional to refer to "what pg_rewind does", and not
necessarily to the regular word for it?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: s/rewinded/rewound/?

2019-08-07 Thread Michael Paquier
On Wed, Aug 07, 2019 at 10:53:45AM +0200, Magnus Hagander wrote:
> To me this sounds like a classic non-English-native-speaker-mistake.  But
> it seems at least the one in the docs come from Bruce, who definitely is...
> So perhaps it's intentional to refer to "what pg_rewind does", and not
> necessarily to the regular word for it?

I am not sure :)
"rewound" sounds much more natural.
--
Michael


signature.asc
Description: PGP signature


Re: partition routing layering in nodeModifyTable.c

2019-08-07 Thread Etsuro Fujita
Amit-san,

On Wed, Aug 7, 2019 at 4:28 PM Amit Langote  wrote:
> On Wed, Aug 7, 2019 at 12:00 PM Etsuro Fujita  wrote:
> > IIUC, I think we reached a consensus at least on the 0001 patch.
> > Andres, would you mind if I commit that patch?
>
> I just noticed obsolete references to es_result_relation_info that
> 0002 failed to remove.  One of them is in fdwhandler.sgml:
>
> 
> TupleTableSlot *
> IterateDirectModify(ForeignScanState *node);
> 
>
> ... The data that was actually inserted, updated
>  or deleted must be stored in the
>  
> es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple
>  of the node's EState.
>
> We will need to rewrite this without mentioning
> es_result_relation_info.  How about as follows:
>
> - 
> es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple
> - of the node's EState.
> + 
> ri_projectReturning->pi_exprContext->ecxt_scantuple
> + of the result relation'sResultRelInfo that has
> + been made available via node.
>
> I've updated 0001 with the above change.

Good catch!

This would be nitpicking, but:

* IIUC, we don't use the term "result relation" in fdwhandler.sgml.
For consistency with your change to the doc for BeginDirectModify, how
about using the term "target foreign table" instead of "result
relation"?

* ISTM that "ResultRelInfo that has been made
available via node" would be a bit fuzzy to FDW authors.  To be more
specific, how about changing it to
"ResultRelInfo passed to
BeginDirectModify" or something like that?

Best regards,
Etsuro Fujita




Re: no default hash partition

2019-08-07 Thread Magnus Hagander
On Wed, Aug 7, 2019 at 5:26 AM Robert Haas  wrote:

> On Tue, Aug 6, 2019 at 6:58 PM Tom Lane  wrote:
> > Hmm.  So given the point about it being hard to predict which hash
> > partitions would receive what values ... under what circumstances
> > would it be sensible to not create a full set of partitions?  Should
> > we just enforce that there is a full set, somehow?
>
> I think it would only be sensible as a temporary state.  The system
> allows more than one modulus so that you can do partition split
> incrementally.  For example if you have 8 partitions all with modulus
> 8 and with remainders 0..7, you could:
>
> - detach the partition with (modulus 8, remainder 0)
> - attach two new partitions with (modulus 16, remainder 0) and
> (modulus 16, remainder 8)
> - move the data from the old partition to the new ones
>
> Then you'd have 9 partitions, and you'd only have taken the amount of
> downtime needed to repartition 1/8th of your data.  You could then
> repeat this process one partition at a time during additional
> maintenance windows, and end up with 16 partitions in the end.
> Without the ability to have more than one modulus, or if you had
> chosen not to double the modulus but to change it to some other value
> like 13, you would've needed to repartition all the data at once,
> which would have required one much longer outage.  You can argue about
> whether the ability to do this kind of thing is useful, but it seemed
> to me that it was.
>
> I think, as Amit says, that having an automatic partition creation
> feature for hash partitions (and maybe other kinds, but certainly for
> hash) would be a useful thing to add to the system. I also think that
> it might be useful to add some commands to automate partition
> splitting (and maybe combining) although I think there's some design
> work to be done there to figure out exactly what we should build.  I
> don't think it's ever useful to have a hash-partitioned table with an
> incomplete set of partitions long term, but it makes things simpler to
> allow that temporarily, for example during dump restoration.
> Therefore, I see no reason why we would want to go to the trouble of
> allowing hash-partitioned tables to have default partitions; it would
> just encourage people to do things that don't really make any sense.
>

Another usecase for not having all partitions temporarily is if some of
them should be different enough that you  don't want them auto-created. A
common one would be that they should be on different tablespaces, but that
can of course be solved by moving the partition after it had been
auto-created (and should be fast since at this point it would be empty).
But imagine you wanted one partition to be a FOREIGN one for example, you
can't ALTER a partition to become foreign, you'd have to drop it and
recreate it, in which case not having created it in the first place
would've been better. That's pretty weird for hash partitioning, but one
could certainly imagine having *all* partitions of a hash partitioned table
be FOREIGN...

None of that is solved by having a default partition for it though, since
it would only be a temporary state. It only goes to that if we do want to
auto-create the hash partitions (which I think would be really useful for
the most common usecase), we should have a way not to do it. Either by only
autocreating them if a specific keyword is given, or by having a keyword
that would prevent it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-08-07 Thread Evgeny Efimkin
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

HI!
patch is look good for me.

The new status of this patch is: Ready for Committer


Re: Small patch to fix build on Windows

2019-08-07 Thread Dmitry Igrishin
ср, 7 авг. 2019 г. в 11:29, Kyotaro Horiguchi :
>
> Hi,
>
> At Tue, 6 Aug 2019 22:50:14 +0300, Dmitry Igrishin  wrote 
> in 
> > The attached self-documented patch fixes build on Windows in case when
> > path to Python has embedded spaces.
>
> -  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
> +  "\"$solution->{options}->{python}\\python\" -c \"$pythonprog\"";
>
> Solution.pm has the following line:
>
> >   my $opensslcmd =
> > $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
>
> AFAICS that's all.
Thank you! The attached 2nd version of the patch fixes this too.

>
>
> -if ($lib =~ m/\s/)
> -{
> -$lib = '"' . $lib . """;
> -}
> +# Since VC automatically quotes paths specified as the data of
> +#  in VC project file, it's mistakably
> +# to quote them here. Thus, it's okay if $lib contains spaces.
>
> I'm not sure, but it's not likely that someone adds it without
> actually stumbling on space-containing paths with the ealier
> version. Anyway if we shouldn't touch this unless the existing
> code makes actual problem.
So, do you think a comment is not needed here?
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index d1d0aed07e..76834f5188 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -495,7 +495,7 @@ sub mkvcbuild
 		my $pythonprog = "import sys;print(sys.prefix);"
 		  . "print(str(sys.version_info[0])+str(sys.version_info[1]))";
 		my $prefixcmd =
-		  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+		  "\"$solution->{options}->{python}\\python\" -c \"$pythonprog\"";
 		my $pyout = `$prefixcmd`;
 		die "Could not query for python version!\n" if $?;
 		my ($pyprefix, $pyver) = split(/\r?\n/, $pyout);
diff --git a/src/tools/msvc/Project.pm b/src/tools/msvc/Project.pm
index b5d1dc6e89..28893f072d 100644
--- a/src/tools/msvc/Project.pm
+++ b/src/tools/msvc/Project.pm
@@ -132,10 +132,9 @@ sub AddLibrary
 {
 	my ($self, $lib, $dbgsuffix) = @_;
 
-	if ($lib =~ m/\s/)
-	{
-		$lib = '"' . $lib . """;
-	}
+	# Since VC automatically quotes paths specified as the data of
+	#  in VC project file, it's mistakably
+	# to quote them here. Thus, it's okay if $lib contains spaces.
 
 	push @{ $self->{libraries} }, $lib;
 	if ($dbgsuffix)
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 318594db5d..327e556c53 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -126,7 +126,7 @@ sub GetOpenSSLVersion
 	# Attempt to get OpenSSL version and location.  This assumes that
 	# openssl.exe is in the specified directory.
 	my $opensslcmd =
-	  $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
+	  "\"$self->{options}->{openssl}\\bin\\openssl.exe\" version 2>&1";
 	my $sslout = `$opensslcmd`;
 
 	$? >> 8 == 0


Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Amit Kapila
On Tue, Aug 6, 2019 at 1:26 PM Andres Freund  wrote:
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>

I am responding to some of the points where I need some more inputs or
some discussion is required.  Some of the things need more thoughts
which I will respond later and some are quite straight forward and
doesn't need much discussion.

>
> > +/*
> > + * Binary heap comparison function to compare the time at which an error
> > + * occurred for transactions.
> > + *
> > + * The error queue is sorted by next_retry_at and err_occurred_at.  
> > Currently,
> > + * the next_retry_at has some constant delay time (see 
> > PushErrorQueueElem), so
> > + * it doesn't make much sense to sort by both values.  However, in future, 
> > if
> > + * we have some different algorithm for next_retry_at, then it will work
> > + * seamlessly.
> > + */
>
> Why is it useful to have error_occurred_at be part of the comparison at
> all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> conflicts for next_retry_at, then we can also get conflicts in
> err_occurred_at).  Seems better to use something actually guaranteed to
> be unique for a tiebreaker.
>

This was to distinguish the cases where the request is failing
multiple times with the request failed the first time.  I agree that
we need a better unique identifier like FullTransactionid though.  Do
let me know if you have any other suggestion?

>
>
> > + /*
> > +  * The rollback hash table is used to avoid duplicate undo requests by
> > +  * backends and discard worker.  The table must be able to accomodate 
> > all
> > +  * active undo requests.  The undo requests must appear in both xid 
> > and
> > +  * size requests queues or neither.  In same transaction, there can 
> > be two
> > +  * requests one for logged relations and another for unlogged 
> > relations.
> > +  * So, the rollback hash table size should be equal to two request 
> > queues,
> > +  * an error queue (currently this is same as request queue) and max
>
> "the same"? I assume this intended to mean the same size?
>

Yes. I will add the word size to be more clear.

>
> > +  * backends. This will ensure that it won't get filled.
> > +  */
>
> How does this ensure anything?
>

Because based on this we will have a hard limit on the number of undo
requests after which we won't allow more requests.  See some more
detailed explanation for the same later in this email.   I think the
comment needs to be updated.

>
> > +  * the binary heaps which can change.
> > +  */
> > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE));
> > +
> > + /*
> > +  * We normally push the rollback request to undo workers if the size 
> > of
> > +  * same is above a certain threshold.
> > +  */
> > + if (req_size >= rollback_overflow_size * 1024 * 1024)
> > + {
>
> Why is this being checked with the lock held? Seems like this should be
> handled in a pre-check?
>

Yeah, it can be a pre-check, but I thought it is better to encapsulate
everything in the function as this is not an expensive check.  I think
we can move it to outside lock to avoid any such confusion.

>
> > + * allow_peek - if true, peeks a few element from each queue to check 
> > whether
> > + * any request matches current dbid.
> > + * remove_from_queue - if true, picks an element from the queue whose dbid
> > + * matches current dbid and remove it from the queue before returning the 
> > same
> > + * to caller.
> > + * urinfo - this is an OUT parameter that returns the details of undo 
> > request
> > + * whose undo action is still pending.
> > + * in_other_db_out - this is an OUT parameter.  If we've not found any work
> > + * for current database, but there is work for some other database, we set
> > + * this parameter as true.
> > + */
> > +bool
> > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo 
> > *urinfo,
> > + bool *in_other_db_out)
> > +{
>
>
> > + /*
> > +  * If some undo worker is already processing the rollback 
> > request or
> > +  * it is already processed, then we drop that request from 
> > the queue
> > +  * and fetch the next entry from the queue.
> > +  */
> > + if (!rh || UndoRequestIsInProgress(rh))
> > + {
> > + RemoveRequestFromQueue(cur_queue, 0);
> > + cur_undo_queue++;
> > + continue;
> > + }
>
> When is it possible to hit the in-progress case?
>

The same request is in two queues. It is possible that when the
request is being processed from xid queue by one of the workers, the
request from another queue is picked by another worker.  I think this
case won't exist after making rbtree based queues.

> > +/*
> > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular
> > + * log have been applied while rolling back a trans

Re: SegFault on 9.6.14

2019-08-07 Thread vignesh C
On Wed, Jul 31, 2019 at 9:37 AM Amit Kapila  wrote:
>
> On Wed, Jul 31, 2019 at 12:05 AM Robert Haas  wrote:
> >
> > On Thu, Jul 18, 2019 at 9:45 AM Tom Lane  wrote:
> > > I think this is going in the wrong direction.  Nodes should *always*
> > > assume that a rescan is possible until ExecEndNode is called.
> > > If you want to do otherwise, you are going to be inventing a whole
> > > bunch of complicated and doubtless-initially-buggy control logic
> > > to pass down information about whether a rescan might be possible.
> > > That doesn't sound like a recipe for a back-patchable fix.  Perhaps
> > > we could consider redesigning the rules around REWIND in a future
> > > version, but that's not where to focus the bug fix effort.
> >
> > So, if I can summarize how we got here, as best I understand it:
> >
>
> Thanks for the summarization.  This looks mostly correct to me.
>
> > 0. The historic behavior of the executor is to assume it's OK to leak
> > resources for the lifetime of the query.  Nodes that are executed to
> > completion generally do some cleanup, but we feel free (as under
> > Limit) to just stop executing a node without giving it any hint that
> > it should release resources.  So a Sort may hold onto a terabyte of
> > memory and an index scan may keep holding a pin even after there's no
> > theoretical way of ever needing those resources again, and we just
> > don't care.
> >
> > 1. Parallel query made that perhaps-already-shaky assumption a lot
> > more problematic. Partly that's because workers are a a more scarce
> > and considerably heavier resource than anything else, and moreover act
> > as a container for anything else, so whatever you were leaking before,
> > you can now leak N times more of it, plus N processes, until the end
> > of the query. However, there's a correctness reason too, which is that
> > when a node has a copy in the leader and a copy in every worker, each
> > copy has its own instrumentation data (startup time, run time, nloops,
> > etc) and we can only fold all that together once the node is done
> > executing, because it's really hard to add up a bunch of numbers
> > before the numbers are done changing.  We could've made the
> > instrumentation shared throughout, but if we had, we could have
> > contention for updating the instrumentation data, which seems like
> > it'd be bad.
> >
> > 2. To fix that correctness problem, we decided to try to shut down the
> > node under a limit node when we're done with it (commit
> > 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this
> > looks fundamentally necessary to me.  If you're going to have N
> > separate copies of the instrumentation, and you want to add them up
> > when you're done, then you have to decide to be done at some point;
> > otherwise you don't know when to add them up, and maybe won't add them
> > up at all, and then you'll be sad.  This does not mean that the exact
> > timing couldn't be changed somehow, but if you want a correct
> > implementation, you have to shut down Limit's sub-node after you're
> > done executing it (so that you can get the instrumentation data from
> > the workers after it's final) and before you start destroying DSM
> > segments and stuff (so that you get the instrumentation data from the
> > workers before it vanishes).
> >
> > 3. The aforementioned commit turned out to be buggy in at least to two
> > ways, precisely because it didn't do a good enough job predicting when
> > the Limit needed to be shut down.  First, there was commit
> > 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact
> > that you could hit the Limit and then back up.
> >
>
> We have not missed it, rather we decided to it separately because it
> appears to impact some different cases as well [1][2].
>
> >  Second, there's the
> > present issue, where the Limit gets rescanned.
> >
> > So, given all that, if we want to adopt Tom's position that we should
> > always cater to a possible rescan, then we're going to have to rethink
> > the way that instrumentation data gets consolidated from workers into
> > the leader in such a way that we can consolidate multiple times
> > without ending up with the wrong answer.
> >
>
> The other idea we had discussed which comes closer to adopting Tom's
> position was that during ExecShutdownNode, we just destroy parallel
> workers, collect instrumentation data and don't destroy the parallel
> context.  The parallel context could be destroyed in ExecEndNode
> (ExecEndGather(Merge)) code path.  The problem with this idea is that
> ExitParallelMode doesn't expect parallel context to be active.  Now,
> we can either change the location of Exit/EnterParallelMode or relax
> that restriction.  As mentioned above that restriction appears good to
> me, so I am not in favor of changing it unless we have some other
> solid way to install it.   I am not sure if this idea is better than
> other approaches we are discussing.
>
>
I have made a patch based on the abo

Re: block-level incremental backup

2019-08-07 Thread Jeevan Chalke
On Mon, Aug 5, 2019 at 7:13 PM Robert Haas  wrote:

> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> > + rc = system(copycmd);
>
> I don't think this patch should be calling system() in the first place.
>

So, do you mean we should just do fread() and fwrite() for the whole file?

I thought it is better if it was done by the OS itself instead of reading
1GB
into the memory and writing the same to the file.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: s/rewinded/rewound/?

2019-08-07 Thread Liudmila Mantrova

On 8/7/19 12:00 PM, Michael Paquier wrote:

On Wed, Aug 07, 2019 at 10:53:45AM +0200, Magnus Hagander wrote:

To me this sounds like a classic non-English-native-speaker-mistake.  But
it seems at least the one in the docs come from Bruce, who definitely is...
So perhaps it's intentional to refer to "what pg_rewind does", and not
necessarily to the regular word for it?

I am not sure :)
"rewound" sounds much more natural.
--
Michael


+1 for rewound from a non-English-native-speaker. The use of "rewound" 
in the same file also supports Michael's view.


If we decide to fix this, we should probably revise and back-patch the 
whole paragraph where it appears as it seems to mix up scanning target 
cluster
WALs and applying source cluster WALs. A small patch is attached for 
your consideration (originally proposed on pgsql-docs [1]).


[1] 
https://www.postgresql.org/message-id/ad6ac5bb-6689-ddb0-dc60-c5fc197d728e%40postgrespro.ru 



--
Liudmila Mantrova
Technical writer at Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 52a1caa..a7e1705 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,14 +66,12 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
+   target cluster ran for a long time after the divergence, its old WAL
files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
-   fetched on startup by configuring  or
-   .  The use of
-   pg_rewind is not limited to failover, e.g.  a standby
-   server can be promoted, run some write transactions, and then rewinded
-   to become a standby again.
+   copied from the WAL archive to the pg_wal directory.
+   The use of pg_rewind is not limited to failover,
+   e.g. a standby server can be promoted, run some write transactions, and then
+   get rewound to become a standby again.
   
 
   


Re: block-level incremental backup

2019-08-07 Thread Ibrar Ahmed
On Wed, Aug 7, 2019 at 2:47 PM Jeevan Chalke 
wrote:

>
>
> On Mon, Aug 5, 2019 at 7:13 PM Robert Haas  wrote:
>
>> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
>> > + rc = system(copycmd);
>>
>> I don't think this patch should be calling system() in the first place.
>>
>
> So, do you mean we should just do fread() and fwrite() for the whole file?
>
> I thought it is better if it was done by the OS itself instead of reading
> 1GB
> into the memory and writing the same to the file.
>
> It is not necessary to read the whole 1GB into Ram.


>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
Ibrar Ahmed


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-07 Thread Heikki Linnakangas

On 06/08/2019 13:35, Thomas Munro wrote:

On Tue, Aug 6, 2019 at 9:26 PM Heikki Linnakangas  wrote:

Attached is a patch that contains your fix.txt with the changes for
clarity mentioned above, and an isolationtester test case.


LGTM.


Pushed, thanks!

- Heikki




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Dilip Kumar
On Tue, Jul 30, 2019 at 1:54 PM Dilip Kumar  wrote:
>
> On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro  wrote:
> >
> > One data structure that could perhaps hold this would be
> > UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> > with pretty fast lookups; used for things like
> > UndoLogNumberGetCategory()).  As long as you never want to have
> > inter-transaction compression, that should have the right scope to
> > give recovery per-undo log tracking.  If you ever wanted to do
> > compression between transactions too, maybe UndoLogSlot could work,
> > but that'd have more complications.
>
> I think this could be a good idea.  I had thought of keeping in the
> slot as my 3rd option but later I removed it thinking that we need to
> expose the compression field to the undo log layer.  I think keeping
> in the UndoLogTableEntry is a better idea than keeping in the slot.
> But, I still have the same problem that we need to expose undo
> record-level fields to undo log layer to compute the cache entry size.
>   OTOH, If we decide to get from the first record of the page (as I
> mentioned up thread) then I don't think there is any performance issue
> because we are inserting on the same page.  But, for doing that we
> need to unpack the complete undo record (guaranteed to be on one
> page).   And, UnpackUndoData will internally unpack the payload data
> as well which is not required in our case unless we change
> UnpackUndoData such that it unpacks only what the caller wants (one
> input parameter will do).
>
> I am not sure out of these two which idea is better?
>

I have one more problem related to compression of the command id
field.  Basically, the problem is that we don't set the command id in
the WAL and we will always store FirstCommandId in the undo[1].   So
suppose there were 2 operations under a different CID then during DO
time both the undo record will store the CID field in their respective
undo records but during REDO time, all the commands will store the
same CID(FirstCommandId) so as per the compression logic the
subsequent record for the same transaction will not store the CID
field.  I am not sure what is the best way to handle this but I have
few ideas.

1) Don't compress the CID field ever.
2) Write CID in WAL,  but just for compressing the CID field in undo
(which may not necessarily go to disk) we don't want to add extra 4
bytes in the WAL.

Any better idea to handle this?

[1] 
https://www.postgresql.org/message-id/CAFiTN-u2Ny2E-NgT8nmE65awJ7keOzePODZTEg98ceF%2BsNhRtw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData

2019-08-07 Thread vignesh C
On Mon, Aug 5, 2019 at 8:31 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-05 14:44:37 +1200, Thomas Munro wrote:
> > Yeah. I think we're agreed for now that we don't want to change
> > procarray (though we still need to figure out how to compute the 64
> > bit horizons correctly and efficiently)
>
> Hm. Is that actually hard? Can't we just use the current logic to
> compute the horizons in 32bit, and then extend that to 64bit by
> comparing to nextFullXid or something like that? That shouldn't be more
> than a few instructions, outside of the contended locks?
>
>
> > and we probably don't want to change any high volume WAL contents, so
> > maybe I was starting down the wrong path there (I was thinking of the
> > subxid list for 2pc as infrequent, but obviously it isn't for some
> > people).  I don't have time to look into that any more right now, but
> > today's experiment made me feel more certain about my earlier
> > statement, that we shouldn't worry about the subxid list for now if we
> > don't actually have to.
>
> I don't think we should start to change existing wal contents to 64bit
> xids before there's a benefit from doing so (that obviously doesn't mean
> that records for a hypothetical AM employing 64bit xids shouldn't
> contain them, if they are actually long-term values, rather than just
> about the current transaction). I think that's just going to be
> confusing, without providing much in the way of benefits.
>
>
> > > > Vignesh's patch achieves something important on its own: it avoids the
> > > > needs for zheap to do a 32->64 conversion.
> > >
> > > Hm, is that an actual problem?
> >
> > It creates a place in the undo worker patch set that wants to do an
> > xid -> fxid translation, as discussed here:
> >
> > https://www.postgresql.org/message-id/CAA4eK1L9BhvnQfa_RJCTpKQf9QZ15pyUW7s32BH78iBC3KbV0g%40mail.gmail.com
>
> Right, but I think that can just be done the suggestion from above.
>
>
> > I'm trying to stop people from supplying a general purpose footgun
> > that looks like "GuessFullTransactionId(xid)".
>
> I think you're right - but I think we should be able to provide
> functions that ensure safety for most if not all of these. For WAL
> replay routines we can reference xlogrecord, for GetOldestXmin() we can
> just expand internally, by referencing a 64bit xid in ShmemVariableCache
> or such.
>
>
> > I suspect that any time you think you want to do that, there is
> > probably a better way that doesn't involve having to convince everyone
> > that we didn't mess up the epoch part in some unlikely race, which
> > probably involves holding onto an fxid that you had somewhere earlier
> > that came ultimately from the next fxid generator, or deriving it with
> > reference to the next fxid or a known older-but-still-running fxid
> > with the right interlocking, or something like that.  I and others
> > said, well, why don't we just put the fxid in the 2pc file.  That's
> > what Vignesh has proposed, and AFAIK it solves that immediate problem.
> >
> > That caused people to ask -- entirely reasonably -- why we don't
> > change ALL the xids in there to fxids, which brings us here.  I think
> > it includes lots of tricky decisions that I don't want to make right
> > now, hence inclination to defer that question for now.
>
> I'm against doing this just for one part of the record. That seems
> supremely confusing. And I don't buy that it meaningfully helps us in
> the first place, given the multitude of other records containing 32bit
> xids.
>
Going by the discussion shall we conclude that we don't need to
convert the subxids into fxid's as part of this fix.
Let me know if any further changes need to be done.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Heikki Linnakangas

On 05/08/2019 16:24, Robert Haas wrote:

On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas  wrote:

I feel that the level of abstraction is not quite right. There are a
bunch of fields, like uur_block, uur_offset, uur_tuple, that are
probably useful for some UNDO resource managers (zheap I presume), but
seem kind of arbitrary. How is uur_tuple different from uur_payload?
Should they be named more generically as uur_payload1 and uur_payload2?
And why two, why not three or four different payloads? In the WAL record
format, there's a concept of "block id", which allows you to store N
number of different payloads in the record, I think that would be a
better approach. Or only have one payload, and let the resource manager
code divide it as it sees fit.

Many of the fields support a primitive type of compression, where a
field can be omitted if it has the same value as on the first record on
an UNDO page. That's handy. But again I don't like the fact that the
fields have been hard-coded into the UNDO record format. I can see e.g.
the relation oid to be useful for many AMs. But not all. And other AMs
might well want to store and deduplicate other things, aside from the
fields that are in the patch now. I'd like to move most of the fields to
AM specific code, and somehow generalize the compression. One approach
would be to let the AM store an arbitrary struct, and run it through a
general-purpose compression algorithm, using the UNDO page's first
record as the "dictionary".


I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.


Sure, a fixed-size struct is quite limiting. My point is that all that 
packing of data into UNDO records should be AM-specific. Maybe it would 
be handy to have a a few common fields in the undo record header itself, 
but most data should be in the AM-specific payload, because it varies 
across AMs.



And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.


Yeah, that's also a problem with complicated WAL record types. Hopefully 
the complex cases are an exception, not the norm. A complex case is 
unlikely to fit any pre-defined set of fields anyway. (We could look at 
how e.g. protobuf works, if this is really a big problem. I'm not 
suggesting that we add a dependency just for this, but there might be 
some patterns or interfaces that we could mimic.)


If you remember, we did a big WAL format refactoring in 9.5, which moved 
some information from AM-specific structs to the common headers. Namely, 
the information on the relation blocks that the WAL record applies to. 
That was a very handy refactoring, and allowed tools like pg_waldump to 
print more detailed information about all WAL record types. For WAL 
records, moving the block information was natural, because there was 
special handling for full-page images anyway. However, I don't think we 
have enough experience with UNDO log yet, to know which fields would be 
best to include in the common undo header, and which to leave as 
AM-specific payload. I think we should keep the common header slim, and 
delegate to the AM routines.


For UNDO records, having an XID on every record probably makes sense; 
all the use cases for UNDO log we've discussed are transactional. The 
rules on which UNDO records to apply and what/when to discard, depend on 
whether a transaction committed or aborted and when, so you need the XID 
for that. Although, the rule also depends on the AM; for cleaning up 
orphaned files, an UNDO record for a committed transaction can be 
discarded immediately, while zheap and zedstore records need to be kept 
around longer. So the exact rules for that will need to be AM-specific, 
too. Or maybe there are only a few different cases and we can enumerate 
them,

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Bruce Momjian
On Wed, Aug  7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote:
> I understood. IIUC in your approach postgres processes encrypt WAL
> records when inserting to the WAL buffer. So WAL data is encrypted
> even on the WAL buffer.
> 
> It works but I think the implementation might be complex; For example
> using openssl, we would use EVP functions to encrypt data by
> AES-256-CTR. We would need to make IV and pass it to them and these
> functions however don't manage the counter value of nonce as long as I
> didn't miss. That is, we need to calculate the correct counter value
> for each encryption and pass it to EVP functions. Suppose we encrypt
> 20 bytes of WAL. The first 16 bytes is encrypted with nonce of
> (segment_number, 0) and the next 4 bytes is encrypted with nonce of
> (segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We
> cannot use nonce of (segment_number, 2) but should use nonce of
> (segment_number , 1). Therefore we would need 4 bytes padding and to
> encrypt it and then to throw that 4 bytes away .

Since we want to have per-byte control over encryption, for both
heap/index pages (skip LSN and CRC), and WAL (encrypt to the last byte),
I assumed we would need to generate a bit stream of a specified size and
do the XOR ourselves against the data.  I assume ssh does this, so we
would have to study the method.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




SQL/JSON path: collation for comparisons, minor typos in docs

2019-08-07 Thread Markus Winand
Hi!

I was playing around with JSON path quite a bit and might have found one case 
where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use 
the database default collation rather then the standard-mandated "Unicode 
codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence 
in first paragraph).

I guess this is the relevant part of the code: 
src/backend/utils/adt/jsonpath_exec.c (compareItems)

case jbvString:
if (op == jpiEqual)
return jb1->val.string.len != 
jb2->val.string.len ||
memcmp(jb1->val.string.val,
   jb2->val.string.val,
   jb1->val.string.len) ? 
jpbFalse : jpbTrue;

cmp = varstr_cmp(jb1->val.string.val, 
jb1->val.string.len,
 jb2->val.string.val, 
jb2->val.string.len,
 DEFAULT_COLLATION_OID);
break;

Testcase:

postgres 12beta3=# select * from jsonb_path_query('"dummy"', '$ ? ("a" 
< "A")');
 jsonb_path_query
--
 "dummy"
(1 row)

In code points, lower case ‘a' is not less than upper case ‘A’—the result 
should be empty.

To convince myself:

postgres 12beta3=# select datcollate, 'a' < 'A', 'a' <'A' COLLATE 
ucs_basic from pg_database where datname=current_database();
 datcollate  | ?column? | ?column?
-+--+--
 en_US.UTF-8 | t| f
(1 row)

I also found two minor typos in the docs. Patch attached.

-markus
ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the 
collation issue is the only one I found. Excellent work. Down to the SQLSTATEs. 
For sure the most complete and correct SQL/JSON path implementation I've seen.



0001-Doc-Fix-typos-in-json-path-documentation.patch
Description: Binary data


Grouping isolationtester tests in the schedule

2019-08-07 Thread Heikki Linnakangas
The list of tests in src/test/isolation/isolation_schedule has grown 
over the years. Originally, they were all related to Serializable 
Snapshot Isolation, but there are different kinds of concurrency tests 
there now. More tests is good, but the schedule file has grown into a 
big inscrutable list with zero comments.


I propose to categorize the tests and add some divider comments to the 
file, see attached.


- Heikki
# Serializable Snapshot Isolation tests.
test: read-only-anomaly
test: read-only-anomaly-2
test: read-only-anomaly-3
test: read-write-unique
test: read-write-unique-2
test: read-write-unique-3
test: read-write-unique-4
test: simple-write-skew
test: receipt-report
test: temporal-range-integrity
test: project-manager
test: classroom-scheduling
test: total-cash
test: referential-integrity
test: ri-trigger
test: partial-index
test: two-ids
test: multiple-row-versions
test: index-only-scan
test: predicate-lock-hot-tuple
test: predicate-hash
test: predicate-gist
test: predicate-gin
test: serializable-parallel
test: serializable-parallel-2

# Deadlock detection tests
test: deadlock-simple
test: deadlock-hard
test: deadlock-soft
test: deadlock-soft-2
test: deadlock-parallel

# Concurrent DML tests
test: fk-contention
test: fk-deadlock
test: fk-deadlock2
test: fk-partitioned-1
test: fk-partitioned-2
test: eval-plan-qual
test: lock-update-delete
test: lock-update-traversal
test: inherit-temp
test: nowait
test: nowait-2
test: nowait-3
test: nowait-4
test: nowait-5
test: skip-locked
test: skip-locked-2
test: skip-locked-3
test: skip-locked-4

# INSERT ON CONFLICT tests.
test: insert-conflict-do-nothing
test: insert-conflict-do-nothing-2
test: insert-conflict-do-update
test: insert-conflict-do-update-2
test: insert-conflict-do-update-3
test: insert-conflict-toast
test: insert-conflict-specconflict

# Multixact (FOR KEY SHARE/UPDATE) tests.
test: delete-abort-savept
test: delete-abort-savept-2
test: aborted-keyrevoke
test: multixact-no-deadlock
test: multixact-no-forget
test: lock-committed-update
test: lock-committed-keyupdate
test: update-locked-tuple
test: propagate-lock-delete
test: tuplelock-conflict
test: tuplelock-update
test: tuplelock-upgrade-no-deadlock
test: freeze-the-dead

# Concurrent DDL tests
test: reindex-concurrently
test: drop-index-concurrently-1
test: multiple-cic
test: alter-table-1
test: alter-table-2
test: alter-table-3
test: alter-table-4
test: create-trigger
test: sequence-ddl
test: vacuum-concurrent-drop

# Misc tests
test: async-notify
test: vacuum-reltuples
test: timeouts
test: vacuum-conflict
test: vacuum-skip-locked
test: partition-key-update-1
test: partition-key-update-2
test: partition-key-update-3
test: partition-key-update-4
test: plpgsql-toast
test: truncate-conflict


Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 13:52, Dilip Kumar wrote:

I have one more problem related to compression of the command id
field.  Basically, the problem is that we don't set the command id in
the WAL and we will always store FirstCommandId in the undo[1].   So
suppose there were 2 operations under a different CID then during DO
time both the undo record will store the CID field in their respective
undo records but during REDO time, all the commands will store the
same CID(FirstCommandId) so as per the compression logic the
subsequent record for the same transaction will not store the CID
field.  I am not sure what is the best way to handle this but I have
few ideas.

1) Don't compress the CID field ever.
2) Write CID in WAL,  but just for compressing the CID field in undo
(which may not necessarily go to disk) we don't want to add extra 4
bytes in the WAL.


Most transactions have only a few commands, so you could optimize for 
that. If you use some kind of a variable-byte encoding for it, it could 
be a single byte or even just a few bits, for the common cases.


For the first version, I'd suggest keeping it simple, though, and 
optimize later.


- Heikki




Re: POC: Cleaning up orphaned files using undo logs

2019-08-07 Thread Thomas Munro
On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila  wrote:
> On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila  wrote:
> > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  wrote:
> > >  but
> > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > the undo worker launcher:
> > >
> >
> > I could see the problem, will fix in next version.
>
> Fixed both of these problems in the patch just posted by me [1].

I reran the script that found that problem, so I could play with the
linger logic.  It creates N databases, and then it creates tables in
random databases (because I'm testing with the orphaned table cleanup
patch) and commits or rolls back at (say) 100 tx/sec.  While it's
doing that, you can look at the pg_stat_undo_logs view to see the
discard and insert pointers whizzing along nicely, but if you look at
the process table with htop or similar you can see that it's forking
undo apply workers at 100/sec (the pid keeps changing), whenever there
is more than one database involved.  With a single database it lingers
as I was expecting (and then creates problems when you want to drop
the database).  What I was expecting to see is that if you configure
the test to generate undo work in  2, 3 or 4 dbs, and you have
max_undo_workers set to 4, then you should finish up with 4 undo apply
workers hanging around to service the work calmly without any new
forking happening.  If you generate undo work in more than 4
databases, I was expecting to see the undo workers exiting and being
forked so that a total of 4 workers (at any time) can work their way
around the more-than-4 databases, but not switching as fast as they
can, so that we don't waste all our energy on forking and setup (how
fast exactly they should switch, I don't know, that's what I wanted to
see).  A more advanced thing to worry about, not yet tested, is how
well they'll handle asymmetrical work distributions (not enough
workers, but some databases producing a lot and some a little undo
work).  Script attached.

-- 
Thomas Munro
https://enterprisedb.com
# Install: python-psycopg2 (Debian/Ubuntu)
# Run with python2 ./test_undo_worker_local_balancing.py

import psycopg2 as pg
import random
import time

def make_conn(dbname):
  return pg.connect("dbname=" + dbname)

def run_test(dbs, txs_per_sec, commit_ratio, runtime):

  # first, create a control connection that we'll use to create databases
  conn = make_conn("postgres")
  conn.set_isolation_level(pg.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
  cursor = conn.cursor()

  # recreate all the databases
  for n in range(dbs):
cursor.execute("drop database if exists db%d" % n)
cursor.execute("create database db%d" % n)

  # next, open a separate session to each database
  conns = [make_conn("db%d" % n) for n in range(dbs)]
  cursors = [conn.cursor() for conn in conns]

  # set up interesting GUCs in each session
  for cursor in cursors:
cursor.execute("set rollback_overflow_size = '0kB'")
cursor.connection.commit()

  # now do random work at the requested rate until our time runs out
  start = time.time()
  finish_at = start + runtime
  txs = 0
  table_number = 0
  now = start
  while now < finish_at:

# choose a random session, and run a transaction
cursor = random.choice(cursors)
cursor.execute("create table t%d ()" % table_number)
# decide whether to commit or roll back
if random.uniform(0.0, 1.0) < commit_ratio:
  cursor.connection.commit()
else:
  cursor.connection.rollback()
table_number += 1

# wait until it's time to start the next transaction
txs += 1
next_tx_at = (txs / txs_per_sec) + start
if next_tx_at < now:
  print "can't run transactions fast enough, not sleeping"
else:
  time.sleep(next_tx_at - now)
now = time.time()

if __name__ == "__main__":
  dbs = 4
  txs_per_sec = 1 #100.0
  commit_ratio = 0.0 # 0.0 = always roll back, 1.0 = always commit
  runtime = 120 # how long to run for, in seconds
  run_test(dbs, txs_per_sec, commit_ratio, runtime)


Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Thomas Munro
On Wed, Aug 7, 2019 at 11:28 PM Heikki Linnakangas  wrote:
> The list of tests in src/test/isolation/isolation_schedule has grown
> over the years. Originally, they were all related to Serializable
> Snapshot Isolation, but there are different kinds of concurrency tests
> there now. More tests is good, but the schedule file has grown into a
> big inscrutable list with zero comments.

+1

> I propose to categorize the tests and add some divider comments to the
> file, see attached.

I think I'd put nowait and skip locked under a separate category "FOR
UPDATE" or "row locking" or something, but maybe that's just me... can
you call that stuff DML?

-- 
Thomas Munro
https://enterprisedb.com




Re: s/rewinded/rewound/?

2019-08-07 Thread Chapman Flack
On 08/07/19 04:48, Thomas Munro wrote:

> as "poetic" and "rare", and then says it was used by one specific
> Victorian poet.  Perhaps I'll send them a pull request: it's now G. M.
> Hopkins and PostgreSQL?

It does seem counter, original, spare, strange.

Regards,
-Chap




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 14:42, Thomas Munro wrote:

On Wed, Aug 7, 2019 at 11:28 PM Heikki Linnakangas  wrote:

The list of tests in src/test/isolation/isolation_schedule has grown
over the years. Originally, they were all related to Serializable
Snapshot Isolation, but there are different kinds of concurrency tests
there now. More tests is good, but the schedule file has grown into a
big inscrutable list with zero comments.


+1


I propose to categorize the tests and add some divider comments to the
file, see attached.


I think I'd put nowait and skip locked under a separate category "FOR
UPDATE" or "row locking" or something, but maybe that's just me... can
you call that stuff DML?


Yeah, I guess SELECT FOR UPDATE isn't really DML. Separate "Row locking" 
category works for me. Or maybe "Concurrent DML and row locking". There 
is also DML in some of those tests.


- Heikki




Re: Small patch to fix build on Windows

2019-08-07 Thread Juan José Santamaría Flecha
On Wed, Aug 7, 2019 at 11:11 AM Dmitry Igrishin  wrote:
>
> ср, 7 авг. 2019 г. в 11:29, Kyotaro Horiguchi :
> >
> > Solution.pm has the following line:
> >
> > >   my $opensslcmd =
> > > $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
> >
> > AFAICS that's all.
> Thank you! The attached 2nd version of the patch fixes this too.
>

At some point the propossed patch for opensslcmd was like:

+ my $opensslprog = '\bin\openssl.exe version 2>&1';
+ my $opensslcmd = '"' . $self->{options}->{openssl} . '"' . $opensslprog;

It can be a question of taste, but I think the dot is easier to read.

Regards,

Juan José Santamaría Flecha




initdb: Use varargs macro for PG_CMD_PRINTF

2019-08-07 Thread Peter Eisentraut
Small patch to simplify some no longer necessary complicated code, using
varargs macros.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 5004af6cc9174e9cdc20a6bf4f959d22e209a8e8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 7 Aug 2019 00:06:21 +0200
Subject: [PATCH 1/4] initdb: Use varargs macro for PG_CMD_PRINTF

I left PG_CMD_PUTS around even though it could be handled by
PG_CMD_PRINTF since PG_CMD_PUTS is sometimes called with non-literal
arguments, and so that would create a potential problem if such a
string contained percent signs.
---
 src/bin/initdb/initdb.c | 28 
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9c303e9cc4..551d379d85 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -307,21 +307,9 @@ do { \
output_failed = true, output_errno = errno; \
 } while (0)
 
-#define PG_CMD_PRINTF1(fmt, arg1) \
+#define PG_CMD_PRINTF(fmt, ...) \
 do { \
-   if (fprintf(cmdfd, fmt, arg1) < 0 || fflush(cmdfd) < 0) \
-   output_failed = true, output_errno = errno; \
-} while (0)
-
-#define PG_CMD_PRINTF2(fmt, arg1, arg2) \
-do { \
-   if (fprintf(cmdfd, fmt, arg1, arg2) < 0 || fflush(cmdfd) < 0) \
-   output_failed = true, output_errno = errno; \
-} while (0)
-
-#define PG_CMD_PRINTF3(fmt, arg1, arg2, arg3)  \
-do { \
-   if (fprintf(cmdfd, fmt, arg1, arg2, arg3) < 0 || fflush(cmdfd) < 0) \
+   if (fprintf(cmdfd, fmt, __VA_ARGS__) < 0 || fflush(cmdfd) < 0) \
output_failed = true, output_errno = errno; \
 } while (0)
 
@@ -1490,7 +1478,7 @@ setup_auth(FILE *cmdfd)
PG_CMD_PUTS(*line);
 
if (superuser_password)
-   PG_CMD_PRINTF2("ALTER USER \"%s\" WITH PASSWORD E'%s';\n\n",
+   PG_CMD_PRINTF("ALTER USER \"%s\" WITH PASSWORD E'%s';\n\n",
   username, 
escape_quotes(superuser_password));
 }
 
@@ -1684,7 +1672,7 @@ setup_description(FILE *cmdfd)
"   objsubid int4, "
"   description text);\n\n");
 
-   PG_CMD_PRINTF1("COPY tmp_pg_description FROM E'%s';\n\n",
+   PG_CMD_PRINTF("COPY tmp_pg_description FROM E'%s';\n\n",
   escape_quotes(desc_file));
 
PG_CMD_PUTS("INSERT INTO pg_description "
@@ -1697,7 +1685,7 @@ setup_description(FILE *cmdfd)
" classname name, "
" description text);\n\n");
 
-   PG_CMD_PRINTF1("COPY tmp_pg_shdescription FROM E'%s';\n\n",
+   PG_CMD_PRINTF("COPY tmp_pg_shdescription FROM E'%s';\n\n",
   escape_quotes(shdesc_file));
 
PG_CMD_PUTS("INSERT INTO pg_shdescription "
@@ -1738,7 +1726,7 @@ setup_collation(FILE *cmdfd)
 * in pg_collation.h.  But add it before reading system collations, so
 * that it wins if libc defines a locale named ucs_basic.
 */
-   PG_CMD_PRINTF3("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, collcollate, 
collctype)"
+   PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, collcollate, 
collctype)"
   "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'ucs_basic', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'C', 'C');\n\n",
   BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, 
PG_UTF8);
 
@@ -1982,12 +1970,12 @@ setup_schema(FILE *cmdfd)
 
free(lines);
 
-   PG_CMD_PRINTF1("UPDATE information_schema.sql_implementation_info "
+   PG_CMD_PRINTF("UPDATE information_schema.sql_implementation_info "
   "  SET character_value = '%s' "
   "  WHERE implementation_info_name = 'DBMS 
VERSION';\n\n",
   infoversion);
 
-   PG_CMD_PRINTF1("COPY information_schema.sql_features "
+   PG_CMD_PRINTF("COPY information_schema.sql_features "
   "  (feature_id, feature_name, 
sub_feature_id, "
   "  sub_feature_name, is_supported, comments) 
"
   " FROM E'%s';\n\n",
-- 
2.22.0



Re: Built-in connection pooler

2019-08-07 Thread Ryan Lambert
> First of all default value of this parameter is 1000, not 10.


Oops, my bad!  Sorry about that, I'm not sure how I got that in my head
last night but I see how that would make it act strange now.  I'll adjust
my notes before re-testing. :)

Thanks,

*Ryan Lambert*


On Wed, Aug 7, 2019 at 4:57 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> Hi Ryan,
>
> On 07.08.2019 6:18, Ryan Lambert wrote:
> > Hi Konstantin,
> >
> > I did some testing with the latest patch [1] on a small local VM with
> > 1 CPU and 2GB RAM with the intention of exploring pg_pooler_state().
> >
> > Configuration:
> >
> > idle_pool_worker_timeout = 0 (default)
> > connection_proxies = 2
> > max_sessions = 10 (default)
> > max_connections = 1000
> >
> > Initialized pgbench w/ scale 10 for the small server.
> >
> > Running pgbench w/out connection pooler with 300 connections:
> >
> > pgbench -p 5432 -c 300 -j 1 -T 60 -P 15 -S bench_test
> > starting vacuum...end.
> > progress: 15.0 s, 1343.3 tps, lat 123.097 ms stddev 380.780
> > progress: 30.0 s, 1086.7 tps, lat 155.586 ms stddev 376.963
> > progress: 45.1 s, 1103.8 tps, lat 156.644 ms stddev 347.058
> > progress: 60.6 s, 652.6 tps, lat 271.060 ms stddev 575.295
> > transaction type: 
> > scaling factor: 10
> > query mode: simple
> > number of clients: 300
> > number of threads: 1
> > duration: 60 s
> > number of transactions actually processed: 63387
> > latency average = 171.079 ms
> > latency stddev = 439.735 ms
> > tps = 1000.918781 (including connections establishing)
> > tps = 1000.993926 (excluding connections establishing)
> >
> >
> > It crashes when I attempt to run with the connection pooler, 300
> > connections:
> >
> > pgbench -p 6543 -c 300 -j 1 -T 60 -P 15 -S bench_test
> > starting vacuum...end.
> > connection to database "bench_test" failed:
> > server closed the connection unexpectedly
> >This probably means the server terminated abnormally
> >before or while processing the request.
> >
> > In the logs I get:
> >
> > WARNING:  PROXY: Failed to add new client - too much sessions: 18
> > clients, 1 backends. Try to increase 'max_sessions' configuration
> > parameter.
> >
> > The logs report 1 backend even though max_sessions is the default of
> > 10.  Why is there only 1 backend reported?  Is that error message
> > getting the right value?
> >
> > Minor grammar fix, the logs on this warning should say "too many
> > sessions" instead of "too much sessions."
> >
> > Reducing pgbench to only 30 connections keeps it from completely
> > crashing but it still does not run successfully.
> >
> > pgbench -p 6543 -c 30 -j 1 -T 60 -P 15 -S bench_test
> > starting vacuum...end.
> > client 9 aborted in command 1 (SQL) of script 0; perhaps the backend
> > died while processing
> > client 11 aborted in command 1 (SQL) of script 0; perhaps the backend
> > died while processing
> > client 13 aborted in command 1 (SQL) of script 0; perhaps the backend
> > died while processing
> > ...
> > ...
> > progress: 15.0 s, 5734.5 tps, lat 1.191 ms stddev 10.041
> > progress: 30.0 s, 7789.6 tps, lat 0.830 ms stddev 6.251
> > progress: 45.0 s, 8211.3 tps, lat 0.810 ms stddev 5.970
> > progress: 60.0 s, 8466.5 tps, lat 0.789 ms stddev 6.151
> > transaction type: 
> > scaling factor: 10
> > query mode: simple
> > number of clients: 30
> > number of threads: 1
> > duration: 60 s
> > number of transactions actually processed: 453042
> > latency average = 0.884 ms
> > latency stddev = 7.182 ms
> > tps = 7549.373416 (including connections establishing)
> > tps = 7549.402629 (excluding connections establishing)
> > Run was aborted; the above results are incomplete.
> >
> > Logs for that run show (truncated):
> >
> >
> > 2019-08-07 00:19:37.707 UTC [22152] WARNING:  PROXY: Failed to add new
> > client - too much sessions: 18 clients, 1 backends. Try to increase
> > 'max_sessions' configuration parameter.
> > 2019-08-07 00:31:10.467 UTC [22151] WARNING:  PROXY: Failed to add new
> > client - too much sessions: 15 clients, 4 backends. Try to increase
> > 'max_sessions' configuration parameter.
> > 2019-08-07 00:31:10.468 UTC [22152] WARNING:  PROXY: Failed to add new
> > client - too much sessions: 15 clients, 4 backends. Try to increase
> > 'max_sessions' configuration parameter.
> > ...
> > ...
> >
> >
> > Here it is reporting fewer clients with more backends. Still, only 4
> > backends reported with 15 clients doesn't seem right.  Looking at the
> > results from pg_pooler_state() at the same time (below) showed 5 and 7
> > backends for the two different proxies, so why are the logs only
> > reporting 4 backends when pg_pooler_state() reports 12 total?
> >
> > Why is n_idle_clients negative?  In this case it showed -21 and -17.
> > Each proxy reported 7 clients, with max_sessions = 10, having those
> > n_idle_client results doesn't make sense to me.
> >
> >
> > postgres=# SELECT * FROM pg_pooler_state();
> >  pid  | n_clients | n_ssl_clients | n_pools | n_backends |
> > n_dedica

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Sehrope Sarkuni
On Mon, Aug 5, 2019 at 9:02 PM Bruce Momjian  wrote:

> On Wed, Jul 31, 2019 at 09:25:01AM -0400, Sehrope Sarkuni wrote:
> > Even if we do not include a separate per-relation salt or things like
> > relfilenode when generating a derived key, we can still include other
> types of
> > immutable attributes. For example the fork type could be included to
> eventually
> > allow multiple forks for the same relation to be encrypted with the same
> IV =
> > LSN + Page Number as the derived key per-fork would be distinct.
>
> Yes, the fork number could be useful in this case.  I was thinking we
> would just leave the extra bits as zeros and we can then set it to '1'
> or something else for a different fork.
>

Key derivation has more flexibility as you're not limited by the number of
unused bits in the IV.


> > WAL encryption should not use the same key as page encryption so there's
> no
> > need to design the IV to try to avoid matching the page IVs. Even a basic
> > derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK =
> HKDF(MDEK,
> > "PAGE") would ensure separate keys. That's the the literal string "WAL"
> or
> > "PAGE" being added as a salt to generate the respective keys, all that
> matters
> > is they're different.
>
> I was thinking the WAL would use the same key since the nonce is unique
> between the two.  What value is there in using a different key?
>

Never having to worry about overlap in Key + IV usage is main advantage.
While it's possible to structure IVs to avoid that from happening, it's
much easier to completely avoid that situation by ensuring different parts
of an application are using separate derived keys.


> > Ideally WAL encryption would generating new derived keys as part of the
> WAL
> > stream. The WAL stream is not fixed so you have the luxury of being able
> to add
> > a "Use new random salt XZY going forward" records. Forcing generation of
> a new
> > salt/key upon promotion of a replica would ensure that at least the WAL
> is
> > unique going forward. Could also generate a new upon server startup,
> after
>
> Ah, yes, good point, and using a derived key would make that easier.
> The tricky part is what to use to create the new derived key, unless we
> generate a random number and store that somewhere in the data directory,
> but that might lead to fragility, so I am worried.


Simplest approach for derived keys would be to use immutable attributes of
the WAL files as an input to the key derivation. Something like HKDF(MDEK,
"WAL:" || timeline_id || wal_segment_num) should be fine for this as it is:

* Unique per WAL file
* Known prior to writing to a given WAL file
* Known prior to reading a given WAL file
* Does not require any additional persistence

We have pg_rewind,
> which allows to make the WAL go backwards.  What is the value in doing
> this?
>

Good point re: pg_rewind. Having key rotation records in the stream would
complicate that as you'd have to jump back / forward to figure out which
key to use. It's doable but much more complicated.

A unique WDEK per WAL file that is derived from the segment number would
not have that problem. A unique key per-file means the IVs can all start at
zero and the each file can be treated as one encrypted stream. Any
encryption/decryption code would only need to touch the write/read
callsites.


> > every N bytes, or a new one for each new WAL file. There's much more
> > flexibility compared to page encryption.
> >
> > As WAL is a single continuous stream, we can start the IV for each
> derived WAL
> > key from zero. There's no need to complicate it further as Key + IV will
> never
> > be reused.
>
> Uh, you want a new random key for each WAL file?  I was going to use the
> WAL segment number as the nonce, which is always increasing, and easily
> determined.  The file is 16MB.
>

Ideally yes as it would allow for multiple replicas promoted off the same
primary to immediately diverge as each would have its own keys. I don't
consider it a requirement but if it's possible without significant added
complexity I say that's a win.

I'm still reading up on the file and record format to understand how
complex that would be. Though given your point re: pg_rewind and the lack
of handling for page encryption divergence when promoting multiple
replicas, I doubt the complexity will be worth it.


> > If WAL is always written as full pages we need to ensure that the empty
> parts
> > of the page are actual zeros and not "encrypted zeroes". Otherwise an
> XOR of
> > the empty section of the first write of a page against a subsequent one
> would
> > give you the plain text.
>
> Right, I think we need the segment number as part of the nonce for WAL.
>

+1 to using segment number but it's better as a derived key instead of
coming up with new IV constructs and reusing the MDEK.


> > The non-fixed size of the WAL allows for the addition of a MAC though
> I'm not
> > sure yet the best way to incorporate it. It could be part of each
> encrypted

Re: SQL/JSON path: collation for comparisons, minor typos in docs

2019-08-07 Thread Alexander Korotkov
Hi!

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand  wrote:
> I was playing around with JSON path quite a bit and might have found one case 
> where the current implementation doesn’t follow the standard.
>
> The functionality in question are the comparison operators except ==. They 
> use the database default collation rather then the standard-mandated "Unicode 
> codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last 
> sentence in first paragraph).

Thank you for pointing!  Nikita is about to write a patch fixing that.

> I also found two minor typos in the docs. Patch attached.

Pushed, thanks.

> -markus
> ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the 
> collation issue is the only one I found. Excellent work. Down to the 
> SQLSTATEs. For sure the most complete and correct SQL/JSON path 
> implementation I've seen.

Thank you!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Regression test failure in regression test temp.sql

2019-08-07 Thread Michael Paquier
Hi all,

While browsing the buildfarm failures, I have found this problem on
anole for the test temp:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-07%2006%3A39%3A35
  select relname from pg_class where relname like 'temp_parted_oncommit_test%';
 relname
  
-  temp_parted_oncommit_test
   temp_parted_oncommit_test1
  (2 rows)

  drop table temp_parted_oncommit_test;
  --- 276,283 
  select relname from pg_class where relname like 'temp_parted_oncommit_test%';
 relname
  
   temp_parted_oncommit_test1
+  temp_parted_oncommit_test
  (2 rows)
   
This could be solved just with an ORDER BY as per the attached.  Any
objections?

Thanks,
--
Michael
diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index eab75dbe2c..7538ca8ad6 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -273,7 +273,8 @@ select * from temp_parted_oncommit_test;
 (1 row)
 
 -- two relations remain in this case.
-select relname from pg_class where relname like 'temp_parted_oncommit_test%';
+select relname from pg_class where relname like 'temp_parted_oncommit_test%'
+  order by relname;
   relname   
 
  temp_parted_oncommit_test
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 761955bfe6..6158b5a3df 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -210,7 +210,8 @@ commit;
 -- preserved.
 select * from temp_parted_oncommit_test;
 -- two relations remain in this case.
-select relname from pg_class where relname like 'temp_parted_oncommit_test%';
+select relname from pg_class where relname like 'temp_parted_oncommit_test%'
+  order by relname;
 drop table temp_parted_oncommit_test;
 
 -- Check dependencies between ON COMMIT actions with inheritance trees.


signature.asc
Description: PGP signature


Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 02:57, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 5:15 PM Tom Lane  wrote:

So I think I've got to take back the assertion that we've got
some lurking generic problem.  This pattern looks way more
like a platform-specific issue.  Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.


chipmunk also:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.


FWIW, I looked at the logs in /var/log/* on chipmunk, and found no 
evidence of OOM killings. I can see nothing unusual in the OS logs 
around the time of that failure.


- Heikki




Re: Small patch to fix build on Windows

2019-08-07 Thread Dmitry Igrishin
ср, 7 авг. 2019 г. в 15:33, Juan José Santamaría Flecha
:
>
> On Wed, Aug 7, 2019 at 11:11 AM Dmitry Igrishin  wrote:
> >
> > ср, 7 авг. 2019 г. в 11:29, Kyotaro Horiguchi :
> > >
> > > Solution.pm has the following line:
> > >
> > > >   my $opensslcmd =
> > > > $self->{options}->{openssl} . "\\bin\\openssl.exe version 2>&1";
> > >
> > > AFAICS that's all.
> > Thank you! The attached 2nd version of the patch fixes this too.
> >
>
> At some point the propossed patch for opensslcmd was like:
>
> + my $opensslprog = '\bin\openssl.exe version 2>&1';
> + my $opensslcmd = '"' . $self->{options}->{openssl} . '"' . $opensslprog;
>
> It can be a question of taste, but I think the dot is easier to read.
Well, the style inconsistent anyway, for example, in file Project.pm

$self->{def}  = "./__CFGNAME__/$self->{name}/$self->{name}.def";
$self->{implib}   = "__CFGNAME__/$self->{name}/$libname";

So, I don't know what style is preferable. Personally, I don't care.




Unix-domain socket support on Windows

2019-08-07 Thread Peter Eisentraut
It works!

(apparently as of Windows 10 version 1803)

Here are some patches to get a discussion rolling.

Basically, it just works, but you need to define your own struct
sockaddr_un.  (This is what configure currently uses as a proxy for
HAVE_UNIX_SOCKETS, so (a) that needs a bit of tweaking, and (b) that is
the reason why builds haven't blown up already.)

But we'll now need to make things work so that binaries with Unix-domain
socket support work on systems without run-time support.  We already did
that exercise with IPv6 support, so some of the framework is already in
place.

Depending on your Windows environment, there might not be a suitable
/tmp directory, so you'll need to specify a directory explicitly using
postgres -k or similar.  This leads to the question what the default for
DEFAULT_PGSOCKET_DIR should be on Windows.  I think it's probably best,
at least for now, to set it so that by default, neither server nor libpq
use Unix sockets unless explicitly selected.  This can be done easily on
the server side by defining DEFAULT_PGSOCKET_DIR as "".  But in libpq, I
don't think the code would handle that correctly everywhere, so it would
need some more analysis and restructuring possibly.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1f3575f38ca1321e0c5d71217cfb46f8efc35caa Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 7 Aug 2019 15:44:19 +0200
Subject: [PATCH 1/5] Enable Unix-domain sockets support on Windows

As of Windows 10 version 1803, Unix-domain sockets are supported on
Windows.  But it's not automatically detected by configure because it
looks for struct sockaddr_un and Windows doesn't define that.  So we
just make our own definition on Windows and override the configure
result.
---
 config/c-library.m4|  5 +++--
 configure  |  5 -
 src/include/c.h| 11 +++
 src/include/pg_config.h.in |  6 +++---
 src/include/pg_config.h.win32  |  6 +++---
 src/include/pg_config_manual.h |  7 ---
 6 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/config/c-library.m4 b/config/c-library.m4
index 6f2b0fbb4e..1469b07d2f 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -121,10 +121,11 @@ AC_DEFUN([PGAC_UNION_SEMUN],
 
 # PGAC_STRUCT_SOCKADDR_UN
 # ---
-# If `struct sockaddr_un' exists, define HAVE_UNIX_SOCKETS.
+# If `struct sockaddr_un' exists, define HAVE_STRUCT_SOCKADDR_UN.
+# If it is missing then one could define it.
 # (Requires test for !)
 AC_DEFUN([PGAC_STRUCT_SOCKADDR_UN],
-[AC_CHECK_TYPE([struct sockaddr_un], [AC_DEFINE(HAVE_UNIX_SOCKETS, 1, [Define 
to 1 if you have unix sockets.])], [],
+[AC_CHECK_TYPES([struct sockaddr_un], [], [],
 [#include 
 #ifdef HAVE_SYS_UN_H
 #include 
diff --git a/configure b/configure
index 7a6bfc2339..6e87537be3 100755
--- a/configure
+++ b/configure
@@ -14135,7 +14135,10 @@ ac_fn_c_check_type "$LINENO" "struct sockaddr_un" 
"ac_cv_type_struct_sockaddr_un
 "
 if test "x$ac_cv_type_struct_sockaddr_un" = xyes; then :
 
-$as_echo "#define HAVE_UNIX_SOCKETS 1" >>confdefs.h
+cat >>confdefs.h <<_ACEOF
+#define HAVE_STRUCT_SOCKADDR_UN 1
+_ACEOF
+
 
 fi
 
diff --git a/src/include/c.h b/src/include/c.h
index 2a082afab1..434c403269 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1263,6 +1263,17 @@ extern unsigned long long strtoull(const char *str, char 
**endptr, int base);
 #define NON_EXEC_STATIC static
 #endif
 
+#ifdef HAVE_STRUCT_SOCKADDR_UN
+#define HAVE_UNIX_SOCKETS 1
+#elif defined(WIN32)
+struct sockaddr_un
+{
+   unsigned short sun_family;
+   char sun_path[108];
+};
+#define HAVE_UNIX_SOCKETS 1
+#endif
+
 /* /port compatibility functions */
 #include "port.h"
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32..e75c090f85 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -594,6 +594,9 @@
 /* Define to 1 if `__ss_len' is a member of `struct sockaddr_storage'. */
 #undef HAVE_STRUCT_SOCKADDR_STORAGE___SS_LEN
 
+/* Define to 1 if the system has the type `struct sockaddr_un'. */
+#undef HAVE_STRUCT_SOCKADDR_UN
+
 /* Define to 1 if `tm_zone' is a member of `struct tm'. */
 #undef HAVE_STRUCT_TM_TM_ZONE
 
@@ -682,9 +685,6 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_UNISTD_H
 
-/* Define to 1 if you have unix sockets. */
-#undef HAVE_UNIX_SOCKETS
-
 /* Define to 1 if you have the `unsetenv' function. */
 #undef HAVE_UNSETENV
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 2d903c82b8..b947a9fbde 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -463,6 +463,9 @@
 /* Define to 1 if `__ss_len' is member of `struct sockaddr_storage'. */
 /* #undef HAVE_STRUCT_SOCKADDR_STORAGE___SS_LEN */
 
+/* Define to 1 if the system has the type `struct sockaddr_un'. */
+/* #undef HAVE_STRUCT_SOCKADDR_UN */
+
 /* D

Re: stress test for parallel workers

2019-08-07 Thread Tom Lane
Heikki Linnakangas  writes:
> On 07/08/2019 02:57, Thomas Munro wrote:
>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane  wrote:
>>> So I think I've got to take back the assertion that we've got
>>> some lurking generic problem.  This pattern looks way more
>>> like a platform-specific issue.  Overaggressive OOM killer
>>> would fit the facts on vulpes/wobbegong, perhaps, though
>>> it's odd that it only happens on HEAD runs.

>> chipmunk also:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

> FWIW, I looked at the logs in /var/log/* on chipmunk, and found no 
> evidence of OOM killings. I can see nothing unusual in the OS logs 
> around the time of that failure.

Oh, that is very useful info, thanks.  That seems to mean that we
should be suspecting a segfault, assertion failure, etc inside
the postmaster.  I don't see any TRAP message in chipmunk's log,
so assertion failure seems to be ruled out, but other sorts of
process-crashing errors would fit the facts.

A stack trace from the crash would be mighty useful info along
about here.  I wonder whether chipmunk has the infrastructure
needed to create such a thing.  From memory, the buildfarm requires
gdb for that, but not sure if there are additional requirements.
Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.

regards, tom lane




Re: no default hash partition

2019-08-07 Thread Stephen Frost
Greetings,

* Amit Langote (amitlangot...@gmail.com) wrote:
> On Wed, Aug 7, 2019 at 1:59 PM Kyotaro Horiguchi
>  wrote:
> > At Tue, 6 Aug 2019 23:26:19 -0400, Robert Haas  
> > wrote:
> > > On Tue, Aug 6, 2019 at 6:58 PM Tom Lane  wrote:
> > > I think, as Amit says, that having an automatic partition creation
> > > feature for hash partitions (and maybe other kinds, but certainly for
> > > hash) would be a useful thing to add to the system. I also think that
> > > it might be useful to add some commands to automate partition
> > > splitting (and maybe combining) although I think there's some design
> > > work to be done there to figure out exactly what we should build.  I
> > > don't think it's ever useful to have a hash-partitioned table with an
> > > incomplete set of partitions long term, but it makes things simpler to
> > > allow that temporarily, for example during dump restoration.
> > > Therefore, I see no reason why we would want to go to the trouble of
> > > allowing hash-partitioned tables to have default partitions; it would
> > > just encourage people to do things that don't really make any sense.
> >
> > +1.
> >
> > By the way, couldn't we offer a means to check for gaps in a hash
> > partition? For example, the output of current \d+ 
> > contains the Partitoins section that shows a list of
> > partitions. I think that we can show all gaps there.
> >
> > =# \d+ p
> >Partitioned table "public.p"
> > ...
> > Partition key: HASH (a)
> > Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
> > c3 FOR VALUES WITH (modulus 4, remainder 3),
> > GAP (modulus 4, remainder 1),
> > GAP (modulus 4, remainder 2)
> >
> > Or
> >
> > Partitions: c1 FOR VALUES WITH (modulus 4, remainder 0),
> > c3 FOR VALUES WITH (modulus 4, remainder 3),
> > Gaps: (modulus 4, remainder 1), (modulus 4, remainder 2)
> 
> I imagine showing this output would require some non-trivial code on
> the client side (?) to figure out the gaps.  If our intention in the
> long run is to make sure that such gaps only ever appear temporarily,
> that is, when running a command to increase the number of hash
> partitions (as detailed in Robert's email), then a user would never
> see those gaps.  So, maybe writing such code wouldn't be worthwhile in
> the long run?

I tend to agree that it might not be useful to have this code,
particularly not on the client side, but we've dealt with the issue of
"the client would need non-trivial code for this" in the past by having
a server-side function for the client to call (eg: pg_get_expr(),
pg_get_ruledef()).  If we really think this would be valuable to show
and we don't want the client to have to have a bunch of code for it,
doing something similar here could address that.

One thing I've wished for is a function that would give me a range type
back for a partition (would be neat to be able to use a range type to
specify a partition's range when creating it too).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: initdb: Use varargs macro for PG_CMD_PRINTF

2019-08-07 Thread Tom Lane
Peter Eisentraut  writes:
> Small patch to simplify some no longer necessary complicated code, using
> varargs macros.

+1

regards, tom lane




Re: Unix-domain socket support on Windows

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 16:56, Peter Eisentraut wrote:

It works!


Cool!

Am I reading the patches correctly, that getpeereid() still doesn't work 
on Windows? That means that peer authentication doesn't work, right? 
That's a bit sad. One of the big advantages of unix domain sockets over 
TCP is peer authentication.


- Heikki




Re: Regression test failure in regression test temp.sql

2019-08-07 Thread Tom Lane
Michael Paquier  writes:
> While browsing the buildfarm failures, I have found this problem on
> anole for the test temp:
> ...
> This could be solved just with an ORDER BY as per the attached.  Any
> objections?

There's no reason to expect stability of row order in pg_class, so
in principle this is a reasonable fix, but I kind of wonder why it's
necessary.  The plan I get for this query is

regression=# explain select relname from pg_class where relname like 
'temp_parted_oncommit_test%';
   QUERY PLAN 
-
 Index Only Scan using pg_class_relname_nsp_index on pg_class  (cost=0.28..4.30 
rows=1 width=64)
   Index Cond: ((relname >= 'temp'::text) AND (relname < 'temq'::text))
   Filter: (relname ~~ 'temp_parted_oncommit_test%'::text)
(3 rows)

which ought to deliver sorted rows natively.  Adding ORDER BY doesn't
change this plan one bit.  So what actually happened on anole to cause
a non-sorted result?

Not objecting to the patch, exactly, just feeling like there's
more here than meets the eye.  Not quite sure if it's worth
investigating closer, or what we'd even need to do to do so.

BTW, I realize from looking at the plan that LIKE is interpreting the
underscores as wildcards.  Maybe it's worth s/_/\_/ while you're
at it.

regards, tom lane




Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 16:57, Tom Lane wrote:

Heikki Linnakangas  writes:

On 07/08/2019 02:57, Thomas Munro wrote:

On Wed, Jul 24, 2019 at 5:15 PM Tom Lane  wrote:

So I think I've got to take back the assertion that we've got
some lurking generic problem.  This pattern looks way more
like a platform-specific issue.  Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.



chipmunk also:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16



FWIW, I looked at the logs in /var/log/* on chipmunk, and found no
evidence of OOM killings. I can see nothing unusual in the OS logs
around the time of that failure.


Oh, that is very useful info, thanks.  That seems to mean that we
should be suspecting a segfault, assertion failure, etc inside
the postmaster.  I don't see any TRAP message in chipmunk's log,
so assertion failure seems to be ruled out, but other sorts of
process-crashing errors would fit the facts.

A stack trace from the crash would be mighty useful info along
about here.  I wonder whether chipmunk has the infrastructure
needed to create such a thing.  From memory, the buildfarm requires
gdb for that, but not sure if there are additional requirements.


It does have gdb installed.


Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.


I think they should just go to a file called "core", I don't think I've 
changed any settings related to it, at least. I tried "find / -name 
core*", but didn't find any core files, though.


- Heikki




Re: stress test for parallel workers

2019-08-07 Thread Tom Lane
Heikki Linnakangas  writes:
> On 07/08/2019 16:57, Tom Lane wrote:
>> Also, if you're using systemd or something else that thinks it
>> ought to interfere with where cores get dropped, that could be
>> a problem.

> I think they should just go to a file called "core", I don't think I've 
> changed any settings related to it, at least. I tried "find / -name 
> core*", but didn't find any core files, though.

On Linux machines the first thing to check is

cat /proc/sys/kernel/core_pattern

On a Debian machine I have handy, that just says "core", but Red Hat
tends to mess with it ...

regards, tom lane




Re: Unix-domain socket support on Windows

2019-08-07 Thread Peter Eisentraut
On 2019-08-07 16:06, Heikki Linnakangas wrote:
> Am I reading the patches correctly, that getpeereid() still doesn't work 
> on Windows? That means that peer authentication doesn't work, right? 
> That's a bit sad. One of the big advantages of unix domain sockets over 
> TCP is peer authentication.

Correct, it's not supported.  I think it's plausible that they will add
this in the future.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: no default hash partition

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Amit Langote wrote:

> That hash-partitioned tables can't have default partition is mentioned
> in the CREATE TABLE page:
> 
> "If DEFAULT is specified, the table will be created as a default
> partition of the parent table. The parent can either be a list or
> range partitioned table. A partition key value not fitting into any
> other partition of the given parent will be routed to the default
> partition. There can be only one default partition for a given parent
> table."

This approach of documenting by omission seems unhelpful.  Yes, I'd like
to expand that too.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Tom Lane
Heikki Linnakangas  writes:
> The list of tests in src/test/isolation/isolation_schedule has grown 
> over the years. Originally, they were all related to Serializable 
> Snapshot Isolation, but there are different kinds of concurrency tests 
> there now. More tests is good, but the schedule file has grown into a 
> big inscrutable list with zero comments.

> I propose to categorize the tests and add some divider comments to the 
> file, see attached.

+1 for concept, didn't review your divisions.

Something related I've been wondering about is whether we could
parallelize the isolation tests.  A difficulty here is that the
slowest ones tend to also be timing-sensitive, such that running
them in parallel would increase the risk of failure.  But we
could likely get at least some improvement.

regards, tom lane




Re: Unix-domain socket support on Windows

2019-08-07 Thread Magnus Hagander
On Wed, Aug 7, 2019 at 4:59 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-08-07 16:06, Heikki Linnakangas wrote:
> > Am I reading the patches correctly, that getpeereid() still doesn't work
> > on Windows? That means that peer authentication doesn't work, right?
> > That's a bit sad. One of the big advantages of unix domain sockets over
> > TCP is peer authentication.
>
> Correct, it's not supported.  I think it's plausible that they will add
> this in the future.
>

Does it work well enough that SSPI auth can run over it? SSPI auth with the
local provider gives you more or less the same results as peer, doesn't it?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: is necessary to recheck cached data in fn_extra?

2019-08-07 Thread Tom Lane
Pavel Stehule  writes:
> I should to use a cache accessed via fn_extra. There will be stored data
> about function parameters (types). If I understand correctly, these data
> should be stable in query, and then recheck is not necessary. Is it true?

I wouldn't trust that.  You don't really know what the lifespan of
a fn_extra cache is.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Sehrope Sarkuni
On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian  wrote:

> On Wed, Aug  7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote:
> > I understood. IIUC in your approach postgres processes encrypt WAL
> > records when inserting to the WAL buffer. So WAL data is encrypted
> > even on the WAL buffer.
>

I was originally thinking of not encrypting the shared WAL buffers but that
may have issues. If the buffers are already encrypted and contiguous in
shared memory, it's possible to write out many via a single pg_pwrite(...)
call as is currently done in XLogWrite(...).

If they're not encrypted you'd need to do more work in that critical
section. That'd involve allocating a commensurate amount of memory to hold
the encrypted pages and then encrypting them all prior to the single
pg_pwrite(...) call. Reusing one buffer is possible but it would require
encrypting and writing the pages one by one. Both of those seem like a bad
idea.

Better to pay the encryption cost at the time of WAL record creation and
keep the writing process as fast and simple as possible.


> > It works but I think the implementation might be complex; For example
> > using openssl, we would use EVP functions to encrypt data by
> > AES-256-CTR. We would need to make IV and pass it to them and these
> > functions however don't manage the counter value of nonce as long as I
> > didn't miss. That is, we need to calculate the correct counter value
> > for each encryption and pass it to EVP functions. Suppose we encrypt
> > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of
> > (segment_number, 0) and the next 4 bytes is encrypted with nonce of
> > (segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We
> > cannot use nonce of (segment_number, 2) but should use nonce of
> > (segment_number , 1). Therefore we would need 4 bytes padding and to
> > encrypt it and then to throw that 4 bytes away .
>
> Since we want to have per-byte control over encryption, for both
> heap/index pages (skip LSN and CRC), and WAL (encrypt to the last byte),
> I assumed we would need to generate a bit stream of a specified size and
> do the XOR ourselves against the data.  I assume ssh does this, so we
> would have to study the method.
>

The lower level non-EVP OpenSSL functions allow specifying the offset
within the 16-byte AES block from which the encrypt/decrypt should proceed.
It's the "num" parameter of their encrypt/decrypt functions. For a
continuous encrypted stream such as a WAL file, a "pread(...)" of a
possibly non-16-byte aligned section would involve determining the 16-byte
counter (byte_offset / 16) and the intra-block offset (byte_offset % 16).
I'm not sure how one handles initializing the internal encrypted counter
and that might be one more step that would need be done. But it's
definitely possible to read / write less than a block via those APIs (not
the EVP ones).

I don't think the EVP functions have parameters for the intra-block offset
but you can mimic it by initializing the IV/block counter and then skipping
over the intra-block offset by either reading or writing a dummy partial
block. The EVP read and write functions both deal with individual bytes so
once you've seeked to your desired offset you can read or write the real
individual bytes.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Handling RestrictInfo in expression_tree_walker

2019-08-07 Thread Tom Lane
Konstantin Knizhnik  writes:
> Frankly speaking I do not see some good reasons for not handling 
> RestrictInfo in expression_tree_worker. It can really simplify writing 
> of mutators/walkers.

I don't buy this; what seems more likely is that you're trying to apply
an expression tree mutator to something you shouldn't.  The caching
aspects of RestrictInfo, and the fact that the planner often assumes
that RestrictInfos don't get copied (so that pointer equality is a
useful test), are both good reasons to be wary of applying general
mutations to those nodes.

Or in other words, if you want a walker/mutator to descend through
those nodes, you almost certainly need special logic at those nodes
anyway.  Omitting them from the nodeFuncs support guarantees you
don't forget that.

regards, tom lane




Re: s/rewinded/rewound/?

2019-08-07 Thread Tom Lane
Magnus Hagander  writes:
> On Wed, Aug 7, 2019 at 10:49 AM Thomas Munro  wrote:
>> The word "rewinded" appears in our manual and in a comment.  That
>> sounds strange to my ears.  Isn't it a mistake?

Certainly.

> To me this sounds like a classic non-English-native-speaker-mistake.  But
> it seems at least the one in the docs come from Bruce, who definitely is...

He might've just been committing somebody else's words without having
reviewed carefully.

regards, tom lane




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Tom Lane wrote:

> Something related I've been wondering about is whether we could
> parallelize the isolation tests.  A difficulty here is that the
> slowest ones tend to also be timing-sensitive, such that running
> them in parallel would increase the risk of failure.  But we
> could likely get at least some improvement.

Yeah, there's some improvement to be had there.  We've discussed it
previously:
https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql

I'm not really happy about this grouping if we mean we're restricted in
how we can make tests run in parallel.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: s/rewinded/rewound/?

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Tom Lane wrote:

> Magnus Hagander  writes:

> > To me this sounds like a classic non-English-native-speaker-mistake.  But
> > it seems at least the one in the docs come from Bruce, who definitely is...
> 
> He might've just been committing somebody else's words without having
> reviewed carefully.

The commit message for 878bd9accb55 doesn't mention that.  He didn't
add a mailing list reference, but this is easy to find at
https://postgr.es/m/20160720180706.gf24...@momjian.us
I lean towards the view that he was using the literal program name as a
verb, rather than trying to decline a verb normally.  Note that the word
"rewound" did not appear in that SGML source when he committed that;
that was only introduced in bfc80683ce51 three years later.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: is necessary to recheck cached data in fn_extra?

2019-08-07 Thread Chapman Flack
On 08/07/19 11:39, Tom Lane wrote:
> Pavel Stehule  writes:
>> I should to use a cache accessed via fn_extra. There will be stored data
>> about function parameters (types). If I understand correctly, these data
>> should be stable in query, and then recheck is not necessary. Is it true?
> 
> I wouldn't trust that.  You don't really know what the lifespan of
> a fn_extra cache is.

It is going to be either the last thing I put there, or NULL, right?
So a null check is sufficient?

Other than when the SRF_* api has commandeered it for other purposes?

Regards,
-Chap




Re: stress test for parallel workers

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 17:45, Tom Lane wrote:

Heikki Linnakangas  writes:

On 07/08/2019 16:57, Tom Lane wrote:

Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.



I think they should just go to a file called "core", I don't think I've
changed any settings related to it, at least. I tried "find / -name
core*", but didn't find any core files, though.


On Linux machines the first thing to check is

cat /proc/sys/kernel/core_pattern

On a Debian machine I have handy, that just says "core", but Red Hat
tends to mess with it ...


It's just "core" on chipmunk, too.

- Heikki




Re: s/rewinded/rewound/?

2019-08-07 Thread Geoff Winkless
On Wed, 7 Aug 2019 at 16:59, Alvaro Herrera  wrote:
> He didn't
> add a mailing list reference, but this is easy to find at
> https://postgr.es/m/20160720180706.gf24...@momjian.us
> I lean towards the view that he was using the literal program name as a
> verb, rather than trying to decline a verb normally.

I go with that, although I think it's confusing to not use the full
app name. If I were discussing a block of data that had been passed to
a "rewind" function, I might well put "this data has been rewind()ed"
(or just rewinded). But if I were discussing the concept itself, I
would say rewound.

eg In the example given, I would accept "and then
pg_rewinded to become a standby".

Although I would probably have reworded it to use "and then
pg_rewind run again to set it to standby"
or something similar, because the "ed" form really does look odd in
documentation.

I don't think using "rewound" instead is explicit enough in this instance.

But that's just me. Feel free to ignore.

Geoff




Re: is necessary to recheck cached data in fn_extra?

2019-08-07 Thread Pavel Stehule
st 7. 8. 2019 v 17:39 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > I should to use a cache accessed via fn_extra. There will be stored data
> > about function parameters (types). If I understand correctly, these data
> > should be stable in query, and then recheck is not necessary. Is it true?
>
> I wouldn't trust that.  You don't really know what the lifespan of
> a fn_extra cache is.
>

fn_extra cache cannot be longer than query. And if I understand well, then
is not possible to change parameter types inside query?

Pavel


> regards, tom lane
>


Re: Removing unneeded self joins

2019-08-07 Thread Konstantin Knizhnik



On 05.08.2019 14:24, Andrey Lepikhov wrote:



On 02/08/2019 04:54, Thomas Munro wrote:

On Thu, Jun 27, 2019 at 6:42 PM Andrey Lepikhov
 wrote:

Version v.17 of the patch that fix the bug see in attachment.


While moving this to the September CF, I noticed that it needs to be
updated for the recent pg_list.h API changes.

The patch was updated:
1. Changes caused by pg_list.h API changes.
2. Fix the problem of joint clause_relids and required_relids changes 
[1].
3. Add eclass mentions of removed relation into the kept relation 
(field eclass_indexes was introduced by commit 3373c71553).


[1] 
https://www.postgresql.org/message-id/flat/5c21029d-81a2-c999-6744-6a898fcc9a19%40postgrespro.ru




One more bug is fixed in this patch: OR clauses were not correctly 
updated in case of self join removal.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2..1af8d75 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2274,7 +2274,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_OID_FIELD(userid);
 	WRITE_BOOL_FIELD(useridiscurrent);
 	/* we don't try to print fdwroutine or fdw_private */
-	/* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */
+	WRITE_NODE_FIELD(unique_for_rels);
+	/* can't print non_unique_for_rels; BMSes aren't Nodes */
 	WRITE_NODE_FIELD(baserestrictinfo);
 	WRITE_UINT_FIELD(baserestrict_min_security);
 	WRITE_NODE_FIELD(joininfo);
@@ -2347,6 +2348,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node)
 }
 
 static void
+_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node)
+{
+	WRITE_NODE_TYPE("UNIQUERELINFO");
+
+	WRITE_BITMAPSET_FIELD(outerrelids);
+	if (node->index)
+	{
+		WRITE_OID_FIELD(index->indexoid);
+		WRITE_NODE_FIELD(column_values);
+	}
+}
+
+static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
 	/*
@@ -4112,6 +4126,9 @@ outNode(StringInfo str, const void *obj)
 			case T_StatisticExtInfo:
 _outStatisticExtInfo(str, obj);
 break;
+			case T_UniqueRelInfo:
+_outUniqueRelInfo(str, obj);
+break;
 			case T_ExtensibleNode:
 _outExtensibleNode(str, obj);
 break;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5f339fd..b57be54 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3568,7 +3568,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
  * relation_has_unique_index_for
  *	  Determine whether the relation provably has at most one row satisfying
  *	  a set of equality conditions, because the conditions constrain all
- *	  columns of some unique index.
+ *	  columns of some unique index. If index_info is not null, it is set to
+ *	  point to a new UniqueRelInfo containing the index and conditions.
  *
  * The conditions can be represented in either or both of two ways:
  * 1. A list of RestrictInfo nodes, where the caller has already determined
@@ -3589,12 +3590,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
-			  List *exprlist, List *oprlist)
+			  List *exprlist, List *oprlist,
+			  UniqueRelInfo **unique_info)
 {
 	ListCell   *ic;
 
 	Assert(list_length(exprlist) == list_length(oprlist));
 
+	if (unique_info)
+		*unique_info = NULL;
+
 	/* Short-circuit if no indexes... */
 	if (rel->indexlist == NIL)
 		return false;
@@ -3645,6 +3650,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List *column_values = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it's
@@ -3693,6 +3699,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 if (match_index_to_operand(rexpr, c, ind))
 {
 	matched = true; /* column is unique */
+	column_values = lappend(column_values, rinfo->outer_is_left
+			? get_leftop(rinfo->clause)
+			: get_rightop(rinfo->clause));
 	break;
 }
 			}
@@ -3735,7 +3744,22 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Matched all key columns of this index? */
 		if (c == ind->nkeycolumns)
+		{
+			if (unique_info)
+			{
+/* This may be called in GEQO memory context. */
+MemoryContext oldContext = MemoryContextSwitchTo(root->planner_cxt);
+*unique_info = makeNode(UniqueRelInfo);
+(*unique_info)->index = ind;
+(*unique_info)->column_values = list_copy(column_values);
+MemoryContextSwitchTo(oldContext);
+			}
+			if (column_values)
+list_free(column_values);
 			return true;
+		}
+		if (column_values)
+			list_free(column_values);
 	}
 
 	return false;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/b

Re: no default hash partition

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-06, Stephen Frost wrote:

> Yeah, that's a fair argument, but giving the user a way to say that
> would address it.  As in, "create me a list-partitioned table for these
> values, plus a default."  Anyhow, I'm sure that I'm taking this beyond
> what we need to do right now, just sharing where I think it'd be good
> for things to go.

Fabien Coelho already submitted a patch for this IIRC.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Rethinking opclass member checks and dependency strength

2019-08-07 Thread Tom Lane
Over in [1] we realized that it would be a good idea to remove the <@
operator from contrib/intarray's GiST opclasses.  Unfortunately, doing
that isn't a simple matter of generating an extension update script
containing ALTER OPERATOR FAMILY DROP OPERATOR, because that operator
is marked as internally dependent on its opclass which means that
dependency.c will object.  We could do some direct hacking on
pg_depend to let the DROP be allowed, but ugh.

I started to wonder why GiST opclass operators are ever considered
as required members of their opclass.  The existing rule (cf.
opclasscmds.c) is that everything mentioned in CREATE OPERATOR CLASS
will have an internal dependency on the opclass, but if you add
operators or functions with ALTER OPERATOR FAMILY ADD, those just
have AUTO dependencies on their operator family.  So the assumption
is that opclass creators will only put the bare minimum of required
stuff into CREATE OPERATOR CLASS and then add optional stuff with
ALTER ... ADD.  But none of our contrib modules do it like that,
and I'd lay long odds against any third party code doing it either.

This leads to the thought that maybe we could put some intelligence
into an index-AM-specific callback instead.  For example, for btree
and hash the appropriate rule is probably that cross-type operators
and functions should be tied to the opfamily while single-type
members are internally tied to the associated opclass.  For GiST,
GIN, and SPGiST it's not clear to me that *any* operator deserves
an INTERNAL dependency; only the implementation functions do.

Furthermore, if we had an AM callback that were charged with
deciding the right dependency links for all the operators/functions,
we could also have it do some validity checking on those things,
thus moving some of the checks made by amvalidate into a more
useful place.

If we went along this line, then a dump/restore or pg_upgrade
would be enough to change an opclass's dependencies to the new
style, which would get us to a place where intarray's problem
could be fixed with ALTER OPERATOR FAMILY DROP OPERATOR and
nothing else.  Such an upgrade script wouldn't work in older
releases, but I think we don't generally care about that.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/458.1565114...@sss.pgh.pa.us




Re: no default hash partition

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Amit Langote wrote:

> That hash-partitioned tables can't have default partition is mentioned
> in the CREATE TABLE page:
> 
> "If DEFAULT is specified, the table will be created as a default
> partition of the parent table. The parent can either be a list or
> range partitioned table. A partition key value not fitting into any
> other partition of the given parent will be routed to the default
> partition. There can be only one default partition for a given parent
> table."

Actually, it also says this (in the blurb for the PARTITION OF clause):

  Creates the table as a partition of the specified
  parent table. The table can be created either as a partition for specific
  values using FOR VALUES or as a default partition
  using DEFAULT.  This option is not available for
  hash-partitioned tables.

which I think is sufficient.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Heikki Linnakangas

On 07/08/2019 18:52, Alvaro Herrera wrote:

On 2019-Aug-07, Tom Lane wrote:


Something related I've been wondering about is whether we could
parallelize the isolation tests.  A difficulty here is that the
slowest ones tend to also be timing-sensitive, such that running
them in parallel would increase the risk of failure.  But we
could likely get at least some improvement.


Yeah, there's some improvement to be had there.  We've discussed it
previously:
https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql

I'm not really happy about this grouping if we mean we're restricted in
how we can make tests run in parallel.


The elephant in the room is the 'timeouts' test, which takes about 40 
seconds, out of a total runtime of 90 seconds. So we'd really want to 
run that in parallel with everything else. Or split 'timeouts' into 
multiple tests that could run in parallel. I don't think grouping the 
rest of the tests differently will make much difference to how easy or 
hard that is.


In any case, we can scramble the list again later, if that's needed for 
running the tests in parallel, and we think it's worth it. Until then, a 
more logical grouping and some comments would be nice.


- Heikki




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Heikki Linnakangas wrote:

> The elephant in the room is the 'timeouts' test, which takes about 40
> seconds, out of a total runtime of 90 seconds. So we'd really want to run
> that in parallel with everything else. Or split 'timeouts' into multiple
> tests that could run in parallel.

Hmm, that test has 8 permutations, five second each ... if we split it
in 3 and run those in parallel, we reduce the total isolation runtime
by 25 seconds even if we do *nothing else at all*.  If we tweak the
other things, I think we could make the whole set run in about 30
seconds total in a normal machine.

Splitting the test never crossed my mind.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: no default hash partition

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Actually, it also says this (in the blurb for the PARTITION OF clause):
> 
> >   Creates the table as a partition of the 
> > specified
> >   parent table. The table can be created either as a partition for 
> > specific
> >   values using FOR VALUES or as a default partition
> >   using DEFAULT.  This option is not available for
> >   hash-partitioned tables.
> 
> > which I think is sufficient.
> 
> Hm, that's rather confusingly worded IMO.  Is the antecedent of "this
> option" just DEFAULT, or does it mean that you can't use FOR VALUES,
> or perchance it means that you can't use a PARTITION OF clause
> at all?

Uh, you're right, I hadn't noticed that.  Not my text.  I think this can
be fixed easily as in the attached.  There are other options, but I like
this one the best.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 9009addb9c..a74fde42d4 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -399,8 +399,7 @@ WITH ( MODULUS numeric_literal, REM
   Creates the table as a partition of the specified
   parent table. The table can be created either as a partition for specific
   values using FOR VALUES or as a default partition
-  using DEFAULT.  This option is not available for
-  hash-partitioned tables.
+  using DEFAULT.
  
 
  
@@ -491,8 +490,8 @@ WITH ( MODULUS numeric_literal, REM
 
  
   If DEFAULT is specified, the table will be
-  created as a default partition of the parent table. The parent can
-  either be a list or range partitioned table. A partition key value
+  created as a default partition of the parent table.  This option
+  is not available for hash-partitioned tables.  A partition key value
   not fitting into any other partition of the given parent will be
   routed to the default partition. There can be only one default
   partition for a given parent table.


Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Tom Lane wrote:

> The problem in "timeouts" is that it has to use drearily long timeouts
> to be sure that the behavior will be stable even on really slow machines
> (think CLOBBER_CACHE_ALWAYS or valgrind --- it can take seconds for them
> to reach a waiting state that other machines reach quickly).  If we run
> such tests in parallel with anything else, that risks re-introducing the
> instability.  I'm not very sure what we can do about that.  But you might
> be right that unless we can solve that, there's not going to be much to be
> gained from parallelizing the rest.

It runs 8 different permutations serially.  If we run the same
permutations in parallel, it would finish much quicker, and we wouldn't
run it in parallel with anything that would take up CPU time, since
they're all just sleeping.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: is necessary to recheck cached data in fn_extra?

2019-08-07 Thread Pavel Stehule
st 7. 8. 2019 v 18:39 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > st 7. 8. 2019 v 17:39 odesílatel Tom Lane  napsal:
> >> I wouldn't trust that.  You don't really know what the lifespan of
> >> a fn_extra cache is.
>
> > fn_extra cache cannot be longer than query.
>
> There are fn_extra caches that are not tied to queries.  Admittedly
> they're for special purposes like I/O functions and index support
> functions, and maybe you can assume that your function can't be
> used in such ways.  I don't think it's a great programming model
> though.
>
> > And if I understand well, then
> > is not possible to change parameter types inside query?
>
> Most places dealing with composite types assume that the rowtype *could*
> change intraquery.  I believe this was a live possibility in the past,
> though it might not be today.  (The issue was inheritance queries, but
> I think we now force tuples from child tables to be converted to the
> parent rowtype.  Whether that's 100% bulletproof is unclear.)  If you're
> not dealing with composites then it's an okay assumption.  I think.
>

ok, thank you for your reply.

Regards

Pavel


> regards, tom lane
>


crash 11.5~

2019-08-07 Thread Justin Pryzby
A daily report crashed repeatedly this morning running pg11.4.
I compiled 11.5 and reproduced it there, too, so I'm including backtrace with
-O0.

I'm trying to dig further into it, but it seems to be crashing under load, but
not when I try to narrow down to a single report, which seem to run to
completion when run individually.

postmaster[6750]: segfault at 7f4a527545bc ip 004834ae sp 
7ffd547e2760 error 4 in postgres[40+6e6000]
postmaster[29786]: segfault at 7f49a968d000 ip 7f49b24abb0f sp 
7ffd547e2268 error 4 in libc-2.12.so (deleted)[7f49b2422000+18b000]

Core was generated by `postgres: telsasoft ts 192.168.122.11(35454) SELECT  
  '.
Program terminated with signal 11, Segmentation fault.

#0  0x004877df in slot_deform_tuple (slot=0xe204638, natts=3) at 
heaptuple.c:1465
#1  0x00487f5b in slot_getsomeattrs (slot=0xe204638, attnum=3) at 
heaptuple.c:1673
#2  0x006d5aae in ExecInterpExpr (state=0xe257948, econtext=0xe204578, 
isnull=0x75f38077) at execExprInterp.c:443
#3  0x006ed221 in ExecEvalExprSwitchContext (state=0xe257948, 
econtext=0xe204578, isNull=0x75f38077) at 
../../../src/include/executor/executor.h:313
#4  0x006ed30e in ExecQual (state=0xe257948, econtext=0xe204578) at 
../../../src/include/executor/executor.h:382
#5  0x006ed5f4 in ExecScan (node=0x252edc8, accessMtd=0x71c54c 
, recheckMtd=0x71c623 ) at execScan.c:190
#6  0x0071c670 in ExecSeqScan (pstate=0x252edc8) at nodeSeqscan.c:129
#7  0x006eb855 in ExecProcNodeInstr (node=0x252edc8) at 
execProcnode.c:461
#8  0x006f6555 in ExecProcNode (node=0x252edc8) at 
../../../src/include/executor/executor.h:247
#9  0x006f6a05 in ExecAppend (pstate=0xe20e7b0) at nodeAppend.c:294
#10 0x006eb855 in ExecProcNodeInstr (node=0xe20e7b0) at 
execProcnode.c:461
#11 0x00708a96 in ExecProcNode (node=0xe20e7b0) at 
../../../src/include/executor/executor.h:247
#12 0x00709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20e7b0, 
hjstate=0xe20e4d8, hashvalue=0x75f3827c) at nodeHashjoin.c:821
#13 0x00708f8d in ExecHashJoinImpl (pstate=0xe20e4d8, parallel=false) 
at nodeHashjoin.c:355
#14 0x007094ca in ExecHashJoin (pstate=0xe20e4d8) at nodeHashjoin.c:565
#15 0x006eb855 in ExecProcNodeInstr (node=0xe20e4d8) at 
execProcnode.c:461
#16 0x0071d9da in ExecProcNode (node=0xe20e4d8) at 
../../../src/include/executor/executor.h:247
#17 0x0071db26 in ExecSort (pstate=0xe20e3c0) at nodeSort.c:107
#18 0x006eb855 in ExecProcNodeInstr (node=0xe20e3c0) at 
execProcnode.c:461
#19 0x006eb826 in ExecProcNodeFirst (node=0xe20e3c0) at 
execProcnode.c:445
#20 0x006f74c5 in ExecProcNode (node=0xe20e3c0) at 
../../../src/include/executor/executor.h:247
#21 0x006f79bb in fetch_input_tuple (aggstate=0xe20df70) at 
nodeAgg.c:406
#22 0x006f9d11 in agg_retrieve_direct (aggstate=0xe20df70) at 
nodeAgg.c:1755
#23 0x006f98fa in ExecAgg (pstate=0xe20df70) at nodeAgg.c:1570
#24 0x006eb855 in ExecProcNodeInstr (node=0xe20df70) at 
execProcnode.c:461
#25 0x006eb826 in ExecProcNodeFirst (node=0xe20df70) at 
execProcnode.c:445
#26 0x00708a96 in ExecProcNode (node=0xe20df70) at 
../../../src/include/executor/executor.h:247
#27 0x00709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20df70, 
hjstate=0x2778580, hashvalue=0x75f3865c) at nodeHashjoin.c:821
#28 0x00708f8d in ExecHashJoinImpl (pstate=0x2778580, parallel=false) 
at nodeHashjoin.c:355
#29 0x007094ca in ExecHashJoin (pstate=0x2778580) at nodeHashjoin.c:565
#30 0x006eb855 in ExecProcNodeInstr (node=0x2778580) at 
execProcnode.c:461
#31 0x006eb826 in ExecProcNodeFirst (node=0x2778580) at 
execProcnode.c:445
#32 0x00702a9b in ExecProcNode (node=0x2778580) at 
../../../src/include/executor/executor.h:247
#33 0x00702d12 in MultiExecPrivateHash (node=0x27783a8) at 
nodeHash.c:164
#34 0x00702c8c in MultiExecHash (node=0x27783a8) at nodeHash.c:114
#35 0x006eb8f7 in MultiExecProcNode (node=0x27783a8) at 
execProcnode.c:501
#36 0x00708e2a in ExecHashJoinImpl (pstate=0x2777210, parallel=false) 
at nodeHashjoin.c:290
#37 0x007094ca in ExecHashJoin (pstate=0x2777210) at nodeHashjoin.c:565
#38 0x006eb855 in ExecProcNodeInstr (node=0x2777210) at 
execProcnode.c:461
#39 0x006eb826 in ExecProcNodeFirst (node=0x2777210) at 
execProcnode.c:445
#40 0x0071d9da in ExecProcNode (node=0x2777210) at 
../../../src/include/executor/executor.h:247
#41 0x0071db26 in ExecSort (pstate=0x2776ce0) at nodeSort.c:107
#42 0x006eb855 in ExecProcNodeInstr (node=0x2776ce0) at 
execProcnode.c:461
#43 0x006eb826 in ExecProcNodeFirst (node=0x2776ce0) at 
execProcnode.c:445
#44 0x006f74c5 in ExecProcNode (node=0x2776ce0) at 
../../../src/include/executor/executor.h:247
#45 0

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-08-07 Thread Bruce Momjian
On Wed, Aug  7, 2019 at 11:41:51AM -0400, Sehrope Sarkuni wrote:
> On Wed, Aug 7, 2019 at 7:19 AM Bruce Momjian  wrote:
> 
> On Wed, Aug  7, 2019 at 05:13:31PM +0900, Masahiko Sawada wrote:
> > I understood. IIUC in your approach postgres processes encrypt WAL
> > records when inserting to the WAL buffer. So WAL data is encrypted
> > even on the WAL buffer.
> 
> 
> I was originally thinking of not encrypting the shared WAL buffers but that 
> may
> have issues. If the buffers are already encrypted and contiguous in shared
> memory, it's possible to write out many via a single pg_pwrite(...) call as is
> currently done in XLogWrite(...).

The shared buffers will not be encrypted --- they are encrypted only
when being written to storage.  We felt encrypting shared buffers will
be too much overhead, for little gain.  I don't know if we will encrypt
while writing to the WAL buffers or while writing the WAL buffers to
the file system.

> If they're not encrypted you'd need to do more work in that critical section.
> That'd involve allocating a commensurate amount of memory to hold the 
> encrypted
> pages and then encrypting them all prior to the single pg_pwrite(...) call.
> Reusing one buffer is possible but it would require encrypting and writing the
> pages one by one. Both of those seem like a bad idea.

Well, right now the 8k pages is part of the WAL stream, so I don't know
it would be any more overhead than other WAL writes.  I am hoping we can
generate the encryption bit stream in chunks earlier so we can just to
the XOR was we are writing the data to the WAL buffers.

> Better to pay the encryption cost at the time of WAL record creation and keep
> the writing process as fast and simple as possible.

Yes, I don't think we know at the time of WAL record creation what
_offset_ the records will have when then are written to WAL, so I am
thinking we need to do it later, and as I said, I am hoping we can
generate the encryption bit stream earlier.

> > It works but I think the implementation might be complex; For example
> > using openssl, we would use EVP functions to encrypt data by
> > AES-256-CTR. We would need to make IV and pass it to them and these
> > functions however don't manage the counter value of nonce as long as I
> > didn't miss. That is, we need to calculate the correct counter value
> > for each encryption and pass it to EVP functions. Suppose we encrypt
> > 20 bytes of WAL. The first 16 bytes is encrypted with nonce of
> > (segment_number, 0) and the next 4 bytes is encrypted with nonce of
> > (segment_number, 1). After that suppose we encrypt 12 bytes of WAL. We
> > cannot use nonce of (segment_number, 2) but should use nonce of
> > (segment_number , 1). Therefore we would need 4 bytes padding and to
> > encrypt it and then to throw that 4 bytes away .
> 
> Since we want to have per-byte control over encryption, for both
> heap/index pages (skip LSN and CRC), and WAL (encrypt to the last byte),
> I assumed we would need to generate a bit stream of a specified size and
> do the XOR ourselves against the data.  I assume ssh does this, so we
> would have to study the method.
> 
> 
> The lower level non-EVP OpenSSL functions allow specifying the offset within
> the 16-byte AES block from which the encrypt/decrypt should proceed. It's the
> "num" parameter of their encrypt/decrypt functions. For a continuous encrypted
> stream such as a WAL file, a "pread(...)" of a possibly non-16-byte aligned
> section would involve determining the 16-byte counter (byte_offset / 16) and
> the intra-block offset (byte_offset % 16). I'm not sure how one handles
> initializing the internal encrypted counter and that might be one more step
> that would need be done. But it's definitely possible to read / write less 
> than
> a block via those APIs (not the EVP ones).
> 
> I don't think the EVP functions have parameters for the intra-block offset but
> you can mimic it by initializing the IV/block counter and then skipping over
> the intra-block offset by either reading or writing a dummy partial block. The
> EVP read and write functions both deal with individual bytes so once you've
> seeked to your desired offset you can read or write the real individual bytes.

Can we generate the bit stream in 1MB chunks or something and just XOR
as needed?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: crash 11.5~ (and 11.4)

2019-08-07 Thread Justin Pryzby
I checked this still happens with max_parallel_workers_per_gather=0.
Now, I just reproduced using SELECT * FROM that table:

(gdb) p thisatt->attrelid
$4 = 2015128626

ts=# SELECT 2015128626::regclass;
regclass | child.huawei_umts_ucell_201908

(gdb) p thisatt->attnum 
$1 = 2
(gdb) p attnum # For earlier crash this is 1
$2 = 1

ts=# \dt+ child.huawei_umts_ucell_201908
 child  | huawei_umts_ucell_201908 | table | telsasoft | 612 MB | 

\d+ child.huawei_umts_ucell_201908
 bsc6900ucell   | text| 
  | not null | | extended |  | 
 ne_name| text| 
  | not null | | extended |  | 
[...]
Partition of: huawei_umts_ucell_metrics FOR VALUES FROM ('2019-08-01 00:00:00') 
TO ('2019-09-01 00:00:00')
Partition constraint: ((start_time IS NOT NULL) AND (start_time >= '2019-08-01 
00:00:00'::timestamp without time zone) AND (start_time < '2019-09-01 
00:00:00'::timestamp without time zone))
Indexes:
"huawei_umts_ucell_201908_unique_idx" UNIQUE, btree (start_time, device_id, 
bsc6900ucell, ne_name, interval_seconds)
"huawei_umts_ucell_201908_idx" brin (start_time) WITH (autosummarize='true')
"huawei_umts_ucell_201908_site_idx" btree (site_id)
Check constraints:
"huawei_umts_ucell_201908_start_time_check" CHECK (start_time >= 
'2019-08-01 00:00:00'::timestamp without time zone AND start_time < '2019-09-01 
00:00:00'::timestamp without time zone)
Statistics objects:
"child"."huawei_umts_ucell_201908" (ndistinct) ON bsc6900ucell, ne_name, 
start_time, interval_seconds, device_id FROM child.huawei_umts_ucell_201908
Options: autovacuum_analyze_threshold=2, autovacuum_analyze_scale_factor=0.005

ts=# SELECT COUNT(1) FROM pg_attribute WHERE 
attrelid='child.huawei_umts_ucell_201908'::regclass;
count | 560

ts=# SELECT * FROM pg_attribute WHERE 
attrelid='child.huawei_umts_ucell_201908'::regclass AND attnum>=0 ORDER BY 
attnum;

attrelid  | 2015128626
attname   | bsc6900ucell
atttypid  | 25
attstattarget | -1
attlen| -1
attnum| 1
attndims  | 0
attcacheoff   | -1
atttypmod | -1
attbyval  | f
attstorage| x
attalign  | i
attnotnull| t
atthasdef | f
atthasmissing | f
attidentity   | 
attisdropped  | f
attislocal| f
attinhcount   | 1
attcollation  | 100
attacl| 
attoptions| 
attfdwoptions | 
attmissingval | 

The only interesting thing about the parent/siblings is that the previous month
was partitioned with "daily" granularity.  I adjusted our threshold for that,
so that August is partitioned monthly:

[...]
child.huawei_umts_ucell_20190730 FOR VALUES FROM ('2019-07-30 00:00:00') TO 
('2019-07-31 00:00:00'),
child.huawei_umts_ucell_20190731 FOR VALUES FROM ('2019-07-31 00:00:00') TO 
('2019-08-01 00:00:00'),
child.huawei_umts_ucell_201908 FOR VALUES FROM ('2019-08-01 00:00:00') TO 
('2019-09-01 00:00:00')

Program terminated with signal 11, Segmentation fault.
#0  0x004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at 
heaptuple.c:1465
1465off = att_align_pointer(off, 
thisatt->attalign, -1,

(gdb) bt
#0  0x004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at 
heaptuple.c:1465
#1  0x00487e4b in slot_getallattrs (slot=0x22b5860) at heaptuple.c:1626
#2  0x0048aab9 in printtup (slot=0x22b5860, self=0x2161b28) at 
printtup.c:392
#3  0x008d0346 in RunFromStore (portal=0x21d5638, 
direction=ForwardScanDirection, count=0, dest=0x2161b28) at pquery.c:1106
#4  0x008cfe88 in PortalRunSelect (portal=0x21d5638, forward=true, 
count=0, dest=0x2161b28) at pquery.c:928
#5  0x008cfb4c in PortalRun (portal=0x21d5638, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2161b28, 
altdest=0x2161b28, completionTag=0x75f39100 "") at pquery.c:773
#6  0x008c9b22 in exec_simple_query (query_string=0x2160d28 "FETCH 
FORWARD 999 FROM _psql_cursor") at postgres.c:1145
#7  0x008cddf3 in PostgresMain (argc=1, argv=0x218bfb0, 
dbname=0x218beb0 "ts", username=0x218be90 "pryzbyj") at postgres.c:4182
#8  0x0082a098 in BackendRun (port=0x2181ac0) at postmaster.c:4358
#9  0x00829806 in BackendStartup (port=0x2181ac0) at postmaster.c:4030
#10 0x00825cab in ServerLoop () at postmaster.c:1707
#11 0x008255dd in PostmasterMain (argc=3, argv=0x215b9a0) at 
postmaster.c:1380
#12 0x0074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228

#0  0x004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at 
heaptuple.c:1465
thisatt = 0x22dec50
tuple = 0x22b58a0
tupleDesc = 0x22debc0
values = 0x22b58c0
isnull = 0x22b6a10
tup = 0x263f9a0
hasnulls = false
attnum = 1
tp = 0x263fa05 
"\374\023\026\213^s#\347(\235=\326\321\067\032\245\321B\026}܋FS\375\244\

Re: Problem with default partition pruning

2019-08-07 Thread Simon Riggs
On Tue, 6 Aug 2019 at 23:18, Alvaro Herrera 
wrote:

> On 2019-Aug-06, Alvaro Herrera wrote:
>
> > Well, if this is really all that duplicative, one thing we could do is
> > run this check in get_partprune_steps_internal only if
> > constraint_exclusion is a value other than on; we should achieve the
> > same effect with no repetition.  Patch for that is attached.  However,
> > if I run the server with constraint_exclusion=on, the regression test
> > fail with the attached diff.  I didn't look at what test is failing, but
> > it seems to me that it's not really duplicative in all cases, only some.
> > Therefore we can't do it.
>
> Right ... One of the failing cases is one that was benefitting from
> constraint_exclusion in the location where we were doing it previously.
>
> I think trying to fix this problem by selectively moving where to apply
> constraint exclusion would be very bug-prone, and hard to detect whether
> we're missing one spot or doing it multiple times in some other cases.
> So I think we shouldn't try.  If this is a real problem, then we should
> add a flag to the reloptinfo and set it when we've done pruning, then
> do nothing if we already did it.  I'm not sure that this is correct, and
> I'm even less sure that it is worth the trouble.
>
> In short, I propose to get this done as the patch I posted in
> https://postgr.es/m/20190806133053.GA23706@alvherre.pgsql


I saw your recent commit and it scares me in various places, noted below.

"Commit: Apply constraint exclusion more generally in partitioning"

"This applies particularly to the default partition..."

My understanding of the thread was the complaint was about removing the
default partition. I would prefer to see code executed just for that case,
so that people who do not define a default partition are unaffected.

"So in certain cases
we're scanning partitions that we don't need to."

Avoiding that has been the subject of months of work.

"This has the unwanted side-effect of testing some (most? all?)
constraints more than once if constraint_exclusion=on.  That seems
unavoidable as far as I can tell without some additional work, but
that's not the recommended setting for that parameter anyway.
However, because this imposes additional processing cost for all
queries using partitioned tables..."

One of the major features of PG12 is the additional performance and
scalability of partitoning, but we don't seem to have benchmarked the
effect of this patch on those issues.

Please could we do perf checks, with tests up to 1000s of partitions? And
if there is a regression, I would vote to revoke this patch or address the
request in a less general way.

Hopefully I have misunderstood and/or there is no regression.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-07 Thread Ashwin Agrawal
On Fri, Aug 2, 2019 at 4:56 PM Ashwin Agrawal  wrote:

>
> On Wed, Jul 31, 2019 at 2:06 PM Andres Freund  wrote:
>
>> Looking at the code as of master, we currently have:
>>
>
> Super awesome feedback and insights, thank you!
>
> - PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure
>>   out a whether the tuple has been locked by the current
>>   transaction. That check afaict just should be
>>   TransactionIdIsCurrentTransactionId(), without all the other
>>   stuff that's done today.
>>
>
> Agree. v1-0002 patch attached does that now. Please let me know if that's
> what you meant.
>
>   TransactionIdIsCurrentTransactionId() imo ought to be optimized to
>>   always check for the top level transactionid first - that's a good bet
>>   today, but even moreso for the upcoming AMs that won't have separate
>>   xids for subtransactions.  Alternatively we shouldn't make that a
>>   binary search for each subtrans level, but just have a small
>>   simplehash hashtable for xids.
>>
>
> v1-0001 patch checks for GetTopTransactionIdIfAny() first in
> TransactionIdIsCurrentTransactionId() which seems yes better in general and
> more for future. That mostly meets the needs for current discussion.
>
> The alternative of not using binary search seems bigger refactoring and
> should be handled as separate optimization exercise outside of this thread.
>
>
>> - CheckForSerializableConflictOut() wants to get the toplevel xid for
>>   the tuple, because that's the one the predicate hashtable stores.
>>
>>   In your patch you've already moved the HTSV() call etc out of
>>   CheckForSerializableConflictOut(). I'm somewhat inclined to think that
>>   the SubTransGetTopmostTransaction() call ought to go along with that.
>>   I don't really think that belongs in predicate.c, especially if
>>   most/all new AMs don't use subtransaction ids.
>>
>>   The only downside is that currently the
>>   TransactionIdEquals(xid, GetTopTransactionIdIfAny()) check
>>   avoids the SubTransGetTopmostTransaction() check.
>>
>>   But again, the better fix for that seems to be to improve the generic
>>   code. As written the check won't prevent a subtrans lookup for heap
>>   when subtransactions are in use, and it's IME pretty common for tuples
>>   to get looked at again in the transaction that has created them. So
>>   I'm somewhat inclined to think that SubTransGetTopmostTransaction()
>>   should have a fast-path for the current transaction - probably just
>>   employing TransactionIdIsCurrentTransactionId().
>>
>
> That optimization, as Kuntal also mentioned, seems something which can be
> done on-top afterwards on current patch.
>
>
>> I don't really see what we gain by having the subtrans handling in the
>> predicate code. Especially given that we've already moved the HTSV()
>> handling out, it seems architecturally the wrong place to me - but I
>> admit that that's a fuzzy argument.  The relevant mapping should be one
>> line in the caller.
>>
>
> Okay, I moved the sub transaction handling out of
> CheckForSerializableConflictOut() and have it along side HTSV() now.
>
> The reason I felt leaving subtransaction handling in generic place, was it
> might be premature to thing no future AM will need it. Plus, all
> serializable function api's having same expectations is easier. Like
> PredicateLockTuple() can be passed top or subtransaction id and it can
> handle it but with the change CheckForSerializableConflictOut() only be
> feed top transaction ID. But its fine and can see the point of AM needing
> it can easily get top transaction ID and feed it as heap.
>
>
>> I wonder if it'd be wroth to combine the
>> TransactionIdIsCurrentTransactionId() calls in the heap cases that
>> currently do both, PredicateLockTuple() and
>> HeapCheckForSerializableConflictOut(). The heap_fetch() case probably
>> isn't commonly that hot a pathq, but heap_hot_search_buffer() is.
>>
>
> Maybe, will give thought to it separate from the current patch.
>
>
>> Minor notes:
>> - I don't think 'insert_xid' is necessarily great - it could also be the
>>   updating xid etc. And while you can argue that an update is an insert
>>   in the current heap, that's not the case for future AMs.
>> - to me
>> @@ -1621,7 +1622,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation
>> relation, Buffer buffer,
>> if (valid)
>> {
>> ItemPointerSetOffsetNumber(tid, offnum);
>> -   PredicateLockTuple(relation, heapTuple,
>> snapshot);
>> +   PredicateLockTID(relation,
>> &(heapTuple)->t_self, snapshot,
>> +
>> HeapTupleHeaderGetXmin(heapTuple->t_data));
>> if (all_dead)
>> *all_dead = false;
>> return true;
>>
>>  What are those parens - as placed they can't do anything. Did you
>>  intend to write &(heapTuple->t_self)? Even that is pret

Re: crash 11.5~ (and 11.4)

2019-08-07 Thread Justin Pryzby
Just found this, although I'm not sure what to do about it.  If it's corrupt
table data, I can restore from backup.

ts=# VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908;
INFO:  0: aggressively vacuuming "child.huawei_umts_ucell_201908"
LOCATION:  lazy_scan_heap, vacuumlazy.c:502
ERROR:  XX001: found xmin 73850277 from before relfrozenxid 111408920
LOCATION:  heap_prepare_freeze_tuple, heapam.c:6853

I confirmed I updated to 11.4 immediately on its release:
[pryzbyj@database ~]$ ls -ltc /usr/pgsql-11/bin/postgres
-rwxr-xr-x. 1 root root 7291736 Jun 20 07:09 /usr/pgsql-11/bin/postgres

That table would've been created shortly after midnight on Aug 1 when we loaded
first data for the month.  So it was created and processed only on pg11.4,
although the parent has probably been around since pg_upgrade last October.

Here's usable looking bt

#0  heap_prepare_freeze_tuple (tuple=0x7fd3d8fa0058, relfrozenxid=111408920, 
relminmxid=1846, cutoff_xid=111658731, cutoff_multi=1846, frz=0x223e930, 
totally_frozen_p=0x75f3848f) at heapam.c:6832
changed = 211
xmax_already_frozen = 127
xmin_frozen = false
freeze_xmax = false
xid = 0
__func__ = "heap_prepare_freeze_tuple"

#1  0x006c8e6e in lazy_scan_heap (onerel=0x7fd37e4f8fa0, options=13, 
vacrelstats=0x223e790, Irel=0x223e8d8, nindexes=3, aggressive=true)
at vacuumlazy.c:1151
tuple_totally_frozen = false
itemid = 0x7fd3d8f9e918
buf = 168313
page = 0x7fd3d8f9e900 "\020:"
offnum = 1
maxoff = 3
hastup = true
nfrozen = 0
freespace = 35907664
all_frozen = true
tupgone = false
prev_dead_count = 0
all_visible_according_to_vm = true
all_visible = true
has_dead_tuples = false
visibility_cutoff_xid = 73850277
nblocks = 75494
blkno = 34915
tuple = {t_len = 2212, t_self = {ip_blkid = {bi_hi = 0, 
  bi_lo = 34915}, ip_posid = 1}, t_tableOid = 2015128626, 
  t_data = 0x7fd3d8fa0058}
relname = 0x7fd37e4f6ac8 "huawei_umts_ucell_201908"
relfrozenxid = 111408920
relminmxid = 1846
empty_pages = 0
vacuumed_pages = 0
next_fsm_block_to_vacuum = 0
num_tuples = 1
live_tuples = 1
tups_vacuumed = 0
nkeep = 0
nunused = 0
indstats = 0x223e850
i = 32512
ru0 = {tv = {tv_sec = 1565204981, tv_usec = 609217}, ru = {ru_utime = {
  tv_sec = 13, tv_usec = 865892}, ru_stime = {tv_sec = 1, 
  tv_usec = 960701}, ru_maxrss = 136988, ru_ixrss = 0, 
ru_idrss = 0, ru_isrss = 0, ru_minflt = 48841, ru_majflt = 1, 
ru_nswap = 0, ru_inblock = 196152, ru_oublock = 292656, 
ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 0, ru_nvcsw = 849, 
ru_nivcsw = 6002}}
vmbuffer = 4267
next_unskippable_block = 34916
skipping_blocks = false
frozen = 0x223e930
buf = {data = 0x0, len = -168590068, maxlen = 32767, 
  cursor = -168589904}
initprog_index = {0, 1, 5}
initprog_val = {1, 75494, 21968754}
__func__ = "lazy_scan_heap"
#2  0x006c77c7 in lazy_vacuum_rel (onerel=0x7fd37e4f8fa0, options=13, 
params=0x75f38a70, bstrategy=0x22b8cf8) at vacuumlazy.c:265
vacrelstats = 0x223e790
Irel = 0x223e8d8
nindexes = 3
ru0 = {tv = {tv_sec = 140733193388036, tv_usec = 281474985219646}, 
  ru = {ru_utime = {tv_sec = 12933152, tv_usec = 35123496}, 
ru_stime = {tv_sec = 140737319765996, tv_usec = 4687312}, 
ru_maxrss = 0, ru_ixrss = -1887226286461968708, 
ru_idrss = 35351240, ru_isrss = 35249392, 
ru_minflt = 140737319765776, ru_majflt = 35249392, ru_nswap = 20, 
ru_inblock = 0, ru_oublock = 140737319765936, 
ru_msgsnd = 10990471, ru_msgrcv = 140546333970336, 
ru_nsignals = 35249264, ru_nvcsw = 4, ru_nivcsw = 4687312}}
starttime = 0
secs = 72057594037927936
usecs = 2015128626
read_rate = 6.9533474784178538e-310
write_rate = 6.9439115263673584e-310
aggressive = true
scanned_all_unfrozen = false
xidFullScanLimit = 111658735
mxactFullScanLimit = 1846
new_rel_pages = 0
new_rel_allvisible = 36424984
new_live_tuples = 6.9533474784194348e-310
new_frozen_xid = 9080197
new_min_multi = 0
__func__ = "lazy_vacuum_rel"
#3  0x006c72b3 in vacuum_rel (relid=2015128626, relation=0x2161820, 
options=13, params=0x75f38a70) at vacuum.c:1557
lmode = 4
onerel = 0x7fd37e4f8fa0
onerelid = {relId = 2015128626, dbId = 16564}
toast_relid = 2015128673
save_userid = 18712
save_sec_context = 0
save_nestlevel = 2
rel_lock = true
__func_

Re: initdb: Use varargs macro for PG_CMD_PRINTF

2019-08-07 Thread Ibrar Ahmed
Hi,
Patch does not apply, rebased patch on (
68343b4ad75305391b38f4b42734dc07f2fe7ee2) attached

On Wed, Aug 7, 2019 at 7:04 PM Tom Lane  wrote:

> Peter Eisentraut  writes:
> > Small patch to simplify some no longer necessary complicated code, using
> > varargs macros.
>
> +1
>
> regards, tom lane
>
>
>

-- 
Ibrar Ahmed


initdb-print_v2.patch
Description: Binary data


Re: Duplicated LSN in ReorderBuffer

2019-08-07 Thread Alvaro Herrera
On 2019-Jul-26, Andres Freund wrote:

> 2) We could simply assign the subtransaction to the parent using
>ReorderBufferAssignChild() in SnapBuildProcessNewCid() or it's
>caller. That ought to also fix the bug
> 
>I also has the advantage that we can save some memory in transactions
>that have some, but fewer than the ASSIGNMENT limit subtransactions,
>because it allows us to avoid having a separate base snapshot for
>them (c.f. ReorderBufferTransferSnapToParent()).

I'm not sure I understood this suggestion correctly.  I first tried with
this, which seems the simplest rendition:

--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -772,6 +772,12 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId 
xid,
 {
CommandId   cid;
 
+   if ((SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) &&
+   (xlrec->top_xid != xid))
+   {
+   ReorderBufferAssignChild(builder->reorder, xlrec->top_xid, xid, 
lsn);
+   }
+
/*
 * we only log new_cid's if a catalog tuple was modified, so mark the
 * transaction as containing catalog modifications

test_decoding's tests pass with that, but if I try the example script
provided by Ildar, all pgbench clients die with this:

client 19 script 1 aborted in command 1 query 0: ERROR:  subtransaction logged 
without previous top-level txn record

I thought I would create the main txn before calling AssignChild in
snapbuild; however, ReorderBufferTXNByXid is static in reorderbuffer.c.
So that seems out.  My next try was to remove the elog() that was
causing the failure ... but that leads pretty quickly to a crash with
this backtrace:

#2  0x5653241fb823 in ExceptionalCondition 
(conditionName=conditionName@entry=0x5653243c1960 "!(prev_first_lsn < 
cur_txn->first_lsn)", 
errorType=errorType@entry=0x565324250596 "FailedAssertion", 
fileName=fileName@entry=0x5653243c18e8 
"/pgsql/source/master/src/backend/replication/logical/reorderbuffer.c", 
lineNumber=lineNumber@entry=680) at 
/pgsql/source/master/src/backend/utils/error/assert.c:54
#3  0x565324062a84 in AssertTXNLsnOrder (rb=rb@entry=0x565326304fa8)
at /pgsql/source/master/src/backend/replication/logical/reorderbuffer.c:680
#4  0x565324062e39 in ReorderBufferTXNByXid (rb=rb@entry=0x565326304fa8, 
xid=, xid@entry=185613, create=create@entry=true, 
is_new=is_new@entry=0x0, lsn=lsn@entry=2645271944, 
create_as_top=create_as_top@entry=true)
at /pgsql/source/master/src/backend/replication/logical/reorderbuffer.c:559
#5  0x565324067365 in ReorderBufferAddNewTupleCids (rb=0x565326304fa8, 
xid=185613, lsn=lsn@entry=2645271944, node=..., tid=..., cmin=0, 
cmax=4294967295, combocid=4294967295) at 
/pgsql/source/master/src/backend/replication/logical/reorderbuffer.c:2100
#6  0x565324069451 in SnapBuildProcessNewCid (builder=0x56532630afd8, 
xid=185614, lsn=2645271944, xlrec=0x5653262efc78)
at /pgsql/source/master/src/backend/replication/logical/snapbuild.c:787

Now this failure goes away if I relax the < to <= in the
complained-about line ... but at this point it's two sanity checks that
I've lobotomized in order to get this to run at all.  Not really
comfortable with that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5fa3d7323e..55538fa44c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -677,7 +677,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 
 		/* Current initial LSN must be strictly higher than previous */
 		if (prev_first_lsn != InvalidXLogRecPtr)
-			Assert(prev_first_lsn < cur_txn->first_lsn);
+			Assert(prev_first_lsn <= cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
 		Assert(!cur_txn->is_known_as_subxact);
@@ -778,9 +778,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
 	subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
-	if (new_top && !new_sub)
-		elog(ERROR, "subtransaction logged without previous top-level txn record");
-
 	if (!new_sub)
 	{
 		if (subtxn->is_known_as_subxact)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dc64b1e0c2..c8b1fcd96c 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -772,6 +772,10 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 {
 	CommandId	cid;
 
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
+		xlrec->top_xid != xid)
+		ReorderBufferAssignChild(builder->reorder, xlrec->top_xid, xid, lsn);
+
 	/*
 	 * we only log new_cid's if a catalog tu

Re: Problem with default partition pruning

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-07, Simon Riggs wrote:

> I saw your recent commit and it scares me in various places, noted below.
> 
> "Commit: Apply constraint exclusion more generally in partitioning"
> 
> "This applies particularly to the default partition..."
> 
> My understanding of the thread was the complaint was about removing the
> default partition. I would prefer to see code executed just for that case,
> so that people who do not define a default partition are unaffected.

Well, as the commit message noted, it applies to other cases also, not
just the default partition.  The default partition just happens to be
the most visible case.

> "So in certain cases
> we're scanning partitions that we don't need to."
> 
> Avoiding that has been the subject of months of work.

Well, yes, avoiding that is the point of this commit also: we were
scanning some partitions for some queries, after this patch we're
supposed not to.

> "This has the unwanted side-effect of testing some (most? all?)
> constraints more than once if constraint_exclusion=on.  That seems
> unavoidable as far as I can tell without some additional work, but
> that's not the recommended setting for that parameter anyway.
> However, because this imposes additional processing cost for all
> queries using partitioned tables..."
> 
> One of the major features of PG12 is the additional performance and
> scalability of partitoning, but we don't seem to have benchmarked the
> effect of this patch on those issues.
> 
> Please could we do perf checks, with tests up to 1000s of partitions? And
> if there is a regression, I would vote to revoke this patch or address the
> request in a less general way.

I'll have a look.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Adding a test for speculative insert abort case

2019-08-07 Thread Melanie Plageman
On Wed, Jul 24, 2019 at 11:48 AM Andres Freund  wrote:

> > diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec
> b/src/test/isolation/specs/insert-conflict-specconflict.spec
> > index 3a70484fc2..7f29fb9d02 100644
> > --- a/src/test/isolation/specs/insert-conflict-specconflict.spec
> > +++ b/src/test/isolation/specs/insert-conflict-specconflict.spec
> > @@ -10,7 +10,7 @@ setup
> >  {
> >   CREATE OR REPLACE FUNCTION blurt_and_lock(text) RETURNS text
> IMMUTABLE LANGUAGE plpgsql AS $$
> >   BEGIN
> > -RAISE NOTICE 'called for %', $1;
> > +RAISE NOTICE 'blurt_and_lock() called for %', $1;
> >
> >   -- depending on lock state, wait for lock 2 or 3
> >  IF
> pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
> > @@ -23,9 +23,16 @@ setup
> >  RETURN $1;
> >  END;$$;
> >
> > +CREATE OR REPLACE FUNCTION blurt_and_lock2(text) RETURNS text
> IMMUTABLE LANGUAGE plpgsql AS $$
> > +BEGIN
> > +RAISE NOTICE 'blurt_and_lock2() called for %', $1;
> > +PERFORM
> pg_advisory_xact_lock(current_setting('spec.session')::int, 4);
> > +RETURN $1;
> > +END;$$;
> > +
>
> Any chance for a bit more descriptive naming than *2? I can live with
> it, but ...
>
>
Taylor Vesely and I paired on updating this test, and, it became clear
that the way that the steps and functions are named makes it very
difficult to understand what the test is doing. That is, I helped
write this test and, after a month away, I could no longer understand
what it was doing at all.

We changed the text of the blurts to "acquiring advisory lock ..."
from "blocking" because we realized that this would print even when
the lock was acquired immediately successfully, which is a little
misleading for the reader.

He's taking a stab at some renaming/refactoring to make it more clear
(including renaming blurt_and_lock2())


>
> > +step "controller_print_speculative_locks" { SELECT
> locktype,classid,objid,mode,granted FROM pg_locks WHERE
> locktype='speculative
> > +token' ORDER BY granted; }
>
> I think showing the speculative locks is possibly going to be unreliable
> - the release time of speculative locks is IIRC not that reliable. I
> think it could e.g. happen that speculative locks are held longer
> because autovacuum spawned an analyze in the background.
>
>
I actually think having the "controller_print_speculative_locks"
wouldn't be a problem because we have not released the advisory lock
on 4 in s2 that allows it to complete its speculative insertion and so
s1 will still be in speculative wait.

The step that might be a problem if autovacuum delays release of the
speculative locks is the "controller_show" step, because, at that
point, if the lock wasn't released, then s1 would still be waiting and
wouldn't have updated.


>
> > +   # Should report s1 is waiting on speculative lock
> > +   "controller_print_speculative_locks"
>
> Hm, I might be missing something, but I don't think it currently
> does. Looking at the expected file:
>

+step controller_print_speculative_locks: SELECT
> locktype,classid,objid,mode,granted FROM pg_locks WHERE
> locktype='speculative
> +token' ORDER BY granted;
> +locktype   classidobjid  mode   granted
>
> +
>
>
Oops! due to an errant newline, the query wasn't correct.


> And if it showed something, it'd make the test not work, because
> classid/objid aren't necessarily going to be the same from test to test.
>
>
Good point. In the attached patch, classid/objid columns are removed
from the SELECT list.

Melanie & Taylor


v3-0001-Test-additional-speculative-conflict-scenarios.patch
Description: Binary data


Re: is necessary to recheck cached data in fn_extra?

2019-08-07 Thread Tom Lane
Pavel Stehule  writes:
> st 7. 8. 2019 v 17:39 odesílatel Tom Lane  napsal:
>> I wouldn't trust that.  You don't really know what the lifespan of
>> a fn_extra cache is.

> fn_extra cache cannot be longer than query.

There are fn_extra caches that are not tied to queries.  Admittedly
they're for special purposes like I/O functions and index support
functions, and maybe you can assume that your function can't be
used in such ways.  I don't think it's a great programming model
though.

> And if I understand well, then
> is not possible to change parameter types inside query?

Most places dealing with composite types assume that the rowtype *could*
change intraquery.  I believe this was a live possibility in the past,
though it might not be today.  (The issue was inheritance queries, but
I think we now force tuples from child tables to be converted to the
parent rowtype.  Whether that's 100% bulletproof is unclear.)  If you're
not dealing with composites then it's an okay assumption.  I think.

regards, tom lane




Re: Unused header file inclusion

2019-08-07 Thread Alvaro Herrera
On 2019-Aug-05, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Then there's the  removal, which is in tuplesort.c because of
> > INT_MAX as added by commit d26559dbf356 and still present ...
> 
> One has to be especially wary of removing system-header inclusions;
> the fact that they don't seem to be needed on your own machine doesn't
> prove they aren't needed elsewhere.

As far as I can see, this line was just carried over from tuplestore.c,
but that file uses INT_MAX and this one doesn't use any of those macros
as far as I can tell.

I pushed this change and hereby consider this to be over.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Duplicated LSN in ReorderBuffer

2019-08-07 Thread Andres Freund
Hi,

On 2019-08-07 16:19:13 -0400, Alvaro Herrera wrote:
> On 2019-Jul-26, Andres Freund wrote:
> 
> > 2) We could simply assign the subtransaction to the parent using
> >ReorderBufferAssignChild() in SnapBuildProcessNewCid() or it's
> >caller. That ought to also fix the bug
> > 
> >I also has the advantage that we can save some memory in transactions
> >that have some, but fewer than the ASSIGNMENT limit subtransactions,
> >because it allows us to avoid having a separate base snapshot for
> >them (c.f. ReorderBufferTransferSnapToParent()).
> 
> I'm not sure I understood this suggestion correctly.  I first tried with
> this, which seems the simplest rendition:
> 
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -772,6 +772,12 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId 
> xid,
>  {
>   CommandId   cid;
>  
> + if ((SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT) &&
> + (xlrec->top_xid != xid))
> + {
> + ReorderBufferAssignChild(builder->reorder, xlrec->top_xid, xid, 
> lsn);
> + }
> +

I think we would need to do this for all values of
SnapBuildCurrentState() - after all the problem occurs because we
*previously* didn't assign subxids to the toplevel xid.  Compared to the
cost of catalog changes, ReorderBufferAssignChild() is really cheap. So
I don't think there's any problem just calling it unconditionally (when
top_xid <> xid, of course).

If the above is the only change, I think the body of the if should be
unreachable, DecodeHeap2Op guards against that:

/*
 * If we don't have snapshot or we are just fast-forwarding, there is no
 * point in decoding changes.
 */
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
ctx->fast_forward)
return;


> I thought I would create the main txn before calling AssignChild in
> snapbuild; however, ReorderBufferTXNByXid is static in reorderbuffer.c.
> So that seems out.

There shouldn't be any need for doing that, ReorderBufferAssignChild
does that.

Greetings,

Andres Freund




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Tom Lane
Heikki Linnakangas  writes:
> The elephant in the room is the 'timeouts' test, which takes about 40 
> seconds, out of a total runtime of 90 seconds. So we'd really want to 
> run that in parallel with everything else. Or split 'timeouts' into 
> multiple tests that could run in parallel. I don't think grouping the 
> rest of the tests differently will make much difference to how easy or 
> hard that is.

The problem in "timeouts" is that it has to use drearily long timeouts
to be sure that the behavior will be stable even on really slow machines
(think CLOBBER_CACHE_ALWAYS or valgrind --- it can take seconds for them
to reach a waiting state that other machines reach quickly).  If we run
such tests in parallel with anything else, that risks re-introducing the
instability.  I'm not very sure what we can do about that.  But you might
be right that unless we can solve that, there's not going to be much to be
gained from parallelizing the rest.

I wonder if there's some way to scale the timeout values based on
machine speed?  But how would the test get that info?

regards, tom lane




Re: Grouping isolationtester tests in the schedule

2019-08-07 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Aug-07, Tom Lane wrote:
>> The problem in "timeouts" is that it has to use drearily long timeouts
>> to be sure that the behavior will be stable even on really slow machines
>> (think CLOBBER_CACHE_ALWAYS or valgrind --- it can take seconds for them
>> to reach a waiting state that other machines reach quickly).  If we run
>> such tests in parallel with anything else, that risks re-introducing the
>> instability.  I'm not very sure what we can do about that.  But you might
>> be right that unless we can solve that, there's not going to be much to be
>> gained from parallelizing the rest.

> It runs 8 different permutations serially.  If we run the same
> permutations in parallel, it would finish much quicker, and we wouldn't
> run it in parallel with anything that would take up CPU time, since
> they're all just sleeping.

Wrong ... they're *not* just sleeping, in the problem cases.  They're
eating cycles due to CLOBBER_CACHE_ALWAYS or valgrind.  They're on their
way to sleeping; but they have to get there before the timeout elapses,
or the test shows unexpected results.

Admittedly, as long as you've got more CPUs than tests, it should still
be OK.  But if you don't, boom.

regards, tom lane




Re: Avoid full GIN index scan when possible

2019-08-07 Thread Tom Lane
Nikita Glukhov  writes:
> Attached 6th version of the patches.

I spent a bit of time looking at these.  Attached is a proposed revision
of the 0001 patch, with some minor changes:

* I didn't adopt your move of the "Non-default modes require the index
to have placeholders" test to after the stanza handling zero-key cases.
I think that move would be correct for 0001 as it stands, but it's far
less clear that it's still safe after 0002/0003 or whatever variant of
those we end up with.  We should leave that code where it is for now,
enforcing the v1-index requirement for all non-default search modes, and
reconsider after the dust settles.  (Or if we never do reconsider, it
won't be a big deal --- I doubt many v0 indexes are still out there.)

* Rearranged the selfuncs.c logic to match ginNewScanKey better.

* Cleaned up my own sloppiness in the new gin.sql test cases.

I think this would be committable as it stands, except that replacing
an ALL scan with an EVERYTHING scan could be a performance regression
if the index contains many null items.  We need to do something about
that before committing.

Unfortunately I'm not sold on either 0002 or 0003 as they stand;
they seem overly complicated, I'm not convinced they're correct,
and you haven't really provided examples showing that all this
extra complexity is worthwhile.

In particular:

* I don't really like the whole business about detecting a constant-true
ALL condition by applying the consistentFn at this stage.  That just
feels wrong to me: the consistentFn should be expecting some data about
the index contents and we don't have any to give.  If it works, it's
accidental, and probably it's fragile.  Moreover, the only gain we'd get
from it is maybe not having to set forcedRecheck, and that doesn't look
to me like it would make all that much difference.

* The code seems to be assuming that a zero-key ALL query is necessarily
precisely equivalent to a NOT NULL condition.  This seems flat out wrong.
At the very least it's possible for such a query to be constant-false,
rather than constant-true-for-non-null-items.  Admittedly, that would
suggest rather stupid coding of the opclass query-extract function, as
it could have reported a constant-false condition in an optimizable way
rather than an unoptimizable one.  But we aren't entitled to assume that
the opclass isn't being dumb; the API says it can do this, so it can.
I think we have to check the scankey regardless, either in the index or
via forcedRecheck.

* I really dislike the refcount business in 0003.  It's not clear why we
need that or whether it's correct, and I think it would be unmaintainable
even if it were documented (which it isn't).


ISTM we could get where we need to go in a much simpler way.  A couple
of alternative ideas:

* During ginNewScanKey, separate out ALL-mode queries and don't add them
to the scankey list immediately.  After examining all the keys, if we
found any normal (DEFAULT or INCLUDE_EMPTY) queries, then go ahead and
add in the ALL-mode queries so that we can check them in the index, but
don't cause a full scan.  Otherwise, discard all the ALL-mode queries
and emit a NOT_NULL scan key, setting forcedRecheck so that we apply the
filtering that way.

* Or we could just discard ALL-mode queries on sight, setting
forcedRecheck always, and then emit NOT_NULL if we had any
of those and no normal queries.

The thing that seems hard to predict here is whether it is worth tracking
the presence of additional keys in the index to avoid a recheck in the
heap.  It's fairly easy to imagine that for common keys, avoiding the
recheck would be a net loss because it requires more work in the index.
If we had statistics about key counts, which of course we don't, you
could imagine making this decision dynamically depending on whether an
ALL query is asking about common or uncommon keys.

BTW --- any way you slice this, it seems like we'd end up with a situation
where we never execute an ALL query against the index in the way we do
now, meaning that the relevant code in the scanning logic is dead and
could be removed.  Given that, we don't really need a new NOT_NULL search
mode; we could just redefine what ALL mode actually does at execution.
This would be good IMO, because it's not obvious what the difference
between ALL and NOT_NULL modes is anyway.

regards, tom lane

diff --git a/contrib/pg_trgm/expected/pg_trgm.out b/contrib/pg_trgm/expected/pg_trgm.out
index b3e709f..3e5ba9b 100644
--- a/contrib/pg_trgm/expected/pg_trgm.out
+++ b/contrib/pg_trgm/expected/pg_trgm.out
@@ -3498,6 +3498,68 @@ select count(*) from test_trgm where t ~ '[qwerty]{2}-?[qwerty]{2}';
   1000
 (1 row)
 
+-- check handling of indexquals that generate no searchable conditions
+explain (costs off)
+select count(*) from test_trgm where t like '%99%' and t like '%qwerty%';
+ QUERY PLAN  
+-

Re: no default hash partition

2019-08-07 Thread Tom Lane
Alvaro Herrera  writes:
> Actually, it also says this (in the blurb for the PARTITION OF clause):

>   Creates the table as a partition of the specified
>   parent table. The table can be created either as a partition for 
> specific
>   values using FOR VALUES or as a default partition
>   using DEFAULT.  This option is not available for
>   hash-partitioned tables.

> which I think is sufficient.

Hm, that's rather confusingly worded IMO.  Is the antecedent of "this
option" just DEFAULT, or does it mean that you can't use FOR VALUES,
or perchance it means that you can't use a PARTITION OF clause
at all?

regards, tom lane




Documentation clarification re: ANALYZE

2019-08-07 Thread Isaac Morland
I'm looking at https://www.postgresql.org/docs/current/sql-analyze.html,
where it says “Without a table_and_columns list, ANALYZE processes every
table and materialized view in the current database that the current user
has permission to analyze.”.

I don’t believe there is a separate “analyze” permission, so which tables
is this? Tables owned by the user? Ones where it can insert/update/delete?
Ones where it can select?

If somebody can tell me, I'll make it a weekend project to propose a
specific update to the documentation to make this more clear. Or maybe
there should just be a cross-reference to another existing part of the
documentation that explains more about this.


Re: no default hash partition

2019-08-07 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Aug-07, Tom Lane wrote:
>> Hm, that's rather confusingly worded IMO.  Is the antecedent of "this
>> option" just DEFAULT, or does it mean that you can't use FOR VALUES,
>> or perchance it means that you can't use a PARTITION OF clause
>> at all?

> Uh, you're right, I hadn't noticed that.  Not my text.  I think this can
> be fixed easily as in the attached.  There are other options, but I like
> this one the best.

OK, but maybe also s/created as a default partition/created as the default
partition/ ?  Writing "a" carries the pretty clear implication that there
can be more than one, and contradicting that a sentence later doesn't
improve it.

regards, tom lane




Re: Documentation clarification re: ANALYZE

2019-08-07 Thread David G. Johnston
On Wed, Aug 7, 2019 at 2:14 PM Isaac Morland 
wrote:

> I'm looking at https://www.postgresql.org/docs/current/sql-analyze.html,
> where it says “Without a table_and_columns list, ANALYZE processes every
> table and materialized view in the current database that the current user
> has permission to analyze.”.
>
> I don’t believe there is a separate “analyze” permission, so which tables
> is this? Tables owned by the user? Ones where it can insert/update/delete?
> Ones where it can select?
>

Owners only - at least in previous releases.  I don't recall whether the
addition of new roles to cover subsets of administrative privileges ever
was extended to cover vacuum/analyze but I do not think it has.

David J.


Re: Documentation clarification re: ANALYZE

2019-08-07 Thread Isaac Morland
On Wed, 7 Aug 2019 at 17:31, David G. Johnston 
wrote:

> On Wed, Aug 7, 2019 at 2:14 PM Isaac Morland 
> wrote:
>
>> I'm looking at https://www.postgresql.org/docs/current/sql-analyze.html,
>> where it says “Without a table_and_columns list, ANALYZE processes every
>> table and materialized view in the current database that the current user
>> has permission to analyze.”.
>>
>> I don’t believe there is a separate “analyze” permission, so which tables
>> is this? Tables owned by the user? Ones where it can insert/update/delete?
>> Ones where it can select?
>>
>
> Owners only - at least in previous releases.  I don't recall whether the
> addition of new roles to cover subsets of administrative privileges ever
> was extended to cover vacuum/analyze but I do not think it has.
>

Thanks. So presumably I would also have permission if I have SET ROLEd to
the owner, or to a role which is an INHERIT member of the owner.


Re: Unused header file inclusion

2019-08-07 Thread Thomas Munro
On Thu, Aug 8, 2019 at 9:00 AM Alvaro Herrera  wrote:
> On 2019-Aug-05, Tom Lane wrote:
> > Alvaro Herrera  writes:
> > > Then there's the  removal, which is in tuplesort.c because of
> > > INT_MAX as added by commit d26559dbf356 and still present ...
> >
> > One has to be especially wary of removing system-header inclusions;
> > the fact that they don't seem to be needed on your own machine doesn't
> > prove they aren't needed elsewhere.
>
> As far as I can see, this line was just carried over from tuplestore.c,
> but that file uses INT_MAX and this one doesn't use any of those macros
> as far as I can tell.

Yeah, probably, or maybe I used one of those sorts of macros in an
earlier version of the patch.

By the way, I see now that I had put the offending #include 
*after* project headers, which is a convention I picked up from other
projects, mentors and authors, but not what PostgreSQL usually does.
In my own code I do that to maximise the chances that project headers
will fail to compile if they themselves forget to include the system
headers they depend on.  Of course an attempt to compile every header
in the project individually as a transaction unit also achieves that.

/me runs away

-- 
Thomas Munro
https://enterprisedb.com




Re: Unused header file inclusion

2019-08-07 Thread Thomas Munro
On Thu, Aug 8, 2019 at 9:47 AM Thomas Munro  wrote:
> transaction unit

*translation unit

-- 
Thomas Munro
https://enterprisedb.com




Re: Documentation clarification re: ANALYZE

2019-08-07 Thread Tom Lane
"David G. Johnston"  writes:
> On Wed, Aug 7, 2019 at 2:14 PM Isaac Morland 
> wrote:
>> I'm looking at https://www.postgresql.org/docs/current/sql-analyze.html,
>> where it says “Without a table_and_columns list, ANALYZE processes every
>> table and materialized view in the current database that the current user
>> has permission to analyze.”.
>> I don’t believe there is a separate “analyze” permission, so which tables
>> is this? Tables owned by the user? Ones where it can insert/update/delete?
>> Ones where it can select?

> Owners only - at least in previous releases.  I don't recall whether the
> addition of new roles to cover subsets of administrative privileges ever
> was extended to cover vacuum/analyze but I do not think it has.

Actually, looking in the source code finds

 * We allow the user to vacuum or analyze a table if he is superuser, the
 * table owner, or the database owner (but in the latter case, only if
 * it's not a shared relation).

It's definitely a documentation omission that this isn't spelled out in
the ANALYZE reference page (VACUUM's page does have text about it).

regards, tom lane




Re: SQL/JSON path: collation for comparisons, minor typos in docs

2019-08-07 Thread Alexander Korotkov
On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
 wrote:
> On Wed, Aug 7, 2019 at 2:25 PM Markus Winand  wrote:
> > I was playing around with JSON path quite a bit and might have found one 
> > case where the current implementation doesn’t follow the standard.
> >
> > The functionality in question are the comparison operators except ==. They 
> > use the database default collation rather then the standard-mandated 
> > "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, 
> > last sentence in first paragraph).
>
> Thank you for pointing!  Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance.  So, "==" remains per-byte comparison.  For consistency
in other operators we compare code points first, then do per-byte
comparison.  In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard.  In future we can implement strict standard
conformance by normalization of input JSON strings.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Use-Unicode-codepoint-collation-in-jsonpath-2.patch
Description: Binary data


  1   2   >