Re: Parallel copy

2020-07-12 Thread Rafia Sabih
On Sat, 11 Jul 2020 at 08:55, Bharath Rupireddy
 wrote:
>
> Thanks Vignesh for the review. Addressed the comments in 0006 patch.
>
> >
> > we can create a local variable and use in place of
> > cstate->pcdata->curr_data_block.
>
> Done.
>
> > +   if (cstate->raw_buf_index + sizeof(fld_count) >= (DATA_BLOCK_SIZE - 
> > 1))
> > +   AdjustFieldInfo(cstate, 1);
> > +
> > +   memcpy(&fld_count,
> > &cstate->pcdata->curr_data_block->data[cstate->raw_buf_index],
> > sizeof(fld_count));
> > Should this be like below, as the remaining size can fit in current block:
> >if (cstate->raw_buf_index + sizeof(fld_count) >= DATA_BLOCK_SIZE)
> >
> > +   if ((cstate->raw_buf_index + sizeof(fld_size)) >= (DATA_BLOCK_SIZE 
> > - 1))
> > +   {
> > +   AdjustFieldInfo(cstate, 2);
> > +   *new_block_pos = pcshared_info->cur_block_pos;
> > +   }
> > Same like above.
>
> Yes you are right. Changed.
>
> >
> > +   movebytes = DATA_BLOCK_SIZE - cstate->raw_buf_index;
> > +
> > +   cstate->pcdata->curr_data_block->skip_bytes = movebytes;
> > +
> > +   data_block = &pcshared_info->data_blocks[block_pos];
> > +
> > +   if (movebytes > 0)
> > Instead of the above check, we can have an assert check for movebytes.
>
> No, we can't use assert here. For the edge case where the current data
> block is full to the size DATA_BLOCK_SIZE, then movebytes will be 0,
> but we need to get a new data block. We avoid memmove by having
> movebytes>0 check.
>
> > +   if (mode == 1)
> > +   {
> > +   cstate->pcdata->curr_data_block = data_block;
> > +   cstate->raw_buf_index = 0;
> > +   }
> > +   else if(mode == 2)
> > +   {
> > +   ParallelCopyDataBlock *prev_data_block = NULL;
> > +   prev_data_block = cstate->pcdata->curr_data_block;
> > +   prev_data_block->following_block = block_pos;
> > +   cstate->pcdata->curr_data_block = data_block;
> > +
> > +   if (prev_data_block->curr_blk_completed  == false)
> > +   prev_data_block->curr_blk_completed = true;
> > +
> > +   cstate->raw_buf_index = 0;
> > +   }
> >
> > This code is common for both, keep in common flow and remove if (mode == 1)
> > cstate->pcdata->curr_data_block = data_block;
> > cstate->raw_buf_index = 0;
> >
>
> Done.
>
> > +#define CHECK_FIELD_COUNT \
> > +{\
> > +   if (fld_count == -1) \
> > +   { \
> > +   if (IsParallelCopy() && \
> > +   !IsLeader()) \
> > +   return true; \
> > +   else if (IsParallelCopy() && \
> > +   IsLeader()) \
> > +   { \
> > +   if
> > (cstate->pcdata->curr_data_block->data[cstate->raw_buf_index +
> > sizeof(fld_count)] != 0) \
> > +   ereport(ERROR, \
> > +
> > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), \
> > +   errmsg("received copy
> > data after EOF marker"))); \
> > +   return true; \
> > +   } \
> > We only copy sizeof(fld_count), Shouldn't we check fld_count !=
> > cstate->max_fields? Am I missing something here?
>
> fld_count != cstate->max_fields check is done after the above checks.
>
> > +   if ((DATA_BLOCK_SIZE - cstate->raw_buf_index) >= fld_size)
> > +   {
> > +   cstate->raw_buf_index = cstate->raw_buf_index + fld_size;
> > +   }
> > We can keep the check like cstate->raw_buf_index + fld_size < ..., for
> > better readability and consistency.
> >
>
> I think this is okay. It gives a good meaning that available bytes in
> the current data block is greater or equal to fld_size then, the tuple
> lies in the current data block.
>
> > +static pg_attribute_always_inline void
> > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> > +   Oid typioparam, int32 typmod, uint32 *new_block_pos,
> > +   int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> > +   ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> > flinfo, typioparam & typmod is not used, we can remove the parameter.
> >
>
> Done.
>
> > +static pg_attribute_always_inline void
> > +CopyReadBinaryAttributeLeader(CopyState cstate, FmgrInfo *flinfo,
> > +   Oid typioparam, int32 typmod, uint32 *new_block_pos,
> > +   int m, ParallelCopyTupleInfo *tuple_start_info_ptr,
> > +   ParallelCopyTupleInfo *tuple_end_info_ptr, uint32 *line_size)
> > I felt this function need not be an inline function.
>
> Yes. Changed.
>
> >
> > +   /* binary format */
> > +   /* for paralle copy leader, fill in the error
> > There are some typos, run spell check
>
> Done.
>
> >
> > +   /* raw_buf_index should never cross data block size,
> > +* as the required number of data blocks would have
> > +* been obtained in the above while loop.
> 

Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Sascha Kuhl
Thanks, I see there is some understanding, though.

Tomas Vondra  schrieb am So., 12. Juli 2020,
01:30:

> On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:
> >Tomas Vondra  schrieb am Sa., 11. Juli
> 2020,
> >13:24:
> >
> >> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
> >> >Tomas Vondra  schrieb am Fr., 10. Juli
> >> 2020,
> >> >14:09:
> >> >
> >> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
> >> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
> >> tomas.von...@2ndquadrant.com>
> >> >> wrote:
> >> >> >>
> >> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov
> wrote:
> >> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> >> >> >> > wrote:
> >> >> >> ...
> >> >> >> >> >
> >> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >> >> >> >inclined to rush on these three as well.  But you're willing
> to
> >> >> commit
> >> >> >> >> >them, you can count round of review on me.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> I have no intention to get 0001-0003 committed. I think those
> >> changes
> >> >> >> >> are beneficial on their own, but the primary reason was to
> support
> >> >> the
> >> >> >> >> new opclasses (which require those changes). And those parts
> are
> >> not
> >> >> >> >> going to make it into v13 ...
> >> >> >> >
> >> >> >> >OK, no problem.
> >> >> >> >Let's do this for v14.
> >> >> >> >
> >> >> >>
> >> >> >> Hi Alexander,
> >> >> >>
> >> >> >> Are you still interested in reviewing those patches? I'll take a
> >> look at
> >> >> >> 0001-0003 to check that your previous feedback was addressed. Do
> you
> >> >> >> have any comments about 0004 / 0005, which I think are the more
> >> >> >> interesting parts of this series?
> >> >> >>
> >> >> >>
> >> >> >> Attached is a rebased version - I realized I forgot to include
> 0005
> >> in
> >> >> >> the last update, for some reason.
> >> >> >>
> >> >> >
> >> >> >I've done a quick test with this patch set. I wonder if we can
> improve
> >> >> >brin_page_items() SQL function in pageinspect as well. Currently,
> >> >> >brin_page_items() is hard-coded to support only normal brin indexes.
> >> >> >When we pass brin-bloom or brin-multi-range to that function the
> >> >> >binary values are shown in 'value' column but it seems not helpful
> for
> >> >> >users. For instance, here is an output of brin_page_items() with a
> >> >> >brin-multi-range index:
> >> >> >
> >> >> >postgres(1:12801)=# select * from
> brin_page_items(get_raw_page('mul',
> >> >> >2), 'mul');
> >> >> >-[ RECORD 1
> >> >>
> >>
> ]--
> >> >>
> >> >>
> >>
> >---
> >> >> >
> >> >> >itemoffset  | 1
> >> >> >blknum  | 0
> >> >> >attnum  | 1
> >> >> >allnulls| f
> >> >> >hasnulls| f
> >> >> >placeholder | f
> >> >> >value   |
> >> >>
> >>
> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> >> >>
> >> >>
> >>
> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> >> >> >071}
> >> >> >
> >> >>
> >> >> Hmm. I'm not sure we can do much better, without making the function
> >> >> much more complicated. I mean, even with regular BRIN indexes we
> don't
> >> >> really know if the value is plain min/max, right?
> >> >>
> >> >You can be sure with the next node. The value is in can be false
> positiv.
> >> >The value is out is clear. You can detect the change between in and
> out.
> >> >
> >>
> >> I'm sorry, I don't understand what you're suggesting. How is any of this
> >> related to false positive rate, etc?
> >>
> >
> >Hi,
> >
> >You check by the bloom filter if a value you're searching is part of the
> >node, right?
> >
> >In case, the value is in the bloom filter you could be mistaken, because
> >another value could have the same hash profile, no?
> >
> >However if the value is out, the filter can not react. You can be sure
> that
> >the value is out.
> >
> >If you looking for a range or many ranges of values, you traverse many
> >nodes. By knowing the value is out, you can state a clear set of nodes
> that
> >form the range. However the border is somehow unsharp because of the false
> >positives.
> >
> >I am not sure if we write about the same. Please confirm, this can be
> >needed. Please.
> >
>
> Probably not. Masahiko-san pointed out that pageinspect (which also has
> a function to print pages from a BRIN index) does not understand the
> summary of the new opclasses and just prints the bytea verbatim.
>
> That has nothing to do with inspecting the bloom filter, or anything
> like that. So I think there's some confusion ...
>
>
> regards
>
> --
> Tomas Vondra  

Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-12 Thread Michael Paquier
On Sat, Jul 11, 2020 at 03:32:55PM -0400, Tom Lane wrote:
> Mark Dilger  writes:
>> There are code paths where relkind is sometimes '\0' under normal,
>> non-exceptional conditions.  This happens in
>> 
>>  allpaths.c: set_append_rel_size
>>  rewriteHandler.c: view_query_is_auto_updatable
>>  lockcmds.c: LockViewRecurse_walker
>>  pg_depend.c: getOwnedSequences_internal

There are more code paths than what's mentioned upthread when it comes
to relkinds and \0.  For example, I can quickly grep for acl.c that
relies on get_rel_relkind() returning \0 when the relkind cannot be
found.  And we do that for get_typtype() as well in the syscache.

>> Doesn't this justify having RELKIND_NULL in the enum?
> 
> I'd say no.  I think including an intentionally invalid value in such
> an enum is horrid, mainly because it will force a lot of places to cover
> that value when they shouldn't (or else draw "enum value not handled in
> switch" warnings).  The confusion factor about whether it maybe *is*
> a valid value is not to be discounted, either.

I agree here that the situation could be improved because we never
store this value in the catalogs.  Perhaps there would be a benefit in
switching to an enum in the long run, I am not sure.  But if we do so,
RELKIND_NULL should not be around.
--
Michael


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-12 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 08:47:54PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I don't know, but one of the main arguments against simply suggesting
people to bump up work_mem (if they're hit by the hashagg spill in v13)
was that it'd increase overall memory usage for them. It seems strange
to then propose a new GUC set to a default that would result in higher
memory usage *for everyone*.


It seems like a lot of the disagreement here is focused on Peter's
proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
seem to me that that's a critical element of the proposal.  Why not just
make it default to 1.0, thus keeping the default behavior identical
to what it is now?

If we find that's a poor default, we can always change it later;
but it seems to me that the evidence for a higher default is
a bit thin at this point.



You're right, I was specifically pushing against that aspect of the
proposal. Sorry for not making that clearer, I assumed it's clear from
the context of this (sub)thread.

I agree making it 1.0 (or equal to work_mem, if it's not a multiplier)
by default, but allowing it to be increased if needed would address most
of the spilling issues.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-12 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 10:26:22PM -0700, David G. Johnston wrote:

On Saturday, July 11, 2020, Tom Lane  wrote:


"David G. Johnston"  writes:
> On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
>> It seems like a lot of the disagreement here is focused on Peter's
>> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
>> seem to me that that's a critical element of the proposal.  Why not just
>> make it default to 1.0, thus keeping the default behavior identical
>> to what it is now?

> If we don't default it to something other than 1.0 we might as well just
> make it memory units and let people decide precisely what they want to
use
> instead of adding the complexity of a multiplier.

Not sure how that follows?  The advantage of a multiplier is that it
tracks whatever people might do to work_mem automatically.






I was thinking that setting -1 would basically do that.



I think Tom meant that the multiplier would automatically track any
changes to work_mem, and adjust the hash_mem accordingly. With -1 (and
the GUC in units) you could only keep it exactly equal to work_mem, but
then as soon as you change it you'd have to update both.


  In general
I'd view work_mem as the base value that people twiddle to control
executor memory consumption.  Having to also twiddle this other value
doesn't seem especially user-friendly.



I’ll admit I don’t have a feel for what is or is not user-friendly when
setting these GUCs in a session to override the global defaults.  But as
far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb,
48mb) and (32mb, 1.0) -> (32mb, 1.5)

If you want 96mb for the session/query hash setting it to 96mb is
invariant, whilesetting it to 3.0 means it can change in the future if the
system work_mem changes.  Knowing the multiplier is 1.5 and choosing 64mb
for work_mem in the session is possible but also mutable and has
side-effects.  If the user is going to set both values to make it invariant
we are back to it being a wash.

I don’t believe using a multiplier will promote better comprehension for
why this setting exists compared to “-1 means use work_mem but you can
override a subset if you want.”

Is having a session level memory setting be mutable something we want to
introduce?

Is it more user-friendly?



I still think it should be in simple units, TBH. We already have
somewhat similar situation with cost parameters, where we often say that
seq_page_cost = 1.0 is the baseline for the other cost parameters, yet
we have not coded that as multipliers.


If we find that's a poor default, we can always change it later;

>> but it seems to me that the evidence for a higher default is
>> a bit thin at this point.

> So "your default is 1.0 unless you installed the new database on or after
> 13.4 in which case it's 2.0"?

What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
that we'd change it in a minor release.)



Minor release update is what I had thought, and to an extent was making
possible by not using the multiplier upfront.

I agree options are wide open come v14 and beyond.

David J.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: A patch for get origin from commit_ts.

2020-07-12 Thread Michael Paquier
On Fri, Jul 10, 2020 at 10:06:06AM +0900, Michael Paquier wrote:
> Please note that I have switched the patch as ready for committer.  So
> I'll try to get that done, with roident as attribute name.  If
> somebody prefers a different name or has an objection, please feel
> free to chime in.

Hearing nothing, committed after fixing few things:
- The docs reversed  and .
- The comment on top of GetLatestCommitTsData() mentioned "extra"
instead of "nodeid".  Not an issue of this patch but I have just fixed
it.
- We could just have used memset for nulls when the data could not be
found in pg_xact_commit_timestamp_origin().
- Added some casts to Oid for the new ObjectIdGetDatum() calls.
- Changed the tests to not show numerical values for roident, checking
instead that the values are non-zero for the cases where we don't
expect a valid replication origin.  For the valid cases, I have just
used a join with pg_replication_origin to grab roname.  This makes the
tests more portable.

After applying the patch as of b1e48bb, longfin has also complained
that regression tests should prefix replication origins with
"regress_".  This has been fixed with ea3e15d.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-12 Thread vignesh C
On Fri, Jul 10, 2020 at 8:51 AM Amit Langote  wrote:
>
> Hi Bharath,
> Here the numbers with the updated patch:
>
> HEADpatched (v2)
> foo5 8.56.1
> foo10 149.4
> foo20 25   16.7
>

Patch applies cleanly, make check & make check-world passes.
I had reviewed the changes. I felt one minor change required:
+ * CopyReadFromRawBuf
+ * Reads 'nbytes' bytes from cstate->copy_file via
cstate->raw_buf and
+ * writes then to 'saveTo'
+ *
+ * Useful when reading binary data from the file.
Should "writes then to 'saveTo'" be "writes them to 'dest'"?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: calling procedures is slow and consumes extra much memory against calling function

2020-07-12 Thread Pavel Stehule
so 11. 7. 2020 v 7:38 odesílatel Pavel Stehule 
napsal:

>
>
> čt 9. 7. 2020 v 8:28 odesílatel Amit Khandekar 
> napsal:
>
>> On Wed, 17 Jun 2020 at 13:54, Pavel Stehule 
>> wrote:
>> >
>> >
>> >
>> > st 17. 6. 2020 v 7:52 odesílatel Amit Khandekar 
>> napsal:
>> >>
>> >> On Wed, 10 Jun 2020 at 17:12, Pavel Stehule 
>> wrote:
>> >> > st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar <
>> amitdkhan...@gmail.com> napsal:
>> >> >> Could you show an example testcase that tests this recursive
>> scenario,
>> >> >> with which your earlier patch fails the test, and this v2 patch
>> passes
>> >> >> it ? I am trying to understand the recursive scenario and the re-use
>> >> >> of expr->plan.
>> >> >
>> >> >
>> >> > it hangs on plpgsql tests. So you can apply first version of patch
>> >> >
>> >> > and "make check"
>> >>
>> >> I could not reproduce the make check hang with the v1 patch. But I
>> >> could see a crash with the below testcase. So I understand the purpose
>> >> of the plan_owner variable that you introduced in v2.
>> >>
>> >> Consider this recursive test :
>> >>
>> >> create or replace procedure p1(in r int) as $$
>> >> begin
>> >>RAISE INFO 'r : % ', r;
>> >>if r < 3 then
>> >>   call p1(r+1);
>> >>end if;
>> >> end
>> >> $$ language plpgsql;
>> >>
>> >> do $$
>> >> declare r int default 1;
>> >> begin
>> >> call p1(r);
>> >> end;
>> >> $$;
>> >>
>> >> In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
>> >> consider this code of exec_stmt_call() with your v2 patch applied:
>> >> if (expr->plan && !expr->plan->saved)
>> >> {
>> >>if (plan_owner)
>> >>   SPI_freeplan(expr->plan);
>> >>expr->plan = NULL;
>> >> }
>> >>
>> >> Here, plan_owner is false. So SPI_freeplan() will not be called, and
>> >> expr->plan is set to NULL. Now I have observed that the stmt pointer
>> >> and expr pointer is shared between the p1() execution at this r=2
>> >> level and the p1() execution at r=1 level. So after the above code is
>> >> executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
>> >> the same above code snippet, it gets the same expr pointer, but it's
>> >> expr->plan is already set to NULL without being freed. From this
>> >> logic, it looks like the plan won't get freed whenever the expr/stmt
>> >> pointers are shared across recursive levels, since expr->plan is set
>> >> to NULL at the lowermost level ? Basically, the handle to the plan is
>> >> lost so no one at the upper recursion level can explicitly free it
>> >> using SPI_freeplan(), right ? This looks the same as the main issue
>> >> where the plan does not get freed for non-recursive calls. I haven't
>> >> got a chance to check if we can develop a testcase for this, similar
>> >> to your testcase where the memory keeps on increasing.
>> >
>> >
>> > This is a good consideration.
>> >
>> > I am sending updated patch
>>
>> Checked the latest patch. Looks like using a local plan rather than
>> expr->plan pointer for doing the checks does seem to resolve the issue
>> I raised. That made me think of another scenario :
>>
>> Now we are checking for plan value and then null'ifying the expr->plan
>> value. What if expr->plan is different from plan ? Is it possible ? I
>> was thinking of such scenarios. But couldn't find one. As long as a
>> plan is always created with saved=true for all levels, or with
>> saved=false for all levels, we are ok. If we can have a mix of saved
>> and unsaved plans at different recursion levels, then expr->plan can
>> be different from the outer local plan because then the expr->plan
>> will not be set to NULL in the inner level, while the outer level may
>> have created its own plan. But I think a mix of saved and unsaved
>> plans are not possible. If you agree, then I think we should at least
>> have an assert that looks like :
>>
>> if (plan && !plan->saved)
>> {
>> if (plan_owner)
>> SPI_freeplan(plan);
>>
>> /* If expr->plan  is present, it must be the same plan that we
>> allocated */
>>Assert ( !expr->plan || plan == expr->plan) );
>>
>> expr->plan = NULL;
>> }
>>
>> Other than this, I have no other issues. I understand that we have to
>> do this special handling only for this exec_stmt_call() because it is
>> only here that we call exec_prepare_plan() with keep_plan = false, so
>> doing special handling for freeing the plan seems to make sense.
>>
>
> attached patch with assert.
>
> all regress tests passed. I think this short patch can be applied on older
> releases as bugfix.
>
> This weekend I'll try to check different strategy - try to save a plan and
> release it at the end of the transaction.
>

I check it, and this state of patch is good enough for this moment. Another
fix needs more invasive changes to handling plan cache.

Regards

Pavel


> Regards
>
> Pavel
>


Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-07-12 Thread Justin Pryzby
On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote:
> Currently, getTableAttrs() always retrieves info about columns defaults and
> check constraints, while this will never be used if --data-only option if 
> used.
> This seems like a waste of resources, so here's a patch to skip those parts
> when the DDL won't be generated.

Note that the speed of default and constraint handling has come up before:
https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com
https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhm1ymesij8j9z+sjhgw7c9bqo3e3jmg4...@mail.gmail.com

I'd pointed out that a significant fraction of our pg_upgrade time was in
pg_dump, due to having wide tables with many child tables, and "default 0" on
every column.  (I've since dropped our defaults so this is no longer an issue
here).

It appears your patch would avoid doing unnecessary work in the --data-only
case, but it wouldn't help the pg_upgrade case.

-- 
Justin




Re: Improving psql slash usage help message

2020-07-12 Thread Tom Lane
Hamid Akhtar  writes:
> On Sun, Apr 26, 2020 at 1:03 AM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>> On Sat, Apr 25, 2020 at 12:29 PM Hamid Akhtar 
>> wrote:
>>> "\dE" displays the list with a "List of relations" heading whereas "\det"
>>> displays "List of foreign tables". So, to differentiate the two, I suggest
>>> to change the help message for "\dE" to:
>>> \dE[S+] [PATTERN]  list foreign relations

>> help.c and the documentation need to be synchronized a bit more than this
>> single issue.
>> Calling it "foreign relation" for \dE and "foreign table" for \det does
>> convey that there is a difference - not sure it a huge improvement though.
>> The "\d[Eimstv]" family of meta-commands should, in the help, probably be
>> moved together to show the fact that they are basically "list relation
>> names [of this type only]" while "\det" is "list foreign table info".

FWIW, I agree with David on this point.  I find it bizarre and unhelpful
that slashUsage shows \dt, \di, etc as separate commands and fails to
indicate that they can be combined.  We could merge these entries into

fprintf(output, _("  \\d[tivmsE][S+] [PATRN] list relations of 
specified type(s)\n"));

which would both remind people that they can give more than one type,
and shorten the already-overly-long list.

> I think from a user perspective, grouping these wouldn't be helpful. For
> example, it may cause FDW related commands to be spread through out the
> help.

That seems quite irrelevant to this proposal.

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Dilip Kumar
On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar  wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila  wrote:
> >
> > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar  wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > 9.
> > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN 
> > > > *txn,
> > > > {
> > > > ..
> > > > + ReorderBufferToastReset(rb, txn);
> > > > + if (specinsert != NULL)
> > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > ..
> > > > }
> > > >
> > > > Why do we need to do these here when we wouldn't have been done for
> > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > >
> > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > gracefully and we are continuing with further decoding so we need to
> > > return this change back.
> > >
> >
> > Okay, then I suggest we should do these before calling stream_stop and
> > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > pattern similar to try block unless there is a reason for not doing
> > so.  Also, it would be good if we can initialize specinsert with NULL
> > after returning the change as we are doing at other places.
>
> Okay
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #  got: '2|0'
> > > > # expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode.  I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases,  basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn.  I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Online checksums verification in the backend

2020-07-12 Thread Justin Pryzby
Small language fixes in comments and user-facing docs.


diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 88efb38556..39596db193 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26162,7 +26162,7 @@ SELECT 
convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 

 The functions shown in 
-provide means to check for health of data file in a cluster.
+provide a means to check for health of a data file in a cluster.

 

@@ -26179,8 +26179,8 @@ SELECT 
convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 pg_check_relation(relation 
regclass [, fork 
text])

setof record
-   Validate the checksums for all blocks of all or the given fork of
-   a given relation.
+   Validate the checksum for all blocks of a relation.
+   
   
  
 
@@ -26190,15 +26190,15 @@ SELECT 
convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 pg_check_relation


-pg_check_relation iterates over all the blocks of a
-given relation and verify their checksum.  If provided,
-fork should be 'main' for the
+pg_check_relation iterates over all blocks of a
+given relation and verifies their checksums.  If passed,
+fork specifies that only checksums of the given
+fork are to be verified.  Fork should be 'main' for the
 main data fork, 'fsm' for the free space map,
 'vm' for the visibility map, or
-'init' for the initialization fork, and only this
-specific fork will be verifies, otherwise all forks will.  The function
-returns the list of blocks for which the found checksum doesn't match the
-expected one.  See 'init' for the initialization fork.
+The function returns a list of blocks for which the computed and stored
+checksums don't match.  See  for
 information on how to configure cost-based verification delay.  You must be
 a member of the pg_read_all_stats role to use this
diff --git a/src/backend/storage/page/checksum.c 
b/src/backend/storage/page/checksum.c
index eb2c919c34..17cd95ec95 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -36,7 +36,7 @@
  * actual storage, you have to discard the operating system cache before
  * running those functions.
  *
- * To avoid torn page and possible false positive when reading data, and
+ * To avoid torn pages and possible false positives when reading data, and to
  * keeping overhead as low as possible, the following heuristics are used:
  *
  * - a shared LWLock is taken on the target buffer pool partition mapping, and
@@ -92,8 +92,8 @@ check_one_block(Relation relation, ForkNumber forknum, 
BlockNumber blkno,
*chk_expected = *chk_found = NoComputedChecksum;
 
/*
-* To avoid too much overhead, the buffer will be first read without
-* the locks that would guarantee the lack of false positive, as such
+* To avoid excessive overhead, the buffer will be first read without
+* the locks that would prevent false positives, as such
 * events should be quite rare.
 */
 Retry:
@@ -120,10 +120,10 @@ Retry:
 }
 
 /*
- * Perform a checksum check on the passed page.  Returns whether the page is
+ * Perform a checksum check on the passed page.  Return True iff the page is
  * valid or not, and assign the expected and found checksum in chk_expected and
  * chk_found, respectively.  Note that a page can look like new but could be
- * the result of a corruption.  We still check for this case, but we can't
+ * the result of corruption.  We still check for this case, but we can't
  * compute its checksum as pg_checksum_page() is explicitly checking for
  * non-new pages, so NoComputedChecksum will be set in chk_found.
  */
@@ -139,7 +139,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 
*chk_expected,
if (PageIsNew(page))
{
/*
-* Check if the page is really new or if there's a corruption 
that
+* Check if the page is really new or if there's corruption that
 * affected PageIsNew detection.  Note that PageIsVerified 
won't try to
 * detect checksum corruption in this case, so there's no risk 
of
 * duplicated corruption report.
@@ -151,7 +151,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 
*chk_expected,
}
 
/*
-* There's a corruption, but since this affect PageIsNew, we
+* There's corruption, but since this affects PageIsNew, we
 * can't compute a checksum, so set NoComputedChecksum for the
 * expected checksum.
 */
@@ -218,8 +218,8 @@ check_delay_point(void)
  * held.  Reading with this lock is to avoid the unlikely but possible case
  * that a buffer wasn't present in shared buffers when we checked but it then
  * alloc'ed in shared_buffers, modified and flushe

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-12 Thread Andrey Lepikhov



22.06.2020 17:11, Ashutosh Bapat пишет:



On Wed, 17 Jun 2020 at 11:54, Andrey V. Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:


On 6/15/20 10:26 AM, Ashutosh Bapat wrote:
 > Thanks Andrey for the patch. I am glad that the patch has taken care
 > of some corner cases already but there exist still more.
 >
 > COPY command constructed doesn't take care of dropped columns. There
 > is code in deparseAnalyzeSql which constructs list of columns for a
 > given foreign relation. 0002 patch attached here, moves that code
to a
 > separate function and reuses it for COPY. If you find that code
change
 > useful please include it in the main patch.

Thanks, i included it.

 > 2. In the same case, if the foreign table declared locally didn't
have
 > any non-dropped columns but the relation that it referred to on the
 > foreign server had some non-dropped columns, COPY command fails. I
 > added a test case for this in 0002 but haven't fixed it.

I fixed it.
This is very special corner case. The problem was that COPY FROM does
not support semantics like the "INSERT INTO .. DEFAULT VALUES". To
simplify the solution, i switched off bulk copying for this case.

  > I think this work is useful. Please add it to the next commitfest so
  > that it's tracked.
Ok.


It looks like we call BeginForeignInsert and EndForeignInsert even 
though actual copy is performed using BeginForeignCopy, ExecForeignCopy 
and EndForeignCopy. BeginForeignInsert constructs the INSERT query which 
looks unnecessary. Also some of the other PgFdwModifyState members are 
initialized unnecessarily. It also gives an impression that we are using 
INSERT underneath the copy. Instead a better way would be to 
call BeginForeignCopy instead of BeginForeignInsert and EndForeignCopy 
instead of EndForeignInsert, if we are going to use COPY protocol to 
copy data to the foreign server. Corresponding postgres_fdw 
implementations need to change in order to do that.

Fixed.
I replaced names of CopyIn FDW API. Also the partition routing 
initializer calls BeginForeignInsert or BeginForeignCopyIn routines in 
accordance with value of ResultRelInfo::UseBulkModifying.
I introduced this parameter because foreign partitions can be placed at 
foreign servers with different types of foreign wrapper. Not all 
wrappers can support CopyIn API.

Also I ran the Tomas Vondra benchmark. At my laptop we have results:
* regular: 5000 ms.
* Tomas buffering patch: 11000 ms.
* This CopyIn patch: 8000 ms.

--
regards,
Andrey Lepikhov
Postgres Professional
>From ac43384af911acd0a07b3fae0ab25a9131a4504c Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 9 Jul 2020 11:16:56 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.
---
 contrib/postgres_fdw/deparse.c|  60 -
 .../postgres_fdw/expected/postgres_fdw.out|  33 ++-
 contrib/postgres_fdw/postgres_fdw.c   | 130 ++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  28 ++
 src/backend/commands/copy.c   | 239 --
 src/backend/executor/execMain.c   |   1 +
 src/backend/executor/execPartition.c  |  34 ++-
 src/include/commands/copy.h   |   5 +
 src/include/foreign/fdwapi.h  |  15 ++
 src/include/nodes/execnodes.h |   8 +
 11 files changed, 456 insertions(+), 98 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = 

Re: explain HashAggregate to report bucket and memory stats

2020-07-12 Thread Daniel Gustafsson
> On 9 Apr 2020, at 00:24, Jeff Davis  wrote:
> 
> On Wed, 2020-04-08 at 16:00 -0500, Justin Pryzby wrote:
>> 90% of the initial goal of this patch was handled by instrumentation
>> added by
>> "hash spill to disk" (1f39bce02), but this *also* adds:
>> 
>> - separate accounting for tuples vs hashtable;
>> - number of hash buckets;
>> - handles other agg nodes, and bitmap scan;
>> 
>> Should I continue pursuing this patch?
>> Does it still serve any significant purpose?
> 
> Those things would be useful for me trying to tune the performance and
> cost model. I think we need to put some of these things under "VERBOSE"
> or maybe invent a new explain option to provide this level of detail,
> though.

This thread has stalled and the patch has been Waiting on Author since March,
and skimming the thread there seems to be questions raised over the value
proposition.  Is there progress happening behind the scenes or should we close
this entry for now, to re-open in case there is renewed activity/interest?

cheers ./daniel



Re: Warn when parallel restoring a custom dump without data offsets

2020-07-12 Thread Tom Lane
David Gilman  writes:
> On Thu, Jul 02, 2020 at 05:25:21PM -0400, Tom Lane wrote:
>> I guess I'm completely confused about the purpose of these patches.
>> Far from coping with the situation of an unseekable file, they appear
>> to change pg_restore so that it fails altogether if it can't seek
>> its input file.  Why would we want to do this?

> I'm not sure where the "fails altogether if it can't seek" is.

I misread the patch, is where :-(

As penance, I spent some time studying this patchset, and have a few
comments:

1. The proposed doc change in 0001 seems out-of-date; isn't it adding a
warning about exactly the deficiency that the rest of the patch is
eliminating?  Note that the preceding para already says that the input
has to be seekable, so that's covered.  Maybe there is reason for
documenting that parallel restore will be slower if the archive was
written in a non-seekable way ... but that's not what this says.

2. It struck me that the patch is still pretty inefficient, in that
anytime it has to back up in an offset-less archive, it blindly rewinds
to dataStart and rescans everything.  In the worst case that'd still be
O(N^2) work, and it's really not necessary, because once we've seen a
given data block we know where it is.  We just have to remember that,
which seems easy enough.  (Well, on Windows it's a bit trickier because
the state in question is shared across threads; but that's good, it might
save some work.)

3. Extending on #2, we actually don't need the rewind and retry logic
at all.  If we are looking for a block we haven't already seen, and we
get to the end of the archive, it ain't there.  (This is a bit less
obvious in the Windows case than otherwise, but I think it's still true,
given that the start state is either "all offsets known" or "no offsets
known".  A particular thread might skip over some blocks on the strength
of an offset established by another thread, but the blocks ahead of that
spot must now all have known offsets.)

4. Patch 0002 seems mighty expensive for the amount of code coverage
it's adding.  On my machine it seems to raise the overall runtime of
pg_dump's "make installcheck" by about 10%, and the only new coverage
is of the few lines added here.  I wonder if we couldn't cover that
more cheaply by testing what happens when we use a "-L" option with
an intentionally mis-sorted restore list.

5. I'm inclined to reject 0003.  It's not saving anything very meaningful,
and we'd just have to put the code back whenever somebody gets around
to making pg_backup_tar.c capable of out-of-order restores like
pg_backup_custom.c is now able to do.

The attached 0001 rewrites your 0001 as per the above ideas (dropping
the proposed doc change for now), and includes your 0004 for simplicity.
I'm including your 0002 verbatim just so the cfbot will be able to do a
meaningful test on 0001; but as stated, I don't really want to commit it.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 6ab122242c..2659676fd9 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -421,13 +421,47 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.  Remember the
+		 * positions of skipped-over blocks, so that if we later decide we
+		 * need to read one, we'll be able to seek to it.
 		 */
-		_readBlockHeader(AH, &blkType, &id);
-
-		while (blkType != EOF && id != te->dumpId)
+		for (;;)
 		{
+			pgoff_t		thisBlkPos = _getFilePos(AH, ctx);
+			TocEntry   *otherte;
+
+			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF || id == te->dumpId)
+break;
+
+			otherte = getTocEntryByDumpId(AH, id);
+			if (otherte && otherte->formatData)
+			{
+lclTocEntry *othertctx = (lclTocEntry *) otherte->formatData;
+
+/*
+ * Note: on Windows, multiple threads might access/update the
+ * same lclTocEntry concurrently, but that should be safe as
+ * long as we update dataPos before dataState.  Ideally, we'd
+ * use pg_write_barrier() to enforce that, but the needed
+ * infrastructure doesn't exist in frontend code.  But Windows
+ * only runs on machines with strong store ordering, so it
+ * should be okay for now.
+ */
+if (othertctx->dataState == K_OFFSET_POS_NOT_SET)
+{
+	othertctx->dataPos = thisBlkPos;
+	othertctx->dataState = K_OFFSET_POS_SET;
+}
+else if (othertctx->dataPos != thisBlkPos ||
+		 othertctx->dataState != K_OFFSET_POS_SET)
+{
+	/* sanity check */
+	pg_log_warning("data block %d has wrong seek position", id);
+}
+			}
+
 			switch (blkType)
 			{
 case BLK_DATA:
@@ -443,7 +477,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	

Re: should INSERT SELECT use a BulkInsertState?

2020-07-12 Thread Daniel Gustafsson
> On 4 Jun 2020, at 19:30, Andres Freund  wrote:
> On 2020-05-08 02:25:45 -0500, Justin Pryzby wrote:

>> Seems to me it should, at least conditionally.  At least if there's a 
>> function
>> scan or a relation or ..
> 
> Well, the problem is that this can cause very very significant
> regressions. As in 10x slower or more. The ringbuffer can cause constant
> XLogFlush() calls (due to the lsn interlock), and the eviction from
> shared_buffers (regardless of actual available) will mean future vacuums
> etc will be much slower.  I think this is likely to cause pretty
> widespread regressions on upgrades.
> 
> Now, it sucks that we have this problem in the general facility that's
> supposed to be used for this kind of bulk operation. But I don't really
> see it realistic as expanding use of bulk insert strategies unless we
> have some more fundamental fixes.

Based on the above, and the lack of activity in the thread, it sounds like this
patch should be marked Returned with Feedback; but Justin: you set it to
Waiting on Author at the start of the commitfest, are you working on a new
version?

cheers ./daniel



Re: ALTER TABLE validate foreign key dependency problem

2020-07-12 Thread Simon Riggs
On Sun, 12 Jul 2020 at 05:51, David Rowley  wrote:


> > I also considered that we could just delay all foreign key validations
> > until phase 3, but I ended up just doing then only when a rewrite is
> > pending.
>
> I still wonder if it's best to delay the validation of the foreign key
> regardless of if there's a pending table rewrite, but the patch as it
> is now only delays if there's a pending rewrite.
>

Consistency seems the better choice, so I agree we should validate later in
all cases. Does changing that have any other effects?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

Mission Critical Databases


Re: Parallel grouping sets

2020-07-12 Thread Daniel Gustafsson
> On 25 Mar 2020, at 15:35, Pengzhou Tang  wrote:

> Thanks a lot, the patch has a memory leak in the lookup_hash_entries, it uses 
> a list_concat there
> and causes a 64-byte leak for every tuple, has fixed that.
> 
> Also, resolved conflicts and rebased the code.

While there hasn't been a review of this version, it no longer applies to HEAD.
There was also considerable discussion in a (virtual) hallway-track session
during PGCon which reviewed the approach (for lack of a better description),
deeming that nodeAgg.c needs a refactoring before complicating it further.
Based on that, and an off-list discussion with Melanie who had picked up the
patch, I'm marking this Returned with Feedback.

cheers ./daniel



Re: Libpq support to connect to standby server as priority

2020-07-12 Thread Daniel Gustafsson
> On 6 Jul 2020, at 14:19, Greg Nancarrow  wrote:
> 
>> This patch no longer applies, can you please submit a rebased version?  I've
>> marked the entry as Waiting on Author in the meantime.
>> 
> 
> Here's a rebased version of the patch.

Thanks, but now the tests no longer work as the nodes in the test suite are
renamed.  While simple enough for a committer to fix, it's always good to see
the tests pass in the CFBot to make sure the variable name error isn't hiding
an actual test error.

cheers ./daniel



Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-12 Thread Thomas Munro
On Mon, Jul 13, 2020 at 1:13 AM vignesh C  wrote:
> On Fri, Jul 10, 2020 at 8:51 AM Amit Langote  wrote:
> > Here the numbers with the updated patch:
> >
> > HEADpatched (v2)
> > foo5 8.56.1
> > foo10 149.4
> > foo20 25   16.7
> >
>
> Patch applies cleanly, make check & make check-world passes.

This error showed up when cfbot tried it:

 COPY BINARY stud_emp FROM
'/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
+ERROR:  could not read from COPY file: Bad address




Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Alvaro Herrera
On 2020-Jul-10, Tomas Vondra wrote:

> > postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> > 2), 'mul');
> > -[ RECORD 1 
> > ]--
> > ---
> > 
> > itemoffset  | 1
> > blknum  | 0
> > attnum  | 1
> > allnulls| f
> > hasnulls| f
> > placeholder | f
> > value   | 
> > {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> > 70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> > 071}
> 
> Hmm. I'm not sure we can do much better, without making the function
> much more complicated. I mean, even with regular BRIN indexes we don't
> really know if the value is plain min/max, right?

Maybe we can try to handle this with some other function that interprets
the bytea in 'value' and returns a user-readable text.  I think it'd
have to be a superuser-only function, because otherwise you could easily
cause a crash by passing a value of a different opclass.  But since this
seems a developer-only thing, that restriction seems fine to me.

(I don't know what's a good way to represent a bloom filter, mind.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Tomas Vondra

On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

On 2020-Jul-10, Tomas Vondra wrote:


> postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> 2), 'mul');
> -[ RECORD 1 
]--
> 
---
> 
> itemoffset  | 1
> blknum  | 0
> attnum  | 1
> allnulls| f
> hasnulls| f
> placeholder | f
> value   | 
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> 
70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> 071}

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?


Maybe we can try to handle this with some other function that interprets
the bytea in 'value' and returns a user-readable text.  I think it'd
have to be a superuser-only function, because otherwise you could easily
cause a crash by passing a value of a different opclass.  But since this
seems a developer-only thing, that restriction seems fine to me.



Ummm, I disagree a superuser check is sufficient protection from a
segfault or similar issues. If we really want to print something nicer,
I'd say it needs to be a special function in the BRIN opclass.


(I don't know what's a good way to represent a bloom filter, mind.)



Me neither, but I guess we could print either some stats (size, number
of bits set, etc.) and/or then the bitmap.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Alvaro Herrera
On 2020-Jul-13, Tomas Vondra wrote:

> On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

> > Maybe we can try to handle this with some other function that interprets
> > the bytea in 'value' and returns a user-readable text.  I think it'd
> > have to be a superuser-only function, because otherwise you could easily
> > cause a crash by passing a value of a different opclass.  But since this
> > seems a developer-only thing, that restriction seems fine to me.
> 
> Ummm, I disagree a superuser check is sufficient protection from a
> segfault or similar issues.

My POV there is that it's the user's responsibility to call the right
function; and if they fail to do so, it's their fault.  I agree it's not
ideal, but frankly these pageinspect things are not critical to get 100%
user-friendly.

> If we really want to print something nicer, I'd say it needs to be a
> special function in the BRIN opclass.

If that can be done, then +1.  We just need to ensure that the function
knows and can verify the type of index that the value comes from.  I
guess we can pass the index OID so that it can extract the opclass from
catalogs to verify.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-12 Thread Bharath Rupireddy
Thanks Thomas for checking this feature.

> On Mon, Jul 13, 2020 at 4:06 AM Thomas Munro  wrote:
>
> This error showed up when cfbot tried it:
>
>  COPY BINARY stud_emp FROM
> '/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
> +ERROR:  could not read from COPY file: Bad address

This is due to the recent commit
cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
raw_buf and line_buf allocations for binary files. Since we are using
raw_buf for this performance improvement feature, now, it's enough to
restrict only line_buf for binary files. I made the changes
accordingly in the v3 patch attached here.

Regression tests(make check & make check-world) ran cleanly with the v3 patch.

Please also find my responses for:

Vignesh's comment:

> On Sun, Jul 12, 2020 at 6:43 PM vignesh C  wrote:
> I had reviewed the changes. I felt one minor change required:
> + * CopyReadFromRawBuf
> + * Reads 'nbytes' bytes from cstate->copy_file via
> cstate->raw_buf and
> + * writes then to 'saveTo'
> + *
> + * Useful when reading binary data from the file.
> Should "writes then to 'saveTo'" be "writes them to 'dest'"?
>

Thanks Vignesh for reviewing the patch. Modified 'saveTo' to 'dest' in v3 patch.

Amit's comment:

>
> > For the case where required nbytes may not fit into the buffer in
> > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > (field count , and field size are of fixed length and can fit in the
> > buffer), instead of reading them in parts of buff size into the buffer
> > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > destination, I think we can detect this condition using requested
> > bytes and the buffer size and directly read from the file to the
> > destination buffer and then reload the raw_buffer for further
> > processing. I think this way, it will be good.
>
> Hmm, I'm afraid that this will make the code more complex for
> apparently small benefit.  Is this really that much of a problem
> performance wise?
>

Yes it makes CopyReadFromRawBuf(), code a bit complex from a
readability perspective. I'm convinced not to have the
abovementioned(by me) change, due to 3 reasons,1)  the
readability/understandability 2)  how many use cases can we have where
requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
cases. I may be wrong here. 3) Performance wise it may not be much as
we do one extra memcpy only in situations where field sizes are
greater than 64KB(as we have already seen and observed by you as well
in one of the response [1]) that memcpy cost for this case may be
negligible.

Considering all of above, I'm okay to have CopyReadFromRawBuf()
function, the way it is currently.

[1]
> >
> > One solution to this problem is to read data from binary file in 
> > RAW_BUF_SIZE(64KB) chunks to avoid repeatedly calling fread()(thus possibly 
> > avoiding few disk IOs). This is similar to the approach followed for 
> > csv/text files.
>
> I agree that having the buffer in front of the file makes sense,
> although we do now have an extra memcpy, that is, from raw_buf to
> attribute_buf.data.  Currently, fread() reads directly into
> attribute_buf.data.  But maybe that's okay as I don't see the new copy
> being all that bad.
>

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


CopyFrom-binary-buffering_v3.patch
Description: Binary data


Re: Physical slot restart_lsn advances incorrectly requiring restore from archive

2020-07-12 Thread Kyotaro Horiguchi
Hello, John.

At Fri, 10 Jul 2020 20:44:30 +, "Hsu, John"  wrote in 
> Hi hackers,
>  
> We believe we’re seeing a problem with how physical slot’s restart_lsn is 
> advanced leading to the replicas needing to restore from archive in order for 
> replication to resume. 
> The logs below are from reproductions against 10.13. I’m still working on 
> reproducing it for 12.3.
>  
> WAL write spans two WAL segments . 
> Write to first WAL segment is complete but not to the second segment. 
> Write to first WAL segment is acknowledged as flushed from the Postgres 
> replica.
> Primary restarts before the write to second segment completes. It also means 
> the complete WAL was never written. 
> Crash recovery finishes at a record before the incomplete WAL write. 
> Though now replica start the slot at the next WAL segment, since the previous 
> WAL was already acknowledged as flushed.
...
> Redo finishes at 0/2BB0 even though the flush we received from
> the replica is already at 0/2C00.
> This is problematic because the replica reconnects to the slot
> telling it to start past the new redo point.
 
Yeah, that is a problem not only related to restart_lsn. The same
cause leads to aother issue of inconsistent archive as discussed in
[1].

1: 
https://www.postgresql.org/message-id/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

> The attached patch (against 10) attempts to address this by keeping
> track of the first flushLsn in the current segNo, and wait until we
> receive one after that before updating. This prevents the WAL from
> rotating out of the primary and a reboot from the replica will fix
> it instead of needing to restore from archive.

On the other hand we can and should advance restart_lsn when we know
that the last record is complete. I think a patch in the thread [2]
would fix your issue. With the patch primary doesn't send a
continuation record at the end of a segment until the whole record is
flushed into WAL file.

2: 
https://www.postgresql.org/message-id/20200625.153532.379700510444980240.horikyota.ntt%40gmail.com


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: should INSERT SELECT use a BulkInsertState?

2020-07-12 Thread Justin Pryzby
On Thu, Jun 04, 2020 at 10:30:47AM -0700, Andres Freund wrote:
> On 2020-05-08 02:25:45 -0500, Justin Pryzby wrote:
> > Seems to me it should, at least conditionally.  At least if there's a 
> > function
> > scan or a relation or ..
> 
> Well, the problem is that this can cause very very significant
> regressions. As in 10x slower or more. The ringbuffer can cause constant
> XLogFlush() calls (due to the lsn interlock), and the eviction from
> shared_buffers (regardless of actual available) will mean future vacuums
> etc will be much slower.  I think this is likely to cause pretty
> widespread regressions on upgrades.
> 
> Now, it sucks that we have this problem in the general facility that's
> supposed to be used for this kind of bulk operation. But I don't really
> see it realistic as expanding use of bulk insert strategies unless we
> have some more fundamental fixes.

I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on 
that.

postgres=# \t on \\ \set QUIET \\ VACUUM FULL t; \dt+ t \\ begin ; \timing on 
\\ INSERT INTO t SELECT * FROM t; rollback; SELECT COUNT(1), usagecount FROM 
pg_buffercache GROUP BY 2 ORDER BY 2; 
| public | t| table | pryzbyj | 35 MB | 
|Time: 9497.318 ms (00:09.497)
|33 |  1
| 3 |  2
|18 |  3
| 5 |  4
|  4655 |  5
| 11670 |   

vs

postgres=# \t on \\ \set QUIET \\ VACUUM FULL t; \dt+ t \\ begin BULK ; \timing 
on \\ INSERT INTO t SELECT * FROM t; rollback; SELECT COUNT(1), usagecount FROM 
pg_buffercache GROUP BY 2 ORDER BY 2; 
| public | t| table | pryzbyj | 35 MB | 
|Time: 8268.780 ms (00:08.269)
|  2080 |  1
| 3 |  2
|19 |  4
|   234 |  5
| 14048 |   

And:

postgres=# begin ; \x \\ \t \\ SELECT statement_timestamp(); \o /dev/null \\ 
SELECT 'INSERT INTO t VALUES(0)' FROM generate_series(1,99); \set ECHO 
errors \\ \set QUIET on \\ \o \\ \gexec \\ SELECT statement_timestamp(); abort; 
\x \\ SELECT COUNT(1), usagecount FROM pg_buffercache GROUP BY 2 ORDER BY 2; a
|statement_timestamp | 2020-07-12 20:31:43.717328-05
|statement_timestamp | 2020-07-12 20:36:16.692469-05
|
|52 |  1
|24 |  2
|17 |  3
| 6 |  4
|  4531 |  5
| 11754 |   

vs

postgres=# begin BULK ; \x \\ \t \\ SELECT statement_timestamp(); \o /dev/null 
\\ SELECT 'INSERT INTO t VALUES(0)' FROM generate_series(1,99); \set ECHO 
errors \\ \set QUIET on \\ \o \\ \gexec \\ SELECT statement_timestamp(); abort; 
\x \\ SELECT COUNT(1), usagecount FROM pg_buffercache GROUP BY 2 ORDER BY 2; a
|statement_timestamp | 2020-07-12 20:43:47.089538-05
|statement_timestamp | 2020-07-12 20:48:04.798138-05
|
|  4456 |  1
|22 |  2
| 1 |  3
| 7 |  4
|79 |  5
| 11819 |

-- 
Justin
>From 0362537e0f3a8496ac760574931db66c59f7c1ba Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 8 May 2020 02:17:32 -0500
Subject: [PATCH v2] Allow INSERT SELECT to use a BulkInsertState

---
 src/backend/executor/nodeModifyTable.c | 23 +--
 src/backend/parser/gram.y  |  6 +-
 src/backend/tcop/utility.c |  4 
 src/backend/utils/misc/guc.c   | 12 +++-
 src/include/executor/nodeModifyTable.h |  2 ++
 src/include/nodes/execnodes.h  |  2 ++
 src/include/parser/kwlist.h|  1 +
 7 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 20a4c474cc..5ff4a2e901 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -75,6 +75,8 @@ static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
    int whichplan);
+/* guc */
+bool insert_in_bulk = false;
 
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
@@ -578,7 +580,7 @@ ExecInsert(ModifyTableState *mtstate,
 			table_tuple_insert_speculative(resultRelationDesc, slot,
 		   estate->es_output_cid,
 		   0,
-		   NULL,
+		   NULL, /* Bulk insert not supported */
 		   specToken);
 
 			/* insert index entries for tuple */
@@ -617,7 +619,7 @@ ExecInsert(ModifyTableState *mtstate,
 			/* insert the tuple normally */
 			table_tuple_insert(resultRelationDesc, slot,
 			   estate->es_output_cid,
-			   0, NULL);
+			   0, mtstate->bistate);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
@@ -2332,6 +2334,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	mtstate->mt_arowmarks = (List **) palloc0(sizeof(List *) * nplans);
 	mtstate->mt_nplans = nplans;
+	mtstate->bistate = NULL;
+	if (operation == CMD_INSERT &&
+			node->onCon

Re: archive status ".ready" files may be created too early

2020-07-12 Thread Kyotaro Horiguchi
Hello.

# Sorry, I wrongly thought that I replied to this thread..

At Tue, 7 Jul 2020 09:02:56 +, "matsumura@fujitsu.com" 
 wrote in 
> At Monday, July 6, 2020 05:13:40 +,  "Kyotaro Horiguchi 
> " wrote in
> > > > after WAL buffer is filled up to the requested position. So when it
> > > > crosses segment boundary we know the all past corss segment-boundary
> > > > records are stable. That means all we need to remember is only the
> > > > position of the latest corss-boundary record.
> > > 
> > > I could not agree. In the following case, it may not work well.
> > > - record-A and record-B (record-B is a newer one) is copied, and
> > > - lastSegContRecStart/End points to record-B's, and
> > > - FlushPtr is proceeded to in the middle of record-A.
> >
> > IIUC, that means record-B is a cross segment-border record and we hav e
> > flushed beyond the recrod-B. In that case crash recovery afterwards
> > can read the complete record-B and will finish recovery *after* the
> > record-B. That's what we need here.
> 
> I'm sorry I didn't explain enough.
> 
> Record-A and Record-B are cross segment-border records.
> Record-A spans segment X and X+1
> Record-B spans segment X+2 and X+3.

Ok.


> If both records have been inserted to WAL buffer, lastSegContRecStart/End 
> points to Record-B.
> If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() 
> allows the writer to notify segment-X.
> Is my understanding correct?

I think that that cannot happen since the segment X must have been
flushed at the time Record-A is completely flushed out. When we write
to the next segment, we have already flushed and closed the whole last
segment. If it is not the case we are to archive segment files not
fully flushed, and would get broken archive files.

Am I missing something here?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-12 Thread Amit Langote
On Mon, Jul 13, 2020 at 10:19 AM Bharath Rupireddy
 wrote:
> > On Mon, Jul 13, 2020 at 4:06 AM Thomas Munro  wrote:
> >
> > This error showed up when cfbot tried it:
> >
> >  COPY BINARY stud_emp FROM
> > '/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/stud_emp.data';
> > +ERROR:  could not read from COPY file: Bad address
>
> This is due to the recent commit
> cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
> raw_buf and line_buf allocations for binary files. Since we are using
> raw_buf for this performance improvement feature, now, it's enough to
> restrict only line_buf for binary files. I made the changes
> accordingly in the v3 patch attached here.
>
> Regression tests(make check & make check-world) ran cleanly with the v3 patch.

Thank you Bharath.  I was a bit surprised that you had also submitted
a patch to NOT allocate raw_buf for COPY FROM ... BINARY. :-)

> Please also find my responses for:
>
> Vignesh's comment:
>
> > On Sun, Jul 12, 2020 at 6:43 PM vignesh C  wrote:
> > I had reviewed the changes. I felt one minor change required:
> > + * CopyReadFromRawBuf
> > + * Reads 'nbytes' bytes from cstate->copy_file via
> > cstate->raw_buf and
> > + * writes then to 'saveTo'
> > + *
> > + * Useful when reading binary data from the file.
> > Should "writes then to 'saveTo'" be "writes them to 'dest'"?
> >
>
> Thanks Vignesh for reviewing the patch. Modified 'saveTo' to 'dest' in v3 
> patch.

My bad.

> Amit's comment:
>
> >
> > > For the case where required nbytes may not fit into the buffer in
> > > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > > (field count , and field size are of fixed length and can fit in the
> > > buffer), instead of reading them in parts of buff size into the buffer
> > > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > > destination, I think we can detect this condition using requested
> > > bytes and the buffer size and directly read from the file to the
> > > destination buffer and then reload the raw_buffer for further
> > > processing. I think this way, it will be good.
> >
> > Hmm, I'm afraid that this will make the code more complex for
> > apparently small benefit.  Is this really that much of a problem
> > performance wise?
> >
>
> Yes it makes CopyReadFromRawBuf(), code a bit complex from a
> readability perspective. I'm convinced not to have the
> abovementioned(by me) change, due to 3 reasons,1)  the
> readability/understandability 2)  how many use cases can we have where
> requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
> cases. I may be wrong here. 3) Performance wise it may not be much as
> we do one extra memcpy only in situations where field sizes are
> greater than 64KB(as we have already seen and observed by you as well
> in one of the response [1]) that memcpy cost for this case may be
> negligible.

Actually, an extra memcpy is incurred on every call of
CopyReadFromRawBuf(), but I haven't seen it to be very problematic.

By the way, considering the rebase over cd22d3cdb9b, it seemed to me
that we needed to update the comments in CopyStateData struct
definition a bit more.  While doing that, I realized
CopyReadFromRawBuf as a name for the new function might be misleading
as long as we are only using it for binary data.  Maybe
CopyReadBinaryData is more appropriate?  See attached v4 with these
and a few other cosmetic changes.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


CopyFrom-binary-buffering_v4.patch
Description: Binary data


Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-12 Thread Bharath Rupireddy
>
> > This is due to the recent commit
> > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb, in which we restricted the
> > raw_buf and line_buf allocations for binary files. Since we are using
> > raw_buf for this performance improvement feature, now, it's enough to
> > restrict only line_buf for binary files. I made the changes
> > accordingly in the v3 patch attached here.
> >
> > Regression tests(make check & make check-world) ran cleanly with the v3 
> > patch.
>
> Thank you Bharath.  I was a bit surprised that you had also submitted
> a patch to NOT allocate raw_buf for COPY FROM ... BINARY. :-)
>

Yes that was by me, before I started to work on this feature. I think
we can backpatch that change(assuming we don't backpatch this
feature), I will make the request accordingly.

Anyways, now we don't allow line_buf allocation for binary files,
which is also a good thing.

>
> > > > For the case where required nbytes may not fit into the buffer in
> > > > CopyReadFromRawBuf, I'm sure this can happen only for field data,
> > > > (field count , and field size are of fixed length and can fit in the
> > > > buffer), instead of reading them in parts of buff size into the buffer
> > > > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
> > > > destination, I think we can detect this condition using requested
> > > > bytes and the buffer size and directly read from the file to the
> > > > destination buffer and then reload the raw_buffer for further
> > > > processing. I think this way, it will be good.
> > >
> > > Hmm, I'm afraid that this will make the code more complex for
> > > apparently small benefit.  Is this really that much of a problem
> > > performance wise?
> > >
> >
> > Yes it makes CopyReadFromRawBuf(), code a bit complex from a
> > readability perspective. I'm convinced not to have the
> > abovementioned(by me) change, due to 3 reasons,1)  the
> > readability/understandability 2)  how many use cases can we have where
> > requested field size greater than RAW_BUF_SIZE(64KB)? I think very few
> > cases. I may be wrong here. 3) Performance wise it may not be much as
> > we do one extra memcpy only in situations where field sizes are
> > greater than 64KB(as we have already seen and observed by you as well
> > in one of the response [1]) that memcpy cost for this case may be
> > negligible.
>
> Actually, an extra memcpy is incurred on every call of
> CopyReadFromRawBuf(), but I haven't seen it to be very problematic.
>

Yes.

>
> CopyReadFromRawBuf as a name for the new function might be misleading
> as long as we are only using it for binary data.  Maybe
> CopyReadBinaryData is more appropriate?  See attached v4 with these
> and a few other cosmetic changes.
>

CopyReadBinaryData() looks meaningful. +1.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Creating a function for exposing memory usage of backend process

2020-07-12 Thread Fujii Masao




On 2020/07/10 17:32, torikoshia wrote:

On 2020-07-09 02:03, Andres Freund wrote:

Hi,

I think this is an incredibly useful feature.


Thanks for your kind comments and suggestion!



On 2020-07-07 22:02:10 +0900, torikoshia wrote:

> There can be multiple memory contexts with the same name. So I'm afraid
> that it's difficult to identify the actual parent memory context from
> this
> "parent" column. This is ok when logging memory contexts by calling
> MemoryContextStats() via gdb. Because child memory contexts are printed
> just under their parent, with indents. But this doesn't work in the
> view.
> To identify the actual parent memory or calculate the memory contexts
> tree
> from the view, we might need to assign unique ID to each memory context
> and display it. But IMO this is overkill. So I'm fine with current
> "parent"
> column. Thought? Do you have any better idea?

Indeed.
I also feel it's not usual to assign a unique ID, which
can vary every time the view displayed.


Hm. I wonder if we just could include the address of the context
itself. There might be reasons not to do so (e.g. security concerns
about leaked pointers making attacks easier), but I think it's worth
considering.



I tried exposing addresses of each context and their parent.
Attached a poc patch.

   =# SELECT name, address, parent_address, total_bytes FROM 
pg_backend_memory_contexts ;

    name   |  address  | parent_address | total_bytes
   --+---++-
    TopMemoryContext | 0x1280da0 |    |   80800
    TopTransactionContext    | 0x1309040 | 0x1280da0  |    8192
    Prepared Queries | 0x138a480 | 0x1280da0  |   16384
    Type information cache   | 0x134b8c0 | 0x1280da0  |   24624
    ...
    CacheMemoryContext   | 0x12cb390 | 0x1280da0  | 1048576
    CachedPlanSource | 0x13c47f0 | 0x12cb390  |    4096
    CachedPlanQuery  | 0x13c9ae0 | 0x13c47f0  |    4096
    CachedPlanSource | 0x13c7310 | 0x12cb390  |    4096
    CachedPlanQuery  | 0x13c1230 | 0x13c7310  |    4096
    ...


Now it's possible to identify the actual parent memory context even when
there are multiple memory contexts with the same name.

I'm not sure, but I'm also worrying about this might incur some security
related problems..

I'd like to hear more opinions about:

- whether information for identifying parent-child relation is necessary or 
it's an overkill
- if this information is necessary, memory address is suitable or other means 
like assigning unique numbers are required


To consider this, I'd like to know what security issue can actually
happen when memory addresses are exposed. I have no idea about this..

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: POC and rebased patch for CSN based snapshots

2020-07-12 Thread Andrey V. Lepikhov

On 7/4/20 7:56 PM, movead...@highgo.ca wrote:



As far as I know about Clock-SI, left part of the blue line will
setup as a snapshot

if master require a snapshot at time t1. But in fact data A should
in snapshot but

not and data B should out of snapshot but not.


If this scene may appear in your origin patch? Or something my
understand about

Clock-SI is wrong?





Sorry for late answer.

I have doubts that I fully understood your question, but still.
What real problems do you see here? Transaction t1 doesn't get state of 
shard2 until time at node with shard2 won't reach start time of t1.
If transaction, that inserted B wants to know about it position in time 
relatively to t1 it will generate CSN, attach to node1 and will see, 
that t1 is not started yet.


Maybe you are saying about the case that someone who has a faster data 
channel can use the knowledge from node1 to change the state at node2?

If so, i think it is not a problem, or you can explain your idea.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Bug with indexes on whole-row expressions

2020-07-12 Thread Ashutosh Bapat
On Mon, Jun 29, 2020 at 3:43 PM Laurenz Albe  wrote:
>
> CREATE TABLE boom (a integer, b integer);
>
> -- index on whole-row expression
> CREATE UNIQUE INDEX ON boom ((boom));
>
> INSERT INTO boom VALUES
>(1, 2),
>(1, 3);
>
> ALTER TABLE boom DROP b;
>
> TABLE boom;
>
>  a
> ---
>  1
>  1
> (2 rows)
>
> REINDEX TABLE boom;
> ERROR:  could not create unique index "boom_boom_idx"
> DETAIL:  Key ((boom.*))=((1)) is duplicated.
>
> The problem here is that there *is* a "pg_depend" entry for the
> index, but it only depends on the whole table, not on specific columns.
>
> I have been thinking what would be the right approach to fix this:
>
> 1. Don't fix it, because it is an artificial corner case.
>(But I can imagine someone trying to exclude duplicate rows with
>a unique index.)
>
> 2. Add code that checks if there is an index with a whole-row reference
>in the definition before dropping a column.
>That feels like a wart for a special case.

Do we need to do something about adding a new column or modifying an
existing one. Esp. if the later changes the uniqueness of row
expression.
>
> 3. Forbid indexes on whole-row expressions.
>After all, you can do the same with an index on all the columns.
>That would open the question what to do about upgrading old databases
>that might have such indexes today.

This would be the best case. However, a whole row expression is not
necessarily the same as "all columns" at that moment. A whole row
expression means whatever column there are at any given point in time.
So creating an index on whole row expression is not the same as an
index on all the columns. However, I can't imagine a case where we
want an index on "whole row expression". An index containing all the
columns would always suffice and will be deterministic as well.

So this option looks best.

>
> 4. Add dependencies on all columns whenever a whole-row expression
>is used in an index.
>That would need special processing for pg_upgrade.

Again, that dependency needs to be maintained as and when the columns
are added and dropped.


--
Best Wishes,
Ashutosh Bapat




Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-07-12 Thread Ashutosh Bapat
On Wed, Jul 8, 2020 at 6:10 PM Bharath Rupireddy
 wrote:
>
> I couldn't think of adding a test case to the existing postgres_fdw
> regression test suite with an automated scenario of the remote backend
> getting killed.

You could get a backend's PID using PQbackendPID and then kill it by
calling pg_terminate_backend() to kill the remote backend to automate
scenario of remote backend being killed.

>
> I would like to thank Ashutosh Bapat (ashutosh.bapat@gmail.com)
> for the suggestion to fix this and the review of my initial patch
> attached in [2]. I tried to address the review comments provided on my
> initial patch [3].
>
> For, one of the Ashutosh's review comments from [3] "the fact that the
> same connection may be used by multiple plan nodes", I tried to have
> few use cases where there exist joins on two foreign tables on the
> same remote server, in a single query, so essentially, the same
> connection was used for multiple plan nodes. In this case we avoid
> retrying for the second GetConnection() request for the second foreign
> table, with the check entry->xact_depth <= 0 , xact_depth after the
> first GetConnection() and the first remote query will become 1 and we
> don't hit the retry logic and seems like we are safe here. Please add
> If I'm missing something here.
>
> Request the community to consider the patch for further review if the
> overall idea seems beneficial.

I think this idea will be generally useful if your work on dropping
stale connection uses idle_connection_timeout or something like that
on the remote server.

About the patch. It seems we could just catch the error from
begin_remote_xact() in GetConnection() and retry connection if the
error is "bad connection". Retrying using PQreset() might be better
than calling PQConnect* always.


>
> [1]  
> https://www.postgresql.org/message-id/CAExHW5t21B_XPQy_hownm1Qq%3DhMrgOhX%2B8gDj3YEKFvpk%3DVRgw%40mail.gmail.com
> [2]  
> https://www.postgresql.org/message-id/CALj2ACXp6DQ3iLGx5g%2BLgVtGwC4F6K9WzKQJpyR4FfdydQzC_g%40mail.gmail.com
> [3]  
> https://www.postgresql.org/message-id/CAExHW5u3Gyv6Q1BEr6zMg0t%2B59e8c4KMfKVrV3Z%3D4UKKjJ19nQ%40mail.gmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com



--
Best Wishes,
Ashutosh Bapat




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Amit Kapila
On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar  wrote:
>
> On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  wrote:
> >
> >
> > 8. We can't stream the transaction before we reach the
> > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > those changes unlike what we do with pgoutput plugin (which writes to
> > file). And, I think applying the transactions without reaching a
> > consistent state would be anyway wrong.  So, we should avoid that and
> > if do that then we should have an Assert for streamed txns rather than
> > sending abort for them in ReorderBufferForget.
>
> I was analyzing this point so currently, we only enable streaming in
> StartReplicationSlot so basically in CreateReplicationSlot the
> streaming will be always off because by that time plugins are not yet
> startup that will happen only on StartReplicationSlot.
>

What do you mean by 'startup' in the above sentence?  AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding.  In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Dilip Kumar
On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila  wrote:
>
> On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar  wrote:
> >
> > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  wrote:
> > >
> > >
> > > 8. We can't stream the transaction before we reach the
> > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > those changes unlike what we do with pgoutput plugin (which writes to
> > > file). And, I think applying the transactions without reaching a
> > > consistent state would be anyway wrong.  So, we should avoid that and
> > > if do that then we should have an Assert for streamed txns rather than
> > > sending abort for them in ReorderBufferForget.
> >
> > I was analyzing this point so currently, we only enable streaming in
> > StartReplicationSlot so basically in CreateReplicationSlot the
> > streaming will be always off because by that time plugins are not yet
> > startup that will happen only on StartReplicationSlot.
> >
>
> What do you mean by 'startup' in the above sentence?  AFAICS, we do
> call startup_cb_wrapper in CreateInitDecodingContext which is called
> from both CreateReplicationSlot and create_logical_replication_slot
> before the start of decoding.  In CreateInitDecodingContext, we call
> StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well.  I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet".  So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot.  So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start.  At
that time we can have an assert that the snapshot should be
CONSISTENT.  However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: max_slot_wal_keep_size and wal_keep_segments

2020-07-12 Thread Fujii Masao



On 2020/07/09 1:20, Alvaro Herrera wrote:

On 2020-Jul-09, Fujii Masao wrote:


I attached the patch that renames wal_keep_segments to wal_keep_size.


Looks good in a quick once-over.  Just two small wording comments:


Thanks for review comments!






 Independently of max_wal_size,
-+ 1 most recent WAL files are
+   most recent  megabytes
+   WAL files plus one WAL file are
 kept at all times. Also, if WAL archiving is used, old segments can not be
 removed or recycled until they are archived. If WAL archiving cannot keep 
up
 with the pace that WAL is generated, or if 
archive_command


This reads a little strange to me.  Maybe "the N most recent megabytes
plus ..."


Yes, fixed.





/* determine how many segments slots can be kept by 
slots ... */
-   keepSegs = XLogMBVarToSegs(max_slot_wal_keep_size_mb, 
wal_segment_size);
-   /* ... and override by wal_keep_segments as needed */
-   keepSegs = Max(keepSegs, wal_keep_segments);
+   slotKeepSegs = 
XLogMBVarToSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+   /* ... and override by wal_keep_size as needed */
+   keepSegs = XLogMBVarToSegs(wal_keep_size_mb, 
wal_segment_size);


Since you change the way these two variables are used, I would not say
"override" in the above comment, nor keep the ellipses; perhaps just
change them to "determine how many segments can be kept by slots" and
"ditto for wal_keep_size".


Yes, fixed.

Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index e9cdff4864..3d973a5c8e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11249,7 +11249,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts 
ppx
  extended means
   that max_wal_size is exceeded but the files are
   still retained, either by the replication slot or
-  by wal_keep_segments.
+  by wal_keep_size.
  
 
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..3384a5197d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3124,7 +3124,7 @@ include_dir 'conf.d'
 checkpoints. This is a soft limit; WAL size can exceed
 max_wal_size under special circumstances, such as
 heavy load, a failing archive_command, or a high
-wal_keep_segments setting.
+wal_keep_size setting.
 If this value is specified without units, it is taken as megabytes.
 The default is 1 GB.
 Increasing this parameter can increase the amount of time needed for
@@ -3751,21 +3751,21 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows

   
 
-  
-   wal_keep_segments (integer)
+  
+   wal_keep_size (integer)

-wal_keep_segments configuration 
parameter
+wal_keep_size configuration 
parameter




-Specifies the minimum number of past log file segments kept in the
+Specifies the minimum size of past log file segments kept in the
 pg_wal
 directory, in case a standby server needs to fetch them for streaming
-replication. Each segment is normally 16 megabytes. If a standby
+replication. If a standby
 server connected to the sending server falls behind by more than
-wal_keep_segments segments, the sending server 
might remove
-a WAL segment still needed by the standby, in which case the
+wal_keep_size megabytes, the sending server might
+remove a WAL segment still needed by the standby, in which case the
 replication connection will be terminated.  Downstream connections
 will also eventually fail as a result.  (However, the standby
 server can recover by fetching the segment from archive, if WAL
@@ -3773,14 +3773,15 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows

 

-This sets only the minimum number of segments retained in
+This sets only the minimum size of segments retained in
 pg_wal; the system might need to retain more 
segments
 for WAL archival or to recover from a checkpoint. If
-wal_keep_segments is zero (the default), the system
+wal_keep_size is zero (the default), the system
 doesn't keep any extra segments for standby purposes, so the number
 of old WAL segments available to standby servers is a function of
 the location of the previous checkpoint and status of WAL
 archiving.
+If this value is specified without units, it is taken as megabytes.
 This parameter can only be set in th

Re: max_slot_wal_keep_size and wal_keep_segments

2020-07-12 Thread Fujii Masao




On 2020/07/09 13:47, Kyotaro Horiguchi wrote:

At Thu, 9 Jul 2020 00:37:57 +0900, Fujii Masao  
wrote in



On 2020/07/02 2:18, David Steele wrote:

On 7/1/20 10:54 AM, Alvaro Herrera wrote:

On 2020-Jul-01, Fujii Masao wrote:


On 2020/07/01 12:26, Alvaro Herrera wrote:

On 2020-Jun-30, Fujii Masao wrote:


When I talked about max_slot_wal_keep_size as new feature in v13
at the conference, I received the question like "Why are the units of
setting values in max_slot_wal_keep_size and wal_keep_segments
different?"
from audience. That difference looks confusing for users and
IMO it's better to use the same unit for them. Thought?


Do we still need wal_keep_segments for anything?


Yeah, personally I like wal_keep_segments because its setting is very
simple and no extra operations on replication slots are necessary.


Okay.  In that case I +1 the idea of renaming to wal_keep_size.

+1 for renaming to wal_keep_size.


I attached the patch that renames wal_keep_segments to wal_keep_size.


It fails on 019_replslot_limit.pl for uncertain reason to me..


I could not reproduce this...





@@ -11323,7 +11329,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, 
TimeLineID *stoptli_p)
 * If archiving is enabled, wait for all the required WAL files to be
 * archived before returning. If archiving isn't enabled, the required 
WAL
 * needs to be transported via streaming replication (hopefully with
-* wal_keep_segments set high enough), or some more exotic mechanism 
like
+* wal_keep_size set high enough), or some more exotic mechanism like
 * polling and copying files from pg_wal with script. We have no 
knowledge

Isn't this time a good chance to mention replication slots?


+1 to do that. But I found there are other places where replication slots
need to be mentioned. So I think it's better to do this as separate patch.





-   "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+   "ALTER SYSTEM SET wal_keep_size to '128MB'; SELECT pg_reload_conf();");

wal_segment_size to 1MB here so, that conversion is not correct.
(However, that test works as long as it is more than
max_slot_wal_keep_size so it's practically no problem.)


So I changed 128MB to 8MB. Is this OK?
I attached the updated version of the patch upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Dilip Kumar
On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar  wrote:
>
> On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar  wrote:
> >
> > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila  wrote:
> > >
> > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar  wrote:
> > > >
> > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > > 9.
> > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, 
> > > > > ReorderBufferTXN *txn,
> > > > > {
> > > > > ..
> > > > > + ReorderBufferToastReset(rb, txn);
> > > > > + if (specinsert != NULL)
> > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > ..
> > > > > }
> > > > >
> > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > >
> > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > gracefully and we are continuing with further decoding so we need to
> > > > return this change back.
> > > >
> > >
> > > Okay, then I suggest we should do these before calling stream_stop and
> > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > pattern similar to try block unless there is a reason for not doing
> > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > after returning the change as we are doing at other places.
> >
> > Okay
> >
> > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > idea?
> > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > #  got: '2|0'
> > > > > # expected: '1000|500'
> > > > > # Looks like you failed 1 test of 2.
> > > > > make[2]: *** [check] Error 1
> > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > make[1]: *** Waiting for unfinished jobs
> > > > > make: *** [check-world-src/test-recurse] Error 2
> > > >
> > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > have executed it multiple time but it did not reproduce again.  Are
> > > > you able to reproduce it consistently?
> > > >
> > >
> > > No, I am also not able to reproduce it consistently but I think this
> > > can fail if a subscriber sends the replay_location before actually
> > > replaying the changes.  First, I thought that extra send_feedback we
> > > have in apply_handle_stream_commit might have caused this but I guess
> > > that can't happen because we need the commit time location for that
> > > and we are storing the same at the end of apply_handle_stream_commit
> > > after applying all messages.  I am not sure what is going on here.  I
> > > think we somehow need to reproduce this or some variant of this test
> > > consistently to find the root cause.
> >
> > And I think it appeared first time for me,  so maybe either induced
> > from past few versions so some changes in the last few versions might
> > have exposed it.  I have noticed that almost 50% of the time I am able
> > to reproduce after the clean build so I can trace back from which
> > version it started appearing that way it will be easy to narrow down.
>
> I think the reason for the failure is that we are not setting
> remote_final_lsn, in the streaming mode.  I have put multiple logs and
> executed in log and from logs it appeared that some of the logical wal
> did not get replayed due to below check in
> should_apply_changes_for_rel.
> return (rel->state == SUBREL_STATE_READY || (rel->state ==
> SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
>
> I still need to do the detailed analysis that why does this fail in
> some cases,  basically, most of the time the rel->state is
> SUBREL_STATE_READY so this check passes but whenever the state is
> SUBREL_STATE_SYNCDONE it failed because we never update
> remote_final_lsn.  I will try to set this value in
> apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.
Apart from this, I have noticed one more thing related to the same
point.  Basically, in the apply_handle_commit, we are calling
process_syncing_tables whereas we are not calling the same in
apply_handle_stream_commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: GSSENC'ed connection stalls while reconnection attempts.

2020-07-12 Thread Kyotaro Horiguchi
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > If psql connected using GSSAPI auth and server restarted, reconnection
> > sequence stalls and won't return.
> 
> Yeah, reproduced here.  (I wonder if there's any reasonable way to
> exercise this scenario in src/test/kerberos/.)
> 
> > I found that psql(libpq) sends startup packet via gss
> > encryption. conn->gssenc should be reset when encryption state is
> > freed.
> 
> Actually, it looks to me like the GSS support was wedged in by somebody
> who was paying no attention to how SSL is managed, or else we forgot
> to pay attention to GSS the last time we rearranged SSL support.  It's
> completely broken for the multiple-host-addresses scenario as well,
> because try_gss is being set and cleared in the wrong places altogether.
> conn->gcred is not being handled correctly either I think --- we need
> to make sure that it's dropped in pqDropConnection.
> 
> The attached patch makes this all act more like the way SSL is handled,
> and for me it resolves the reconnection problem.

It looks good to me.

> > The reason that psql doesn't notice the error is pqPacketSend returns
> > STATUS_OK when write error occurred.  That behavior contradicts to the
> > comment of the function. The function is used only while making
> > connection so it's ok to error out immediately by write failure.  I
> > think other usage of pqFlush while making a connection should report
> > write failure the same way.
> 
> I'm disinclined to mess with that, because (a) I don't think it's the
> actual source of the problem, and (b) it would affect way more than
> just GSS mode.

If I did that in pqFlush your objection would be right, but
pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
so not doing that is just wrong. (pqSendSome reported write failure in
this case.) For other parts in authentication code, I don't think it
doesn't affect badly because authentication should proceed without any
read/write overlapping.

> > Finally, It's user-friendly if psql shows error message for error on
> > reset attempts. (This perhaps should be arguable.)
> 
> Meh, that's changing fairly longstanding behavior that I don't think
> we've had many complaints about.

Yeah, I haven't seen the message for any other reasons than the
absence of server. :p And, I noticed that, in the first place, I would
see that message in the next connection attempt from scratch.

I agree to you on this point.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Amit Kapila
On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar  wrote:
>
> On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila  wrote:
> >
> > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar  wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > 8. We can't stream the transaction before we reach the
> > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > > those changes unlike what we do with pgoutput plugin (which writes to
> > > > file). And, I think applying the transactions without reaching a
> > > > consistent state would be anyway wrong.  So, we should avoid that and
> > > > if do that then we should have an Assert for streamed txns rather than
> > > > sending abort for them in ReorderBufferForget.
> > >
> > > I was analyzing this point so currently, we only enable streaming in
> > > StartReplicationSlot so basically in CreateReplicationSlot the
> > > streaming will be always off because by that time plugins are not yet
> > > startup that will happen only on StartReplicationSlot.
> > >
> >
> > What do you mean by 'startup' in the above sentence?  AFAICS, we do
> > call startup_cb_wrapper in CreateInitDecodingContext which is called
> > from both CreateReplicationSlot and create_logical_replication_slot
> > before the start of decoding.  In CreateInitDecodingContext, we call
> > StartupDecodingContext which should load the plugin.
>
> Yeah, you are right that we do call startup_cb_wrapper from
> CreateInitDecodingContext as well.  I think I got confused by below
> comment in patch 0007
>
> @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> WalSndPrepareWrite, WalSndWriteData,
> WalSndUpdateProgress);
> + /*
> + * Make sure streaming is disabled here - we may have the methods,
> + * but we don't have anywhere to send the data yet.
> + */
> + ctx->streaming = false;
> +
>
> Basically, during CreateReplicationSlot we forcefully disable the
> streaming with the comment "we don't have anywhere to send the data
> yet".  So my point is during CreateReplicationSlot time the streaming
> will always be off and once we are done with creating the slot we will
> be having consistent snapshot.  So my point is can we just check that
> while decoding unless the current LSN reaches the start_decoding_at
> point we should not start streaming and after that we can start.  At
> that time we can have an assert that the snapshot should be
> CONSISTENT.  However, before doing that I need to check on this point
> that why after creating slot we are setting ctx->streaming to false.
>

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.".  I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way.  I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way.  Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase.  So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: partition routing layering in nodeModifyTable.c

2020-07-12 Thread Amit Langote
On Wed, Jul 1, 2020 at 6:56 PM Daniel Gustafsson  wrote:
>
> > On 2 Mar 2020, at 06:08, Amit Langote  wrote:
> >
> > On Mon, Mar 2, 2020 at 4:43 AM Tom Lane  wrote:
> >> Amit Langote  writes:
> >>> Rebased again.
> >>
> >> Seems to need that again, according to cfbot :-(
> >
> > Thank you, done.
>
> ..and another one is needed as it no longer applies, please submit a rebased
> version.

Sorry, it took me a while to get to this.

It's been over 11 months since there was any significant commentary on
the contents of the patches themselves, so perhaps I should reiterate
what the patches are about and why it might still be a good idea to
consider them.

The thread started with some very valid criticism of the way
executor's partition tuple routing logic looks randomly sprinkled over
in nodeModifyTable.c, execPartition.c.  In the process of making it
look less random, we decided to get rid of the global variable
es_result_relation_info to avoid complex maneuvers of
setting/resetting it correctly when performing partition tuple
routing, causing some other churn beside the partitioning code.  Same
with another global variable TransitionCaptureState.tcs_map.  So, the
patches neither add any new capabilities, nor improve performance, but
they do make the code in this area a bit easier to follow.

Actually, there is a problem that some of the changes here conflict
with patches being discussed on other threads ([1], [2]), so much so
that I decided to absorb some changes here into another "refactoring"
patch that I have posted at [2].

Attached rebased patches.

0001 contains preparatory FDW API changes to stop relying on
es_result_relation_info being set correctly.

0002 removes es_result_relation_info in favor passing the active
result relation around as a parameter in the various functions that
need it

0003 Moves UPDATE tuple-routing logic into a new function

0004 removes the global variable TransitionCaptureState.tcs_map which
needed to be set/reset whenever the active result relation relation
changes in favor of a new field in ResultRelInfo to store the same map

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/28/2575/
[2] https://commitfest.postgresql.org/28/2621/
From db6849c6ef3c3329911bcba5d75a88d86164ca7b Mon Sep 17 00:00:00 2001
From: Etsuro Fujita 
Date: Thu, 8 Aug 2019 21:41:12 +0900
Subject: [PATCH v11 1/4] Remove dependency on estate->es_result_relation_info
 from FDW APIs.

FDW APIs for executing a foreign table direct modification assumed that
the FDW would obtain the target foreign table's ResultRelInfo from
estate->es_result_relation_info of the passed-in ForeignScanState node,
but the upcoming patch(es) to refactor partitioning-related code in
nodeModifyTable.c will remove the es_result_relation_info variable.
Revise BeginDirectModify()'s API to pass the ResultRelInfo explicitly, to
remove the dependency on that variable from the FDW APIs.  For
ExecInitForeignScan() to efficiently get the ResultRelInfo to pass to
BeginDirectModify(), add a field to ForeignScan that gives the index of
the target foreign table in the list of the query result relations.

Patch by Amit Langote, following a proposal by Andres Freund, reviewed by
Andres Freund and me

Discussion: https://postgr.es/m/20190718010911.l6xcdv6birtxiei4@alap3.anarazel.de
---
 contrib/postgres_fdw/postgres_fdw.c | 25 +++--
 doc/src/sgml/fdwhandler.sgml|  8 ++--
 src/backend/executor/nodeForeignscan.c  | 14 ++
 src/backend/nodes/copyfuncs.c   |  1 +
 src/backend/nodes/outfuncs.c|  1 +
 src/backend/nodes/readfuncs.c   |  1 +
 src/backend/optimizer/plan/createplan.c |  2 ++
 src/backend/optimizer/plan/setrefs.c| 15 +++
 src/include/foreign/fdwapi.h|  1 +
 src/include/nodes/plannodes.h   |  3 +++
 10 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53ca..3af5b24 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -218,6 +218,7 @@ typedef struct PgFdwDirectModifyState
 	int			num_tuples;		/* # of result tuples */
 	int			next_tuple;		/* index of next one to return */
 	Relation	resultRel;		/* relcache entry for the target relation */
+	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for the target relation */
 	AttrNumber *attnoMap;		/* array of attnums of input user columns */
 	AttrNumber	ctidAttno;		/* attnum of input ctid column */
 	AttrNumber	oidAttno;		/* attnum of input oid column */
@@ -361,7 +362,9 @@ static bool postgresPlanDirectModify(PlannerInfo *root,
 	 ModifyTable *plan,
 	 Index resultRelation,
 	 int subplan_index);
-static void postgresBeginDirectModify(ForeignScanState *node, int eflags);
+static void postgresBeginDirectModify(ForeignScanState *node,
+		  ResultRelInfo *rinfo,
+		  int eflags);
 st

Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Masahiko Sawada
On Mon, 13 Jul 2020 at 09:33, Alvaro Herrera  wrote:
>
> On 2020-Jul-13, Tomas Vondra wrote:
>
> > On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:
>
> > > Maybe we can try to handle this with some other function that interprets
> > > the bytea in 'value' and returns a user-readable text.  I think it'd
> > > have to be a superuser-only function, because otherwise you could easily
> > > cause a crash by passing a value of a different opclass.  But since this
> > > seems a developer-only thing, that restriction seems fine to me.
> >
> > Ummm, I disagree a superuser check is sufficient protection from a
> > segfault or similar issues.
>
> My POV there is that it's the user's responsibility to call the right
> function; and if they fail to do so, it's their fault.  I agree it's not
> ideal, but frankly these pageinspect things are not critical to get 100%
> user-friendly.
>
> > If we really want to print something nicer, I'd say it needs to be a
> > special function in the BRIN opclass.
>
> If that can be done, then +1.  We just need to ensure that the function
> knows and can verify the type of index that the value comes from.  I
> guess we can pass the index OID so that it can extract the opclass from
> catalogs to verify.

+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
doesn't have it, the 'values' can be null.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-12 Thread Amit Kapila
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar  wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar  wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated 
> > > > > > this
> > > > > > in detail as the patch is still under progress.  See, if you have 
> > > > > > any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #  got: '2|0'
> > > > > > # expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
...
..
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
>

Your analysis and fix seem correct to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Don't choke on files that are removed while pg_rewind runs.

2020-07-12 Thread Justin Pryzby
commit b36805f3c54fe0e50e58bb9e6dad66daca46fbf6
Author: Heikki Linnakangas 
Date:   Sun Jun 28 21:35:51 2015 +0300

...

|@@ -175,22 +175,31 @@ libpqProcessFileList(void)
|pg_fatal("unexpected result set while fetching file list\n");
| 
|/* Read result to local variables */
|for (i = 0; i < PQntuples(res); i++)
|{
|char   *path = PQgetvalue(res, i, 0);
|int filesize = atoi(PQgetvalue(res, i, 1));
|boolisdir = (strcmp(PQgetvalue(res, i, 2), "t") == 
0);
|char   *link_target = PQgetvalue(res, i, 3);
|file_type_t type;
| 
|+   if (PQgetisnull(res, 0, 1))
...
|+   continue;

Every other access to "res" in this loop is to res(i), which I believe is what
was intended here, too.  Currently, it will dumbly loop but skip *every* row if
the 2nd column (1: size) of the first row (0) is null.

-- 
Justin




Re: proposal: possibility to read dumped table's name from file

2020-07-12 Thread Pavel Stehule
ne 12. 7. 2020 v 1:06 odesílatel Justin Pryzby 
napsal:

> On Mon, Jul 06, 2020 at 06:34:15AM +0200, Pavel Stehule wrote:
> > >> > Comment support is still missing but easily added :)
> > >> Still missing from the latest patch.
> > >
> > > I can implement a comment support. But I am not sure about the format.
> The
> >
> > here is support for comment's line - first char should be #
>
> Thanks, that's how I assumed it would look.
>
> > >> With some added documentation, I think this can be RfC.
>
> Do you want to add any more documentation ?
>

done


> Few more things:
>
> > +exit_invalid_filter_format(FILE *fp, char *filename, char *message,
> char *line, int lineno)
> > +{
> > + pg_log_error("invalid format of filter file \"%s\": %s",
> > +  *filename == '-' ? "stdin" : filename,
> > +  message);
>
> You refer to as "stdin" any filename beginning with -.
>
> I think you could just print "-" and not "stdin".
> In any case, *filename=='-' is wrong since it only checks filename[0].
> In a few places you compare ==stdin (which is right).
>

done


> Also, I think "f" isn't as good a name as "fp".
>

done


> You're adding 139 lines to a 600 line main(), and no other option is more
> than
> 15 lines.  Would you put it in a separate function ?
>

done

please, check attached patch

Regards

Pavel


> --
> Justin
>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7a37fd8045..2f2bfb4dbf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -755,6 +755,99 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+This option ensure reading object's filters from specified file.
+If you use "-" as filename, then stdin is used as source. This file
+has to have following line format:
+
+(+|-)[tnfd] objectname
+
+Only one object name can be specified per one line:
+
++t mytable1
++t mytable2
++f some_foreign_table
+-d mytable3
+
+With this file the dump ensures dump table mytable1,
+mytable2. The data of foreign table
+some_foreign_table will be dumped too. And the data
+of mytable3 will not be dumped.
+   
+
+   
+The first char + or - specifies
+if object name will be used as include or exclude filter.
+   
+
+   
+The second char
+t,
+n,
+f,
+d
+specifies a object type.
+
+
+ 
+  t
+  
+   
+In inclusive form (+) it does same work like
+--table. In exclusive form (-)
+it is same like --exclude-table.
+   
+  
+ 
+
+ 
+  n
+  
+   
+In inclusive form (+) it does same work like
+--schema. In exclusive form (-)
+it is same like --exclude-schema.
+   
+  
+ 
+
+ 
+  f
+  
+   
+In inclusive form (+) it does same work like
+--include-foreign-data. The exclusive form
+(-) is not allowed.
+   
+  
+ 
+
+ 
+  d
+  
+   
+The inclusive form (+) is not allowed.
+In exclusive form (-) it is same like
+--exclude-table-data.
+   
+  
+ 
+
+   
+
+   
+The option --filter can be used together with options
+--table, --exclude-table,
+--schema, --exclude-schema,
+--include-foreign-data and
+--exclude-table-data.
+   
+  
+ 
+
  
   --if-exists
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e758b5c50a..2fe6db9b05 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -290,6 +290,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
 static TableInfo *getRootTableInfo(TableInfo *tbinfo);
+static void read_patterns_from_file(char *filename, DumpOptions *dopt);
 
 
 int
@@ -364,6 +365,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"filter", required_argument, NULL, 12},
 		{"if-exists", no_argument, &dopt.if_exists, 1},
 		{"inserts", no_argument, NULL, 9},
 		{"lock-wait-timeout", required_argument, NULL, 2},
@@ -603,6 +605,10 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:			/* filter implementation */
+read_patterns_from_file(optarg, &dopt);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -1022,6 +1028,7 @@ help(const char *progna

Re: Don't choke on files that are removed while pg_rewind runs.

2020-07-12 Thread Daniel Gustafsson
> On 13 Jul 2020, at 08:10, Justin Pryzby  wrote:

> Every other access to "res" in this loop is to res(i), which I believe is what
> was intended here, too.  Currently, it will dumbly loop but skip *every* row 
> if
> the 2nd column (1: size) of the first row (0) is null.

Yeah, I agree with that, seems like the call should've been PQgetisnull(res, i, 
1);
to match the loop.

cheers ./daniel



Re: POC and rebased patch for CSN based snapshots

2020-07-12 Thread movead...@highgo.ca

>I have doubts that I fully understood your question, but still.
>What real problems do you see here? Transaction t1 doesn't get state of
>shard2 until time at node with shard2 won't reach start time of t1.
>If transaction, that inserted B wants to know about it position in time
>relatively to t1 it will generate CSN, attach to node1 and will see,
>that t1 is not started yet.
 
>Maybe you are saying about the case that someone who has a faster data
>channel can use the knowledge from node1 to change the state at node2?
>If so, i think it is not a problem, or you can explain your idea.
Sorry, I think this is my wrong understand about Clock-SI. At first I expect
we can get a absolutly snapshot, for example B should not include in the
snapshot because it happened after time t1. How ever Clock-SI can not guarantee
that and no design guarantee that at all.




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca