Re: Skipping schema changes in publication

2022-08-08 Thread vignesh C
On Fri, Jun 3, 2022 at 3:36 PM vignesh C  wrote:
>
> On Thu, May 26, 2022 at 7:04 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Monday, May 23, 2022 2:13 PM vignesh C  wrote:
> > > Attached v7 patch which fixes the buildfarm warning for an unused warning 
> > > in
> > > release mode as in  [1].
> > Hi, thank you for the patches.
> >
> >
> > I'll share several review comments.
> >
> > For v7-0001.
> >
> > (1) I'll suggest some minor rewording.
> >
> > +  
> > +   The RESET clause will reset the publication to the
> > +   default state which includes resetting the publication options, setting
> > +   ALL TABLES flag to false and
> > +   dropping all relations and schemas that are associated with the 
> > publication.
> >
> > My suggestion is
> > "The RESET clause will reset the publication to the
> > default state. It resets the publication operations,
> > sets ALL TABLES flag to false and drops all relations
> > and schemas associated with the publication."
>
> I felt the existing looks better. I would prefer to keep it that way.
>
> > (2) typo and rewording
> >
> > +/*
> > + * Reset the publication.
> > + *
> > + * Reset the publication options, setting ALL TABLES flag to false and drop
> > + * all relations and schemas that are associated with the publication.
> > + */
> >
> > The "setting" in this sentence should be "set".
> >
> > How about changing like below ?
> > FROM:
> > "Reset the publication options, setting ALL TABLES flag to false and drop
> > all relations and schemas that are associated with the publication."
> > TO:
> > "Reset the publication operations, set ALL TABLES flag to false and drop
> > all relations and schemas associated with the publication."
>
>  I felt the existing looks better. I would prefer to keep it that way.
>
> > (3) AlterPublicationReset
> >
> > Do we need to call CacheInvalidateRelcacheAll() or
> > InvalidatePublicationRels() at the end of
> > AlterPublicationReset() like AlterPublicationOptions() ?
>
> CacheInvalidateRelcacheAll should be called if we change all tables
> from true to false, else the cache will not be invalidated. Modified
>
> >
> > For v7-0002.
> >
> > (4)
> >
> > +   if (stmt->for_all_tables)
> > +   {
> > +   boolisdefault = CheckPublicationDefValues(tup);
> > +
> > +   if (!isdefault)
> > +   ereport(ERROR,
> > +   
> > errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > +   errmsg("adding ALL TABLES requires 
> > the publication to have default publication options, no tables/
> > +   errhint("Use ALTER PUBLICATION ... 
> > RESET to reset the publication"));
> >
> >
> > The errmsg string has three messages for user and is a bit long
> > (we have two sentences there connected by 'and').
> > Can't we make it concise and split it into a couple of lines for code 
> > readability ?
> >
> > I'll suggest a change below.
> > FROM:
> > "adding ALL TABLES requires the publication to have default publication 
> > options, no tables/schemas associated and ALL TABLES flag should not be set"
> > TO:
> > "adding ALL TABLES requires the publication defined not for ALL TABLES"
> > "to have default publish actions without any associated tables/schemas"
>
> Added errdetail and split it
>
> > (5) typo
> >
> >
> > +EXCEPT TABLE
> > +
> > + 
> > +  This clause specifies a list of tables to exclude from the 
> > publication.
> > +  It can only be used with FOR ALL TABLES.
> > + 
> > +
> > +   
> > +
> >
> > Kindly change
> > FROM:
> > This clause specifies a list of tables to exclude from the publication.
> > TO:
> > This clause specifies a list of tables to be excluded from the publication.
> > or
> > This clause specifies a list of tables excluded from the publication.
>
> Modified
>
> > (6) Minor suggestion for an expression change
> >
> >Marks the publication as one that replicates changes for all tables 
> > in
> > -  the database, including tables created in the future.
> > +  the database, including tables created in the future. If
> > +  EXCEPT TABLE is specified, then exclude 
> > replicating
> > +  the changes for the specified tables.
> >
> >
> > I'll suggest a minor rewording.
> > FROM:
> > ...exclude replicating the changes for the specified tables
> > TO:
> > ...exclude replication changes for the specified tables
>
> I felt the existing is better.
>
> > (7)
> > (7-1)
> >
> > +/*
> > + * Check if the publication has default values
> > + *
> > + * Check the following:
> > + * a) Publication is not set with "FOR ALL TABLES"
> > + * b) Publication is having default options
> > + * c) Publication is not associated with schemas
> > + * d) Publication is not associated with relations
> > + */
> > +static bool
> > +CheckPublicationDefValues(HeapTuple tup)
> >
> >
> > I think this header comment can be improved.
> > FROM:
> > Check the 

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Mon, Aug 8, 2022 at 12:17 PM John Naylor
 wrote:
>
> On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart  
> wrote:
> >
> > [v8]
>
> Okay, I think it's basically in good shape. Since it should be a bit
> faster than a couple versions ago, would you be up for retesting with
> the original test having 8 to 512 writers? And also add the const
> markers we discussed upthread? Aside from that, I plan to commit this
> week unless there is further bikeshedding.

I quickly reviewed v8 patch set, few comments:

1) pg_lfind32 - why just uint32? If it's not possible to define
functions for char, unsigned char, int16, uint16, int32, int64, uint64
and so on, can we add a few comments around that? Also, the comments
can talk about if the base type or element data type of array or data
type of key matters to use pg_lfind32.

2) I think this is not just for the remaining elements but also for
non-USE_SSE2 cases. Also, please specify in which cases we reach here
for USE_SSE2 cases.
+/* Process the remaining elements the slow way. */

3) Can pg_lfind32 return the index of  the key found, for instance to
use it for setting/resetting the found element in the array?
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)

4) Can we, right away, use this API to replace linear search, say, in
SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(),
AfterTriggerSetState()? I'm sure I might be missing other places, but
can we replace the possible found areas with the new function?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-08 Thread Amit Langote
Hi Andres,

On Sat, Aug 6, 2022 at 5:37 AM Andres Freund  wrote:
> On 2022-08-04 17:01:48 +0900, Amit Langote wrote:
> > On Wed, Aug 3, 2022 at 12:00 AM Andres Freund  wrote:
> > > Honestly, this code seems like it should just be rewritten from scratch.
> >
> > Based on what I wrote above, please let me know if I've misunderstood
> > your concerns about over-allocation of coercion state.  I can try to
> > rewrite one more time if I know what this should look like instead.
>
> I don't know. I don't understand the design of what needs to have error
> handling, and what not.

AFAIK, there are two things that the JSON path code considers can
cause an error when evaluating a JsonExpr:

* Actual JSON path evaluation in ExecEvalJsonExpr(), because it
invokes JsonPath*() family of functions defined in jsonpath_exec.c,
which can possibly cause an error.  Actually, I looked again today as
to what goes on in them and it seems there is a throwErrors variable
being used to catch and prevent any errors found by the JSON path
machinery itself, and it has the same value as the throwErrors
variable in ExecEvalJsonExpr().  Given that the latter is set per the
ON ERROR specification (throw errors or return NULL / a default
expression in lieu), maybe this part doesn't really need a
sub-transaction.  To check, I took off the sub-transaction around this
part and can see that no tests fail.

* Evaluating coercion expression in ExecEvalJsonExprCoercion(), which
involves passing a user-specified expression through either
EEOP_IOCOERCE or json_populate_type(), both of which can cause errors
that are not suppressible as those in jsonpath_exec.c are.  So, this
part does need a sub-transaction to satisfy the ON ERROR behavior.  To
check, I took out the sub-transaction around the coercion evaluation,
and some tests in jsob_sqljson do indeed fail, like this, for example:

 SELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int);
- json_value
-
-
-(1 row)
-
+ERROR:  invalid input syntax for type integer: "aaa"

Note that both the JSON path expression and the coercion would run as
part of the one EEOP_JSONEXPR ExecEvalStep before this patch and thus
would be wrapped under the same sub-transaction, even if only the
latter apparently needs it.

With this patch, I tried to keep that same behavior, but because the
coercion evaluation has now been broken out into its own step, it must
use another sub-transaction, given that the same sub-transaction can
no longer wrap both things.  But given my finding that the JSON path
expression doesn't really need one, maybe the new EEOP_JSONEXPR_PATH
step can run without one, while keeping it for the new
EEOP_JSONEXPR_COERCION step.

> > > > >   I first was confused why the code tries to load the jump target
> > > > >   dynamically. But then I saw that the interpreted code sets it 
> > > > > dynamically -
> > > > >   why? That's completely unnecessary overhead afaics? There's just two
> > > > >   possible jump targets, no?
> > > >
> > > > Hmm, I looked at the code for other expressions that jump, especially
> > > > CASE WHEN, but they use ready-made EEOP_JUMP_IF_* steps, which can be
> > > > added statically.  I thought we can't use them in this case, because
> > > > the conditions are very ad-hoc, like if the JSON path computation
> > > > returned an "empty" item or if the "error" flag was set during that
> > > > computation, etc.
> > >
> > > The minimal fix would be to return the jump target from the function, and 
> > > then
> > > jump to that. That at least avoids the roundtrip to memory you have right 
> > > now.
> >
> > You mean like this:
> >
> > LLVMValueRef v_args[2];
> > LLVMValueRef v_ret;
> >
> > /*
> >  * Call ExecEvalJsonExprSkip() to decide if JSON path
> >  * evaluation can be skipped.  This returns the step
> >  * address to jump to.
> >  */
> > v_args[0] = v_state;
> > v_args[1] = l_ptr_const(op, l_ptr(StructExprEvalStep));
> > v_ret = LLVMBuildCall(b,
> >   llvm_pg_func(mod,
> > "ExecEvalJsonExprSkip"),
> >   params, lengthof(params), "");
> >
> > Actually, this is how I had started, but never figured out how to jump
> > to the address in v_ret.  As in, how to extract the plain C int32
> > value that is the jump address from v_ret, an LLVMValueRef, to use in
> > the following statement:
> >
> > LLVMBuildBr(b, opblocks[]);
>
> We could make that work, but even keeping it more similar to your current
> code, you're already dealing with a variable jump target. Only that you load
> it from a variable, instead of the function return type. So you could just
> v_ret instead of v_jumpaddr, and your code would be simpler and faster.

Ah, I see you mean to continue to use all the LLVMBuildCondBr

Re: Checking pgwin32_is_junction() errors

2022-08-08 Thread r . zharkov

On 2022-08-06 08:02, Thomas Munro wrote:


Pushed.

Hmm, this stuff could *really* use a little test framework that's run
by check-world, that exercises these various replacement operations.
But I also suspect that problems in this area are likely to be due to
concurrency.  It's hard to make a simple test that simulates the case
where a file is unlinked between system calls within stat() and hits
the STATUS_DELETE_PENDING case.  That check is code I cargo-culted in
this patch.  So much of the stuff we've had in the tree relating to
that area has been wrong in the past...


Hello, hackers!

initdb on my windows 10 system stopped working after the commit
c5cb8f3b: "Provide lstat() for Windows."
The error message is: creating directory C:/HOME/data ... initdb:
error: could not create directory "C:/HOME": File exists

"C:/HOME" is the junction point to the second volume on my hard drive -
"\??\Volume{GUID}\" which name pgreadlink() erroneously strips here:
https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357.
So initdb could not stat the file with name "Volume{GUID}", tried to
create it and failed.
With the attached patch initdb works fine again.

--
regards,

Romandiff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 2818bfd2e95..9d73620a6ba 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -354,7 +354,8 @@ pgreadlink(const char *path, char *buf, size_t size)
 	 * If the path starts with "\??\", which it will do in most (all?) cases,
 	 * strip those out.
 	 */
-	if (r > 4 && strncmp(buf, "\\??\\", 4) == 0)
+	if (r > 4 && strncmp(buf, "\\??\\", 4) == 0 &&
+		strncmp(buf, "\\??\\Volume", 10) != 0)
 	{
 		memmove(buf, buf + 4, strlen(buf + 4) + 1);
 		r -= 4;


Re: logical decoding and replication of sequences

2022-08-08 Thread Kyotaro Horiguchi
At Mon, 8 Aug 2022 18:15:46 +1200, Thomas Munro  wrote 
in 
> On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro  wrote:
> > Thanks for the repro patch and bisection work.  Looking...
> 
> I don't have the complete explanation yet, but it's something like
> this.  We hit the following branch in xlogrecovery.c...
> 
> if (StandbyMode &&
> !XLogReaderValidatePageHeader(xlogreader,
> targetPagePtr, readBuf))
> {
> /*
>  * Emit this error right now then retry this page
> immediately. Use
>  * errmsg_internal() because the message was already 
> translated.
>  */
> if (xlogreader->errormsg_buf[0])
> ereport(emode_for_corrupt_record(emode,
> xlogreader->EndRecPtr),
> (errmsg_internal("%s",
> xlogreader->errormsg_buf)));
> 
> /* reset any error XLogReaderValidatePageHeader()
> might have set */
> xlogreader->errormsg_buf[0] = '\0';
> goto next_record_is_invalid;
> }
> 
> ... but, even though there was a (suppressed) error, nothing
> invalidates the reader's page buffer.  Normally,
> XLogReadValidatePageHeader() failure or any other kind of error
> encountered by xlogreader.c'd decoding logic would do that, but here
> the read_page callback is directly calling the header validation.
> Without prefetching, that doesn't seem to matter, but reading ahead
> can cause us to have the problem page in our buffer at the wrong time,
> and then not re-read it when we should.  Or something like that.
> 
> The attached patch that simply moves the cache invalidation into
> report_invalid_record(), so that it's reached by the above code and
> everything else that reports an error, seems to fix the problem in
> src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch
> applied, and passes check-world without it.  I need to look at this
> some more, though, and figure out if it's the right fix.

If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
misses the chance to inavlidate reader-state.  That state is not an
error while in StandbyMode.

In the repro case, XLogPageRead returns XLREAD_WOULDBLOCK after the
first failure.  This situation (of course) was not considered when
that code was introduced. If that function is going to return with
XLREAD_WOULDBLOCK while lastSourceFailed, it should be turned into
XLREAD_FAIL. So, the following also works.

diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 21088e78f6..9f242fe656 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3220,7 +3220,9 @@ retry:

xlogreader->nonblocking))
{
case XLREAD_WOULDBLOCK:
-   return XLREAD_WOULDBLOCK;
+   if (!lastSourceFailed)
+   return XLREAD_WOULDBLOCK;
+   /* Fall through.  */
case XLREAD_FAIL:
if (readFile >= 0)
close(readFile);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Column Filtering in Logical Replication

2022-08-08 Thread Peter Smith
PSA patch version v1* for a new "Column Lists" pgdocs section

This is just a first draft, but I wanted to post it as-is, with the
hope that I can get some feedback while continuing to work on it.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Column-List-replica-identity-rules.patch
Description: Binary data


v1-0002-Column-Lists-new-pgdocs-section.patch
Description: Binary data


Re: logical decoding and replication of sequences

2022-08-08 Thread Kyotaro Horiguchi
At Mon, 08 Aug 2022 17:33:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
> misses the chance to inavlidate reader-state.  That state is not an
> error while in StandbyMode.

Mmm... Maybe I wanted to say:  (Still I'm not sure the rewrite works..)

If WaitForWALToBecomeAvailable returned by promotion, ReadPageInteral
would miss the chance to invalidate reader-state.  When XLogPageRead
is called in blocking mode while in StandbyMode (that is, the
traditional condition) , the function continues retrying until it
succeeds, or returns XLRAD_FAIL if promote is triggered.  In other
words, it was not supposed to return non-failure while the header
validation is failing while in standby mode.  But while in nonblocking
mode, the function can return non-failure with lastSourceFailed =
true, which seems wrong.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy
 wrote:
>
>
> 1) pg_lfind32 - why just uint32? If it's not possible to define
> functions for char, unsigned char, int16, uint16, int32, int64, uint64
> and so on, can we add a few comments around that? Also, the comments

Future work, as far as I'm  concerned. I'm interested in using a char
version for json strings.

> 3) Can pg_lfind32 return the index of  the key found, for instance to
> use it for setting/resetting the found element in the array?

That was just discussed. It's slightly faster not to return an index.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Mon, Aug 8, 2022 at 2:30 PM John Naylor  wrote:
>
> On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy
>  wrote:
>
> > 3) Can pg_lfind32 return the index of  the key found, for instance to
> > use it for setting/resetting the found element in the array?
>
> That was just discussed. It's slightly faster not to return an index.

I haven't looked upthread, please share the difference. How about
another version of the function that returns the index too?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Skipping schema changes in publication

2022-08-08 Thread vignesh C
On Mon, Aug 8, 2022 at 12:46 PM vignesh C  wrote:
>
> On Fri, Jun 3, 2022 at 3:36 PM vignesh C  wrote:
> >
> > On Thu, May 26, 2022 at 7:04 PM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On Monday, May 23, 2022 2:13 PM vignesh C  wrote:
> > > > Attached v7 patch which fixes the buildfarm warning for an unused 
> > > > warning in
> > > > release mode as in  [1].
> > > Hi, thank you for the patches.
> > >
> > >
> > > I'll share several review comments.
> > >
> > > For v7-0001.
> > >
> > > (1) I'll suggest some minor rewording.
> > >
> > > +  
> > > +   The RESET clause will reset the publication to the
> > > +   default state which includes resetting the publication options, 
> > > setting
> > > +   ALL TABLES flag to false and
> > > +   dropping all relations and schemas that are associated with the 
> > > publication.
> > >
> > > My suggestion is
> > > "The RESET clause will reset the publication to the
> > > default state. It resets the publication operations,
> > > sets ALL TABLES flag to false and drops all relations
> > > and schemas associated with the publication."
> >
> > I felt the existing looks better. I would prefer to keep it that way.
> >
> > > (2) typo and rewording
> > >
> > > +/*
> > > + * Reset the publication.
> > > + *
> > > + * Reset the publication options, setting ALL TABLES flag to false and 
> > > drop
> > > + * all relations and schemas that are associated with the publication.
> > > + */
> > >
> > > The "setting" in this sentence should be "set".
> > >
> > > How about changing like below ?
> > > FROM:
> > > "Reset the publication options, setting ALL TABLES flag to false and drop
> > > all relations and schemas that are associated with the publication."
> > > TO:
> > > "Reset the publication operations, set ALL TABLES flag to false and drop
> > > all relations and schemas associated with the publication."
> >
> >  I felt the existing looks better. I would prefer to keep it that way.
> >
> > > (3) AlterPublicationReset
> > >
> > > Do we need to call CacheInvalidateRelcacheAll() or
> > > InvalidatePublicationRels() at the end of
> > > AlterPublicationReset() like AlterPublicationOptions() ?
> >
> > CacheInvalidateRelcacheAll should be called if we change all tables
> > from true to false, else the cache will not be invalidated. Modified
> >
> > >
> > > For v7-0002.
> > >
> > > (4)
> > >
> > > +   if (stmt->for_all_tables)
> > > +   {
> > > +   boolisdefault = 
> > > CheckPublicationDefValues(tup);
> > > +
> > > +   if (!isdefault)
> > > +   ereport(ERROR,
> > > +   
> > > errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > > +   errmsg("adding ALL TABLES 
> > > requires the publication to have default publication options, no 
> > > tables/
> > > +   errhint("Use ALTER PUBLICATION 
> > > ... RESET to reset the publication"));
> > >
> > >
> > > The errmsg string has three messages for user and is a bit long
> > > (we have two sentences there connected by 'and').
> > > Can't we make it concise and split it into a couple of lines for code 
> > > readability ?
> > >
> > > I'll suggest a change below.
> > > FROM:
> > > "adding ALL TABLES requires the publication to have default publication 
> > > options, no tables/schemas associated and ALL TABLES flag should not be 
> > > set"
> > > TO:
> > > "adding ALL TABLES requires the publication defined not for ALL TABLES"
> > > "to have default publish actions without any associated tables/schemas"
> >
> > Added errdetail and split it
> >
> > > (5) typo
> > >
> > >
> > > +EXCEPT TABLE
> > > +
> > > + 
> > > +  This clause specifies a list of tables to exclude from the 
> > > publication.
> > > +  It can only be used with FOR ALL TABLES.
> > > + 
> > > +
> > > +   
> > > +
> > >
> > > Kindly change
> > > FROM:
> > > This clause specifies a list of tables to exclude from the publication.
> > > TO:
> > > This clause specifies a list of tables to be excluded from the 
> > > publication.
> > > or
> > > This clause specifies a list of tables excluded from the publication.
> >
> > Modified
> >
> > > (6) Minor suggestion for an expression change
> > >
> > >Marks the publication as one that replicates changes for all 
> > > tables in
> > > -  the database, including tables created in the future.
> > > +  the database, including tables created in the future. If
> > > +  EXCEPT TABLE is specified, then exclude 
> > > replicating
> > > +  the changes for the specified tables.
> > >
> > >
> > > I'll suggest a minor rewording.
> > > FROM:
> > > ...exclude replicating the changes for the specified tables
> > > TO:
> > > ...exclude replication changes for the specified tables
> >
> > I felt the existing is better.
> >
> > > (7)
> > > (7-1)
> > >
> > > +/*
> > > + * Check if the publication has default values
> > > + *
> > > + * C

Re: bug on log generation ?

2022-08-08 Thread Marcos Pegoraro
>
>
> How are you running postgres? If the logger process runs into trouble it
> might
> write to stderr.
>
> Is there a chance your huge statements would make you run out of space?
>
> Well, I don't think it is a  out of space problem, because it
doesn´t stop logging, it just splits that message. As you can see, the next
message is logged properly. And that statement is not so huge, these
statements have not more than 10 or 20kb. And as I said these statements
occur dozens of times a day, but only once or twice is not correctly logged
An additional info, that splitted message has an out of order log time. At
that time the log file was having 2 or 3 logs per second, and that message was
1 or 2 minutes later. It seems like it occurs now but it's stored a minute
or two later.

thanks
Marcos


Re: Reducing planning time on tables with many indexes

2022-08-08 Thread Luc Vlaming Hummel
On 27.07.22, 18:39, "Tom Lane"  wrote:

[External Email]


David Geier  writes:
> We tracked down the root cause of this slowdown to lock contention in
> 'get_relation_info()'. The index lock of every single index of every 
single
> table used in that query is acquired. We attempted a fix by pre-filtering
> out all indexes that anyways cannot be used with a certain query, without
> taking the index locks (credits to Luc Vlaming for idea and
> implementation). The patch does so by caching the columns present in every
> index, inside 'struct Relation', similarly to 'rd_indexlist'.

I wonder how much thought you gave to the costs imposed by that extra
cache space.  We have a lot of users who moan about relcache bloat
already.  But more to the point, I do not buy the assumption that
an index's set of columns is a good filter for which indexes are of
interest.  A trivial counterexample from the regression database is

regression=# explain select count(*) from tenk1;
 QUERY PLAN




 Aggregate  (cost=219.28..219.29 rows=1 width=8)
   ->  Index Only Scan using tenk1_hundred on tenk1  (cost=0.29..194.28 
rows=100
00 width=0)
(2 rows)

It looks to me like the patch also makes unwarranted assumptions about
being able to discard all but the smallest index having a given set
of columns.  This would, for example, possibly lead to dropping the
index that has the most useful sort order, or that has the operator
class needed to support a specific WHERE clause.

Thanks for checking out the patch!

Just to make sure we're on the same page: we're only making this assumption if 
you select no fields at all.
If you select any fields at all it will check for column overlap, and if 
there's any overlap with any referenced field, 
then the index will not be filtered out.

For producing a row count with no referenced fields it is true that it should 
select the truly cheapest 
index to produce the row count and there should be some Index-am callback 
introduced for that. 
For now it was just a quick-and-dirty solution.
Wouldn't a callback that would estimate the amount of data read be good enough 
though?

For sort orders the field to sort by should be listed and hence the index 
should not be filtered out,
or what am I missing? Likely I've missed some fields that are referenced 
somehow (potentially indirectly),
but that shouldn't disqualify the approach completely.

In short, I'm not sure I buy this concept at all.  I think it might
be more useful to attack the locking overhead more directly.  I kind
of wonder why we need per-index locks at all during planning ---
I think that we already interpret AccessShareLock on the parent table
as being sufficient to block schema changes on existing indexes.

Could you elaborate as to why this approach is not good enough? To me it seems 
that avoiding work
ahead of time is generally useful. Or are you worried that we remove too much?

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are some
comparable behaviors in other AMs).  I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

regards, tom lane


The thing you're touching on is specific for a btree. Not sure this generalizes 
to all index types that
are out there though? I could see there being some property that allows you to 
be "no-lock",
and then a callback that allows you to cache some generic data that can be 
transformed
when the indexopt info structs are filled. Is that roughly what you have in 
mind?

Best,
Luc



Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-08 Thread Bharath Rupireddy
On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy
 wrote:
>
> On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro  wrote:
> >
> > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> > > designed to work on WAL segments initialization and it uses
> > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> > > in its name that tells us so.  This makes me question whether
> > > file_utils.c is a good location for this second thing.  Could a new
> > > file be a better location?  We have a xlogutils.c in the backend, and
> > > a name similar to that in src/common/ would be one possibility.
> >
> > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
> > maybe it's OK to use BLCKSZ with a comment to say that it's a bit
> > arbitrary, or maybe it's better to define a new zero buffer of some
> > arbitrary size just in this code if that is too strange.  We could
> > experiment with different size buffers to see how it performs, bearing
> > in mind that every time we double it you halve the number of system
> > calls, but also bearing in mind that at some point it's too much for
> > the stack.  I can tell you that the way that code works today was not
> > really written with performance in mind (unlike, say, the code
> > reverted from 9.4 that tried to do this with posix_fallocate()), it
> > was just finding an excuse to call pwritev(), to exercise new fallback
> > code being committed for use by later AIO stuff (more patches coming
> > soon).  The retry support was added because it seemed plausible that
> > some system out there would start to do short writes as we cranked up
> > the sizes for some implementation reason other than ENOSPC, so we
> > should make a reusable retry routine.
>
> Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
> reduces the system calls to half (right now, pg_pwritev_with_retry()
> gets called 64 times per 16MB WAL file, it writes in the batches of 32
> blocks per call).
>
> Is there a ready-to-use tool or script or specific settings for
> pgbench (pgbench command line options or GUC settings) that I can play
> with to measure the performance?

I played with a simple insert use-case [1] that generates ~380 WAL
files, with different block sizes. To my surprise, I have not seen any
improvement with larger block sizes. I may be doing something wrong
here, suggestions on to test and see the benefits are welcome.

> > I think this should also handle the remainder after processing whole
> > blocks, just for completeness.  If I call the code as presented with size
> > 8193, I think this code will only write 8192 bytes.
>
> Hm, I will fix it.

Fixed.

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v3-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patch
Description: Binary data


v3-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch
Description: Binary data


Asking for feedback on Pgperffarm

2022-08-08 Thread Yedil Serzhan
Dear hackers,

I'm Yedil. I'm working on the project "Postgres Performance Farm" during
Gsoc. Pgperffarm is a project like Postgres build farm but focuses on the
performance of the database. Now it has 2 types of benchmarks, pgbench and
tpc-h. The website is online here , and the repo
is here .

I would like you to take a look at our website and, if possible, give some
feedback on, for example, what other data should be collected or what other
metrics could be used to compare performance.

Thanks for your time in advance!

Best regards
Yedil


SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-08 Thread Simon Riggs
On Thu, 4 Aug 2022 at 13:11, Simon Riggs  wrote:
>
> On Wed, 3 Aug 2022 at 20:18, Andres Freund  wrote:
>
> > I think we should consider redesigning subtrans more substantially - even 
> > with
> > the changes you propose here, there's still plenty ways to hit really bad
> > performance. And there's only so much we can do about that without more
> > fundamental design changes.
>
> I completely agree - you will be glad to hear that I've been working
> on a redesign of the subtrans module.
...
> I will post my patch, when complete, in a different thread.

The attached patch reduces the overhead of SUBTRANS by minimizing the
number of times SubTransSetParent() is called, to below 1% of the
current rate in common cases.

Instead of blindly calling SubTransSetParent() for every subxid, this
proposal only calls SubTransSetParent() when that information will be
required for later use. It does this by analyzing all of the callers
of SubTransGetParent() and uses these pre-conditions to filter out
calls/subxids that will never be required, for various reasons. It
redesigns the way XactLockTableWait() calls
SubTransGetTopmostTransactionId() to allow this.

This short patchset compiles and passes make check-world, with lengthy comments.

This might then make viable a simple rewrite of SUBTRANS using a hash
table, as proposed by Andres. But in any case, it will allow us to
design a densely packed SUBTRANS replacement that does not generate as
much contention and I/O.

NOTE that this patchset does not touch SUBTRANS at all, it just
minimizes the calls in preparation for a later redesign in a later
patch. If this patch/later versions of it is committed in Sept CF,
then we should be in good shape to post a subtrans redesign patch by
major patch deadline at end of year.

Patches 001 and 002 are common elements of a different patch,
"Smoothing the subtrans performance catastrophe", but other than that,
the two patches are otherwise independent of each other.

Where does this come from? I learnt a lot about subxids when coding
Hot Standby, specifically commit 06da3c570f21394003. This patch just
builds upon that earlier understanding.

Comments please.

--
Simon Riggshttp://www.EnterpriseDB.com/


003_minimize_calls_to_SubTransSetParent.v7.patch
Description: Binary data


002_new_isolation_tests_for_subxids.v2.patch
Description: Binary data


001_subx_include_subxids_even_if_overflowed.v2.patch
Description: Binary data


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-08 Thread Bharath Rupireddy
On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe  
> wrote in
> > While this may mitigate the problem, I don't think it will deal with
> > all the cases which could cause a transaction to end up committed locally,
> > but not on the synchronous standby.  I think that only using the full
> > power of two-phase commit can make this bulletproof.
> >
> > Is it worth adding additional complexity that is not a complete solution?
>
> I would agree to this. Likewise 2PC, whatever we do to make it
> perfect, we're left with unresolvable problems at least for now.
>
> Doesn't it meet your requirements if we have a means to know the last
> transaction on the current session is locally committed or aborted?
>
> We are already internally managing last committed LSN. I think we can
> do the same thing about transaction abort and last inserted LSN and we
> can expose them any way. This is way simpler than the (maybe)
> uncompletable attempt to fill up the deep gap.

There can be more txns that are
locally-committed-but-not-yet-replicated. Even if we have that
information stored somewhere, what do we do with it? Those txns are
committed from the client perspective but not committed from the
server's perspective.

Can you please explain more about your idea, I may be missing something?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-08-08 Thread Bharath Rupireddy
On Sat, Jul 30, 2022 at 5:37 AM Bharath Rupireddy
 wrote:
>
> On Tue, Jul 26, 2022 at 5:50 PM David Steele  wrote:
> >
> > >> I would prefer to have all the components of backup_label stored
> > >> separately and then generate backup_label from them in pg_backup_stop().
> > >
> > > +1, because pg_backup_stop is the one that's returning backup_label
> > > contents, so it does make sense for it to prepare it once and for all
> > > and return.
> > >
> > >> For PG16 I am planning to add some fields to backup_label that are not
> > >> known when pg_backup_start() is called, e.g. min recovery time.
> > >
> > > Can you please point to your patch that does above?
> >
> > Currently it is a plan, not a patch. So there is nothing to show yet.
> >
> > > Yes, right now, backup_label or tablespace_map contents are being
> > > filled in by pg_backup_start and are never changed again. But if your
> > > above proposal is for fixing some issue, then it would make sense for
> > > us to carry all the info in a structure to pg_backup_stop and then let
> > > it prepare the backup_label and tablespace_map contents.
> >
> > I think this makes sense even if I don't get these changes into PG16.
> >
> > > If the approach is okay for the hackers, I would like to spend time on it.
> >
> > +1 from me.
>
> Here comes the v1 patch. This patch tries to refactor backup related
> code, advantages of doing so are following:
>
> 1) backup state is more structured now - all in a single structure,
> callers can create backup_label contents whenever required, either
> during the pg_backup_start or the pg_backup_stop or in between.
> 2) no parsing of backup_label file lines now in pg_backup_stop, no
> error checking for invalid parsing.
> 3) backup_label and history file contents have most of the things in
> common, they can now be created within a single function.
> 4) makes backup related code extensible and readable.
>
> One downside is that it creates a lot of diff with previous versions.
>
> Please review.

I added this to current CF - https://commitfest.postgresql.org/39/3808/

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: bug on log generation ?

2022-08-08 Thread Andrew Dunstan


On 2022-08-08 Mo 07:34, Marcos Pegoraro wrote:
>
>
> How are you running postgres? If the logger process runs into
> trouble it might
> write to stderr.
>
> Is there a chance your huge statements would make you run out of
> space?
>
> Well, I don't think it is a  out of space problem, because it
> doesn´t stop logging, it just splits that message. As you can see, the
> next message is logged properly. And that statement is not so huge,
> these statements have not more than 10 or 20kb. And as I said these
> statements occur dozens of times a day, but only once or twice is
> not correctly logged 
> An additional info, that splitted message has an out of order log
> time. At that time the log file was having 2 or 3 logs per second, and
> that message was 1 or 2 minutes later. It seems like it occurs now but
> it's stored a minute or two later.
>
>

It looks like a failure of the log chunking protocol, with long messages
being improperly interleaved. I don't think we've had reports of such a
failure since commit c17e863bc7 back in 2012, but maybe my memory is
failing.

What platform is this on? Is it possible that on some platform the chunk
size we're using is not doing an atomic write?


syslogger.h says:


#ifdef PIPE_BUF
/* Are there any systems with PIPE_BUF > 64K?  Unlikely, but ... */
#if PIPE_BUF > 65536
#define PIPE_CHUNK_SIZE  65536
#else
#define PIPE_CHUNK_SIZE  ((int) PIPE_BUF)
#endif
#else   /* not defined */
/* POSIX says the value of PIPE_BUF must be at least 512, so use that */
#define PIPE_CHUNK_SIZE  512
#endif


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: bug on log generation ?

2022-08-08 Thread Tom Lane
Andrew Dunstan  writes:
> What platform is this on? Is it possible that on some platform the chunk
> size we're using is not doing an atomic write?

Another idea is that some of the write() calls are failing --- elog.c
doesn't check for that.  Eyeing the POSIX spec for write(), I wonder
if somehow the pipe has gotten set into O_NONBLOCK mode and we're
not retrying EAGAIN failures.

regards, tom lane




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan
On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan  wrote:
> Attached patch fixes closes the remaining gap. With the patch applied,
> the second and subsequent vacuum verbose operations from the test case
> will show that reltuples is still 1 (it won't ever change). The
> patch just extends an old behavior that was applied when scanned_pages
> == 0 to cases where scanned_pages <= 1 (unless we happened to scan all
> of the relation's tables, of course).

My plan is to commit this later in the week, barring objections. Maybe
on Thursday.

-- 
Peter Geoghegan




Re: `make check` doesn't pass on MacOS Catalina

2022-08-08 Thread Andrew Dunstan


On 2022-08-06 Sa 12:10, Andrew Dunstan wrote:
> On 2022-08-06 Sa 11:25, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> I came across this when I was working on setting up some Dockerfiles for
>>> the buildfarm. Apparently LD_LIBRARY_PATH doesn't work on Alpine, at
>>> least out of the box, as it uses a different linker, and "make check"
>>> relies on it (or the moral equivalent) if "make install" hasn't been run.
>> I did some quick googling on this point.  We seem not to be the only
>> project having linking issues on Alpine, and yet it does support
>> LD_LIBRARY_PATH according to some fairly authoritative-looking pages, eg
>>
>> https://www.musl-libc.org/doc/1.0.0/manual.html
>>
>> I suspect the situation is similar to macOS, ie there is some limitation
>> somewhere on whether LD_LIBRARY_PATH gets passed through.  If memory
>> serves, the problem on SIP-enabled Mac is that DYLD_LIBRARY_PATH is
>> cleared upon invoking bash, so that we lose it anywhere that "make"
>> invokes a shell to run a subprogram.  (Hmm ... I wonder whether ninja
>> uses the shell ...)  I don't personally care at all about Alpine, but
>> maybe somebody who does could dig a little harder and characterize
>> the problem there better.
>>
>>  
>
> We probably should care about Alpine, because it's a good distro to use
> as the basis for Docker images, being fairly secure, very small, and
> booting very fast.
>
> I'll dig some more, and possibly set up a (docker based) buildfarm instance.
>
>

It appears that LD_LIBRARY_PATH is supported on Alpine but it fails if
chained, which seems somewhat braindead. The regression tests get errors
like this:


+ERROR:  could not load library
"/app/buildroot/HEAD/pgsql.build/tmp_install/app/buildroot/HEAD/inst/lib/postgresql/libpqwalreceiver.so":
Error loading shared library libpq.so.5: No such file or directory
(needed by
/app/buildroot/HEAD/pgsql.build/tmp_install/app/buildroot/HEAD/inst/lib/postgresql/libpqwalreceiver.so)


If the check stage is delayed until after the install stage the tests pass.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: bug on log generation ?

2022-08-08 Thread Marcos Pegoraro
>
> What platform is this on? Is it possible that on some platform the chunk
> size we're using is not doing an atomic write?
>

Until last year we were Ubuntu 16.04 and Postgres 11 with the latest minor
update.
This January we changed to Ubuntu 20.04 and Postgres 14, now updated to
14.4.

But the problem occured on both old and new SO and Postgres versions.
Right now I opened the current log file and there are 20 or 30 of these
statements and all of them are fine, maybe tomorrow the problem comes back,
maybe this afternoon.

thanks
Marcos


Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Matthias van de Meent
On Mon, 8 Aug 2022 at 16:52, Peter Geoghegan  wrote:
>
> On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan  wrote:
> > Attached patch fixes closes the remaining gap. With the patch applied,
> > the second and subsequent vacuum verbose operations from the test case
> > will show that reltuples is still 1 (it won't ever change). The
> > patch just extends an old behavior that was applied when scanned_pages
> > == 0 to cases where scanned_pages <= 1 (unless we happened to scan all
> > of the relation's tables, of course).
>
> My plan is to commit this later in the week, barring objections. Maybe
> on Thursday.

I do not have intimate knowledge of this code, but shouldn't we also
add some sefety guarantees like the following in these blocks? Right
now, we'll keep underestimating the table size even when we know that
the count is incorrect.

if (scanned_tuples > old_rel_tuples)
return some_weighted_scanned_tuples;

Kind regards,

Matthias van de Meent




Re: failing to build preproc.c on solaris with sun studio

2022-08-08 Thread Andrew Dunstan


On 2022-08-07 Su 02:46, Andres Freund wrote:
>
>> I think if we want to get this past the finish line, we need to
>> acknowledge that the initial commit isn't going to be perfect.
>> The whole point of continuing to maintain the Makefiles is to
>> give us breathing room to fix remaining issues in a leisurely
>> fashion.
> Wholeheartedly agreed.
>

I'm waiting for that first commit so I can start working on the
buildfarm client changes. Ideally (from my POV) this would happen by
early Sept when I will be leaving on a trip for some weeks, and this
would be a good project to take with me. Is that possible?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Clarifying Commitfest policies

2022-08-08 Thread Matthias van de Meent
On Thu, 4 Aug 2022 at 20:38, Jacob Champion  wrote:
>
> On Wed, Aug 3, 2022 at 2:05 PM Matthias van de Meent
>  wrote:
> > On Wed, 3 Aug 2022 at 20:04, Jacob Champion  wrote:
> > > Is that enough, or should we do more?
> >
> > "The CF Checklist" seems to refer to only the page that is (or seems
> > to be) intended for the CFM only. We should probably also update the
> > pages of "Commitfest", "Submitting a patch", "Reviewing a Patch", "So,
> > you want to be a developer?", and the "Developer FAQ" page, which
> > doesn't have to be more than removing outdated information and
> > refering to any (new) documentation on how to participate in the
> > PostgreSQL development and/or Commitfest workflow as a non-CFM.
>
> Agreed, a sweep of those materials would be helpful as well. I'm
> personally focused on CFM tasks, since it's fresh in my brain and
> documentation is almost non-existent for it, but if you have ideas for
> those areas, I definitely don't want to shut down that line of the
> conversation.

Nor would I want to hold you back on CFM documentation.

> > Additionally, we might want to add extra text to the "developers"
> > section of the main website [0] to refer to any new documentation.
> > This suggestion does depend on whether the new documentation has a
> > high value for potential community members.
>
> Right. So what kinds of info do we want to highlight in this
> documentation, to make it high-quality?

I think it would be a combined and abbreviated version of the detailed
manuals that we (will) have: The pages "Submitting a patch" and
"Reviewing a patch" on the wiki, and the CommitFest manual (plus
potentially info on CFBot).

The first part of "So, you want to be a developer?" seems like a very
good starting point for dense, high-quality entry-level documentation.
Each section should then further refer to the relevant sections of the
"Developer FAQ" and the "Submitting / Reviewing a Patch" pages for the
in-and-outs of the specific procedure (such as "installing development
dependencies", "reviewing changes", "code style", etc.).

> Drawing from some of the questions I've seen recently, we could talk about
> - CF "power" structure (perhaps simply to highlight that the CFM has
> no additional authority to get patches in)
> - the back-and-forth process on the mailing list, maybe including
> expected response times
> - what to do when a patch is returned (or rejected)
>
> > As an example, the GitHub mirror of the main PostgreSQL repository
> > receives a decent amount of pull request traffic. When a project has a
> > CONTRIBUTING.md -file at the top level people writing the pull request
> > message will be pointed to those contributing guidelines. This could
>
> (I think some text got cut here.)

... This could help reduce the amount of mis-addressed (maybe better
word: mis-located?) contributions, and potentially help the
contributor get involved at -hackers. Indeed this process is much more
involved than 'just' opening a pull request, but at least it is now
slightly more visible.

> The mirror bot will point you to the "So, you want to be a developer?"
> wiki when you open a PR, but I agree that a CONTRIBUTING doc would
> help prevent that small embarrassment.

That's news to me, but nice to see some improvements there. I have
previously noticed that there were PRs on GitHub that went unnoticed
for several weeks, so this bot is a significant improvement.


Kind regards,

Matthias van de Meent




Get the statistics based on the application name and IP address

2022-08-08 Thread Ibrar Ahmed
While working on pg_stat_stements, I got some questions from customers to
have statistics by application and IP address. I know that we are
collecting the
statistics by query id, user id, database id and top-level query. There is
no way to
collect the statistics based on IP address and application
name. That's possible that
multiple applications issue the same queries with the same user on the same
database. We
cannot segregate those queries from which application this query comes. I
know we can
this in the log file with log_line_prefix, but I want to see that
aggregates like call count based on IP and application
name. I did some POC and had a patch. But before sharing the patch.

I need to know if there has been any previous discussion about this topic;
by the way,
I did some Googling to find that but failed.

Thoughts?


-- 

Ibrar Ahmed.
Senior Software Engineer, PostgreSQL Consultant.


Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan
On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
 wrote:
> I do not have intimate knowledge of this code, but shouldn't we also
> add some sefety guarantees like the following in these blocks? Right
> now, we'll keep underestimating the table size even when we know that
> the count is incorrect.
>
> if (scanned_tuples > old_rel_tuples)
> return some_weighted_scanned_tuples;

Not sure what you mean -- we do something very much like that already.

We take the existing tuple density, and assume that that hasn't
changed for any unscanned pages -- that is used to build a total
number of tuples for the unscanned pages. Then we add the number of
live tuples/scanned_tuples that the vacuumlazy.c caller just
encountered on scanned_pages. That's often where the final reltuples
value comes from.

-- 
Peter Geoghegan




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Matthias van de Meent
On Mon, 8 Aug 2022 at 17:26, Peter Geoghegan  wrote:
>
> On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
>  wrote:
> > I do not have intimate knowledge of this code, but shouldn't we also
> > add some sefety guarantees like the following in these blocks? Right
> > now, we'll keep underestimating the table size even when we know that
> > the count is incorrect.
> >
> > if (scanned_tuples > old_rel_tuples)
> > return some_weighted_scanned_tuples;
>
> Not sure what you mean -- we do something very much like that already.
>
> We take the existing tuple density, and assume that that hasn't
> changed for any unscanned pages -- that is used to build a total
> number of tuples for the unscanned pages. Then we add the number of
> live tuples/scanned_tuples that the vacuumlazy.c caller just
> encountered on scanned_pages. That's often where the final reltuples
> value comes from.

Indeed we often apply this, but not always. It is the default case,
but never applied in the special cases.

For example, if currently the measured 2% of the pages contains more
than 100% of the previous count of tuples, or with your patch the last
page contains more than 100% of the previous count of the tuples, that
new count is ignored, which seems silly considering that the vacuum
count is supposed to be authorative.

Kind regards,

Matthias van de Meent




Re: bug on log generation ?

2022-08-08 Thread Andrew Dunstan


On 2022-08-08 Mo 11:07, Marcos Pegoraro wrote:
>
> What platform is this on? Is it possible that on some platform the
> chunk
> size we're using is not doing an atomic write?
>
>  
> Until last year we were Ubuntu 16.04 and Postgres 11 with the latest
> minor update.
> This January we changed to Ubuntu 20.04 and Postgres 14, now updated
> to 14.4.
>
> But the problem occured on both old and new SO and Postgres versions.
> Right now I opened the current log file and there are 20 or 30 of
> these statements and all of them are fine, maybe tomorrow the problem
> comes back, maybe this afternoon.
>
>

OK, we really need a repeatable test if possible. Perhaps a pgbench run
with lots of concurrent runs of a some very long query would do the trick.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] CF app: add "Returned: Needs more interest"

2022-08-08 Thread Jacob Champion
On Thu, Aug 4, 2022 at 3:00 PM Andres Freund  wrote:
> On 2022-08-04 11:19:28 -0700, Jacob Champion wrote:
> > My intention had not quite been for this to be a referendum on the
> > decision for every patch -- we can do that if it helps, but I don't
> > think we necessarily have to have unanimity on the bucketing for every
> > patch in order for the new state to be useful.
>
> Sorry, I should have been clearer. It wasn't mine either! I was just trying to
> understand what you see as the usecase / get a better feel for it. I'm now a
> bit more convinced it's useful than before.

Great!

> > >> - https://commitfest.postgresql.org/38/3310/
> > >
> > > I don't really understand why this has been RwF'd, doesn't seem that long
> > > since the last review leading to changes.
> >
> > Eight months without feedback, when we expect authors to turn around a
> > patch in two weeks or less to avoid being RwF'd, is a long time IMHO.
>
> Why is it better to mark it as lacks interested than RwF if there actually
> *has* been feedback?

Because I don't think the utility of RwF is in saying "we gave you
feedback once and then ghosted"; I think it's in saying "this patchset
needs work before the next round of review." If an author has
responded to the feedback and the patchset is just sitting there for
months, the existence of the feedback is less relevant.

> > I don't think a patch should sit motionless in CF for eight months; it's not
> > at all fair to the author.
>
> It's not great, I agree, but wishes don't conjure up resources :(

I see this less as a wish for resources, and more as an honest
admission -- we don't currently have enough resources to give each
patch the eyes it deserves, so if an author finds themselves in this
state, they'll have to put in some more work to find those eyes
somewhere.

> > >> - https://commitfest.postgresql.org/38/3050/
> > >
> > > Given that a non-author did a revision of the patch, listed a number of 
> > > TODO
> > > items and said "I'll create regression tests firstly." - I don't think 
> > > "lacks
> > > interest" would have been appropriate, and RwF is?
> >
> > That was six months ago, and prior to that there was another six month
> > silence. I'd say that lacks interest, and I don't feel like it's
> > currently reviewable in CF.
>
> I don't think the entry needs more review - it needs changes:
> https://www.postgresql.org/message-id/CAOKkKFtc45uNFoWYOCo4St19ayxrh-_%2B4TnZtwxGZz6-3k_GSA%40mail.gmail.com
> That contains quite a few things that should be changed.
>
> A patch that has gotten feedback, but that feedback hasn't been processed
> pretty much is the definition of RwF, no?

Looking through again, I see now what you're saying. Yes, I agree that
RwF would have been a fine fit there.

> I agree that "lacks interest" could be useful. But I'm wary of it becoming
> just a renaming if we end up marking patches that should be RwF or rejected as
> "lacks interest".

Agreed. This probably bleeds over into the other documentation thread
a bit -- how do we want to communicate the subtle points to people in
a CF?

--Jacob




Re: [PATCH] CF app: add "Returned: Needs more interest"

2022-08-08 Thread Andres Freund
Hi,

On 2022-08-08 08:37:41 -0700, Jacob Champion wrote:
> Agreed. This probably bleeds over into the other documentation thread
> a bit -- how do we want to communicate the subtle points to people in
> a CF?

We should write a docs patch for it and then reference if from a bunch of
places. I started down that road a few years back [1] but unfortunately lost
steam.

Regards,

Andres

[1] https://postgr.es/m/20180302224056.3fps7kc6hokqk3th%40alap3.anarazel.de




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan
On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
 wrote:
> For example, if currently the measured 2% of the pages contains more
> than 100% of the previous count of tuples, or with your patch the last
> page contains more than 100% of the previous count of the tuples, that
> new count is ignored, which seems silly considering that the vacuum
> count is supposed to be authorative.

The 2% thing is conditioned on the new relpages value precisely
matching the existing relpages from pg_class -- which makes it very
targeted. I don't see why scanned_tuples greatly exceeding the
existing reltuples from pg_class is interesting (any more interesting
than the other way around).

We'll always accept scanned_tuples as authoritative when VACUUM
actually scans all pages, no matter what. Currently it isn't possible
for VACUUM to skip pages in a table that is 32 pages or less in size.
So even the new "single page" thing from the patch cannot matter
there.


--
Peter Geoghegan




2022-08-11 release announcement draft

2022-08-08 Thread Jonathan S. Katz

Hi,

Please see attached draft of the 2022-08-11 release announcement.

Please provide feedback on {technical accuracy, omissions, any other 
errors} no later than 2022-08-11 0:00 AoE[1].


Thanks,

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth
The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 14.5, 13.8, 12.12, 11.17, and 10.22, as well
as the third beta release of PostgreSQL 15. This release fixes over 40 bugs
reported over the last three months.

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 10 EOL Upcoming
--

PostgreSQL 10 will stop receiving fixes on November 10, 2022. If you are
running PostgreSQL 10 in a production environment, we strongly advise that you
make plans to upgrade to a newer, supported version of PostgreSQL so you can
continue to receive bug and security fixes. Please see our
[versioning policy](https://www.postgresql.org/support/versioning/) for more
information.

A Note on the PostgreSQL 15 Beta


This release marks the third beta release of PostgreSQL 15 and puts the
community one step closer to general availability tentatively around the end of
the third quarter.

In the spirit of the open source PostgreSQL community, we strongly encourage you
to test the new features of PostgreSQL 15 on your systems to help us eliminate
bugs or other issues that may exist. While we do not advise you to run
PostgreSQL 15 Beta 3 in production environments, we encourage you to find ways
to run your typical application workloads against this beta release.

Your testing and feedback will help the community ensure that PostgreSQL 15
upholds our standards of delivering a stable, reliable release of the world's
most advanced open source relational database. Please read more about our
[beta testing process](https://www.postgresql.org/developer/beta/) and how you
can contribute:

  
[https://www.postgresql.org/developer/beta/](https://www.postgresql.org/developer/beta/)

You can find information about all of the PostgreSQL 15 features and changes in
the [release notes](https://www.postgresql.org/docs/15/release-15.html):

  
[https://www.postgresql.org/docs/15/release-15.html](https://www.postgresql.org/docs/15/release-15.html)

Bug Fixes and Improvements
--

This update fixes over 40 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 14. Some of these issues may also
affect other supported versions of PostgreSQL.

Included in this release:

* Fix replay of [`CREATE 
DATABASE`](https://www.postgresql.org/docs/current/sql-createdatabase.html)
write-ahead log (WAL) records on standby servers when encountering a missing
[tablespace](https://www.postgresql.org/docs/current/manage-ag-tablespaces.html)
directory.
* Add support for 
[tablespaces](https://www.postgresql.org/docs/current/manage-ag-tablespaces.html)
that are plain directories instead of symbolic links to other directories.
* Fix permission checks in [`CREATE 
INDEX`](https://www.postgresql.org/docs/current/sql-createindex.html)
to use the user's permissions. This fixes broken dump/restore scenarios that
relied on the behavior prior to the fix for
[CVE-2022-1552](https://www.postgresql.org/support/security/CVE-2022-1552/).
* In the extended query protocol, force an immediate commit after
[`CREATE 
DATABASE`](https://www.postgresql.org/docs/current/sql-createdatabase.html)
and other commands that can't run in a transaction block.
* Fix a race condition around checking transaction visibility that was more
likely to happen when using synchronous replication.
* Fix incorrect permission-checking code for
[extended 
statistics](https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED).
* Fix extended statistics machinery to handle
[most common 
value](https://www.postgresql.org/docs/current/planner-stats.html#id-1.5.13.5.4.13)
([MCV](https://www.postgresql.org/docs/current/planner-stats.html#id-1.5.13.5.4.13))-type
statistics on boolean-valued expressions.
* Avoid planner core dump with constant
[`= 
ANY(array)`](https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY-ANY-SOME)
clauses when there are
[MCV-type extended 
statistics](https://www.postgresql.org/docs/current/planner-stats.html#id-1.5.13.5.4.13)
on the `array` variable.
* Allow cancellation of 
[`ANALYZE`](https://www.postgresql.org/docs/current/sql-analyze.html)
while it is computing extended statistics.
* Fix [`ALTER TABLE ... ENABLE/DISABLE 
TRIGGER`](https://www.postgresql.org/docs/current/sql-altertable.html)
to handle recursion for triggers on partitioned tables.
* Reject `ROW()` expressions and functions in `FROM` that have more than 1600
columns.
* Fix memory leak in logical replication subscribers.
* Fix checks in logical replication of replica identity when the t

Re: [RFC] building postgres with meson - v10

2022-08-08 Thread Andres Freund
Hi,

On 2022-07-21 15:26:05 +0300, Bilal Yavuz wrote:
> > On 2022-07-06 11:03:31 +0200, Peter Eisentraut wrote:
> > > dff7b5a960 meson: prereq: regress: allow to specify director containing
> > > expected files.
> > >
> > > This could use a bit more explanation, but it doesn't look
> > > controversial so far.
> 
> While testing ECPG, C and exe files are generated by meson so these files
> are in the meson's build directory but expected files are in the source
> directory. However; there was no way to set different paths for inputs (C
> and exe files') and expected files' directory. So, I added `--expecteddir`
> to separately set expected files' directory.

Attached is a version of this patch that also removes the copying of these
files from ecpg's makefile.

Bilal's version checked different directories for expected files, but I don't
think that's necessary. Bilal, do you remember why you added that?


I'm somewhat tempted to rename ecpg's pg_regress to pg_regress_ecpg as part of
this, given the .c file is named pg_regress_ecpg.c and that pg_regress is a
pre-existing binary.

Greetings,

Andres Freund
>From 9645e0a01305ce42fdacaae73bd0d021a5a1c47e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 30 Jun 2022 18:28:17 -0700
Subject: [PATCH] meson: prereq: regress: allow to specify director containing
 expected files

The ecpg tests have their input directory in the build directory as the tests
need to be built. Until now that required copying the expected/ directory to
the build directory in VPATH builds. To avoid needing to implement the same
for the meson build, add support for specifying the expected directory/.

Remove the copying of ecpg's expected/ to the build dir in VPATH builds.

Author: Nazir Bilal Yavuz 
Author: Andres Freund 
Reviewed-By: Andres Freund 
Discussion: https://postgr.es/m/20220718202327.pspcqz5mwbi2y...@awork3.anarazel.de
---
 src/interfaces/ecpg/test/Makefile  | 28 --
 src/interfaces/ecpg/test/pg_regress_ecpg.c |  8 +++
 src/test/regress/pg_regress.c  |  7 ++
 src/test/regress/pg_regress.h  |  1 +
 4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index 10f53c708c8..d7a7d1d1ca5 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -59,24 +59,16 @@ pg_regress_ecpg.o: pg_regress_ecpg.c $(top_builddir)/src/port/pg_config_paths.h
 $(top_builddir)/src/port/pg_config_paths.h: $(top_builddir)/src/Makefile.global
 	$(MAKE) -C $(top_builddir)/src/port pg_config_paths.h
 
-# When doing a VPATH build, copy over the .pgc, .stdout and .stderr
-# files so that the driver script can find them.  We have to use an
-# absolute path for the targets, because otherwise make will try to
-# locate the missing files using VPATH, and will find them in
-# $(srcdir), but the point here is that we want to copy them from
-# $(srcdir) to the build directory.
-
-ifdef VPATH
-remaining_files_src := $(wildcard $(srcdir)/*/*.pgc) $(wildcard $(srcdir)/expected/*.c) $(wildcard $(srcdir)/expected/*.stdout) $(wildcard $(srcdir)/expected/*.stderr)
-remaining_files_build := $(patsubst $(srcdir)/%, $(abs_builddir)/%, $(remaining_files_src))
-
-all: $(remaining_files_build)
-$(remaining_files_build): $(abs_builddir)/%: $(srcdir)/%
-	ln -s $< $@
-endif
-
-# Common options for tests. Also pick up anything passed in EXTRA_REGRESS_OPTS
-REGRESS_OPTS = --dbname=ecpg1_regression,ecpg2_regression --create-role=regress_ecpg_user1,regress_ecpg_user2 $(EXTRA_REGRESS_OPTS)
+# Common options for tests
+#
+# Need to specify expecteddir explicitly, as the inputdir is located in the
+# build directory, because the files need to be compiled. Other pg_regress
+# style tests have the expecteddir in the source directory.
+#
+# Also pick up anything passed in EXTRA_REGRESS_OPTS.
+REGRESS_OPTS =  --expecteddir=$(srcdir) \
+  --dbname=ecpg1_regression,ecpg2_regression --create-role=regress_ecpg_user1,regress_ecpg_user2 \
+  $(EXTRA_REGRESS_OPTS)
 
 check: all
 	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check $(TEMP_CONF) --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
diff --git a/src/interfaces/ecpg/test/pg_regress_ecpg.c b/src/interfaces/ecpg/test/pg_regress_ecpg.c
index f920af4560c..84e45ceebe0 100644
--- a/src/interfaces/ecpg/test/pg_regress_ecpg.c
+++ b/src/interfaces/ecpg/test/pg_regress_ecpg.c
@@ -164,7 +164,7 @@ ecpg_start_test(const char *testname,
 	char	   *appnameenv;
 
 	snprintf(inprg, sizeof(inprg), "%s/%s", inputdir, testname);
-	snprintf(insource, sizeof(insource), "%s.c", testname);
+	snprintf(insource, sizeof(insource), "%s/%s.c", inputdir, testname);
 
 	/* make a version of the test name that has dashes in place of slashes */
 	initStringInfo(&testname_dash);
@@ -177,13 +177,13 @@ ecpg_start_test(const char *testname,
 
 	snprintf(expectfile_stdout, sizeof(expectf

Re: shared-memory based stats collector - v70

2022-08-08 Thread Greg Stark
I'm trying to wrap my head around the shared memory stats collector
infrastructure from
<20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.

I have one question about locking -- afaics there's nothing protecting
reading the shared memory stats. There is an lwlock protecting
concurrent updates of the shared memory stats, but that lock isn't
taken when we read the stats. Are we ok relying on atomic 64-bit reads
or is there something else going on that I'm missing?

In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
which does this:

memcpy(stats_data,
   pgstat_get_entry_data(kind, entry_ref->shared_stats),
   kind_info->shared_data_len);

stats_data is the returned copy of the stats entry with all the
statistics in it. But it's copied from the shared memory location
directly using memcpy and there's no locking or change counter or
anything protecting this memcpy that I can see.

-- 
greg




Re: failing to build preproc.c on solaris with sun studio

2022-08-08 Thread Andres Freund
Hi,

On 2022-08-08 11:14:58 -0400, Andrew Dunstan wrote:
> I'm waiting for that first commit so I can start working on the
> buildfarm client changes. Ideally (from my POV) this would happen by
> early Sept when I will be leaving on a trip for some weeks, and this
> would be a good project to take with me. Is that possible?

Yes, I think that should be possible. I think what's required before then is
1) a minimal docs patch 2) a discussion about where to store tests results
etc. It'll clearly not be finished, but we agreed that a project like this can
only be done incrementally after a certain stage...

I've been doing a lot of cleanup over the last few days, and I'll send a new
version soon and then kick off the discussion for 2).

Greetings,

Andres Freund




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-08-08 Thread Önder Kalacı
Hi,

Thanks for the feedback, see my reply below.

>
> > It turns out that we already invalidate the relevant entries in
> LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates any
> of the statistics in pg_class.
> >
> > The call-stack for analyze is roughly:
> > do_analyze_rel()
> >-> vac_update_relstats()
> >  -> heap_inplace_update()
> >  -> if needs to apply any statistical change
> >  -> CacheInvalidateHeapTuple()
> >

Yeah, it appears that this will work but I see that we don't update
> here for inherited stats, how does it work for such cases?


There, the expansion of the relation list to partitions happens one level
above on the call stack. So, the call stack looks like the following:

autovacuum_do_vac_analyze() (or ExecVacuum)
  -> vacuum()
 -> expand_vacuum_rel()
  -> rel_list=parent+children partitions
  -> for rel in rel_list
  ->analyze_rel()
  ->do_analyze_rel
  ... (and the same call stack as above)

I also added one variation of a similar test for partitioned tables, which
I earlier added for non-partitioned tables as well:

Added a test which triggers this behavior. The test is as follows:
> - Create two indexes on the target, on column_a and column_b
> - Initially load data such that the column_a has a high cardinality
> - Show that we use the index on column_a on a *child *table
> - Load more data such that the column_b has higher cardinality
> - ANALYZE on the *parent* table
> - Show that we use the index on column_b afterwards on the *child* table


My answer for the above assumes that your question is regarding what
happens if you ANALYZE on a partitioned table. If your question is
something different, please let me know.


> >> BTW, do we want to consider partial indexes for the scan in this
> >> context? I mean it may not have data of all rows so how that would be
> >> usable?
> >>
> >
> > As far as I can see, check_index_predicates() never picks a partial
> index for the baserestrictinfos we create in
> CreateReplicaIdentityFullPaths(). The reason is that we have roughly the
> following call stack:
> >
> > -check_index_predicates
> >  --predicate_implied_by
> > ---predicate_implied_by_recurse
> > predicate_implied_by_simple_clause
> > -operator_predicate_proof
> >
> > And, inside operator_predicate_proof(), there is never going to be an
> equality. Because, we push `Param`s to the baserestrictinfos whereas the
> index predicates are always `Const`.
> >
>
> I agree that the way currently baserestrictinfos are formed by patch,
> it won't select the partial path, and chances are that that will be
> true in future as well but I think it is better to be explicit in this
> case to avoid creating a dependency between two code paths.
>
>
Yes, it makes sense. So, I changed Assert into a function where we filter
partial indexes and indexes on only expressions, so that we do not create
such dependencies between the planner and here.

If one day planner supports using column values on index with expressions,
this code would only not be able to use the optimization until we do some
improvements in this code-path. I think that seems like a fair trade-off
for now.

Few other comments:
> ==
> 1. Why is it a good idea to choose the index selected even for the
> bitmap path (T_BitmapHeapScan or T_BitmapIndexScan)? We use index scan
> during update/delete, so not sure how we can conclude to use index for
> bitmap paths.
>

In our case, during update/delete we are searching for a single tuple on
the target. And, it seems like using an index is probably going to be
cheaper for finding the single tuple. In general, I thought we should use
an index if the planner ever decides to use it with the given restrictions.

Also, for the majority of the use-cases, I think we'd probably expect an
index on a column with high cardinality -- hence use index scan. So, bitmap
index scans are probably not going to be that much common.

Still, I don't see a problem with using such indexes. Of course, it is
possible that I might be missing something. Do you have any specific
concerns in this area?


>
> 2. The index info is built even on insert, so workload, where there
> are no updates/deletes or those are not published then this index
> selection work will go waste. Will it be better to do it at first
> update/delete? One can say that it is not worth the hassle as anyway
> it will be built the first time we perform an operation on the
> relation or after the relation gets invalidated.


With the current approach, the index (re)-calculation is coupled with
(in)validation of the relevant cache entries. So, I'd argue for the
simplicity of the code, we could afford to waste this small overhead?
According to my local measurements, especially for large tables, the index
oid calculation is mostly insignificant compared to the rest of the steps.
Does that sound OK to you?



> If we think so, then
> probabl

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-08 Thread Imseih (AWS), Sami
> The server seem to have started as a standby after crashing a
> primary. Is it correct?

Yes, that is correct.

2022-08-05 17:18:51 UTC::@:[359]:LOG:  database system was interrupted; last 
known up at 2022-08-05 17:08:52 UTC
2022-08-05 17:18:51 UTC::@:[359]:LOG:  creating missing WAL directory 
"pg_wal/archive_status"
recovering 0002.history
2022-08-05 17:18:51 UTC::@:[359]:LOG:  entering standby mode
2022-08-05 17:18:51 UTC::@:[359]:LOG:  ### [S] @0/48A3400: abort=(0/0)0/0, 
miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  checkpoint record is at 0/48A3388
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  redo record is at 0/48A3388; shutdown 
true
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  next transaction ID: 533; next OID: 
16395
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  next MultiXactId: 1; next 
MultiXactOffset: 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  oldest unfrozen transaction ID: 479, 
in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  oldest MultiXactId: 1, in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  commit timestamp Xid oldest/newest: 0/0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  transaction ID wrap limit is 
2147484126, limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  MultiXactId wrap limit is 2147483648, 
limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  starting up replication slots
2022-08-05 17:18:51 UTC::@:[359]:LOG:  database system was not properly shut 
down; automatic recovery in progress
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  resetting unlogged relations: cleanup 
1 init 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  initializing for hot standby
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  recovery snapshots are now enabled
2022-08-05 17:18:51 UTC::@:[359]:LOG:  redo starts at 0/48A3400
2022-08-05 17:18:51 UTC::@:[359]:LOG:  ### [F] @0/1000: abort=(0/0)0/0, 
miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  reached end of WAL in pg_wal, entering 
archive recovery
2022-08-05 17:18:51 UTC::@:[359]:LOG:  consistent recovery state reached at 
0/1000
2022-08-05 17:18:51 UTC::@:[357]:LOG:  database system is ready to accept read 
only connections


> Archive recovery ended here. The server should have promoted that
> time..  Do you see some interesting log lines around this time?

The server did promote around that time

2022-08-05 18:38:14 UTC::@:[359]:LOG:  ### [F] @6/B6CB27D0: abort=(0/0)0/0, 
miss=(0/0)0/0, SbyMode=1, SbyModeReq=1
2022-08-05 18:38:14 UTC::@:[359]:LOG:  received promote request
2022-08-05 18:38:14 UTC::@:[359]:LOG:  redo done at 6/B6CB27A8
2022-08-05 18:38:14 UTC::@:[359]:LOG:  last completed transaction was at log 
time 2022-08-05 18:38:00.832047+00
recovering 00010006002D


2022-08-05 18:38:14 UTC::@:[359]:LOG:  ### [S] @6/B6CB27D0: abort=(0/0)0/0, 
miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
recovering 0002.history
2022-08-05 18:38:14 UTC::@:[359]:LOG:  selected new timeline ID: 2
2022-08-05 18:38:14 UTC::@:[359]:LOG:  archive recovery complete
recovering 0001.history
2022-08-05 18:38:14 UTC::@:[357]:LOG:  database system is ready to accept 
connections
2022-08-05 18:38:15 UTC::@:[367]:LOG:  restartpoint complete: wrote 21388 
buffers (2.1%); 0 WAL file(s) added, 9 removed, 0 recycled; write=98.394 s, 
sync=0.041 s, total=98.586 s; sync files=46, longest=0.012 s, average=0.001 s; 
distance=1048565 kB, estimate=1048584 kB
2022-08-05 18:38:15 UTC::@:[367]:LOG:  recovery restart point at 6/5C0003B0

>But, recovery continues in non-standby mode.  I don't see how come it
>behaves that way.

But the server crashes sometime after which is why recovery starts.

022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on IPv4 address "0.0.0.0", port 
5432
2022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on IPv6 address "::", port 5432
2022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2022-08-05 18:50:13 UTC::@:[359]:LOG:  database system was interrupted; last 
known up at 2022-08-05 18:38:15 UTC
2022-08-05 18:50:13 UTC::@:[359]:LOG:  creating missing WAL directory 
"pg_wal/archive_status"
recovering 0003.history
2022-08-05 18:50:13 UTC::@:[359]:LOG:  entering standby mode
recovering 0002.history
2022-08-05 18:50:13 UTC::@:[359]:LOG:  ### [S] @6/B8000198: abort=(0/0)0/0, 
miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 18:50:13 UTC::@:[359]:LOG:  database system was not properly shut 
down; automatic recovery in progress
2022-08-05 18:50:13 UTC::@:[359]:LOG:  redo starts at 6/B8E8

And a few hours later, is when we see a panic

Thanks

--
Sami Imseih
Amazon Web Services (AWS)




Re: bug on log generation ?

2022-08-08 Thread Andres Freund
Hi,

On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > What platform is this on? Is it possible that on some platform the chunk
> > size we're using is not doing an atomic write?
> 
> Another idea is that some of the write() calls are failing --- elog.c
> doesn't check for that.

I was suspicious of those as well. It might be a good idea to at least write
such failures to stderr, otherwise it's just about impossible to debug. Not
that stderr will always point anywhere useful...

I can imagine that a system under heavy memory pressure might fail writing, if
there's a lot of writes in a row or such.

Greetings,

Andres Freund




Re: make update-po@master stops at pg_upgrade

2022-08-08 Thread Alvaro Herrera
On 2022-Jul-28, Tom Lane wrote:

> Thanks for testing it.  I think the only remaining concern is
> Peter's objection that $(wildcard) might pick up random junk files
> that end in ".c".  That's true, but the backend's nls.mk also
> picks up everything matching "*.c" (over the whole backend tree,
> not just one directory!),

Now that I did the translation chores again after a few years I am
reminded of a point about this argument:  in reality, few people ever
run this recipe manually (I know I never do), because it's easier to
grab the already-merged files from the NLS website.  It all happens
mechanically and there's nobody leaving random junnk files.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Matthias van de Meent
On Mon, 8 Aug 2022 at 17:49, Peter Geoghegan  wrote:
>
> On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
>  wrote:
> > For example, if currently the measured 2% of the pages contains more
> > than 100% of the previous count of tuples, or with your patch the last
> > page contains more than 100% of the previous count of the tuples, that
> > new count is ignored, which seems silly considering that the vacuum
> > count is supposed to be authorative.
>
> The 2% thing is conditioned on the new relpages value precisely
> matching the existing relpages from pg_class -- which makes it very
> targeted. I don't see why scanned_tuples greatly exceeding the
> existing reltuples from pg_class is interesting (any more interesting
> than the other way around).

Because if a subset of the pages of a relation contains more tuples
than your current total expected tuples in the table, you should
update your expectations regardless of which blocks or which number of
blocks you've scanned - the previous stored value is a strictly worse
estimation than your last measurement.

> We'll always accept scanned_tuples as authoritative when VACUUM
> actually scans all pages, no matter what. Currently it isn't possible
> for VACUUM to skip pages in a table that is 32 pages or less in size.
> So even the new "single page" thing from the patch cannot matter
> there.

A 33-block relation with first 32 1-tuple pages is still enough to
have a last page with 250 tuples, which would be ignored in that
scheme and have a total tuple count of 33 or so. Sure, this is an
artificial sample, but you can construct similarly wrong vacuum
samples: Two classes of tuples that have distinct update regimes, one
with 32B-tuples and one with MaxFreeSpaceRequestSize-d tuples. When
you start running VACUUM against these separate classes of updated
blocks you'll see that the relation tuple count will also tend to
1*nblocks, due to the disjoint nature of these updates and the
tendency to ignore all updates to densely packed blocks.

With current code, we ignore the high counts of those often-updated
blocks and expect low density in the relation, precisely because we
ignore areas that are extremely dense and updated in VACUUM cycles
that are different from bloated blocks.

Kind regards,

Matthias van de Meent




Re: bug on log generation ?

2022-08-08 Thread Tom Lane
Andres Freund  writes:
> On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
>> Another idea is that some of the write() calls are failing --- elog.c
>> doesn't check for that.

> I was suspicious of those as well. It might be a good idea to at least write
> such failures to stderr, otherwise it's just about impossible to debug. Not
> that stderr will always point anywhere useful...

Uh ... what we are talking about is a failure to write to stderr.
It's not likely that adding more output will help.

Having said that, looping on EAGAIN seems like a reasonably harmless
change.  Whether it will help here is really hard to say, though.

regards, tom lane




Re: make update-po@master stops at pg_upgrade

2022-08-08 Thread Tom Lane
Alvaro Herrera  writes:
> On 2022-Jul-28, Tom Lane wrote:
>> Thanks for testing it.  I think the only remaining concern is
>> Peter's objection that $(wildcard) might pick up random junk files
>> that end in ".c".  That's true, but the backend's nls.mk also
>> picks up everything matching "*.c" (over the whole backend tree,
>> not just one directory!),

> Now that I did the translation chores again after a few years I am
> reminded of a point about this argument:  in reality, few people ever
> run this recipe manually (I know I never do), because it's easier to
> grab the already-merged files from the NLS website.  It all happens
> mechanically and there's nobody leaving random junnk files.

Hmm, so where does the NLS website get its data?

I'd be all for flushing the recipe altogether if no one uses it.
However, the existence of this thread suggests otherwise.

regards, tom lane




Re: bug on log generation ?

2022-08-08 Thread Andres Freund
Hi,

On 2022-08-08 12:19:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
> >> Another idea is that some of the write() calls are failing --- elog.c
> >> doesn't check for that.
> 
> > I was suspicious of those as well. It might be a good idea to at least write
> > such failures to stderr, otherwise it's just about impossible to debug. Not
> > that stderr will always point anywhere useful...
> 
> Uh ... what we are talking about is a failure to write to stderr.
> It's not likely that adding more output will help.

I forgot that we don't preserve the original stderr in some other fd, likely
because the logger itself still has it open and can use write_stderr().

Greetings,

Andres Freund




Re: 2022-08-11 release announcement draft

2022-08-08 Thread Justin Pryzby
On Mon, Aug 08, 2022 at 11:50:16AM -0400, Jonathan S. Katz wrote:
> * Fix [`pg_upgrade`](https://www.postgresql.org/docs/current/pgupgrade.html) 
> to
> detect non-upgradable usages of functions accepting `anyarray` parameters.

use or usage

> and regressions before the general availability of PostgreSQL 15. As
> this is a beta release , changes to database behaviors, feature details,

Extraneous space before comma

> APIs are still possible. Your feedback and testing will help determine the 
> final

and APIs

> tweaks on the new features, so please test in the near future. The quality of

remove "new" before features ?

-- 
Justin




Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan
On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
 wrote:
> Because if a subset of the pages of a relation contains more tuples
> than your current total expected tuples in the table, you should
> update your expectations regardless of which blocks or which number of
> blocks you've scanned - the previous stored value is a strictly worse
> estimation than your last measurement.

The previous stored value could be -1, which represents the idea that
we don't know the tuple density yet. So it doesn't necessarily follow
that the new estimate is strictly better, even in this exact scenario.

> A 33-block relation with first 32 1-tuple pages is still enough to
> have a last page with 250 tuples, which would be ignored in that
> scheme and have a total tuple count of 33 or so.

The simple fact is that there is only so much we can do with the
limited information/context that we have. Heuristics are not usually
free of all bias. Often the bias is the whole point -- the goal can be
to make sure that we have the bias that we know we can live with, and
not the opposite bias, which is much worse. Details of which are
usually very domain specific.

I presented my patch with a very simple test case -- a very clear
problem. Can you do the same for this scenario?

I accept that it is possible that we'll keep an old reltuples which is
provably less accurate than doing something with the latest
information from vacuumlazy.c. But the conditions under which this can
happen are *very* narrow. I am not inclined to do anything about it
for that reason.

-- 
Peter Geoghegan




Re: Asking for feedback on Pgperffarm

2022-08-08 Thread Mark Wong
Hi Yedil,

On Mon, Aug 08, 2022 at 02:50:17PM +0200, Yedil Serzhan wrote:
> Dear hackers,
> 
> I'm Yedil. I'm working on the project "Postgres Performance Farm" during
> Gsoc. Pgperffarm is a project like Postgres build farm but focuses on the
> performance of the database. Now it has 2 types of benchmarks, pgbench and
> tpc-h. The website is online here , and the repo
> is here .
> 
> I would like you to take a look at our website and, if possible, give some
> feedback on, for example, what other data should be collected or what other
> metrics could be used to compare performance.

Nice work!

We need to be careful with how results based on the TPC-H specification
are presented.  It needs to be changed, but maybe not dramatically.
Something like "Fair use derivation of TPC-H".  It needs to be clear
that it's not an official TPC-H result.

I think I've hinted at it in the #perffarm slack channel, that I think
it would be better if you leveraged one of the already existing TPC-H
derived kits.  While I'm partial to dbt-3, because I'm trying to
maintain it and because it sounded like you were starting to do
something similar to that, I think you can save a good amount of effort
from reimplementing another kit from scratch.

Regards,
Mark

--
Mark Wong
EDB https://enterprisedb.com





Re: Get the statistics based on the application name and IP address

2022-08-08 Thread Julien Rouhaud
Hi,

On Mon, Aug 08, 2022 at 08:21:06PM +0500, Ibrar Ahmed wrote:
> While working on pg_stat_stements, I got some questions from customers to
> have statistics by application and IP address.
> [...]
> name. I did some POC and had a patch. But before sharing the patch.
>
> I need to know if there has been any previous discussion about this topic;
> by the way,

I don't think there was any discussion on this exactly, but there have been
some related discussions.

This would likely bring 2 problems.  First, for now each entry contains its own
query text in the query file.  There can already be some duplication, which
isn't great, but adding the application_name and/or IP address will make things
way worse, so you would probably need to fix that first.  There has been some
discussion about it recently (1) but more work and benchmarking are needed.

The other problem is the multiplication of entries.  It's a well known
limitation that pg_stat_statements eviction are so costly that it makes it
unusable.  The last numbers I saw about it was ~55% overhead (2).  Adding
application_name or ip address to the key would probably make
pg_stat_statements unusable for anyone who would actually need those metrics.

[1]: 
https://www.postgresql.org/message-id/flat/604E3199-2DD2-47DD-AC47-774A6F97DCA9%40amazon.com
[2]: https://twitter.com/AndresFreundTec/status/1105585237772263424




Re: make update-po@master stops at pg_upgrade

2022-08-08 Thread Alvaro Herrera
On 2022-Aug-08, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Now that I did the translation chores again after a few years I am
> > reminded of a point about this argument:  in reality, few people ever
> > run this recipe manually (I know I never do), because it's easier to
> > grab the already-merged files from the NLS website.  It all happens
> > mechanically and there's nobody leaving random junnk files.
> 
> Hmm, so where does the NLS website get its data?

Well, the NLS website does invoke the recipe.  Just not manually.

> I'd be all for flushing the recipe altogether if no one uses it.
> However, the existence of this thread suggests otherwise.

I just meant it's not normally run manually.  But if you do run it
manually, and you translate a file that has a few extra messages because
of the hypothetical junk source file, then you'll upload a catalog with
those extra messages; these extra messages will be dropped the next time
your file is merged through the NLS website.  Maybe you'll do some extra
work (translating useless messages) but there'll be no harm.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Patch to provide example for ssl certification authentication

2022-08-08 Thread Dave Cramer
I couldn't find a clear document which showed how this was done.
The example would help.


Dave Cramer


ssl_cert_auth.patch
Description: Binary data


Re: automatically generating node support functions

2022-08-08 Thread Tom Lane
Amit Kapila  writes:
> On Sun, Aug 7, 2022 at 8:19 PM Tom Lane  wrote:
>> Meh ... it's not checking the data files themselves.  Here's
>> a patch based on the logic for invoking genbki.  Completely
>> untested, would somebody try it?

> I tried it on commit a69959fab2 just before the commit (1349d2790b)
> which was causing problems for me. On running "perl mkvcbuild.pl", I
> got the below error:
> wrong number of input files, expected nodes/nodes.h nodes/primnodes.h
> nodes/parsenodes.h nodes/pathnodes.h nodes/plannodes.h
> nodes/execnodes.h access/amapi.h access/sdir.h access/tableam.h
> access/tsmapi.h commands/event_trigger.h commands/trigger.h
> executor/tuptable.h foreign/fdwapi.h nodes/extensible.h
> nodes/lockoptions.h nodes/replnodes.h nodes/supportnodes.h
> nodes/value.h utils/rel.h

Ah.  It'd help if that complaint said what the command input actually
is :-(.  But on looking closer, I missed stripping the empty strings
that "split" will produce at the ends of the array.  I think the
attached will do the trick, and I really do want to get rid of this
copy of the file list if possible.

regards, tom lane

diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index 86cf1b39d0..b707a09f56 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -174,7 +174,7 @@ push @scalar_types, qw(QualCost);
 
 
 ## check that we have the expected number of files on the command line
-die "wrong number of input files, expected @all_input_files\n"
+die "wrong number of input files, expected:\n@all_input_files\ngot:\n@ARGV\n"
   if ($#ARGV != $#all_input_files);
 
 ## read input
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 383b8a7c62..cc82668457 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -797,36 +797,30 @@ EOF
 		close($chs);
 	}
 
-	if (IsNewer(
-			'src/backend/nodes/node-support-stamp',
-			'src/backend/nodes/gen_node_support.pl'))
+	my $nmf = Project::read_file('src/backend/nodes/Makefile');
+	$nmf =~ s{\\\r?\n}{}g;
+	$nmf =~ /^node_headers\s*:?=(.*)$/gm
+	  || croak "Could not find node_headers in Makefile\n";
+	my @node_headers = split /\s+/, $1;
+	@node_headers = grep { $_ ne '' } @node_headers;
+	my @node_files = map { "src/include/$_" } @node_headers;
+
+	my $need_node_support = 0;
+	foreach my $nodefile (@node_files)
 	{
-		# XXX duplicates node_headers list in src/backend/nodes/Makefile
-		my @node_headers = qw(
-		  nodes/nodes.h
-		  nodes/primnodes.h
-		  nodes/parsenodes.h
-		  nodes/pathnodes.h
-		  nodes/plannodes.h
-		  nodes/execnodes.h
-		  access/amapi.h
-		  access/sdir.h
-		  access/tableam.h
-		  access/tsmapi.h
-		  commands/event_trigger.h
-		  commands/trigger.h
-		  executor/tuptable.h
-		  foreign/fdwapi.h
-		  nodes/extensible.h
-		  nodes/lockoptions.h
-		  nodes/replnodes.h
-		  nodes/supportnodes.h
-		  nodes/value.h
-		  utils/rel.h
-		);
-
-		my @node_files = map { "src/include/$_" } @node_headers;
+		if (IsNewer('src/backend/nodes/node-support-stamp', $nodefile))
+		{
+			$need_node_support = 1;
+			last;
+		}
+	}
+	$need_node_support = 1
+	  if IsNewer(
+		'src/backend/nodes/node-support-stamp',
+		'src/backend/nodes/gen_node_support.pl');
 
+	if ($need_node_support)
+	{
 		system("perl src/backend/nodes/gen_node_support.pl --outdir src/backend/nodes @node_files");
 		open(my $f, '>', 'src/backend/nodes/node-support-stamp')
 		  || confess "Could not touch node-support-stamp";


Re: Temporary file access API

2022-08-08 Thread Antonin Houska
Robert Haas  wrote:

> On Fri, Jul 29, 2022 at 11:36 AM Antonin Houska  wrote:
> > Attached is a new version. It allows the user to set "elevel" (i.e. ERROR is
> > not necessarily thrown on I/O failure, if the user prefers to check the 
> > number
> > of bytes read/written) and to specify the buffer size. Also, 0015 adds a
> > function to copy data from one file descriptor to another.
> 
> I basically agree with Peter Eisentraut's earlier comment: "I don't
> understand what this patch set is supposed to do." The intended
> advantages, as documented in buffile.c, are:
> 
> + * 1. Less code is needed to access the file.
> + *
> + * 2.. Buffering reduces the number of I/O system calls.
> + *
> + * 3. The buffer size can be controlled by the user. (The larger the buffer,
> + * the fewer I/O system calls are needed, but the more data needs to be
> + * written to the buffer before the user recognizes that the file access
> + * failed.)
> + *
> + * 4. It should make features like Transparent Data Encryption less invasive.
> 
> It does look to me like there is a modest reduction in the total
> amount of code from these patches, so I do think they might be
> achieving (1).
> 
> I'm not so sure about (2), though. A number of these places are cases
> where we only do a single write to the file - like in patches 0010,
> 0011, 0012, and 0014, for example. And even in 0013, where we do
> multiple I/Os, it makes no sense to introduce a second buffering
> layer. We're reading from a file into a buffer several kilobytes in
> size and then shipping the data out. The places where you want
> buffering are places where we might be doing system calls for a few
> bytes of I/O at a time, not places where we're already doing
> relatively large I/Os. An extra layer of buffering probably makes
> things slower, not faster.

The buffering seemed to me like a good opportunity to plug-in the encryption,
because the data needs to be encrypted in memory before it's written to
disk. Of course it'd be better to use the existing buffering for that than to
add another layer.

> I don't think that (3) is a separate advantage from (1) and (2), so I
> don't have anything separate to say about it.

I thought that the uncontrollable buffer size is one of the things you
complaint about in

https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

> I find it really unclear whether, or how, it achieves (4). In general,
> I don't think that the large number of raw calls to read() and write()
> spread across the backend is a particularly good thing. TDE is an
> example of a feature that might affect every byte we read or write,
> but you can also imagine instrumentation and tracing facilities that
> might want to be able to do something every time we do an I/O without
> having to modify a zillion separate call sites. Such features can
> certainly benefit from consolidation, but I suspect it isn't helpful
> to treat every I/O in exactly the same way. For instance, let's say we
> settle on a design where everything in the cluster is encrypted but,
> just to make things simple for the sake of discussion, let's say there
> are no stored nonces anywhere. Can we just route every I/O in the
> backend through a wrapper layer that does encryption and decryption?

> Probably not. I imagine postgresql.conf and pg_hba.conf aren't going
> to be encrypted, for example, because they're intended to be edited by
> users. And Bruce has previously proposed designs where the WAL would
> be encrypted with a different key than the data files. Another idea,
> which I kind of like, is to encrypt WAL on a record level rather than
> a block level. Either way, you can't just encrypt every single byte
> the same way. There are probably other relevant distinctions: some
> data is temporary and need not survive a server restart or be shared
> with other backends (except maybe in the case of parallel query) and
> some data is permanent. Some data is already block-structured using
> blocks of size BLCKSZ; some data is block-structured using some other
> block size; and some data is not block-structured. Some data is in
> standard-format pages that are part of the buffer pool and some data
> isn't. I feel like some of those distinctions probably matter to TDE,
> and maybe to other things that we might want to do.
> 
> For instance, to go back to my earlier comments, if the data is
> already block-structured, we do not need to insert a second layer of
> buffering; if it isn't, maybe we should, not just for TDE but for
> other reasons as well. If the data is temporary, TDE might want to
> encrypt it using a temporary key which is generated on the fly and
> only stored in server memory, but if it's data that needs to be
> readable after a server restart, we're going to need to use a key that
> we'll still have access to in the future. Or maybe that particular
> thing isn't relevant, but I just can't believe that every single I/O
> needs ex

Re: bug on log generation ?

2022-08-08 Thread Marcos Pegoraro
>
> OK, we really need a repeatable test if possible. Perhaps a pgbench run
> with lots of concurrent runs of a some very long query would do the trick.
>

OK, I can do it but ... strangely that error usually occurs at random
times, sometimes at 08:00, sometimes at 19:00, and it's busier between
10:00 and 16:00. If I cron some of those queries to run every second is
enough ? What exactly do you expect to see on log files ?


Re: 2022-08-11 release announcement draft

2022-08-08 Thread Jonathan S. Katz

On 8/8/22 12:44 PM, Justin Pryzby wrote:

On Mon, Aug 08, 2022 at 11:50:16AM -0400, Jonathan S. Katz wrote:

* Fix [`pg_upgrade`](https://www.postgresql.org/docs/current/pgupgrade.html) to
detect non-upgradable usages of functions accepting `anyarray` parameters.


use or usage


This line comes directly from the release notes and appears to be 
correct as is.



and regressions before the general availability of PostgreSQL 15. As
this is a beta release , changes to database behaviors, feature details,


Extraneous space before comma


Fixed.


APIs are still possible. Your feedback and testing will help determine the final


and APIs


Fixed.


tweaks on the new features, so please test in the near future. The quality of


remove "new" before features ?


I believe this is personal preference. I would like to leave in "new" as 
we are releasing new features in PostgreSQL 15.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Managing my own index partitions

2022-08-08 Thread Chris Cleveland
I'm building a Postgres index access method. For a variety of reasons it's
more efficient to store the index data in multiple physical files on disk
rather in the index's main fork.

I'm trying to create separate rels that can be created and destroyed by the
parent index access method. I've succeeded in creating them with
heap.c/heap_create() and also with heap_create_with_catalog(). Where I'm
struggling is in getting the rels to be dropped when the parent index is
dropped.

When I use heap_create(), followed by recordDependencyOn(parent_oid,
child_oid), the child doesn't get dropped when the parent index is dropped.

When I use heap_create_with_catalog(), followed by
recordDependencyOn(parent_oid, child_oid), I get a "cache lookup failed for
index " triggered from within CommandCounterIncrement(), and
no amount of spelunking in the code turns up which cache entry is the
problem.

Backing up just a minute, am I going about this the right way?

Should I be using heap_create() at all?

Is there some other way to do this?


***
Here is my code:

Oid namespaceId = get_namespace_oid("relevantdb", false);
Oid tablespaceId = parent_index_rel->rd_rel->reltablespace;
Oid new_seg_oid = get_new_filenode(tablespaceId);
char *new_seg_name = make_seg_name(parent_index_rel, new_seg_oid);
Oid new_seg_filenode = InvalidOid; // heap_create() will create it

// this is required, unfortunately. It goes in the relcache.
// it creates a totally empty tupdesc. The natts=1 arg is to avoid an
error.
TupleDesc segTupleDesc = CreateTemplateTupleDesc(1);
TupleDescInitEntry(segTupleDesc, (AttrNumber) 1,
   "dummy",
   INT4OID,
   -1, 0);
bool shared_relation = parent_index_rel->rd_rel->relisshared;
bool mapped_relation = RelationIsMapped(parent_index_rel);
bool allow_system_table_mods = false;

// not used
TransactionId relfrozenxid;
MultiXactId relminmxid;

Relation seg_rel = heap_create(
new_seg_name,
namespaceId,
tablespaceId,
new_seg_oid,
new_seg_filenode,
0, // access method oid
segTupleDesc,
RELKIND_INDEX,  // or RELKIND_RELATION or
RELKIND_PARTITIONED_INDEX?
RELPERSISTENCE,
shared_relation,
mapped_relation,
allow_system_table_mods,
&relfrozenxid,
&relminmxid
);

Assert(relfrozenxid == InvalidTransactionId);
Assert(relminmxid == InvalidMultiXactId);
Assert(new_seg_oid == RelationGetRelid(seg_rel));

// make changes visible
CommandCounterIncrement();

// record dependency so seg gets dropped when index dropped
Oid parent_oid = parent_index_rel->rd_id;
record_dependency(parent_oid, new_seg_oid);

table_close(seg_rel, NoLock);/* do not unlock till end of xact */

...

void record_dependency(Oid parent_oid, Oid child_oid) {
ObjectAddress baseobject;
ObjectAddress segobject;
baseobject.classId = IndexRelationId;
baseobject.objectId = parent_oid;
baseobject.objectSubId = 0;
segobject.classId = IndexRelationId;
segobject.objectId = child_oid;
segobject.objectSubId = 0;
recordDependencyOn(&segobject, &baseobject, DEPENDENCY_INTERNAL);
}


The code where I use heap_create_with_catalog() is substantially the same.


Re: automatically generating node support functions

2022-08-08 Thread Tom Lane
I wrote:
> Ah.  It'd help if that complaint said what the command input actually
> is :-(.  But on looking closer, I missed stripping the empty strings
> that "split" will produce at the ends of the array.  I think the
> attached will do the trick, and I really do want to get rid of this
> copy of the file list if possible.

I tried this version on the cfbot, and it seems happy, so pushed.

regards, tom lane




Managing my own index partitions

2022-08-08 Thread Chris Cleveland
(resending because I sent the original from the wrong email address...)

I'm building a Postgres index access method. For a variety of reasons it's
more efficient to store the index data in multiple physical files on disk
rather than in the index's main fork.

I'm trying to create separate rels that can be created and destroyed by the
parent index access method. I've succeeded in creating them with
heap.c/heap_create() and also with heap_create_with_catalog(). Where I'm
struggling is in getting the rels to be dropped when the parent index is
dropped.

When I use heap_create(), followed by recordDependencyOn(parent_oid,
child_oid), the child doesn't get dropped when the parent index is dropped.

When I use heap_create_with_catalog(), followed by
recordDependencyOn(parent_oid, child_oid), I get a "cache lookup failed for
index " triggered from within CommandCounterIncrement(), and
no amount of spelunking in the code turns up which cache entry is the
problem.

Backing up just a minute, am I going about this the right way?

Should I be using heap_create() at all?

Is there some other way to do this?


***
Here is my code:

Oid namespaceId = get_namespace_oid("relevantdb", false);
Oid tablespaceId = parent_index_rel->rd_rel->reltablespace;
Oid new_seg_oid = get_new_filenode(tablespaceId);
char *new_seg_name = make_seg_name(parent_index_rel, new_seg_oid);
Oid new_seg_filenode = InvalidOid; // heap_create() will create it

// this is required, unfortunately. It goes in the relcache.
// it creates a totally empty tupdesc. The natts=1 arg is to avoid an
error.
TupleDesc segTupleDesc = CreateTemplateTupleDesc(1);
TupleDescInitEntry(segTupleDesc, (AttrNumber) 1,
   "dummy",
   INT4OID,
   -1, 0);
bool shared_relation = parent_index_rel->rd_rel->relisshared;
bool mapped_relation = RelationIsMapped(parent_index_rel);
bool allow_system_table_mods = false;

// not used
TransactionId relfrozenxid;
MultiXactId relminmxid;

Relation seg_rel = heap_create(
new_seg_name,
namespaceId,
tablespaceId,
new_seg_oid,
new_seg_filenode,
0, // access method oid
segTupleDesc,
RELKIND_INDEX,  // or RELKIND_RELATION or
RELKIND_PARTITIONED_INDEX?
RELPERSISTENCE,
shared_relation,
mapped_relation,
allow_system_table_mods,
&relfrozenxid,
&relminmxid
);

Assert(relfrozenxid == InvalidTransactionId);
Assert(relminmxid == InvalidMultiXactId);
Assert(new_seg_oid == RelationGetRelid(seg_rel));

// make changes visible
CommandCounterIncrement();

// record dependency so seg gets dropped when index dropped
Oid parent_oid = parent_index_rel->rd_id;
record_dependency(parent_oid, new_seg_oid);

table_close(seg_rel, NoLock);/* do not unlock till end of xact */

...

void record_dependency(Oid parent_oid, Oid child_oid) {
ObjectAddress baseobject;
ObjectAddress segobject;
baseobject.classId = IndexRelationId;
baseobject.objectId = parent_oid;
baseobject.objectSubId = 0;
segobject.classId = IndexRelationId;
segobject.objectId = child_oid;
segobject.objectSubId = 0;
recordDependencyOn(&segobject, &baseobject, DEPENDENCY_INTERNAL);
}


The code where I use heap_create_with_catalog() is substantially the same.




-- 
Chris Cleveland
312-339-2677 mobile


Re: Patch to address creation of PgStat* contexts with null parent context

2022-08-08 Thread Andres Freund
Hi,

On 2022-08-08 15:12:08 +0900, Kyotaro Horiguchi wrote:
> At Sat, 6 Aug 2022 19:19:39 -0700, Andres Freund  wrote in
> > Hi,
> >
> > On 2022-08-05 17:22:38 +0900, Kyotaro Horiguchi wrote:
> > > I think it a bit different.  Previously that memory (but for a bit
> > > different use, precisely) was required only when stats data is read so
> > > almost all server processes didn't need it.  Now, every server process
> > > that uses pgstats requires the two memory if it is going to write
> > > stats.  Even if that didn't happen until process termination, that
> > > memory eventually required to flush possibly remaining data.  That
> > > final write might be avoidable but I'm not sure it's worth the
> > > trouble.  As the result, calling pgstat_initialize() is effectively
> > > the declaration that the process requires the memory.
> >
> > I don't think every process will end up calling pgstat_setup_memcxt() -
> > e.g. walsender, bgwriter, checkpointer probably don't? What do we gain by
> > creating the contexts eagerly?
>
> Yes. they acutally does, in shmem_shutdown hook function, during
> at-termination stats write.  I didn't consider to make that not
> happen, to save 2kB of memory on such small number of processes.

That's true for checkpointer, but not e.g. for walwriter, bgwriter. I don't
see why we should force allocate memory that we're never going to use in
background processes.


> And we don't hastle to avoid maybe-empty at-process-termination
> writes..

Hm?

Greetings,

Andres Freund




Re: conchuela doesn't like gnu_printf anymore

2022-08-08 Thread Mikael Kjellström




On 2022-08-06 18:59, Tom Lane wrote:


This is seemingly an intentional configuration change, because the
animal is reporting different config_env than before.  However,
we decide what to set PG_PRINTF_ATTRIBUTE to based on what CC
likes, and if CXX doesn't like it then you'll get these warnings.
(The warnings only appear in C++ compiles, else there'd REALLY
be a lot of them.)


Yes, when I upgraded to the lastest DragonFly BSD 6.2.2 I was meaning to 
switch to CLANG14 for both C and C++.  I guess I fat fingered the 
configuration somehow.


I have switch to CLANG14 for both C and C++ now.

Let's see if the warnings disappears now.

/Mikael




Support logical replication of global object commands

2022-08-08 Thread Zheng Li
Hello,

Logical replication of DDL commands support is being worked on in [1].
However, global object commands are quite different from other
non-global object DDL commands and need to be handled differently. For
example, global object commands include ROLE statements, DATABASE
statements, TABLESPACE statements and a subset of GRANT/REVOKE
statements if the object being modified is a global object. These
commands are different from other DDL commands in that:

1. Global object commands can be executed in any database.
2. Global objects are not schema qualified.
3. Global object commands are not captured by event triggers.

I’ve put together a prototype to support logical replication of global
object commands in the attached patch. This patch builds on the DDL
replication patch from ZJ in [2] and must be applied on top of it.
Here is a list of global object commands that the patch replicate, you
can find more details in function LogGlobalObjectCommand:

/* ROLE statements */
CreateRoleStmt
AlterRoleStmt
AlterRoleSetStmt
DropRoleStmt
ReassignOwnedStmt
GrantRoleStmt

/* Database statements */
CreatedbStmt
AlterDatabaseStmt
AlterDatabaseRefreshCollStmt
AlterDatabaseSetStmt
DropdbStmt

/* TableSpace statements */
CreateTableSpaceStmt
DropTableSpaceStmt
AlterTableSpaceOptionsStmt

/* GrantStmt and RevokeStmt if objtype is a global object determined
by EventTriggerSupportsObjectType() */
GrantStmt
RevokeStmt

The idea with this patch is to support global objects commands
replication by WAL logging the command using the same function for DDL
logging - LogLogicalDDLMessage towards the end of
standard_ProcessUtility. Because global objects are not schema
qualified, we can skip the deparser invocation and directly log the
original command string for replay on the subscriber.

A key problem to address is that global objects can become
inconsistent between the publisher and the subscriber if a command
modifying the global object gets executed in a database (on the source
side) that doesn't replicate the global object commands. I think we
can work on the following two aspects in order to avoid such
inconsistency:

1. Introduce a publication option for global object commands
replication and document that logical replication of global object
commands is preferred to be enabled on all databases. Otherwise
inconsistency can happen if a command modifies the global object in a
database that doesn't replicate global object commands.

For example, we could introduce the following publication option
publish_global_object_command :
CREATE PUBLICATION mypub
FOR ALL TABLES
WITH (publish = 'insert, delete, update', publish_global_object_command = true);

We may consider other fine tuned global command options such as
“publish_role_statements”, “publish_database_statements”,
“publish_tablespace_statements” and "publish_grant_statements", i.e.
you pick which global commands you want replicated. For example, you
can do this if you need a permission or tablespace to be set up
differently on the target cluster. In addition, we may need to adjust
the syntax once the DDL replication syntax finalizes.

2. Introduce the following database cluster level logical replication
commands to avoid such inconsistency, this is especially handy when
there is a large number of databases to configure for logical
replication.

CREATE PUBLICATION GROUP mypub_
FOR ALL DATABASES
WITH (publish = 'insert, delete, update', publish_global_object_command = true);

CREATE SUBSCRIPTION GROUP mysub_
CONNECTION 'dbnames = \“path to file\” host=hostname user=username port=5432'
PUBLICATION GROUP mypub_;

Under the hood, the CREATE PUBLICATION GROUP command generates one
CREATE PUBLICATION mypub_n sub-command for each database in the
cluster where n is a monotonically increasing integer from 1. The
command outputs the (dbname, publication name) pairs which can be
saved in a file and then used on the subscription side.

Similarly, the CREATE SUBSCRIPTION GROUP command will generate one
CREATE SUBSCRIPTION mysub_n sub-command for each database in the
dbnames file. The dbnames file contains the (dbname, publication name)
pairs which come from the output of the CREATE PUBLICATION GROUP
command. Notice the connection string doesn’t have the dbname field,
During execution the connection string will be appended the dbname
retrieved from the dbnames file. By default the target DB name is the
same as the source DB name, optionally user can specify the source_db
to target_db mapping in the dbnames file.

In addition, we might want to create dependencies for the
publications/subscriptions created by the above commands in order to
guarantee the group consistency. Also we need to enforce that there is
only one group of publications/subscriptions for database cluster
level replication.

Logical replication of all commands across an entire cluster (instead
of on a per-database basis) is a separate topic. We can start another
thread after implementing a prototype.

Please let me know 

Re: Support logical replication of DDLs

2022-08-08 Thread Zheng Li
> In general, I agree with your comment below that we can work on this
> after we have some more concrete plans/discussions. I think we can
> probably consider this when we have more discussion around the
> publication commands for the DDL objects. However, it would be good if
> you can add some more details about the above two options.

Thanks for the support. I have started a new thread on supporting replication of
global object command and have added more detailed discussion:
https://www.postgresql.org/message-id/CAAD30UKD7YPEbYcs_L9PYLcLZjnxyqO%3DJF5_mnAwx7g_PtOi3A%40mail.gmail.com

> I noticed that LogLogicalDDLMessage() uses MyDatabaseId and then
> decoding can take some action (filtering) based on that. Is it Okay to
> use that function for global objects, if so, you might want to add a
> comment for the same?

Could you elaborate on the concern? The dbid filtering happens in
logicalddlmsg_decode but I don't see why I can't use LogLogicalDDLMessage
to log global commands. Are there any global commands that are
non-transactional?
logicalddlmsg_decode:
if (message->dbId != ctx->slot->data.database ||
FilterByOrigin(ctx, origin_id))
return;

> As per my understanding, the overall work on this project includes the
> following sub-tasks:
> a. DDL Deparsing: This is required to support DDL replication of
> non-global objects. The work for this is in progress, this is based on
> prior work by Alvaro.
> b. DDL Replication: This includes replication of DDL commands based on
> event triggers and DDL deparsing. The work on this is also in
> progress.
> c. DDL Replication of global objects: It requires a different approach
> due to the reasons quoted above in your email. Zheng has started
> working on it.
> d. Initial Sync: I think there is a brief discussion about this in the
> thread but no concrete proposal yet. I think it is important to solve
> this part of the puzzle as well to have an overall design ready for
> this project. Do let me know if you, Sawada-San, or anybody else
> intends to work on it? I think that will avoid overlap of work.

Euler mentioned that he has plan to work on the initial schema sync in
[1]. We can help with this effort as well.

Regards,
Zheng

[1] 
https://www.postgresql.org/message-id/45d0d97c-3322-4054-b94f-3c08774bbd90%40www.fastmail.com




Re: Checking pgwin32_is_junction() errors

2022-08-08 Thread Thomas Munro
On Mon, Aug 8, 2022 at 8:23 PM  wrote:
> initdb on my windows 10 system stopped working after the commit
> c5cb8f3b: "Provide lstat() for Windows."
> The error message is: creating directory C:/HOME/data ... initdb:
> error: could not create directory "C:/HOME": File exists
>
> "C:/HOME" is the junction point to the second volume on my hard drive -
> "\??\Volume{GUID}\" which name pgreadlink() erroneously strips here:
> https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357.
> So initdb could not stat the file with name "Volume{GUID}", tried to
> create it and failed.
> With the attached patch initdb works fine again.

-if (r > 4 && strncmp(buf, "\\??\\", 4) == 0)
+if (r > 4 && strncmp(buf, "\\??\\", 4) == 0 &&
+strncmp(buf, "\\??\\Volume", 10) != 0)
 {
 memmove(buf, buf + 4, strlen(buf + 4) + 1);
 r -= 4;

Hmm.  I suppose the problem must be in pg_mkdir_p().  Our symlink()
emulation usually adds this "\??\" prefix (making it an "NT object
path"?), because junction points only work if they are in that format.
Then our readlink() emulation removes it again, but in the case of
your \??\Volume{GUID} path, created by you, not our symlink()
emulation, removing "\??\" apparently makes it unopenable with
CreateFile() (I guess that's what fails?  What's the error?).  So your
patch just says: don't strip "\??\" if it's followed by "Volume".

I don't understand all the kinds of DOS, Windows and NT paths (let me
take a moment to say how much I love Unix), but here's a guess: could
it be that NT "\??\C:\foo" = DOS "C:\foo", but NT "\??\Volume..." =
DOS "\Volume..."?  In other words, if it hasn't got a drive letter,
maybe it still needs an initial "\" (or if not that, then *something*
special, because otherwise it looks like a relative path).  Would it
be better to say: if it doesn't begin with "\??\X:", where X could be
any letter, then don't modify it?

Maybe [1] has some clues.  It seems to give the info in a higher
density form than the Windows docs (at least to the uninitiated like
me wanting a quick overview with examples).  Hmm, I wonder if we could
get away from doing our own path mangling and use some of the proper
library calls mentioned on that page...

[1] 
https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html




Re: logical replication restrictions

2022-08-08 Thread Euler Taveira
On Wed, Jul 13, 2022, at 2:34 PM, Melih Mutlu wrote:

[Sorry for the delay...]

> 22. src/test/subscription/t/032_apply_delay.pl
>> 
>> I received the following error when trying to run these 'subscription' tests:
>> 
>> t/032_apply_delay.pl ... No such class log_location at
>> t/032_apply_delay.pl line 49, near "my log_location"
>> syntax error at t/032_apply_delay.pl line 49, near "my log_location ="
> 
> I'm having these errors too. Seems like some declarations are missing.
Fixed in v5.

> 
>>> +  specified amount of time. If this value is specified without 
>>> units,
>>> +  it is taken as milliseconds. The default is zero, adding no 
>>> delay.
>>> + 
> I'm also having an issue when I give min_apply_delay parameter without units.
> I expect that if I set  min_apply_delay to 5000 (without any unit), it will 
> be interpreted as 5000 ms.
Good catch. I fixed it in v5.

> 
> Lastly, I have a question about this delay during tablesync. 
> It's stated here that apply delays are not for initial tablesync.
> 
>>>  
>>> +  The delay occurs only on WAL records for transaction begins and 
>>> after
>>> +  the initial table synchronization. It is possible that the
>>> +  replication delay between publisher and subscriber exceeds the 
>>> value
>>> +  of this parameter, in which case no delay is added. Note that the
>>> +  delay is calculated between the WAL time stamp as written on
>>> +  publisher and the current time on the subscriber. Delays in 
>>> logical
>>> +  decoding and in transfer the transaction may reduce the actual 
>>> wait
>>> +  time. If the system clocks on publisher and subscriber are not
>>> +  synchronized, this may lead to apply changes earlier than 
>>> expected.
>>> +  This is not a major issue because a typical setting of this 
>>> parameter
>>> +  are much larger than typical time deviations between servers.
>>> + 
> 
> There might be a case where tablesync workers are in SYNCWAIT state and 
> waiting for apply worker to tell them to CATCHUP. 
> And if apply worker is waiting in apply_delay function, tablesync workers 
> will be stuck at SYNCWAIT state and this might delay tablesync at least 
> "min_apply_delay" amount of time or more.
> Is it something we would want? What do you think?
Good catch. That's an oversight. It should wait for the initial table
synchronization before starting to apply the delay. The main reason is the
current logical replication worker design. It only closes the tablesync workers
after the catchup phase. As you noticed we cannot impose the delay as soon as
the COPY finishes because it will take a long time to finish due to possibly
lack of workers.  Instead, let's wait for the READY state for all tables then
apply the delay. I added an explanation for it.

I also modified the test a bit to use the new function
wait_for_subscription_sync introduced in the commit
0c20dd33db1607d6a85ffce24238c1e55e384b49.

I attached a v6.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 465a6d5aa491855e2925da24785103cac0c520a2 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sat, 6 Nov 2021 11:31:10 -0300
Subject: [PATCH v6] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

If the subscriber sets min_apply_delay parameter, the logical
replication worker will delay the transaction commit for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs only on WAL records for transaction begins. The main
reason is to avoid keeping a transaction open for a long time. Regular
and prepared transactions are covered. Streamed transactions are also
covered.

Author: Euler Taveira
Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml |  10 ++
 doc/src/sgml/ref/alter_subscription.sgml   |   5 +-
 doc/src/sgml/ref/create_subscription.sgml  |  43 +-
 src/backend/catalog/pg_subscription.c  |   1 +
 src/backend/catalog/system_views.sql   |   7 +-
 src/backend/commands/subscriptioncmds.c|  48 +-
 src/backend/replication/logical/worker.c   | 100 +
 src/backend/utils/adt/timestamp.c  |  32 
 src/bin/pg_dump/pg_dump.c  |  17 ++-
 src/bin/pg_dump/pg_dump.h  |   1 +
 src/bin/psql/describe.c|   9 +-
 src/bin/psql/tab-complete.c|   4 +-
 src/include/catalog/pg_subscription.h  |   3 +
 src/include/datatype/timestamp.h   |   2 +
 src/include/utils/timestamp.h  |   2 +
 src/test/regress/expected/subscription.out | 165 -
 src/test/regress/sql/subscription.sql  |  20 +

dropping datumSort field

2022-08-08 Thread Zhihong Yu
Hi,
I was looking at ExecSort() w.r.t. datum sort.

I wonder if the datumSort field can be dropped.
Here is a patch illustrating the potential simplification.

Please take a look.

Thanks


drop-datum-sort-field.patch
Description: Binary data


Re: logical replication restrictions

2022-08-08 Thread Euler Taveira
On Wed, Aug 3, 2022, at 10:27 AM, Amit Kapila wrote:
> Your explanation makes sense to me. The other point to consider is
> that there can be cases where we may not apply operation for the
> transaction because of empty transactions (we don't yet skip empty
> xacts for prepared transactions). So, won't it be better to apply the
> delay just before we apply the first change for a transaction? Do we
> want to apply the delay during table sync as we sometimes do need to
> enter apply phase while doing table sync?
I thought about the empty transactions but decided to not complicate the code
mainly because skipping transactions is not a code path that will slow down
this feature. As explained in the documentation, there is no harm in delaying a
transaction for more than min_apply_delay; it cannot apply earlier. Having said
that I decided to do nothing. I'm also not sure if it deserves a comment or if
this email is a possible explanation for this decision.

Regarding the table sync that was mention by Melih, I sent a new version (v6)
that fixed this oversight. The current logical replication worker design make
it difficult to apply the delay in the catchup phase; tablesync workers are not
closed as soon as the COPY finishes (which means possibly running out of
workers sooner). After all tablesync workers have reached READY state, the
apply delay is activated. The documentation was correct; the code wasn't.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> Okay, I think it's basically in good shape. Since it should be a bit
> faster than a couple versions ago, would you be up for retesting with
> the original test having 8 to 512 writers?

Sure thing.  The results are similar.  As before, the improvements are most
visible when the arrays are large.

writers  head  patch
8672   680
16   639   664
32   701   689
64   705   703
128  628   653
256  576   627
512  530   584
768  450   536
1024 350   494

> And also add the const
> markers we discussed upthread?

Oops, sorry about that.  This is done in v9.

> Aside from that, I plan to commit this
> week unless there is further bikeshedding.

Great, thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 816c56bc8779a1e8ab85db8ca61ba8d3438957d7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v9 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 69 +
 1 file changed, 69 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..4a9484a16d
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,69 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		const __m128i tmp1 = _mm_or_si128(result1, result2);
+		const __m128i tmp2 = _mm_or_si128(result3, result4);
+		const __m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From 6942097a6406bca2c52851bbad40e5f679cc18ef Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v9 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..9b504c9745 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -56,6 +56,7 @@
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"

Re: Checking pgwin32_is_junction() errors

2022-08-08 Thread Thomas Munro
On Tue, Aug 9, 2022 at 8:30 AM Thomas Munro  wrote:
> On Mon, Aug 8, 2022 at 8:23 PM  wrote:
> > "C:/HOME" is the junction point to the second volume on my hard drive -
> > "\??\Volume{GUID}\" which name pgreadlink() erroneously strips here:
> > https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357.

> ... Would it
> be better to say: if it doesn't begin with "\??\X:", where X could be
> any letter, then don't modify it?

Concretely, I wonder if this is a good fix at least in the short term.
Does this work for you, and do the logic and explanation make sense?
From 9772b2bf5a1b11605ea16cfd17b3b50e0274aadf Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 9 Aug 2022 10:29:08 +1200
Subject: [PATCH] Fix readlink() for more complex Windows paths.

Our symlink() and readlink() replacements perform a very naive
transformation from "DOS" paths to "NT" paths and back, as required by
the junction point APIs.  This was enough for the simple paths we expect
users to provide for tablespaces, for example 'd:\my\fast\storage'.

Since commit c5cb8f3b taught stat() to follow symlinks, and since initdb
uses pg_mkdir_p(), and that examines parent directories, our humble
readlink() implementation can now be exposed to junction points not of
PostgreSQL origin.  Those might be corrupted by our naive path mangling,
which doesn't really understand NT paths in general.

It's possible that we should investigate using Rtl*() library routines
to do path transformation, but for now, simply decline to transform
paths that don't fit the simple drive-qualified pattern.  That means
that readlink() returns an NT path, which works just as well as a DOS
path when following the link.

Reported-by: Roman Zharkov 
Discussion: https://postgr.es/m/4590c37927d7b8ee84f9855d83229018%40postgrespro.ru
---
 src/port/dirmod.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 2818bfd2e9..821563c290 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -352,9 +352,11 @@ pgreadlink(const char *path, char *buf, size_t size)
 
 	/*
 	 * If the path starts with "\??\", which it will do in most (all?) cases,
-	 * strip those out.
+	 * strip those out, but only if it's of the simple drive-letter-qualified
+	 * type that we know how to transform from NT to DOS format.
 	 */
-	if (r > 4 && strncmp(buf, "\\??\\", 4) == 0)
+	if (r >= 6 && strncmp(buf, "\\??\\", 4) == 0 && isalpha(buf[4]) &&
+		buf[5] == ':')
 	{
 		memmove(buf, buf + 4, strlen(buf + 4) + 1);
 		r -= 4;
-- 
2.35.1



Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote:
> 1) pg_lfind32 - why just uint32? If it's not possible to define
> functions for char, unsigned char, int16, uint16, int32, int64, uint64
> and so on, can we add a few comments around that? Also, the comments
> can talk about if the base type or element data type of array or data
> type of key matters to use pg_lfind32.

I figured that we'd add functions for other types when needed.  I
considered making the new function generic by adding an argument for the
element size.  Then, we could branch to optimized routines based on the
element size (e.g., pg_lfind() would call pg_lfind32() if the element size
was 4 bytes).  However, that seemed like more complexity than is required,
and it's probably nice to avoid the extra branching.

> 2) I think this is not just for the remaining elements but also for
> non-USE_SSE2 cases. Also, please specify in which cases we reach here
> for USE_SSE2 cases.
> +/* Process the remaining elements the slow way. */

Well, in the non-SSE2 case, all of the elements are remaining at this
point.  :)

> 3) Can pg_lfind32 return the index of  the key found, for instance to
> use it for setting/resetting the found element in the array?

As discussed upthread, only returning whether the element is present in the
array is slightly faster.  If we ever needed a version that returned the
address of the matching element, we could reevaluate whether this small
boost was worth creating a separate function or if we should just modify
pg_lfind32() to be a tad slower.  I don't think we need to address that
now, though.

> 4) Can we, right away, use this API to replace linear search, say, in
> SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(),
> AfterTriggerSetState()? I'm sure I might be missing other places, but
> can we replace the possible found areas with the new function?

I had found a few eligible linear searches earlier [0], but I haven't done
any performance analysis that proved such changes were worthwhile.  While
substituting linear searches with pg_lfind32() is probably an improvement
in most cases, I think we ought to demonstrate the benefits for each one.

[0] https://postgr.es/m/20220802221301.GA742739%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: conchuela doesn't like gnu_printf anymore

2022-08-08 Thread Tom Lane
=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?=  writes:
> Yes, when I upgraded to the lastest DragonFly BSD 6.2.2 I was meaning to 
> switch to CLANG14 for both C and C++.  I guess I fat fingered the 
> configuration somehow.
> I have switch to CLANG14 for both C and C++ now.
> Let's see if the warnings disappears now.

Yup, looks clean now.  Thanks!

regards, tom lane




Re: Temporary file access API

2022-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2022 at 08:27:27PM +0200, Antonin Houska wrote:
> > For instance, to go back to my earlier comments, if the data is
> > already block-structured, we do not need to insert a second layer of
> > buffering; if it isn't, maybe we should, not just for TDE but for
> > other reasons as well. If the data is temporary, TDE might want to
> > encrypt it using a temporary key which is generated on the fly and
> > only stored in server memory, but if it's data that needs to be
> > readable after a server restart, we're going to need to use a key that
> > we'll still have access to in the future. Or maybe that particular
> > thing isn't relevant, but I just can't believe that every single I/O
> > needs exactly the same treatment, so there have got to be some
> > patterns here that do matter. And I can't really tell whether this
> > patch set has a theory about that and, if so, what it is.
> 
> I didn't want to use this API for relations (file reads/writes for these only
> affects a couple of places in md.c), nor for WAL. The intention was to use the
> API for temporary files, and also for other kinds of "auxiliary" files where
> the current coding is rather verbose but not too special.
> 
> To enable the encryption, we'd need to add an extra argument (structure) to
> the appropriate function, which provides the necessary information on the
> encryption, such as cipher kind (block/stream), the encryption key or the
> nonce. I just don't understand whether the TDE alone (4) justifies patch like
> this, so I tried to make more use of it.

Yes, I assumed a parameter would be added to each call site to indicate
the type of key to be used for encryption, or no encryption.  If these
are in performance-critical places, we can avoid the additional level of
calls by adding an 'if (tde)' check to only do the indirection when
necessary.  We can also skip the indirection for files we know will not
need to be encrypted, e.g., postgresql.conf.

I can also imagine needing to write the data encrypted without
encrypting the in-memory version, so centralizing that code in one place
would be good.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Cleaning up historical portability baggage

2022-08-08 Thread Thomas Munro
On Mon, Aug 8, 2022 at 12:27 PM Tom Lane  wrote:
> BTW, that commit really should have updated the explanation at the top of
> instr_time.h:
>
>  * This file provides an abstraction layer to hide portability issues in
>  * interval timing.  On Unix we use clock_gettime() if available, else
>  * gettimeofday().  On Windows, gettimeofday() gives a low-precision result
>  * so we must use QueryPerformanceCounter() instead.  These macros also give
>  * some breathing room to use other high-precision-timing APIs.
>
> Updating the second sentence is easy enough, but as for the third,
> I wonder if it's still true in view of 24c3ce8f1.  Should we revisit
> whether to use gettimeofday vs. QueryPerformanceCounter?  At the very
> least I suspect it's no longer about "low precision", but about which
> API is faster.

Yeah, that's not true anymore, and QueryPerformanceCounter() is faster
than Get­System­Time­Precise­As­File­Time()[1], but there doesn't
really seem to be any point in mentioning that or gettimeofday() at
all here.  I propose to cut it down to just:

  * This file provides an abstraction layer to hide portability issues in
- * interval timing.  On Unix we use clock_gettime() if available, else
- * gettimeofday().  On Windows, gettimeofday() gives a low-precision result
- * so we must use QueryPerformanceCounter() instead.  These macros also give
- * some breathing room to use other high-precision-timing APIs.
+ * interval timing.  On Unix we use clock_gettime(), and on Windows we use
+ * QueryPerformanceCounter().  These macros also give some breathing room to
+ * use other high-precision-timing APIs.

FWIW I expect this stuff to get whacked around some more for v16[2].

[1] https://devblogs.microsoft.com/oldnewthing/20170921-00/?p=97057
[2] https://commitfest.postgresql.org/39/3751/




Re: Cleaning up historical portability baggage

2022-08-08 Thread Tom Lane
Thomas Munro  writes:
> Yeah, that's not true anymore, and QueryPerformanceCounter() is faster
> than Get­System­Time­Precise­As­File­Time()[1], but there doesn't
> really seem to be any point in mentioning that or gettimeofday() at
> all here.  I propose to cut it down to just:

>   * This file provides an abstraction layer to hide portability issues in
> - * interval timing.  On Unix we use clock_gettime() if available, else
> - * gettimeofday().  On Windows, gettimeofday() gives a low-precision result
> - * so we must use QueryPerformanceCounter() instead.  These macros also give
> - * some breathing room to use other high-precision-timing APIs.
> + * interval timing.  On Unix we use clock_gettime(), and on Windows we use
> + * QueryPerformanceCounter().  These macros also give some breathing room to
> + * use other high-precision-timing APIs.

WFM.

> FWIW I expect this stuff to get whacked around some more for v16[2].
> [2] https://commitfest.postgresql.org/39/3751/

Meh.  I think trying to use rdtsc is a fool's errand; you'll be fighting
CPU quirks forever.

regards, tom lane




Re: pg15b2: large objects lost on upgrade

2022-08-08 Thread Bruce Momjian
On Tue, Aug  2, 2022 at 03:32:05PM -0400, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
> > On 8/2/22 1:12 PM, Tom Lane wrote:
> >> Sadly, we're still not out of the woods.  I see three buildfarm
> >> failures in this test since Robert resolved the "-X" problem [1][2][3]:
> 
> > Looking at the test code, is there anything that could have changed the 
> > relfrozenxid or relminxid independently of the test on these systems?
> 
> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
> that attempts to turn off autovacuum on either the source server or
> the destination.  So one plausible theory is that autovac moved the
> numbers since we checked.

Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
this:

start_postmaster()
...
/*
 * Use -b to disable autovacuum.
 *
 * Turn off durability requirements to improve object creation speed, and
 * we only modify the new cluster, so only use it there.  If there is a
 * crash, the new cluster has to be recreated anyway.  fsync=off is a big
 * win on ext4.
 *
 * Force vacuum_defer_cleanup_age to 0 on the new cluster, so that
 * vacuumdb --freeze actually freezes the tuples.
 */
snprintf(cmd, sizeof(cmd),
 "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" 
start",
 cluster->bindir,
 log_opts.logdir,
 SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
 (cluster == &new_cluster) ?
 " -c synchronous_commit=off -c fsync=off -c full_page_writes=off 
-c vacuum_defer_cleanup_age=0" : "",
 cluster->pgopts ? cluster->pgopts : "", socket_string);

Perhaps the test script should do something similar, or this method
doesn't work anymore.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: out of date comment in commit_ts.c

2022-08-08 Thread Thomas Munro
On Thu, Jul 28, 2022 at 8:30 AM Nathan Bossart  wrote:
> On Tue, Jul 26, 2022 at 10:33:43AM -0700, Nathan Bossart wrote:
> > IIUC the ability for callers to request WAL record generation is no longer
> > possible as of 08aa89b [0].  Should the second sentence be removed?
>
> Here's a patch.

Pushed.




Re: pg15b2: large objects lost on upgrade

2022-08-08 Thread Tom Lane
Bruce Momjian  writes:
>> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
>> that attempts to turn off autovacuum on either the source server or
>> the destination.  So one plausible theory is that autovac moved the
>> numbers since we checked.

> Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
> this:

The problems come from autovac running before or after pg_upgrade.

> Perhaps the test script should do something similar,

I'm not on board with that, for the reasons I gave upthread.

regards, tom lane




Fix unmatched file identifications

2022-08-08 Thread Masahiko Sawada
Hi,

I found that there are two .c and .h files whose identification in the
header comment doesn't match its actual path.

src/include/common/compression.h has:

  * IDENTIFICATION
  *src/common/compression.h
  *-

src/fe_utils/cancel.c has:

  * src/fe-utils/cancel.c
  *
  *

The attached small patch fixes them.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v1-0001-Fix-unmatched-file-identifications.patch
Description: Binary data


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Masahiko Sawada
On Tue, Aug 9, 2022 at 7:33 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> > Okay, I think it's basically in good shape. Since it should be a bit
> > faster than a couple versions ago, would you be up for retesting with
> > the original test having 8 to 512 writers?
>
> Sure thing.  The results are similar.  As before, the improvements are most
> visible when the arrays are large.
>
> writers  head  patch
> 8672   680
> 16   639   664
> 32   701   689
> 64   705   703
> 128  628   653
> 256  576   627
> 512  530   584
> 768  450   536
> 1024 350   494
>
> > And also add the const
> > markers we discussed upthread?
>
> Oops, sorry about that.  This is done in v9.
>
> > Aside from that, I plan to commit this
> > week unless there is further bikeshedding.
>
> Great, thanks.

The patch looks good to me. One minor point is:

+ * IDENTIFICATION
+ *   src/port/pg_lfind.h

The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Fix unmatched file identifications

2022-08-08 Thread John Naylor
On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada  wrote:
> I found that there are two .c and .h files whose identification in the
> header comment doesn't match its actual path.

> The attached small patch fixes them.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: pg15b2: large objects lost on upgrade

2022-08-08 Thread Bruce Momjian
On Mon, Aug  8, 2022 at 09:51:46PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> >> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
> >> that attempts to turn off autovacuum on either the source server or
> >> the destination.  So one plausible theory is that autovac moved the
> >> numbers since we checked.
> 
> > Uh, pg_upgrade assumes autovacuum is not running, and tries to enforce
> > this:
> 
> The problems come from autovac running before or after pg_upgrade.
> 
> > Perhaps the test script should do something similar,
> 
> I'm not on board with that, for the reasons I gave upthread.

Uh, I assume it is this paragraph:

> If that is the explanation, then it leaves us with few good options.
> I am not in favor of disabling autovacuum in the test: ordinary
> users are not going to do that while pg_upgrade'ing, so it'd make
> the test less representative of real-world usage, which seems like
> a bad idea.  We could either drop this particular check again, or
> weaken it to allow new relfrozenxid >= old relfrozenxid, likewise
> relminxid.

I thought the test was setting up a configuration that would never be
used by normal servers.  Is that false?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Fix unmatched file identifications

2022-08-08 Thread Masahiko Sawada
On Tue, Aug 9, 2022 at 11:24 AM John Naylor
 wrote:
>
> On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada  wrote:
> > I found that there are two .c and .h files whose identification in the
> > header comment doesn't match its actual path.
>
> > The attached small patch fixes them.
>
> Pushed, thanks!

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: pg15b2: large objects lost on upgrade

2022-08-08 Thread Tom Lane
Bruce Momjian  writes:
> I thought the test was setting up a configuration that would never be
> used by normal servers.  Is that false?

*If* we made it disable autovac before starting pg_upgrade,
then that would be a process not used by normal users.
I don't care whether pg_upgrade disables autovac during its
run; that's not what's at issue here.

regards, tom lane




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Thomas Munro
Working on get_dirent_type() reminded me of this thread.  I was going
to commit the first of these patches, but then I noticed Andres said
he was planning to, so I'll wait another day.  Here they are, with
commit messages but otherwise unchanged from Nathan's v12 except for a
slight comment tweak:

-   /* we're only handling directories here, skip if it's
not ours */
+   /* we're only handling directories here, skip if it's not one */

The only hunk I'm having second thoughts about is the following, which
makes unexpected stray files break checkpoints:

-* We just log a message if a file doesn't fit the pattern, it's
-* probably some editors lock/state file or similar...
 */
if (sscanf(snap_de->d_name, "%X-%X.snap", &hi, &lo) != 2)
-   {
-   ereport(LOG,
+   ereport(ERROR,
(errmsg("could not parse file
name \"%s\"", path)));

Bharath mentioned other places that loop over stat(), but I think
those are places that want stuff we don't already have, like st_size.
From f0be5188830655b9ce9375b6d13b1ca728dbfcdd Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 9 Aug 2022 13:11:08 +1200
Subject: [PATCH v13 1/2] Make more use of get_dirent_type().

We don't need to call stat() or lstat() in several ReadDir() loops, if
d_type is available, as it usually is on most systems.  If not,
get_dirent_type() will do that under the covers.

In a couple of places this will now raise an ERROR if an internal
stat/lstat fails, where previously errors were silently ignored.  ENOENT
was not expected in these locations, because the loop in question is the
one responsible for unlinking files during checkpoint, or runs at
startup.

Author: Nathan Bossart 
Reviewed-by: Thomas Munro 
Reviewed-by: Bharath Rupireddy 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/20220215231123.GA2584239%40nathanxps13
---
 src/backend/access/heap/rewriteheap.c |  4 +-
 .../replication/logical/reorderbuffer.c   | 12 +++---
 src/backend/replication/logical/snapbuild.c   |  5 +--
 src/backend/replication/slot.c|  4 +-
 src/backend/storage/file/copydir.c| 21 +++---
 src/backend/storage/file/fd.c | 20 +
 src/backend/utils/misc/guc-file.l | 42 +++
 src/timezone/pgtz.c   |  8 +---
 8 files changed, 48 insertions(+), 68 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9dd885d936..525039d5b2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -113,6 +113,7 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1213,7 +1214,6 @@ CheckPointLogicalRewriteHeap(void)
 	mappings_dir = AllocateDir("pg_logical/mappings");
 	while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
 	{
-		struct stat statbuf;
 		Oid			dboid;
 		Oid			relid;
 		XLogRecPtr	lsn;
@@ -1227,7 +1227,7 @@ CheckPointLogicalRewriteHeap(void)
 			continue;
 
 		snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
-		if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+		if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG)
 			continue;
 
 		/* Skip over files that cannot be ours. */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 88a37fde72..f2b6cdf473 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -91,6 +91,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -4407,15 +4408,10 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
 {
 	DIR		   *spill_dir;
 	struct dirent *spill_de;
-	struct stat statbuf;
 	char		path[MAXPGPATH * 2 + 12];
 
 	sprintf(path, "pg_replslot/%s", slotname);
 
-	/* we're only handling directories here, skip if it's not ours */
-	if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
-		return;
-
 	spill_dir = AllocateDir(path);
 	while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
 	{
@@ -4463,6 +4459,7 @@ StartupReorderBuffer(void)
 {
 	DIR		   *logical_dir;
 	struct dirent *logical_de;
+	char		path[MAXPGPATH * 2 + 12];
 
 	logical_dir = AllocateDir("pg_replslot");
 	while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
@@ -4475,6 +4472,11 @@ StartupReorderBuffer(void)
 		if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
 			continue;
 
+		/* we're only handling directories here, skip if it's not one */
+		sprin

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Tom Lane
Thomas Munro  writes:
> The only hunk I'm having second thoughts about is the following, which
> makes unexpected stray files break checkpoints:

Sounds like a pretty bad idea.  What's the upside?

regards, tom lane




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> The only hunk I'm having second thoughts about is the following, which
>> makes unexpected stray files break checkpoints:

> Sounds like a pretty bad idea.  What's the upside?

Actually, having now read the patch, I don't think there is any
part of 0002 that is a good idea.  It's blithely removing the
comments that explain why the existing coding is the way it is,
and not providing a shred of justification for making checkpoints
more brittle.

I have not tried to analyze the error-handling properties of 0001,
but if it's being equally cavalier then it shouldn't be committed
either.  Most of this behavior is the result of decades of hard-won
experience; discarding it because it doesn't fit conveniently
into some refactoring plan isn't smart.

regards, tom lane




Re: out of date comment in commit_ts.c

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 01:02:15PM +1200, Thomas Munro wrote:
> Pushed.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Thomas Munro
On Tue, Aug 9, 2022 at 3:27 PM Tom Lane  wrote:
> Actually, having now read the patch, I don't think there is any
> part of 0002 that is a good idea.  It's blithely removing the
> comments that explain why the existing coding is the way it is,
> and not providing a shred of justification for making checkpoints
> more brittle.

0002 also contradicts the original $SUBJECT and goal of this thread,
which is possibly why it was kept separate.  I was only thinking of
committing 0001 myself, which is the one I'd reviewed an earlier
version of.

> I have not tried to analyze the error-handling properties of 0001,
> but if it's being equally cavalier then it shouldn't be committed
> either.  Most of this behavior is the result of decades of hard-won
> experience; discarding it because it doesn't fit conveniently
> into some refactoring plan isn't smart.

0001 does introduce new errors, as mentioned in the commit message, in
the form of elevel ERROR passed into get_dirent_type(), which might be
thrown if your OS has no d_type and lstat() fails (also if you asked
to follow symlinks, but in those cases errors were already thrown).
But in those cases, it seems at least a little fishy that we ignored
errors from the existing lstat().  I wondered if that was because they
expected that any failure meant ENOENT and they wanted to tolerate
that, but that does not seem to be the case, so I considered the error
to be an improvement.




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Aug 9, 2022 at 3:27 PM Tom Lane  wrote:
>> I have not tried to analyze the error-handling properties of 0001,
>> but if it's being equally cavalier then it shouldn't be committed
>> either.

> 0001 does introduce new errors, as mentioned in the commit message, in
> the form of elevel ERROR passed into get_dirent_type(), which might be
> thrown if your OS has no d_type and lstat() fails (also if you asked
> to follow symlinks, but in those cases errors were already thrown).
> But in those cases, it seems at least a little fishy that we ignored
> errors from the existing lstat().

Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
But should the replacement behavior be elog-LOG-and-press-on,
or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
hurry to believe that the latter is more appropriate without some
analysis of what the callers are doing.

The bottom line here is that I'm distrustful of behavioral changes
introduced to simplify refactoring rather than to solve a live
problem.

regards, tom lane




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 10:57:44AM +0900, Masahiko Sawada wrote:
> The patch looks good to me. One minor point is:

Thanks for taking a look.

> + * IDENTIFICATION
> + *   src/port/pg_lfind.h
> 
> The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Fixed in v10.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ed80cfaf146f82b930ea09b8efc062cc4088d4b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v10 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 69 +
 1 file changed, 69 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..24de544f63
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,69 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		const __m128i tmp1 = _mm_or_si128(result1, result2);
+		const __m128i tmp2 = _mm_or_si128(result3, result4);
+		const __m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From 28bc28d898afad0762c1966b21e9582294e8b101 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v10 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..9b504c9745 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -56,6 +56,7 @@
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "port/pg_lfind.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -2284,8 +2285,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -2317,13 +2316,8 @@ Xid

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Bharath Rupireddy
On Tue, Aug 9, 2022 at 9:20 AM Tom Lane  wrote:
>
> Thomas Munro  writes:
> > On Tue, Aug 9, 2022 at 3:27 PM Tom Lane  wrote:
> >> I have not tried to analyze the error-handling properties of 0001,
> >> but if it's being equally cavalier then it shouldn't be committed
> >> either.
>
> > 0001 does introduce new errors, as mentioned in the commit message, in
> > the form of elevel ERROR passed into get_dirent_type(), which might be
> > thrown if your OS has no d_type and lstat() fails (also if you asked
> > to follow symlinks, but in those cases errors were already thrown).
> > But in those cases, it seems at least a little fishy that we ignored
> > errors from the existing lstat().
>
> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
> But should the replacement behavior be elog-LOG-and-press-on,
> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
> hurry to believe that the latter is more appropriate without some
> analysis of what the callers are doing.
>
> The bottom line here is that I'm distrustful of behavioral changes
> introduced to simplify refactoring rather than to solve a live
> problem.

+1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
checkpoint operation. Because the checkpoint is more important than
why a single snapshot file (out thousands or even million files) isn't
removed at that moment. Also, I originally proposed to change
elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
failures for the same reason.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Andres Freund
On 2022-08-09 15:10:41 +1200, Thomas Munro wrote:
> I was going to commit the first of these patches, but then I noticed Andres
> said he was planning to, so I'll wait another day.

I'd be happy for you to take this on...




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Tue, Aug 9, 2022 at 4:37 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote:
> > 1) pg_lfind32 - why just uint32? If it's not possible to define
> > functions for char, unsigned char, int16, uint16, int32, int64, uint64
> > and so on, can we add a few comments around that? Also, the comments
> > can talk about if the base type or element data type of array or data
> > type of key matters to use pg_lfind32.
>
> I figured that we'd add functions for other types when needed.  I
> considered making the new function generic by adding an argument for the
> element size.  Then, we could branch to optimized routines based on the
> element size (e.g., pg_lfind() would call pg_lfind32() if the element size
> was 4 bytes).  However, that seemed like more complexity than is required,
> and it's probably nice to avoid the extra branching.
>
> > 3) Can pg_lfind32 return the index of  the key found, for instance to
> > use it for setting/resetting the found element in the array?
>
> As discussed upthread, only returning whether the element is present in the
> array is slightly faster.  If we ever needed a version that returned the
> address of the matching element, we could reevaluate whether this small
> boost was worth creating a separate function or if we should just modify
> pg_lfind32() to be a tad slower.  I don't think we need to address that
> now, though.

Isn't it a good idea to capture the above two points as comments in
port/pg_lfind.h just to not lose track of it? I know these are present
in the hackers thread, but having them in the form of comments helps
developers who attempt to change or use the new function.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 09:29:16AM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 9, 2022 at 9:20 AM Tom Lane  wrote:
>> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
>> But should the replacement behavior be elog-LOG-and-press-on,
>> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
>> hurry to believe that the latter is more appropriate without some
>> analysis of what the callers are doing.
>>
>> The bottom line here is that I'm distrustful of behavioral changes
>> introduced to simplify refactoring rather than to solve a live
>> problem.
> 
> +1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
> checkpoint operation. Because the checkpoint is more important than
> why a single snapshot file (out thousands or even million files) isn't
> removed at that moment. Also, I originally proposed to change
> elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
> failures for the same reason.

This was my initial instinct as well, but this thread has received
contradictory feedback during the months since.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote:
> Isn't it a good idea to capture the above two points as comments in
> port/pg_lfind.h just to not lose track of it? I know these are present
> in the hackers thread, but having them in the form of comments helps
> developers who attempt to change or use the new function.

Hm.  My first impression is that this is exactly the sort of information
that is better captured on the lists.  I'm not sure that the lack of such
commentary really poses any threat for future changes, which would need to
be judged on their own merit, anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Tom Lane
Nathan Bossart  writes:
> On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote:
>> Isn't it a good idea to capture the above two points as comments in
>> port/pg_lfind.h just to not lose track of it? I know these are present
>> in the hackers thread, but having them in the form of comments helps
>> developers who attempt to change or use the new function.

> Hm.  My first impression is that this is exactly the sort of information
> that is better captured on the lists.  I'm not sure that the lack of such
> commentary really poses any threat for future changes, which would need to
> be judged on their own merit, anyway.

It's clearly unproductive (not to say impossible) to enumerate every
possible alternative design and say why you didn't choose it.  If
there's some particular "obvious" choice that you feel a need to
refute, then sure write a comment about that.  Notably, if we used
to do X and now do Y because X was found to be broken, then it's good
to have a comment trail discouraging future hackers from reinventing
X.  But that doesn't lead to needing comments about an unrelated
option Z.

regards, tom lane




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-08 Thread Thomas Munro
On Tue, Aug 9, 2022 at 4:28 PM Nathan Bossart  wrote:
> On Tue, Aug 09, 2022 at 09:29:16AM +0530, Bharath Rupireddy wrote:
> > On Tue, Aug 9, 2022 at 9:20 AM Tom Lane  wrote:
> >> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
> >> But should the replacement behavior be elog-LOG-and-press-on,
> >> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
> >> hurry to believe that the latter is more appropriate without some
> >> analysis of what the callers are doing.
> >>
> >> The bottom line here is that I'm distrustful of behavioral changes
> >> introduced to simplify refactoring rather than to solve a live
> >> problem.
> >
> > +1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
> > checkpoint operation. Because the checkpoint is more important than

The changes were not from elog-LOG to elog-ERROR, they were changes
from silently eating errors, but I take your (plural) general point.

> > why a single snapshot file (out thousands or even million files) isn't
> > removed at that moment. Also, I originally proposed to change
> > elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
> > failures for the same reason.
>
> This was my initial instinct as well, but this thread has received
> contradictory feedback during the months since.

OK, so there aren't many places in 0001 that error out where we
previously did not.

For the one in CheckPointLogicalRewriteHeap(), if it fails, you don't
just want to skip -- it might be in the range that we need to fsync().
So what if we just deleted that get_dirent_type() != PGFILETYPE_REG
check altogether?  Depending on the name range check, we'd either
attempt to unlink() and LOG if that fails, or we'd attempt to
open()-or-ERROR and then fsync()-or-PANIC.  What functionality have we
lost without the DT_REG check?  Well, now if someone ill-advisedly
puts an important character device, socket, symlink (ie any kind of
non-DT_REG) into that directory with a name conforming to the
unlinkable range, we'll try to unlink it and possibly succeed, where
previously we'd have skipped, and if it's in the checkpointable range,
we'd try to fsync() it and likely fail, which is good because this is
a supposed to be a checkpoint.

That's already what happens if lstat() fails in master.  We fall back
to treating it like a DT_REG anyway:

   if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
continue;

For the one in CheckSnapBuild(), it's similar but unlink only.

For the one in StartupReplicationSlots(), you could possibly take the
same line: if it's not a directory, we'll try an rmdir() it anyway and
then emit a WARNING if that fails.  Again, that's exactly what master
would do if that lstat() failed.

In other words, if we're going to LOG and press on, maybe we should
just press on anyway.  Would the error message be any less
informative?  Would the risk of unlinking a character device that you
accidentally stored in
/clusters/pg16-main/pgdata/pg_logical/mappings/map-1234-5678 be a
problem for anyone?

Separately from that topic, it would be nice to have a comment in each
place where we tolerate unlink() failure to explain why that is
harmless except for wasted disk.




Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-08 Thread Dilip Kumar
On Mon, Aug 8, 2022 at 11:41 AM Dilip Kumar  wrote:
>
> On Mon, Aug 8, 2022 at 10:18 AM Dilip Kumar  wrote:
> >
> > > Based on above, we plan to first introduce the patch to perform streaming
> > > logical transactions by background workers, and then introduce parallel 
> > > apply
> > > normal transaction which design is different and need some additional 
> > > handling.
> >
> > Yeah I think that makes sense.  Since the streamed transactions are
> > sent to standby interleaved so we can take advantage of parallelism
> > and along with that we can also avoid the I/O so that will also
> > speedup.
>
> Some review comments on the latest version of the patch.
>
> 1.
> +/* Queue size of DSM, 16 MB for now. */
> +#define DSM_QUEUE_SIZE16000
>
> Why don't we directly use 16 *1024 * 1024, that would be exactly 16 MB
> so it will match with comments and also it would be more readable.
>
> 2.
> +/*
> + * There are three fields in message: start_lsn, end_lsn and send_time. 
> Because
> + * we have updated these statistics in apply worker, we could ignore these
> + * fields in apply background worker. (see function LogicalRepApplyLoop)
> + */
> +#define SIZE_STATS_MESSAGE (3 * sizeof(uint64))
>
> Instead of assuming you have 3 uint64 why don't directly add 2 *
> sizeof(XLogRecPtr) + sizeof(TimestampTz) so that if this data type
> ever changes
> we don't need to track that we will have to change this as well.
>
> 3.
> +/*
> + * Entry for a hash table we use to map from xid to our apply background 
> worker
> + * state.
> + */
> +typedef struct ApplyBgworkerEntry
> +{
> +TransactionId xid;
> +ApplyBgworkerState *wstate;
> +} ApplyBgworkerEntry;
>
> Mention in the comment of the structure or for the member that xid is
> the key of the hash.  Refer to other such structures for the
> reference.
>
> I am doing a more detailed review but this is what I got so far.

Some more comments

+/*
+ * Exit if any relation is not in the READY state and if any worker is
+ * handling the streaming transaction at the same time. Because for
+ * streaming transactions that is being applied in apply background
+ * worker, we cannot decide whether to apply the change for a relation
+ * that is not in the READY state (see should_apply_changes_for_rel) as we
+ * won't know remote_final_lsn by that time.
+ */
+if (list_length(ApplyBgworkersFreeList) !=
list_length(ApplyBgworkersList) &&
+!AllTablesyncsReady())
+{
+ereport(LOG,
+(errmsg("logical replication apply workers for
subscription \"%s\" will restart",
+MySubscription->name),
+ errdetail("Cannot handle streamed replication
transaction by apply "
+   "background workers until all tables are
synchronized")));
+
+proc_exit(0);
+}

How this situation can occur? I mean while starting a background
worker itself we can check whether all tables are sync ready or not
right?

+/* Check the status of apply background worker if any. */
+apply_bgworker_check_status();
+

What is the need to checking each worker status on every commit?  I
mean if there are a lot of small transactions along with some
steamiing transactions
then it will affect the apply performance for those small transactions?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Tue, Aug 9, 2022 at 10:51 AM Nathan Bossart  wrote:
> Fixed in v10.

I decided I wasn't quite comfortable changing snapshot handling
without further guarantees.  To this end, 0002 in the attached v11 is
an addendum that adds assert checking (also pgindent and some
comment-smithing). As I suspected, make check-world passes even with
purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
I verified that sticking in the wrong answer will lead to a crash in
assert-enabled builds in short order. I'd kind of like to throw this
(or something else suitable) at the build farm first for that reason.
It's simpler than the qsort/qunique/binary search that was there
before, so that's nice, but I've not tried to test performance.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 1c89b9b7d3c71bb1a703caaf7c01c93bc9e2515f Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 9 Aug 2022 11:51:52 +0700
Subject: [PATCH v11 2/4] Add assert checking, run pgindent, comment changes

---
 src/include/port/pg_lfind.h | 78 ++---
 1 file changed, 56 insertions(+), 22 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 24de544f63..fb125977b2 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -18,51 +18,85 @@
 /*
  * pg_lfind32
  *
- * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
- * returns false.
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
  */
 static inline bool
 pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* If possible, use SSE2 intrinsics to speed up the search. */
+	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
-	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
 
-	for (; i < iterations; i += 16)
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers. Testing has showed this is ~40% faster than using a
+	 * block of two registers.
+	 */
+	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+#if defined(USE_ASSERT_CHECKING)
+	bool		assert_result = false;
+
+	/* pre-compute the result for assert checking */
+	for (i = 0; i < nelem; i++)
 	{
-		/* load the next 16 values into __m128i variables */
-		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
-		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
-		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
-		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+		if (key == base[i])
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif
 
-		/* perform the comparisons */
-		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
+		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
+		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
+		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
 
-		/* shrink the results into a single variable */
-		const __m128i tmp1 = _mm_or_si128(result1, result2);
-		const __m128i tmp2 = _mm_or_si128(result3, result4);
-		const __m128i result = _mm_or_si128(tmp1, tmp2);
+		/* compare each value to the key */
+		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		__m128i tmp1 = _mm_or_si128(result1, result2);
+		const		__m128i tmp2 = _mm_or_si128(result3, result4);
+		const		__m128i result = _mm_or_si128(tmp1, tmp2);
 
 		/* see if there was a match */
 		if (_mm_movemask_epi8(result) != 0)
+		{
+#if defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
-#endif
+#endif			/* USE_SSE2 */
 
-	/* Process the remaining elements the slow way. */
+	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
 		if (key == base[i])
+		{
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
 
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+	Assert(assert_result == false);
+#endif
 	return false;
 }
 
-- 
2.36.1

From ff77224f9227bcff88a68e63f39754296810351c Mon Sep 17 00:00:00 2001
From: Nathan Boss