date:20230920

On Tue, Sep 19, 2023 at 12:07:07PM +0200, Matthias van de Meent wrote:
> V5 is a rebased version of v4, and includes the latest patch from
> "smaller XLRec block header" [0] as 0001.

0001 and 0007 are the meat of the changes.

-#define XLR_CHECK_CONSISTENCY  0x02
+#define XLR_CHECK_CONSISTENCY  (0x20)

I can't help but notice that there are a few stylistic choices like
this one that are part of the patch.  Using parenthesis in the case of
hexa values is inconsistent with the usual practices I've seen in the
tree.

 #define COPY_HEADER_FIELD(_dst, _size)\
 do {\
-if (remaining < _size)\
+if (remaining < (_size))\
 goto shortdata_err;\

There are a couple of stylistic changes like this one, that I guess
could just use their own patch to make these macros easier to use.

-#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info & 
XLR_INFO_MASK)
+#define XLogRecGetRmgrInfo(decoder) (((decoder)->record->header.xl_info) & 
XLR_RMGR_INFO_MASK)

This stuff in 0002 is independent of 0001, am I right?  Doing this
split with an extra macro is okay by me, reducing the presence of
XLR_INFO_MASK and bitwise operations based on it.

0003 is also mechanical, but if you begin to enforce the use of
XLR_RMGR_INFO_MASK as the bits allowed to be passed down to the RMGR
identity callback, we should have at least a validity check to make
sure that nothing, even custom RMGRs, pass down unexpected bits?

I am not convinced that XLOG_INCLUDE_XID is a good interface, TBH, and 
I fear that people are going to forget to set it.  Wouldn't it be
better to use an option where the XID is excluded instead, making the
inclusing the an XID the default?

> The resource manager has ID = 0, thus requiring some special
> handling in other code. Apart from being generally useful, it is
> used in future patches to detect the end of wal in lieu of a zero-ed
> fixed-size xl_tot_len field.

Err, no, that may not be true.  See for example this thread where the
topic of improving the checks of xl_tot_len and rely on this value on
when a record header has been validated, even across page borders:
https://www.postgresql.org/message-id/17928-aa92416a70ff4...@postgresql.org

Except that, in which cases could an invalid RMGR be useful?
--
Michael

signature.asc
Description: PGP signature

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-09-20 Thread Andrey Lepikhov


On 23/8/2023 12:37, Richard Guo wrote:

If we go with the "tablesample scans can't be reparameterized" approach
in the back branches, I'm a little concerned that what if we find more
cases in the futrue where we need modify RTEs for reparameterization.
So I spent some time seeking and have managed to find one: there might
be lateral references in a scan path's restriction clauses, and
currently reparameterize_path_by_child fails to adjust them.
It may help you somehow: in [1], we designed a feature where the 
partitionwise join technique can be applied to a JOIN of partitioned and 
non-partitioned tables. Unfortunately, it is out of community 
discussions, but we still support it for sharding usage - it is helpful 
for the implementation of 'global' tables in a distributed 
configuration. And there we were stuck into the same problem with 
lateral relids adjustment. So you can build a more general view of the 
problem with this patch.


[1] Asymmetric partition-wise JOIN
https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com

--
regards,
Andrey Lepikhov
Postgres Professional

Re: disfavoring unparameterized nested loops

2023-09-20 Thread Andrey Lepikhov

On 29/9/2022 21:32, Benjamin Coutu wrote:

I'd like to revamp this important discussion.

As is well described in this fairly recent paper here
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres)
"estimation errors quickly grow as the number of joins increases, and that these
errors are usually the reason for bad plans" - I think we can all get behind that
statement.

While nested loop joins work great when cardinality estimates are correct, they
are notoriously bad when the optimizer underestimates and they degrade very
fast in such cases - the good vs. bad here is very asymmetric. On the other
hand hash joins degrade much more gracefully - they are considered very robust
against underestimation. The above mentioned paper illustrates that all mayor
DBMS (including Postgres) tend to underestimate and usually that
underestimation increases drastically with the number of joins (see Figures 3+4
of the paper).

Now, a simple approach to guarding against bad plans that arise from
underestimation could be to use what I would call a
nested-loop-conviction-multiplier based on the current depth of the join tree,
e.g. for a base table that multiplier would obviously be 1, but would then grow
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew
the cardinality estimates themselves, but rather be applied to the overall
nested loop join cost at each particular stage of the plan when comparing it to
other more robust join strategies like hash or sort-merge joins. That way when
we can be sure to have a good estimate at the bottom of the join tree we treat
all things equal, but favor nested loops less and less as we move up the join
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default
cardinality constant as surely all bets are off at that point - we should
definitely treat nested loop joins as out of favor in this instance and that
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?
In my practice, parameterized nested loop reduces, sometimes
drastically, execution time. If your query touches a lot of tables but
extracts only a tiny part of the data, and you have good coverage by
indexes, PNL works great.
Moreover, I have pondered extending parameterization through subqueries
and groupings.

What could you say about a different way: hybrid join? In MS SQL Server,
they have such a feature [1], and, according to their description, it
requires low overhead. They start from HashJoin and switch to NestLoop
if the inner input contains too small tuples. It solves the issue, Isn't it?

[1]
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Show WAL write and fsync stats in pg_stat_io

2023-09-20 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in
pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty
cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═╦═══╦═══╦═══╗
║ IOCONTEXT_NORMAL║ read  ║ write ║ fsync ║
╠═╬═══╬═══╬═══╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═╬═══╬═══╬═══╣
║ autovacuum worker   ║ FALSE ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ client backend  ║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ background worker   ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ background writer   ║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ checkpointer║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ standalone backend  ║  TRUE ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ startup ║  TRUE ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walreceiver ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walsender   ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walwriter   ║   ║  TRUE ║  TRUE ║
╚═╩═══╩═══╩═══╝


IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═╦═══╦═══╗
║ IOCONTEXT_WAL_INIT  ║ write ║ fsync ║
╠═╬═══╬═══╣
║ autovacuum launcher ║   ║   ║
╠═╬═══╬═══╣
║ autovacuum worker   ║   ║   ║
╠═╬═══╬═══╣
║ client backend  ║  TRUE ║  TRUE ║
╠═╬═══╬═══╣
║ background worker   ║   ║   ║
╠═╬═══╬═══╣
║ background writer   ║   ║   ║
╠═╬═══╬═══╣
║ checkpointer║   ║   ║
╠═╬═══╬═══╣
║ standalone backend  ║  TRUE ║  TRUE ║
╠═╬═══╬═══╣
║ startup ║   ║   ║
╠═╬═══╬═══╣
║ walreceiver ║   ║   ║
╠═╬═══╬═══╣
║ walsender   ║   ║   ║
╠═╬═══╬═══╣
║ walwriter   ║   ║   ║
╚═╩═══╩═══╝


On Wed, 9 Aug 2023 at 21:52, Melanie Plageman 
wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> >  wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ
bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I see

Re: pg_rewind with cascade standby doesn't work well

2023-09-20 Thread Fujii Masao





On 2023/09/20 12:04, Michael Paquier wrote:

This is a known issue.  I guess that the same as this thread and this
CF entry:
https://commitfest.postgresql.org/44/4244/
https://www.postgresql.org/message-id/flat/zarvomifjze7f...@paquier.xyz


I think this is a separate issue, and we should still use Kuwamura-san's patch
even after the one you posted on the thread gets accepted. BTW, I was able to
reproduce the assertion failure Kuwamura-san reported, even after applying
your latest patch from the thread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-20 Thread Dilip Kumar

On Wed, Sep 20, 2023 at 12:12 PM Amit Kapila  wrote:
>
> On Wed, Sep 20, 2023 at 11:51 AM Dilip Kumar  wrote:
> >
> > On Wed, Sep 20, 2023 at 11:00 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Dear Amit,
> > >
> > > Thank you for reviewing! PSA new version. In this version I ran pgindent 
> > > again.
> > >
> >
> > + /*
> > + * There is a possibility that following records may be generated
> > + * during the upgrade.
> > + */
> > + if (!CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_SWITCH) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_PARAMETER_CHANGE) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_STANDBY_ID, XLOG_RUNNING_XACTS) &&
> > + !CHECK_WAL_RECORD(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE))
> > + is_valid = false;
> > +
> > + CHECK_FOR_INTERRUPTS();
> >
> > Just wondering why XLOG_HEAP2_VACUUM or other vacuum-related commands
> > can not occur during the upgrade?
> >
>
> Because autovacuum is disabled during upgrade. See comment: "Use -b to
> disable autovacuum" in start_postmaster().

Okay got it, thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Fix error handling in be_tls_open_server()

> On 19 Sep 2023, at 10:06, Sergey Shinderuk  wrote:
> 
> On 19.09.2023 03:54, Michael Paquier wrote:
>> One doubt that I have is if we shouldn't let X509_NAME_print_ex() be
>> as it is now, and not force a failure on the bio if this calls fails.
> 
> If malloc fails inside X509_NAME_print_ex, then we will be left with empty 
> port->peer_dn.

Looking at the OpenSSL code, there a other (albeit esoteric) errors that return
-1 as well.  I agree that we should handle this error.

X509_NAME_print_ex is not documented to return -1 in OpenSSL 1.0.2 but reading
the code it's clear that it does, so checking for -1 is safe for all supported
OpenSSL versions (supported by us that is).

Attached is a v2 on top of HEAD with commit message etc, which I propose to
backpatch to v15 where it was introduced.

--
Daniel Gustafsson

v2-0001-Avoid-potential-pfree-on-NULL-on-OpenSSL-errors.patch
Description: Binary data

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-20 Thread Pavel Stehule

Hi

st 20. 9. 2023 v 9:34 odesílatel Maciek Sakrejda 
napsal:

> On Tue, Sep 19, 2023 at 5:56 PM Jeff Davis  wrote:
> >...
> > On Tue, 2023-09-19 at 11:41 -0400, Robert Haas wrote:
> > > That leads to a second idea, which is having it continue
> > > to be a GUC but only affect directly-entered SQL, with all
> > > indirectly-entered SQL either being stored as a node tree or having a
> > > search_path property attached somewhere.
> >
> > That's not too far from the proposed patch and I'd certainly be
> > interested to hear more and/or adapt my patch towards this idea.
>
> As an interested bystander, that's the same thing I was thinking when
> reading this. I reread your original e-mail, Jeff, and I still think
> that.
>
> I wonder if something like CURRENT (i.e., the search path at function
> creation time) might be a useful keyword addition. I can see some uses
> (more forgiving than SYSTEM but not as loose as SESSION), but I don't
> know if it would justify its presence.
>

Personally, I dislike this - because the value of the search path is hidden
in this case.

I agree so it can be comfortable, but it can be confusing for review,
migration, ...

Regards

Pavel


> Thanks for working on this.
>
> Thanks,
> Maciek
>
>
>

Re: Fix error handling in be_tls_open_server()

2023-09-20 Thread Sergey Shinderuk


On 20.09.2023 11:42, Daniel Gustafsson wrote:

Attached is a v2 on top of HEAD with commit message etc, which I propose to
backpatch to v15 where it was introduced.


There is a typo: upon en error. Otherwise, looks good to me. Thank you.

--
Sergey Shinderukhttps://postgrespro.com/

Re: Fix error handling in be_tls_open_server()

> On 20 Sep 2023, at 10:55, Sergey Shinderuk  wrote:
> 
> On 20.09.2023 11:42, Daniel Gustafsson wrote:
>> Attached is a v2 on top of HEAD with commit message etc, which I propose to
>> backpatch to v15 where it was introduced.
> There is a typo: upon en error. Otherwise, looks good to me. Thank you.

Thanks, will fix before pushing.

--
Daniel Gustafsson

Re: There should be a way to use the force flag when restoring databases

2023-09-20 Thread Peter Eisentraut


On 06.08.23 21:39, Ahmed Ibrahim wrote:
I have addressed the pg version compatibility with the FORCE option in 
drop. Here is the last version of the patch


The patch is pretty small, but I think there is some disagreement 
whether we want this option at all?  Maybe some more people can make 
their opinions more explicit?

Re: POC, WIP: OR-clause support for indexes

2023-09-20 Thread Peter Eisentraut


On 29.08.23 05:37, a.rybakina wrote:
Thank you for your interest in this problem and help, and I'm sorry that 
I didn't respond to this email for a long time. To be honest, I wanted 
to investigate the problems in more detail and already answer more 
clearly, but unfortunately I have not found anything more significant yet.


What is the status of this patch?  It is registered in the commitfest. 
It looks like a stalled research project?  The last posted patch doesn't 
contain any description or tests, so it doesn't look very ready.

Re: bug fix and documentation improvement about vacuumdb

2023-09-20 Thread Kuwamura Masaki

Thank you for all your reviews!

>>> PATTERN should be changed to SCHEMA because -n and -N options don't
support
>>> pattern matching for schema names. The attached patch 0001 fixes this.
>>
>> True, there is no pattern matching performed.  I wonder if it's worth
lifting
>> the pattern matching from pg_dump into common code such that tools like
this
>> can use it?
>
> I agree that this should be changed to SCHEMA.  It might be tough to add
> pattern matching with the current catalog query, and I don't know whether
> there is demand for such a feature, but I wouldn't discourage someone from
> trying.

I think that supporting pattern matching is quite nice.
But it will be not only tough but also a breaking change, I wonder.
So I guess this change should be commited either way.

>>> Yeah, I think we can fix the JOIN as you suggest.  I quickly put a patch
>>> together to demonstrate.
>
> Looks good from a quick skim.

I do agree with this updates. Thank you!

>> We should probably add some tests...
>
> Agreed.

The attached patch includes new tests for this bug.
Also, I fixed the current test for -N option seems to be incorrect.

>>> It seems to work fine. However, if we're aiming for consistent
>>> spacing, the "IS NULL" (two spaces in between) might be an concern.
>>
>> I don't think that's a problem.  I would rather have readable C code and
two
>> spaces in the generated SQL than contorting the C code to produce less
>> whitespace in a query few will read in its generated form.
>
> I think we could pretty easily avoid the extra space and keep the C code
> relatively readable.  These sorts of things bug me, too (see 2af3336).

Though I don't think it affects readability, I'm neutral about this.

>> >> Third, for the description of the -N option, I wonder if "vacuum all
tables except
>> >> in the specified schema(s)" might be clearer. The current one says
nothing about
>> >> tables not in the specified schema.
>> >
>> > Maybe, but the point of vacuumdb is to analyze a database so I'm not
sure who
>> > would expect anything else than vacuuming everything but the excluded
schema
>> > when specifying -N.  What else could "vacuumdb -N foo" be interpreted
to do
>> > that can be confusing?
>>
>> I agree with Daniel on this one.
>
> +1.

That make sense. I retract my suggestion.


v1_vacuumdb_add_tests.patch
Description: Binary data

Re: disfavoring unparameterized nested loops

On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
 wrote:
> What could you say about a different way: hybrid join? In MS SQL Server,
> they have such a feature [1], and, according to their description, it
> requires low overhead. They start from HashJoin and switch to NestLoop
> if the inner input contains too small tuples. It solves the issue, Isn't it?

A complexity which you may not be considering here is that Nested Loop
joins always preserve the tuple order from the outer side of the join,
whereas hash joins will not do this when multi-batching.

I've no idea how the SQL Server engineers solved that.

David

> [1]
> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

Re: logical decoding and replication of sequences, take 2

2023-09-20 Thread Dilip Kumar

On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
 wrote:
>

I was reading through 0001, I noticed this comment in
ReorderBufferSequenceIsTransactional() function

+ * To decide if a sequence change should be handled as transactional or applied
+ * immediately, we track (sequence) relfilenodes created by each transaction.
+ * We don't know if the current sub-transaction was already assigned to the
+ * top-level transaction, so we need to check all transactions.

It says "We don't know if the current sub-transaction was already
assigned to the top-level transaction, so we need to check all
transactions". But IIRC as part of the steaming of in-progress
transactions we have ensured that whenever we are logging the first
change by any subtransaction we include the top transaction ID in it.

Refer this code

LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *record)
{
...
/*
* If the top-level xid is valid, we need to assign the subxact to the
* top-level xact. We need to do this for all records, hence we do it
* before the switch.
*/
if (TransactionIdIsValid(txid))
{
ReorderBufferAssignChild(ctx->reorder,
txid,
XLogRecGetXid(record),
buf.origptr);
}
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] Reducing planning time when tables have many partitions

2023-09-20 Thread Yuya Watari

Hello Ashutosh and Andrey,

Thank you for your email, and I really apologize for my late response.

On Thu, Sep 7, 2023 at 3:43 PM Ashutosh Bapat
 wrote:
> It seems that  you are still investigating and fixing issues. But the
> CF entry is marked as "needs review". I think a better status is
> "WoA". Do you agree with that?

Yes, I am now investigating and fixing issues. I agree with you and
changed the entry's status to "Waiting on Author". Thank you for your
advice.

On Tue, Sep 19, 2023 at 5:21 PM Andrey Lepikhov
 wrote:
> Working on self-join removal in the thread [1] nearby, I stuck into the
> problem, which made an additional argument to work in this new direction
> than a couple of previous ones.
> With indexing positions in the list of equivalence members, we make some
> optimizations like join elimination more complicated - it may need to
> remove some clauses and equivalence class members.
> For changing lists of derives or ec_members, we should go through all
> the index lists and fix them, which is a non-trivial operation.

Thank you for looking into this and pointing that out. I understand
that this problem will occur somewhere like your patch [1] quoted
below because we need to modify RelOptInfo->eclass_child_members in
addition to ec_members. Is my understanding correct? (Of course, I
know ec_[no]rel_members, but I doubt we need them.)

=
+static void
+update_eclass(EquivalenceClass *ec, int from, int to)
+{
+   List   *new_members = NIL;
+   ListCell   *lc;
+
+   foreach(lc, ec->ec_members)
+   {
+   EquivalenceMember  *em = lfirst_node(EquivalenceMember, lc);
+   boolis_redundant = false;
+
...
+
+   if (!is_redundant)
+   new_members = lappend(new_members, em);
+   }
+
+   list_free(ec->ec_members);
+   ec->ec_members = new_members;
=

I think we may be able to remove the eclass_child_members field by
making child members on demand. v20 makes child members at
add_[child_]join_rel_equivalences() and adds them into
RelOptInfo->eclass_child_members. Instead of doing that, if we
translate on demand when child members are requested,
RelOptInfo->eclass_child_members is no longer necessary. After that,
there is only ec_members, which consists of parent members, so
removing clauses will still be simple. Do you think this idea will
solve your problem? If so, I will experiment with this and share a new
patch version. The main concern with this idea is that the same child
member will be created many times, wasting time and memory. Some
techniques like caching might solve this.

[1] 
https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru

-- 
Best regards,
Yuya Watari

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-20 Thread Amit Kapila

On Wed, Sep 20, 2023 at 12:16 PM Amit Kapila  wrote:
>
> On Wed, Sep 20, 2023 at 11:00 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Amit,
>
> +int
> +count_old_cluster_logical_slots(void)
> +{
> + int dbnum;
> + int slot_count = 0;
> +
> + for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
> + slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots;
> +
> + return slot_count;
> +}
>
> In this code, aren't we assuming that 'slot_arr.nslots' will be zero
> for versions <=PG16? On my Windows machine, this value is not zero but
> rather some uninitialized negative value which makes its caller try to
> allocate some undefined memory and fail. I think you need to
> initialize this in get_old_cluster_logical_slot_infos() for lower
> versions.
>

+{ oid => '8046', descr => 'for use by pg_upgrade',
+  proname => 'binary_upgrade_validate_wal_records',
+  prorows => '10', proretset => 't', provolatile => 's', prorettype => 'bool',
+  proargtypes => 'pg_lsn', proallargtypes => '{pg_lsn,bool}',
+  proargmodes => '{i,o}', proargnames => '{start_lsn,is_ok}',
+  prosrc => 'binary_upgrade_validate_wal_records' },

In this many of the fields seem bogus. For example, we don't need
prorows => '10', proretset => 't' for this function. Similarly
proargmodes also look incorrect as we don't have any out parameter.

-- 
With Regards,
Amit Kapila.

Re: Comment about set_join_pathlist_hook()

On Wed, 20 Sept 2023 at 22:06, Etsuro Fujita  wrote:
> So I would like to propose to extend the comment to explain what they
> can do, as in the comment about set_rel_pathlist_hook() in allpaths.c.
> Attached is a patch for that.

Looks good to me.

I see you've copy/edited the comment just above the call to
set_rel_pathlist_hook(). That makes sense.

David

Re: [PoC] Reducing planning time when tables have many partitions

On Wed, Sep 20, 2023 at 3:35 PM Yuya Watari  wrote:

> I think we may be able to remove the eclass_child_members field by
> making child members on demand. v20 makes child members at
> add_[child_]join_rel_equivalences() and adds them into
> RelOptInfo->eclass_child_members. Instead of doing that, if we
> translate on demand when child members are requested,
> RelOptInfo->eclass_child_members is no longer necessary. After that,
> there is only ec_members, which consists of parent members, so
> removing clauses will still be simple. Do you think this idea will
> solve your problem? If so, I will experiment with this and share a new
> patch version. The main concern with this idea is that the same child
> member will be created many times, wasting time and memory. Some
> techniques like caching might solve this.
>

While working on RestrictInfo translations patch I was thinking on
these lines. [1] uses hash table for storing translated RestrictInfo.
An EC can have a hash table to store ec_member translations. The same
patchset also has some changes in the code which generates
RestrictInfo clauses from ECs. I think that code will be simplified by
your approach.

[1] 
https://www.postgresql.org/message-id/caexhw5u0yyyr2mwvlrvvy_qnld65kpc9u-bo0ox7bglkgba...@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: bug fix and documentation improvement about vacuumdb

> On 20 Sep 2023, at 11:46, Kuwamura Masaki  
> wrote:

> I think that supporting pattern matching is quite nice.
> But it will be not only tough but also a breaking change, I wonder.
> So I guess this change should be commited either way.

I agree.  Supporting pattern matching should, if anyone is interested in
trying, be done separately in its own thread, no need to move the goalposts
here. Sorry if I made it sound like so upthread.

> The attached patch includes new tests for this bug.
> Also, I fixed the current test for -N option seems to be incorrect.

When sending an update, please include the previous patch as well with your new
tests as a 0002 patch in a patchset.  The CFBot can only apply and build/test
patches when the entire patchset is attached to the email.  The below
testresults indicate that the patch failed the new tests (which in a way is
good since without the fix the tests *should* fail), since the fix patch was
not applied:

http://cfbot.cputube.org/masaki-kuwamura.html

--
Daniel Gustafsson

Re: persist logical slots to disk during shutdown checkpoint

On Thu, Sep 14, 2023 at 2:41 PM Amit Kapila  wrote:
>
> On Thu, Sep 14, 2023 at 7:20 AM Michael Paquier  wrote:
> >
> > On Wed, Sep 13, 2023 at 04:20:37PM +0530, Amit Kapila wrote:
> > > The patch is updated as per recent discussion.
> >
> > WFM.  Thanks for the updated version.
> >
>
> Pushed.

Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in
"Ready for committer" state. Is there something remaining here? We
should probably set it as "committed".

-- 
Best Wishes,
Ashutosh Bapat

Re: persist logical slots to disk during shutdown checkpoint

On Wed, Sep 20, 2023 at 04:48:00PM +0530, Ashutosh Bapat wrote:
> Commitfest entry "https://commitfest.postgresql.org/44/4536/ is in
> "Ready for committer" state. Is there something remaining here? We
> should probably set it as "committed".

Thanks, I've switched that to "Committed".
--
Michael


signature.asc
Description: PGP signature

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-20 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> +int
> +count_old_cluster_logical_slots(void)
> +{
> + int dbnum;
> + int slot_count = 0;
> +
> + for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
> + slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots;
> +
> + return slot_count;
> +}
> 
> In this code, aren't we assuming that 'slot_arr.nslots' will be zero
> for versions <=PG16? On my Windows machine, this value is not zero but
> rather some uninitialized negative value which makes its caller try to
> allocate some undefined memory and fail. I think you need to
> initialize this in get_old_cluster_logical_slot_infos() for lower
> versions.

Good catch, I could not notice because it worked well in my RHEL. Here is the
updated version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v42-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Description:  v42-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch

Comment about set_join_pathlist_hook()

2023-09-20 Thread Etsuro Fujita

Hi,

What I am concerned about from the report [1] is that this comment is
a bit too terse; it might cause a misunderstanding that extensions can
do different things than we intend to allow:

/*
 * 6. Finally, give extensions a chance to manipulate the path list.
 */
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
   jointype, &extra);

So I would like to propose to extend the comment to explain what they
can do, as in the comment about set_rel_pathlist_hook() in allpaths.c.
Attached is a patch for that.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CACawEhV%3D%2BQ0HXrcDergbTR9EkVFukgRPMTZbRFL-YK5CRmvYag%40mail.gmail.com


update-set_join_pathlist_hook-comment.patch
Description: Binary data

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-20 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

Thank you for reviewing! New version can be available in [1].

> 
> +{ oid => '8046', descr => 'for use by pg_upgrade',
> +  proname => 'binary_upgrade_validate_wal_records',
> +  prorows => '10', proretset => 't', provolatile => 's', prorettype => 
> 'bool',
> +  proargtypes => 'pg_lsn', proallargtypes => '{pg_lsn,bool}',
> +  proargmodes => '{i,o}', proargnames => '{start_lsn,is_ok}',
> +  prosrc => 'binary_upgrade_validate_wal_records' },
> 
> In this many of the fields seem bogus. For example, we don't need
> prorows => '10', proretset => 't' for this function. Similarly
> proargmodes also look incorrect as we don't have any out parameter.
>

The part was made in old versions and has kept till now. I rechecked them and
changed like below:

* This function just returns boolean, proretset was changed to 'f'.
* Based on above, prorows should be zero. Removed.
* Returned value is quite depended on the internal status, provolatile was
  changed to 'v'.
* There are no OUT and INOUT arguments, no need to set proallargtypes and 
proargmodes.
  Removed.
* Anonymous arguments are allowed, proargnames was removed NULL.
* This function is not expected to be call in parallel. proparallel was set to 
'u'.
* The argument must not be NULL, and we should error out. proisstrict was 
changed 'f'.
  Also, the check was added to the function.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB586615579356A84A8CF29A00F5F9A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-09-20 Thread Amit Langote

Hi Ashutosh,

On Wed, Aug 16, 2023 at 2:28 PM Ashutosh Bapat
 wrote:
> On Fri, Aug 4, 2023 at 2:11 PM Ashutosh Bapat
>  wrote:
> >
> > Attached patchset fixing those.
> > 0001 - patch to report planning memory, with to explain.out regression 
> > output fix. We may consider committing this as well.
> > 0002 - with your comment addressed above.
>
> 0003 - Added this patch for handling SpecialJoinInfos for inner joins.
> These SpecialJoinInfos are created on the fly for parent joins. They
> can be created on the fly for child joins as well without requiring
> any translations. Thus they do not require any new memory. This patch
> is intended to be merged into 0002 finally.

I read this thread and have been reading the latest patch.  At first
glance, it seems quite straightforward to me.  I agree with Richard
that pfree()'ing 4 bitmapsets may not be a lot of added overhead.  I
will study the patch a bit more.

Just one comment on 0003:

+   /*
+* Dummy SpecialJoinInfos do not have any translated fields and hence have
+* nothing to free.
+*/
+   if (child_sjinfo->jointype == JOIN_INNER)
+   return;

Should this instead be Assert(child_sjinfo->jointype != JOIN_INNER)?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: There should be a way to use the force flag when restoring databases

> On 20 Sep 2023, at 11:24, Peter Eisentraut  wrote:
> 
> On 06.08.23 21:39, Ahmed Ibrahim wrote:
>> I have addressed the pg version compatibility with the FORCE option in drop. 
>> Here is the last version of the patch
> 
> The patch is pretty small, but I think there is some disagreement whether we 
> want this option at all?  Maybe some more people can make their opinions more 
> explicit?

My my concern is that a --force parameter conveys to the user that it's a big
hammer to override things and get them done, when in reality this doesn't do
that.  Taking the example from the pg_restore documentation which currently has
a dropdb step:

:~ $ ./bin/createdb foo
:~ $ ./bin/psql -c "create table t(a int);" foo
CREATE TABLE
:~ $ ./bin/pg_dump --format=custom -f foo.dump foo
:~ $ ./bin/pg_restore -d foo -C -c --force foo.dump
pg_restore: error: could not execute query: ERROR:  cannot drop the currently 
open database
Command was: DROP DATABASE foo WITH(FORCE);
pg_restore: error: could not execute query: ERROR:  database "foo" already 
exists
Command was: CREATE DATABASE foo WITH TEMPLATE = template0 ENCODING = 'UTF8' 
LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';

pg_restore: error: could not execute query: ERROR:  relation "t" already exists
Command was: CREATE TABLE public.t (
a integer
);

pg_restore: warning: errors ignored on restore: 3

Without knowing internals, I would expect an option named --force to make that
just work, especially given the documentation provided in this patch.  I think
the risk for user confusion outweighs the benefits, or maybe I'm just not smart
enough to see all the benefits?  If so, I would argue that more documentation
is required.

Skimming the patch itself, it updates the --help output with --force for
pg_dump and not for pg_restore.  Additionally it produces a compilerwarning:

pg_restore.c:127:26: warning: incompatible pointer types initializing 'int *' 
with an expression of type 'bool *' [-Wincompatible-pointer-types]
{"force", no_argument, &force, 1},
   ^~
1 warning generated.

--
Daniel Gustafsson

Re: POC, WIP: OR-clause support for indexes

2023-09-20 Thread a.rybakina

Hi!

When I sent the patch version to commitfest, I thought that the work on
this topic was completed. Patch version and test results in [0].

But in the process of discussing this patch, we found out that there is
another place where you can make a transformation, specifically, during
the calculation of selectivity. I implemented the raw version [1], but
unfortunately it didn't work in regression tests.

I'm sorry that I didn't write about the status earlier, I was very
overwhelmed with tasks at work due to releases and preparations for the
conference. I returned to the work of this patch, today or tomorrow I'll
drop the version.

[0]

https://www.postgresql.org/message-id/4bac271d-1700-db24-74ac-8414f2baf9fd%40postgrespro.ru

https://www.postgresql.org/message-id/11403645-b342-c400-859e-47d0f41ec22a%40postgrespro.ru

[1]
https://www.postgresql.org/message-id/b301dce1-09fd-72b1-834a-527ca428db5e%40yandex.ru

On 20.09.2023 12:37, Peter Eisentraut wrote:

On 29.08.23 05:37, a.rybakina wrote:
Thank you for your interest in this problem and help, and I'm sorry
that I didn't respond to this email for a long time. To be honest, I
wanted to investigate the problems in more detail and already answer
more clearly, but unfortunately I have not found anything more
significant yet.

What is the status of this patch? It is registered in the commitfest.
It looks like a stalled research project? The last posted patch
doesn't contain any description or tests, so it doesn't look very ready.

Re: Infinite Interval

2023-09-20 Thread Dean Rasheed

On Wed, 20 Sept 2023 at 11:27, jian he  wrote:
>
> if I remove IntervalAggState's element: MemoryContext, it will not work.
> so I don't understand what the above sentence means.. Sorry. (it's
> my problem)
>

I don't see why it won't work. The point is to try to simplify
do_interval_accum() as much as possible. Looking at the current code,
I see a few places that could be simpler:

+X.day = newval->day;
+X.month = newval->month;
+X.time = newval->time;
+
+temp.day = state->sumX.day;
+temp.month = state->sumX.month;
+temp.time = state->sumX.time;

Why do we need these local variables X and temp? It could just add the
values from newval directly to those in state->sumX.

+/* The rest of this needs to work in the aggregate context */
+old_context = MemoryContextSwitchTo(state->agg_context);

Why? It's not allocating any memory here, so I don't see a need to
switch context.

So basically, do_interval_accum() could be simplified to:

static void
do_interval_accum(IntervalAggState *state, Interval *newval)
{
/* Count infinite intervals separately from all else */
if (INTERVAL_IS_NOBEGIN (newval))
{
state->nInfcount++;
return;
}
if (INTERVAL_IS_NOEND(newval))
{
state->pInfcount++;
return;
}

/* Update count of finite intervals */
state->N++;

/* Update sum of finite intervals */
if (unlikely(pg_add_s32_overflow(state->sumX.month, newval->month,
 &state->sumX.month)))
ereport(ERROR,
errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
errmsg("interval out of range"));

if (unlikely(pg_add_s32_overflow(state->sumX.day, newval->day,
 &state->sumX.day)))
ereport(ERROR,
errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
errmsg("interval out of range"));

if (unlikely(pg_add_s64_overflow(state->sumX.time, newval->time,
 &state->sumX.time)))
ereport(ERROR,
errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
errmsg("interval out of range"));

return;
}

and that can be further refactored, as described below, and similarly
for do_interval_discard(), except using pg_sub_s32/64_overflow().

> > Also, this needs to include serialization and deserialization
> > functions, otherwise these aggregates will no longer be able to use
> > parallel workers. That makes a big difference to queryE, if the size
> > of the test data is scaled up.
> >
> I tried, but failed. sum(interval) result is correct, but
> avg(interval) result is wrong.
>
> SELECT sum(b) ,avg(b)
> ,avg(b) = sum(b)/count(*) as should_be_true
> ,avg(b) * count(*) = sum(b) as should_be_true_too
> from interval_aggtest_1m; --1million row.
> The above query expects two bool columns to return true, but actually
> both returns false.(spend some time found out parallel mode will make
> the number of rows to 1_000_002, should be 1_000_).
>

I think the reason for your wrong results is this code in
interval_avg_combine():

+if (state2->N > 0)
+{
+/* The rest of this needs to work in the aggregate context */
+old_context = MemoryContextSwitchTo(agg_context);
+
+/* Accumulate interval values */
+do_interval_accum(state1, &state2->sumX);
+
+MemoryContextSwitchTo(old_context);
+}

The problem is that using do_interval_accum() to add the 2 sums
together also adds 1 to the count N, making it incorrect. This code
should only be adding state2->sumX to state1->sumX, not touching
state1->N. And, as in do_interval_accum(), there is no need to switch
memory context.

Given that there are multiple places in this file that need to add
intervals, I think it makes sense to further refactor, and add a local
function to add 2 finite intervals, along the lines of the code above.
This can then be called from do_interval_accum(),
interval_avg_combine(), and interval_pl(). And similarly for
subtracting 2 finite intervals.

Regards,
Dean

Re: should frontend tools use syncfs() ?

2023-09-20 Thread Maxim Orlov

On Thu, 7 Sept 2023 at 03:34, Nathan Bossart 
wrote:

> Committed.
>

Hi! Great job!

But here is one problem I've encountered during working on some unrelated
stuff.
How we have two different things call the same name – sync_method. One in
xlog:
intsync_method = DEFAULT_SYNC_METHOD;
...and another one in "bins":
static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;

In current include order, this is not a problem, but imagine you add a
couple of new includes,
for example:
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/item.h"
 #include "storage/off.h"
+#include "postgres.h"
+#include "utils/rel.h"

And build will be broken, because we how have two different things called
"sync_method" with
different types:
In file included from .../src/bin/pg_rewind/pg_rewind.c:33:
In file included from .../src/include/storage/bufpage.h:22:
In file included from .../src/include/utils/rel.h:18:
.../src/include/access/xlog.h:27:24: error: redeclaration of 'sync_method'
with a different type: 'int' vs 'DataDirSyncMethod' (aka 'enum
DataDirSyncMethod')
extern PGDLLIMPORT int sync_method;
...

As a solution, I suggest renaming sync_method in xlog module to
wal_sync_method. In fact,
appropriate GUC for this variable, called "wal_sync_method" and I see no
reason not to use
the exact same name for a variable in xlog module.

-- 
Best regards,
Maxim Orlov.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 424ecba028f..3aa1fc60cb4 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -18,6 +18,8 @@
 #include "storage/block.h"
 #include "storage/item.h"
 #include "storage/off.h"
+#include "postgres.h"
+#include "utils/rel.h"

 /*
  * A postgres disk page is an abstraction layered on top of a postgres

v10-0001-Fix-conflicting-types-for-sync_method.patch
Description: Binary data

Re: Infinite Interval

2023-09-20 Thread Dean Rasheed

On Wed, 20 Sept 2023 at 13:09, Dean Rasheed  wrote:
>
> So basically, do_interval_accum() could be simplified to:
>

Oh, and I guess it also needs an INTERVAL_NOT_FINITE() check, to make
sure that finite values don't sum to our representation of infinity,
as in interval_pl().

Regards,
Dean

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

> On 19 Sep 2023, at 11:21, bt23nguyent  wrote:

> Please let me know if you have any further suggestions that I can improve 
> more.

+*logdetail = pstrdup("WAL archiving failed because 
basic_archive.archive_directory is not set");

Nitpick: detail messages should end with a period per the error message style
guide [0].

-archiving will proceed only when it returns true.
+archiving will proceed only when it returns true. The
+archiver may also emit the detail explaining how the module is not 
configured
+to the sever log if the archive module has any. 

I think this paragraph needs to be updated to include how the returned
logdetail is emitted, since it currently shows the WARNING without mentioning
the added detail in case returned.  It would also be good to mention that it
should be an allocated string which the caller can free.

--
Daniel Gustafsson

[0] https://www.postgresql.org/docs/devel/error-style-guide.html

Re: pg_upgrade and logical replication

2023-09-20 Thread Amit Kapila

On Fri, Sep 15, 2023 at 3:08 PM vignesh C  wrote:
>
> The attached v8 version patch has the changes for the same.
>

Is the check to ensure remote_lsn is valid correct in function
check_for_subscription_state()? How about the case where the apply
worker didn't receive any change but just marked the relation as
'ready'?

Also, the patch seems to be allowing subscription relations from PG
>=10 to be migrated but how will that work if the corresponding
publisher is also upgraded without slots? Won't the corresponding
workers start failing as soon as you restart the upgrade server? Do we
need to document the steps for users?

-- 
With Regards,
Amit Kapila.

Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2023-09-20 Thread Amul Sul

Hi,

On the latest master head, I can see a $subject bug that seems to be related
commit #b0e96f311985:

Here is the table definition:
create table foo(i int, j int, CONSTRAINT pk PRIMARY KEY(i) DEFERRABLE);

And after restore from the dump, it shows a descriptor where column 'i' not
marked NOT NULL:

=# \d foo
Table "public.foo"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 i  | integer |   |  |
 j  | integer |   |  |
Indexes:
"pk" PRIMARY KEY, btree (i) DEFERRABLE

The pg_attribute entry:

=# select attname, attnotnull from pg_attribute
where attrelid  = 'foo'::regclass and attnum > 0;

 attname | attnotnull
-+
 i   | f
 j   | f
(2 rows)

-- 
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

On Wed, Sep 20, 2023 at 5:24 PM Amit Langote  wrote:
>
> I read this thread and have been reading the latest patch.  At first
> glance, it seems quite straightforward to me.  I agree with Richard
> that pfree()'ing 4 bitmapsets may not be a lot of added overhead.  I
> will study the patch a bit more.

Thanks.

>
> Just one comment on 0003:
>
> +   /*
> +* Dummy SpecialJoinInfos do not have any translated fields and hence have
> +* nothing to free.
> +*/
> +   if (child_sjinfo->jointype == JOIN_INNER)
> +   return;
>
> Should this instead be Assert(child_sjinfo->jointype != JOIN_INNER)?

try_partitionwise_join() calls free_child_sjinfo_members()
unconditionally i.e. it will be called even SpecialJoinInfos of INNER
join. Above condition filters those out early. In fact if we convert
into an Assert, in a production build the rest of the code will free
Relids which are referenced somewhere else causing dangling pointers.

--
Best Wishes,
Ashutosh Bapat

Re: Infinite Interval

On Mon, Sep 18, 2023 at 5:09 PM Dean Rasheed  wrote:
>
> On Wed, 13 Sept 2023 at 11:13, Ashutosh Bapat
>  wrote:
> >
> > On Tue, Sep 12, 2023 at 2:39 PM Dean Rasheed  
> > wrote:
> > >
> > > and it looks like the infinite interval
> > > input code is broken.
> >
> > The code required to handle 'infinity' as an input value was removed by
> > d6d1430f404386162831bc32906ad174b2007776. I have added a separate
> > commit which reverts that commit as 0004, which should be merged into
> > 0003.
> >
>
> I think that simply reverting d6d1430f404386162831bc32906ad174b2007776
> is not sufficient. This does not make it clear what the point is of
> the code in the "case RESERV" block. That code really should check the
> value returned by DecodeSpecial(), otherwise invalid inputs are not
> caught until later, and the error reported is not ideal. For example:
>
> select interval 'now';
> ERROR:  unexpected dtype 12 while parsing interval "now"
>
> So DecodeInterval() should return DTERR_BAD_FORMAT in such cases (see
> similar code in DecodeTimeOnly(), for example).

The since the code was there earlier, I missed that part. Sorry.

>
> I'd also suggest a comment to indicate why itm_in isn't updated in
> this case (see similar case in DecodeDateTime(), for example).
>

Added but in the function prologue since it's part of the API.

>
> Another point to consider is what should happen if "ago" is specified
> with infinite inputs. As it stands, it is accepted, but does nothing:
>
> select interval 'infinity ago';
>  interval
> --
>  infinity
> (1 row)
>
> select interval '-infinity ago';
>  interval
> ---
>  -infinity
> (1 row)
>
> This could be made to invert the sign, as it does for finite inputs,
> but I think perhaps it would be better to simply reject such inputs.

Fixed this. After studying what DecodeInterval is doing, I think it's
better to treat all infinity specifications similar to "ago". They
need to be the last part of the input string. Rest of the code makes
sure that nothing preceds infinity specification as other case blocks
do not handle RESERVE or DTK_LATE and DTK_EARLY. This means that
"+infinity", "-infinity" or "infinity" can be the only allowed word as
a valid interval input when either of them is specified.

I will post these changes in another email along with other patches.

--
Best Wishes,
Ashutosh Bapat

Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2023-09-20 Thread David Geier


Hi Dmitry,

Thanks for looking at the patch and sorry for the long line.

On 3/17/23 21:14, Dmitry Dolgov wrote:

The idea sounds reasonable to me, but I have to admit snapshot_and_stats
implementation looks awkward. Maybe it would be better to have a
separate structure field for both stats and snapshot, which will be set
to point to a corresponding place in the shared FAM e.g. when the worker
is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle
possibility of some atomic operations needing it, so I guess that would
have to be an alignment in this case as well.

Probably another option would be to allocate two separate pieces of
shared memory, which resolves questions like proper alignment, but
annoyingly will require an extra lookup and a new key.


I considered the other options and it seems to me none of them is 
particularly superior. All of them have pros and cons with the cons 
mostly outweighing the pros. Let me quickly elaborate:


1. Use multiple shm_toc entries: Shared state is split into multiple 
pieces. Extra pointers in BitmapHeapScanState needed to point at the 
split out data. BitmapHeapScanState has already a shared_info member, 
which is not a pointer to the shared memory but a pointer to the leader 
local data allocated used to store the instrumentation data from the 
workers. This is confusing but at least consistent with how its done in 
other places (e.g. execSort.c, nodeHash.c, nodeIncrementalSort.c). 
Having another pointer there which points to the shared memory makes it 
even more confusing. If we go this way we would have e.g. 
shared_info_copy and shared_info members in BitmapHeapScanState.


2. Store two extra pointers to the shared FAM entries in 
BitmapHeapScanState: IMHO, that is the better alternative of (1) as it 
doesn't need an extra TOC entry but comes with the same confusion of 
multiple pointers to SharedBitmapHeapScanInfo in BitmapHeapScanState. 
But maybe that's not too bad?


3. Solution in initial patch (use two functions to obtain pointers 
where/when needed): Avoids the need for another pointer in 
BitmapHeapScanState at the cost / ugliness of having to call the helper 
functions.


Another, not yet discussed, option I can see work is:

4. Allocate a fixed amount of memory for the instrumentation stats based 
on MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used 
as the limit of the max_parallel_workers GUC. This way 
MAX_PARALLEL_WORKER_LIMIT * sizeof(BitmapHeapScanInstrumentation) = 1024 
* 8 = 8192 bytes would be allocated. To cut this down in half we could 
additionally change the type of lossy_pages and exact_pages from long to 
uint32. Only possibly needed memory would have to get initialized, the 
remaining unused memory would remain untouched to not waste cycles.


My first preference is the new option (4). My second preference is 
option (1). What's your take?


--
David Geier
(ServiceNow)

Re: Infinite Interval

On Wed, Sep 20, 2023 at 5:44 PM Dean Rasheed  wrote:
>
> On Wed, 20 Sept 2023 at 13:09, Dean Rasheed  wrote:
> >
> > So basically, do_interval_accum() could be simplified to:
> >
>
> Oh, and I guess it also needs an INTERVAL_NOT_FINITE() check, to make
> sure that finite values don't sum to our representation of infinity,
> as in interval_pl().

Fixed in the latest patch set.

-- 
Best Wishes,
Ashutosh Bapat

Re: Disabling Heap-Only Tuples

2023-09-20 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2023-09-19 at 12:52 -0400, Robert Haas wrote:
> > On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> > wrote:
> > > I was thinking something vaguely like "a table size that's roughly what
> > > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > > vacuum_scale_factor.  You would determine the absolute minimum size for
> > > the table given the current live tuples in the table, then add 20% to
> > > account for a steady state of dead tuples and vacuumed space.  So it's
> > > not 1.2x of the "current" table size at the time the local_update_limit
> > > feature is installed, but 1.2x of the optimal table size.
> > 
> > Right, that would be great. And honestly if that's something we can
> > figure out, then why does the parameter even need to be an integer
> > instead of a Boolean? If the system knows the optimal table size, then
> > the user can just say "try to compact this table" and need not say to
> > what size. The 1.2 multiplier is probably situation dependent and
> > maybe the multiplier should indeed be a configuration parameter, but
> > we would be way better off if the absolute size didn't need to be.
> 
> I don't have high hopes for a reliable way to automatically determine
> the target table size.  There are these queries floating around to estimate
> table bloat, which are used by various monitoring systems.  I find that they
> get it right a lot of the time, but sometimes they get it wrong.  Perhaps
> we can do better than that, but I vastly prefer a setting that I can control
> (even at the danger that I can misconfigure it) over an automatism that I
> cannot control and that sometimes gets it wrong.

Not completely against a setting- but would certainly prefer that this
be done in a more automated way, if possible.

To that end, my thought would be some kind of regular review of the FSM,
or maybe actual review by walking through the table (as VACUUM already
does...) to get an idea of where there's space and where there's used up
areas and then use that to inform various operations (either VACUUM
itself or perhaps UPDATEs from SQL).  We could also try to 'start
simple' and look for cases that we can say "well, that's definitely not
good" and address those initially.

Consider (imagine as a histogram; X is used space, . is empty):

 1: XXX
 2: XXX
 3: XXX
 4: XXX
 5: X
 6: X
 7: .
 8: .
 9: .
10: .
11: .
12: .
13: .
14: .
15: .
16: .
17: .
18: .
19: .
20: X

Well, obviously there's tons of free space in the middle and if we could
just move those few tuples/pages/whatever that are near the end to
earlier in the table then we'd be able to truncate off and shrink a
lot of the table.

> I like Alvaro's idea to automatically reset "local_update_limit" when the
> table has shrunk enough.  Why not perform that task during vacuum truncation?
> If vacuum truncation has taken place, check if the table size is no bigger
> than "local_update_limit" * (1 + "autovacuum_vacuum_scale_factor"), and if
> it is no bigger, reset "local_update_limit".  That way, we would not have
> to worry about a lock, because vacuum truncation already has the table locked.

Agreed on this too.  Essentially, once we've done some truncation, we
should 'reset'.

I've no doubt that there's some better algorithm for this, but I keep
coming back to something as simple as- if the entire second half of the
table will fit into the entire first half then the table is twice as
large as it needs to be and perhaps that triggers a preference for
placing tuples in the first half of the table.  As for what handles
this- maybe have both UPDATE and VACUUM able to, but prefer for UPDATE
to do so and only have VACUUM kick in once the tuples at the end of the
relation are older than some xid-based threshold (perhaps all of the
tuples on a given page have to be old enough?).

While it feels a bit 'late' in terms of when to start taking this
action, we could possibly start with 'all frozen' as an indicator of
'old enough'?  Then, between the FSM and the VM, VACUUM could decide
that pages at the end of the table should be moved to be earlier and go
about making that happen.  I'm a bit concerned about the risk of some
kind of deadlock or similar happening between VACUUM and user processes
if we're trying to do this with multiple tuples at a time but hopefully
we could come up with a way to avoid that.  This process naturally would
have to involve updating indexes and the VM and FSM as the tuples get
moved.

In terms of what this would look like, my thinking is that VACUUM would
scan the table and the FSM and perhaps the VM and then say "ok, this
table is bigger than it needs to be, let's try to fix that" and then set
a flag on the table, which a user could also explicitly set to give them
control over this process happening sooner or not happening at all, and
that would indicate to UPDATE to prefer earlier pages over the current
page or HOT updates, w

Re: Index range search optimization

2023-09-20 Thread Alexander Korotkov

Hi, Peter!

Thank you for your interest in this patch.

On Tue, Sep 19, 2023 at 1:48 AM Peter Geoghegan  wrote:
>
> On Thu, Sep 14, 2023 at 3:23 AM Alexander Korotkov  
> wrote:
> > The attached patch allows Postgres to skip scan keys required for 
> > directional scans (even when other keys are present in the scan).  I'll 
> > soon post the testing results and a more polished version of this patch.
>
> This is very interesting to me, partly because it seems related to my
> ongoing work on SAOP execution within nbtree.
>
> My patch gives _bt_readpage and particularly _bt_checkkeys more
> high-level context, which they use to intelligently control the scan.
> That enables us to dynamically decide whether we should now perform
> another descent of the index via another call to _bt_first, or if we
> should prefer to continue on the leaf level for now. Maybe we will
> match many distinct sets of array keys on the same leaf page, in the
> same call to _bt_readpage. We don't want to miss out on such
> opportunities, but we also want to quickly notice when we're on a page
> where matching any more array keys is just hopeless.
>
> There is a need to keep these two things in balance. We need to notice
> the hopeless cases before wasting too many cycles on it. That creates
> a practical need to do an early precheck of the high key (roughly the
> same check that we do already). If the high key indicates that
> continuing on this page is truly hopeless, then we should give up and
> do another primitive index scan -- _bt_first will reposition us onto
> the leaf page that we need to go to next, which is (hopefully) far
> away from the leaf page we started on.

This is a pretty neat optimization indeed!

> Your patch therefore has the potential to help my own patch. But, it
> also has some potential to conflict with it, because my patch makes
> the meaning of SK_BT_REQFWD and SK_BT_REQBKWD more complicated (though
> only in cases where we have SK_SEARCHARRAY scan keys). I'm sure that
> this can be managed sensibly, though.

OK!  Let me know if you feel that I need to change something in this
patch to lower the potential conflict.

> Some feedback on your patch:
>
> * I notice that you're not using the high key for this, even in a
> forward scan -- you're using the last non-pivot tuple on the page. Why
> is that? (I have some idea why, actually, but I'd like to hear your
> thoughts first.)

I'm using the last non-pivot tuple on the page instead of hikey
because it's lower than hikey.  As you stated below, the most
significant column in the hikey is likely different from that of the
last non-pivot tuple.  So, it's more likely to use the optimization
with the last non-pivot tuple.

> * Separately, I don't think that it makes sense to use the same
> requiredDirMatched value (which came from the last non-pivot tuple on
> the page) when the special _bt_checkkeys call for the high key takes
> place. I don't think that this will lead to wrong answers, but it's
> weird, and is likely to defeat the existing optimization in some
> important cases.
>
> Due to the influence of suffix truncation, it's relatively likely that
> the most significant column in the high key will be different to the
> corresponding value from the last few non-pivot tuples on the page --
> the high key tends to be "aligned with natural boundaries in the key
> space", and so "gives us a good preview of the right sibling page". We
> don't want to treat it the same as non-pivot tuples here, because it's
> quite different, in ways that are subtle but still important.

This definitely makes sense. I've removed the usage of
requiredDirMatched from this _bt_checkkeys() call.

> * I would avoid using the terminology "preprocess scan keys" for this.
> That exact terminology is already used to describe what
> _bt_preprocess_keys() does.
>
> That function is actually involved in Konstantin's patch, so that
> could be very confusing. When we "preprocess" a scan key, it outputs a
> processed scan key with markings such as the required markings that
> you're using in the patch -- it's something that acts on/changes the
> scan keys themselves. Whereas your patch is exploiting information
> from already-processed scan keys, by applying it to the key space of a
> page.
>
> I suggest calling it "prechecking the page", or something like that. I
> don't feel very strongly about what you call it, provided it isn't
> confusing or ambiguous.


 This also makes sense. I've rephrased the comment.

--
Regards,
Alexander Korotkov


0001-Skip-checking-of-scan-keys-required-for-direction-v2.patch
Description: Binary data

Re: Infinite Interval

2023-09-20 Thread Dean Rasheed

On Wed, 20 Sept 2023 at 15:00, Ashutosh Bapat
 wrote:
>
> 0005 - Refactored Jian's code fixing window functions. Does not
> contain the changes for serialization and deserialization. Jian,
> please let me know if I have missed anything else.
>

That looks a lot neater. One thing I don't care for is this code
pattern in finite_interval_pl():

+result->month = span1->month + span2->month;
+/* overflow check copied from int4pl */
+if (SAMESIGN(span1->month, span2->month) &&
+!SAMESIGN(result->month, span1->month))
+ereport(ERROR,

The problem is that this is a bug waiting to happen for anyone who
uses this function with "result" pointing to the same Interval struct
as "span1" or "span2". I understand that the current code avoids this
by careful use of temporary Interval structs, but it's still a pretty
ugly pattern. This can be avoided by using pg_add_s32/64_overflow(),
which then allows the callers to be simplified, getting rid of the
temporary Interval structs and memcpy()'s.

Also, in do_interval_discard(), this seems a bit risky:

+neg_val.day = -newval->day;
+neg_val.month = -newval->month;
+neg_val.time = -newval->time;

because it could in theory turn a finite large negative interval into
an infinite one (-INT_MAX -> INT_MAX), leading to an assertion failure
in finite_interval_pl(). Now maybe that's not possible for some other
reasons, but I think we may as well do the same refactoring for
interval_mi() as we're doing for interval_pl() -- i.e., introduce a
finite_interval_mi() function, making the addition and subtraction
code match, and removing the need for neg_val in
do_interval_discard().

Regards,
Dean

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2023-09-20 Thread Alvaro Herrera

On 2023-Sep-20, Amul Sul wrote:

> On the latest master head, I can see a $subject bug that seems to be related
> commit #b0e96f311985:
> 
> Here is the table definition:
> create table foo(i int, j int, CONSTRAINT pk PRIMARY KEY(i) DEFERRABLE);

Interesting, thanks for the report.  Your attribution to that commit is
correct.  The table is dumped like this:

CREATE TABLE public.foo (
i integer CONSTRAINT pgdump_throwaway_notnull_0 NOT NULL NO INHERIT,
j integer
);
ALTER TABLE ONLY public.foo
ADD CONSTRAINT pk PRIMARY KEY (i) DEFERRABLE;
ALTER TABLE ONLY public.foo DROP CONSTRAINT pgdump_throwaway_notnull_0;

so the problem here is that the deferrable PK is not considered a reason
to keep attnotnull set, so we produce a throwaway constraint that we
then drop.  This is already bogus, but what is more bogus is the fact
that the backend accepts the DROP CONSTRAINT at all.

The pg_dump failing should be easy to fix, but fixing the backend to
error out sounds more critical.  So, the reason for this behavior is
that RelationGetIndexList doesn't want to list an index that isn't
marked indimmediate as a primary key.  I can easily hack around that by
doing

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index 7234cb3da6..971d9c8738 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4794,7 +4794,6 @@ RelationGetIndexList(Relation relation)
 * check them.
 */
if (!index->indisunique ||
-   !index->indimmediate ||
!heap_attisnull(htup, Anum_pg_index_indpred, 
NULL))
continue;

@@ -4821,6 +4820,9 @@ RelationGetIndexList(Relation relation)
 relation->rd_rel->relkind == 
RELKIND_PARTITIONED_TABLE))
pkeyIndex = index->indexrelid;

+   if (!index->indimmediate)
+   continue;
+
if (!index->indisvalid)
continue;

But of course this is not great, since it impacts unrelated bits of code
that are relying on relation->pkindex or RelationGetIndexAttrBitmap
having their current behavior with non-immediate index.

I think a real solution is to stop relying on RelationGetIndexAttrBitmap
in ATExecDropNotNull().  (And, again, pg_dump needs some patch as well
to avoid printing a throwaway NOT NULL constraint at this point.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: disfavoring unparameterized nested loops

2023-09-20 Thread Lepikhov Andrei

On Wed, Sep 20, 2023, at 4:49 PM, David Rowley wrote:
> On Wed, 20 Sept 2023 at 19:56, Andrey Lepikhov
>  wrote:
>> What could you say about a different way: hybrid join? In MS SQL Server,
>> they have such a feature [1], and, according to their description, it
>> requires low overhead. They start from HashJoin and switch to NestLoop
>> if the inner input contains too small tuples. It solves the issue, Isn't it?
>
> A complexity which you may not be considering here is that Nested Loop
> joins always preserve the tuple order from the outer side of the join,
> whereas hash joins will not do this when multi-batching.

My idea here is the same as MS SQL guys did: prefetch from the HashJoin inner 
some predefined number of tuples and, if the planner has made a mistake and 
overestimated it, move hash join inner to NestLoop as an outer.
The opposite strategy, "underestimation" - starting with NestLoop and switching 
to HashJoin looks more difficult, but the main question is: is it worthwhile to 
research?

> I've no idea how the SQL Server engineers solved that.

>> [1]
>> https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411

-- 
Regards,
Andrei Lepikhov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-09-20 Thread Damir

Although v7 patch doesn't have commit messages on the patch, I think 
leave commit message is good for reviewers.


Sure, didn't notice it. Added the commit message to the updated patch.



  * Note: DON'T convert this error to "soft" style (errsave/ereturn). We
  * want this data type to stay permanently in the hard-error world 
so that

  * it can be used for testing that such cases still work reasonably.


From this point of view, I think this is a supposed way of using widget.


I agree, it's a good approach for checking datatype errors, because 
that's what was intended.



OTOH widget is declared in create_type.sql and I'm not sure it's ok to 
use it in another test copy2.sql.


I think that other regress tests with 'widget' type that will be created 
in the future can be not only in the create_type.sql. So it's not a 
problem that some type or function is taken from another regress test. 
For example, the table 'onek' is used in many regress tests.



Regards,

Damir Belyalov

Postgres Professional
From 0e1193e00bb5ee810a015a2baaf7c79e395a54c7 Mon Sep 17 00:00:00 2001
From: Damir Belyalov 
Date: Fri, 15 Sep 2023 11:14:57 +0300
Subject: [PATCH v7] Add new COPY option IGNORE_DATATYPE_ERRORS

Currently entire COPY fails even when there is one unexpected data
regarding data type or range.
IGNORE_DATATYPE_ERRORS ignores these errors and skips them and COPY
data which don't contain problem.

This patch uses the soft error handling infrastructure, which is
introduced by d9f7f5d32f20.

Author: Damir Belyalov, Atsushi Torikoshi

---
 doc/src/sgml/ref/copy.sgml   | 13 +
 src/backend/commands/copy.c  | 13 +
 src/backend/commands/copyfrom.c  | 37 
 src/backend/commands/copyfromparse.c | 20 ++---
 src/bin/psql/tab-complete.c  |  3 +-
 src/include/commands/copy.h  |  1 +
 src/include/commands/copyfrom_internal.h |  3 ++
 src/test/regress/expected/copy2.out  | 28 ++
 src/test/regress/sql/copy2.sql   | 26 +
 9 files changed, 139 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 4d614a0225..d5cdbb4025 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,6 +43,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NOT_NULL ( column_name [, ...] )
 FORCE_NULL ( column_name [, ...] )
+IGNORE_DATATYPE_ERRORS [ boolean ]
 ENCODING 'encoding_name'
 
  
@@ -370,6 +371,18 @@ COPY { table_name [ ( 

 
+   
+IGNORE_DATATYPE_ERRORS
+
+ 
+  Drops rows that contain malformed data while copying. These are rows
+  with columns where the data type's input-function raises an error.
+  This option is not allowed when using binary format.  Note that this
+  is only supported in current COPY syntax.
+ 
+
+   
+

 ENCODING
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f14fae3308..beb73f5357 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -419,6 +419,7 @@ ProcessCopyOptions(ParseState *pstate,
 	bool		format_specified = false;
 	bool		freeze_specified = false;
 	bool		header_specified = false;
+	bool		ignore_datatype_errors_specified = false;
 	ListCell   *option;
 
 	/* Support external use for option sanity checking */
@@ -458,6 +459,13 @@ ProcessCopyOptions(ParseState *pstate,
 			freeze_specified = true;
 			opts_out->freeze = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "ignore_datatype_errors") == 0)
+		{
+			if (ignore_datatype_errors_specified)
+errorConflictingDefElem(defel, pstate);
+			ignore_datatype_errors_specified = true;
+			opts_out->ignore_datatype_errors = defGetBoolean(defel);
+		}
 		else if (strcmp(defel->defname, "delimiter") == 0)
 		{
 			if (opts_out->delim)
@@ -594,6 +602,11 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("cannot specify DEFAULT in BINARY mode")));
 
+	if (opts_out->binary && opts_out->ignore_datatype_errors)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("cannot specify IGNORE_DATATYPE_ERRORS in BINARY mode")));
+
 	/* Set defaults for omitted options */
 	if (!opts_out->delim)
 		opts_out->delim = opts_out->csv_mode ? "," : "\t";
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 70871ed819..b18aea6376 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -752,6 +752,14 @@ CopyFrom(CopyFromState cstate)
 		ti_options |= TABLE_INSERT_FROZEN;
 	}
 
+	/* Set up soft error handler for IGNORE_DATATYPE_ERRORS */
+	if (cstate->opts.ignore_datatype_errors)
+	{
+		ErrorSaveContext escontext = {T_ErrorSaveContext};
+		escontext.details_wanted = true;
+		cstate->escontext = escontext;
+	}
+
 	/*
 	 * We need a ResultRelInfo so we can use the regular executor's
 	 * index-entry-making mach

pg_upgrade --check fails to warn about abstime

2023-09-20 Thread Alvaro Herrera

I got a complaint that pg_upgrade --check fails to raise red flags when
the source database contains type abstime when upgrading from pg11.  The
type (along with reltime and tinterval) was removed by pg12.


In passing, while testing this, I noticed that the translation
infrastructure in pg_upgrade/util.c is broken: we do have the messages
in the translation catalog, but the translations for the messages from
prep_status are never displayed.  So I made the quick hack of adding _()
around the fmt, and this was the result:

Verificando Consistencia en Vivo en el Servidor Antiguo
---
Verificando las versiones de los clústerséxito
Verificando que el usuario de base de datos es el usuario de instalaciónéxito
Verificando los parámetros de conexión de bases de datoséxito
Verificando transacciones preparadas  éxito
Verificando tipos compuestos definidos por el sistema en tablas de usuarioéxito
Verificando tipos de datos reg* en datos de usuario   éxito
Verificando contrib/isn con discordancia en mecanismo de paso de bigintéxito
Checking for incompatible "aclitem" data type in user tables  éxito
Checking for removed "abstime" data type in user tables   éxito
Checking for removed "reltime" data type in user tables   éxito
Checking for removed "tinterval" data type in user tables éxito
Verificando conversiones de codificación definidas por el usuarioéxito
Verificando operadores postfix definidos por el usuario   éxito
Verificando funciones polimórficas incompatibles éxito
Verificando tablas WITH OIDS  éxito
Verificando columnas de usuario del tipo «sql_identifier»   éxito
Verificando la presencia de las bibliotecas requeridaséxito
Verificando que el usuario de base de datos es el usuario de instalaciónéxito
Verificando transacciones preparadas  éxito
Verificando los directorios de tablespaces para el nuevo clústeréxito

Note how nicely they line up ... not.  There is some code that claims to
do this correctly, but apparently it counts bytes, not characters, and
also it appears to be measuring the original rather than the
translation.

I think we're trimming the strings in the wrong places.  We need to
apply _() to the originals, not the trimmed ones.  Anyway, clearly
nobody has looked at this very much.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We’ve narrowed the problem down to the customer’s pants being in a situation
 of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
>From 0c11fb69019e0c694249eaa70eb325a015f107f1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 19 Sep 2023 12:30:11 +0200
Subject: [PATCH] pg_upgrade: check for types removed in pg12

Commit cda6a8d01d39 removed a few datatypes, but didn't update
pg_upgrade --check to throw error if these types are used.  So the users
find that pg_upgrade --check tells them that everything is fine, only to
fail when the real upgrade is attempted.
---
 src/bin/pg_upgrade/check.c | 48 +-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 56e313f562..3a72c5bfd7 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -26,6 +26,8 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
+static void check_for_removed_data_type_usage(ClusterInfo *cluster,
+			  const char *datatype);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(void);
@@ -111,6 +113,16 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
 		check_for_aclitem_data_type_usage(&old_cluster);
 
+	/*
+	 * PG 12 removed types abstime, reltime, tinterval.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1100)
+	{
+		check_for_removed_data_type_usage(&old_cluster, "abstime");
+		check_for_removed_data_type_usage(&old_cluster, "reltime");
+		check_for_removed_data_type_usage(&old_cluster, "tinterval");
+	}
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1215,7 +1227,8 @@ check_for_aclitem_data_type_usage(ClusterInfo *cluster)
 {
 	char		output_path[MAXPGPATH];
 
-	prep_status("Checking for incompatible \"aclitem\" data type in user tables");
+	prep_status("Checking for incompatible \"%s\" data type in user tables",
+"aclitem");
 
 	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
 
@@ -12

Re: pg_upgrade --check fails to warn about abstime

2023-09-20 Thread Tristan Partin


+/*
+ * check_for_removed_data_type_usage
+ *
+ *similar to the above, but for types that were removed in 12.
+ */
+static void
+check_for_removed_data_type_usage(ClusterInfo *cluster, const char *datatype)


Seems like you could make this more generic instead of hardcoding 
version 12, and then you could use it for any future removed types as 
well.


Just a thought.

--
Tristan Partin
Neon (https://neon.tech)

Re: pg_upgrade --check fails to warn about abstime

2023-09-20 Thread Alvaro Herrera

On 2023-Sep-20, Tristan Partin wrote:

> > +/*
> > + * check_for_removed_data_type_usage
> > + *
> > + *similar to the above, but for types that were removed in 12.
> > + */
> > +static void
> > +check_for_removed_data_type_usage(ClusterInfo *cluster, const char 
> > *datatype)
> 
> Seems like you could make this more generic instead of hardcoding version
> 12, and then you could use it for any future removed types as well.

Yeah, I thought about that, and then closed that with "we can whack it
around when we need it".  At this point I imagine there's very few other
datatypes we can remove from the core server, if any.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)

Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-20 Thread Tomas Vondra




On 9/20/23 01:24, Tom Lane wrote:
> Tomas Vondra  writes:
>> bsd@freebsd:~ $ tclsh8.6
>> % clock scan "1/26/2010"
>> time value too large/small to represent
> 
> In hopes of replicating this, I tried installing FreeBSD 14-BETA2
> aarch64 on my Pi 3B.  This test case works fine:
> 
> $ tclsh8.6
> % clock scan "1/26/2010"
> 1264482000
> 
> $ tclsh8.7
> % clock scan "1/26/2010"
> 1264482000
> 
> and unsurprisingly, pltcl's regression tests pass.  I surmise
> that something is broken in BETA1 that they fixed in BETA2.
> 
> plpython works too, with the python 3.9 package (and no older
> python).
> 
> However, all is not peachy, because plperl doesn't work.
> Trying to CREATE EXTENSION either plperl or plperlu leads
> to a libperl panic:
> 
> pl_regression=# create extension plperl;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> 
> with this in the postmaster log:
> 
> panic: pthread_key_create failed
> 
> That message is certainly not ours, so it must be coming out of libperl.
> 
> Another thing that seemed strange is that ecpg's preproc.o takes
> O(forever) to compile.  I killed the build after observing that the
> compiler had gotten to 40 minutes of CPU time, and redid that step
> with PROFILE=-O0, which allowed it to compile in 20 seconds or so.
> (I also tried -O1, but gave up after a few minutes.)  This machine
> can compile the main backend grammar in a minute or two, so there is
> something very odd there.
> 
> I'm coming to the conclusion that 14-BETA is, well, beta grade.
> I'll be interested to see if you get the same results when you
> update to BETA2.

Thanks, I'll try that when I'll be at the office next week.

retards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-20 Thread Tomas Vondra

On 9/20/23 19:59, Tomas Vondra wrote:
> 
> 
> On 9/20/23 01:24, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> bsd@freebsd:~ $ tclsh8.6
>>> % clock scan "1/26/2010"
>>> time value too large/small to represent
>>
>> In hopes of replicating this, I tried installing FreeBSD 14-BETA2
>> aarch64 on my Pi 3B.  This test case works fine:
>>
>> $ tclsh8.6
>> % clock scan "1/26/2010"
>> 1264482000
>>
>> $ tclsh8.7
>> % clock scan "1/26/2010"
>> 1264482000
>>
>> and unsurprisingly, pltcl's regression tests pass.  I surmise
>> that something is broken in BETA1 that they fixed in BETA2.
>>
>> plpython works too, with the python 3.9 package (and no older
>> python).
>>
>> However, all is not peachy, because plperl doesn't work.
>> Trying to CREATE EXTENSION either plperl or plperlu leads
>> to a libperl panic:
>>
>> pl_regression=# create extension plperl;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Succeeded.
>>
>> with this in the postmaster log:
>>
>> panic: pthread_key_create failed
>>
>> That message is certainly not ours, so it must be coming out of libperl.
>>
>> Another thing that seemed strange is that ecpg's preproc.o takes
>> O(forever) to compile.  I killed the build after observing that the
>> compiler had gotten to 40 minutes of CPU time, and redid that step
>> with PROFILE=-O0, which allowed it to compile in 20 seconds or so.
>> (I also tried -O1, but gave up after a few minutes.)  This machine
>> can compile the main backend grammar in a minute or two, so there is
>> something very odd there.
>>
>> I'm coming to the conclusion that 14-BETA is, well, beta grade.
>> I'll be interested to see if you get the same results when you
>> update to BETA2.
> 
> Thanks, I'll try that when I'll be at the office next week.
> 

FWIW when I disabled tcl, the tests pass (it's running with --nostatus
--nosend, so it's not visible on the buildfarm site). Including the
plperl stuff:

== running regression test queries==
test plperl   ... ok  397 ms
test plperl_lc... ok  152 ms
test plperl_trigger   ... ok  374 ms
test plperl_shared... ok  163 ms
test plperl_elog  ... ok  184 ms
test plperl_util  ... ok  210 ms
test plperl_init  ... ok  150 ms
test plperlu  ... ok  117 ms
test plperl_array ... ok  228 ms
test plperl_call  ... ok  189 ms
test plperl_transaction   ... ok  412 ms
test plperl_plperlu   ... ok  238 ms

==
 All 12 tests passed.
==

I wonder if this got broken between BETA1 and BETA2.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade --check fails to warn about abstime

2023-09-20 Thread Tristan Partin

On Wed Sep 20, 2023 at 12:58 PM CDT, Alvaro Herrera wrote:

On 2023-Sep-20, Tristan Partin wrote:

> > +/*
> > + * check_for_removed_data_type_usage
> > + *
> > + *similar to the above, but for types that were removed in 12.
> > + */
> > +static void
> > +check_for_removed_data_type_usage(ClusterInfo *cluster, const char 
*datatype)
> 
> Seems like you could make this more generic instead of hardcoding version

> 12, and then you could use it for any future removed types as well.

Yeah, I thought about that, and then closed that with "we can whack it
around when we need it".  At this point I imagine there's very few other
datatypes we can remove from the core server, if any.

Makes complete sense to me. Patch looks good to me with one comment.

+pg_fatal("Your installation contains the \"%s\" data type in user 
tables.\n"
+ "Data type \"%s\" has been removed in PostgreSQL 
version 12,\n"
+ "so this cluster cannot currently be upgraded.  
You can drop the\n"
+ "problem columns, or change them to another data 
type, and restart\n"
+ "the upgrade.  A list of the problem columns is in 
the file:\n"
+ "%s", datatype, datatype, output_path);

I would wrap the second \"%s\" in commas.

Data type, "abstime", has been...

Maybe also add a "The" to start that sentence to make it less terse. Up 
to you.

--
Tristan Partin
Neon (https://neon.tech)

Re: should frontend tools use syncfs() ?

On Wed, Sep 20, 2023 at 03:12:56PM +0300, Maxim Orlov wrote:
> As a solution, I suggest renaming sync_method in xlog module to
> wal_sync_method. In fact,
> appropriate GUC for this variable, called "wal_sync_method" and I see no
> reason not to use
> the exact same name for a variable in xlog module.

+1

I think we should also consider renaming things like SYNC_METHOD_FSYNC to
WAL_SYNC_METHOD_FSYNC, and sync_method_options to wal_sync_method_options.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: remaining sql/json patches

2023-09-20 Thread Andrew Dunstan



On 2023-09-19 Tu 23:07, Amit Langote wrote:

On Tue, Sep 19, 2023 at 9:00 PM Amit Langote  wrote:

On Tue, Sep 19, 2023 at 7:37 PM jian he  wrote:

  ---
https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC

  When the return value of a function is declared as a polymorphic type, there 
must be at least one argument position that is also
polymorphic, and the actual data type(s) supplied for the polymorphic arguments 
determine the actual result type for that call.

select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
returning anyrange);
should fail. Now it returns NULL. Maybe we can validate it in
transformJsonFuncExpr?
---

I'm not sure whether we should make the parser complain about the
weird types being specified in RETURNING.

Sleeping over this, maybe adding the following to
transformJsonOutput() does make sense?

+   if (get_typtype(ret->typid) == TYPTYPE_PSEUDO)
+   ereport(ERROR,
+   errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   errmsg("returning pseudo-types is not supported in
SQL/JSON functions"));
+



Seems reasonable.


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Add 'worker_type' to pg_stat_subscription

On Tue, Sep 19, 2023 at 08:36:35AM +0530, Amit Kapila wrote:
> I am of the opinion that worker_type should be 'apply' instead of
> 'leader apply' because even when it is a leader for parallel apply
> worker, it could perform individual transactions apply. For reference,
> I checked pg_stat_activity.backend_type, there is nothing called main
> or leader backend even when the backend is involved in parallel query.

Okay.  Here is v9 of the patch with this change.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b8268a4c95ff217742bc2e127f74f67c9a417233 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 20 Sep 2023 12:22:21 -0700
Subject: [PATCH v9 1/1] Add worker type to pg_stat_subscription.

Thanks to 2a8b40e368, the logical replication worker type is easily
determined, and this information is a nice addition to the
pg_stat_subscription view.  The worker type could already be
deduced via other columns such as leader_pid and relid, but that is
unnecessary complexity for users.

Bumps catversion.

Author: Peter Smith
Reviewed-by: Michael Paquier, Maxim Orlov, Amit Kapila
Discussion: https://postgr.es/m/CAHut%2BPtmbSMfErSk0S7xxVdZJ9XVE3xVLhqBTmT91kf57BeKDQ%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml   | 13 -
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/replication/logical/launcher.c | 18 +-
 src/include/catalog/pg_proc.dat|  6 +++---
 src/test/regress/expected/rules.out|  3 ++-
 src/test/subscription/t/004_sync.pl|  2 +-
 6 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ff415d6a0..9c4930e9ae 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1993,6 +1993,17 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   
  
 
+ 
+  
+   worker_type text
+  
+  
+   Type of the subscription worker process.  Possible types are
+   apply, parallel apply, and
+   table synchronization.
+  
+ 
+
  
   
pid integer
@@ -2008,7 +2019,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   
   
Process ID of the leader apply worker if this process is a parallel
-   apply worker; NULL if this process is a leader apply worker or a
+   apply worker; NULL if this process is a leader apply worker or a table
synchronization worker
   
  
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 77b06e2a7a..fcb14976c0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -949,6 +949,7 @@ CREATE VIEW pg_stat_subscription AS
 SELECT
 su.oid AS subid,
 su.subname,
+st.worker_type,
 st.pid,
 st.leader_pid,
 st.relid,
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 7882fc91ce..501910b445 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1278,7 +1278,7 @@ GetLeaderApplyWorkerPid(pid_t pid)
 Datum
 pg_stat_get_subscription(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_SUBSCRIPTION_COLS	9
+#define PG_STAT_GET_SUBSCRIPTION_COLS	10
 	Oid			subid = PG_ARGISNULL(0) ? InvalidOid : PG_GETARG_OID(0);
 	int			i;
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
@@ -1339,6 +1339,22 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
 		else
 			values[8] = TimestampTzGetDatum(worker.reply_time);
 
+		switch (worker.type)
+		{
+			case WORKERTYPE_APPLY:
+values[9] = CStringGetTextDatum("apply");
+break;
+			case WORKERTYPE_PARALLEL_APPLY:
+values[9] = CStringGetTextDatum("parallel apply");
+break;
+			case WORKERTYPE_TABLESYNC:
+values[9] = CStringGetTextDatum("table synchronization");
+break;
+			case WORKERTYPE_UNKNOWN:
+/* Should never happen. */
+elog(ERROR, "unknown worker type");
+		}
+
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
 			 values, nulls);
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9805bc6118..f0b7b9cbd8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5484,9 +5484,9 @@
   proname => 'pg_stat_get_subscription', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'oid',
-  proallargtypes => '{oid,oid,oid,int4,int4,pg_lsn,timestamptz,timestamptz,pg_lsn,timestamptz}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{subid,subid,relid,pid,leader_pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}',
+  proallargtypes => '{oid,oid,oid,int4,int4,pg_lsn,timestamptz,timestamptz,pg_lsn,timestamptz,text}',
+  pro

Re: New WAL record to detect the checkpoint redo location

2023-09-20 Thread Robert Haas

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas  wrote:
> I've been brainstorming about this today, trying to figure out some
> ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

I've tested this to the extent of running the regression tests, and I
also did one (1) manual test where it looked like the right thing was
happening, but that's it, so this might be buggy or perform like
garbage for all I know. But my hope is that it isn't buggy and
performs adequately. If there's any chance of getting some comments on
the basic design choices before I spend time testing and polishing it,
that would be very helpful.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v6-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patch
Description: Binary data

v6-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patch
Description: Binary data

v6-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patch
Description: Binary data

Re: LLVM 16 (opaque pointers)

2023-09-20 Thread Thomas Munro

Belated thanks Dmitry, Ronan, Andres for your feedback.  Here's a new
version, also including Dmitry's patch for 17 which it is now also
time to push.  It required a bit more trivial #if magic to be
conditional, as Dmitry already mentioned.  I just noticed that Dmitry
had the LLVMPassBuilderOptionsSetInlinerThreshold() function added to
LLVM 17's C API for this patch.  Thanks!  (Better than putting stuff
in llvmjit_wrap.c, if you can get it upstreamed in time.)

I thought I needed to block users from building with too-old clang and
too-new LLVM, but I haven't managed to find a combination that
actually breaks anything.  I wouldn't recommend it, but for example
clang 10 bitcode seems to be inlinable without problems by LLVM 16 on
my system (I didn't use an assert build of LLVM though).  I think that
could be a separate adjustment if we learn that we need to enforce or
document a constraint there.

So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
branch) against PostgreSQL versions 14, 15, 16.  I've attached the
versions that apply to master and 16, and pushed back-patches to 14
and 15 to public branches if anyone's interested[1].  Back-patching
further seems a bit harder.  I'm quite willing to do it, but ... do we
actually need to, ie does anyone really *require* old PostgreSQL
release branches to work with new LLVM?

(I'll start a separate thread about the related question of when we
get to drop support for old LLVMs.)

One point from an earlier email:

On Sat, Aug 12, 2023 at 6:09 AM Andres Freund  wrote:
> >  AttributeTemplate(PG_FUNCTION_ARGS)
> >  {
> > + PGFunction  fp PG_USED_FOR_ASSERTS_ONLY;
> > +
> > + fp = &AttributeTemplate;

> Other parts of the file do this by putting the functions into
> referenced_functions[], i'd copy that here and below.

Actually here I just wanted to assert that the 3 template functions
match certain function pointer types.  To restate what these functions
are about:  in the JIT code I need the function type, but we have only
the function pointer type, and it is now impossible to go from a
function pointer type to a function type, so I needed to define some
example functions with the right prototype (well, one of them existed
already but I needed more), and then I wanted to assert that they are
assignable to the appropriate function pointer types.  Does that make
sense?

In this version I changed it to what I hope is a more obvious/explicit
expression of that goal:

+   AssertVariableIsOfType(&ExecEvalSubroutineTemplate,
+  ExecEvalSubroutine);

[1] https://github.com/macdice/postgres/tree/llvm16-14 and -15
From d5e4ac6dd9c2016c00bedfd8cd404e2f277701e1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 9 May 2022 08:27:47 +1200
Subject: [PATCH v2 1/2] jit: Support opaque pointers in LLVM 16.

Remove use of LLVMGetElementType() and provide the type of all pointers
to LLVMBuildXXX() functions when emitting IR, as required by modern LLVM
versions[1].

 * For LLVM <= 14, we'll still use the old LLVMBuildXXX() functions.
 * For LLVM == 15, we'll continue to do the same, explicitly opting
   out of opaque pointer mode.
 * For LLVM >= 16, we'll use the new LLVMBuildXXX2() functions that take
   the extra type argument.

The difference is hidden behind some new IR emitting wrapper functions
l_load(), l_gep(), l_call() etc.  The change is mostly mechanical,
except that at each site the correct type had to be provided.

In some places we needed to do some extra work to get functions types,
including some new wrappers for C++ APIs that are not yet exposed by in
LLVM's C API, and some new "example" functions in llvmjit_types.c
because it's no longer possible to start from the function pointer type
and ask for the function type.

Back-patch to 12, because it's a little tricker in 11 and we agreed not
to put the latest LLVM support into the upcoming final release of 11.

[1] https://llvm.org/docs/OpaquePointers.html

Reviewed-by: Dmitry Dolgov <9erthali...@gmail.com>
Reviewed-by: Ronan Dunklau 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKGKNX_%3Df%2B1C4r06WETKTq0G4Z_7q4L4Fxn5WWpMycDj9Fw%40mail.gmail.com
---
 src/backend/jit/llvm/llvmjit.c|  57 ++--
 src/backend/jit/llvm/llvmjit_deform.c | 119 
 src/backend/jit/llvm/llvmjit_expr.c   | 401 --
 src/backend/jit/llvm/llvmjit_types.c  |  39 ++-
 src/backend/jit/llvm/llvmjit_wrap.cpp |  12 +
 src/backend/jit/llvm/meson.build  |   2 +-
 src/include/jit/llvmjit.h |   7 +
 src/include/jit/llvmjit_emit.h| 106 +--
 8 files changed, 479 insertions(+), 264 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 09650e2c70..06bb440503 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -85,8 +85,11 @@ LLVMTypeRef StructExprState;
 LLVMTypeRef StructAggState;
 LLVMTypeRef StructAggStatePerGroupData;
 LLVMTypeRef StructAggSt

Re: Improvements in pg_dump/pg_restore toc format and performances

On Tue, Sep 19, 2023 at 12:15:55PM +0200, Pierre Ducroquet wrote:
> Attached updated patches fix this regression, I'm sorry I missed that.

Thanks for the new patches.  0001 and 0002 look reasonable to me.  This is
a nitpick, but we might want to use atooid() in 0002, which is just
shorthand for the strtoul() call you are using.

> - WriteStr(AH, NULL); /* Terminate List */
> + WriteInt(AH, -1);   /* Terminate List */

I think we need to be cautious about using WriteInt and ReadInt here.  OIDs
are unsigned, so we probably want to use InvalidOid (0) instead.

> + if (AH->version >= K_VERS_1_16)
>   {
> - depSize *= 2;
> - deps = (DumpId *) pg_realloc(deps, 
> sizeof(DumpId) * depSize);
> + depId = ReadInt(AH);
> + if (depId == -1)
> + break;  /* end of list 
> */
> + if (depIdx >= depSize)
> + {
> + depSize *= 2;
> + deps = (DumpId *) 
> pg_realloc(deps, sizeof(DumpId) * depSize);
> + }
> + deps[depIdx] = depId;
> + }
> + else
> + {
> + tmp = ReadStr(AH);
> + if (!tmp)
> + break;  /* end of list 
> */
> + if (depIdx >= depSize)
> + {
> + depSize *= 2;
> + deps = (DumpId *) 
> pg_realloc(deps, sizeof(DumpId) * depSize);
> + }
> + deps[depIdx] = strtoul(tmp, NULL, 10);
> + free(tmp);
>   }

I would suggest making the resizing logic common.

if (AH->version >= K_VERS_1_16)
{
depId = ReadInt(AH);
if (depId == InvalidOid)
break;  /* end of list */
}
else
{
tmp = ReadStr(AH);
if (!tmp)
break;  /* end of list */
depId = strtoul(tmp, NULL, 10);
free(tmp);
}

if (depIdx >= depSize)
{
depSize *= 2;
deps = (DumpId *) pg_realloc(deps, sizeof(DumpId) * depSize);
}
deps[depIdx] = depId;

Also, can we make depId more locally scoped?

I have yet to look into 0004 still.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Guiding principle for dropping LLVM versions?

2023-09-20 Thread Thomas Munro

Hi,

Currently we claim to support all versions of LLVM from 3.9 up.  It's
now getting quite inconvenient to test changes on older releases with
single digit major versions, because they aren't available through
usual package channels on current distributions, and frankly it feels
like pointless busy-work to build those older versions from source
(not to mention that it takes huuurrs to compile that much C++).
At the other end of the window, we've also been back-patching support
for the latest LLVM versions into all supported releases, which might
make slightly more sense, but I don't know.

For the trailing end of the window, would it make sense to say that
when PostgreSQL 17 ships, it doesn't need to support any LLVM versions
that are no longer available in the default package repositories of
current major distros?

I'm trying to understand the practical constraints.  Perhaps a package
maintainer could correct me if I have this wrong.  Distros typically
support a range of releases from the past few years, and then bless
one as 'default' by making it the one you get if you install a meta
package eg 'llvm' without a number (for example, on Debian 12 this is
LLVM 14, though LLVM 13 is still available).  Having a default
encourages sharing, eg one LLVM library can be used by many different
things.  The maintainer of the PostgreSQL package then chooses which
one to link against, and it's usually the default one unless we can't
use that one yet for technical reasons (a situation that might arise
from time to time in bleeding edge distros).  So if we just knew the
*oldest default* on every live distro at release time, I assume no
package maintainer would get upset if we ripped out support for
everything older, and that'd let us vacuum a lot of old versions out
of our tree.

A more conservative horizon would be: which is the *oldest* LLVM you
can still get through the usual channels on every relevant distro, for
the benefit of people compiling from source, who for some reason want
to use a version older then the default on their distro?  I don't know
what the motivation would be.

What reason could there be to be more conservative than that?

I wonder if there is a good way to make this sort of thing more
systematic.  If we could agree on a guiding principle vaguely like the
above, then perhaps we just need a wiki page that lists relevant
distributions, versions and EOL dates, that could be used to reduce
the combinations of stuff we have to consider and make the pruning
decisions into no-brainers.

Re: Add 'worker_type' to pg_stat_subscription

2023-09-20 Thread Peter Smith

On Thu, Sep 21, 2023 at 5:30 AM Nathan Bossart  wrote:
>
> On Tue, Sep 19, 2023 at 08:36:35AM +0530, Amit Kapila wrote:
> > I am of the opinion that worker_type should be 'apply' instead of
> > 'leader apply' because even when it is a leader for parallel apply
> > worker, it could perform individual transactions apply. For reference,
> > I checked pg_stat_activity.backend_type, there is nothing called main
> > or leader backend even when the backend is involved in parallel query.
>
> Okay.  Here is v9 of the patch with this change.
>

One question -- the patch comment still says "Bumps catversion.", but
catversion.h change is missing from the v9 patch?

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pg_upgrade and logical replication

On Wed, Sep 20, 2023 at 04:54:36PM +0530, Amit Kapila wrote:
> Also, the patch seems to be allowing subscription relations from PG
> >=10 to be migrated but how will that work if the corresponding
> publisher is also upgraded without slots? Won't the corresponding
> workers start failing as soon as you restart the upgrade server? Do we
> need to document the steps for users?

Hmm?  How is that related to the upgrade of the subscribers?  And how
is that different from the case where a subscriber tries to connect
back to a publisher where a slot has been dropped?  There is no need
of pg_upgrade to reach such a state:
ERROR:  could not start WAL streaming: ERROR:  replication slot "popo" does not 
exist
--
Michael


signature.asc
Description: PGP signature

Re: Add 'worker_type' to pg_stat_subscription

On Thu, Sep 21, 2023 at 09:01:01AM +1000, Peter Smith wrote:
> One question -- the patch comment still says "Bumps catversion.", but
> catversion.h change is missing from the v9 patch?

Yeah, previous patches did that, but it is no big deal.  My take is
that it is a good practice to never do a catversion bump in posted
patches, and that it is equally a good practice from Nathan to be
reminded about that with the addition of a note in the commit message
of the patch posted.
--
Michael


signature.asc
Description: PGP signature

Re: Add 'worker_type' to pg_stat_subscription

On Thu, Sep 21, 2023 at 08:14:23AM +0900, Michael Paquier wrote:
> On Thu, Sep 21, 2023 at 09:01:01AM +1000, Peter Smith wrote:
>> One question -- the patch comment still says "Bumps catversion.", but
>> catversion.h change is missing from the v9 patch?
> 
> Yeah, previous patches did that, but it is no big deal.  My take is
> that it is a good practice to never do a catversion bump in posted
> patches, and that it is equally a good practice from Nathan to be
> reminded about that with the addition of a note in the commit message
> of the patch posted.

Right, I'll take care of it before committing.  I'm trying to make sure I
don't forget.  :)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add 'worker_type' to pg_stat_subscription

2023-09-20 Thread Peter Smith

On Thu, Sep 21, 2023 at 9:34 AM Nathan Bossart  wrote:
>
> On Thu, Sep 21, 2023 at 08:14:23AM +0900, Michael Paquier wrote:
> > On Thu, Sep 21, 2023 at 09:01:01AM +1000, Peter Smith wrote:
> >> One question -- the patch comment still says "Bumps catversion.", but
> >> catversion.h change is missing from the v9 patch?
> >
> > Yeah, previous patches did that, but it is no big deal.  My take is
> > that it is a good practice to never do a catversion bump in posted
> > patches, and that it is equally a good practice from Nathan to be
> > reminded about that with the addition of a note in the commit message
> > of the patch posted.
>
> Right, I'll take care of it before committing.  I'm trying to make sure I
> don't forget.  :)

OK, all good.

~~~

This is a bit of a late entry, but looking at the PG DOCS, I felt it
might be simpler if we don't always refer to every other worker type
when explaining NULLs. The descriptions are already bigger than they
need to be, and if more types ever get added they will keep growing.

~

BEFORE
leader_pid integer
Process ID of the leader apply worker if this process is a parallel
apply worker; NULL if this process is a leader apply worker or a table
synchronization worker

SUGGESTION
leader_pid integer
Process ID of the leader apply worker; NULL if this process is not a
parallel apply worker

~

BEFORE
relid oid
OID of the relation that the worker is synchronizing; NULL for the
leader apply worker and parallel apply workers

SUGGESTION
relid oid
OID of the relation being synchronized; NULL if this process is not a
table synchronization worker

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: LLVM 16 (opaque pointers)

2023-09-20 Thread Devrim Gündüz



Hi Thomas,

On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote:
> So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
> branch) against PostgreSQL versions 14, 15, 16.  I've attached the
> versions that apply to master and 16, and pushed back-patches to 14
> and 15 to public branches if anyone's interested[1].  Back-patching
> further seems a bit harder.  I'm quite willing to do it, but ... do we
> actually need to, ie does anyone really *require* old PostgreSQL
> release branches to work with new LLVM?

RHEL releases new LLVM version along with their new minor releases every
6 month, and we have to build older versions with new LLVM each time.
>From RHEL point of view, it would be great if we can back-patch back to
v12 :(

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR

Re: Guiding principle for dropping LLVM versions?

2023-09-20 Thread Devrim Gündüz

Hi,

On Thu, 2023-09-21 at 10:54 +1200, Thomas Munro wrote:
> I'm trying to understand the practical constraints.  Perhaps a package
> maintainer could correct me if I have this wrong.  Distros typically
> support a range of releases from the past few years, and then bless
> one as 'default' by making it the one you get if you install a meta
> package eg 'llvm' without a number (for example, on Debian 12 this is
> LLVM 14, though LLVM 13 is still available).  Having a default
> encourages sharing, eg one LLVM library can be used by many different
> things.  The maintainer of the PostgreSQL package then chooses which
> one to link against, and it's usually the default one unless we can't
> use that one yet for technical reasons (a situation that might arise
> from time to time in bleeding edge distros).  So if we just knew the
> *oldest default* on every live distro at release time, I assume no
> package maintainer would get upset if we ripped out support for
> everything older, and that'd let us vacuum a lot of old versions out
> of our tree.

RPM packager speaking:

Even though older LLVM versions exist on both RHEL and Fedora, they
don't provide older Clang packages, which means we have to link to the
latest release anyway (like currently Fedora 38 packages are waiting for
LLVM 16 patch, as they cannot be linked against LLVM 15)

Regards,

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR

Re: LLVM 16 (opaque pointers)

2023-09-20 Thread Thomas Munro

On Thu, Sep 21, 2023 at 12:24 PM Devrim Gündüz  wrote:
> On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote:
> > So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
> > branch) against PostgreSQL versions 14, 15, 16.  I've attached the
> > versions that apply to master and 16, and pushed back-patches to 14
> > and 15 to public branches if anyone's interested[1].  Back-patching
> > further seems a bit harder.  I'm quite willing to do it, but ... do we
> > actually need to, ie does anyone really *require* old PostgreSQL
> > release branches to work with new LLVM?
>
> RHEL releases new LLVM version along with their new minor releases every
> 6 month, and we have to build older versions with new LLVM each time.
> From RHEL point of view, it would be great if we can back-patch back to
> v12 :(

Got it.  OK, I'll work on 12 and 13 now.

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-09-20 Thread Amit Langote

On Wed, Sep 20, 2023 at 10:24 PM Ashutosh Bapat
 wrote:
> On Wed, Sep 20, 2023 at 5:24 PM Amit Langote  wrote:
> > Just one comment on 0003:
> >
> > +   /*
> > +* Dummy SpecialJoinInfos do not have any translated fields and hence 
> > have
> > +* nothing to free.
> > +*/
> > +   if (child_sjinfo->jointype == JOIN_INNER)
> > +   return;
> >
> > Should this instead be Assert(child_sjinfo->jointype != JOIN_INNER)?
>
> try_partitionwise_join() calls free_child_sjinfo_members()
> unconditionally i.e. it will be called even SpecialJoinInfos of INNER
> join. Above condition filters those out early. In fact if we convert
> into an Assert, in a production build the rest of the code will free
> Relids which are referenced somewhere else causing dangling pointers.

You're right.  I hadn't realized that the parent SpecialJoinInfo
passed to try_partitionwise_join() might itself be a dummy.  I guess I
should've read the whole thing before commenting.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Bug fix in vacuumdb --buffer-usage-limit xxx -Z

2023-09-20 Thread Ryoga Yoshida


Hi,

When --buffer-usage-limit option is specified, vacuumdb issues VACUUM or 
VACUUM ANALYZE command with BUFFER_USAGE_LIMIT option. Also if 
--buffer-usage-limit and -Z options are specified, vacuumdb should issue 
ANALYZE command with BUFFER_USAGE_LIMIT option. But it does not. That 
is, vacuumdb -Z seems to fail to handle --buffer-usage-limit option. 
This seems a bug.


You can see my patch in the attached file and how it works by adding -e 
option in vacuumdb.


Ryoga Yoshidadiff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index f03d5b1c6c..57355639c0 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -964,6 +964,13 @@ prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
 appendPQExpBuffer(sql, "%sVERBOSE", sep);
 sep = comma;
 			}
+			if (vacopts->buffer_usage_limit)
+			{
+Assert(serverVersion >= 16);
+appendPQExpBuffer(sql, "%sBUFFER_USAGE_LIMIT '%s'", sep,
+		  vacopts->buffer_usage_limit);
+sep = comma;
+			}
 			if (sep != paren)
 appendPQExpBufferChar(sql, ')');
 		}

Re: bug fix and documentation improvement about vacuumdb

2023-09-20 Thread Kuwamura Masaki

>
> I agree.  Supporting pattern matching should, if anyone is interested in
> trying, be done separately in its own thread, no need to move the goalposts
> here. Sorry if I made it sound like so upthread.
>
 I got it.


> When sending an update, please include the previous patch as well with
> your new
> tests as a 0002 patch in a patchset.  The CFBot can only apply and
> build/test
> patches when the entire patchset is attached to the email.  The below
> testresults indicate that the patch failed the new tests (which in a way is
> good since without the fix the tests *should* fail), since the fix patch
> was
> not applied:
>
> http://cfbot.cputube.org/masaki-kuwamura.html
>
I'm sorry, I didn't know that. I attached both the test and fix patch to
this mail.
(The fix patch is clearly Nathan-san's though)
If I'm still in a wrong way, please let me know.

Masaki Kuwamura


v1_vacuumdb_add_tests.patch
Description: Binary data


vacuumdb_fix.patch
Description: Binary data

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2023-09-20 Thread bt23nguyent


On 2023-09-20 21:14, Daniel Gustafsson wrote:
On 19 Sep 2023, at 11:21, bt23nguyent  
wrote:


Please let me know if you have any further suggestions that I can 
improve more.


+*logdetail = pstrdup("WAL archiving failed because
basic_archive.archive_directory is not set");

Nitpick: detail messages should end with a period per the error message 
style

guide [0].



Yes! I totally missed this detail.

-archiving will proceed only when it returns 
true.
+archiving will proceed only when it returns 
true. The

+archiver may also emit the detail explaining how the module is
not configured
+to the sever log if the archive module has any.

I think this paragraph needs to be updated to include how the returned
logdetail is emitted, since it currently shows the WARNING without 
mentioning
the added detail in case returned.  It would also be good to mention 
that it

should be an allocated string which the caller can free.

--
Daniel Gustafsson

[0] https://www.postgresql.org/docs/devel/error-style-guide.html



Thank you for your kind review comment!

I agree with you that this document update is not explanatory enough.
So here is an updated patch.

If there is any further suggestion, please let me know.

Best regards,
Tung Nguyendiff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 4d78c31859..dd0f2816dc 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -48,7 +48,7 @@ typedef struct BasicArchiveData
 static char *archive_directory = NULL;
 
 static void basic_archive_startup(ArchiveModuleState *state);
-static bool basic_archive_configured(ArchiveModuleState *state);
+static bool basic_archive_configured(ArchiveModuleState *state, char **logdetail);
 static bool basic_archive_file(ArchiveModuleState *state, const char *file, const char *path);
 static void basic_archive_file_internal(const char *file, const char *path);
 static bool check_archive_directory(char **newval, void **extra, GucSource source);
@@ -159,9 +159,15 @@ check_archive_directory(char **newval, void **extra, GucSource source)
  * Checks that archive_directory is not blank.
  */
 static bool
-basic_archive_configured(ArchiveModuleState *state)
+basic_archive_configured(ArchiveModuleState *state, char **logdetail)
 {
-	return archive_directory != NULL && archive_directory[0] != '\0';
+	if (archive_directory == NULL || archive_directory[0] == '\0')
+{
+*logdetail = pstrdup("WAL archiving failed because basic_archive.archive_directory is not set.");
+return false;
+}
+else
+return true;
 }
 
 /*
diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml
index 7064307d9e..b58ed238b4 100644
--- a/doc/src/sgml/archive-modules.sgml
+++ b/doc/src/sgml/archive-modules.sgml
@@ -101,7 +101,7 @@ typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
 assumes the module is configured.
 
 
-typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
+typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state, char **logdetail);
 
 
 If true is returned, the server will proceed with
@@ -112,7 +112,10 @@ typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
 WARNING:  archive_mode enabled, yet archiving is not configured
 
 In the latter case, the server will periodically call this function, and
-archiving will proceed only when it returns true.
+archiving will proceed only when it returns true. The
+archiver may also emit the detail explaining how the module is not configured
+to the sever log if the archive module has any. The detail message of archive
+module should be an allocated string which the caller can free.

   
 
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 157ca4c751..c957a18ee7 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -23,7 +23,7 @@
 #include "common/percentrepl.h"
 #include "pgstat.h"
 
-static bool shell_archive_configured(ArchiveModuleState *state);
+static bool shell_archive_configured(ArchiveModuleState *state, char **logdetail);
 static bool shell_archive_file(ArchiveModuleState *state,
 			   const char *file,
 			   const char *path);
@@ -43,7 +43,7 @@ shell_archive_init(void)
 }
 
 static bool
-shell_archive_configured(ArchiveModuleState *state)
+shell_archive_configured(ArchiveModuleState *state, char **logdetail)
 {
 	return XLogArchiveCommand[0] != '\0';
 }
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..2fd1d03b09 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -390,7 +390,8 @@ pgarch_ArchiverCopyLoop(void)
 		{
 			struct stat stat_buf;
 			char		pathname[MAXPGPATH];
-
+			char   *logdetail;
+			
 			/*
 			 * Do not initiate any more archive commands after receiving
 			 * SIGTERM, nor

Re: Bug fix in vacuumdb --buffer-usage-limit xxx -Z

On Thu, Sep 21, 2023 at 10:44:49AM +0900, Ryoga Yoshida wrote:
> When --buffer-usage-limit option is specified, vacuumdb issues VACUUM or
> VACUUM ANALYZE command with BUFFER_USAGE_LIMIT option. Also if
> --buffer-usage-limit and -Z options are specified, vacuumdb should issue
> ANALYZE command with BUFFER_USAGE_LIMIT option. But it does not. That is,
> vacuumdb -Z seems to fail to handle --buffer-usage-limit option. This seems
> a bug.
> 
> You can see my patch in the attached file and how it works by adding -e
> option in vacuumdb.

Good catch, indeed the option is missing from the ANALYZE commands
built under analyze_only.  I can also notice that we have no tests for
this option in src/bin/scripts/t checking the shape of the commands
generated.  Could you add something for ANALYZE and VACUUM?  The
option could just be appended in one of the existing cases, for
instance.
--
Michael


signature.asc
Description: PGP signature

Re: Comment about set_join_pathlist_hook()

2023-09-20 Thread Lepikhov Andrei

On Wed, Sep 20, 2023, at 5:05 PM, Etsuro Fujita wrote:
> Hi,
>
> What I am concerned about from the report [1] is that this comment is
> a bit too terse; it might cause a misunderstanding that extensions can
> do different things than we intend to allow:
>
> /*
>  * 6. Finally, give extensions a chance to manipulate the path list.
>  */
> if (set_join_pathlist_hook)
> set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
>jointype, &extra);
>
> So I would like to propose to extend the comment to explain what they
> can do, as in the comment about set_rel_pathlist_hook() in allpaths.c.
> Attached is a patch for that.

It makes sense. But why do you restrict addition to pathlist by only the 
add_path() routine? It can fail to add a path to the pathlist. We need to find 
out the result of the add_path operation and need to check the pathlist each 
time. So, sometimes, it can be better to add a path manually.
One more slip-up could be prevented by the comment: removing a path from the 
pathlist we should remember about the cheapest_* pointers.
Also, it may be good to remind a user, that jointype and 
extra->sjinfo->jointype aren't the same all the time.

> [1] 
> https://www.postgresql.org/message-id/CACawEhV%3D%2BQ0HXrcDergbTR9EkVFukgRPMTZbRFL-YK5CRmvYag%40mail.gmail.com

-- 
Regards,
Andrei Lepikhov

Re: Infinite Interval

2023-09-20 Thread jian he

> On Wed, 20 Sept 2023 at 15:00, Ashutosh Bapat
>  wrote:
> >
> > 0005 - Refactored Jian's code fixing window functions. Does not
> > contain the changes for serialization and deserialization. Jian,
> > please let me know if I have missed anything else.
> >

attached serialization and deserialization function.


>
> Also, in do_interval_discard(), this seems a bit risky:
>
> +neg_val.day = -newval->day;
> +neg_val.month = -newval->month;
> +neg_val.time = -newval->time;
>

we already have interval negate function, So I changed to interval_um_internal.
based on 20230920 patches. I have made the attached changes.

The serialization do make big difference when configure to parallel mode.
From bff5e3dfa8607a8b45aa287a7c55fda9d984f339 Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Thu, 21 Sep 2023 10:05:21 +0800
Subject: [PATCH v22 7/7] interval aggregate serializatinn and deserialization

add interval aggregate serialization and deserialization function.
fix a typo.
change manually negate a interval to use interval_um_internal.
---
 src/backend/utils/adt/timestamp.c| 86 +---
 src/include/catalog/pg_aggregate.dat |  4 ++
 src/include/catalog/pg_proc.dat  |  6 ++
 3 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c6dc2d44..2b86171a 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -75,7 +75,7 @@ typedef struct
 
 /*
  * The transition datatype for interval aggregates is declared as Internal. It's
- * a pointer to a NumericAggState allocated in the aggregate context.
+ * a pointer to a IntervalAggState allocated in the aggregate context.
  */
 typedef struct IntervalAggState
 {
@@ -3984,10 +3984,7 @@ interval_avg_combine(PG_FUNCTION_ARGS)
 		state1->N = state2->N;
 		state1->pInfcount = state2->pInfcount;
 		state1->nInfcount = state2->nInfcount;
-
-		state1->sumX.day = state2->sumX.day;
-		state1->sumX.month = state2->sumX.month;
-		state1->sumX.time = state2->sumX.time;
+		memcpy(&state1->sumX, &state2->sumX, sizeof(Interval));
 
 		PG_RETURN_POINTER(state1);
 	}
@@ -4008,6 +4005,81 @@ interval_avg_combine(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(state1);
 }
 
+/*
+ * interval_avg_serialize
+ *		Serialize IntervalAggState for interval aggregates.
+ */
+Datum
+interval_avg_serialize(PG_FUNCTION_ARGS)
+{
+	IntervalAggState *state;
+	StringInfoData buf;
+	bytea	   *result;
+	/* Ensure we disallow calling when not in aggregate context */
+	if (!AggCheckCallContext(fcinfo, NULL))
+		elog(ERROR, "aggregate function called in non-aggregate context");
+	state = (IntervalAggState *) PG_GETARG_POINTER(0);
+	pq_begintypsend(&buf);
+	/* N */
+	pq_sendint64(&buf, state->N);
+	/* Interval struct elements, one by one. */
+	pq_sendint64(&buf, state->sumX.time);
+	pq_sendint32(&buf, state->sumX.day);
+	pq_sendint32(&buf, state->sumX.month);
+	/* pInfcount */
+	pq_sendint64(&buf, state->pInfcount);
+	/* nInfcount */
+	pq_sendint64(&buf, state->nInfcount);
+	result = pq_endtypsend(&buf);
+	PG_RETURN_BYTEA_P(result);
+}
+
+/*
+ * interval_avg_deserialize
+ *		Deserialize bytea into IntervalAggState for interval aggregates.
+ */
+Datum
+interval_avg_deserialize(PG_FUNCTION_ARGS)
+{
+	bytea	   *sstate;
+	IntervalAggState *result;
+	StringInfoData buf;
+
+	if (!AggCheckCallContext(fcinfo, NULL))
+		elog(ERROR, "aggregate function called in non-aggregate context");
+
+	sstate = PG_GETARG_BYTEA_PP(0);
+
+	/*
+	 * Copy the bytea into a StringInfo so that we can "receive" it using the
+	 * standard recv-function infrastructure.
+	 */
+	initStringInfo(&buf);
+	appendBinaryStringInfo(&buf,
+		   VARDATA_ANY(sstate), VARSIZE_ANY_EXHDR(sstate));
+
+	result = makeIntervalAggState(fcinfo);
+
+	/* N */
+	result->N = pq_getmsgint64(&buf);
+
+	/* Interval struct elements, one by one. */
+	result->sumX.time = pq_getmsgint64(&buf);
+	result->sumX.day = pq_getmsgint(&buf, sizeof(result->sumX.day));
+	result->sumX.month = pq_getmsgint(&buf, sizeof(result->sumX.month));
+
+	/* pInfcount */
+	result->pInfcount = pq_getmsgint64(&buf);
+
+	/* nInfcount */
+	result->nInfcount = pq_getmsgint64(&buf);
+
+	pq_getmsgend(&buf);
+	pfree(buf.data);
+
+	PG_RETURN_POINTER(result);
+}
+
 /*
  * Remove the given interval value from the aggregated state.
  */
@@ -4034,9 +4106,7 @@ do_interval_discard(IntervalAggState *state, Interval *newval)
 		Interval	temp;
 		Interval	neg_val;
 
-		neg_val.day = -newval->day;
-		neg_val.month = -newval->month;
-		neg_val.time = -newval->time;
+		interval_um_internal(newval, &neg_val);
 
 		memcpy(&temp, &state->sumX, sizeof(Interval));
 		finite_interv

CI: Unfamiliar error while testing macOS

2023-09-20 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

While developing my patch, I found that the CI for macOS failed with unknown 
error [1].
Do you know the reason why it happened? Please tell me if you have 
workarounds...

It failed the test at "Upload 'ccache' cache". The Cirrus app said a following 
message:

> Persistent worker failed to start the task: remote agent failed: failed to 
> run agent: wait: remote command exited without exit status or exit signal

[1]: https://cirrus-ci.com/task/5439712639320064

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-09-20 Thread Bowen Shi

Thanks for the patch.

I rerun the test in
https://www.postgresql.org/message-id/flat/ZQtzcH2lvo8leXEr%40paquier.xyz#cc5ed83e0edc0b9a1c1305f08ff7a335
. We can discuss all the problems in this thread.

First I encountered the problem " FATAL:  could not find
recovery.signal or standby.signal when recovering with backup_label ",
then I deleted the backup_label file and started the instance
successfully.

> Delete a backup_label from a fresh base backup can easily lead to data
> corruption, as the startup process would pick up as LSN to start
> recovery from the control file rather than the backup_label file.
> This would happen if a checkpoint updates the redo LSN in the control
> file while a backup happens and the control file is copied after the
> checkpoint, for instance. If one wishes to deploy a new primary from
> a base backup, recovery.signal is the way to go, making sure that the
> new primary is bumped into a new timeline once recovery finishes, on
> top of making sure that the startup process starts recovery from a
> position where the cluster would be able to achieve a consistent
> state.

ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are restoring from a backup, touch
\"%s/recovery.signal\" and add required recovery options.\n"
"If you are not restoring from a backup, try removing the file
\"%s/backup_label\".\n"
"Be careful: removing \"%s/backup_label\" will result in a corrupt
cluster if restoring from a backup.",
DataDir, DataDir, DataDir)));

There are two similar error messages in xlogrecovery.c. Maybe we can
modify the error messages to be similar.

--
Bowen Shi


On Thu, 21 Sept 2023 at 11:01, Michael Paquier  wrote:
>
> On Wed, Jul 19, 2023 at 11:21:17AM -0700, David Zhang wrote:
> > 1) simply start server from a base backup
> >
> > FATAL:  could not find recovery.signal or standby.signal when recovering
> > with backup_label
> >
> > HINT:  If you are restoring from a backup, touch
> > "/media/david/disk1/pg_backup1/recovery.signal" or
> > "/media/david/disk1/pg_backup1/standby.signal" and add required recovery
> > options.
>
> Note the difference when --write-recovery-conf is specified, where a
> standby.conf is created with a primary_conninfo in
> postgresql.auto.conf.  So, yes, that's expected by default with the
> patch.
>
> > 2) touch a recovery.signal file and then try to start the server, the
> > following error was encountered:
> >
> > FATAL:  must specify restore_command when standby mode is not enabled
>
> Yes, that's also something expected in the scope of the v1 posted.
> The idea behind this restriction is that specifying recovery.signal is
> equivalent to asking for archive recovery, but not specifying
> restore_command is equivalent to not provide any options to be able to
> recover.  See validateRecoveryParameters() and note that this
> restriction exists since the beginning of times, introduced in commit
> 66ec2db.  I tend to agree that there is something to be said about
> self-contained backups taken from pg_basebackup, though, as these
> would fail if no restore_command is specified, and this restriction is
> in place before Postgres has introduced replication and easier ways to
> have base backups.  As a whole, I think that there is a good argument
> in favor of removing this restriction for the case where archive
> recovery is requested if users have all their WAL in pg_wal/ to be
> able to recover up to a consistent point, keeping these GUC
> restrictions if requesting a standby (not recovery.signal, only
> standby.signal).
>
> > 3) touch a standby.signal file, then the server successfully started,
> > however, it operates in standby mode, whereas the intended behavior was for
> > it to function as a primary server.
>
> standby.signal implies that the server will start in standby mode.  If
> one wants to deploy a new primary, that would imply a timeline jump at
> the end of recovery, you would need to specify recovery.signal
> instead.
>
> We need more discussions and more opinions, but the discussion has
> stalled for a few months now.  In case, I am adding Thomas Munro in CC
> who has mentioned to me at PGcon that he was interested in this patch
> (this thread's problem is not directly related to the fact that the
> checkpointer now runs in crash recovery, though).
>
> For now, I am attaching a rebased v2.
> --
> Michael

Re: Synchronizing slots from primary to standby

2023-09-20 Thread shveta malik

On Tue, Sep 19, 2023 at 10:29 AM shveta malik  wrote:
>
> On Fri, Sep 15, 2023 at 2:22 PM Peter Smith  wrote:
> >
> > Hi. Here are some review comments for v17-0002.
> >
>
> Thanks Peter for the feedback.  I have addressed most of these in v18
> except 2.  Please find my comments for the ones not addressed.
>
> > This is a WIP and a long way from complete, but I wanted to send what
> > I have so far (while it is still current with your latest posted
> > patches).
> >
> > ==
>
> >
> > 34. ListSlotDatabaseOIDs - sorting/logic
> >
> > Maybe explain better the reason for having the qsort and other logic.
> >
> > TBH, I was not sure of the necessity for the names lists and the
> > sorting and bsearch logic. AFAICT these are all *only* used to check
> > for uniqueness and existence of the slot name. So I was wondering if a
> > hashmap keyed by the slot name might be more appropriate and also
> > simpler than all this list sorting/searching.
> >
>
> Pending. I will revisit this soon and will let you know more on this.
> IMO, it was done to optimize the search as slot_names list could be
> pretty huge if max_replication_slots is set to high value.
>
> > ~~
> >
> > 35. ListSlotDatabaseOIDs
> >
> > + for (int slotno = 0; slotno < max_replication_slots; slotno++)
> > + {
> >
> > loop variable declaration
> >
> >
> > ==
> > src/backend/storage/lmgr/lwlock.c
> > OK
> >
> > ==
> > src/backend/storage/lmgr/lwlocknames.txt
> > OK
> >
> > ==
> > .../utils/activity/wait_event_names.txt
> > TODO
> >
> > ==
> > src/backend/utils/misc/guc_tables.c
> > OK
> >
> > ==
> > src/backend/utils/misc/postgresql.conf.sample
> >
> > 36.
> >   # primary to streaming replication standby server
> > +#max_slotsync_workers = 2 # max number of slot synchronization
> > workers on a standby
> >
> > IMO it is better to say "maximum" instead of "max" in the comment.
> >
> > (make sure the GUC description text is identical)
> >
> > ==
> > src/include/catalog/pg_proc.dat
> >
> > 37.
> > +{ oid => '6312', descr => 'get invalidate cause of a replication slot',
> > +  proname => 'pg_get_invalidation_cause', provolatile => 's',
> > proisstrict => 't',
> > +  prorettype => 'int2', proargtypes => 'name',
> > +  prosrc => 'pg_get_invalidation_cause' },
> >
> > 37a.
> > SUGGESTION (descr)
> > what caused the replication slot to become invalid
> >
> > ~
> >
> > 37b
> > 'pg_get_invalidation_cause' seemed like a poor function name because
> > it doesn't have any context -- not even the word "slot" in it.
> >
> > ==
> > src/include/commands/subscriptioncmds.h
> > OK
> >
> > ==
> > src/include/nodes/replnodes.h
> > OK
> >
> > ==
> > src/include/postmaster/bgworker_internals.h
> >
> > 38.
> >  #define MAX_PARALLEL_WORKER_LIMIT 1024
> > +#define MAX_SLOT_SYNC_WORKER_LIMIT 50
> >
> > Consider SLOTSYNC instead of SLOT_SYNC for consistency with other
> > names of this worker.
> >
> > ==
> > OK
> >
> > ==
> > src/include/replication/logicalworker.h
> >
> > 39.
> >  extern void ApplyWorkerMain(Datum main_arg);
> >  extern void ParallelApplyWorkerMain(Datum main_arg);
> >  extern void TablesyncWorkerMain(Datum main_arg);
> > +extern void ReplSlotSyncMain(Datum main_arg);
> >
> > The name is not consistent with others nearby. At least it should
> > include the word "Worker" like everything else does.
> >
> > ==
> > src/include/replication/slot.h
> > OK
> >
> > ==
> > src/include/replication/walreceiver.h
> >
> > 40.
> > +/*
> > + * Slot's DBid related data
> > + */
> > +typedef struct WalRcvRepSlotDbData
> > +{
> > + Oid database; /* Slot's DBid received from remote */
> > + TimestampTz last_sync_time; /* The last time we tried to launch sync
> > + * worker for above Dbid */
> > +} WalRcvRepSlotDbData;
> > +
> >
> >
> > Is that comment about field 'last_sync_time' correct? I thought this
> > field is the last time the slot was synced -- not the last time the
> > worker was launched.
>
> Sorry for confusion. Comment is correct, the name is misleading. I
> have changed the name in v18.
>
> >
> > ==
> > src/include/replication/worker_internal.h
> >
> > 41.
> > - /* User to use for connection (will be same as owner of subscription). */
> > + /* User to use for connection (will be same as owner of subscription
> > + * in case of LogicalRep worker). */
> >   Oid userid;
> > +} WorkerHeader;
> >
> > 41a.
> >
> > This is not the normal style for a multi-line comment.
> >
> > ~
> >
> > 41b.
> > I wondered if the name "WorkerHeader" is just a bit *too* generic and
> > might cause future trouble because of the vague name.
> >
>
> I agree. Can you please suggest a better name for it?
>
>
> thanks
> Shveta


Currently in patch001, synchronize_slot_names is a GUC on both primary
and physical standby. This GUC tells which all logical slots need to
be synced on physical standbys from the primary. Ideally it should be
a GUC on physical standby alone and each physical standby should be
able to communicate the value to the primary

Re: Bug fix in vacuumdb --buffer-usage-limit xxx -Z

On Thu, 21 Sept 2023 at 13:45, Ryoga Yoshida
 wrote:
> When --buffer-usage-limit option is specified, vacuumdb issues VACUUM or
> VACUUM ANALYZE command with BUFFER_USAGE_LIMIT option. Also if
> --buffer-usage-limit and -Z options are specified, vacuumdb should issue
> ANALYZE command with BUFFER_USAGE_LIMIT option. But it does not. That
> is, vacuumdb -Z seems to fail to handle --buffer-usage-limit option.
> This seems a bug.
>
> You can see my patch in the attached file and how it works by adding -e
> option in vacuumdb.

Thanks for the report and the patch. I agree this has been overlooked.

I also wonder if we should be escaping the buffer-usage-limit string
sent in the comment.  It seems quite an unlikely attack vector, as the
user would have command line access and could likely just use psql
anyway, but I had thought about something along the lines of:

$ vacuumdb --buffer-usage-limit "1MB'); drop database postgres;--" postgres
vacuumdb: vacuuming database "postgres"
vacuumdb: error: processing of database "postgres" failed: ERROR:
VACUUM cannot run inside a transaction block

seems that won't work, due to sending multiple commands at once, but I
think we should fix it anyway.

David

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

On Thu, Sep 21, 2023 at 6:37 AM Amit Langote  wrote:
>
> On Wed, Sep 20, 2023 at 10:24 PM Ashutosh Bapat
>  wrote:
> > On Wed, Sep 20, 2023 at 5:24 PM Amit Langote  
> > wrote:
> > > Just one comment on 0003:
> > >
> > > +   /*
> > > +* Dummy SpecialJoinInfos do not have any translated fields and hence 
> > > have
> > > +* nothing to free.
> > > +*/
> > > +   if (child_sjinfo->jointype == JOIN_INNER)
> > > +   return;
> > >
> > > Should this instead be Assert(child_sjinfo->jointype != JOIN_INNER)?
> >
> > try_partitionwise_join() calls free_child_sjinfo_members()
> > unconditionally i.e. it will be called even SpecialJoinInfos of INNER
> > join. Above condition filters those out early. In fact if we convert
> > into an Assert, in a production build the rest of the code will free
> > Relids which are referenced somewhere else causing dangling pointers.
>
> You're right.  I hadn't realized that the parent SpecialJoinInfo
> passed to try_partitionwise_join() might itself be a dummy.  I guess I
> should've read the whole thing before commenting.

Maybe there's something to improve in terms of readability, I don't know.

-- 
Best Wishes,
Ashutosh Bapat

Re: Guiding principle for dropping LLVM versions?

2023-09-20 Thread Tom Lane

Thomas Munro  writes:
> I wonder if there is a good way to make this sort of thing more
> systematic.  If we could agree on a guiding principle vaguely like the
> above, then perhaps we just need a wiki page that lists relevant
> distributions, versions and EOL dates, that could be used to reduce
> the combinations of stuff we have to consider and make the pruning
> decisions into no-brainers.

FWIW, I think "compile older Postgres on newer infrastructure"
is a more common and more defensible scenario than "compile
newer Postgres on older infrastructure".  We've spent a ton of
effort on the latter scenario (and I've helped lead the charge
in many cases), but I think the real-world demand for it isn't
truly that high once you get beyond a year or two back.  On the
other hand, if you have an app that depends on PG 11 behavioral
details and you don't want to update it right now, you might
nonetheless need to put that server onto recent infrastructure
for practical reasons.

Thus, I think it's worthwhile to spend effort on back-patching
new-LLVM compatibility fixes into old PG branches, but I agree
that newer PG branches can drop compatibility with obsolete
LLVM versions.

LLVM is maybe not the poster child for these concerns -- for
either direction of compatibility problems, someone could build
without JIT support and not really be dead in the water.

In any case, I agree with your prior decision to not touch v11
for this.  With that branch's next release being its last,
I think the odds of introducing a bug we can't fix later
outweigh any arguable portability gain.

regards, tom lane

Re: Bug fix in vacuumdb --buffer-usage-limit xxx -Z

On Thu, 21 Sept 2023 at 16:18, David Rowley  wrote:
> Thanks for the report and the patch. I agree this has been overlooked.
>
> I also wonder if we should be escaping the buffer-usage-limit string
> sent in the comment.  It seems quite an unlikely attack vector, as the
> user would have command line access and could likely just use psql
> anyway, but I had thought about something along the lines of:
>
> $ vacuumdb --buffer-usage-limit "1MB'); drop database postgres;--" postgres
> vacuumdb: vacuuming database "postgres"
> vacuumdb: error: processing of database "postgres" failed: ERROR:
> VACUUM cannot run inside a transaction block
>
> seems that won't work, due to sending multiple commands at once, but I
> think we should fix it anyway.

I've pushed your patch plus some additional code to escape the text
specified in --buffer-usage-limit before passing it to the server in
commit 5cfba1ad6

Thanks again for the report.

David

Re: Comment about set_join_pathlist_hook()

2023-09-20 Thread Etsuro Fujita

Hi,

On Thu, Sep 21, 2023 at 11:49 AM Lepikhov Andrei
 wrote:
> On Wed, Sep 20, 2023, at 5:05 PM, Etsuro Fujita wrote:
> > What I am concerned about from the report [1] is that this comment is
> > a bit too terse; it might cause a misunderstanding that extensions can
> > do different things than we intend to allow:
> >
> > /*
> >  * 6. Finally, give extensions a chance to manipulate the path list.
> >  */
> > if (set_join_pathlist_hook)
> > set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
> >jointype, &extra);
> >
> > So I would like to propose to extend the comment to explain what they
> > can do, as in the comment about set_rel_pathlist_hook() in allpaths.c.
> > Attached is a patch for that.
>
> It makes sense. But why do you restrict addition to pathlist by only the 
> add_path() routine? It can fail to add a path to the pathlist. We need to 
> find out the result of the add_path operation and need to check the pathlist 
> each time. So, sometimes, it can be better to add a path manually.

I do not agree with you on this point; I think you can do so at your
own responsibility, but I think it is better for extensions to use
add_path(), because that makes them stable.  (Assuming that add_path()
has a bug and we change the logic of it to fix the bug, extensions
that do not follow the standard procedure might not work anymore.)

> One more slip-up could be prevented by the comment: removing a path from the 
> pathlist we should remember about the cheapest_* pointers.

Do we really need to do this?  I mean we do set_cheapest() afterward.
See standard_join_search().

> Also, it may be good to remind a user, that jointype and 
> extra->sjinfo->jointype aren't the same all the time.

That might be an improvement, but IMO that is not the point here,
because the purpose to expand the comment is to avoid extensions doing
different things than we intend to allow.

Thanks for looking!

Best regards,
Etsuro Fujita

Re: pg_upgrade and logical replication

On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote:
> On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu)
>  wrote:
>> Is there a possibility that apply worker on old cluster connects to the
>> publisher during the upgrade? Regarding the pg_upgrade on publisher, the we
>> refuse TCP/IP connections from remotes and port number is also changed, so 
>> we can
>> assume that subscriber does not connect to. But IIUC such settings may not 
>> affect
>> to the connection source, so that the apply worker may try to connect to the
>> publisher. Also, is there any hazards if it happens?
> 
> Yes, there is a possibility that the apply worker gets started and new
> transaction data is being synced from the publisher. I have made a fix
> not to start the launcher process in binary ugprade mode as we don't
> want the launcher to start apply worker during upgrade.

Hmm.  I was wondering if 0001 is the right way to handle this case,
but at the end I'm OK to paint one extra isBinaryUpgrade in the code
path where apply launchers are registered.  I don't think that the
patch is complete, though.  A comment should be added in pg_upgrade's
server.c, exactly start_postmaster(), to tell that -b also stops apply
workers.  I am attaching a version updated as of the attached, that
I'd be OK to apply.

I don't really think that we need to worry about a subscriber
connecting back to a publisher in this case, though?  I mean, each
postmaster instance started by pg_upgrade restricts the access to the
instance with unix_socket_directories set to a custom path and
permissions at 0700, and a subscription's connection string does not
know the unix path used by pg_upgrade.  I certainly agree that
stopping these processes could lead to inconsistencies in the data the
subscribers have been holding though, if we are not careful, so
preventing them from running is a good practice anyway.

I have also reviewed 0002.  As a whole, I think that I'm OK with the
main approach of the patch in pg_dump to use a new type of dumpable
object for subscription relations that are dumped with their upgrade
functions after.  This still needs more work, and more documentation.
Also, perhaps we should really have an option to control if this part
of the copy happens or not.  With a --no-subscription-relations for
pg_dump at least?

+{ oid => '4551', descr => 'add a relation with the specified relation state to 
pg_subscription_rel table', 

During a development cycle, any new function added needs to use an OID
in range 8000-.  Running unused_oids will suggest new random OIDs.

FWIW, I am not convinced that there is a need for two functions to add
an entry to pg_subscription_rel, with sole difference between both the
handling of a valid or invalid LSN.  We should have only one function
that's able to handle NULL for the LSN.  So let's remove rel_state_a
and rel_state_b, and have a single rel_state().  The description of
the SQL functions is inconsistent with the other binary upgrade ones,
I would suggest for the two functions:
"for use by pg_upgrade (relation for pg_subscription_rel)"
"for use by pg_upgrade (remote_lsn for origin)"

+   i_srsublsn = PQfnumber(res, "srsublsn");
[...]
+   subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, i_srsublsn));

In getSubscriptionTables(), this should check for PQgetisnull()
because we would have a NULL value for InvalidXLogRecPtr in the
catalog.  Using a char* for srsublsn is OK, but just assign NULL to
it, then just pass a hardcoded NULL value to the function as we do in
other places.  So I don't quite get why this is not the same handling
as suboriginremotelsn.

getSubscriptionTables() is entirely skipped if we don't want any
subscriptions, if we deal with a server of 9.6 or older or if we don't
do binary upgrades, which is OK.

+/*
+ * getSubscriptionTables
+ *   get information about subscription membership for dumpable tables.
+ */
This commit is slightly misleading and should mention that this is an
upgrade-only path?

The code for dumpSubscriptionTable() is a copy-paste of
dumpPublicationTable(), but a lot of what you are doing here is
actually pointless if we are not in binary mode?  Why should this code
path not taken only under dataOnly?  I mean, this is a code path we
should never take except if we are in binary mode.  This should have
at least a cross-check to make sure that we never have a
DO_SUBSCRIPTION_REL in this code path if we are in non-binary mode.

+if (dopt->binary_upgrade && subinfo->suboriginremotelsn)
+{
+appendPQExpBufferStr(query,
+ "SELECT 
pg_catalog.binary_upgrade_replorigin_advance(");
+appendStringLiteralAH(query, subinfo->dobj.name, fout);
+appendPQExpBuffer(query, ", '%s');\n", subinfo->suboriginremotelsn);
+}

Hmm..  Could it be actually useful even for debugging to still have
this query if suboriginremotelsn is an InvalidXLogRecPtr?  I think
that this should have a comment of the kind "\n-- For b

Re: Bug fix in vacuumdb --buffer-usage-limit xxx -Z

On Thu, Sep 21, 2023 at 05:50:26PM +1200, David Rowley wrote:
> I've pushed your patch plus some additional code to escape the text
> specified in --buffer-usage-limit before passing it to the server in
> commit 5cfba1ad6

That was fast.  If I may ask, why don't you have some regression tests
for the two code paths of vacuumdb that append this option to the
commands generated for VACUUM and ANALYZE?
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade and logical replication

On Wed, Sep 20, 2023 at 04:54:36PM +0530, Amit Kapila wrote:
> Is the check to ensure remote_lsn is valid correct in function
> check_for_subscription_state()? How about the case where the apply
> worker didn't receive any change but just marked the relation as
> 'ready'?

I may be missing, of course, but a relation is switched to
SUBREL_STATE_READY only once a sync happened and its state was
SUBREL_STATE_SYNCDONE, implying that SubscriptionRelState->lsn is
never InvalidXLogRecPtr, no?

For instance, nothing happens when a
Assert(!XLogRecPtrIsInvalid(rstate->lsn)) is added in
process_syncing_tables_for_apply().
--
Michael

signature.asc
Description: PGP signature

Re: Comment about set_join_pathlist_hook()

2023-09-20 Thread Lepikhov Andrei

On Thu, Sep 21, 2023, at 12:53 PM, Etsuro Fujita wrote:
> Hi,
>
> On Thu, Sep 21, 2023 at 11:49 AM Lepikhov Andrei
>  wrote:
>> On Wed, Sep 20, 2023, at 5:05 PM, Etsuro Fujita wrote:
>> > What I am concerned about from the report [1] is that this comment is
>> > a bit too terse; it might cause a misunderstanding that extensions can
>> > do different things than we intend to allow:
>> >
>> > /*
>> >  * 6. Finally, give extensions a chance to manipulate the path list.
>> >  */
>> > if (set_join_pathlist_hook)
>> > set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
>> >jointype, &extra);
>> >
>> > So I would like to propose to extend the comment to explain what they
>> > can do, as in the comment about set_rel_pathlist_hook() in allpaths.c.
>> > Attached is a patch for that.
>>
>> It makes sense. But why do you restrict addition to pathlist by only the 
>> add_path() routine? It can fail to add a path to the pathlist. We need to 
>> find out the result of the add_path operation and need to check the pathlist 
>> each time. So, sometimes, it can be better to add a path manually.
>
> I do not agree with you on this point; I think you can do so at your
> own responsibility, but I think it is better for extensions to use
> add_path(), because that makes them stable.  (Assuming that add_path()
> has a bug and we change the logic of it to fix the bug, extensions
> that do not follow the standard procedure might not work anymore.)

Ok, I got it.This question related to the add_path() interface itself, not to 
the comment. It is awkward to check every time in the pathlist the result of 
the add_path.

>> One more slip-up could be prevented by the comment: removing a path from the 
>> pathlist we should remember about the cheapest_* pointers.
>
> Do we really need to do this?  I mean we do set_cheapest() afterward.
> See standard_join_search().

Agree, in the case of current join it doesn't make sense. I stuck in this 
situation because providing additional path at the current level I need to 
arrange paths for the inner and outer too.

Thanks for the explanation!

-- 
Regards,
Andrei Lepikhov

Re: Unlogged relation copy is not fsync'd

2023-09-20 Thread Noah Misch

On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> On 05/09/2023 21:20, Robert Haas wrote:

> Thinking about this some more, I think this is still not 100% correct, even
> with the patch I posted earlier:
> 
> > /*
> >  * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> >  * reason is that since we're copying outside shared buffers, a 
> > CHECKPOINT
> >  * occurring during the copy has no way to flush the previously written
> >  * data to disk (indeed it won't know the new rel even exists).  A crash
> >  * later on would replay WAL from the checkpoint, therefore it wouldn't
> >  * replay our earlier WAL entries. If we do not fsync those pages here,
> >  * they might still not be on disk when the crash occurs.
> >  */
> > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > smgrimmedsync(dst, forkNum);
> 
> Let's walk through each case one by one:
> 
> 1. Temporary table. No fsync() needed. This is correct.
> 
> 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> also because we bypassed the buffer manager. Correct.

Agreed.

> 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> it, because we bypassed the buffer manager. Like the comment explains. This
> is now correct with the patch.

This has two subtypes:

3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.

3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync().  (RelationCreateStorage() could start calling
AddPendingSync() for this case.  I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers.  On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems.  Also,
the number of distinguishable cases is unpleasantly high.)

> 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> WAL-logged it, because we bypassed the buffer manager. Like the comment
> explains. Correct.
> 
> 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> it's a new relation, so we have a sync request pending for it. (An assertion
> for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> we will either fsync it or we will WAL-log all the pages if it was a small
> relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> WAL-log the pages, we have the same race condition that's explained in the
> comment, because we bypassed the buffer manager:
> 
> 1. RelationCopyStorage() copies some of the pages.
> 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> dirty segment when the relation was created)
> 3. RelationCopyStorage() copies the rest of the pages.
> 4. smgrDoPendingSyncs() WAL-logs all the pages.

smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...

> 5. Another checkpoint happens. It does *not* fsync the relation.

... I think this will fsync the relation.  No?

> 6. Crash.

Re: pg_upgrade --check fails to warn about abstime

2023-09-20 Thread Suraj Kharage

Thanks, Alvaro, for working on this.

The patch looks good to me.

+ * similar to the above, but for types that were removed in 12.
Comment can start with a capital letter.

Also, We need to backport the same, right?

On Wed, Sep 20, 2023 at 10:24 PM Alvaro Herrera 
wrote:

> I got a complaint that pg_upgrade --check fails to raise red flags when
> the source database contains type abstime when upgrading from pg11.  The
> type (along with reltime and tinterval) was removed by pg12.
>
>
> In passing, while testing this, I noticed that the translation
> infrastructure in pg_upgrade/util.c is broken: we do have the messages
> in the translation catalog, but the translations for the messages from
> prep_status are never displayed.  So I made the quick hack of adding _()
> around the fmt, and this was the result:
>
> Verificando Consistencia en Vivo en el Servidor Antiguo
> ---
> Verificando las versiones de los clústerséxito
> Verificando que el usuario de base de datos es el usuario de
> instalaciónéxito
> Verificando los parámetros de conexión de bases de datoséxito
> Verificando transacciones preparadas  éxito
> Verificando tipos compuestos definidos por el sistema en tablas de
> usuarioéxito
> Verificando tipos de datos reg* en datos de usuario   éxito
> Verificando contrib/isn con discordancia en mecanismo de paso de
> bigintéxito
> Checking for incompatible "aclitem" data type in user tables  éxito
> Checking for removed "abstime" data type in user tables   éxito
> Checking for removed "reltime" data type in user tables   éxito
> Checking for removed "tinterval" data type in user tables éxito
> Verificando conversiones de codificación definidas por el usuarioéxito
> Verificando operadores postfix definidos por el usuario   éxito
> Verificando funciones polimórficas incompatibles éxito
> Verificando tablas WITH OIDS  éxito
> Verificando columnas de usuario del tipo «sql_identifier»   éxito
> Verificando la presencia de las bibliotecas requeridaséxito
> Verificando que el usuario de base de datos es el usuario de
> instalaciónéxito
> Verificando transacciones preparadas  éxito
> Verificando los directorios de tablespaces para el nuevo clústeréxito
>
> Note how nicely they line up ... not.  There is some code that claims to
> do this correctly, but apparently it counts bytes, not characters, and
> also it appears to be measuring the original rather than the
> translation.
>
> I think we're trimming the strings in the wrong places.  We need to
> apply _() to the originals, not the trimmed ones.  Anyway, clearly
> nobody has looked at this very much.
>
> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
> "We’ve narrowed the problem down to the customer’s pants being in a
> situation
>  of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
>


-- 
--

Thanks & Regards,
Suraj kharage,



edbpostgres.com

Re: Bug fix in vacuumdb --buffer-usage-limit xxx -Z