date:20230201

On Wed, 1 Feb 2023 14:37:27 +0800
Julien Rouhaud  wrote:

> Hi,
> 
> On Tue, Jan 31, 2023 at 11:17:22AM +0900, Yugo NAGATA wrote:
> >
> > On Mon, 30 Jan 2023 16:05:52 -0500
> > Tom Lane  wrote:
> >
> > > If you have no update script, why call it a new version?  The point
> > > of extension versions is to distinguish different states of the
> > > extension's SQL objects.  We do not consider mods in underlying C code
> > > to justify a new version.
> >
> > Although, as you say, the extension manager doesn't track changes in C code
> > functions, a new version could be released with changes in only in C
> > functions that implement a new feature or fix some bugs because it looks
> > a new version from user's view.  Actually, there are several extensions
> > that provide empty update scripts in order to allow user to install such
> > new versions, for example;
> >
> > [...]
> > - hypopg
> >  
> > (https://github.com/HypoPG/hypopg/blob/REL1_STABLE/hypopg--1.3.1--1.3.2.sql)
> > [...]
> 
> Indeed, almost all users don't really understand the difference between the
> module / C code and the extension, and that gets worse when
> shared_preload_libraries gets in the way.
> 
> I personally choose to ship "empty" extension versions so that people can
> upgrade it if they want to have e.g. the OS level version match the SQL level
> version.  I know some extensions that chose a different approach: keep the
> first 2 digits for anything that involves extension changes and have a 3rd
> digit for C level bugfix only.  But they get very frequent bug reports about
> version mismatch any time a C bugfix is released, so I will again personally
> keep shipping those useless versions.  That being said, I agree with Tom here
> and we shouldn't add hacks for that.

Thank you for your comment and explanation. That helped me understand extension
release approaches.

I will withdraw the proposal since just providing empty update scripts can
resolve the problem and it wouldn't be worth the confusing. 

Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Masahiko Sawada

On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila  wrote:
>
> On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila  wrote:
> >
> > Let me try to summarize the discussion till now. The problem we are
> > trying to solve here is to allow a shutdown to complete when walsender
> > is not able to send the entire WAL. Currently, in such cases, the
> > shutdown fails. As per our current understanding, this can happen when
> > (a) walreceiver/walapply process is stuck (not able to receive more
> > WAL) due to locks or some other reason; (b) a long time delay has been
> > configured to apply the WAL (we don't yet have such a feature for
> > logical replication but the discussion for same is in progress).
> >
> > Both reasons mostly apply to logical replication because there is no
> > separate walreceiver process whose job is to just flush the WAL. In
> > logical replication, the process that receives the WAL also applies
> > it. So, while applying it can stuck for a long time waiting for some
> > heavy-weight lock to be released by some other long-running
> > transaction by the backend.
> >
>
> While checking the commits and email discussions in this area, I came
> across the email [1] from Michael where something similar seems to
> have been discussed. Basically, whether the early shutdown of
> walsender can prevent a switchover between publisher and subscriber
> but that part was never clearly answered in that email chain. It might
> be worth reading the entire discussion [2]. That discussion finally
> lead to the following commit:
>
> commit c6c333436491a292d56044ed6e167e2bdee015a2
> Author: Andres Freund 
> Date:   Mon Jun 5 18:53:41 2017 -0700
>
> Prevent possibility of panics during shutdown checkpoint.
>
> When the checkpointer writes the shutdown checkpoint, it checks
> afterwards whether any WAL has been written since it started and
> throws a PANIC if so.  At that point, only walsenders are still
> active, so one might think this could not happen, but walsenders can
> also generate WAL, for instance in BASE_BACKUP and logical decoding
> related commands (e.g. via hint bits).  So they can trigger this panic
> if such a command is run while the shutdown checkpoint is being
> written.
>
> To fix this, divide the walsender shutdown into two phases.  First,
> checkpointer, itself triggered by postmaster, sends a
> PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
> is idle or runs an SQL query this causes the backend to shutdown, if
> logical replication is in progress all existing WAL records are
> processed followed by a shutdown.
> ...
> ...
>
> Here, as mentioned in the commit, we are trying to ensure that before
> checkpoint writes its shutdown WAL record, we ensure that "if logical
> replication is in progress all existing WAL records are processed
> followed by a shutdown.". I think even before this commit, we try to
> send the entire WAL before shutdown but not completely sure. There was
> no discussion on what happens if the logical walreceiver/walapply
> process is waiting on some heavy-weight lock and the network socket
> buffer is full due to which walsender is not able to process its WAL.
> Is it okay for shutdown to fail in such a case as it is happening now,
> or shall we somehow detect that and shut down the walsender, or we
> just allow logical walsender to always exit immediately as soon as the
> shutdown signal came?

+1 to eliminate condition (b) for logical replication.

Regarding (a), as Amit mentioned before[1], I think we should check if
pq_is_send_pending() is false. Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR:  could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown. In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case. I've not studied the time-delayed logical
replication patch yet, though.

Regards,

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BpD654%2BXnrPugYueh7Oh22EBGTr6dA_fS0%2BgPiHayG9A%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-01 Thread Kyotaro Horiguchi

At Wed, 1 Feb 2023 08:38:11 +0530, Amit Kapila  wrote 
in 
> On Wed, Feb 1, 2023 at 8:13 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 31 Jan 2023 15:12:14 +0530, Amit Kapila  
> > wrote in
> > > So, shall we check if the result of parse_int is in the range 0 and
> > > PG_INT32_MAX to ameliorate this concern?
> >
> > Yeah, it is exactly what I wanted to suggest.
> >
> > > If this works then we need to
> > > probably change the return value of defGetMinApplyDelay() to int32.
> >
> > I didn't thought doing that, int can store all values in the valid
> > range (I'm assuming we implicitly assume int >= int32 in bit width)
> > and it is the natural integer in C.  Either will do for me but I
> > slightly prefer to use int there.
> >
> 
> I think it would be clear to use int32 because the parameter where we
> store the return value is also int32.

I'm fine with that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

transition tables and UPDATE

Earlier today I gave a talk about MERGE and wanted to provide an example
with FOR EACH STATEMENT triggers using transition tables.  However, I
can't find a non-ugly way to obtain the NEW row that corresponds to each
OLD row ...  I had to resort to an ugly trick with OFFSET n LIMIT 1.
Can anyone suggest anything better?  I couldn't find any guidance in the
docs.

This is the example function I wrote:

CREATE FUNCTION wine_audit() RETURNS trigger LANGUAGE plpgsql AS $$
  BEGIN
IF (TG_OP = 'DELETE') THEN
  INSERT INTO wine_audit
   SELECT 'D', now(), row_to_json(o), NULL FROM old_table o;
ELSIF (TG_OP = 'INSERT') THEN
  INSERT INTO wine_audit
   SELECT 'I', now(), NULL, row_to_json(n) FROM new_table n;
ELSIF (TG_OP = 'UPDATE') THEN
  DECLARE
oldrec record;
newrec jsonb;
i  integer := 0;
  BEGIN
FOR oldrec IN SELECT * FROM old_table LOOP
  newrec := row_to_json(n) FROM new_table n OFFSET i LIMIT 1;
  i := i + 1;
  INSERT INTO wine_audit
   SELECT 'U', now(), row_to_json(oldrec), newrec;
END LOOP;
  END;

END IF;
RETURN NULL;
  END;
$$;

CREATE TABLE wines (winery text, brand text, variety text, year int, bottles 
int);
CREATE TABLE shipment (LIKE wines);
CREATE TABLE wine_audit (op varchar(1), datetime timestamptz,
 oldrow jsonb, newrow jsonb);

CREATE TRIGGER wine_update
  AFTER UPDATE ON wines
  REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
  FOR EACH STATEMENT EXECUTE FUNCTION wine_audit();
-- I omit triggers on insert and update because the trigger code for those is 
trivial

INSERT INTO wines VALUES ('Concha y Toro', 'Sunrise', 'Chardonnay', 2021, 12),
('Concha y Toro', 'Sunrise', 'Merlot', 2022, 12);

INSERT INTO shipment VALUES ('Concha y Toro', 'Sunrise', 'Chardonnay', 2021, 
96),
('Concha y Toro', 'Sunrise', 'Merlot', 2022, 120),
('Concha y Toro', 'Marqués de Casa y Concha', 'Carmenere', 2021, 48),
('Concha y Toro', 'Casillero del Diablo', 'Cabernet Sauvignon', 2019, 240);

ALTER TABLE shipment ADD COLUMN marked timestamp with time zone;

WITH unmarked_shipment AS
 (UPDATE shipment SET marked = now() WHERE marked IS NULL
 RETURNING winery, brand, variety, year, bottles)
MERGE INTO wines AS w
 USING (SELECT winery, brand, variety, year,
sum(bottles) as bottles
   FROM unmarked_shipment
   GROUP BY winery, brand, variety, year) AS s
ON (w.winery, w.brand, w.variety, w.year) =
   (s.winery, s.brand, s.variety, s.year)
WHEN MATCHED THEN
 UPDATE SET bottles = w.bottles + s.bottles
WHEN NOT MATCHED THEN
 INSERT (winery, brand, variety, year, bottles)
 VALUES (s.winery, s.brand, s.variety, s.year, s.bottles)
;


If you examine table wine_audit after pasting all of the above, you'll
see this, which is correct:

─[ RECORD 1 
]
op   │ U
datetime │ 2023-02-01 01:16:44.704036+01
oldrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 12, "variety": "Chardonnay"}
newrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 108, "variety": "Chardonnay"}
─[ RECORD 2 
]
op   │ U
datetime │ 2023-02-01 01:16:44.704036+01
oldrow   │ {"year": 2022, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 12, "variety": "Merlot"}
newrow   │ {"year": 2022, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 132, "variety": "Merlot"}

My question is how to obtain the same rows without the LIMIT/OFFSET line
in the trigger function.


Also: how can we "subtract" both JSON blobs so that the 'newrow' only
contains the members that differ?  I would like to have this:

─[ RECORD 1 
]
op   │ U
datetime │ 2023-02-01 01:16:44.704036+01
oldrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 12, "variety": "Chardonnay"}
newrow   │ {"bottles": 108}
─[ RECORD 2 
]
op   │ U
datetime │ 2023-02-01 01:16:44.704036+01
oldrow   │ {"year": 2022, "brand": "Sunrise", "winery": "Concha y Toro", 
"bottles": 12, "variety": "Merlot"}
newrow   │ {"bottles": 132}

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"

Re: [PATCH] New [relation] option engine

2023-02-01 Thread Nikolay Shaplov

В письме от среда, 1 февраля 2023 г. 12:04:26 MSK пользователь Alvaro Herrera 
написал:
> On 2023-Jan-31, vignesh C wrote:
> > On Tue, 3 Jan 2023 at 18:38, vignesh C  wrote:
> > > The patch does not apply on top of HEAD as in [1], please post a rebased
> > > patch: === Applying patches on top of PostgreSQL commit ID
> > > 92957ed98c5c565362ce665266132a7f08f6b0c0 ===
> > > === applying patch ./new_options_take_two_v03f.patch
> > > patching file src/include/access/reloptions.h
> > > Hunk #1 FAILED at 1.
> > > 1 out of 4 hunks FAILED -- saving rejects to file
> > > src/include/access/reloptions.h.rej
> > 
> > There has been no updates on this thread for some time, so this has
> > been switched as Returned with Feedback. Feel free to open it in the
> > next commitfest if you plan to continue on this.
> 
> Well, no feedback has been given, so I'm not sure this is a great
> outcome.  In the interest of keeping it alive, I've rebased it.  It
> turns out that the only conflict is with the 2022 -> 2023 copyright line
> update.
> 
> I have not reviewed it.
Guys sorry, I am totally unable to do just anything this month... Do not have 
spare resources to switch my attention to anything even this simple...

Hope, next month will be better... And I will try to do some more reviewing of 
other patches...

-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.

Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Amit Kapila

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada  wrote:
>
> On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila  wrote:
> >
> > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila  wrote:
> > >
> > > Let me try to summarize the discussion till now. The problem we are
> > > trying to solve here is to allow a shutdown to complete when walsender
> > > is not able to send the entire WAL. Currently, in such cases, the
> > > shutdown fails. As per our current understanding, this can happen when
> > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > configured to apply the WAL (we don't yet have such a feature for
> > > logical replication but the discussion for same is in progress).
> > >
> > > Both reasons mostly apply to logical replication because there is no
> > > separate walreceiver process whose job is to just flush the WAL. In
> > > logical replication, the process that receives the WAL also applies
> > > it. So, while applying it can stuck for a long time waiting for some
> > > heavy-weight lock to be released by some other long-running
> > > transaction by the backend.
> > >
...
...
>
> +1 to eliminate condition (b) for logical replication.
>
> Regarding (a), as Amit mentioned before[1], I think we should check if
> pq_is_send_pending() is false.
>

Sorry, but your suggestion is not completely clear to me. Do you mean
to say that for logical replication, we shouldn't wait for all the WAL
to be successfully replicated but we should ensure to inform the
subscriber that XLOG streaming is done (by ensuring
pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

> Otherwise, we will end up terminating
> the WAL stream without the done message. Which will lead to an error
> message "ERROR:  could not receive data from WAL stream: server closed
> the connection unexpectedly" on the subscriber even at a clean
> shutdown.
>

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

> In a case where pq_is_send_pending() doesn't become false
> for a long time, (e.g., the network socket buffer got full due to the
> apply worker waiting on a lock), I think users should unblock it by
> themselves. Or it might be practically better to shutdown the
> subscriber first in the logical replication case, unlike the physical
> replication case.
>

Yeah, will users like such a dependency? And what will they gain by doing so?


[1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

-- 
With Regards,
Amit Kapila.

Re: transition tables and UPDATE

2023-02-01 Thread Thomas Munro

On Wed, Feb 1, 2023 at 10:18 PM Alvaro Herrera  wrote:
> Earlier today I gave a talk about MERGE and wanted to provide an example
> with FOR EACH STATEMENT triggers using transition tables.  However, I
> can't find a non-ugly way to obtain the NEW row that corresponds to each
> OLD row ...  I had to resort to an ugly trick with OFFSET n LIMIT 1.
> Can anyone suggest anything better?  I couldn't find any guidance in the
> docs.

I don't know the answer, either in PostgreSQL or the SQL spec.  I
wondered if there *should* be a way here:

https://www.postgresql.org/message-id/CAEepm=1ncxBNna-pXGr2hnMHRyYi_6_AwG_352-Jn=mwdfd...@mail.gmail.com

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-01 Thread shiy.f...@fujitsu.com

On Mon, Jan 30, 2023 6:05 PM Takamichi Osumi (Fujitsu) 
 wrote:
> 
> On Saturday, January 28, 2023 1:28 PM I wrote:
> > Attached the updated patch v24.
> Hi,
> 
> 
> I've conducted the rebase affected by the commit(1e8b61735c)
> by renaming the GUC to logical_replication_mode accordingly,
> because it's utilized in the TAP test of this time-delayed LR feature.
> There is no other change for this version.
> 
> Kindly have a look at the attached v25.
> 

Thanks for your patch. Here are some comments.

1.
+   /*
+* The min_apply_delay parameter is ignored until all tablesync workers
+* have reached READY state. This is because if we allowed the delay
+* during the catchup phase, then once we reached the limit of tablesync
+* workers it would impose a delay for each subsequent worker. That 
would
+* cause initial table synchronization completion to take a long time.
+*/
+   if (!AllTablesyncsReady())
+   return;

I saw that the new parameter becomes effective after all tables are in ready
state, because the apply worker can't set the state to catchup during the delay.
But can we call process_syncing_tables() in the while-loop of
maybe_apply_delay()? Then the tablesync can finish without delay. If we can't do
so, it might be better to add some comments for it.

2.
+# Make sure the apply worker knows to wait for more than 500ms
+check_apply_delay_log($node_subscriber, $offset, "0.5");

I think the last parameter should be 500.
Besides, I am not sure it's a stable test to check the log. Is it possible that
there's no such log on a slow machine? I modified the code to sleep 1s at the
beginning of apply_dispatch(), then the new added test failed because the server
log cannot match.

Regards,
Shi yu

Re: [DOCS] Stats views and functions not in order?

On 2023-Jan-30, Peter Smith wrote:

> On Fri, Jan 27, 2023 at 10:30 PM Peter Eisentraut
>  wrote:

> > We could change the chunking boundary to be sect2 globally.  This is
> > easily configurable (chunk.section.depth).

> > Thinking about it now, maybe this is what we need.  As the documentation
> > grows, as it clearly does, the depth of the structure increases and
> > pages get longer.  This can also be seen in other chapters.

> This chunk configuration idea sounds a better approach. If somebody
> else wants to champion that change separately then I can maybe help to
> review it.

Changing the chunking depth will change every single doc URL, though, so
the website will need some work to ensure there's a good transition
mechanism for the "this page in older/newer versions" functionality.

It sounds doable, but someone will need to craft it and test it.  (Maybe
it would work to populate a table with all URLs at each side of the
divide, and its equivalent at the other side.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak

On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
 wrote:

Hi Bharath, thanks for reviewing.

> I think measuring the number of WAL flushes with and without this
> feature that the postgres generates is great to know this feature
> effects on IOPS. Probably it's even better with variations in
> synchronous_commit_flush_wal_after or the throttling limits.

It's the same as point 19, so I covered it there.

> Although v2 is a WIP patch, I have some comments:
> 1. Coding style doesn't confirm to the Postgres standard:

Fixed all thoses that you mentioned and I've removed elog() and code
for timing. BTW: is there a way to pgindent only my changes (on git
diff?)

> 2. It'd be better to add a TAP test hitting the WAL throttling.

TODO, any hints on how that test should look like are welcome
(checking pg_stat_wal?) I've could spot only 1 place for it --
src/test/recovery/007_sync_rep.pl

> 3. We generally don't need timings to be calculated explicitly when we
> emit before and after log messages. If needed one can calculate the
> wait time from timestamps of the log messages. If it still needs an
> explicit time difference which I don't think is required, because we
> have a special event and before/after log messages, I think it needs
> to be under track_wal_io_timing to keep things simple.

Removed, it was just debugging aid to demonstrate the effect in the
session waiting.

> 4. XLogInsertRecord() can be a hot path for high-write workload, so
> effects of adding anything in it needs to be measured. So, it's better
> to run benchmarks with this feature enabled and disabled.

When the GUC is off(0 / default), in my tests the impact is none (it's
just set of simple IFs), however if the feature is enabled then the
INSERT is slowed down (I've repeated the initial test from 1st post
and single-statement INSERT for 50MB WAL result is the same 4-5s and
~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
that's intentional).

> 5. Missing documentation of this feature and the GUC. I think we can
> describe extensively why we'd need a feature of this kind in the
> documentation for better adoption or usability.

TODO, I'm planning on adding documentation when we'll be a little bit
closer to adding to CF.

> 6. Shouldn't the max limit be MAX_KILOBYTES?
> +&synchronous_commit_flush_wal_after,
> +0, 0, 1024L*1024L,

Fixed.

> 7. Can this GUC name be something like
> synchronous_commit_wal_throttle_threshold to better reflect what it
> does for a backend?
> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
> REPLICATION_SENDING,

Done.

> 8. A typo - s/confiration/confirmation
[..]
> 9. This
> "Sets the maximum logged WAL in kbytes, after which wait for sync
> commit confiration even without commit "
> better be something like below?
> "Sets the maximum amount of WAL in kilobytes a backend generates after
> which it waits for synchronous commit confiration even without commit
> "

Fixed as you have suggested.

> 10. I think there's nothing wrong in adding some assertions in
> HandleXLogDelayPending():
> Assert(synchronous_commit_flush_wal_after > 0);
> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);

Sure, added.

> 11. Specify the reason in the comments as to why we had to do the
> following things:
> Here:
> +/* flush only up to the last fully filled page */
> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> Talk about how this avoids multiple-flushes for the same page.
>
> Here:
> + * Called from ProcessMessageInterrupts() to avoid waiting while
> being in critical section
> + */
> +void HandleXLogDelayPending()
> Talk about how waiting in a critical section, that is in
> XLogInsertRecord() causes locks to be held longer durations and other
> effects.

Added.

> Here:
> +/* WAL throttling */
> +backendWalInserted += rechdr->xl_tot_len;
> Be a bit more verbose about why we try to throttle WAL and why only
> for sync replication, the advantages, effects etc.

Added.

> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
> clutter server logs.
> +/* XXX Debug for now */
> +elog(WARNING, "throttling WAL down on this session
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
> +backendWalInserted,
> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>
> +elog(WARNING, "throttling WAL down on this session - end (%f
> ms)", timediff);

OK, that's entirely removed.

> 13. BTW, we don't need to hold interrupts while waiting for sync
> replication ack as it may block query cancels or proc die pendings.
> You can remove these, unless there's a strong reason.
> +//HOLD_INTERRUPTS();
> +//RESUME_INTERRUPTS();

Sure, removed. However, one problem I've discovered is that we were
hitting Assert(InterruptHoldoffCount > 0) in

Re: Generating "Subplan Removed" in EXPLAIN

On Wed, 1 Feb 2023 16:52:07 +1300
David Rowley  wrote:

> On Wed, 1 Feb 2023 at 15:53, Yugo NAGATA  wrote:
> > Maybe, you missed to set plan_cache_mode to force_generic_plan.
> > "Subplan Removed" doesn't appear when using a custom plan.
> 
> I wouldn't say that's 100% true. The planner is only able to prune
> using values which are known during planning. Constant folding is
> going to evaluate any immutable functions during planning, but nothing
> more.
> 
> Partition pruning might be delayed until execution time if some
> expression that's being compared to the partition key is stable. e.g:
> 
> create table rp (t timestamp not null) partition by range(t);
> create table rp2022 partition of rp for values from ('2022-01-01') to
> ('2023-01-01');
> create table rp2023 partition of rp for values from ('2023-01-01') to
> ('2024-01-01');
> 
> explain select * from rp where t >= now();
> 
>  Append  (cost=0.00..95.33 rows=1506 width=8)
>Subplans Removed: 1
>->  Seq Scan on rp2023 rp_1  (cost=0.00..43.90 rows=753 width=8)
>  Filter: (t >= now())
> 

I am sorry for my explanation was not completely correct. Thank you for
your clarification.

Regards,
Yugo Nagata


-- 
Yugo NAGATA

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Aleksander Alekseev

Hi hackers,

> Concretely, I'm thinking something like the attached.  Notes:

> > 1. I have not tested the meson changes.
> Works here.

Took me a while to figure out how to build the documentation with Meson:

```
XML_CATALOG_FILES=/usr/local/etc/xml/catalog ninja -C build alldocs
```

It works. Perhaps we should add:

```
ninja -C build alldocs
```

... command to installation.sgml file while on it, to the 17.4.1
Building and Installation with Meson / Short Version section.

> > 2. As this is written, you can't override the --nonet options very
> > easily in the Makefile build (you could do so at runtime by setting
> > XSLTPROC, but not at configure time); and you can't override them at
> > all in the meson build.  Given the lack of evidence that it's still
> > useful to allow net access, I'm untroubled by that.  I did intentionally
> > skip using "override" in the Makefile, though, to allow that case.
>
> I'm not troubled by this either.

Neither am I.

> 3. For consistency with the directions for other platforms, I made
> the package lists for macOS just mention libxslt.  That should
> be enough to pull in libxml2 as well.

Fair enough.

> 4. Use of --nonet changes the error message you get if xsltproc
> can't find the DTDs.  I copied the error I get from MacPorts'
> version of xsltproc, but can you confirm it's the same on Homebrew?

Yes, the message is the same.

-- 
Best regards,
Aleksander Alekseev

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Hi,

On 2023-02-01 13:05:32 +0300, Aleksander Alekseev wrote:
> Took me a while to figure out how to build the documentation with Meson:
> 
> ```
> XML_CATALOG_FILES=/usr/local/etc/xml/catalog ninja -C build alldocs
> ```
> 
> It works. Perhaps we should add:
> 
> ```
> ninja -C build alldocs
> ```

FWIW, just 'docs' would build just the multi-page html/man pages,
alldocs takes a lot longer...

And yes, adding that to the docs is a good idea.

Greetings,

Andres Freund

Re: Clarify deleting comments and security labels in synopsis

2023-02-01 Thread Dagfinn Ilmari Mannsåker

Tom Lane  writes:

> =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
>> Tom Lane  writes:
>>> Agreed; as-is, the syntax summary is not just confusing but outright
>>> wrong.
>>> 
>>> I think we could go further and split the entry under Parameters
>>> to match:
>
>> Makes sense. Something like the attached v2?
>
> WFM, will push.

Thanks!

>   regards, tom lane

- ilmari

Re: Weird failure with latches in curculio on v15

Hi,

On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> My database off assertion failures has four like that, all 15 and master:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02
> 
> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> handler which is allowed to call that while in_restore_command is
> true.

Ugh, no wonder we're getting crashes. This whole business seems bogus as
hell.

RestoreArchivedFile():
...
/*
 * Check signals before restore command and reset afterwards.
 */
PreRestoreCommand();

/*
 * Copy xlog from archival storage to XLOGDIR
 */
ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);

PostRestoreCommand();

/* SIGTERM: set flag to abort redo and exit */
static void
StartupProcShutdownHandler(SIGNAL_ARGS)
{
int save_errno = errno;

if (in_restore_command)
proc_exit(1);
else
shutdown_requested = true;
WakeupRecovery();

errno = save_errno;
}

Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command.

There's *a lot* of stuff happening inside shell_restore() that's not
compatible with doing proc_exit() inside a signal handler. We're
allocating memory! Interact with stdout.

There's also the fact that system() isn't signal safe, but that's a much
less likely problematic issue.

This appears to have gotten worse over a sequence of commits. The
following commits each added something betwen PreRestoreCommand() and
PostRestoreCommand().

commit 1b06d7bac901e5fd20bba597188bae2882bf954b
Author: Fujii Masao 
Date:   2021-11-22 10:28:21 +0900

Report wait events for local shell commands like archive_command.

added pgstat_report_wait_start/end. Unlikely to cause big issues, but
not good.

commit 7fed801135bae14d63b11ee4a10f6083767046d8
Author: Tom Lane 
Date:   2022-08-29 13:55:38 -0400

Clean up inconsistent use of fflush().

Made it a bit worse by adding an fflush(). That certainly seems like it
could cause hangs.

commit 9a740f81eb02e04179d78f3df2ce671276c27b07
Author: Michael Paquier 
Date:   2023-01-16 16:31:43 +0900

Refactor code in charge of running shell-based recovery commands

which completely broke the mechanism. We suddenly run the entirety of
shell_restore(), which does pallocs etc to build the string passed to
system, and raises errors, all within a section in which a signal
handler can invoke proc_exit().  That's just completely broken.

Sorry, but particularly in this area, you got to be a heck of a lot more
careful.

I don't see a choice but to revert the recent changes. They need a
fairly large rewrite.

Greetings,

Andres Freund

Re: heapgettup refactoring

2023-02-01 Thread David Rowley

On Tue, 31 Jan 2023 at 12:18, Melanie Plageman
 wrote:
> v7 attached

I've been looking over the v7-0002 patch today and I did make a few
adjustments to heapgettup_initial_block() as I would prefer to see the
branching of each of all these helper functions follow the pattern of:

if ()
{
if ()

else

}
else
{

}

which wasn't quite what the function was doing.

Along the way, I noticed that 0002 has a subtle bug that does not seem
to be present once the remaining patches are applied.  I think I'm
happier to push these along the lines of how you have them in the
patches, so I've held off pushing for now due to the bug and the
change I had to make to fix it.

The problem is around the setting of scan->rs_inited = true; you've
moved that into heapgettup_initial_block() and you've correctly not
initialised the scan for empty tables when you return
InvalidBlockNumber, however, you've not correctly considered the fact
that table_block_parallelscan_nextpage() could also return
InvalidBlockNumber if the parallel workers manage to grab all of the
blocks before the current process gets the first block. I don't know
for sure, but it looks like this could cause problems when
heapgettup() or heapgettup_pagemode() got called again for a rescan.
We'd have returned the NULL tuple to indicate that no further tuples
exist, but we'll have left rs_inited set to true which looks like
it'll cause issues.

I wondered if it might be better to do the scan->rs_inited = true; in
heapgettup() and heapgettup_pagemode() instead. The attached v8 patch
does it this way. Despite this fixing that bug, I think this might be
a slightly better division of duties.

If you're ok with the attached (and everyone else is too), then I can
push it in the (NZ) morning.  The remaining patches would need to be
rebased due to my changes.

David
From cbd37463bdaa96afed4c7c739c8e91b770a9f8a7 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 1 Feb 2023 19:35:16 +1300
Subject: [PATCH v8] Refactor heapam.c adding heapgettup_initial_block function

Here we adjust heapgettup() and heapgettup_pagemode() to move the code
that fetches the first block out into a helper function.  This removes
some code duplication.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: 
https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_t8qeqzqol02rpi...@mail.gmail.com
---
 src/backend/access/heap/heapam.c | 225 ++-
 1 file changed, 103 insertions(+), 122 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0a8bac25f5..40168cc9ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -483,6 +483,67 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapgettup_initial_block - return the first BlockNumber to scan
+ *
+ * Returns InvalidBlockNumber when there are no blocks to scan.  This can
+ * occur with empty tables and in parallel scans when parallel workers get all
+ * of the pages before we can get a chance to get our first page.
+ */
+static BlockNumber
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+{
+   Assert(!scan->rs_inited);
+
+   /* When there are no pages to scan, return InvalidBlockNumber */
+   if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+   return InvalidBlockNumber;
+
+   if (ScanDirectionIsForward(dir))
+   {
+   /* serial scan */
+   if (scan->rs_base.rs_parallel == NULL)
+   return scan->rs_startblock;
+   else
+   {
+   /* parallel scan */
+   
table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+   
 scan->rs_parallelworkerdata,
+   
 (ParallelBlockTableScanDesc) 
scan->rs_base.rs_parallel);
+
+   /* may return InvalidBlockNumber if there are no more 
blocks */
+   return 
table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+   
 scan->rs_parallelworkerdata,
+   
 (ParallelBlockTableScanDesc) 
scan->rs_base.rs_parallel);
+   }
+   }
+   else
+   {
+   /* backward parallel scan not supported */
+   Assert(scan->rs_base.rs_parallel == NULL);
+
+   /*
+* Disable reporting to syncscan logic in a backwards scan; 
it's not
+* very likely anyone else is doing the same thing at the same 
time,
+* and much more likely that we'll just bollix things for

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Hi,

Please see attached patches for the below changes.

shveta malik , 27 Oca 2023 Cum, 13:12 tarihinde
şunu yazdı:

> On Thu, Jan 26, 2023 at 7:53 PM Melih Mutlu 
> wrote:
> 1.
> --Can we please add a few more points to the Summary to make it more clear.
> a) something telling that reusability of workers is for tables under
> one subscription and not across multiple subscriptions.
> b) Since we are reusing both workers and slots, can we add:
> --when do we actually end up spawning a new worker
> --when do we actually end up creating a new slot in a worker rather
> than using existing one.
> --if we reuse existing slots, what happens to the snapshot?
>

I modified the commit message if that's what you mean by the Summary.


> 2.
> +   The last used ID for tablesync workers. This ID is used to
> +   create replication slots. The last used ID needs to be stored
> +   to make logical replication can safely proceed after any
> interruption.
> +   If sublastusedid is 0, then no table has been synced yet.
>
> --typo:
>  to make logical replication can safely proceed ==> to make logical
> replication safely proceed
>

Done


> 3.
> is_first_run;
> move_to_next_rel;
> --Do you think one variable is enough here as we do not get any extra
> info by using 2 variables? Can we have 1 which is more generic like
> 'ready_to_reuse'. Otherwise, please let me know if we must use 2.
>

Right. Removed is_first_run and renamed move_to_next_rel as ready_to_reuse.


> 4.
> /* missin_ok = true, since the origin might be already dropped. */
> typo: missing_ok
>

Done.


> 5. GetReplicationSlotNamesBySubId:
> errmsg("not tuple returned."));
>
> Can we have a better error msg:
> ereport(ERROR,
> errmsg("could not receive list of slots
> associated with subscription %d, error: %s", subid, res->err));
>

Done.


> 6.
> static void
> clean_sync_worker(void)
>
> --can we please add introductory comment for this function.
>

Done.


> 7.
> /*
>  * Pick the table for the next run if there is not another worker
>  * already picked that table.
>  */
> Pick the table for the next run if it is not already picked up by
> another worker.
>

Done.


> 8.
> process_syncing_tables_for_sync()
>
> /* Cleanup before next run or ending the worker. */
> --can we please improve this comment:
> if there is no more work left for this worker, stop the worker
> gracefully, else do clean-up and make it ready for the next
> relation/run.
>

Done


> 9.
> LogicalRepSyncTableStart:
>  * Read previous slot name from the catalog, if exists.
>  */
> prev_slotname = (char *) palloc0(NAMEDATALEN);
> Do we need to free this at the end?
>

Pfree'd prev_slotname after we're done with it.


> 10.
> if (strlen(prev_slotname) == 0)
> {
> elog(ERROR, "Replication slot could not be
> found for relation %u",
>  MyLogicalRepWorker->relid);
> }
> shall we mention subid also in error msg.
>

Done.

Thanks for reviewing,
-- 
Melih Mutlu
Microsoft


v5-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v9-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Can we do something to help stop users mistakenly using force_parallel_mode?

2023-02-01 Thread David Rowley

On Thu, 30 Jun 2022 at 00:31, Justin Pryzby  wrote:
>
> On Wed, Jun 29, 2022 at 03:23:27PM +1200, David Rowley wrote:
> > Over on [1] I noticed that the user had set force_parallel_mode to
> > "on" in the hope that would trick the planner into making their query
> > run more quickly.  Of course, that's not what they want since that GUC
> > is only there to inject some parallel nodes into the plan in order to
> > verify the tuple communication works.

> Note that it was already changed to be a developer GUC
> https://www.postgresql.org/message-id/20210404012546.GK6592%40telsasoft.com

> Since the user in this recent thread is running v13.7, I'm *guessing* that
> if that had been backpatched, they wouldn't have made this mistake.

I was just reading [1] where a PG15 user made this mistake, so it
seems people are still falling for it even now it's been changed to a
developer GUC.

I don't really share Laurenz's worry [2] about compatibility break
from renaming this GUC. I think the legitimate usages of this setting
are probably far more rare than the illegitimate ones. I'm not overly
concerned about renaming if it helps stop people from making this
mistake. I believe the current name is just too conveniently named and
that users are likely just to incorrectly assume it does exactly what
they want because what else could it possibly do?!

I think something like debug_parallel_query is much less likely to be misused.

David

[1] 
https://www.postgresql.org/message-id/can4ko3b4y75pg5ro_oajwf8l1hysygxcdgss6nzotvqokkn...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/26139c03e118bec967c77da374d947e9ecf81333.ca...@cybertec.at

RLS makes COPY TO process child tables

2023-02-01 Thread Antonin Houska

While working on [1] I noticed that if RLS gets enabled, the COPY TO command
includes the contents of child table into the result, although the
documentation says it should not:

"COPY TO can be used only with plain tables, not views, and does not
copy rows from child tables or child partitions. For example, COPY
table TO copies the same rows as SELECT * FROM ONLY table. The syntax
COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
in an inheritance hierarchy, partitioned table, or view."

A test case is attached (rls.sql) as well as fix proposal
(copy_rls_no_inh.diff).

[1] https://commitfest.postgresql.org/41/3641/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

create table a(i int);
insert into a values (1);

create table a1() inherits(a);
insert into a1 values (1);

-- Only the parent table is copied, as the documentation claims.
copy a to stdout;

alter table a enable row level security;
create role u;
create policy pol_perm on a as permissive for select to u using (true);
grant select on table a to u;
set role u;

-- Both "a" and "a1" appears in the output.
copy a to stdout;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..3b8c25dadd 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -249,6 +249,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 			from = makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
 pstrdup(RelationGetRelationName(rel)),
 -1);
+			from->inh = false;
 
 			/* Build query */
 			select = makeNode(SelectStmt);

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Hi,

I mistakenly attached v9 in my previous email.
Please see attached v6 and v10 for the previous and below changes.

shveta malik , 31 Oca 2023 Sal, 12:59 tarihinde
şunu yazdı:

> On Fri, Jan 27, 2023 at 3:41 PM shveta malik 
> wrote:
> 1)
> REPLICATION_SLOT_SNAPSHOT
> --Do we need 'CREATE' prefix with it i.e. CREATE_REPLICATION_SNAPSHOT
> (or some other brief one with CREATE?).  'REPLICATION_SLOT_SNAPSHOT'
> does not look like a command/action and thus is confusing.
>

Renamed it as CREATE_REPLICATION_SNAPSHOT


> 2)
> is used in the currenct transaction. This command is currently only
> supported
> for logical replication.
> slots.
> --typo: currenct-->current
> --slots can be moved to previous line
>

Done.


> 3)
> /*
>  * Signal that we don't need the timeout mechanism. We're just creating
>  * the replication slot and don't yet accept feedback messages or send
>  * keepalives. As we possibly need to wait for further WAL the walsender
>  * would otherwise possibly be killed too soon.
>  */
> We're just creating the replication slot --> We're just creating the
> replication snapshot
>

Done.


> 4)
> I see XactReadOnly check in CreateReplicationSlot, do we need the same
> in ReplicationSlotSnapshot() as well?
>

Added this check too.


> ===
> v9-0002:
>
> 5)
>   /* We are safe to drop the replication trackin origin after this
> --typo: tracking
>

Done.


> 6)
> slot->data.catalog_xmin = xmin_horizon;
> slot->effective_xmin = xmin_horizon;
> SpinLockRelease(&slot->mutex);
> xmin_horizon =
> GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> ReplicationSlotsComputeRequiredXmin(true);
>
> --do we need to set xmin_horizon in slot after
> 'GetOldestSafeDecodingTransactionId' call, otherwise it will be set to
> InvalidId in slot. Is that intentional? I see that we do set this
> correct xmin_horizon in builder->initial_xmin_horizon but the slot is
> carrying Invalid one.
>

I think you're right. Moved GetOldestSafeDecodingTransactionId call before
xmin_horizon assignment.

Thanks,
-- 
Melih Mutlu
Microsoft


v6-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v10-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Perform streaming logical transactions by background workers and parallel apply

2023-02-01 Thread Amit Kapila

On Tue, Jan 31, 2023 at 9:04 AM houzj.f...@fujitsu.com
 wrote:
>
> I think your comment makes sense, thanks.
> I updated the patch for the same.
>

The patch looks mostly good to me. I have made a few changes in the
comments and docs, see attached.

-- 
With Regards,
Amit Kapila.


v91-0001-Allow-the-logical_replication_mode-to-be-used-on.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread shveta malik

On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu  wrote:
>
> Hi,
>
> Please see attached patches for the below changes.
>

> Thanks for reviewing,
> --
> Melih Mutlu
> Microsoft

Hello Melih,
Thank you for making the changes.

I have few more comments:
1)
src/backend/replication/logical/worker.c: (errmsg("logical replication
table synchronization worker for subscription \"%s\", table \"%s\" has
started",
src/backend/replication/logical/worker.c: (errmsg("logical replication
table synchronization worker for subscription \"%s\" has moved to sync
table \"%s\".",
src/backend/replication/logical/tablesync.c: (errmsg("logical
replication table synchronization worker for subscription \"%s\",
table \"%s\" has finished",

In above can we have rep_slot_id as well in trace message, else it is
not clear which worker moved to next relation. We may have:
logical replication table synchronization worker_%d for subscription
\"%s\" has moved to sync table, rep_slot_id,

Overall we need to improve the tracing. I will give my suggestions on
this later (in detail).

2) I found a crash in the previous patch (v9), but have not tested it
on the latest yet. Crash happens when all the replication slots are
consumed and we are trying to create new. I tweaked the settings like
below so that it can be reproduced easily:
max_sync_workers_per_subscription=3
max_replication_slots = 2
and then ran the test case shared by you. I think there is some memory
corruption happening. (I did test in debug mode, have not tried in
release mode). I tried to put some traces to identify the root-cause.
I observed that worker_1 keeps on moving from 1 table to another table
correctly, but at some point, it gets corrupted i.e. origin-name
obtained for it is wrong and it tries to advance that and since that
origin does not exist, it  asserts and then something else crashes.
>From log: (new trace lines added by me are prefixed by shveta, also
tweaked code to have my comment 1 fixed to have clarity on worker-id).

form below traces, it is clear that worker_1 was moving from one
relation to another, always getting correct origin 'pg_16688_1', but
at the end it got 'pg_16688_49' which does not exist. Second part of
trace shows who updated 'pg_16688_49', it was done by worker_49 which
even did not get chance to create this origin due to max_rep_slot
reached.
==
..
2023-02-01 16:01:38.041 IST [9243] LOG:  logical replication table
synchronization worker_1 for subscription "mysub", table "table_93"
has finished
2023-02-01 16:01:38.047 IST [9243] LOG:  logical replication table
synchronization worker_1 for subscription "mysub" has moved to sync
table "table_98".
2023-02-01 16:01:38.055 IST [9243] LOG:  shveta-
LogicalRepSyncTableStart- worker_1 get-origin-id2 originid:2,
originname:pg_16688_1
2023-02-01 16:01:38.055 IST [9243] LOG:  shveta-
LogicalRepSyncTableStart- Worker_1 reusing
slot:pg_16688_sync_1_7195132648087016333, originid:2,
originname:pg_16688_1
2023-02-01 16:01:38.094 IST [9243] LOG:  shveta-
LogicalRepSyncTableStart- worker_1 updated-origin2
originname:pg_16688_1
2023-02-01 16:01:38.096 IST [9243] LOG:  logical replication table
synchronization worker_1 for subscription "mysub", table "table_98"
has finished
2023-02-01 16:01:38.096 IST [9243] LOG:  logical replication table
synchronization worker_1 for subscription "mysub" has moved to sync
table "table_60".
2023-02-01 16:01:38.099 IST [9243] LOG:  shveta-
LogicalRepSyncTableStart- worker_1 get-origin originid:0,
originname:pg_16688_49
2023-02-01 16:01:38.099 IST [9243] LOG:  could not drop replication
slot "pg_16688_sync_49_7195132648087016333" on publisher: ERROR:
replication slot "pg_16688_sync_49_7195132648087016333" does not exist
2023-02-01 16:01:38.103 IST [9243] LOG:  shveta-
LogicalRepSyncTableStart- Worker_1 reusing
slot:pg_16688_sync_1_7195132648087016333, originid:0,
originname:pg_16688_49
TRAP: failed Assert("node != InvalidRepOriginId"), File: "origin.c",
Line: 892, PID: 9243
postgres: logical replication worker for subscription 16688 sync 16384
(ExceptionalCondition+0xbb)[0x56019194d3b7]
postgres: logical replication worker for subscription 16688 sync 16384
(replorigin_advance+0x6d)[0x5601916b53d4]
postgres: logical replication worker for subscription 16688 sync 16384
(LogicalRepSyncTableStart+0xbb4)[0x5601916cb648]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x5d25e2)[0x5601916d35e2]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x5d282c)[0x5601916d382c]
postgres: logical replication worker for subscription 16688 sync 16384
(ApplyWorkerMain+0x17b)[0x5601916d4078]
postgres: logical replication worker for subscription 16688 sync 16384
(StartBackgroundWorker+0x248)[0x56019167f943]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x589ad3)[0x56019168aad3]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x589ee3)[0x56019168aee3]
postgres: logical r

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Hi,

wangw.f...@fujitsu.com , 31 Oca 2023 Sal, 13:40
tarihinde şunu yazdı:

> Sorry, I forgot to add the link to the email. Please refer to [1].
>
> [1] -
> https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com


Thanks for pointing out this review. I somehow skipped that, sorry.

Please see attached patches.

shiy.f...@fujitsu.com , 17 Oca 2023 Sal, 10:46
tarihinde şunu yazdı:

> On Wed, Jan 11, 2023 4:31 PM Melih Mutlu  wrote:
> 0001 patch
> 
> 1. walsender.c
> +   /* Create a tuple to send consisten WAL location */
>
> "consisten" should be "consistent" I think.
>

Done.


> 2. logical.c
> +   if (need_full_snapshot)
> +   {
> +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +
> +   SpinLockAcquire(&slot->mutex);
> +   slot->effective_catalog_xmin = xmin_horizon;
> +   slot->data.catalog_xmin = xmin_horizon;
> +   slot->effective_xmin = xmin_horizon;
> +   SpinLockRelease(&slot->mutex);
> +
> +   xmin_horizon =
> GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> +   ReplicationSlotsComputeRequiredXmin(true);
> +
> +   LWLockRelease(ProcArrayLock);
> +   }
>
> It seems that we should first get the safe decoding xid, then inform the
> slot
> machinery about the new limit, right? Otherwise the limit will be
> InvalidTransactionId and that seems inconsistent with the comment.
>

You're right. Moved that call before assigning xmin_horizon.


> 3. doc/src/sgml/protocol.sgml
> +   is used in the currenct transaction. This command is currently
> only supported
> +   for logical replication.
> +   slots.
>
> We don't need to put "slots" in a new line.
>

Done.


> 0002 patch
> 
> 1.
> In pg_subscription_rel.h, I think the type of "srrelslotname" can be
> changed to
> NameData, see "subslotname" in pg_subscription.h.
>
> 2.
> +* Find the logical replication sync
> worker if exists store
> +* the slot number for dropping associated
> replication slots
> +* later.
>
> Should we add comma after "if exists"?
>

Done.

3.
> +   PG_FINALLY();
> +   {
> +   pfree(cmd.data);
> +   }
> +   PG_END_TRY();
> +   \
> +   return tablelist;
> +}
>
> Do we need the backslash?
>

Removed it.


> 4.
> +   /*
> +* Advance to the LSN got from walrcv_create_slot. This is WAL
> +* logged for the purpose of recovery. Locks are to prevent the
> +* replication origin from vanishing while advancing.
>
> "walrcv_create_slot" should be changed to
> "walrcv_create_slot/walrcv_slot_snapshot" I think.


Right, done.


>
>
5.
> +   /* Replication drop might still exist. Try to drop
> */
> +   replorigin_drop_by_name(originname, true, false);
>
> Should "Replication drop" be "Replication origin"?
>

Done.


> 6.
> I saw an assertion failure in the following case, could you please look
> into it?
> The backtrace is attached.
>
> -- pub
> CREATE TABLE tbl1 (a int, b text);
> CREATE TABLE tbl2 (a int primary key, b text);
> create publication pub for table tbl1, tbl2;
> insert into tbl1 values (1, 'a');
> insert into tbl1 values (1, 'a');
>
> -- sub
> CREATE TABLE tbl1 (a int primary key, b text);
> CREATE TABLE tbl2 (a int primary key, b text);
> create subscription sub connection 'dbname=postgres port=5432' publication
> pub;
>
> Subscriber log:
> 2023-01-17 14:47:10.054 CST [1980841] LOG:  logical replication apply
> worker for subscription "sub" has started
> 2023-01-17 14:47:10.060 CST [1980843] LOG:  logical replication table
> synchronization worker for subscription "sub", table "tbl1" has started
> 2023-01-17 14:47:10.070 CST [1980845] LOG:  logical replication table
> synchronization worker for subscription "sub", table "tbl2" has started
> 2023-01-17 14:47:10.073 CST [1980843] ERROR:  duplicate key value violates
> unique constraint "tbl1_pkey"
> 2023-01-17 14:47:10.073 CST [1980843] DETAIL:  Key (a)=(1) already exists.
> 2023-01-17 14:47:10.073 CST [1980843] CONTEXT:  COPY tbl1, line 2
> 2023-01-17 14:47:10.074 CST [1980821] LOG:  background worker "logical
> replication worker" (PID 1980843) exited with exit code 1
> 2023-01-17 14:47:10.083 CST [1980845] LOG:  logical replication table
> synchronization worker for subscription "sub", table "tbl2" has finished
> 2023-01-17 14:47:10.083 CST [1980845] LOG:  logical replication table
> synchronization worker for subscription "sub" has moved to sync table
> "tbl1".
> TRAP: failed Assert("node != InvalidRepOriginId"), File: "origin.c", Line:
> 892, PID: 1980845
>

I'm not able to reproduce this yet. Will look into it further.

Thanks,
-- 
Melih Mutlu
Microsoft


v10-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data


v7-0001-Add

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Hi Shveta,

shveta malik , 1 Şub 2023 Çar, 15:01 tarihinde şunu
yazdı:

> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu  wrote:
> 2) I found a crash in the previous patch (v9), but have not tested it
> on the latest yet. Crash happens when all the replication slots are
> consumed and we are trying to create new. I tweaked the settings like
> below so that it can be reproduced easily:
> max_sync_workers_per_subscription=3
> max_replication_slots = 2
> and then ran the test case shared by you. I think there is some memory
> corruption happening. (I did test in debug mode, have not tried in
> release mode). I tried to put some traces to identify the root-cause.
> I observed that worker_1 keeps on moving from 1 table to another table
> correctly, but at some point, it gets corrupted i.e. origin-name
> obtained for it is wrong and it tries to advance that and since that
> origin does not exist, it  asserts and then something else crashes.
> From log: (new trace lines added by me are prefixed by shveta, also
> tweaked code to have my comment 1 fixed to have clarity on worker-id).
>
> form below traces, it is clear that worker_1 was moving from one
> relation to another, always getting correct origin 'pg_16688_1', but
> at the end it got 'pg_16688_49' which does not exist. Second part of
> trace shows who updated 'pg_16688_49', it was done by worker_49 which
> even did not get chance to create this origin due to max_rep_slot
> reached.
>

Thanks for investigating this error. I think it's the same error as the one
Shi reported earlier. [1]
I couldn't reproduce it yet but will apply your tweaks and try again.
Looking into this.

[1]
https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com


Thanks,
-- 
Melih Mutlu
Microsoft

Re: transition tables and UPDATE

On 2023-Feb-01, Thomas Munro wrote:

> On Wed, Feb 1, 2023 at 10:18 PM Alvaro Herrera  
> wrote:
> > Earlier today I gave a talk about MERGE and wanted to provide an example
> > with FOR EACH STATEMENT triggers using transition tables.  However, I
> > can't find a non-ugly way to obtain the NEW row that corresponds to each
> > OLD row ...  I had to resort to an ugly trick with OFFSET n LIMIT 1.
> > Can anyone suggest anything better?  I couldn't find any guidance in the
> > docs.
> 
> I don't know the answer, either in PostgreSQL or the SQL spec.  I
> wondered if there *should* be a way here:
> 
> https://www.postgresql.org/message-id/CAEepm=1ncxBNna-pXGr2hnMHRyYi_6_AwG_352-Jn=mwdfd...@mail.gmail.com

I had tried to tie these relations using WITH ORDINALITY, but the only
way I could think of (array_agg to then unnest() WITH ORDINALITY) was
even uglier than what I already had.  So yeah, I think it might be
useful if we had a way to inject a counter or something in there.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The important things in the world are problems with society that we don't
understand at all. The machines will become more complicated but they won't
be more complicated than the societies that run them."(Freeman Dyson)

Re: Progress report of CREATE INDEX for nested partitioned tables

Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
lookup for every single element therein ... this sounds slow. 

In one of the callsites, we already have the partition descriptor
available.  We could just scan partdesc->is_leaf[] and add one for each
'true' value we see there.

In the other callsite, we had the table open just a few lines before the
place you call count_leaf_partitions.  Maybe we can rejigger things by
examining its state before closing it: if relkind is not partitioned we
know leaf_partitions=0, and only if partitioned we count leaf partitions.
I think that would save some work.  I also wonder if it's worth writing
a bespoke function for counting leaf partitions rather than relying on
find_all_inheritors.

I think there's probably not much point optimizing it further than that.
If there was, then we could think about creating a data representation
that we can build for the entire partitioning hierarchy in a single pass
with the count of leaf partitions that sit below each specific non-leaf;
but I think that's just over-engineering.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"After a quick R of TFM, all I can say is HOLY CR** THAT IS COOL! PostgreSQL was
amazing when I first started using it at 7.2, and I'm continually astounded by
learning new features and techniques made available by the continuing work of
the development team."
Berend Tober, http://archives.postgresql.org/pgsql-hackers/2007-08/msg01009.php

Re: Can we do something to help stop users mistakenly using force_parallel_mode?

2023-02-01 Thread John Naylor

On Wed, Feb 1, 2023 at 6:41 PM David Rowley  wrote:
>
> I don't really share Laurenz's worry [2] about compatibility break
> from renaming this GUC. I think the legitimate usages of this setting
> are probably far more rare than the illegitimate ones. I'm not overly
> concerned about renaming if it helps stop people from making this
> mistake. I believe the current name is just too conveniently named and
> that users are likely just to incorrectly assume it does exactly what
> they want because what else could it possibly do?!
>
> I think something like debug_parallel_query is much less likely to be
misused.

+1 on both points.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: transition tables and UPDATE

2023-02-01 Thread Dean Rasheed

On Wed, 1 Feb 2023 at 12:12, Alvaro Herrera  wrote:
>
> I had tried to tie these relations using WITH ORDINALITY, but the only
> way I could think of (array_agg to then unnest() WITH ORDINALITY) was
> even uglier than what I already had.  So yeah, I think it might be
> useful if we had a way to inject a counter or something in there.
>

You could use a pair of cursors like this:

CREATE OR REPLACE FUNCTION wine_audit() RETURNS trigger LANGUAGE plpgsql AS $$
  BEGIN
IF (TG_OP = 'DELETE') THEN
  INSERT INTO wine_audit
   SELECT 'D', now(), row_to_json(o), NULL FROM old_table o;
ELSIF (TG_OP = 'INSERT') THEN
  INSERT INTO wine_audit
   SELECT 'I', now(), NULL, row_to_json(n) FROM new_table n;
ELSIF (TG_OP = 'UPDATE') THEN
  DECLARE
oldcur CURSOR FOR SELECT row_to_json(o) FROM old_table o;
newcur CURSOR FOR SELECT row_to_json(n) FROM new_table n;
oldrec jsonb;
newrec jsonb;
  BEGIN
OPEN oldcur;
OPEN newcur;

LOOP
  FETCH oldcur INTO oldrec;
  EXIT WHEN NOT FOUND;

  FETCH newcur INTO newrec;
  EXIT WHEN NOT FOUND;

  INSERT INTO wine_audit VALUES('U', now(), oldrec, newrec);
END LOOP;

CLOSE oldcur;
CLOSE newcur;
  END;

END IF;
RETURN NULL;
  END;
$$;

though it would be nicer if there was a way to do it in a single SQL statement.

Regards,
Dean

Re: meson: Optionally disable installation of test modules

2023-02-01 Thread Nazir Bilal Yavuz


Hi,

On 1/31/23 11:44, Peter Eisentraut wrote:

On 30.01.23 18:42, Andres Freund wrote:

Bilal, with a bit of help by me, worked on an alternative approach to
this. It's a lot more verbose in the initial change, but wouldn't 
increase the

amount of work/lines for new test modules. The main advantage is that we
wouldn't have disable the modules by default, which I think would be 
quite

likely to result in plenty people not running the tests.

Sending a link instead of attaching, in case you already registered a 
cfbot entry:
https://github.com/anarazel/postgres/commit/d1d192a860da39af9aa63d7edf643eed07c4 



Probably worth adding an install-test-modules target for manual use.


Looks like a good idea.  I'm happy to proceed along that line.


I am adding a patch of an alternative approach. Also, I updated the 
commit at the link by adding regress_module, autoinc_regress and 
refint_regress to the test_install_libs.



Regards,
Nazir Bilal Yavuz
Microsoft
From 9e5d2a7f2e81360d728d02abf42959b529942ba3 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 1 Feb 2023 14:10:20 +0300
Subject: [PATCH v2] meson: prevent installation of test files during main
 install

---
 meson.build   | 32 +-
 src/backend/meson.build   |  7 
 src/test/install_additional_files | 33 +++
 src/test/modules/delay_execution/meson.build  |  6 ++--
 src/test/modules/dummy_index_am/meson.build   |  9 ++---
 src/test/modules/dummy_seclabel/meson.build   |  9 ++---
 src/test/modules/plsample/meson.build | 10 ++
 src/test/modules/spgist_name_ops/meson.build  | 10 ++
 .../ssl_passphrase_callback/meson.build   |  6 ++--
 src/test/modules/test_bloomfilter/meson.build |  9 ++---
 .../modules/test_copy_callbacks/meson.build   |  9 ++---
 .../modules/test_custom_rmgrs/meson.build |  9 ++---
 src/test/modules/test_ddl_deparse/meson.build |  9 ++---
 src/test/modules/test_extensions/meson.build  |  4 +--
 .../modules/test_ginpostinglist/meson.build   |  9 ++---
 src/test/modules/test_integerset/meson.build  |  9 ++---
 src/test/modules/test_lfind/meson.build   |  9 ++---
 src/test/modules/test_oat_hooks/meson.build   |  6 ++--
 src/test/modules/test_parser/meson.build  |  9 ++---
 .../test_pg_db_role_setting/meson.build   |  9 ++---
 src/test/modules/test_pg_dump/meson.build |  4 +--
 src/test/modules/test_predtest/meson.build|  9 ++---
 src/test/modules/test_rbtree/meson.build  |  9 ++---
 src/test/modules/test_regex/meson.build   |  9 ++---
 src/test/modules/test_rls_hooks/meson.build   |  6 ++--
 src/test/modules/test_shm_mq/meson.build  |  9 ++---
 src/test/modules/test_slru/meson.build|  9 ++---
 src/test/modules/worker_spi/meson.build   |  9 ++---
 src/test/regress/meson.build  | 18 --
 29 files changed, 144 insertions(+), 151 deletions(-)
 create mode 100644 src/test/install_additional_files

diff --git a/meson.build b/meson.build
index e379a252a51..e815ca9606c 100644
--- a/meson.build
+++ b/meson.build
@@ -2805,6 +2805,10 @@ backend_code = declare_dependency(
   dependencies: os_deps + backend_both_deps + backend_deps,
 )
 
+# install these files only during test, not main install
+test_install_files = []
+test_install_libs = []
+
 # src/backend/meson.build defines backend_mod_code used for extension
 # libraries.
 
@@ -2825,6 +2829,10 @@ subdir('doc/src/sgml')
 
 generated_sources_ac += {'': ['GNUmakefile']}
 
+# After processing src/test, add test_install_libs to the testprep_targets
+# to build them
+testprep_targets += test_install_libs
+
 
 # If there are any files in the source directory that we also generate in the
 # build directory, they might get preferred over the newly generated files,
@@ -2907,14 +2915,36 @@ meson_install_args = meson_args + ['install'] + {
 'muon': []
 }[meson_impl]
 
+# setup tests should  be run first,
+# so define priority for these
+setup_tests_priority = 100
 test('tmp_install',
 meson_bin, args: meson_install_args ,
 env: {'DESTDIR':test_install_destdir},
-priority: 100,
+priority: setup_tests_priority,
 timeout: 300,
 is_parallel: false,
 suite: ['setup'])
 
+# get full paths of test_install_libs to copy them
+test_install_libs_fp = []
+foreach lib: test_install_libs
+  test_install_libs_fp += lib.full_path()
+endforeach
+
+install_additional_files = files('src/test/install_additional_files')
+test('install_additional_files',
+python, args: [
+  install_additional_files,
+  '--sharedir', test_install_location / contrib_data_args['install_dir'],
+  '--libdir', test_install_location / dir_lib_pkg,
+  '--install_files', test_install_files,
+  '--install_libs', test_install_libs_fp,
+],
+priority: setup_tests_priority,
+is_parallel: false,
+suite: ['setup'])
+
 test_result_dir = meson.build_root() / 'testrun'
 
 
diff --git a/src/backen

Re: transition tables and UPDATE

On Wed, 1 Feb 2023 10:03:26 +0100
Alvaro Herrera  wrote:

> Earlier today I gave a talk about MERGE and wanted to provide an example
> with FOR EACH STATEMENT triggers using transition tables.  However, I
> can't find a non-ugly way to obtain the NEW row that corresponds to each
> OLD row ...  I had to resort to an ugly trick with OFFSET n LIMIT 1.
> Can anyone suggest anything better?  I couldn't find any guidance in the
> docs.

What I could come up with is to join old_table and new_table using keys
of the wine table (winery, brand, variety, year), or join them using
values from row_number(), like;

 INSERT INTO wine_audit
  SELECT 'U', now(), row_o, row_n
   FROM (SELECT row_number() OVER() i, row_to_json(o) row_o FROM old_table o) 
   JOIN (SELECT row_number() OVER() i, row_to_json(n) row_n FROM new_table n) 
 USING (i);

> 
> This is the example function I wrote:
> 
> CREATE FUNCTION wine_audit() RETURNS trigger LANGUAGE plpgsql AS $$
>   BEGIN
> IF (TG_OP = 'DELETE') THEN
>   INSERT INTO wine_audit
>SELECT 'D', now(), row_to_json(o), NULL FROM old_table o;
> ELSIF (TG_OP = 'INSERT') THEN
>   INSERT INTO wine_audit
>SELECT 'I', now(), NULL, row_to_json(n) FROM new_table n;
> ELSIF (TG_OP = 'UPDATE') THEN
>   DECLARE
> oldrec record;
>   newrec jsonb;
>   i  integer := 0;
>   BEGIN
> FOR oldrec IN SELECT * FROM old_table LOOP
>   newrec := row_to_json(n) FROM new_table n OFFSET i LIMIT 1;
>   i := i + 1;
>   INSERT INTO wine_audit
>SELECT 'U', now(), row_to_json(oldrec), newrec;
> END LOOP;
>   END;
> 
> END IF;
> RETURN NULL;
>   END;
> $$;
> 
> CREATE TABLE wines (winery text, brand text, variety text, year int, bottles 
> int);
> CREATE TABLE shipment (LIKE wines);
> CREATE TABLE wine_audit (op varchar(1), datetime timestamptz,
>oldrow jsonb, newrow jsonb);
> 
> CREATE TRIGGER wine_update
>   AFTER UPDATE ON wines
>   REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
>   FOR EACH STATEMENT EXECUTE FUNCTION wine_audit();
> -- I omit triggers on insert and update because the trigger code for those is 
> trivial
> 
> INSERT INTO wines VALUES ('Concha y Toro', 'Sunrise', 'Chardonnay', 2021, 12),
> ('Concha y Toro', 'Sunrise', 'Merlot', 2022, 12);
> 
> INSERT INTO shipment VALUES ('Concha y Toro', 'Sunrise', 'Chardonnay', 2021, 
> 96),
> ('Concha y Toro', 'Sunrise', 'Merlot', 2022, 120),
> ('Concha y Toro', 'Marqués de Casa y Concha', 'Carmenere', 2021, 48),
> ('Concha y Toro', 'Casillero del Diablo', 'Cabernet Sauvignon', 2019, 240);
> 
> ALTER TABLE shipment ADD COLUMN marked timestamp with time zone;
> 
> WITH unmarked_shipment AS
>  (UPDATE shipment SET marked = now() WHERE marked IS NULL
>  RETURNING winery, brand, variety, year, bottles)
> MERGE INTO wines AS w
>  USING (SELECT winery, brand, variety, year,
> sum(bottles) as bottles
>FROM unmarked_shipment
>GROUP BY winery, brand, variety, year) AS s
> ON (w.winery, w.brand, w.variety, w.year) =
>(s.winery, s.brand, s.variety, s.year)
> WHEN MATCHED THEN
>  UPDATE SET bottles = w.bottles + s.bottles
> WHEN NOT MATCHED THEN
>  INSERT (winery, brand, variety, year, bottles)
>  VALUES (s.winery, s.brand, s.variety, s.year, s.bottles)
> ;
> 
> 
> If you examine table wine_audit after pasting all of the above, you'll
> see this, which is correct:
> 
> ─[ RECORD 1 
> ]
> op   │ U
> datetime │ 2023-02-01 01:16:44.704036+01
> oldrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
> "bottles": 12, "variety": "Chardonnay"}
> newrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
> "bottles": 108, "variety": "Chardonnay"}
> ─[ RECORD 2 
> ]
> op   │ U
> datetime │ 2023-02-01 01:16:44.704036+01
> oldrow   │ {"year": 2022, "brand": "Sunrise", "winery": "Concha y Toro", 
> "bottles": 12, "variety": "Merlot"}
> newrow   │ {"year": 2022, "brand": "Sunrise", "winery": "Concha y Toro", 
> "bottles": 132, "variety": "Merlot"}
> 
> My question is how to obtain the same rows without the LIMIT/OFFSET line
> in the trigger function.
> 
> 
> Also: how can we "subtract" both JSON blobs so that the 'newrow' only
> contains the members that differ?  I would like to have this:
> 
> ─[ RECORD 1 
> ]
> op   │ U
> datetime │ 2023-02-01 01:16:44.704036+01
> oldrow   │ {"year": 2021, "brand": "Sunrise", "winery": "Concha y Toro", 
> "bottles": 12, "variety": "Chardonnay"}
> newrow   │ {"bottles": 108}
> ─[ RECORD 2 
> ]

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread shveta malik

On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu  wrote:

>
>
> Thanks for investigating this error. I think it's the same error as the one 
> Shi reported earlier. [1]
> I couldn't reproduce it yet but will apply your tweaks and try again.
> Looking into this.
>
> [1] 
> https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com

I tried Shi-san's testcase earlier but I too could not reproduce it,
so I assumed that it is fixed in one of your patches already and thus
thought that the current issue is a new one.

thanks
Shveta

Re: [Commitfest 2023-01] has started

2023-02-01 Thread vignesh C

On Wed, 1 Feb 2023 at 11:25, Julien Rouhaud  wrote:
>
> On Wed, Feb 1, 2023 at 10:44 AM vignesh C  wrote:
> >
> > I don't have permissions to close the commitfest, could one of them
> > help in closing the commitfest.
>
> It's technically 17:53 at Anywhere on Earth, so we usually wait for
> the day to be over before doing so.  But since you already took care
> triaging all CF entries I closed to CF.

I had updated the entries at 31st EOD India time, probably I should
have waited for the other parts of the world too. I will take care of
this next time. Thanks for closing the CF.

Regards,
Vignesh

Re: Support for dumping extended statistics

2023-02-01 Thread Justin Pryzby

On Wed, Feb 01, 2023 at 04:38:17AM +, Hari krishna Maddileti wrote:
> Hi Justin,
> Thanks for the update, I have attached the updated patch with 
> meson compatible and  addressed warnings from make file too.

Thanks - I see it compiles now under both build systems.

But there's build warnings, and it fails regression tests.

http://cfbot.cputube.org/david-kimura.html

On 15/01/23, 2:27 AM, "Justin Pryzby"  wrote:
> 
> On Tue, Jan 10, 2023 at 11:28:36AM +, Hari krishna Maddileti wrote:
> > Thanks Team for showing interest.
> >
> > Please find the attached patch, which uses the same approach as mentioned 
> > in previous email to implement input functions to parse pg_distinct, 
> > pg_dependency and pg_mcv_list strings.
> 
> The patch is failing ; you need to make the corresponding update to
> meson as you did for make.
> 
> But actually, it also fails to compile with "make".

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Tomas Vondra




On 2/1/23 11:04, Jakub Wartak wrote:
> On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
>  wrote:
> 
> Hi Bharath, thanks for reviewing.
> 
>> I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> It's the same as point 19, so I covered it there.
> 
>> Although v2 is a WIP patch, I have some comments:
>> 1. Coding style doesn't confirm to the Postgres standard:
> 
> Fixed all thoses that you mentioned and I've removed elog() and code
> for timing. BTW: is there a way to pgindent only my changes (on git
> diff?)
> 
>> 2. It'd be better to add a TAP test hitting the WAL throttling.
> 
> TODO, any hints on how that test should look like are welcome
> (checking pg_stat_wal?) I've could spot only 1 place for it --
> src/test/recovery/007_sync_rep.pl
> 
>> 3. We generally don't need timings to be calculated explicitly when we
>> emit before and after log messages. If needed one can calculate the
>> wait time from timestamps of the log messages. If it still needs an
>> explicit time difference which I don't think is required, because we
>> have a special event and before/after log messages, I think it needs
>> to be under track_wal_io_timing to keep things simple.
> 
> Removed, it was just debugging aid to demonstrate the effect in the
> session waiting.
> 
>> 4. XLogInsertRecord() can be a hot path for high-write workload, so
>> effects of adding anything in it needs to be measured. So, it's better
>> to run benchmarks with this feature enabled and disabled.
> 
> When the GUC is off(0 / default), in my tests the impact is none (it's
> just set of simple IFs), however if the feature is enabled then the
> INSERT is slowed down (I've repeated the initial test from 1st post
> and single-statement INSERT for 50MB WAL result is the same 4-5s and
> ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
> that's intentional).
> 
>> 5. Missing documentation of this feature and the GUC. I think we can
>> describe extensively why we'd need a feature of this kind in the
>> documentation for better adoption or usability.
> 
> TODO, I'm planning on adding documentation when we'll be a little bit
> closer to adding to CF.
> 
>> 6. Shouldn't the max limit be MAX_KILOBYTES?
>> +&synchronous_commit_flush_wal_after,
>> +0, 0, 1024L*1024L,
> 
> Fixed.
> 
>> 7. Can this GUC name be something like
>> synchronous_commit_wal_throttle_threshold to better reflect what it
>> does for a backend?
>> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
>> REPLICATION_SENDING,
> 
> Done.
> 
>> 8. A typo - s/confiration/confirmation
> [..]
>> 9. This
>> "Sets the maximum logged WAL in kbytes, after which wait for sync
>> commit confiration even without commit "
>> better be something like below?
>> "Sets the maximum amount of WAL in kilobytes a backend generates after
>> which it waits for synchronous commit confiration even without commit
>> "
> 
> Fixed as you have suggested.
> 
>> 10. I think there's nothing wrong in adding some assertions in
>> HandleXLogDelayPending():
>> Assert(synchronous_commit_flush_wal_after > 0);
>> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
>> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);
> 
> Sure, added.
> 
>> 11. Specify the reason in the comments as to why we had to do the
>> following things:
>> Here:
>> +/* flush only up to the last fully filled page */
>> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
>> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
>> Talk about how this avoids multiple-flushes for the same page.
>>
>> Here:
>> + * Called from ProcessMessageInterrupts() to avoid waiting while
>> being in critical section
>> + */
>> +void HandleXLogDelayPending()
>> Talk about how waiting in a critical section, that is in
>> XLogInsertRecord() causes locks to be held longer durations and other
>> effects.
> 
> Added.
> 
>> Here:
>> +/* WAL throttling */
>> +backendWalInserted += rechdr->xl_tot_len;
>> Be a bit more verbose about why we try to throttle WAL and why only
>> for sync replication, the advantages, effects etc.
> 
> Added.
> 
>> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
>> clutter server logs.
>> +/* XXX Debug for now */
>> +elog(WARNING, "throttling WAL down on this session
>> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
>> +backendWalInserted,
>> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
>> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>>
>> +elog(WARNING, "throttling WAL down on this session - end (%f
>> ms)", timediff);
> 
> OK, that's entirely removed.
> 
>> 13. BTW, we don't need to hold interrupts while waiting for sync
>> replication ack as it may block query cancels or proc die pendings.
>

Re: Show various offset arrays for heap WAL records

On Tue, Jan 31, 2023 at 6:20 PM Peter Geoghegan  wrote:
> Actually the really wide output comes from COMMIT records. After I run
> the regression tests, and execute some of my own custom pg_walinspect
> queries, I see that some individual COMMIT records have a
> length(description) of over 10,000 bytes/characters. There is even one
> particular COMMIT record whose length(description) is about 46,000
> bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
> uncommon today. The worst case (or even particularly bad cases) won't
> be made any worse by this patch, because there are obviously limits on
> the width of the arrays that it outputs details descriptions of, that
> don't apply to these COMMIT records.

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: recovery modules

Hi,

On 2023-01-31 15:30:13 -0800, Nathan Bossart wrote:

> +/*
> + * basic_archive_startup
> + *
> + * Creates the module's memory context.
> + */
> +void *
> +basic_archive_startup(void)
> +{
> + return (void *) AllocSetContextCreate(TopMemoryContext,
> + 
>   "basic_archive",
> + 
>   ALLOCSET_DEFAULT_SIZES);
>  }

I'd make basic_archive's private data a struct, with a member for the
context, but it's not that important.

I'd also be inclined to do the same for the private_state you're passing
around for each module. Even if it's just to reduce the number of
functions accepting void * - loosing compiler type checking isn't great.

So maybe an ArchiveModuleState { void *private_data } that's passed to
basic_archive_startup() and all the other callbacks.

Greetings,

Andres Freund

Re: Support for dumping extended statistics

2023-02-01 Thread Tomas Vondra

On 1/7/23 03:39, Bruce Momjian wrote:
> On Thu, Jan  5, 2023 at 06:29:03PM +, Hari krishna Maddileti wrote:
>> Hi Team,
>> In order to restore dumped extended statistics (stxdndistinct,
>> stxddependencies, stxdmcv) we need to provide input functions to parse
>> pg_distinct/pg_dependency/pg_mcv_list strings.
>>
>> Today we get the ERROR "cannot accept a value of type pg_ndistinct/
>> pg_dependencies/pg_mcv_list" when we try to do an insert of any type.
>>
>> Approch tried:
>>
>> - Using yacc grammar file (statistics_gram.y) to parse the input string to 
>> its
>> internal format for the types pg_distinct and pg_dependencies
>>
>> - We are just calling byteain() for serialized input text of type 
>> pg_mcv_list.
>>
>> Currently the changes are working locally,  I would like to push the commit
>> changes to upstream if there any usecase for postgres.   Would like to know 
>> if
>> there any interest from postgres side.
> 
> There is certainly interest in allowing the optimizer statistics to be
> dumped and reloaded.  This could be used by pg_restore and pg_upgrade.
> 

Indeed, although I think it'd be better to deal with regular statistics
(which is what 99% of systems use). Furthermore, we should probably
think about differences between major versions - until now we could
change on-disk format of the statistics, because we have reset them.
It'd be silly to do dump on version X, and then fail to restore it on
(X+1) just because the statistics changed a bit.

So we need to be able to determine is the statistics has the correct
format/version, or what. And we need to do that for pg_upgrade.

At the very least we need an option to skip restoring statistics, or
something like that.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak

On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
 wrote:

> > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> > we are flushing just because of WAL thortting ?) Would that still be
> > safe?
>
> It's not clear to me how could this work and still be safe. I mean, we
> *must* flush the local WAL first, otherwise the replica could get ahead
> (if we send unflushed WAL to replica and then crash). Which would be
> really bad, obviously.

Well it was just a thought: in this particular test - with no other
concurrent activity happening - we are fsyncing() uncommitted
Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
record comes into play. I agree that some other concurrent backend's
COMMIT could fsync it, but I was wondering if that's sensible
optimization to perform (so that issue_fsync() would be called for
only commit/rollback records). I can imagine a scenario with 10 such
concurrent backends running - all of them with this $thread-GUC set -
but that would cause 20k unnecessary fsyncs (?) -- (assuming single
HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
would be wasted close to 400s just due to local fsyncs?). I don't have
a strong opinion or in-depth on this, but that smells like IO waste.

-J.

Performance issues with parallelism and LIMIT

2023-02-01 Thread David Geier


Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the following 
performance degradation caused by commit 46846433a03dff: "shm_mq: Update 
mq_bytes_written less often", discussion in [1].


The batching can make queries with a LIMIT clause run significantly 
slower compared to PostgreSQL 14, because neither the ring buffer write 
position is updated, nor the latch to inform the leader that there's 
data available is set, before a worker's queue is 1/4th full. This can 
be seen in the number of rows produced by a parallel worker. Worst-case, 
the data set is large and all rows to answer the query appear early, but 
are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and 
the tuple sizes are small). Here is an example to reproduce the problem.


CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);
INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROM 
generate_series(1, 1000) AS i;

ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

 Limit  (cost=0.00..797.43 rows=100 width=4) (actual 
time=65.083..69.207 rows=100 loops=1)

   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9947 width=4) (actual 
time=65.073..68.417 rows=100 loops=1)

 Output: id2
 Workers Planned: 8
 Workers Launched: 7
 ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)

   Output: id2
   Filter: (t.id1 = 100)
   Rows Removed by Filter: 99345
   Worker 0:  actual time=0.334..32.284 rows=100 loops=1
   Worker 1:  actual time=0.060..32.680 rows=100 loops=1
   Worker 2:  actual time=0.637..33.954 rows=98 loops=1
   Worker 3:  actual time=0.136..33.301 rows=100 loops=1
   Worker 4:  actual time=0.140..31.942 rows=100 loops=1
   Worker 5:  actual time=0.062..33.673 rows=100 loops=1
   Worker 6:  actual time=0.062..33.512 rows=100 loops=1
 Planning Time: 0.113 ms
 Execution Time: 69.772 ms

PostgreSQL 14:

 Limit  (cost=0.00..797.75 rows=100 width=4) (actual 
time=30.602..38.459 rows=100 loops=1)

   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9943 width=4) (actual 
time=30.592..37.669 rows=100 loops=1)

 Output: id2
 Workers Planned: 8
 Workers Launched: 7
 ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)

   Output: id2
   Filter: (t.id1 = 100)
   Rows Removed by Filter: 15241
   Worker 0:  actual time=0.129..4.840 rows=15 loops=1
   Worker 1:  actual time=0.125..4.924 rows=15 loops=1
   Worker 2:  actual time=0.314..5.249 rows=17 loops=1
   Worker 3:  actual time=0.252..5.341 rows=15 loops=1
   Worker 4:  actual time=0.163..5.179 rows=15 loops=1
   Worker 5:  actual time=0.422..5.248 rows=15 loops=1
   Worker 6:  actual time=0.139..5.489 rows=16 loops=1
 Planning Time: 0.084 ms
 Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can't 
achieve the same performance improvement without batching by e.g.:


- Only set the latch if new data is written to an empty queue. 
Otherwise, the leader should anyways keep try reading from the queues 
without waiting for the latch, so no need to set the latch again.


- Reorganize struct shm_mq. There seems to be false sharing happening 
between at least mq_ring_size and the atomics and potentially also 
between the atomics. I'm wondering if the that's not the root cause of 
the "slow atomics" observed in [1]? I'm happy to do some profiling.


Alternatively, we could always set the latch if numberTuples in 
ExecutePlan() is reasonably low. To do so, the DestReceiver's receive() 
method would only need an additional "force flush" argument.



A slightly different but related problem is when some workers have 
already produced enough rows to answer the LIMIT query, but other 
workers are still running without producing any new rows. In that case 
the "already done" workers will stop running even though they haven't 
reached 1/4th of the queue size, because the for-loop in execMain.c 
bails out in the following condition:


    if (numberTuples && numberTuples == current_tuple_count)
    break;

Subsequently, the leader will end the plan and then wait in the Gather 
node for all workers to shutdown. However, workers still running but not 
producing any new rows will never reach the following condition in 
execMain.c to check if they're supposed to stop (the shared memory queue 
dest receiver will return false on detached queues):


    /*
 * If we are no

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev


> 1 февр. 2023 г., в 16:01, Alvaro Herrera  написал(а):
> 
> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> lookup for every single element therein ... this sounds slow. 
> 
> In one of the callsites, we already have the partition descriptor
> available.  We could just scan partdesc->is_leaf[] and add one for each
> 'true' value we see there.

The problem is that partdesc contains only direct children of the table and we 
need all the children down the inheritance tree to count the total number of 
leaf partitions in the first callsite.

> In the other callsite, we had the table open just a few lines before the
> place you call count_leaf_partitions.  Maybe we can rejigger things by
> examining its state before closing it: if relkind is not partitioned we
> know leaf_partitions=0, and only if partitioned we count leaf partitions.
> I think that would save some work.  I also wonder if it's worth writing
> a bespoke function for counting leaf partitions rather than relying on
> find_all_inheritors.

Sure, added this condition to avoid the extra work here.


v3-0001-create-index-progress-increment.patch
Description: Binary data

Re: Non-superuser subscription owners

On Tue, Jan 31, 2023 at 7:01 PM Andres Freund  wrote:
> I don't really understand that - the run-as approach seems like a
> necessary piece of improving the security model.
>
> I think it's perfectly reasonable to want to replicate from one system
> in another, but to not want to allow logical replication to insert into
> pg_class or whatnot. So not using superuser to execute the replication
> makes sense.
>
> This is particularly the case if you're just replicating a small part of
> the tables from one system to another. E.g. in a sharded setup, you may
> want to replicate metadata too servers.

I don't think that a system catalog should be considered a valid
replication target, no matter who owns the subscription, so ISTM that
writing to pg_class should be blocked regardless. The thing I'm
struggling to understand is: if you only want to replicate into tables
that Alice can write, why not just make Alice own the subscription?
For a run-as user to make sense, you need a scenario where we want the
replication to target only tables that Alice can touch, but we also
don't want Alice herself to be able to touch the subscription, so you
make Alice the run-as user and yourself the owner, or something like
that. But I'm not sure what that scenario is exactly.

Mark was postulating a scenario where the publisher and subscriber
don't trust each other. I was thinking a little bit more about that. I
still maintain that the current system is poorly set up to make that
work, but suppose we wanted to do better. We could add filtering on
the subscriber side, like you list schemas or specific relations that
you are or are not willing to replicate into. Then you could, for
example, connect your subscription to a certain remote publication,
but with the restriction that you're only willing to replicate into
the "headquarters" schema. Then we'll replicate whatever tables they
send us, but if the dorks at headquarters mess up the publications on
their end (intentionally or otherwise) and add some tables from the
"locally_controlled_stuff" schema, we'll refuse to replicate that into
our eponymous schema. I don't think this kind of system is well-suited
to environments where people are totally hostile to each other,
because you still need to have replication slots on the remote side
and stuff. Also, having the remote side decode stuff and ignoring it
locally is expensive, and I bet if we add stuff like this then people
will misuse it and be sad. But it would make the system easier to
reason about: I know for sure that this subscription will only write
to these places, because that's all I've given it permission to do.

In the sharding scenario you mention, if you want to provide
accidental writes to unrelated tables due to the publication being not
what we expect, you can either make the subscription owned by the same
role that owns the sharded tables, or a special-purpose role that has
permission to write to exactly the set of tables that you expect to be
touched and no others. Or, if you had something like what I posited in
the last paragraph, you could use that instead. But I don't see how a
separate run-as user helps. If I'm just being super-dense here, I hope
that one of you will explain using short words. :-)

> I think we'll need two things to improve upon the current situation:
>
> 1) run-as user, to reduce the scope of potential danger
>
> 2) Option to run the database inserts as the owner of the table, with a
>check that the run-as is actually allowed to perform work as the
>owning role. That prevents escalation from table owner (who could add
>default expressions etc) from gettng the privs of the
>run-as/replication owner.

I'm not quite sure what we do here now, but I agree that trigger
firing seems like a problem. It might be that we need to worry about
the user on the origin server, too. If Alice inserts a row that causes
a replicated table owned by Bob to fire a trigger or evaluate a
default expression or whatever due the presence of a subscription
owned by Charlie, there is a risk that Alice might try to attack
either Bob or Charlie, or that Bob might try to attack Charlie.

> > And if we suppose that that already works and is safe, well then
> > what's the case where I do need a run-as user?
>
> It's not at all safe today, IMO. You need to trust that nothing bad will
> be replicated, otherwise the owner of the subscription has to be
> considered compromised.

What kinds of things are bad to replicate?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support for dumping extended statistics

Tomas Vondra  writes:
> On 1/7/23 03:39, Bruce Momjian wrote:
>> There is certainly interest in allowing the optimizer statistics to be
>> dumped and reloaded.  This could be used by pg_restore and pg_upgrade.

> Indeed, although I think it'd be better to deal with regular statistics
> (which is what 99% of systems use). Furthermore, we should probably
> think about differences between major versions - until now we could
> change on-disk format of the statistics, because we have reset them.

Yeah, it's extremely odd to be proposing dump/reload for extended
stats when we don't yet have it for plain stats.  And yes, the main
stumbling block is that you need to have a plan for stats changing
across versions, or even just environmental issues.  For example,
what if the target DB doesn't use the same collation as the source?
That would affect string sorting and therefore at least partially
invalidate histograms for text columns.

I actually did some work on this, probably close to ten years ago
now, and came up with some hacks that didn't pass community review.
It'd be a good idea to dig up those old discussions if you want to
re-open the topic.

regards, tom lane

Re: Weird failure with latches in curculio on v15

Andres Freund  writes:
> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>> handler which is allowed to call that while in_restore_command is
>> true.

> Ugh, no wonder we're getting crashes. This whole business seems bogus as
> hell.

Indeed :-(

> I don't see a choice but to revert the recent changes. They need a
> fairly large rewrite.

9a740f81e clearly made things a lot worse, but it wasn't great
before.  Can we see a way forward to removing the problem entirely?

The fundamental issue is that we have no good way to break out
of system(), and I think the original idea was that
in_restore_command would be set *only* for the duration of the
system() call.  That's clearly been lost sight of completely,
but maybe as a stopgap we could try to get back to that.

regards, tom lane

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent

On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  wrote:
>
> > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > написал(а):
> >
> > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > lookup for every single element therein ... this sounds slow.
> >
> > In one of the callsites, we already have the partition descriptor
> > available.  We could just scan partdesc->is_leaf[] and add one for each
> > 'true' value we see there.
>
> The problem is that partdesc contains only direct children of the table and 
> we need all the children down the inheritance tree to count the total number 
> of leaf partitions in the first callsite.
>
> > In the other callsite, we had the table open just a few lines before the
> > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > examining its state before closing it: if relkind is not partitioned we
> > know leaf_partitions=0, and only if partitioned we count leaf partitions.
> > I think that would save some work.  I also wonder if it's worth writing
> > a bespoke function for counting leaf partitions rather than relying on
> > find_all_inheritors.
>
> Sure, added this condition to avoid the extra work here.
>

>When creating an index on a partitioned table, this column is set to
> -   the total number of partitions on which the index is to be created.
> +   the total number of leaf partitions on which the index is to be 
> created or attached.

I think we should also add a note about the (now) non-constant nature
of the value, something along the lines of "This value is updated as
we're processing and discovering partitioned tables in the partition
hierarchy".

Kind regards,

Matthias van de Meent

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Justin Pryzby

On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
> wrote:
> > > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > > написал(а):
> > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > > lookup for every single element therein ... this sounds slow.
> > >
> > > In one of the callsites, we already have the partition descriptor
> > > available.  We could just scan partdesc->is_leaf[] and add one for each
> > > 'true' value we see there.
> >
> > The problem is that partdesc contains only direct children of the table and 
> > we need all the children down the inheritance tree to count the total 
> > number of leaf partitions in the first callsite.
> >
> > > In the other callsite, we had the table open just a few lines before the
> > > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > > examining its state before closing it: if relkind is not partitioned we
> > > know leaf_partitions=0, and only if partitioned we count leaf partitions.
> > > I think that would save some work.  I also wonder if it's worth writing
> > > a bespoke function for counting leaf partitions rather than relying on
> > > find_all_inheritors.
> >
> > Sure, added this condition to avoid the extra work here.
> >
> 
> >When creating an index on a partitioned table, this column is set to
> > -   the total number of partitions on which the index is to be created.
> > +   the total number of leaf partitions on which the index is to be 
> > created or attached.
> 
> I think we should also add a note about the (now) non-constant nature
> of the value, something along the lines of "This value is updated as
> we're processing and discovering partitioned tables in the partition
> hierarchy".

But the TOTAL is constant, right ?  Updating the total when being called
recursively is the problem these patches fix.

-- 
Justin

Re: Weird failure with latches in curculio on v15

On Wed, Feb 01, 2023 at 10:12:26AM -0500, Tom Lane wrote:
> Andres Freund  writes:
>> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>>> handler which is allowed to call that while in_restore_command is
>>> true.
> 
>> Ugh, no wonder we're getting crashes. This whole business seems bogus as
>> hell.
> 
> Indeed :-(

Ugh.  My bad.

> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

+1.  I'll produce some patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: RLS makes COPY TO process child tables

On Wed, 01 Feb 2023 12:45:57 +0100
Antonin Houska  wrote:

> While working on [1] I noticed that if RLS gets enabled, the COPY TO command
> includes the contents of child table into the result, although the
> documentation says it should not:
> 
>   "COPY TO can be used only with plain tables, not views, and does not
>   copy rows from child tables or child partitions. For example, COPY
>   table TO copies the same rows as SELECT * FROM ONLY table. The syntax
>   COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
>   in an inheritance hierarchy, partitioned table, or view."
> 
> A test case is attached (rls.sql) as well as fix proposal
> (copy_rls_no_inh.diff).

I think this is a bug because the current behaviour is different from
the documentation. 

When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
clauses. This causes to dump the rows of child tables.

The patch fixes this by setting "inh" of the table in the converted query
to false. This seems reasonable and actually fixes the problem.

However, I think we would want a comment on the added line. Also, the
attached test should be placed in the regression test.

Regards,
Yugo Nagata

> 
> [1] https://commitfest.postgresql.org/41/3641/
> 
> -- 
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
> 

-- 
Yugo NAGATA

pg_dump versus hash partitioning

Over at [1] we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs. So restore tries to
load rows into the wrong leaf tables, and it's all a mess.
The patch approach proposed at [1] doesn't really work, but
what does work is to use pg_dump's --load-via-partition-root
option, so that the tuple routing decisions are all re-made.

I'd initially proposed that we force --load-via-partition-root
if we notice that we have hash partitioning on an enum column.
But the more I thought about this, the more comparable problems
came to mind:

1. Hash partitioning on text columns will likely fail if the
destination uses a different encoding.

2. Hash partitioning on float columns will fail if you use
--extra-float-digits to round off the values. And then
there's the fact that the behavior of strtod() might vary
across platforms.

3. Hash partitioning on floats is also endian-dependent,
and the same is likely true for some other types.

4. Anybody want to bet that complex types such as jsonb
are entirely free of similar hazards? (Yes, somebody
thought it'd be a good idea to provide jsonb_hash.)

In general, we've never thought that hash values are
required to be consistent across platforms.

That was leading me to think that we should force
--load-via-partition-root for any hash-partitioned table,
just to pre-emptively avoid these problems. But then
I remembered that

5. Range partitioning on text columns will likely fail if the
destination uses a different collation.

This is looking like a very large-caliber foot-gun, isn't it?
And remember that --load-via-partition-root acts at pg_dump
time, not restore. If all you have is a dump file with no
opportunity to go back and get a new one, and it won't load
into your new server, you have got a nasty problem.

I don't think this is an acceptable degree of risk, considering
that the primary use-cases for pg_dump involve target systems
that aren't 100.00% identical to the source.

So here's what I think we should actually do: make
--load-via-partition-root the default. We can provide a
switch to turn it off, for those who are brave or foolish
enough to risk that in the name of saving a few cycles,
but it ought to be the default.

Furthermore, I think we should make this happen in time for
next week's releases. I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

Anyone want to bikeshed on the spelling of the new switch?
I'm contemplating "--load-via-partition-leaf" or perhaps
"--no-load-via-partition-root".

regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/765e5968-6c39-470f-95bf-7b14e6b9a1c0%40app.fastmail.com

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent

On Wed, 1 Feb 2023 at 16:53, Justin Pryzby  wrote:
>
> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
> > On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
> > wrote:
> > > > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > > > написал(а):
> > > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > > > lookup for every single element therein ... this sounds slow.
> > > >
> > > > In one of the callsites, we already have the partition descriptor
> > > > available.  We could just scan partdesc->is_leaf[] and add one for each
> > > > 'true' value we see there.
> > >
> > > The problem is that partdesc contains only direct children of the table 
> > > and we need all the children down the inheritance tree to count the total 
> > > number of leaf partitions in the first callsite.
> > >
> > > > In the other callsite, we had the table open just a few lines before the
> > > > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > > > examining its state before closing it: if relkind is not partitioned we
> > > > know leaf_partitions=0, and only if partitioned we count leaf 
> > > > partitions.
> > > > I think that would save some work.  I also wonder if it's worth writing
> > > > a bespoke function for counting leaf partitions rather than relying on
> > > > find_all_inheritors.
> > >
> > > Sure, added this condition to avoid the extra work here.
> > >
> >
> > >When creating an index on a partitioned table, this column is set 
> > > to
> > > -   the total number of partitions on which the index is to be 
> > > created.
> > > +   the total number of leaf partitions on which the index is to be 
> > > created or attached.
> >
> > I think we should also add a note about the (now) non-constant nature
> > of the value, something along the lines of "This value is updated as
> > we're processing and discovering partitioned tables in the partition
> > hierarchy".
>
> But the TOTAL is constant, right ?  Updating the total when being called
> recursively is the problem these patches fix.

If that's the case, then I'm not seeing the 'fix' part of the patch. I
thought this patch was fixing the provably incorrect TOTAL value where
DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
values instead of updating them.

In HEAD we set TOTAL to whatever number partitioned table we're
currently processing has - regardless of whether we're the top level
statement.
With the patch we instead add the number of child relations to that
count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
posted by Ilya. Approximately immediately after updating that count we
recurse to the child relations, and that only returns once it is done
creating the indexes, so both TOTAL and DONE go up as we process more
partitions in the hierarchy.

An example hierarchy:
CREATE TABLE parent (a, b) partition by list (a);
CREATE TABLE a1
PARTITION OF parent FOR VALUES IN (1)
PARTITION BY LIST (b);
CREATE TABLE a1bd
PARTITION OF a1 DEFAULT;

CREATE TABLE a2
PARTITION OF parent FOR VALUES IN (2)
PARTITION BY LIST (b);
CREATE TABLE a2bd
PARTITION OF a2 DEFAULT;

INSERT INTO parent (a, b) SELECT * from generate_series(1, 2) a(a)
cross join generate_series(1, 10),b(b);
CREATE INDEX ON parent(a,b);

This will only discover that a2bd will need to be indexed after a1bd
is done (or vice versa, depending on which order a1 and a2 are
processed in the ).

Kind regards,

Matthias van de Meent

Re: RLS makes COPY TO process child tables

Yugo NAGATA  writes:
> Antonin Houska  wrote:
>> While working on [1] I noticed that if RLS gets enabled, the COPY TO command
>> includes the contents of child table into the result, although the
>> documentation says it should not:

> I think this is a bug because the current behaviour is different from
> the documentation. 

I agree, it shouldn't do that.

> When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
> to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
> clauses. This causes to dump the rows of child tables.

Do we actually say that in so many words, either in the code or docs?
If so, it ought to read `COPY (SELECT * FROM ONLY ...) TO ...`
instead.  (If we say that in the docs, then arguably the code *does*
conform to the docs.  But I don't see it in the COPY ref page at least.)

regards, tom lane

Re: Timeline ID hexadecimal format

2023-02-01 Thread Sébastien Lardière


On 31/01/2023 20:16, Greg Stark wrote:

The fact that the *filename* has it encoded in hex is an
implementation detail and really gets exposed here because it's giving
you the underlying system error that caused the problem.



It's an implementation detail, but an exposed detail, so, people refer 
to the filename to find the timeline ID (That's why it happened to me)




  The confusion
only arises when the two are juxtaposed. A hint or something just in
that case might be enough?




Thanks, i got your point.

 Note that my proposal was to remove the ambiguous notation which 
happen in some case (as in 11 <-> 17). A hint is useless in most of the 
case, because there is no ambiguous. That's why i though format 
hexadecimal everywhere.



At least, can I propose to improve the documentation to expose the fact 
that the timeline ID is exposed in hexadecimal in filenames but must be 
used in decimal in recovery_target_timeline and pg_waldump ?



regards,

--
Sébastien

Re: Weird failure with latches in curculio on v15

Hi,

On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before.  Can we see a way forward to removing the problem entirely?

Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.


> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).

And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did

- I'm doubtful that the new shell_* functions are the base for a good
  API to abstract restoring files

- the error message for a failed restore command seems to have gotten
  worse:
  could not restore file \"%s\" from archive: %s"
  ->
  "%s \"%s\": %s", commandName, command

- shell_* imo is not a good namespace for something called from xlog.c,
  xlogarchive.c. I realize the intention is that shell_archive.c is
  going to be its own "restore module", but for now it imo looks odd

- The comment moved out of RestoreArchivedFile() doesn't seems less
  useful at its new location

- explanation of why we use GetOldestRestartPoint() is halfway lost


My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".

Greetings,

Andres Freund

Re: About PostgreSQL Core Team

2023-02-01 Thread Pavel Borisov

Hi, Adherent!

IMO "not liking" that you quote in the picture is just other words for
expressing caution for the patch or for the general direction of some
change. At least I never felt personal or arbitrary presumptions in
relation to my patches. So if you can join a discussion with your proposals
to address the sources of caution, and improve or review the patches, it
would be really helpful. And these are really the things that move patches
forward, not just complaining about words.

Regards,
Pavel Borisov,
Supabase

Re: Weird failure with latches in curculio on v15

On Wed, Feb 1, 2023 at 11:58 AM Andres Freund  wrote:
> > 9a740f81e clearly made things a lot worse, but it wasn't great
> > before.  Can we see a way forward to removing the problem entirely?
>
> Yea, I think we can - we should stop relying on system(). If we instead
> run the command properly as a subprocess, we don't need to do bad things
> in the signal handler anymore.

I like the idea of not relying on system(). In most respects, doing
fork() + exec() ourselves seems superior. We can control where the
output goes, what we do while waiting, etc. But system() runs the
command through the shell, so that for example you don't have to
invent your own way of splitting a string into words to be passed to
exec[whatever](). I've never understood how you're supposed to get
that behavior other than by calling system().

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: meson: pkgconfig difference

2023-02-01 Thread Peter Eisentraut


On 01.02.23 08:55, Andres Freund wrote:

Hi,

On January 31, 2023 11:40:52 PM PST, Peter Eisentraut 
 wrote:

I think there is a tiny typo in src/interfaces/ecpg/ecpglib/meson.build:

diff --git a/src/interfaces/ecpg/ecpglib/meson.build 
b/src/interfaces/ecpg/ecpglib/meson.build
index dba9e3c3d9..da8d304f54 100644
--- a/src/interfaces/ecpg/ecpglib/meson.build
+++ b/src/interfaces/ecpg/ecpglib/meson.build
@@ -57,7 +57,7 @@ pkgconfig.generate(
   description: 'PostgreSQL libecpg library',
   url: pg_url,
   libraries: ecpglib_so,
-  libraries_private: [frontend_shlib_code, thread_dep],
+  libraries_private: [frontend_stlib_code, thread_dep],
   requires_private: ['libpgtypes', 'libpq'],
)

This makes it match the other libraries.

Without this change, we get

Libs.private: ... -lpgport_shlib -lpgcommon_shlib

which seems wrong.


Ugh, yes, that's wrong. Do you want me to apply the fix?


I've done it now.

Re: Weird failure with latches in curculio on v15

Hi,

On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> On Wed, Feb 1, 2023 at 11:58 AM Andres Freund  wrote:
> > > 9a740f81e clearly made things a lot worse, but it wasn't great
> > > before.  Can we see a way forward to removing the problem entirely?
> >
> > Yea, I think we can - we should stop relying on system(). If we instead
> > run the command properly as a subprocess, we don't need to do bad things
> > in the signal handler anymore.
> 
> I like the idea of not relying on system(). In most respects, doing
> fork() + exec() ourselves seems superior. We can control where the
> output goes, what we do while waiting, etc. But system() runs the
> command through the shell, so that for example you don't have to
> invent your own way of splitting a string into words to be passed to
> exec[whatever](). I've never understood how you're supposed to get
> that behavior other than by calling system().

We could just exec the shell in the forked process, using -c to invoke
the command. That should give us pretty much the same efficiency as
system(), with a lot more control.

I think we already do that somewhere. . Ah, yes, spawn_process() in
pg_regress.c.  I suspect we couldn't use exec for restore_command etc,
as I think it's not uncommon to use && in the command.

Perhaps we should abstract the relevant pieces of spawn_process() that
into something more general? The OS specifics are sufficiently
complicated that I don't think it'd be good to have multiple copies.

It's too bad that we have the history of passing things to shell,
otherwise we could define a common argument handling of the GUC and just
execve ourselves, but that ship has sailed.

Greetings,

Andres Freund

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Andres Freund  writes:
> On 2023-02-01 13:05:32 +0300, Aleksander Alekseev wrote:
>> It works. Perhaps we should add:
>> ninja -C build alldocs

> FWIW, just 'docs' would build just the multi-page html/man pages,
> alldocs takes a lot longer...

Hmm ... why does 'docs' include the man pages, and not just the html?
It's unlike what "make -C doc/src/sgml all" does in the Makefile
system, and I don't find it to be an improvement.  I want the man
pages approximately never, so I don't care to wait around for them
to be built.

While I'm bitching ... section 17.1 doesn't mention that you need
ninja to use meson, much less mention the minimum version.  And
the minimum version appears to be newer than RHEL8's 1.8.2,
which I find pretty unfortunate.  On RHEL8, it fails with

$ ninja
ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
depslog; bring this up on the mailing list if it affects you

I did manage to test this stuff on bleeding-edge Fedora,
but ...

regards, tom lane

Re: Weird failure with latches in curculio on v15

Andres Freund  writes:
> On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
>> I like the idea of not relying on system(). In most respects, doing
>> fork() + exec() ourselves seems superior. We can control where the
>> output goes, what we do while waiting, etc. But system() runs the
>> command through the shell, so that for example you don't have to
>> invent your own way of splitting a string into words to be passed to
>> exec[whatever](). I've never understood how you're supposed to get
>> that behavior other than by calling system().

> We could just exec the shell in the forked process, using -c to invoke
> the command. That should give us pretty much the same efficiency as
> system(), with a lot more control.

The main thing that system() brings to the table is platform-specific
knowledge of where the shell is.  I'm not very sure that we want to
wire in "/bin/sh".

regards, tom lane

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 11:18 AM Tom Lane  wrote:
> Over at [1] we have a complaint that dump-and-restore fails for
> hash-partitioned tables if a partitioning column is an enum,
> because the enum values are unlikely to receive the same OIDs
> in the destination database as they had in the source, and the
> hash codes are dependent on those OIDs.

It seems to me that this is the root of the problem. We can't expect
to hash on something that's not present in the dump file and have
anything work.

> So here's what I think we should actually do: make
> --load-via-partition-root the default.  We can provide a
> switch to turn it off, for those who are brave or foolish
> enough to risk that in the name of saving a few cycles,
> but it ought to be the default.
>
> Furthermore, I think we should make this happen in time for
> next week's releases.  I can write the patch easily enough,
> but we need a consensus PDQ that this is what to do.

This seems extremely precipitous to me and I'm against it. I like the
fact that we have --load-via-partition-root, but it is a bit of a
hack. You don't get a single copy into the partition root, you get one
per child table -- and those COPY statements are listed as data for
the partitions where the data lives now, not for the parent table. I
am not completely sure whether there is a scenario where that's an
issue, but it's certainly an oddity. Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense. I
don't think we should just foist that kind of overhead onto everyone
in every situation for every data type because somebody had a problem
in a certain case.

And even if we do decide to do that at some point, I don't think it is
right at all to rush such a change out on a short time scale, with
little time to mull over consequences and alternative fixes. I think
that could easily hurt more people than it helps.

I think that not all of the cases that you list are of the same type.
Loading a dump under a different encoding or on a different endianness
are surely corner cases. They might come up for some people
occasionally, but they're not typical. In the case of endianness,
that's because little-Endian has pretty much taken over the world; in
the case of encoding, that's because converting data between encodings
is a real pain, and combining that with a database dump and restore is
likely to be very little fun. It's hard to argue that collation
changes fall into the same category: we know that they get changed all
the time, often silently. But none of us database geeks think that's a
good thing: just that it's a thing that we have to deal with.

The enum case seems completely different to me. That's not the result
of trying to migrate your data to another architecture or of the glibc
maintainers not believing in sorting working the same way on Tuesday
that it did on Monday. That's the result of the PostgreSQL project
hashing data in a way that is does not make any real sense for the
application at hand. Any hash function that we use for partitioning
has to work based on data that is preserved by the dump-and-restore
process. I would argue that the float case is not of the same kind:
yes, if you round your data off, then the values are going to hash
differently, but if you truncate your strings, those will hash
differently too. Duh. Intentionally changing the value is supposed to
change the hash code, that's kind of the point of hashing.

So I think we should be asking ourselves what we could do about the
enum case specifically, rather than resorting to a bazooka.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Show various offset arrays for heap WAL records

On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> If we're dumping a lot of details out of each WAL record, we might
> want to switch to a multi-line format of some kind. No one enjoys a
> 460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

It seems practically mandatory to use a pager when running
pg_walinspect queries in psql right now -- pspg is good for this. I
really can't use expanded table mode here, since it obscures the
relationship between adjoining records. I'm usually looking through
rows/records in LSN order, and want to be able to easily compare the
LSNs (or other details) of groups of adjoining records.

--
Peter Geoghegan

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Hi,

On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 13:05:32 +0300, Aleksander Alekseev wrote:
> >> It works. Perhaps we should add:
> >> ninja -C build alldocs
> 
> > FWIW, just 'docs' would build just the multi-page html/man pages,
> > alldocs takes a lot longer...
> 
> Hmm ... why does 'docs' include the man pages, and not just the html?

I think it's because the makefile is doing things a bit oddly, and I
didn't quite grok that in the right moment.

I probably just saw:
all: html man

but before that there's

# Make "html" the default target, since that is what most people tend
# to want to use.
html:

> It's unlike what "make -C doc/src/sgml all" does in the Makefile
> system, and I don't find it to be an improvement.

Well, that'd actually build the manpages too, afaics :). But I get the
point.

I really have no opinion on what we should should build under what
name. Happy to change what's included in 'docs', add additional targets,
etc.

> I want the man pages approximately never, so I don't care to wait
> around for them to be built.
> 
> While I'm bitching ... section 17.1 doesn't mention that you need
> ninja to use meson, much less mention the minimum version.

Peter rewrote the requirements (almost?) entirely while committing the
docs from Samay and hasn't responded to my concerns about the new
form...

Normally the ninja version that's pulled in by meson should suffice. I
suspect that the problem you found can be worked around.

> And the minimum version appears to be newer than RHEL8's 1.8.2, which
> I find pretty unfortunate.  On RHEL8, it fails with
> $ ninja
> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> depslog; bring this up on the mailing list if it affects you

What's in that line +- 2 lines?  And/or what are the steps that got you
to that point?

I'll try building 1.8.2 and reproing.

> I did manage to test this stuff on bleeding-edge Fedora,
> but ...

Yea, I worked a fair bit to avoid requiring a too new version, I'll try
to figure out what went wrong.  I did built on rhel8 not long ago, so I
suspect it's a corner case somewhere.

Greetings,

Andres Freund

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev



> 1 февр. 2023 г., в 20:27, Matthias van de Meent 
>  написал(а):
> 
> On Wed, 1 Feb 2023 at 16:53, Justin Pryzby  > wrote:
>> 
>> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
>>> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
>>> wrote:
> 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> написал(а):
> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> lookup for every single element therein ... this sounds slow.
> 
> In one of the callsites, we already have the partition descriptor
> available.  We could just scan partdesc->is_leaf[] and add one for each
> 'true' value we see there.
 
 The problem is that partdesc contains only direct children of the table 
 and we need all the children down the inheritance tree to count the total 
 number of leaf partitions in the first callsite.
 
> In the other callsite, we had the table open just a few lines before the
> place you call count_leaf_partitions.  Maybe we can rejigger things by
> examining its state before closing it: if relkind is not partitioned we
> know leaf_partitions=0, and only if partitioned we count leaf partitions.
> I think that would save some work.  I also wonder if it's worth writing
> a bespoke function for counting leaf partitions rather than relying on
> find_all_inheritors.
 
 Sure, added this condition to avoid the extra work here.
 
>>> 
   When creating an index on a partitioned table, this column is set to
 -   the total number of partitions on which the index is to be created.
 +   the total number of leaf partitions on which the index is to be 
 created or attached.
>>> 
>>> I think we should also add a note about the (now) non-constant nature
>>> of the value, something along the lines of "This value is updated as
>>> we're processing and discovering partitioned tables in the partition
>>> hierarchy".
>> 
>> But the TOTAL is constant, right ?  Updating the total when being called
>> recursively is the problem these patches fix.
> 
> If that's the case, then I'm not seeing the 'fix' part of the patch. I
> thought this patch was fixing the provably incorrect TOTAL value where
> DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
> values instead of updating them.
> 
> In HEAD we set TOTAL to whatever number partitioned table we're
> currently processing has - regardless of whether we're the top level
> statement.
> With the patch we instead add the number of child relations to that
> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
> posted by Ilya. Approximately immediately after updating that count we
> recurse to the child relations, and that only returns once it is done
> creating the indexes, so both TOTAL and DONE go up as we process more
> partitions in the hierarchy.

The TOTAL in the patch is set only when processing the top-level parent and it 
is not updated when we recurse, so yes, it is constant. From v3:

@@ -1219,8 +1243,14 @@ DefineIndex(Oid relationId,
RelationparentIndex;
TupleDesc   parentDesc;
 
-   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-   
 nparts);
+   if (!OidIsValid(parentIndexId))
+   {
+   int total_parts;
+
+   total_parts = count_leaf_partitions(relationId);
+   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+   
 total_parts);
+   }


It is set to the total number of children on all levels of the hierarchy, not 
just the current one, so the total value doesn’t need to be updated later, 
because it is set to the correct value from the very beginning. 

It is the DONE counter that is updated, and when we attach an index of a 
partition that is itself a partitioned table (like a2 in your example, if it 
already had an index created), it will be updated by the number of children of 
the partition.

@@ -1431,9 +1463,25 @@ DefineIndex(Oid relationId,

SetUserIdAndSecContext(child_save_userid,

   child_save_sec_context);
}
+   else
+   {
+   int attached_parts = 1;
+
+   if 
(RELKIND_HAS_PARTITIONS(child_relkind))
+   attached_parts = 
count_leaf_partitions(childRelid);
+
+   /*
+

Re: Show various offset arrays for heap WAL records

On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan  wrote:
> On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> > If we're dumping a lot of details out of each WAL record, we might
> > want to switch to a multi-line format of some kind. No one enjoys a
> > 460-character wide line, let alone 46000.
>
> I generally prefer it when I can use psql without using expanded table
> format mode, and without having to use a pager. Of course that isn't
> always possible, but it often is. I just don't think that that's going
> to become feasible with pg_walinspect queries any time soon, since it
> really requires a comprehensive strategy to deal with the issue of
> verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Weird failure with latches in curculio on v15

On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>> The fundamental issue is that we have no good way to break out
>> of system(), and I think the original idea was that
>> in_restore_command would be set *only* for the duration of the
>> system() call.  That's clearly been lost sight of completely,
>> but maybe as a stopgap we could try to get back to that.
> 
> We could push the functions setting in_restore_command down into
> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
> being right either - we'd now use the mechanism in places we previously
> didn't (cleanup/end commands).

Right, we'd only want to set it for restore_command.  I think that's
doable.

> And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> doesn't look right:
> - We now have two places open-coding what BuildRestoreCommand did

This was done because BuildRestoreCommand() had become a thin wrapper
around replace_percent_placeholders().  I can add it back if you don't
think this was the right decision.

> - I'm doubtful that the new shell_* functions are the base for a good
>   API to abstract restoring files

Why?

> - the error message for a failed restore command seems to have gotten
>   worse:
>   could not restore file \"%s\" from archive: %s"
>   ->
>   "%s \"%s\": %s", commandName, command

Okay, I'll work on improving this message.

> - shell_* imo is not a good namespace for something called from xlog.c,
>   xlogarchive.c. I realize the intention is that shell_archive.c is
>   going to be its own "restore module", but for now it imo looks odd

What do you propose instead?  FWIW this should go away with recovery
modules.  This is just an intermediate state to simplify those patches.

> - The comment moved out of RestoreArchivedFile() doesn't seems less
>   useful at its new location

Where do you think it should go?

> - explanation of why we use GetOldestRestartPoint() is halfway lost

Okay, I'll work on adding more context here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Non-superuser subscription owners

2023-02-01 Thread Mark Dilger




> On Feb 1, 2023, at 6:43 AM, Robert Haas  wrote:

> The thing I'm
> struggling to understand is: if you only want to replicate into tables
> that Alice can write, why not just make Alice own the subscription?
> For a run-as user to make sense, you need a scenario where we want the
> replication to target only tables that Alice can touch, but we also
> don't want Alice herself to be able to touch the subscription, so you
> make Alice the run-as user and yourself the owner, or something like
> that. But I'm not sure what that scenario is exactly.

This "run-as" idea came about because we didn't want arbitrary roles to be able 
to change the subscription's connection string.  A competing idea was to have a 
server object rather than a string, with roles like Alice being able to use the 
server object if they have been granted usage privilege, and not otherwise.  So 
the "run-as" and "server" ideas were somewhat competing.

> Mark was postulating a scenario where the publisher and subscriber
> don't trust each other. I was thinking a little bit more about that. I
> still maintain that the current system is poorly set up to make that
> work, but suppose we wanted to do better. We could add filtering on
> the subscriber side, like you list schemas or specific relations that
> you are or are not willing to replicate into. Then you could, for
> example, connect your subscription to a certain remote publication,
> but with the restriction that you're only willing to replicate into
> the "headquarters" schema. Then we'll replicate whatever tables they
> send us, but if the dorks at headquarters mess up the publications on
> their end (intentionally or otherwise) and add some tables from the
> "locally_controlled_stuff" schema, we'll refuse to replicate that into
> our eponymous schema.

That example is good, though I don't see how "filters" are better than 
roles+privileges.  Care to elaborate?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Weird failure with latches in curculio on v15

Hi,

On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> >> I like the idea of not relying on system(). In most respects, doing
> >> fork() + exec() ourselves seems superior. We can control where the
> >> output goes, what we do while waiting, etc. But system() runs the
> >> command through the shell, so that for example you don't have to
> >> invent your own way of splitting a string into words to be passed to
> >> exec[whatever](). I've never understood how you're supposed to get
> >> that behavior other than by calling system().
> 
> > We could just exec the shell in the forked process, using -c to invoke
> > the command. That should give us pretty much the same efficiency as
> > system(), with a lot more control.
> 
> The main thing that system() brings to the table is platform-specific
> knowledge of where the shell is.  I'm not very sure that we want to
> wire in "/bin/sh".

We seem to be doing OK with using SHELLPROG in pg_regress, which just
seems to be using $SHELL from the build environment.

Greetings,

Andres Freund

Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-02-01 Thread João Paulo Labegalini de Carvalho

On Mon, Jan 30, 2023 at 10:47 AM Andres Freund  wrote:

> For some reason my notes for using LTO include changing RANLIB to point to
> gcc/llvm-ranlib of the appropriate version. Won't even be used on HEAD, but
> before that it can make a difference.
>

I will try that.


> Depending on how you built clang, it could be that the above recipe ends up
> using the system lld, which might be too old.
>

I double checked and I am using the lld that I built from source.


> What are the crashes you're getting?
>

When I run make check, the server starts up fine but the test queries seem
to not execute. I don't see any errors, the check step just quits after a
while.

2023-02-01 13:00:38.703 EST postmaster[28750] LOG:  starting PostgreSQL
14.5 on x86_64-pc-linux-gnu, compiled by clang version 15.0.6, 64-bit
2023-02-01 13:00:38.703 EST postmaster[28750] LOG:  listening on Unix
socket "/tmp/pg_regress-h8Fmqu/.s.PGSQL.58085"
2023-02-01 13:00:38.704 EST startup[28753] LOG:  database system was shut
down at 2023-02-01 13:00:38 EST
2023-02-01 13:00:38.705 EST postmaster[28750] LOG:  database system is
ready to accept connections

-- 
João Paulo L. de Carvalho
Ph.D Computer Science |  IC-UNICAMP | Campinas , SP - Brazil
Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada
joao.carva...@ic.unicamp.br
joao.carva...@ualberta.ca

Re: pg_dump versus hash partitioning

Robert Haas  writes:
> On Wed, Feb 1, 2023 at 11:18 AM Tom Lane  wrote:
>> Over at [1] we have a complaint that dump-and-restore fails for
>> hash-partitioned tables if a partitioning column is an enum,
>> because the enum values are unlikely to receive the same OIDs
>> in the destination database as they had in the source, and the
>> hash codes are dependent on those OIDs.

> It seems to me that this is the root of the problem.

Well, that was what I thought too to start with, but I now think that
it is far too narrow-minded a view of the problem.  The real issue
is something I said that you trimmed:

>> In general, we've never thought that hash values are
>> required to be consistent across platforms.

hashenum() is doing something that is perfectly fine in the context
of hash indexes, which have traditionally been our standard of how
reproducible hash values need to be.  Hash partitioning has moved
those goalposts, and if there was any discussion of the consequences
of that, I didn't see it.

>> Furthermore, I think we should make this happen in time for
>> next week's releases.  I can write the patch easily enough,
>> but we need a consensus PDQ that this is what to do.

> This seems extremely precipitous to me and I'm against it.

Yeah, it's been this way for awhile, so if there's debate then
we can let it go awhile longer.

> So I think we should be asking ourselves what we could do about the
> enum case specifically, rather than resorting to a bazooka.

My idea of solving this with a bazooka would be to deprecate hash
partitioning.  Quite frankly that's where I think we will end up
someday, but I'm not going to try to make that argument today.

In the meantime, I think we need to recognize that hash values are
not very portable.  I do not think we do our users a service by
letting them discover the corner cases the hard way.

I'd be willing to compromise on the intermediate idea of forcing
--load-via-partition-root just for hashed partitioning.

regards, tom lane

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent

On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev  wrote:
>
> 1 февр. 2023 г., в 20:27, Matthias van de Meent 
>  написал(а):
>
>> In HEAD we set TOTAL to whatever number partitioned table we're
>> currently processing has - regardless of whether we're the top level
>> statement.
>> With the patch we instead add the number of child relations to that
>> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
>> posted by Ilya. Approximately immediately after updating that count we
>> recurse to the child relations, and that only returns once it is done
>> creating the indexes, so both TOTAL and DONE go up as we process more
>> partitions in the hierarchy.
>
>
> The TOTAL in the patch is set only when processing the top-level parent and 
> it is not updated when we recurse, so yes, it is constant. From v3:

Ugh, I misread the patch, more specifically count_leaf_partitions and
the !OidIsValid(parentIndexId) condition changes.

You are correct, sorry for the noise.

Kind regards,

Matthias van de Meent

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Andres Freund  writes:
> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
>> It's unlike what "make -C doc/src/sgml all" does in the Makefile
>> system, and I don't find it to be an improvement.

> Well, that'd actually build the manpages too, afaics :). But I get the
> point.

Ah, sorry, I too had forgotten that "all" isn't the default target
there.  I actually just go into that directory and type "make".

> I really have no opinion on what we should should build under what
> name. Happy to change what's included in 'docs', add additional targets,
> etc.

I think "docs" for just the html and "alldocs" for all supported
outputs is probably reasonable.  If we ever get to the point of
building distribution tarballs with meson, we might need another
target for html+man, but I suppose that's a long way off.

>> And the minimum version appears to be newer than RHEL8's 1.8.2, which
>> I find pretty unfortunate.  On RHEL8, it fails with
>> $ ninja
>> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
>> depslog; bring this up on the mailing list if it affects you

> What's in that line +- 2 lines?  And/or what are the steps that got you
> to that point?

"meson setup build" is sufficient to see it --- apparently ninja
gets invoked at the end of that, and it's already unhappy.  But
it repeats after "cd build; ninja".

It seems to be unhappy about the stanza for building sql_help.c?
Line 6771 is the blank line after "description" in this bit:

build src/bin/psql/sql_help.c src/bin/psql/sql_help.h: CUSTOM_COMMAND_DEP  | 
../src/bin/psql/create_help.pl /usr/bin/perl
 DEPFILE = src/bin/psql/sql_help.dep
 DEPFILE_UNQUOTED = src/bin/psql/sql_help.dep
 COMMAND = /usr/bin/perl ../src/bin/psql/create_help.pl --docdir 
../doc/src/sgml/ref --depfile src/bin/psql/sql_help.dep --outdir src/bin/psql 
--basename sql_help
 description = Generating$ psql_help$ with$ a$ custom$ command

build src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o: c_COMPILER 
src/bin/psql/psqlscanslash.c || src/bin/psql/sql_help.h 
src/include/catalog/pg_aggregate_d.h src/include/catalog/pg_am_d.h 
src/include/catalog/pg_amop_d.h src/include/catalog/pg_amproc_d.h 
src/include/catalog/pg_attrdef_d.h src/include/catalog/pg_attribute_d.h 
src/include/catalog/pg_auth_members_d.h src/include/catalog/pg_authid_d.h 
src/include/catalog/pg_cast_d.h src/include/catalog/pg_class_d.h 
src/include/catalog/pg_collation_d.h src/include/catalog/pg_constraint_d.h 
src/include/catalog/pg_conversion_d.h src/include/catalog/pg_database_d.h 
src/include/catalog/pg_db_role_setting_d.h 
src/include/catalog/pg_default_acl_d.h src/include/catalog/pg_depend_d.h 
src/include/catalog/pg_description_d.h src/include/catalog/pg_enum_d.h 
src/include/catalog/pg_event_trigger_d.h src/include/catalog/pg_extension_d.h 
src/include/catalog/pg_foreign_data_wrapper_d.h 
src/include/catalog/pg_foreign_server_d.h 
src/include/catalog/pg_foreign_table_d.h src/include/catalog/pg_index_d.h 
src/include/catalog/pg_inherits_d.h src/include/catalog/pg_init_privs_d.h 
src/include/catalog/pg_language_d.h src/include/catalog/pg_largeobject_d.h 
src/include/catalog/pg_largeobject_metadata_d.h 
src/include/catalog/pg_namespace_d.h src/include/catalog/pg_opclass_d.h 
src/include/catalog/pg_operator_d.h src/include/catalog/pg_opfamily_d.h 
src/include/catalog/pg_parameter_acl_d.h 
src/include/catalog/pg_partitioned_table_d.h src/include/catalog/pg_policy_d.h 
src/include/catalog/pg_proc_d.h src/include/catalog/pg_publication_d.h 
src/include/catalog/pg_publication_namespace_d.h 
src/include/catalog/pg_publication_rel_d.h src/include/catalog/pg_range_d.h 
src/include/catalog/pg_replication_origin_d.h 
src/include/catalog/pg_rewrite_d.h src/include/catalog/pg_seclabel_d.h 
src/include/catalog/pg_sequence_d.h src/include/catalog/pg_shdepend_d.h 
src/include/catalog/pg_shdescription_d.h src/include/catalog/pg_shseclabel_d.h 
src/include/catalog/pg_statistic_d.h src/include/catalog/pg_statistic_ext_d.h 
src/include/catalog/pg_statistic_ext_data_d.h 
src/include/catalog/pg_subscription_d.h 
src/include/catalog/pg_subscription_rel_d.h 
src/include/catalog/pg_tablespace_d.h src/include/catalog/pg_transform_d.h 
src/include/catalog/pg_trigger_d.h src/include/catalog/pg_ts_config_d.h 
src/include/catalog/pg_ts_config_map_d.h src/include/catalog/pg_ts_dict_d.h 
src/include/catalog/pg_ts_parser_d.h src/include/catalog/pg_ts_template_d.h 
src/include/catalog/pg_type_d.h src/include/catalog/pg_user_mapping_d.h 
src/include/catalog/postgres.bki src/include/catalog/schemapg.h 
src/include/catalog/system_constraints.sql src/include/catalog/system_fk_info.h 
src/include/nodes/nodetags.h src/include/utils/errcodes.h
 DEPFILE = src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o.d
 DEPFILE_UNQUOTED = src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o.d
 ARGS = -Isrc/bin/psql/psql.p -Isrc/bin/psql -I../src/bin/psql -Isrc/include 
-I../src/include -Isrc/interfaces/libpq -I../src/interf

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Hi,

On 2023-02-01 09:49:00 -0800, Andres Freund wrote:
> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
> > And the minimum version appears to be newer than RHEL8's 1.8.2, which
> > I find pretty unfortunate.  On RHEL8, it fails with
> > $ ninja
> > ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> > depslog; bring this up on the mailing list if it affects you
> 
> What's in that line +- 2 lines?  And/or what are the steps that got you
> to that point?
> 
> I'll try building 1.8.2 and reproing.
> 
> 
> > I did manage to test this stuff on bleeding-edge Fedora,
> > but ...
> 
> Yea, I worked a fair bit to avoid requiring a too new version, I'll try
> to figure out what went wrong.  I did built on rhel8 not long ago, so I
> suspect it's a corner case somewhere.

Unfortunately the test script accidentally pulled in ninja from epel,
hence not noticing the issue.

There's three issues:

One is easy enough, albeit slightly annoying: 1.8.2 wants the
"depending" file only be named once in a dependency file. Slightly
uglier code in snowball_create.pl, but whatever.

The second is one case of multiple outputs with a depfile:
create_help.pl creates both sql_help.c and sql_help.h. Not immediately
sure what a good solution here is. The brute force solution would be to
invoke it twice, but I don't like that at all.

The last case is the various man directories. That'd be easy enough to
avoid if we generated them inside a man/ directory.

Greetings,

Andres Freund

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

Andres Freund  writes:
> On 2023-02-01 09:49:00 -0800, Andres Freund wrote:
>> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
>>> And the minimum version appears to be newer than RHEL8's 1.8.2, which
>>> I find pretty unfortunate.

> Unfortunately the test script accidentally pulled in ninja from epel,
> hence not noticing the issue.

Ah.  For myself, pulling the newer version from epel would not be a big
problem.  I think what we need to do is figure out what is the minimum
ninja version we want to support, and then see if we need to make any
of these changes.  I don't have hard data on which distros have which
versions of ninja, but surely somebody checked that at some point?

regards, tom lane

Re: Introduce "log_connection_stages" setting.

2023-02-01 Thread Sergey Dudoladov

Hi again,

Justin, thank you for the fast review. The new version is attached.

Regards,
Sergey Dudoladov
From 994a86e6ac3abb647d93bdaf0f42be76f42b83a8 Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Tue, 8 Nov 2022 18:56:26 +0100
Subject: [PATCH] Introduce 'log_connection_messages'

This patch removes 'log_connections' and 'log_disconnections'
in favor of 'log_connection_messages', thereby introducing incompatibility

Author: Sergey Dudoladov

Reviewed-by: Justin Pryzby

Discussion:
https://www.postgresql.org/message-id/flat/CAA8Fd-qCB96uwfgMKrzfNs32mqqysi53yZFNVaRNJ6xDthZEgA%40mail.gmail.com
---
 doc/src/sgml/config.sgml  | 89 +--
 src/backend/commands/variable.c   | 77 
 src/backend/libpq/auth.c  |  5 +-
 src/backend/postmaster/postmaster.c   |  5 +-
 src/backend/tcop/postgres.c   | 11 ++-
 src/backend/utils/init/postinit.c |  2 +-
 src/backend/utils/misc/guc_tables.c   | 30 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +-
 src/include/postgres.h|  9 ++
 src/include/postmaster/postmaster.h   |  1 -
 src/include/utils/guc_hooks.h |  2 +
 src/test/authentication/t/001_password.pl |  2 +-
 src/test/authentication/t/003_peer.pl |  2 +-
 src/test/kerberos/t/001_auth.pl   |  2 +-
 src/test/ldap/t/001_auth.pl   |  2 +-
 src/test/ldap/t/002_bindpasswd.pl |  2 +-
 src/test/recovery/t/013_crash_restart.pl  |  2 +-
 src/test/recovery/t/022_crash_temp_files.pl   |  2 +-
 src/test/recovery/t/032_relfilenode_reuse.pl  |  2 +-
 src/test/ssl/t/SSL/Server.pm  |  2 +-
 src/tools/ci/pg_ci_base.conf  |  3 +-
 21 files changed, 182 insertions(+), 75 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cf53c74ea..ccefe144c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -140,7 +140,7 @@
  An example of what this file might look like is:
 
 # This is a comment
-log_connections = yes
+log_connection_messages = all
 log_destination = 'syslog'
 search_path = '"$user", public'
 shared_buffers = 128MB
@@ -335,7 +335,7 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
passed to the postgres command via the
-c command-line parameter.  For example,
 
-postgres -c log_connections=yes -c log_destination='syslog'
+postgres -c log_connection_messages=all -c log_destination='syslog'
 
Settings provided in this way override those set via
postgresql.conf or ALTER SYSTEM,
@@ -7085,23 +7085,74 @@ local0.*/var/log/postgresql
   
  
 
- 
-  log_connections (boolean)
+ 
+  log_connection_messages (string)
   
-   log_connections configuration parameter
+   log_connection_messages configuration parameter
   
   
   

-Causes each attempted connection to the server to be logged,
-as well as successful completion of both client authentication (if
-necessary) and authorization.
+Causes the specified stages of each connection attempt to the server to be logged. Example: authorized,disconnected.
+The default is the empty string, which means that nothing is logged.
 Only superusers and users with the appropriate SET
 privilege can change this parameter at session start,
 and it cannot be changed at all within a session.
-The default is off.

 
+
+ Connection messages
+ 
+  
+  
+  
+   
+Name
+Description
+   
+  
+  
+   
+received
+Logs receipt of a connection. At this point a connection has been
+received, but no further work has been done: PostgreSQL is about to start
+interacting with a client.
+   
+
+   
+authenticated
+Logs the original identity used by an authentication method
+to identify a user. In most cases, the identity string matches the
+PostgreSQL username, but some third-party authentication methods may
+alter the original user identifier before the server stores it. Failed
+authentication is always logged regardless of the value of this setting.
+
+   
+
+   
+authorized
+Logs successful completion of authorization. At this point the
+connection has been fully established.
+
+   
+
+   
+disconnected
+Logs session termination. The log output provides information
+similar to authorized, plus the duration of the session.
+
+   
+
+   
+all
+A convenience alias. If present, it must be the only value in the
+list.
+   
+
+  
+ 
+
+

 
  Some client programs, like psql, attempt
@@ -7113,26 +716

Re: recovery modules

On Wed, Feb 01, 2023 at 03:54:26AM -0800, Andres Freund wrote:
> I'd make basic_archive's private data a struct, with a member for the
> context, but it's not that important.
> 
> I'd also be inclined to do the same for the private_state you're passing
> around for each module. Even if it's just to reduce the number of
> functions accepting void * - loosing compiler type checking isn't great.
> 
> So maybe an ArchiveModuleState { void *private_data } that's passed to
> basic_archive_startup() and all the other callbacks.

Here's a new patch set in which I've attempted to address this feedback and
Michael's feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e54ed061b772e7f0859a37612ed1effebc5c6ba1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH v3 1/3] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3c714a79c6..dda6698509 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -406,8 +406,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -508,7 +508,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -814,7 +814,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -827,7 +827,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(&ArchiveContext, 0, sizeof(ArchiveModuleCallbacks));
+	memset(&ArchiveCallbacks, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -844,9 +844,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) (&ArchiveContext);
+	(*archive_init) (&ArchiveCallbacks);
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -859,6 +859,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 0bcfb2dc8ef1544dcd6a8cc07a784b36a38febad Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH v3 2/3] move archive module exports to dedicated headers

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  2 ++
 src/backend/postmaster/shell_archive.c  |  3 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 47 +
 src/include/postmaster/pgarch.h | 39 
 src/include/postmaster/shell_archive.h  | 22 
 7 files changed, 75 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h
 create mode 100644 src/include/postmaster/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -32,7 +32,7 @@
 
 #include "common/int.h"
 #include "miscadmin.h"
-#include "postmaster/pgarch.h"
+#include "postmaster/archive_module.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dda6698509..ec47b2cc20 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,8 +34,10 @@
 #include "lib/bina

Re: pg_dump versus hash partitioning

Robert Haas  writes:
> ... I like the
> fact that we have --load-via-partition-root, but it is a bit of a
> hack. You don't get a single copy into the partition root, you get one
> per child table -- and those COPY statements are listed as data for
> the partitions where the data lives now, not for the parent table. I
> am not completely sure whether there is a scenario where that's an
> issue, but it's certainly an oddity.

I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a selective
restore of just one partition --- but under what circumstance would
that be a useful thing to do?  By definition, under hash partitioning
there is no user-meaningful difference between different partitions.
Moreover, in the case at hand you would get constraint failures without
--load-via-partition-root, or tuple routing failures with it,
so what's the difference?  (Unless you'd created all the partitions
to start with and were doing a selective restore of just one partition's
data, in which case the outcome is "fails" or "works" respectively.)

> Also, and I think pretty
> significantly, using --load-via-partition-root forces you to pay the
> overhead of rerouting every tuple to the target partition whether you
> need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.
I don't mind having an option that allows people to select a less-safe
way of doing this, but I do think it's unwise for less-safe to be the
default, especially when it's something you can't fix after the fact.

What do you think of "--load-via-partition-root=on/off/auto", where
auto means "not with hash partitions" or the like?  I'm still
uncomfortable about the collation aspect, but I'm willing to concede
that range partitioning is less likely to fail in this way than hash.

regards, tom lane

Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security

2023-02-01 Thread Jacob Champion

On Mon, Jan 30, 2023 at 2:21 PM Robert Haas  wrote:
> On Mon, Jan 30, 2023 at 4:12 PM Jacob Champion  
> wrote:
> > For our case, assuming that connections have side effects, one
> > solution could be for the client to signal to the server that the
> > connection should use in-band authentication only; i.e. fail the
> > connection if the credentials provided aren't good enough by
> > themselves to authenticate the client. (This has some overlap with
> > SASL negotiation, maybe.)
>
> I'm not an expert on this stuff, but to me that feels like a weak and
> fuzzy concept. If the client is going to tell the server something,
> I'd much rather have it say something like "i'm proxying a request
> from my local user rhaas, who authenticated using such and such a
> method and connected from such and such an IP yadda yadda". That feels
> to me like really clear communication that the server can then be
> configured to something about via pg_hba.conf or similar. Saying "use
> in-band authentication only", to me, feels much murkier. As the
> recipient of that message, I don't know exactly what to do about it,
> and it feels like whatever heuristic I adopt might end up being wrong
> and something bad happens anyway.

Is it maybe just a matter of terminology? If a proxy tells the server,
"This user is logging in. Here's the password I have for them. DO NOT
authenticate using anything else," and the HBA says to use ident auth
for that user, then the server fails the connection. That's what I
mean by in-band -- the proxy says, "here are the credentials for this
connection." That's it.

Alternatively, if you really don't like making this server-side: any
future "connection side effects" we add, such as logon triggers, could
either be opted into by the client or explicitly invoked by the client
after it's happy with the authentication exchange. Or it could be
disabled at the server side for forms of ambient authn. (This is
getting closer to HTTP's method safety concept.)

> > I agree. (But for the record, I think that an outbound proxy filter is
> > also insufficient. Someone, somewhere, is going to want to safely
> > proxy through localhost _and_ have peer authentication set up.)
>
> Well then they're indeed going to need some way to distinguish a
> proxied connection from a non-proxied one. You can't send identical
> connection requests in different scenarios and get different
> results

Yeah. Most of these solutions require explicitly labelling things that
were implicit before.

> I think what we really need here is an example or three of a proposed
> configuration file syntax. I think it would be good if we could pick a
> syntax that doesn't require a super-complicated parser

Agreed. The danger from my end is, I'm trained on configuration
formats that have infinite bells and whistles. I don't really want to
go too crazy with it.

> and that maybe
> has something in common with our existing configuration file syntaxes.
> But if we have to invent something new, then we can do that.

Okay. Personally I'd like
- the ability to set options globally (so filters are optional)
- the ability to maintain many options for a specific scope (host? IP
range?) without making my config lines grow without bound
- the ability to audit a configuration without trusting its comments

But getting all of my wishlist into a sane configuration format that
handles all the use cases is the tricky part. I'll think about it.

--Jacob

Re: Non-superuser subscription owners

Hi,

On 2023-02-01 09:43:39 -0500, Robert Haas wrote:
> On Tue, Jan 31, 2023 at 7:01 PM Andres Freund  wrote:
> > I don't really understand that - the run-as approach seems like a
> > necessary piece of improving the security model.
> >
> > I think it's perfectly reasonable to want to replicate from one system
> > in another, but to not want to allow logical replication to insert into
> > pg_class or whatnot. So not using superuser to execute the replication
> > makes sense.
> >
> > This is particularly the case if you're just replicating a small part of
> > the tables from one system to another. E.g. in a sharded setup, you may
> > want to replicate metadata too servers.
> 
> I don't think that a system catalog should be considered a valid
> replication target, no matter who owns the subscription, so ISTM that
> writing to pg_class should be blocked regardless.

The general point is that IMO is that in many setups you should use a
user with fewer privileges than a superuser.  It doesn't really matter
whether we have an ad-hoc restriction for system catalogs. More often
than not being able to modify other tables will give you a lot of
privileges too.

> The thing I'm struggling to understand is: if you only want to
> replicate into tables that Alice can write, why not just make Alice
> own the subscription?

Because it implies that the replication happens as a user that's
privileged enough to change the configuration of replication.

> Mark was postulating a scenario where the publisher and subscriber
> don't trust each other.

FWIW, I don't this this is mainly about "trust", but instead about
layering security / the principle of least privilege. The "run-as" user
(i.e. currently owner) is constantly performing work on behalf of a
remote node, including executing code (default clauses etc). To make it
harder to use such a cross-system connection to move from one system to
the next, it's a good idea to execute it in the least privileged context
possible. And I don't see why it'd need the permission to modify the
definition of the subscription and similar "admin" tasks.

It's not that such an extra layer would necessarily completely stop an
attacker. But it might delay them and make their attack more noisy.

Similarly, if I were to operate an important production environment
again, I'd not have relations owned by the [pseudo]superuser, but by a
user controlled by the [pseudo]superuser. That way somebody tricking the
superuser into a REINDEX or such only gets the ability to execute code
in a less privileged context.

> I was thinking a little bit more about that. I
> still maintain that the current system is poorly set up to make that
> work, but suppose we wanted to do better. We could add filtering on
> the subscriber side, like you list schemas or specific relations that
> you are or are not willing to replicate into.

Isn't that largely a duplication of the ACLs on relations etc?

> > I think we'll need two things to improve upon the current situation:
> >
> > 1) run-as user, to reduce the scope of potential danger
> >
> > 2) Option to run the database inserts as the owner of the table, with a
> >check that the run-as is actually allowed to perform work as the
> >owning role. That prevents escalation from table owner (who could add
> >default expressions etc) from gettng the privs of the
> >run-as/replication owner.
> 
> I'm not quite sure what we do here now, but I agree that trigger
> firing seems like a problem. It might be that we need to worry about
> the user on the origin server, too. If Alice inserts a row that causes
> a replicated table owned by Bob to fire a trigger or evaluate a
> default expression or whatever due the presence of a subscription
> owned by Charlie, there is a risk that Alice might try to attack
> either Bob or Charlie, or that Bob might try to attack Charlie.

The attack on Bob exists without logical replication too - a REINDEX or
such is executed as the owner of the relation and re-evaluates index
expressions, constraints etc.  Given our security model I don't think we
can protect the relation owner if they trust somebody to insert rows, so
I don't really know what we can do to protect Charlie against Bob.

> > > And if we suppose that that already works and is safe, well then
> > > what's the case where I do need a run-as user?
> >
> > It's not at all safe today, IMO. You need to trust that nothing bad will
> > be replicated, otherwise the owner of the subscription has to be
> > considered compromised.
> 
> What kinds of things are bad to replicate?

I think that's unfortunately going to be specific to a setup.

Greetings,

Andres Freund

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 1:23 PM Tom Lane  wrote:
> Well, that was what I thought too to start with, but I now think that
> it is far too narrow-minded a view of the problem.  The real issue
> is something I said that you trimmed:
>
> >> In general, we've never thought that hash values are
> >> required to be consistent across platforms.
>
> hashenum() is doing something that is perfectly fine in the context
> of hash indexes, which have traditionally been our standard of how
> reproducible hash values need to be.  Hash partitioning has moved
> those goalposts, and if there was any discussion of the consequences
> of that, I didn't see it.

Right, I don't think it was ever discussed, and I think that was a
mistake on my part.

> In the meantime, I think we need to recognize that hash values are
> not very portable.  I do not think we do our users a service by
> letting them discover the corner cases the hard way.

I think you're not really engaging with the argument that "not
completely portable" and "totally broken" are two different things,
and I still think that's an important point here. One idea that I had
is to add a flag somewhere to indicate whether a particular opclass or
opfamily is suitable for hash partitioning, or perhaps better, an
alternative to opcdefault that sets the default for partitioning,
which could be different from the default for indexing. Then we could
either prohibit this case, or make it work. Of course we would have to
define what "suitable for hash partitioning" means, but "would be
likely to survive a dump and reload on the same machine without any
software changes" is probably a reasonable minimum standard.

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water. Needs
change over time, and we adapt the code to meet the new needs. Since
we have no system for type properties in PostgreSQL -- a design
decision I find questionable -- we tie all such properties to operator
classes. That's why, for example, we have HASHEXTENDED_PROC, even
though hash indexes don't need 64-bit hash values or a seed. We added
that for hash partitioning, and it's now used in other places too,
because 32-bits aren't enough for everything just because they're
enough for hash indexes, and seeds are handy. That's also why we have
BTINRANGE_PROC, which doesn't exist to support btree indexes, but
rather window frames. The fact that a certain feature requires us to
graft some additional stuff into the operator class/family mechanism,
or that it doesn't quite work with everything that's already part of
that mechanism, isn't an argument against the feature. That's just how
we do things around here. Indeed, if somebody, instead of implementing
hash partitioning by tying it into hash opfamilies, were to make up
some completely new hashing infrastructure that had exactly the
properties they wanted for partitioning, that would be *totally*
unacceptable and surely a reason for rejecting such a feature
outright. The fact that it tries to make use of the existing
infrastructure is a good thing about that feature, not a bad thing,
even though it is turning out that there are some problems.

On the question of whether hash partitioning is a good feature in
general, I can only say that I disagree with what seems to be your
position, which as best as I can tell is "it sucks and we should kill
it with fire". I do think that partitioning in general leaves a lot to
be desired in PostgreSQL in general, and perhaps the issues are even
worse for hash partitioning than they are elsewhere. However, I think
that the solution to that is for people to keep putting more work into
making it better, not to give up and say "ah, partitioning (or hash
partitioning specifically) is a stupid feature that nobody wants". To
think that, you have to be living in a bubble. It's unfortunate that
with all the work that so many people have put into this area we don't
have better results to show for it, but AFAICS there's no help for
that but to keep hacking. Amit Langote's work on locking before
pruning, for example, will be hugely impactful for some kinds of
queries if it gets committed, and it's been a long time coming, partly
because so many other problems needed to be sorted out first. But you
can't run the simplest workload with any kind of partitioning, range,
hash, whatever, and not run into that problem immediately.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] CF app: add "Returned: Needs more interest"

2023-02-01 Thread Jacob Champion

On Wed, Jan 4, 2023 at 9:33 AM Jacob Champion  wrote:
> Is there a good way to remind people that, hey, this exists as a
> patchset? (Other than me pinging the list every so often.)

I've withdrawn this patchset for now, but if anyone has any ideas on
where and how I can better propose features for CF itself, I'm all
ears.

Thanks,
--Jacob

Re: Add connection active, idle time to pg_stat_activity

2023-02-01 Thread Sergey Dudoladov

Hello hackers,

I've sketched the first version of a patch to add pg_stat_session.
Please review this early version.

Regards,
Sergey.
From 31f781ecd69fc42aaadd9bcdbebaf8f72449946c Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Tue, 22 Nov 2022 09:23:32 +0100
Subject: [PATCH] Add pg_stat_session

Author: Sergey Dudoladov

Adds pg_stat_session view to track statistics accumulated during
 lifetime of a session.

Reviewed-by: Aleksander Alekseev, Bertrand Drouvot, and Atsushi
Torikoshi

Discussion:
https://www.postgresql.org/message-id/flat/CA%2BFpmFcJF0vwi-SWW0wYO-c-FbhyawLq4tCpRDCJJ8Bq%3Dja-gA%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml| 153 
 src/backend/catalog/system_views.sql|  15 ++
 src/backend/utils/activity/backend_status.c |  58 ++--
 src/backend/utils/adt/pgstatfuncs.c |  70 +
 src/include/catalog/pg_proc.dat |   9 ++
 src/include/utils/backend_status.h  |  37 +
 src/test/regress/expected/rules.out |  12 ++
 7 files changed, 345 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..38cc29810a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -414,6 +414,20 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
See .
   
  
+
+ 
+  
+   pg_stat_session
+   pg_stat_session
+  
+  
+   One row per server process, showing information related to
+   the currently accumulated activity of that process, such as time spent in
+   a certain state.
+   See 
+   pg_stat_session for details.
+  
+ 
 

   
@@ -5315,6 +5329,129 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

   
 
+  
+   pg_stat_session View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   pid integer
+  
+  
+   Process ID of this backend.
+  
+ 
+
+ 
+  
+   total_running_time double precision
+  
+  
+   Time in milliseconds this backend spent in the running state.
+  
+ 
+
+
+  
+   total_running_count bigint
+  
+  
+   Number of times this backend switched to the running state.
+  
+ 
+
+ 
+  
+   total_idle_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle state.
+  
+ 
+
+ 
+  
+   total_idle_count bigint
+  
+  
+   Number of times this backend switched to the idle state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle in transaction
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_count bigint
+  
+  
+   Number of times this backend switched to the idle in transaction
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_aborted_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle in transaction (aborted) 
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_aborted_count bigint
+  
+  
+   Number of times this backend switched to the idle in transaction (aborted)
+   state.
+  
+ 
+
+ 
+  
+   total_fastpath_time double precision
+  
+  
+   Time in milliseconds this backend spent in the fastpath state.
+  
+ 
+
+ 
+  
+   total_fastpath_count bigint
+  
+  
+   Number of times this backend switched to the fastpath
+   state.
+  
+ 
+
+
+   
+  
+
  
 
  
@@ -5382,6 +5519,22 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

   
 
+  
+   
+
+ pg_stat_get_session
+
+pg_stat_get_session ( integer )
+setof record
+   
+   
+Returns a record of information about the backend with the specified
+process ID, or one record for each active backend in the system
+if NULL is specified.  The fields returned are a
+subset of those in the pg_stat_session view.
+   
+  
+
   

 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..8f68a6ea00 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -870,6 +870,21 @@ CREATE VIEW pg_stat_activity AS
 LEFT JOIN pg_database AS D ON (S.datid = D.oid)
 LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
 
+CREATE VIEW pg_stat_session AS
+SELECT
+s.pid,
+s.total_running_time,
+s.total_running_count,
+s.total_idle_time,
+s.total_idle_cou

Re: Non-superuser subscription owners

Hi,

On 2023-01-30 15:32:34 -0500, Robert Haas wrote:
> I had a long think about what to do with ALTER SUBSCRIPTION ... OWNER
> TO in terms of permissions checks. The previous version required that
> the new owner have permissions of pg_create_subscription, but there
> seems to be no particular reason for that rule except that it happens
> to be what I made the code do. So I changed it to say that the current
> owner must have CREATE privilege on the database, and must be able to
> SET ROLE to the new owner. This matches the rule for CREATE SCHEMA.
> Possibly we should *additionally* require that the person performing
> the rename still have pg_create_subscription, but that shouldn't be
> the only requirement.

As long as owner and run-as are the same, I think it's strongly
preferrable to *not* require pg_create_subscription.


> There seems to be a good deal of inconsistency here. If you want to
> give someone a schema, YOU need CREATE on the database. But if you
> want to give someone a table, THEY need CREATE on the containing
> schema. It make sense that we check permissions on the containing
> object, which could be a database or a schema depending on what you're
> renaming, but it's unclear to me why we sometimes check on the person
> performing the ALTER command and at other times on the recipient. It's
> also somewhat unclear to me why we are checking CREATE in the first
> place, especially on the donor. It might make sense to have a rule
> that you can't own an object in a place where you couldn't have
> created it, but there is no such rule, because you can give someone
> CREATE on a schema, they can create an object, and they you can take
> CREATE a way and they still own an object there. So it kind of looks
> to me like we made it up as we went along and that the result isn't
> very consistent, but I'm inclined to follow CREATE SCHEMA here unless
> there's some reason to do otherwise.

Yuck. No idea what the best policy around this is.


> Another question around ALTER SUBSCRIPTION ... OWNER TO and also ALTER
> SUBSCRIPTION .. RENAME is whether they ought to fail if you're not a
> superuser and password_required false is set.

I don't really see a benefit in allowing it, so I'm inclined to go for
the more restrictive option. But this is a really weakly held opinion.



> > > If there is, I think we could fix it by moving the LockSharedObject call 
> > > up
> > > higher, above object_ownercheck. The only problem with that is it lets you
> > > lock an object on which you have no permissions: see
> > > 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an
> > > analogue of RangeVarGetRelidExtended.
> >
> > Yea, we really should have something like RangeVarGetRelidExtended() for 
> > other
> > kinds of objects. It'd take a fair bit of work / time to use it widely, but
> > it'll take even longer if we start in 5 years ;)
>
> We actually have something sort of like that in the form of
> get_object_address(). It doesn't allow for a callback, but it does
> have a retry loop.

Hm, sure looks like that code doesn't do any privilege checking...


> @@ -1269,13 +1270,19 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
>   
> slotname,
>   
> NAMEDATALEN);
>
> + /* Is the use of a password mandatory? */
> + must_use_password = MySubscription->passwordrequired &&
> + !superuser_arg(MySubscription->owner);

There's a few repetitions of this - perhaps worth putting into a helper?


> @@ -180,6 +181,13 @@ libpqrcv_connect(const char *conninfo, bool logical, 
> const char *appname,
>   if (PQstatus(conn->streamConn) != CONNECTION_OK)
>   goto bad_connection_errmsg;
>
> + if (must_use_password && !PQconnectionUsedPassword(conn->streamConn))
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
> +  errmsg("password is required"),
> +  errdetail("Non-superuser cannot connect if the 
> server does not request a password."),
> +  errhint("Target server's authentication method 
> must be changed. or set password_required=false in the subscription 
> attributes\
.")));
> +
>   if (logical)
>   {
>   PGresult   *res;

This still leaks the connection on error, no?

Greetings,

Andres Freund

Re: recovery modules

Hi,

On 2023-02-01 12:15:29 -0800, Nathan Bossart wrote:
> Here's a new patch set in which I've attempted to address this feedback and
> Michael's feedback.

Looks better!


> @@ -25,12 +34,14 @@ extern PGDLLIMPORT char *XLogArchiveLibrary;
>   * For more information about the purpose of each callback, refer to the
>   * archive modules documentation.
>   */
> -typedef bool (*ArchiveCheckConfiguredCB) (void);
> -typedef bool (*ArchiveFileCB) (const char *file, const char *path);
> -typedef void (*ArchiveShutdownCB) (void);
> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveFileCB) (const char *file, const char *path, 
> ArchiveModuleState *state);
> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);

Personally I'd always pass ArchiveModuleState *state as the first arg,
but it's not important.

Greetings,

Andres Freund

Re: pg_dump versus hash partitioning

Robert Haas  writes:
> On Wed, Feb 1, 2023 at 1:23 PM Tom Lane  wrote:
>> In the meantime, I think we need to recognize that hash values are
>> not very portable.  I do not think we do our users a service by
>> letting them discover the corner cases the hard way.

> I think you're not really engaging with the argument that "not
> completely portable" and "totally broken" are two different things,
> and I still think that's an important point here.

I don't buy that argument.  From the user's perspective, it's broken
if her use-case fails.  "It works for most people" is cold comfort,
most especially so if there's no convenient path to fixing it after
a failure.

> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water.

Well, it wouldn't need to if we had a practical way of changing the
behavior of an existing hash function, but guess what: we don't.
Andrew's original proposal for fixing this was exactly to change the
behavior of hashenum().  There were some problems with the idea of
depending on enumsortorder instead of enum OID, but the really
fundamental issue is that you can't change hashing behavior without
breaking pg_upgrade completely.  Not only will your hash indexes be
corrupt, but your hash-partitioned tables will be broken, in exactly
the same way that we're trying to solve for dump/reload cases (which
of course will *also* be broken by redefining the hash function, if
you didn't use --load-via-partition-root).  Moreover, while we can
always advise people to reindex, there's no similarly easy way to fix
broken partitioning.

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

> On the question of whether hash partitioning is a good feature in
> general, I can only say that I disagree with what seems to be your
> position, which as best as I can tell is "it sucks and we should kill
> it with fire".

As I said, I'm not prepared to litigate that case today ... but
I do have a sneaking suspicion that we will eventually reach that
conclusion.  In any case, if we don't want to reach that conclusion,
we need some practical solution to these dump/reload problems.
Have you got a better idea than --load-via-partition-root?

regards, tom lane

Re: heapgettup refactoring

2023-02-01 Thread Melanie Plageman

On Thu, Feb 02, 2023 at 12:21:20AM +1300, David Rowley wrote:
> On Tue, 31 Jan 2023 at 12:18, Melanie Plageman
>  wrote:
> > v7 attached
> 
> I've been looking over the v7-0002 patch today and I did make a few
> adjustments to heapgettup_initial_block() as I would prefer to see the
> branching of each of all these helper functions follow the pattern of:
> 
> if ()
> {
> if ()
> 
> else
> 
> }
> else
> {
> 
> }
> 
> which wasn't quite what the function was doing.

I'm fine with this. One code comment about the new version inline.

> Along the way, I noticed that 0002 has a subtle bug that does not seem
> to be present once the remaining patches are applied.  I think I'm
> happier to push these along the lines of how you have them in the
> patches, so I've held off pushing for now due to the bug and the
> change I had to make to fix it.
> 
> The problem is around the setting of scan->rs_inited = true; you've
> moved that into heapgettup_initial_block() and you've correctly not
> initialised the scan for empty tables when you return
> InvalidBlockNumber, however, you've not correctly considered the fact
> that table_block_parallelscan_nextpage() could also return
> InvalidBlockNumber if the parallel workers manage to grab all of the
> blocks before the current process gets the first block. I don't know
> for sure, but it looks like this could cause problems when
> heapgettup() or heapgettup_pagemode() got called again for a rescan.
> We'd have returned the NULL tuple to indicate that no further tuples
> exist, but we'll have left rs_inited set to true which looks like
> it'll cause issues.

Ah, yes. In the later patches in the series, I handle all end of scan
cases (regardless of whether or not there was a beginning) in a single
place at the end of the function. There I release the buffer and reset
all state -- including setting rs_inited to false. So, that made it okay
to set rs_inited to true in heapgettup_initial_block().

When splitting it up, I made a mistake and missed the case you
mentioned. Thanks for catching that!

FWIW, I like setting rs_inited in heapgettup_initial_block() better in
the final refactor, but I agree with you that in this patch on its own
it is better in the body of heapgettup() and heapgettup_pagemode().
 
> I wondered if it might be better to do the scan->rs_inited = true; in
> heapgettup() and heapgettup_pagemode() instead. The attached v8 patch
> does it this way. Despite this fixing that bug, I think this might be
> a slightly better division of duties.

LGTM.

> From cbd37463bdaa96afed4c7c739c8e91b770a9f8a7 Mon Sep 17 00:00:00 2001
> From: David Rowley 
> Date: Wed, 1 Feb 2023 19:35:16 +1300
> Subject: [PATCH v8] Refactor heapam.c adding heapgettup_initial_block function
> 
> Here we adjust heapgettup() and heapgettup_pagemode() to move the code
> that fetches the first block out into a helper function.  This removes
> some code duplication.
> 
> Author: Melanie Plageman
> Reviewed-by: David Rowley
> Discussion: 
> https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_t8qeqzqol02rpi...@mail.gmail.com
> ---
>  src/backend/access/heap/heapam.c | 225 ++-
>  1 file changed, 103 insertions(+), 122 deletions(-)
> 
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 0a8bac25f5..40168cc9ca 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -483,6 +483,67 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
>   scan->rs_ntuples = ntup;
>  }
>  
> +/*
> + * heapgettup_initial_block - return the first BlockNumber to scan
> + *
> + * Returns InvalidBlockNumber when there are no blocks to scan.  This can
> + * occur with empty tables and in parallel scans when parallel workers get 
> all
> + * of the pages before we can get a chance to get our first page.
> + */
> +static BlockNumber
> +heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
> +{
> + Assert(!scan->rs_inited);
> +
> + /* When there are no pages to scan, return InvalidBlockNumber */
> + if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
> + return InvalidBlockNumber;
> +
> + if (ScanDirectionIsForward(dir))
> + {
> + /* serial scan */
> + if (scan->rs_base.rs_parallel == NULL)
> + return scan->rs_startblock;

I believe this else is superfluous since we returned above.

> + else
> + {
> + /* parallel scan */
> + 
> table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
> + 
>  scan->rs_parallelworkerdata,
> + 
>  (ParallelBlockTableScanDesc) 
> scan->rs_base.rs_parallel);
> +
> + /* may return InvalidBlockNumber

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 12:34 PM Tom Lane  wrote:
> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.
> I don't mind having an option that allows people to select a less-safe
> way of doing this, but I do think it's unwise for less-safe to be the
> default, especially when it's something you can't fix after the fact.

+1

As you pointed out already, pg_dump's primary use-cases all involve
target systems that aren't identical to the source system. Anybody
using pg_dump is unlikely to be particularly concerned about
performance.

> What do you think of "--load-via-partition-root=on/off/auto", where
> auto means "not with hash partitions" or the like?  I'm still
> uncomfortable about the collation aspect, but I'm willing to concede
> that range partitioning is less likely to fail in this way than hash.

Currently, pg_dump ignores collation versions entirely, except when
run by pg_upgrade. So pg_dump already understands that sometimes it's
important that the collation behavior be completely identical when the
database is restored, and sometimes it's desirable to produce a dump
with any available "logically equivalent" collation. This is about the
high level requirements, which makes sense to me.

ISTM that range partitioning is missing a good high level model that
builds on that. What's really needed is a fully worked out abstraction
that recognizes how collations can be equivalent for some purposes,
but not other purposes. The indirection between "logical and physical
collations" is underdeveloped. There isn't even an official name for
that idea.

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 3:34 PM Tom Lane  wrote:
> I spent a bit more time thinking about that, and while I agree that
> it's an oddity, I don't see that it matters in the case of hash
> partitioning.  You would notice an issue if you tried to do a selective
> restore of just one partition --- but under what circumstance would
> that be a useful thing to do?  By definition, under hash partitioning
> there is no user-meaningful difference between different partitions.
> Moreover, in the case at hand you would get constraint failures without
> --load-via-partition-root, or tuple routing failures with it,
> so what's the difference?  (Unless you'd created all the partitions
> to start with and were doing a selective restore of just one partition's
> data, in which case the outcome is "fails" or "works" respectively.)

I guess I was worried that pg_dump's dependency ordering stuff might
do something weird in some case that I'm not smart enough to think up.

> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.

That's a bit rich.

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before. Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system. Under normal circumstances, of
course, we wouldn't consider any such thing, because then we would not
be faithfully replicating the database state, which would be
incorrect. But here you want to argue that it's OK to create a
different database state because trying to recreate the same one would
produce an error and the user might not like getting an error so let's
just do something else instead and not even bother telling them.

As you have quite rightly pointed out, the --load-via-partition-root
behavior is useful for working around a variety of unfortunate things
that can happen. If a user is willing to say that getting a row into
one partition of some table is just as good as getting it into another
partition of that same table and that you don't mind paying the cost
associated with that, then that is something that we can do for that
user. But just as we normally prioritize correctness over speed, so
also do we normally throw errors when things aren't right instead of
silently accepting bad input. The partitions in this scenario are
tables that have constraints. If a dump contains a row that doesn't
satisfy some constraint on the table into which it is being loaded,
that's an error. Keep in mind that there's no rule that a user can't
query a partition directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: recovery modules

On Wed, Feb 01, 2023 at 01:06:06PM -0800, Andres Freund wrote:
> On 2023-02-01 12:15:29 -0800, Nathan Bossart wrote:
>> Here's a new patch set in which I've attempted to address this feedback and
>> Michael's feedback.
> 
> Looks better!

Thanks!

>> @@ -25,12 +34,14 @@ extern PGDLLIMPORT char *XLogArchiveLibrary;
>>   * For more information about the purpose of each callback, refer to the
>>   * archive modules documentation.
>>   */
>> -typedef bool (*ArchiveCheckConfiguredCB) (void);
>> -typedef bool (*ArchiveFileCB) (const char *file, const char *path);
>> -typedef void (*ArchiveShutdownCB) (void);
>> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveFileCB) (const char *file, const char *path, 
>> ArchiveModuleState *state);
>> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
> 
> Personally I'd always pass ArchiveModuleState *state as the first arg,
> but it's not important.

Yeah, that's nicer.  cfbot is complaining about a missing #include, so I
need to send a new revision anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e54ed061b772e7f0859a37612ed1effebc5c6ba1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH v4 1/3] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3c714a79c6..dda6698509 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -406,8 +406,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -508,7 +508,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -814,7 +814,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -827,7 +827,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(&ArchiveContext, 0, sizeof(ArchiveModuleCallbacks));
+	memset(&ArchiveCallbacks, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -844,9 +844,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) (&ArchiveContext);
+	(*archive_init) (&ArchiveCallbacks);
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -859,6 +859,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 37e19c986b548f5d281190e196e01ce94f8c4391 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH v4 2/3] move archive module exports to dedicated headers

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  2 ++
 src/backend/postmaster/shell_archive.c  |  3 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 47 +
 src/include/postmaster/pgarch.h | 39 
 src/include/postmaster/shell_archive.h  | 24 +
 7 files changed, 77 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h
 create mode 100644 src/include/postmaster/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 1:14 PM Robert Haas  wrote:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before. Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system. Under normal circumstances, of
> course, we wouldn't consider any such thing, because then we would not
> be faithfully replicating the database state, which would be
> incorrect. But here you want to argue that it's OK to create a
> different database state because trying to recreate the same one would
> produce an error and the user might not like getting an error so let's
> just do something else instead and not even bother telling them.

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

If there was a fully worked out idea of what that would mean, then I
suspect it would be quite subtle and complicated -- it's an inherently
tricky area. You seem to be saying that the way that this stuff
currently works is correct by definition, except when it isn't.

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 12:39 PM Robert Haas  wrote:
> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water. Needs
> change over time, and we adapt the code to meet the new needs. Since
> we have no system for type properties in PostgreSQL -- a design
> decision I find questionable -- we tie all such properties to operator
> classes.

Are you familiar with B-Tree opclass support function 4, equalimage?
It's used to determine whether a B-Tree index can use deduplication at
CREATE INDEX time. ISTM that the requirements are rather similar here
-- perhaps even identical.

See: https://www.postgresql.org/docs/devel/btree-support-funcs.html

-- 
Peter Geoghegan

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 4:12 PM Tom Lane  wrote:
> > I don't think the fact that our *traditional* standard for how stable
> > a hash function needs to be has been XYZ carries any water.
>
> Well, it wouldn't need to if we had a practical way of changing the
> behavior of an existing hash function, but guess what: we don't.
> Andrew's original proposal for fixing this was exactly to change the
> behavior of hashenum().  There were some problems with the idea of
> depending on enumsortorder instead of enum OID, but the really
> fundamental issue is that you can't change hashing behavior without
> breaking pg_upgrade completely.  Not only will your hash indexes be
> corrupt, but your hash-partitioned tables will be broken, in exactly
> the same way that we're trying to solve for dump/reload cases (which
> of course will *also* be broken by redefining the hash function, if
> you didn't use --load-via-partition-root).  Moreover, while we can
> always advise people to reindex, there's no similarly easy way to fix
> broken partitioning.
>
> That being the case, I don't think moving the goalposts for hash
> function stability is going to lead to a workable solution.

I don't see that there is any easy, clean way to solve this in
released branches. The idea that I proposed could be implemented in
master, and I think it is the right kind of fix, but it is not
back-patchable. However, I think your argument rests on the premise
that making --load-via-partition-root the default behavior in some or
all cases will not break anything for anyone, and I'm skeptical. I
think that's a significant behavior change and that some people will
notice, and some will find it an improvement while others will find it
worse than the current behavior. I also think that there must be a lot
more people using partitioning in general, and even hash partitioning
specifically, than there are people using hash partitioning on an enum
column.

Personally, I would rather disallow this case in the back-branches --
i.e. make pg_dump barf if it is encountered and block CREATE TABLE
from setting up any new situations of this type -- than foist
--load-via-partition-root on many people who aren't affected by the
issue. I'm not saying that's a great answer, but we have to pick from
the choices we have.

I also don't accept that if someone has hit this issue they are just
hosed and there's no way out. Yeah, it's not a lot of fun: you
probably have to use "CREATE TABLE unpartitioned AS SELECT * FROM
borked; DROP TABLE borked;" or so to rescue your data. But what would
we do if we discovered that the btree opclass sorts 1 before 0, or
something? Surely we wouldn't refuse to fix the opclass just because
some users have existing indexes on disk that would be out of order
with the new opclass definition. We'd just change it and people would
have to deal. People with indexes would need to reindex. People with
partitioning boundaries between 0 and 1 would need to repartition.
This case isn't the same because hashenum() isn't broken in general,
just for this particular purpose. But I think you're trying to argue
that we should fix this by changing something other than the thing
that is broken, and I don't agree with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

Robert Haas  writes:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before.

No, its job is to produce *logically* the same state.  We don't expect
it to produce, say, the same CTID value that each row had before ---
if you want that, you use pg_upgrade or physical replication, not
pg_dump.  There are fairly strong constraints on how identical we
can make the results given that we're not replicating physical state.
And that's not even touching the point that frequently what the user
is after is not an identical copy anyway.

> Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system.

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user.  I don't buy it
for hash partitioning.  It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

> ... But just as we normally prioritize correctness over speed, so
> also do we normally throw errors when things aren't right instead of
> silently accepting bad input. The partitions in this scenario are
> tables that have constraints.

Again, for hash partitioning those constraints are implementation
artifacts not something that users should have any concern with.

> Keep in mind that there's no rule that a user can't
> query a partition directly.

Sure, and in the case of a hash partition, what he is going to
get is an implementation-dependent subset of the rows.  I use
"implementation-dependent" here in the same sense that the SQL
standard does, which is "there is a rule but the implementation
doesn't have to tell you what it is".  In particular, we are not
bound to make the subset be the same on different installations.
We already didn't promise that, because of issues like endianness.

regards, tom lane

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan  wrote:
> This is a misrepresentation of Tom's words. It isn't actually
> self-evident what "we end up with all of the same objects, each
> defined in the same way, and that all of the tables end up with all
> the same contents that they had before" actually means here, in
> general. Tom's main concern seems to be just that -- the ambiguity
> itself.

As far as I can see, Tom didn't admit that there was any ambiguity. He
just said that I was advocating for wrong behavior for the sake of
performance. I don't think that is what I was doing. I also don't
really think Tom thinks that that is what I was doing. But it is what
he said I was doing.

I do agree with you that the ambiguity is the root of the issue. I
mean, if we can't put the rows back into the same partitions where
they were before, does the user care about that, or do they only care
that the rows end up in some partition of the toplevel partitioned
table? I think that there's no single definition of correctness that
is the only defensible one here, and we can't know which one the user
wants a priori. I also think that they can care which one they're
getting, and thus I think that changing the default in a minor release
is a bad plan. Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Aleksander Alekseev

Hi,

Here are my two cents.

> the minimum version appears to be newer than RHEL8's 1.8.2,
> which I find pretty unfortunate.  On RHEL8, it fails with

> $ ninja
> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> depslog; bring this up on the mailing list if it affects you

> [...]  I don't have hard data on which distros have which
> versions of ninja, but surely somebody checked that at some point?

I'm using three different systems at the moment and the minimum
version of Ninja that is known to work is 1.10.1.

> Normally the ninja version that's pulled in by meson should suffice.

There are several ways to install Meson one of which, if you want the
latest version, is just using PIP:

```
pip3 install --user meson
```

Naturally Ninja will not be pulled in this case.

-- 
Best regards,
Aleksander Alekseev

Re: pg_dump versus hash partitioning

Robert Haas  writes:
> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I think it's categorically better than a failed restore.  I wouldn't
be proposing this if there were no such problem; but there is,
and I don't buy your apparent position that we should leave affected
users to cope as best they can.  Yes, it's clearly only a minority
of users that are affected, else we'd have heard complaints before.
But it could be absolutely catastrophic for an affected user,
if they're trying to restore their only backup.  I'd rather impose
an across-the-board cost on all users of hash partitioning than
risk such outcomes for a few.

Also, you've really offered no evidence for your apparent position
that --load-via-partition-root has unacceptable overhead.  We've
done enough work on partition routing over the last few years that
whatever measurements might've originally justified that idea
don't necessarily apply anymore.  Admittedly, I've not measured
it either.  But we don't tell people to avoid partitioning because
INSERT is unduly expensive.  Partition routing is just the cost of
doing business in that space.

regards, tom lane

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 2:12 PM Robert Haas  wrote:
> On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan  wrote:
> > This is a misrepresentation of Tom's words. It isn't actually
> > self-evident what "we end up with all of the same objects, each
> > defined in the same way, and that all of the tables end up with all
> > the same contents that they had before" actually means here, in
> > general. Tom's main concern seems to be just that -- the ambiguity
> > itself.
>
> As far as I can see, Tom didn't admit that there was any ambiguity.

Tom said:

"I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a
selective restore of just one partition --- but under what
circumstance would that be a useful thing to do?"

While the word ambiguity may not have actually been used, Tom very
clearly admitted some ambiguity. But even if he didn't, so what? It's
perfectly obvious that that's the major underlying issue, and that
this is a high level problem rather than a low level problem.

> He just said that I was advocating for wrong behavior for the sake of
> performance. I don't think that is what I was doing. I also don't
> really think Tom thinks that that is what I was doing. But it is what
> he said I was doing.

And I think that you're making a mountain out of a molehill.

> I do agree with you that the ambiguity is the root of the issue. I
> mean, if we can't put the rows back into the same partitions where
> they were before, does the user care about that, or do they only care
> that the rows end up in some partition of the toplevel partitioned
> table?

That was precisely the question that Tom posed to you, in the same
email as the one that you found objectionable.

> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I don't know why you seem to think that it was such an absolutist
position as that.

You mentioned "minor releases" here. Who said anything about that?

-- 
Peter Geoghegan

Re: Weird failure with latches in curculio on v15

On Wed, Feb 01, 2023 at 09:58:06AM -0800, Nathan Bossart wrote:
> On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
>> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>>> The fundamental issue is that we have no good way to break out
>>> of system(), and I think the original idea was that
>>> in_restore_command would be set *only* for the duration of the
>>> system() call.  That's clearly been lost sight of completely,
>>> but maybe as a stopgap we could try to get back to that.
>> 
>> We could push the functions setting in_restore_command down into
>> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
>> being right either - we'd now use the mechanism in places we previously
>> didn't (cleanup/end commands).
> 
> Right, we'd only want to set it for restore_command.  I think that's
> doable.

Here is a first draft for the proposed stopgap fix.  If we want to proceed
with this, I can provide patches for the back branches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c703a9f7ac6c43e65fc32117980495ac7980e2e3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Feb 2023 14:32:02 -0800
Subject: [PATCH v1 1/1] stopgap fix for restore_command

---
 src/backend/access/transam/shell_restore.c | 22 --
 src/backend/access/transam/xlogarchive.c   |  7 ---
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/shell_restore.c b/src/backend/access/transam/shell_restore.c
index 8458209f49..abec023c1a 100644
--- a/src/backend/access/transam/shell_restore.c
+++ b/src/backend/access/transam/shell_restore.c
@@ -21,6 +21,7 @@
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
 #include "common/percentrepl.h"
+#include "postmaster/startup.h"
 #include "storage/ipc.h"
 #include "utils/wait_event.h"
 
@@ -124,8 +125,7 @@ shell_recovery_end(const char *lastRestartPointFileName)
  * human-readable name describing the command emitted in the logs. If
  * 'failOnSignal' is true and the command is killed by a signal, a FATAL
  * error is thrown. Otherwise, 'fail_elevel' is used for the log message.
- * If 'exitOnSigterm' is true and the command is killed by SIGTERM, we exit
- * immediately.
+ * If 'exitOnSigterm' is true and SIGTERM is received, we exit immediately.
  *
  * Returns whether the command succeeded.
  */
@@ -146,7 +146,25 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	 */
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
+
+	/*
+	 * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
+	 * process that it is okay to proc_exit() right away on SIGTERM.  This is
+	 * done for the duration of the system() call because there isn't a good
+	 * way to break out while it is executing.  Since we might call proc_exit()
+	 * in a signal handler here, it is extremely important that nothing but the
+	 * system() call happens between the calls to PreRestoreCommand() and
+	 * PostRestoreCommand().  Any additional code must go before or after this
+	 * section.
+	 */
+	if (exitOnSigterm)
+		PreRestoreCommand();
+
 	rc = system(command);
+
+	if (exitOnSigterm)
+		PostRestoreCommand();
+
 	pgstat_report_wait_end();
 
 	if (rc != 0)
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 4b89addf97..66312c816b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -147,18 +147,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	else
 		XLogFileName(lastRestartPointFname, 0, 0L, wal_segment_size);
 
-	/*
-	 * Check signals before restore command and reset afterwards.
-	 */
-	PreRestoreCommand();
-
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
 	ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);
 
-	PostRestoreCommand();
-
 	if (ret)
 	{
 		/*
-- 
2.25.1

Re: pg_dump versus hash partitioning

On Wed, Feb 1, 2023 at 5:08 PM Tom Lane  wrote:
> > Here, you'd like to argue that it's perfectly
> > fine if we instead insert some of the rows into different tables than
> > where they were on the original system.
>
> I can agree with that argument for range or list partitioning, where
> the partitions have some semantic meaning to the user.  I don't buy it
> for hash partitioning.  It was an implementation artifact to begin
> with that a given row ended up in partition 3 not partition 11, so why
> would users care which partition it ends up in after a dump/reload?
> If they think there is a difference between the partitions, they need
> education.

I see your point. I think it's somewhat valid. However, I also think
it muddies the definition of what pg_dump is allowed to do in a way
that I do not like. I think there's a difference between the CTID or
XMAX of a tuple changing and it ending up in a totally different
partition. It feels like it makes the definition of correctness
subjective: we do think that people care about range and list
partitions as individual entities, so we'll put the rows back where
they were and complain if we can't, but we don't think they think
about hash partitions that way, so we will err on the side of making
the dump restoration succeed. That's a level of guessing what the user
meant that I think is uncomfortable. I feel like when somebody around
here discovers that sort of reasoning in some other software's code,
or in a proposed patch, it pretty often gets blasted on this mailing
list with considerable vigor.

I think you can construct plausible cases where it's not just
academic. For instance, suppose I intend to use some kind of logical
replication system, not necessarily the one built into PostgreSQL, to
replicate data between two systems. Before engaging that system, I
need to make the initial database contents match. The system is
oblivious to partitioning, and just replicates each table to a table
with a matching name. Well, if the partitions don't actually match up
1:1, I kind of need to know about that. In this use case, the rows
silently moving around doesn't meet my needs. Or, suppose I dump and
restore two databases. It works perfectly. I then run a comparison
tool of some sort that compares the two databases. EDB has such a
tool! I don't know whether it would perform the comparison via the
partition root or not, because I don't know how it works. But I find
it pretty plausible that some such tool would show differences between
the source and target databases. Now, if I had done the data migration
using --load-with-partition-root, I would expect that. I might even be
looking for it, to see what got moved around. But otherwise, it might
be unexpected.

Another subtle problem with this whole situation is: suppose that on
host A, I set up a table hash-partitioned by an enum column and make a
bunch of hash partitions. Then, on host B, I set up the same table
with a bunch of foreign table partitions, each corresponding to the
matching partition on the other node. I guess that just doesn't work,
whereas if the column were of any other data type, it would work. If
it were a collatable data type, it would need the collations and
collation definitions to match, too.

I know these things are subtle, and maybe these specific things will
never happen to anyone, or nobody will care. I don't know. I just have
a really hard time accepting that a categorical statement that this
just can't ever matter to anyone.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump versus hash partitioning

Peter Geoghegan  writes:
> You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible.  I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

regards, tom lane

Re: pg_dump versus hash partitioning