Re: Avoid detoast overhead when possible

2023-12-05 Thread Nikita Malakhov
Hi,

Hmmm, I've checked this patch and can't see performance difference on a
large
(2 key-value pairs) json, using toasted json column several times makes
no
difference between current implementation on master (like queries mentioned
above).

Maybe I'm doing something wrong?

On Tue, Dec 5, 2023 at 4:16 AM  wrote:

>
> Hi,
>
> Matthias van de Meent  writes:
>
> > SELECT toastable_col FROM t1
> > WHERE f(t1.toastable_col)
> > ORDER BY nonindexed;
>
> Thanks for this example! it's true that the current design requires more
> memory to sort since toastable_col is detoasted at the scan stage and it
> is output to the sort node. It should be avoided.
>
> > SELECT ev_class
> > FROM pg_rewrite
> > WHERE octet_length(ev_action) > 1
> > ORDER BY ev_class;
>
> This one is different I think, since the ev_action (the toastable_col) is
> *NOT* output to sort node, so no extra memory is required IIUC.
>
>  * CP_SMALL_TLIST specifies that a narrower tlist is preferred.  This is
>  * passed down by parent nodes such as Sort and Hash, which will have to
>  * store the returned tuples.
>
> We can also verify this by
>
> explain (costs off, verbose) SELECT ev_class
> FROM pg_rewrite
> WHERE octet_length(ev_action) > 1
> ORDER BY ev_class;
> QUERY PLAN
> --
>  Sort
>Output: ev_class
>Sort Key: pg_rewrite.ev_class
>->  Seq Scan on pg_catalog.pg_rewrite
>  Output: ev_class
>  Filter: (octet_length((pg_rewrite.ev_action)::text) > 1)
> (6 rows)
>
> Only ev_class is output to Sort node.
>
> So if we want to make sure there is performance regression for all the
> existing queries in any case, we can add 1 more restriction into the
> saved-detoast-value logic. It must be (NOT under CP_SMALL_TLIST) OR (the
> toastable_col is not in the output list). It can be a planner decision.
>
> If we code like this, the result will be we need to dotoast N times
> for toastable_col in qual for the below query.
>
> SELECT toastable_col FROM t
> WHERE f1(toastable_col)
> AND f2(toastable_col)
> ..
> AND fn(toastable_col)
> ORDER BY any-target-entry;
>
> However
>
> SELECT
>   f1(toastable_col),
>   f2(toastable_col),
>   ..
>   fn(toastable_col)
> FROM t
> ORDER BY any-target-entry;
>
> the current path still works for it.
>
> This one is my favorite one so far. Another option is saving the
> detoast-value in some other memory or existing-slot-in-place for
> different sistuation, that would requires more expr expression changes
> and planner changes. I just checked all the queries in my hand, the
> current design can cover all of them.
>
> --
> Best Regards
> Andy Fan
>
>
>
>

-- 
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Synchronizing slots from primary to standby

2023-12-05 Thread Drouvot, Bertrand

Hi,

On 12/5/23 6:08 AM, shveta malik wrote:

On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
 wrote:

Maybe another option could be to have the walreceiver a way to let the slot sync
worker knows that it (the walreceiver) was not able to start due to non existing
replication slot on the primary? (that way we'd avoid the slot sync worker 
having
to talk to the primary).


Few points:
1) I think if we do it, we should do it in generic way i.e. slotsync
worker should go to no-op if walreceiver is not able to start due to
any reason and not only due to invalid primary_slot_name.


Agree.


2) Secondly, slotsync worker needs to make sure it has synced the
slots so far i.e. worker should not go to no-op immediately on seeing
missing WalRcv process if there are pending slots to be synced.


Agree.


So the generic way I see to have this optimization is:
1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver
is running or not.


Not sure that would work because the walreceiver keeps try re-starting
and so get a pid before reaching the "could not start WAL streaming: ERROR:  replication slot 
"" does not exist"
error.

We may want to add an extra check on walrcv->walRcvState (or should/could be 
enough by its own).
But walrcv->walRcvState is set to WALRCV_STREAMING way before 
walrcv_startstreaming().

Wouldn't that make sense to move it once we are sure that
walrcv_startstreaming() returns true and first_stream is true, here?

"
if (first_stream)
+   {
ereport(LOG,
(errmsg("started streaming WAL from 
primary at %X/%X on timeline %u",

LSN_FORMAT_ARGS(startpoint), startpointTLI)));
+   SpinLockAcquire(&walrcv->mutex);
+   walrcv->walRcvState = WALRCV_STREAMING;
+   SpinLockRelease(&walrcv->mutex);
+   }
"


2) Slotsync worker should check null 'WalRcv->pid' only when
no-activity is observed for threshold time i.e. it can do it during
existing logic of increasing naptime.
3) On finding null  'WalRcv->pid', worker can mark a flag to go to
no-op unless WalRcv->pid becomes valid again. Marking this flag during
increasing naptime will guarantee that the worker has taken all the
changes so far i.e. standby is not lagging in terms of slots.



2) and 3) looks good to me but with a check on walrcv->walRcvState
looking for WALRCV_STREAMING state instead of looking for a non null
WalRcv->pid.

And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING as
mentioned above (I think it does).

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid detoast overhead when possible

2023-12-05 Thread zhihuifan1213


Nikita Malakhov  writes:

> Hi,
>
> Hmmm, I've checked this patch and can't see performance difference on a large
> (2 key-value pairs) json, using toasted json column several times makes no
> difference between current implementation on master (like queries mentioned 
> above).
>
> Maybe I'm doing something wrong?

Could you try something like below? (set jit to off to turn on this
feature). Or could you tell me the steps you used?  I also attached the
setup.sql at the begining of this thread.

select big->'1', big->'2', big->'3', big->'5', big->'10' from b;  
  QUERY PLAN   
---
 Seq Scan on b (actual time=1.731..1577.911 rows=1001 loops=1)
 Planning Time: 0.099 ms
 Execution Time: 1578.411 ms
(3 rows) 

set jit to off;

select big->'1', big->'2', big->'3', big->'5', big->'10' from b;  
  QUERY PLAN  
--
 Seq Scan on b (actual time=0.417..309.937 rows=1001 loops=1)
 Planning Time: 0.097 ms
 Execution Time: 310.255 m

(I used 'jit=off' to turn on this feature just because I'm still not
ready for JIT code.)

-- 
Best Regards
Andy Fan





Re: Test 002_pg_upgrade fails with olddump on Windows

2023-12-05 Thread Alexander Lakhin

Hi Michael,

05.12.2023 10:56, Michael Paquier wrote:


Or you have used the test suite with an old installation that has the
same major version as the new installation, meaning that the filtering
was not happening, still you have detected some diffs?  It sounds to
me that we should just apply the filters to the dumps all the time if
you have used matching versions.  The filtering would remove only the
comments, some extra newlines and replace the CRLFs in this case.


Yes, my case is with the same version, literally:
build>echo create database regression>c:\temp\dump.sql
build>set olddump=c:\temp\dump.sql& set 
oldinstall=%CD%/tmp_install/usr/local/pgsql& meson test pg_upgrade/002_pg_upgrade

So removing the condition "if ($oldnode->pg_version != $newnode->pg_version)"
works here as well, but maybe switching the file mode (to preserve EOLs
produced by pg_dump) in the block "After dumping, update references ..."
is more efficient than filtering dumps (on all OSes?).

Best regards,
Alexander




Re: pg_upgrade and logical replication

2023-12-05 Thread Amit Kapila
On Tue, Dec 5, 2023 at 10:56 AM Michael Paquier  wrote:
>
> On Mon, Dec 04, 2023 at 04:30:49PM +0530, Amit Kapila wrote:
> > I have made minor changes in the comments and code at various places.
> > See and let me know if you are not happy with the changes. I think
> > unless there are more suggestions or comments, we can proceed with
> > committing it.
>
> Yeah.  I am planning to look more closely at what you have here, and
> it is going to take me a bit more time though (some more stuff planned
> for next CF, an upcoming conference and end/beginning-of-year
> vacations), but I think that targetting the beginning of next CF in
> January would be OK.
>
> Overall, I have the impression that the patch looks pretty solid, with
> a restriction in place for "init" and "ready" relations, while there
> are tests to check all the states that we expect.  Seeing coverage
> about all that makes me a happy hacker.
>
> + * If retain_lock is true, then don't release the locks taken in this 
> function.
> + * We normally release the locks at the end of transaction but in 
> binary-upgrade
> + * mode, we expect to release those immediately.
>
> I think that this should be documented in pg_upgrade_support.c where
> the caller expects the locks to be released, and why these should be
> released.  There is a risk that this comment becomes obsolete if
> AddSubscriptionRelState() with locks released is called in a different
> code path.  Anyway, I am not sure to get why this is OK, or even
> necessary.  It seems like a good practice to keep the locks on the
> subscription until the transaction that updates its state.  If there's
> a specific reason explaining why that's better, the patch should tell
> why.
>

It is to be consistent with other code paths in the upgrade. We
followed existing coding rules like what we do in
binary_upgrade_set_missing_value->SetAttrMissing(). The probable
theory is that during the upgrade we are not worried about concurrent
operations being blocked till the transaction ends. As in this
particular case, we know that the apply worker won't try to sync any
of those relations or a concurrent DDL won't try to remove it from the
pg_subscrition_rel. This point is not being explicitly commented
because of its similarity with the existing code.

>
> +my $result = $old_sub->safe_psql('postgres',
> +   "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'i'");
> +is($result, qq(t), "Check that the table is in init state");
>
> Hmm.  Not sure that this is safe.  Shouldn't this be a
> poll_query_until(), polling that the state of the relation is what we
> want it to be after requesting a fresh of the publication on the
> subscriber?
>

This is safe because the init state should be marked by the "Alter
Subscription ... Refresh .." command itself. What exactly makes you
think that such a poll would be required?

-- 
With Regards,
Amit Kapila.




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-12-05 Thread Daniel Gustafsson
> On 8 Nov 2023, at 20:00, Jacob Champion  wrote:

> Unfortunately the configure/Makefile build of libpq now seems to be
> pulling in an `exit()` dependency in a way that Meson does not.

I believe this comes from the libcurl and specifically the ntlm_wb support
which is often enabled in system and package manager provided installations.
There isn't really a fix here apart from requiring a libcurl not compiled with
ntlm_wb support, or adding an exception to the exit() check in the Makefile.

Bringing this up with other curl developers to see if it could be fixed we
instead decided to deprecate the whole module as it's quirky and not used much.
This won't help with existing installations but at least it will be deprecated
and removed by the time v17 ships, so gating on a version shipped after its
removal will avoid it.

https://github.com/curl/curl/commit/04540f69cfd4bf16e80e7c190b645f1baf505a84

> (Or maybe Meson isn't checking?) I still need to investigate that
> difference and fix it, so I recommend Meson if you're looking to
> test-drive a build.

There is no corresponding check in the Meson build, which seems like a TODO.

--
Daniel Gustafsson





Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-05 Thread Alena Rybakina

Hi!

Thank you for your contribution to this thread.

On 04.12.2023 05:23, jian he wrote:

hi.
here is my implementation based on previous discussions

add a new COPY FROM flag save_error.
save_error only works with non-BINARY flags.
save_error is easier for me to implement, if using "save error" I
worry, 2 words, gram.y will not work.
save_error also works other flag like {csv mode, force_null, force_not_null}

overall logic is:
if  save_error is specified then
if error_holding table not exists then create one
if error_holding table exists set error_firsttime to false.
if  save_error is not specified then work as master branch.

if errors happen then insert error info to error_holding table.
if errors do not exist and error_firsttime is true then drop the table.
if errors do not exist and error_firsttime is false then raise a
notice: All the past error holding saved at %s.%s

error holding table:
schema will be the same as COPY destination table.
the table name will be: COPY destination name concatenate with "_error".

error_holding table definition:
CREATE TABLE err_nsp.error_rel (LINENO BIGINT, LINE TEXT,
FIELD TEXT, SOURCE TEXT, ERR_MESSAGE TEXT,
ERR_DETAIL TEXT, ERRORCODE TEXT);

the following field is not implemented.
FIELDS  text[], separated, de-escaped string fields (the data that was
or would be fed to input functions)

because imagine following case:
create type test as (a int, b text);
create table copy_comp (c1 int, c2 test default '(11,test)', c3 date);
copy copy_comp from stdin with (default '\D');
1 \D '2022-07-04'
\.
table copy_comp;

I feel it's hard from textual '\D'  to get text[] `(11,test)` via SPI.
--
demo:

create table copy_default_error_save (
id integer,
text_value text not null default 'test',
ts_value timestamp without time zone not null default '2022-07-05'
);
copy copy_default_error_save from stdin with (save_error, default '\D');
k value '2022-07-04'
z \D '2022-07-03ASKL'
s \D \D
\.

NOTICE:  3 rows were skipped because of error. skipped row saved to
table public.copy_default_error_save_error
select  * from copy_default_error_save_error;
  lineno |   line   |  field   |  source
   | err_message |
err_detail | errorcode
+--+--+--+-++---
   1 | k   value   '2022-07-04' | id   | k
   | invalid input syntax for type integer: "k"  |
   | 22P02
   2 | z   \D  '2022-07-03ASKL' | id   | z
   | invalid input syntax for type integer: "z"  |
   | 22P02
   2 | z   \D  '2022-07-03ASKL' | ts_value |
'2022-07-03ASKL' | invalid input syntax for type timestamp:
"'2022-07-03ASKL'" || 22007
   3 | s   \D  \D   | id   | s
   | invalid input syntax for type integer: "s"  |
   | 22P02
(4 rows)

The doc is not so good.

COPY FROM (save_error),  it will not be as fast as COPY FROM (save_error false).
With save_error, we can only use InputFunctionCallSafe, which I
believe is not as fast as InputFunctionCall.
If any conversion error happens, we need to call the SPI interface,
that would add more overhead. also we can only insert error cases row
by row. (maybe we can insert to error_save values(error1), (error2).
(I will try later)...

The main code is about constructing SPI query, and test and test output.

I reviewed it and have a few questions.

1. I have seen that you delete a table before creating it, to which you 
want to add errors due to a failed "copy from" operation. I think this 
is wrong because this table can save useful data for the user.
At a minimum, we should warn the user about this, but I think we can 
just add some number at the end of the name, such as name_table1, 
name_table_2.


2. I noticed that you are forming a table name using the type of errors 
that prevent rows from being added during 'copy from' operation.
I think it would be better to use the name of the source file that was 
used while 'copy from' was running.

In addition, there may be several such files, it is also worth considering.

3. I found spelling:

/* no err_nsp.error_rel table then crete one. for holding error. */

4. Maybe rewrite this comment

these info need, no error will drop err_nsp.error_rel table
to:
this information is necessary, no error will lead to the deletion of the 
err_sp.error_rel table.


5. Is this part of the comment needed? I think it duplicates the 
information below when we form the query.


 * . column list(order by attnum, begin from ctid) =
 *    {ctid, lineno,line,field,source,err_message,err_detail,errorcode}
 * . data types (from attnum = -1) ={tid, 
int8,text,text,text,text,text,text}


I'm not sure if we need to order the rows by number. It might be easier 
to work with these

Re: Avoid detoast overhead when possible

2023-12-05 Thread Nikita Malakhov
Hi,

With your setup (table created with setup.sql):
postgres@postgres=# explain analyze select big->'1', big->'2', big->'3',
big->'5', big->'10' from b;
  QUERY PLAN
--
 Seq Scan on b  (cost=0.00..29.52 rows=1001 width=160) (actual
time=0.656..359.964 rows=1001 loops=1)
 Planning Time: 0.042 ms
 Execution Time: 360.177 ms
(3 rows)

Time: 361.054 ms
postgres@postgres=# explain analyze select big->'1' from b;
 QUERY PLAN

 Seq Scan on b  (cost=0.00..19.51 rows=1001 width=32) (actual
time=0.170..63.996 rows=1001 loops=1)
 Planning Time: 0.042 ms
 Execution Time: 64.063 ms
(3 rows)

Time: 64.626 ms

Without patch, the same table and queries:
postgres@postgres=# explain analyze select big->'1', big->'2', big->'3',
big->'5', big->'10' from b;
  QUERY PLAN
--
 Seq Scan on b  (cost=0.00..29.52 rows=1001 width=160) (actual
time=0.665..326.399 rows=1001 loops=1)
 Planning Time: 0.035 ms
 Execution Time: 326.508 ms
(3 rows)

Time: 327.132 ms
postgres@postgres=# explain analyze select big->'1' from b;
 QUERY PLAN

 Seq Scan on b  (cost=0.00..19.51 rows=1001 width=32) (actual
time=0.159..62.807 rows=1001 loops=1)
 Planning Time: 0.033 ms
 Execution Time: 62.879 ms
(3 rows)

Time: 63.504 ms

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Synchronizing slots from primary to standby

2023-12-05 Thread shveta malik
On Tue, Dec 5, 2023 at 2:18 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 12/5/23 6:08 AM, shveta malik wrote:
> > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
> >  wrote:
> >> Maybe another option could be to have the walreceiver a way to let the 
> >> slot sync
> >> worker knows that it (the walreceiver) was not able to start due to non 
> >> existing
> >> replication slot on the primary? (that way we'd avoid the slot sync worker 
> >> having
> >> to talk to the primary).
> >
> > Few points:
> > 1) I think if we do it, we should do it in generic way i.e. slotsync
> > worker should go to no-op if walreceiver is not able to start due to
> > any reason and not only due to invalid primary_slot_name.
>
> Agree.
>
> > 2) Secondly, slotsync worker needs to make sure it has synced the
> > slots so far i.e. worker should not go to no-op immediately on seeing
> > missing WalRcv process if there are pending slots to be synced.
>
> Agree.
>
> > So the generic way I see to have this optimization is:
> > 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver
> > is running or not.
>
> Not sure that would work because the walreceiver keeps try re-starting
> and so get a pid before reaching the "could not start WAL streaming: ERROR:  
> replication slot "" does not exist"
> error.
>

yes, right. pid will keep on toggling.

> We may want to add an extra check on walrcv->walRcvState (or should/could be 
> enough by its own).
> But walrcv->walRcvState is set to WALRCV_STREAMING way before 
> walrcv_startstreaming().
>

Agree. Check on 'walrcv->walRcvState' alone should suffice.

> Wouldn't that make sense to move it once we are sure that
> walrcv_startstreaming() returns true and first_stream is true, here?
>
> "
>  if (first_stream)
> +   {
>  ereport(LOG,
>  (errmsg("started streaming 
> WAL from primary at %X/%X on timeline %u",
>  
> LSN_FORMAT_ARGS(startpoint), startpointTLI)));
> +   SpinLockAcquire(&walrcv->mutex);
> +   walrcv->walRcvState = WALRCV_STREAMING;
> +   SpinLockRelease(&walrcv->mutex);
> +   }
> "
>

Yes, it makes sense and is the basis for current slot-sync worker
changes being discussed.

> > 2) Slotsync worker should check null 'WalRcv->pid' only when
> > no-activity is observed for threshold time i.e. it can do it during
> > existing logic of increasing naptime.
> > 3) On finding null  'WalRcv->pid', worker can mark a flag to go to
> > no-op unless WalRcv->pid becomes valid again. Marking this flag during
> > increasing naptime will guarantee that the worker has taken all the
> > changes so far i.e. standby is not lagging in terms of slots.
> >
>
> 2) and 3) looks good to me but with a check on walrcv->walRcvState
> looking for WALRCV_STREAMING state instead of looking for a non null
> WalRcv->pid.

yes. But I think, the worker should enter no-op, when walRcvState is
WALRCV_STOPPED and not when walRcvState != WALRCV_STREAMING as it is
okay to have WALRCV_WAITING/STARTING/RESTARTING. But the worker should
exit no-op only when it finds walRcvState switched back to
WALRCV_STREAMING.

>
> And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING 
> as
> mentioned above (I think it does).
>

yes, I agree.

thanks
Shveta




Re: Add semi-join pushdown to postgres_fdw

2023-12-05 Thread Alexander Pyhalov

Alexander Korotkov писал(а) 2023-12-03 23:52:

Hi, Alexander!

On Mon, Nov 27, 2023 at 5:11 PM Alexander Pyhalov
 wrote:

Alexander Korotkov писал(а) 2023-11-27 03:49:

> Thank you for the revision.
>
> I've revised the patch myself.  I've replaced StringInfo with
> additional conds into a list of strings as I proposed before.  I think
> the code became much clearer.  Also, it gets rid of some unnecessary
> allocations.
>
> I think the code itself is not in bad shape.  But patch lacks some
> high-level description of semi-joins processing as well as comments on
> each manipulation with additional conds.  Could you please add this?
>

Hi. The updated patch looks better. It seems I've failed to fix logic 
in

deparseFromExprForRel() when tried to convert StringInfos to Lists.

I've added some comments. The most complete description of how 
SEMI-JOIN

is processed, is located in deparseFromExprForRel(). Unfortunately,
there seems to be no single place, describing current JOIN deparsing
logic.


Looks good to me. I've made some grammar and formatting adjustments.
Also, I've written the commit message.

Now, I think this looks good.  I'm going to push this if no objections.

--
Regards,
Alexander Korotkov


Hi. No objections from my side.

Perhaps, some rephrasing is needed in comment in semijoin_target_ok():

"The planner can create semi-joins, which refer to inner rel
vars in its target list."

Perhaps, change "semi-joins, which refer" to "a semi-join, which refers 
...",

as later we speak about "its" target list.

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: POC, WIP: OR-clause support for indexes

2023-12-05 Thread Andrei Lepikhov
Here is fresh version with the pg_dump.pl regex fixed. Now it must pass 
buildfarm.


Under development:
1. Explanation of the general idea in comments (Robert's note)
2. Issue with hiding some optimizations (Alexander's note and example 
with overlapping clauses on two partial indexes)


--
regards,
Andrei Lepikhov
Postgres Professional
From 71746caae3eb0c771792b563fd53244fc1ac575b Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the
preliminary stage of optimization when we are still working with the tree
expression.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 310 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   6 +-
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 156 -
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 219 +++--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 862 insertions(+), 70 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0a5bdf8bcc..ff69b2cd3b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN 
(SELECT * FROM ft5 WHERE
  Foreign Scan
Output: t1.c1, t1.c2, ft5.c1, ft5.c2
Relations: (public.ft4 t1) LEFT JOIN (public.ft5)
-   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 
10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10))
+   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 
IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10))
 (4 rows)
 
 SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 
WHERE c1 < 10) t2 ON (t1.c1 = t2.c1)
@@ -3112,7 +3112,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full 
join ft5 t2 on (t1.c1 = t2
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 
20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY 
array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST
+   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS 
NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT 
(r1.c1 % 5)) ASC NULLS LAST
 (4 rows)
 
 select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = 
t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 
order by 1;
@@ -3130,7 +3130,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) 
from ft4 t1 full join ft
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5) ORDER BY (t1.c1 % 5))), ((t2.c1 % 
3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c1 % 5)) 
ASC NULLS LAST), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON 
(((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 
5 GROUP BY 2 ORDER BY array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c1 % 5)) 
ASC NULLS LAST) ASC NUL

backtrace_on_internal_error

2023-12-05 Thread Peter Eisentraut
We have backtrace support for server errors.  You can activate that 
either by setting backtrace_functions or by explicitly attaching 
errbacktrace() to an ereport() call.


I would like an additional mode that essentially triggers a backtrace 
anytime elog() (for internal errors) is called.  This is not well 
covered by backtrace_functions, because there are many equally-worded 
low-level errors in many functions.  And if you find out where the error 
is, then you need to manually rewrite the elog() to ereport() to attach 
the errbacktrace(), which is annoying.  Having a backtrace automatically 
on every elog() call would be very helpful during development for 
various kinds of common errors from palloc, syscache, node support, etc.


I think the implementation would be very simple, something like

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 6aeb855e49..45d40abe92 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const 
char *funcname)


/* Collect backtrace, if enabled and we didn't already */
if (!edata->backtrace &&
-   edata->funcname &&
-   backtrace_functions &&
-   matches_backtrace_functions(edata->funcname))
+   ((edata->funcname &&
+ backtrace_functions &&
+ matches_backtrace_functions(edata->funcname)) ||
+(edata->sqlerrcode == ERRCODE_INTERNAL_ERROR &&
+ backtrace_on_internal_error)))
set_backtrace(edata, 2);

/*

where backtrace_on_internal_error would be a GUC variable.

Would others find this useful?  Any other settings or variants in this 
area that should be considered while we're here?





UBSan pointer overflow in xlogreader.c

2023-12-05 Thread Thomas Munro
Hi,

xlogreader.c has a pointer overflow bug, as revealed by the
combination of -fsanitize=undefined -m32, the new 039_end_of_wal.pl
test and Robert's incremental backup patch[1].  The bad code tests
whether an object could fit using something like base + size <= end,
which can be converted to something like size <= end - base to avoid
the overflow.  See experimental fix patch, attached.

I took a while to follow up because I wanted to understand exactly why
it doesn't break in practice despite being that way since v15.  I
think I have it now:

1.  In the case of a bad/garbage size at end-of-WAL, the
following-page checks will fail first before anything bad happens as a
result of the overflow.

2.  In the case of a real oversized record on current 64 bit
architectures including amd64, aarch64, power and riscv64, the pointer
can't really overflow anyway because the virtual address space is < 64
bit, typically around 48, and record lengths are 32 bit.

3.  In the case of the 32 bit kernels I looked at including Linux,
FreeBSD and cousins, Solaris and Windows the top 1GB plus a bit more
of virtual address space is reserved for system use*, so I think a
real oversized record shouldn't be able to overflow the pointer there
either.

A 64 bit kernel running a 32 bit process could run into trouble,
though :-(.  Those don't need to block out that high memory segment.
You'd need to have the WAL buffer in that address range and decode
large enough real WAL records and then things could break badly.  I
guess 32/64 configurations must be rare these days outside developers
testing 32 bit code, and that is what happened here (ie CI); and with
some minor tweaks to the test it can be reached without Robert's patch
too.  There may of course be other more exotic systems that could
break, but I don't know specifically what.

TLDR; this is a horrible bug, but all damage seems to be averted on
"normal" systems.  The best thing I can say about all this is that the
new test found a bug, and the fix seems straightforward.  I will study
and test this some more, but wanted to share what I have so far.

(*I think the old 32 bit macOS kernels might have been an exception to
this pattern but 32 bit kernels and even processes are history there.)
[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLCuW_CE9nDAbUNV40G2FkpY_kcPZkaORyVBVive8FQHQ%40mail.gmail.com#d0d00ca5cc3f756656466adc9f2ec186


0001-Fix-pointer-overflow-in-xlogreader.c.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2023-12-05 Thread Amit Kapila
On Tue, Dec 5, 2023 at 10:38 AM shveta malik  wrote:
>
> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
>  wrote:
> >
> > >
> > >> ~~~
> > >> 4. primary_slot_name GUC value test:
> > >>
> > >> When standby is started with a non-existing primary_slot_name, the
> > >> wal-receiver gives an error but the slot-sync worker does not raise
> > >> any error/warning. It is no-op though as it has a check 'if
> > >> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'.   Is this
> > >> okay or shall the slot-sync worker too raise an error and exit?
> > >>
> > >> In another case, when standby is started with valid primary_slot_name,
> > >> but it is changed to some invalid value in runtime, then walreceiver
> > >> starts giving error but the slot-sync worker keeps on running. In this
> > >> case, unlike the previous case, it even did not go to no-op mode (as
> > >> it sees valid WalRcv->latestWalEnd from the earlier run) and keep
> > >> pinging primary repeatedly for slots.  Shall here it should error out
> > >> or at least be no-op until we give a valid primary_slot_name?
> > >>
> > >
> >
> > Nice catch, thanks!
> >
> > > I reviewed it. There is no way to test the existence/validity of
> > > 'primary_slot_name' on standby without making a connection to primary.
> > > If primary_slot_name is invalid from the start, slot-sync worker will
> > > be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and
> > > if 'primary_slot_name' is changed to invalid on runtime, slot-sync
> > > worker will still keep on pinging primary. But that should be okay (in
> > > fact needed) as it needs to sync at-least the previous slot's
> > > positions (in case it is delayed in doing so for some reason earlier).
> > > And once the slots are up-to-date on standby, even if worker pings
> > > primary, it will not see any change in slots lsns and thus go for
> > > longer nap. I think, it is not worth the effort to introduce the
> > > complexity of checking validity of 'primary_slot_name' on primary from
> > > standby for this rare scenario.
> > >
> >
> > Maybe another option could be to have the walreceiver a way to let the slot 
> > sync
> > worker knows that it (the walreceiver) was not able to start due to non 
> > existing
> > replication slot on the primary? (that way we'd avoid the slot sync worker 
> > having
> > to talk to the primary).
>
> Few points:
> 1) I think if we do it, we should do it in generic way i.e. slotsync
> worker should go to no-op if walreceiver is not able to start due to
> any reason and not only due to invalid primary_slot_name.
> 2) Secondly, slotsync worker needs to make sure it has synced the
> slots so far i.e. worker should not go to no-op immediately on seeing
> missing WalRcv process if there are pending slots to be synced.
>

Won't it be better to just ping and check the validity of
'primary_slot_name' at the start of slot-sync and if it is changed
anytime? I think it would be better to avoid adding dependency on
walreciever state as that sounds like needless complexity.

-- 
With Regards,
Amit Kapila.




Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-05 Thread Tomas Vondra
On 12/5/23 08:14, Shlok Kyal wrote:
> Hi,
> 
>> As for the test results, I very much doubt the differences are not
>> caused simply by random timing variations, or something like that. And I
>> don't understand what "Performance Machine Linux" is, considering those
>> timings are slower than the other two machines.
> 
> The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating 
> System
> Also find the detailed info of the performance machine attached.
> 

Thanks for the info. I don't think the tests really benefit from this
much resources, I would be rather surprised if it was faster beyond 8
cores or so. The CPU frequency likely matters much more. Which probably
explains why this machine was the slowest.

Also, I wonder how much the results vary between the runs. I suppose you
only did s single run for each, right?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Amit Kapila
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
 wrote:
>
> Thanks for the script. Are you also measuring the time it takes to
> decode this using test_decoding?
>
> FWIW I did more comprehensive suite of tests over the weekend, with a
> couple more variations. I'm attaching the updated scripts, running it
> should be as simple as
>
>   ./run.sh BRANCH TRANSACTIONS RUNS
>
> so perhaps
>
>   ./run.sh master 1000 3
>
> to do 3 runs with 1000 transactions per client. And it'll run a bunch of
> combinations hard-coded in the script, and write the timings into a CSV
> file (with "master" in each row).
>
> I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
> did this with current master, the basic patch (without the 0002 part),
> and then with the optimized approach (single global hash table, see the
> 0004 part). That's what master / patched / optimized in the results is.
>
> Interestingly enough, the i5 handled this much faster, it seems to be
> better in single-core tasks. The xeon is still running, so the results
> for "optimized" only have one run (out of 3), but shouldn't change much.
>
> Attached is also a table summarizing this, and visualizing the timing
> change (vs. master) in the last couple columns. Green is "faster" than
> master (but we don't really expect that), and "red" means slower than
> master (the more red, the slower).
>
> There results are grouped by script (see the attached .tgz), with either
> 32 or 96 clients (which does affect the timing, but not between master
> and patch). Some executions have no pg_sleep() calls, some have 0.001
> wait (but that doesn't seem to make much difference).
>
> Overall, I'd group the results into about three groups:
>
> 1) good cases [nextval, nextval-40, nextval-abort]
>
> These are cases that slow down a bit, but the slowdown is mostly within
> reasonable bounds (we're making the decoding to do more stuff, so it'd
> be a bit silly to require that extra work to make no impact). And I do
> think this is reasonable, because this is pretty much an extreme / worst
> case behavior. People don't really do just nextval() calls, without
> doing anything else. Not to mention doing aborts for 100% transactions.
>
> So in practice this is going to be within noise (and in those cases the
> results even show speedup, which seems a bit surprising). It's somewhat
> dependent on CPU too - on xeon there's hardly any regression.
>
>
> 2) nextval-40-abort
>
> Here the slowdown is clear, but I'd argue it generally falls in the same
> group as (1). Yes, I'd be happier if it didn't behave like this, but if
> someone can show me a practical workload affected by this ...
>
>
> 3) irrelevant cases [all the alters taking insane amounts of time]
>
> I absolutely refuse to care about these extreme cases where decoding
> 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
> xeon). If this was a problem for some practical workload, we'd have
> already heard about it I guess. And even if there was such workload, it
> wouldn't be up to this patch to fix that. There's clearly something
> misbehaving in the snapshot builder.
>
>
> I was hopeful the global hash table would be an improvement, but that
> doesn't seem to be the case. I haven't done much profiling yet, but I'd
> guess most of the overhead is due to ReorderBufferQueueSequence()
> starting and aborting a transaction in the non-transactinal case. Which
> is unfortunate, but I don't know if there's a way to optimize that.
>

Before discussing the alternative ideas you shared, let me try to
clarify my understanding so that we are on the same page. I see two
observations based on the testing and discussion we had (a) for
non-transactional cases, the overhead observed is mainly due to
starting/aborting a transaction for each change; (b) for transactional
cases, we see overhead due to traversing all the top-level txns and
check the hash table for each one to find whether change is
transactional.

Am, I missing something?

-- 
With Regards,
Amit Kapila.




Re: Avoid detoast overhead when possible

2023-12-05 Thread Andy Fan


Hi

Nikita Malakhov  writes:
>
> With your setup (table created with setup.sql):

You need to "set jit to off" to turn on this feature, as I state in [1]
[2]. 

[1] https://www.postgresql.org/message-id/87ttoyihgm.fsf%40163.com
[2] https://www.postgresql.org/message-id/877cltvxgt.fsf%40163.com

-- 
Best Regards
Andy Fan





Make attstattarget nullable

2023-12-05 Thread Peter Eisentraut
In [0] it was discussed that we could make attstattarget a nullable 
column, instead of always storing an explicit -1 default value for most 
columns.  This patch implements this.


This changes the pg_attribute field attstattarget into a nullable field 
in the variable-length part of the row.  If no value is set by the user 
for attstattarget, it is now null instead of previously -1.  This saves 
space in pg_attribute and tuple descriptors for most practical 
scenarios.  (ATTRIBUTE_FIXED_PART_SIZE is reduced from 108 to 104.) 
Also, null is the semantically more correct value.


The ANALYZE code internally continues to represent the default 
statistics target by -1, so that that code can avoid having to deal with 
null values.  But that is now contained to ANALYZE code.  The DDL code 
deals with attstattarget possibly null.


For system columns, the field is now always null but the effective value 
0 (don't analyze) is assumed.


To set a column's statistics target to the default value, the new 
command form ALTER TABLE ... SET STATISTICS DEFAULT can be used.  (SET 
STATISTICS -1 still works.)



[0]: 
https://www.postgresql.org/message-id/flat/d07ffc2b-e0e8-77f7-38fb-be921dff71af%40enterprisedb.com
From 174d51a66c80a3284ad97b347a286a7b65a2a440 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 5 Dec 2023 13:43:30 +0100
Subject: [PATCH v1] Make attstattarget nullable

This changes the pg_attribute field attstattarget into a nullable
field in the variable-length part of the row.  If no value is set by
the user for attstattarget, it is now null instead of previously -1.
This saves space in pg_attribute and tuple descriptors for most
practical scenarios.  (ATTRIBUTE_FIXED_PART_SIZE is reduced from 108
to 104.)  Also, null is the semantically more correct value.

The ANALYZE code internally continues to represent the default
statistics target by -1, so that that code can avoid having to deal
with null values.  But that is now contained to ANALYZE code.  The DDL
code deals with attstattarget possibly null.

For system columns, the field is now always null but the effective
value 0 (don't analyze) is assumed.

To set a column's statistics target to the default value, the new
command form ALTER TABLE ... SET STATISTICS DEFAULT can be used.  (SET
STATISTICS -1 still works.)

TODO: catversion
---
 doc/src/sgml/ref/alter_table.sgml  |  4 +-
 src/backend/access/common/tupdesc.c|  4 --
 src/backend/bootstrap/bootstrap.c  |  1 -
 src/backend/catalog/genbki.pl  |  1 -
 src/backend/catalog/heap.c | 25 ++--
 src/backend/catalog/index.c| 23 ---
 src/backend/commands/analyze.c |  7 +++-
 src/backend/commands/tablecmds.c   | 44 +-
 src/backend/parser/gram.y  | 18 ++---
 src/backend/utils/cache/lsyscache.c| 17 +++--
 src/bin/pg_dump/pg_dump.c  |  7 +++-
 src/include/catalog/pg_attribute.h | 16 
 src/include/commands/vacuum.h  |  2 +-
 src/test/regress/expected/create_index.out |  4 +-
 14 files changed, 113 insertions(+), 60 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index e1d207bc60..9d637157eb 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -50,7 +50,7 @@
 ALTER [ COLUMN ] column_name 
ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( 
sequence_options ) ]
 ALTER [ COLUMN ] column_name 
{ SET GENERATED { ALWAYS | BY DEFAULT } | SET 
sequence_option | RESTART [ [ WITH ] restart ] } [...]
 ALTER [ COLUMN ] column_name 
DROP IDENTITY [ IF EXISTS ]
-ALTER [ COLUMN ] column_name 
SET STATISTICS integer
+ALTER [ COLUMN ] column_name 
SET STATISTICS { integer | DEFAULT 
}
 ALTER [ COLUMN ] column_name 
SET ( attribute_option = 
value [, ... ] )
 ALTER [ COLUMN ] column_name 
RESET ( attribute_option [, ... ] )
 ALTER [ COLUMN ] column_name 
SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN | DEFAULT }
@@ -317,7 +317,7 @@ Description
   sets the per-column statistics-gathering target for subsequent
   ANALYZE operations.
   The target can be set in the range 0 to 1; alternatively, set it
-  to -1 to revert to using the system default statistics
+  to DEFAULT to revert to using the system default 
statistics
   target ().
   For more information on the use of statistics by the
   PostgreSQL query planner, refer to
diff --git a/src/backend/access/common/tupdesc.c 
b/src/backend/access/common/tupdesc.c
index 8826519e5e..054ccff1e2 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -453,8 +453,6 @@ equalTupleDescs(TupleDesc tupdesc1, TupleDesc tupdesc2)
return false;
if (attr1->atttypid != attr2->atttypid)
return false;
-   if (attr1->attstattarget

Re: table inheritance versus column compression and storage settings

2023-12-05 Thread Peter Eisentraut

On 05.12.23 05:26, Ashutosh Bapat wrote:

- When inheriting from multiple parents with different settings, an
explicit setting in the child is required.

When no explicit setting for child is specified, it will throw an
error as it does today. Right?


Yes, it would throw an error, but a different error than today, saying 
something like "the settings in the parents conflict, so you need to 
specify one here to override the conflict".






Re: Synchronizing slots from primary to standby

2023-12-05 Thread Drouvot, Bertrand

Hi,

On 12/5/23 11:29 AM, shveta malik wrote:

On Tue, Dec 5, 2023 at 2:18 PM Drouvot, Bertrand
 wrote:

Wouldn't that make sense to move it once we are sure that
walrcv_startstreaming() returns true and first_stream is true, here?

"
  if (first_stream)
+   {
  ereport(LOG,
  (errmsg("started streaming WAL 
from primary at %X/%X on timeline %u",
  
LSN_FORMAT_ARGS(startpoint), startpointTLI)));
+   SpinLockAcquire(&walrcv->mutex);
+   walrcv->walRcvState = WALRCV_STREAMING;
+   SpinLockRelease(&walrcv->mutex);
+   }
"



Yes, it makes sense and is the basis for current slot-sync worker
changes being discussed.


I think this change deserves its own dedicated thread and patch, does
that make sense?

If so, I'll submit one.



2) and 3) looks good to me but with a check on walrcv->walRcvState
looking for WALRCV_STREAMING state instead of looking for a non null
WalRcv->pid.


yes. But I think, the worker should enter no-op, when walRcvState is
WALRCV_STOPPED and not when walRcvState != WALRCV_STREAMING as it is
okay to have WALRCV_WAITING/STARTING/RESTARTING. But the worker should
exit no-op only when it finds walRcvState switched back to
WALRCV_STREAMING.



Yeah, fully agree.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2023-12-05 Thread Drouvot, Bertrand

Hi,

On 12/5/23 12:32 PM, Amit Kapila wrote:

On Tue, Dec 5, 2023 at 10:38 AM shveta malik  wrote:


On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
 wrote:




Maybe another option could be to have the walreceiver a way to let the slot sync
worker knows that it (the walreceiver) was not able to start due to non existing
replication slot on the primary? (that way we'd avoid the slot sync worker 
having
to talk to the primary).


Few points:
1) I think if we do it, we should do it in generic way i.e. slotsync
worker should go to no-op if walreceiver is not able to start due to
any reason and not only due to invalid primary_slot_name.
2) Secondly, slotsync worker needs to make sure it has synced the
slots so far i.e. worker should not go to no-op immediately on seeing
missing WalRcv process if there are pending slots to be synced.



Won't it be better to just ping and check the validity of
'primary_slot_name' at the start of slot-sync and if it is changed
anytime? I think it would be better to avoid adding dependency on
walreciever state as that sounds like needless complexity.


I think the overall extra complexity is linked to the fact that we first
want to ensure that the slots are in sync before shutting down the
sync slot worker.

I think than talking to the primary or relying on the walreceiver state
is "just" what would trigger the decision to shutdown the sync slot worker.

Relying on the walreceiver state looks better to me (as it avoids possibly
useless round trips with the primary).

Also the walreceiver could be down for multiple reasons, and I think there
is no point of having a sync slot worker running if the slots are in sync and
there is no walreceiver running (even if primary_slot_name is a valid one).

That said, I'm also ok with the "ping primary" approach if others have another
point of view and find it better.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan



On 2023-12-04 Mo 17:55, Davin Shearer wrote:
Sorry about the top posting / top quoting... the link you sent me 
gives me a 404.  I'm not exactly sure what top quoting / posting means 
and Googling those terms wasn't helpful for me, but I've removed the 
quoting that my mail client is automatically "helpfully" adding to my 
emails.  I mean no offense.



Hmm. Luckily the Wayback Machine has a copy: 



Maybe I'll put a copy in the developer wiki.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Postgres Partitions Limitations (5.11.2.3)

2023-12-05 Thread Ashutosh Bapat
On Tue, Dec 5, 2023 at 1:40 AM Laurenz Albe  wrote:
>
> On Fri, 2023-12-01 at 18:49 +0530, Ashutosh Bapat wrote:
> > On Thu, Nov 30, 2023 at 10:29 PM Laurenz Albe  
> > wrote:
> > >
> > > On Thu, 2023-11-30 at 19:22 +0530, Ashutosh Bapat wrote:
> > > > May be attach the patch to hackers thread (this) as well?
> > >
> > > If you want, sure.  I thought it was good enough if the thread
> > > is accessible via the commitfest app.
> >
> > The addition is long enough that it deserved to be outside of parentheses.
> >
> > I think it's worth mentioning the exception but in a way that avoids
> > repeating what's mentioned in the last paragraph of just the previous
> > section. I don't have brilliant ideas about how to rephrase it.
> >
> > Maybe "Using ONLY to add or drop a constraint, other than PRIMARY and
> > UNIQUE, on only the partitioned table is supported as long as there
> > are no partitions. ...".
>
> I agree that the parenthesis is too long.  I shortened it in the attached
> patch. Is that acceptable?

It's still longer than the actual sentence :). I am fine with it if
somebody else finds it acceptable.

-- 
Best Wishes,
Ashutosh Bapat




Re: introduce dynamic shared memory registry

2023-12-05 Thread Joe Conway

On 12/4/23 22:46, Nathan Bossart wrote:

Every once in a while, I find myself wanting to use shared memory in a
loadable module without requiring it to be loaded at server start via
shared_preload_libraries.  The DSM API offers a nice way to create and
manage dynamic shared memory segments, so creating a segment after server
start is easy enough.  However, AFAICT there's no easy way to teach other
backends about the segment without storing the handles in shared memory,
which puts us right back at square one.

The attached 0001 introduces a "DSM registry" to solve this problem.  The
API provides an easy way to allocate/initialize a segment or to attach to
an existing one.  The registry itself is just a dshash table that stores
the handles keyed by a module-specified string.  0002 adds a test for the
registry that demonstrates basic usage.

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


Notwithstanding any dragons there may be, and not having actually looked 
at the the patches, I love the concept! +


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Detecting some cases of missing backup_label

2023-12-05 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> I recently mentioned to Robert (and also Heikki earlier), that I think I see a
> way to detect an omitted backup_label in a relevant subset of the cases (it'd
> apply to the pg_control as well, if we moved to that).  Robert encouraged me
> to share the idea, even though it does not provide complete protection.

That would certainly be nice.

> The subset I think we can address is the following:
> 
> a) An omitted backup_label would lead to corruption, i.e. without the
>backup_label we won't start recovery at the right position. Obviously it'd
>be better to also catch a wrong procedure when it'd not cause corruption -
>perhaps my idea can be extended to handle that, with a small bit of
>overhead.
> 
> b) The backup has been taken from a primary. Unfortunately that probably can't
>be addressed - but the vast majority of backups are taken from a primary,
>so I think it's still a worthwhile protection.

Agreed that this is a worthwhile set to try and address, even if we
can't address other cases.

> Here's my approach
> 
> 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
>primary, emitted just *after* the checkpoint completed
> 
> 2) When replaying a base backup start record, we create a state file that
>includes the corresponding LSN in the filename
> 
> 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
>that time. Instead the state file is created during pg_backup_stop().
> 
> 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
>created by XLOG_BACKUP_START is present, and error out if not.  Backups
>that started before the redo LSN from backup_label are ignored
>(necessitates remembering that LSN, but we've been discussing that anyway).
> 
> 
> Because the backup state file on the primary is only created during
> pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
> and pg_backup_stop() does *not* contain the corresponding "backup state
> file". Because of this, an omitted backup_label is detected if recovery does
> not start early enough - recovery won't encounter the XLOG_BACKUP_START record
> and thus would not create the state file, leading to an error in 4).

While I see the idea here, I think, doesn't it end up being an issue if
things happen like this:

pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash

In that scenario, we'd go back to the new checkpoint (the one *after*
the checkpoint that happened before we wrote XLOG_BACKUP_START), start
replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
Even though we're actually doing crash recovery and everything should be
fine as long as we replay all of the WAL.

Perhaps we can make the pg_backup_stop and(/or?) the writing out of
XLOG_BACKUP_STOP wait until just before the next checkpoint and
hopefully minimize that window ... but I'm not sure if we could make
that window zero and what happens if someone does end up hitting it?
Doesn't seem like there's any way around it, which seems like it might
be a problem.  I suppose it wouldn't be hard to add some option to tell
PG to ignore the XLOG_BACKUP_STOP ... but then that's akin to removing
backup_label which lands us possibly back into the issue of people
mis-using that.

> It is not a problem that the primary does not create the state file before the
> pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
> XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
> sequence differs between normal processing and recovery, but I think that's
> nothing a good comment couldn't explain.

Right, crashing before pg_backup_stop() is fine, but crashing *after*
would be an issue, I think, as outlined above, until the next checkpoint
completes, so we've moved the window but not eliminated it.

> I haven't worked out the details, but I think we might be able extend this to
> catch errors even if there is no checkpoint during the base backup, by
> emitting the WAL record *before* the RequestCheckpoint(), and creating the
> corresponding state file during backup_label processing at the start of
> recovery.  That'd probably make the logic for when we can remove the backup
> state files a bit more complicated, but I think we could deal with that.

Not entirely following this- are you meaning that we might be able to
make something here work in the case where we don't have
pg_backup_start() wait for a checkpoint to happen (which I have some
serious doubts about?), or are you saying that the above doesn't work
unless there's at least one post-pg_backup_start() checkpoint?  I don't
immediately see why that would be the case though.  Also, if we wrote
out the XLOG_BACKUP_START before the checkpoint that we start replay
from and instead move that logic to backup_label processing ... 

Re: introduce dynamic shared memory registry

2023-12-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 10:35 AM Joe Conway  wrote:
> Notwithstanding any dragons there may be, and not having actually looked
> at the the patches, I love the concept! +

Seems fine to me too. I haven't looked at the patches or searched for
dragons either, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-12-05 Thread John Naylor
On Tue, Dec 5, 2023 at 7:55 AM Jeff Davis  wrote:
>
> On Tue, 2023-11-21 at 09:24 -0500, Robert Haas wrote:
> > As to the second, could we somehow come
> > up with an API for guc.c where you can ask for an opaque handle that
> > will later allow you to do a fast-SET of a GUC?
>
> Yes, attached. That provides a significant speedup: my test goes from
> around ~7300ms to ~6800ms.
>
> This doesn't seem very controversial or complex, so I'll probably
> commit this soon unless someone else has a comment.

+ * set_config_option_ext: sets option with the given handle to the given
+ * value.

Copy-paste-o of the other function name.

+config_handle *
+get_config_handle(const char *name)
+{
+ struct config_generic *record;
+
+ record = find_option(name, false, false, 0);
+ if (record == NULL)
+ return 0;

Part of this code this was copied from a function that returned int,
but this one returns a pointer.




Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Tomas Vondra
On 12/5/23 13:17, Amit Kapila wrote:
> ...
>> I was hopeful the global hash table would be an improvement, but that
>> doesn't seem to be the case. I haven't done much profiling yet, but I'd
>> guess most of the overhead is due to ReorderBufferQueueSequence()
>> starting and aborting a transaction in the non-transactinal case. Which
>> is unfortunate, but I don't know if there's a way to optimize that.
>>
> 
> Before discussing the alternative ideas you shared, let me try to
> clarify my understanding so that we are on the same page. I see two
> observations based on the testing and discussion we had (a) for
> non-transactional cases, the overhead observed is mainly due to
> starting/aborting a transaction for each change;

Yes, I believe that's true. See the attached profiles for nextval.sql
and nextval-40.sql from master and optimized build (with the global
hash), and also a perf-diff. I only include the top 1000 lines for each
profile, that should be enough.

master - current master without patches applied
optimized - master + sequence decoding with global hash table

For nextval, there's almost no difference in the profile. Decoding the
other changes (inserts) is the dominant part, as we only log sequences
every 32 increments.

For nextval-40, the main increase is likely due to this part

  |--11.09%--seq_decode
  | |
  | |--9.25%--ReorderBufferQueueSequence
  | | |
  | | |--3.56%--AbortCurrentTransaction
  | | ||
  | | | --3.53%--AbortSubTransaction
  | | ||
  | | ||--0.95%--AtSubAbort_Portals
  | | ||  |
  | | ||   --0.83%--hash_seq_search
  | | ||
  | | | --0.83%--ResourceOwnerReleaseInternal
  | | |
  | | |--2.06%--BeginInternalSubTransaction
  | | |  |
  | | |   --1.10%--CommitTransactionCommand
  | | | |
  | | |  --1.07%--StartSubTransaction
  | | |
  | | |--1.28%--CleanupSubTransaction
  | | |  |
  | | |   --0.64%--AtSubCleanup_Portals
  | | | |
  | | |  --0.55%--hash_seq_search
  | | |
  | |  --0.67%--RelidByRelfilenumber

So yeah, that's the transaction stuff in ReorderBufferQueueSequence.

There's also per-diff, comparing individual functions.

> (b) for transactional
> cases, we see overhead due to traversing all the top-level txns and
> check the hash table for each one to find whether change is
> transactional.
> 

Not really, no. As I explained in my preceding e-mail, this check makes
almost no difference - I did expect it to matter, but it doesn't. And I
was a bit disappointed the global hash table didn't move the needle.

Most of the time is spent in

78.81% 0.00%  postgres  postgres  [.] DecodeCommit (inlined)
  |
  ---DecodeCommit (inlined)
 |
 |--72.65%--SnapBuildCommitTxn
 | |
 |  --72.61%--SnapBuildBuildSnapshot
 ||
 | --72.09%--pg_qsort
 ||
 ||--66.24%--pg_qsort
 ||  |

And there's almost no difference between master and build with sequence
decoding - see the attached diff-alter-sequence.perf, comparing the two
branches (perf diff -c delta-abs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

alter-sequence-master.perf.gz
Description: application/gzip


alter-sequence-optimized.perf.gz
Description: application/gzip


diff-alter-sequence.perf.gz
Description: application/gzip


diff-nextval.perf.gz
Description: application/gzip


diff-nextval-40.perf.gz
Description: application/gzip


nextval-40-master.perf.gz
Description: application/gzip


nextval-40-optimized.perf.gz
Description: application/gzip


nextval-master.perf.gz
Description: application/gzip


nextval-optimized.perf.gz
Description: application/gzip


Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/4/23 21:54, Joe Conway wrote:

On 12/4/23 17:55, Davin Shearer wrote:

There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases.

So I did a quick check:
8<--
with t(f1) as
(
  values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text)
)
select
  length(t.f1),
  t.f1,
  row_to_json(t)
from t;
 length | f1  |row_to_json
+-+---
  7 | aaa"bbb | {"f1":"aaa\"bbb"}
  7 | aaa\bbb | {"f1":"aaa\\bbb"}
  7 | aaa/bbb | {"f1":"aaa/bbb"}
  7 | aaa\x08bbb  | {"f1":"aaa\"}
  7 | aaa\x0Cbbb  | {"f1":"aaa\fbbb"}
  7 | aaa+| {"f1":"aaa\nbbb"}
| bbb |
  7 | aaa\rbbb| {"f1":"aaa\rbbb"}
  7 | aaa bbb | {"f1":"aaa\tbbb"}
(8 rows)

8<--

This is all independent of my patch for COPY TO. If I am reading that 
correctly, everything matches Davin's table *except* the forward slash 
("/"). I defer to the experts on the thread to debate that...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Possible segfault when sending notification within a ProcessUtility hook

2023-12-05 Thread Anthonin Bonnefoy
Hi,

I've encountered the following segfault:

#0: 0x000104e821a8 postgres`list_head(l=0x7f7f7f7f7f7f7f7f) at
pg_list.h:130:17
#1: 0x000104e81c9c postgres`PreCommit_Notify at async.c:932:16
#2: 0x000104dd02f8 postgres`CommitTransaction at xact.c:2236:2
#3: 0x000104dcfc24 postgres`CommitTransactionCommand at xact.c:3061:4
#4: 0x00010528a880 postgres`finish_xact_command at postgres.c:2777:3
#5: 0x0001052883ac postgres`exec_simple_query(query_string="notify
test;") at postgres.c:1298:4

This happens when a transaction block fails and a ProcessUtility hook
sends a notification during the rollback command.

When a transaction block fails, it will enter in a TBLOCK_ABORT state,
waiting for a rollback. Calling rollback will switch to a
TBLOCK_ABORT_END state and will only go through CleanupTransaction.
If a hook sends a notification during the rollback command, a
notification will be queued but its content will be wiped when the
TopTransactionContext is destroyed.
Trying to send a notification immediately after will segfault in
PreCommit_Notify as pendingNotifies->events will be invalid.

There's a test_notify_rollback test module attached to the patch that reproduces
the issue.

Moving notification clean up from AbortTransaction to CleanupTransaction fixes
the issue as it will clear pendingActions in the same function that destroys the
TopTransactionContext.

Regards,
Anthonin


v1-0001-Fix-segfault-when-notifying-in-a-ProcessUtility-h.patch
Description: Binary data


Add checks in pg_rewind to abort if backup_label file is present

2023-12-05 Thread Krishnakumar R
Hi,

Please find the attached patch for $subject and associated test. Please review.

--
Thanks and Regards,
Krishnakumar (KK).
[Microsoft]
From 80ad7293b57a2b346b0775b8f6e0e06198617154 Mon Sep 17 00:00:00 2001
From: "Krishnakumar R (KK)" 
Date: Tue, 5 Dec 2023 02:36:32 -0800
Subject: [PATCH v1] Add checks in pg_rewind to abort if backup_label file is
 found in either source or target cluster. Append rewind tap tests to verify
 the check.

---
 src/bin/pg_rewind/file_ops.c | 17 +
 src/bin/pg_rewind/file_ops.h |  2 +-
 src/bin/pg_rewind/libpq_source.c | 22 
 src/bin/pg_rewind/local_source.c | 14 
 src/bin/pg_rewind/pg_rewind.c|  6 +++-
 src/bin/pg_rewind/rewind_source.h|  5 +++
 src/bin/pg_rewind/t/001_basic.pl | 46 ++--
 src/test/perl/PostgreSQL/Test/Cluster.pm | 31 
 8 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index dd22685932..d4568e7d09 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -299,6 +299,23 @@ sync_target_dir(void)
 	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
 }
 
+/*
+ * Check if a file is present using the stat function. Return 'true'
+ * if present and 'false' if not.
+ */
+bool
+is_file_present(const char *datadir, const char *path)
+{
+	struct stat s;
+	char		fullpath[MAXPGPATH];
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s", datadir, path);
+
+	if (stat(fullpath, &s) < 0)
+		return false;
+
+	return true;
+}
 
 /*
  * Read a file into memory. The file to be read is /.
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 427cf8e0b5..c063175fac 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -20,7 +20,7 @@ extern void truncate_target_file(const char *path, off_t newsize);
 extern void create_target(file_entry_t *entry);
 extern void remove_target(file_entry_t *entry);
 extern void sync_target_dir(void);
-
+extern bool is_file_present(const char *datadir, const char *path);
 extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
 
 typedef void (*process_file_callback_t) (const char *path, file_type_t type, size_t size, const char *link_target);
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 417c74cfef..66ed6e156d 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -61,6 +61,7 @@ static void appendArrayEscapedString(StringInfo buf, const char *str);
 static void process_queued_fetch_requests(libpq_source *src);
 
 /* public interface functions */
+static bool libpq_is_file_present(rewind_source *source, const char *path);
 static void libpq_traverse_files(rewind_source *source,
  process_file_callback_t callback);
 static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
@@ -87,6 +88,7 @@ init_libpq_source(PGconn *conn)
 
 	src = pg_malloc0(sizeof(libpq_source));
 
+	src->common.is_file_present = libpq_is_file_present;
 	src->common.traverse_files = libpq_traverse_files;
 	src->common.fetch_file = libpq_fetch_file;
 	src->common.queue_fetch_file = libpq_queue_fetch_file;
@@ -627,6 +629,26 @@ appendArrayEscapedString(StringInfo buf, const char *str)
 	appendStringInfoCharMacro(buf, '\"');
 }
 
+/*
+ * Check if a file is present using the connection to the
+ * database.
+ */
+static bool
+libpq_is_file_present(rewind_source *source, const char *path)
+{
+	PGconn	   *conn = ((libpq_source *) source)->conn;
+	PGresult   *res;
+	const char *paramValues[1];
+
+	paramValues[0] = path;
+	res = PQexecParams(conn, "SELECT pg_stat_file($1)",
+	   1, NULL, paramValues, NULL, NULL, 1);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		return false;
+
+	return true;
+}
+
 /*
  * Fetch a single file as a malloc'd buffer.
  */
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 9bd43cba74..0d6bf403b4 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -25,6 +25,7 @@ typedef struct
 	const char *datadir;		/* path to the source data directory */
 } local_source;
 
+static bool local_is_file_present(rewind_source *source, const char *path);
 static void local_traverse_files(rewind_source *source,
  process_file_callback_t callback);
 static char *local_fetch_file(rewind_source *source, const char *path,
@@ -43,6 +44,7 @@ init_local_source(const char *datadir)
 
 	src = pg_malloc0(sizeof(local_source));
 
+	src->common.is_file_present = local_is_file_present;
 	src->common.traverse_files = local_traverse_files;
 	src->common.fetch_file = local_fetch_file;
 	src->common.queue_fetch_file = local_queue_fetch_file;
@@ -62,6 +64,18 @@ local_traverse_files(rewind_source *source, process_file_callback_t callback)
 	traverse_datadir(((local_source *) source)->datadir, callback);
 }
 
+

Re: backtrace_on_internal_error

2023-12-05 Thread Nathan Bossart
On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote:
> Would others find this useful?

Yes.  I think I would use this pretty frequently.

> Any other settings or variants in this area
> that should be considered while we're here?

IMO it would be nice to have a way to turn on backtraces for everything, or
at least everything above a certain logging level.  That would primarily be
useful for when you don't know exactly which C function is producing the
error.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer
Thanks for the wayback machine link Andrew.  I read it, understood it, and
will comply.

Joe, those test cases look great and the outputs are the same as `jq`.

As for forward slashes being escaped, I found this:
https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped
.

Forward slash escaping is optional, so not escaping them in Postgres is
okay.   The important thing is that the software _reading_ JSON interprets
both '\/' and '/' as '/'.


Re: UBSan pointer overflow in xlogreader.c

2023-12-05 Thread Nathan Bossart
On Wed, Dec 06, 2023 at 12:03:53AM +1300, Thomas Munro wrote:
> xlogreader.c has a pointer overflow bug, as revealed by the
> combination of -fsanitize=undefined -m32, the new 039_end_of_wal.pl
> test and Robert's incremental backup patch[1].  The bad code tests
> whether an object could fit using something like base + size <= end,
> which can be converted to something like size <= end - base to avoid
> the overflow.  See experimental fix patch, attached.

The patch LGTM.  I wonder if it might be worth creating some special
pointer arithmetic routines (perhaps using the stuff in common/int.h) to
help prevent this sort of thing in the future.  But that'd require you to
realize that your code is at risk of overflow, at which point it's probably
just as easy to restructure the logic like you've done here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: backtrace_on_internal_error

2023-12-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 12:40 PM Nathan Bossart  wrote:
> On Tue, Dec 05, 2023 at 11:55:05AM +0100, Peter Eisentraut wrote:
> > Would others find this useful?
>
> Yes.  I think I would use this pretty frequently.

I think we should consider unconditionally emitting a backtrace when
an elog() is hit, instead of requiring a GUC. Or at least any elog()
that's not at a DEBUGn level. If the extra output annoys anybody, that
means they're regularly hitting an elog(), and it ought to be promoted
to ereport().

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Nathan Bossart
On Mon, Dec 04, 2023 at 03:35:48PM +0900, Sutou Kouhei wrote:
> I want to work on making COPY format extendable. I attach
> the first patch for it. I'll send more patches after this is
> merged.

Given the current discussion about adding JSON, I think this could be a
nice bit of refactoring that could ultimately open the door to providing
other COPY formats via shared libraries.

> In other words, this is just a refactoring for further
> changes to make COPY format extendable. If I use "complete
> the task and then request reviews for it" approach, it will
> be difficult to review because changes for it will be
> large. So I want to work on this step by step. Is it
> acceptable?

I think it makes sense to do this part independently, but we should be
careful to design this with the follow-up tasks in mind.

> So I measured COPY TO time with/without this change. You can
> see there is no significant loss of performance.
> 
> Data: Random 32 bit integers:
> 
> CREATE TABLE data (int32 integer);
> INSERT INTO data
>   SELECT random() * 1
> FROM generate_series(1, ${n_records});

Seems encouraging.  I assume the performance concerns stem from the use of
function pointers.  Or was there something else?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: backtrace_on_internal_error

2023-12-05 Thread Nathan Bossart
On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:
> I think we should consider unconditionally emitting a backtrace when
> an elog() is hit, instead of requiring a GUC. Or at least any elog()
> that's not at a DEBUGn level. If the extra output annoys anybody, that
> means they're regularly hitting an elog(), and it ought to be promoted
> to ereport().

Perhaps this should be a GUC that defaults to LOG or ERROR.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: backtrace_on_internal_error

2023-12-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart  wrote:
> On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:
> > I think we should consider unconditionally emitting a backtrace when
> > an elog() is hit, instead of requiring a GUC. Or at least any elog()
> > that's not at a DEBUGn level. If the extra output annoys anybody, that
> > means they're regularly hitting an elog(), and it ought to be promoted
> > to ereport().
>
> Perhaps this should be a GUC that defaults to LOG or ERROR.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: psql not responding to SIGINT upon db reconnection

2023-12-05 Thread Tristan Partin

On Wed Nov 29, 2023 at 11:48 AM CST, Tristan Partin wrote:
I am not completely in love with the code I have written. Lots of 
conditional compilation which makes it hard to read. Looking forward to 
another round of review to see what y'all think.


Ok. Here is a patch which just uses select(2) with a timeout of 1s or 
pselect(2) if it is available. I also moved the state machine processing 
into its own function.


Thanks for your comments thus far.

--
Tristan Partin
Neon (https://neon.tech)
From b22f286d3733d6d5ec2a924682679f6884b3600c Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Mon, 24 Jul 2023 11:12:59 -0500
Subject: [PATCH v6] Allow SIGINT to cancel psql database reconnections

After installing the SIGINT handler in psql, SIGINT can no longer cancel
database reconnections. For instance, if the user starts a reconnection
and then needs to do some form of interaction (ie psql is polling),
there is no way to cancel the reconnection process currently.

Use PQconnectStartParams() in order to insert a CancelRequested check
into the polling loop.
---
 meson.build|   1 +
 src/bin/psql/command.c | 224 -
 src/include/pg_config.h.in |   3 +
 src/tools/msvc/Solution.pm |   1 +
 4 files changed, 228 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index ee58ee7a06..2d63485c53 100644
--- a/meson.build
+++ b/meson.build
@@ -2440,6 +2440,7 @@ func_checks = [
   ['posix_fadvise'],
   ['posix_fallocate'],
   ['ppoll'],
+  ['pselect'],
   ['pstat'],
   ['pthread_barrier_wait', {'dependencies': [thread_dep]}],
   ['pthread_is_threaded_np', {'dependencies': [thread_dep]}],
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 82cc091568..e87401b790 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -11,6 +11,11 @@
 #include 
 #include 
 #include 
+#if HAVE_POLL
+#include 
+#else
+#include 
+#endif
 #ifndef WIN32
 #include 			/* for stat() */
 #include 			/* for setitimer() */
@@ -40,6 +45,7 @@
 #include "large_obj.h"
 #include "libpq-fe.h"
 #include "libpq/pqcomm.h"
+#include "libpq/pqsignal.h"
 #include "mainloop.h"
 #include "portability/instr_time.h"
 #include "pqexpbuffer.h"
@@ -3251,6 +3257,169 @@ copy_previous_query(PQExpBuffer query_buf, PQExpBuffer previous_buf)
 	return false;
 }
 
+/*
+ * Check a file descriptor for read and/or write data, possibly waiting.
+ * If neither forRead nor forWrite are set, immediately return a timeout
+ * condition (without waiting).  Return >0 if condition is met, 0
+ * if a timeout occurred, -1 if an error or interrupt occurred.
+ *
+ * Timeout is infinite if end_time is -1.  Timeout is immediate (no blocking)
+ * if end_time is 0 (or indeed, any time before now).
+ */
+static int
+pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+{
+	/*
+	 * We use functions in the following order if available:
+	 *   - ppoll(2) OR pselect(2)
+	 *   - poll(2) OR select(2)
+	 */
+#ifdef HAVE_POLL
+	struct pollfd input_fd;
+#ifdef HAVE_PPOLL
+	int			rc;
+	sigset_t	emptyset;
+	sigset_t	blockset;
+	sigset_t	origset;
+	struct timespec timeout;
+	struct timespec *ptr_timeout;
+#else
+	int			timeout_ms;
+#endif
+
+	if (!forRead && !forWrite)
+		return 0;
+
+	input_fd.fd = sock;
+	input_fd.events = POLLERR;
+	input_fd.revents = 0;
+
+	if (forRead)
+		input_fd.events |= POLLIN;
+	if (forWrite)
+		input_fd.events |= POLLOUT;
+
+	/* Compute appropriate timeout interval */
+#ifdef HAVE_PPOLL
+	sigemptyset(&blockset);
+	sigaddset(&blockset, SIGINT);
+	sigprocmask(SIG_BLOCK, &blockset, &origset);
+
+	if (end_time == ((time_t) -1))
+		ptr_timeout = NULL;
+	else
+	{
+		timeout.tv_sec = end_time;
+		timeout.tv_nsec = 0;
+		ptr_timeout = &timeout;
+	}
+#else
+	if (end_time == ((time_t) -1))
+		timeout_ms = -1;
+	else
+	{
+		time_t		now = time(NULL);
+
+		if (end_time > now)
+			timeout_ms = (end_time - now) * 1000;
+		else
+			timeout_ms = 0;
+	}
+#endif
+
+#ifdef HAVE_PPOLL
+	sigemptyset(&emptyset);
+	if (cancel_pressed)
+	{
+		sigprocmask(SIG_SETMASK, &origset, NULL);
+		return 1;
+	}
+
+	rc = ppoll(&input_fd, 1, ptr_timeout, &emptyset);
+	sigprocmask(SIG_SETMASK, &origset, NULL);
+	return rc;
+#else
+	return poll(&input_fd, 1, timeout_ms);
+#endif
+#else			/* !HAVE_POLL */
+
+	fd_set		input_mask;
+	fd_set		output_mask;
+	fd_set		except_mask;
+#ifdef HAVE_PSELECT
+	int			rc;
+	sigset_t	emptyset;
+	sigset_t	blockset;
+	sigset_t	origset;
+	struct timespec timeout;
+	struct timespec *ptr_timeout;
+#else
+	struct timeval timeout;
+	struct timeval *ptr_timeout;
+#endif
+
+	if (!forRead && !forWrite)
+		return 0;
+
+	FD_ZERO(&input_mask);
+	FD_ZERO(&output_mask);
+	FD_ZERO(&except_mask);
+	if (forRead)
+		FD_SET(sock, &input_mask);
+
+	if (forWrite)
+		FD_SET(sock, &output_mask);
+	FD_SET(sock, &except_mask);
+
+	/* Compute appropriate timeout interval */
+#ifdef HAVE_PSELECT
+	sigemptyset(&blockset);
+	sigaddset(&blockset, SIGINT);
+	sigprocmask(SIG_BLOCK, &blockset, &origset);
+
+

Remove WIN32 conditional compilation from win32common.c

2023-12-05 Thread Tristan Partin
The file is only referenced in Meson and MSVC scripts from what I can 
tell, and the Meson reference is protected by a Windows check.


--
Tristan Partin
Neon (https://neon.tech)
From bc9ffc7b0b141959a4c2a3f8b731f457ff5825c1 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 5 Dec 2023 10:18:37 -0600
Subject: [PATCH v1] Remove WIN32 conditional compilation from win32common.c

This file is only compiled when WIN32 is defined, so the #ifdef was
not needed.
---
 src/port/win32common.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/src/port/win32common.c b/src/port/win32common.c
index 2fd78f7f93..a075d01209 100644
--- a/src/port/win32common.c
+++ b/src/port/win32common.c
@@ -19,8 +19,6 @@
 #include "postgres.h"
 #endif
 
-#ifdef WIN32
-
 /*
  * pgwin32_get_file_type
  *
@@ -64,5 +62,3 @@ pgwin32_get_file_type(HANDLE hFile)
 
 	return fileType;
 }
-
-#endif			/* WIN32 */
-- 
Tristan Partin
Neon (https://neon.tech)



Re: backtrace_on_internal_error

2023-12-05 Thread Matthias van de Meent
On Tue, 5 Dec 2023 at 19:30, Robert Haas  wrote:
>
> On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart  
> wrote:
> > On Tue, Dec 05, 2023 at 01:16:22PM -0500, Robert Haas wrote:
> > > I think we should consider unconditionally emitting a backtrace when
> > > an elog() is hit, instead of requiring a GUC. Or at least any elog()
> > > that's not at a DEBUGn level. If the extra output annoys anybody, that
> > > means they're regularly hitting an elog(), and it ought to be promoted
> > > to ereport().
> >
> > Perhaps this should be a GUC that defaults to LOG or ERROR.
>
> Why?

I can't speak for Nathan, but my reason would be that I'm not in the
habit to attach a debugger to my program to keep track of state
progression, but instead use elog() during patch development. I'm not
super stoked for getting my developmental elog(LOG)-s spammed with
stack traces, so I'd want to set this at least to ERROR, while in
production LOG could be fine.

Similarly, there are probably extensions that do not use ereport()
directly, but instead use elog(), because of reasons like 'not
planning on doing translations' and 'elog() is the easier API'.
Forcing a change over to ereport because of stack trace spam in logs
caused by elog would be quite annoying.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 12:43, Davin Shearer wrote:

Joe, those test cases look great and the outputs are the same as `jq`.




Forward slash escaping is optional, so not escaping them in Postgres is 
okay. The important thing is that the software _reading_ JSON 
interprets both '\/' and '/' as '/'.


Thanks for the review and info. I modified the existing regression test 
thus:


8<--
create temp table copyjsontest (
id bigserial,
f1 text,
f2 timestamptz);

insert into copyjsontest
  select g.i,
 CASE WHEN g.i % 2 = 0 THEN
   'line with '' in it: ' || g.i::text
 ELSE
   'line with " in it: ' || g.i::text
 END,
 'Mon Feb 10 17:32:01 1997 PST'
  from generate_series(1,5) as g(i);

insert into copyjsontest (f1) values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text);
copy copyjsontest to stdout json;
{"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"}
{"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"}
{"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"}
{"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"}
{"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"}
{"id":1,"f1":"aaa\"bbb","f2":null}
{"id":2,"f1":"aaa\\bbb","f2":null}
{"id":3,"f1":"aaa/bbb","f2":null}
{"id":4,"f1":"aaa\","f2":null}
{"id":5,"f1":"aaa\fbbb","f2":null}
{"id":6,"f1":"aaa\nbbb","f2":null}
{"id":7,"f1":"aaa\rbbb","f2":null}
{"id":8,"f1":"aaa\tbbb","f2":null}
8<--

I think the code, documentation, and tests are in pretty good shape at 
this point. Latest version attached.


Any other comments or complaints out there?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes three output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); 2) "json lines", except include comma delimiters
between json objects; and 3) "json array" which is the same as #2, but with
the addition of a leading "[" and trailing "]" to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5


diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..0236a9e 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,426 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_row_delimiter_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 445,452 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if 

Re: collect_corrupt_items_vacuum.patch

2023-12-05 Thread Dmitry Koval

Hi!

I agree with Alexander Lakhin about PROC_IN_VACUUM and 
VISHORIZON_DATA_STRICT:
1) probably manipulations with the PROC_IN_VACUUM flag in 
pg_visibility.c were needed for condition [1] and can be removed now;
2) the VISHORIZON_DATA_STRICT macro is probably unnecessary too (since 
we are not going to use it in the GlobalVisHorizonKindForRel() function).


Also It would be nice to remove the get_strict_xid_horizon() function 
from the comment (replace to GetStrictOldestNonRemovableTransactionId()?).


[1] 
https://github.com/postgres/postgres/blob/4d0cf0b05defcee985d5af38cb0db2b9c2f8dbae/src/backend/storage/ipc/procarray.c#L1812


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer
Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, we
should also _imply_ FORCE ROW DELIMITER.  I can't envision a use case where
someone would want to use FORCE ARRAY without also using FORCE ROW
DELIMITER.  I can, however, envision a use case where someone would want
FORCE ROW DELIMITER without FORCE ARRAY, like maybe including into a larger
array.  I definitely appreciate these options and the flexibility that they
afford from a user perspective.

In the test output, will you also show the different variations with FORCE
ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), (false,
true), (true, true)}?  Technically you've already shown me the (false,
false) case as those are the defaults.

Thanks!


Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-12-05 Thread Jeff Davis
On Tue, 2023-12-05 at 23:22 +0700, John Naylor wrote:
> Copy-paste-o of the other function name.

...

> Part of this code this was copied from a function that returned int,
> but this one returns a pointer.

Thank you, fixed.

Also, I forward-declared config_generic in guc.h to eliminate the cast.

Regards,
Jeff Davis

From 72b00b1b094945845e4ea4d427e426eafd5650c2 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 4 Dec 2023 16:20:05 -0800
Subject: [PATCH v2] Cache opaque handle for GUC option to avoid repeasted
 lookups.

When setting GUCs from proconfig, performance is important, and hash
lookups in the GUC table are significant.

Per suggestion from Robert Haas.

Discussion: https://postgr.es/m/ca+tgmoypkxhr3hod9syk2xwcauvpa0+ba0xpnwwbcyxtklk...@mail.gmail.com
---
 src/backend/utils/fmgr/fmgr.c | 32 --
 src/backend/utils/misc/guc.c  | 61 ---
 src/include/utils/guc.h   |  8 +
 3 files changed, 88 insertions(+), 13 deletions(-)

diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 9dfdf890c5..6aac698e4b 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -612,7 +612,7 @@ struct fmgr_security_definer_cache
 {
 	FmgrInfo	flinfo;			/* lookup info for target function */
 	Oid			userid;			/* userid to set, or InvalidOid */
-	List	   *configNames;	/* GUC names to set, or NIL */
+	List	   *configHandles;	/* GUC handles to set, or NIL */
 	List	   *configValues;	/* GUC values to set, or NIL */
 	Datum		arg;			/* passthrough argument for plugin modules */
 };
@@ -670,11 +670,25 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 		if (!isnull)
 		{
 			ArrayType  *array;
+			ListCell   *lc;
+			List	   *names;
 
 			oldcxt = MemoryContextSwitchTo(fcinfo->flinfo->fn_mcxt);
 			array = DatumGetArrayTypeP(datum);
-			TransformGUCArray(array, &fcache->configNames,
+			TransformGUCArray(array, &names,
 			  &fcache->configValues);
+
+			/* transform names to config handles to avoid lookup cost */
+			fcache->configHandles = NIL;
+			foreach(lc, names)
+			{
+char	   *name = (char *) lfirst(lc);
+
+fcache->configHandles = lappend(fcache->configHandles,
+get_config_handle(name));
+			}
+			list_free_deep(names);
+
 			MemoryContextSwitchTo(oldcxt);
 		}
 
@@ -687,7 +701,7 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 
 	/* GetUserIdAndSecContext is cheap enough that no harm in a wasted call */
 	GetUserIdAndSecContext(&save_userid, &save_sec_context);
-	if (fcache->configNames != NIL) /* Need a new GUC nesting level */
+	if (fcache->configHandles != NIL)	/* Need a new GUC nesting level */
 		save_nestlevel = NewGUCNestLevel();
 	else
 		save_nestlevel = 0;		/* keep compiler quiet */
@@ -696,17 +710,17 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 		SetUserIdAndSecContext(fcache->userid,
 			   save_sec_context | SECURITY_LOCAL_USERID_CHANGE);
 
-	forboth(lc1, fcache->configNames, lc2, fcache->configValues)
+	forboth(lc1, fcache->configHandles, lc2, fcache->configValues)
 	{
 		GucContext	context = superuser() ? PGC_SUSET : PGC_USERSET;
 		GucSource	source = PGC_S_SESSION;
 		GucAction	action = GUC_ACTION_SAVE;
-		char	   *name = lfirst(lc1);
+		config_handle *handle = lfirst(lc1);
 		char	   *value = lfirst(lc2);
 
-		(void) set_config_option(name, value,
- context, source,
- action, true, 0, false);
+		(void) set_config_with_handle(handle, value,
+	  context, source, GetUserId(),
+	  action, true, 0, false);
 	}
 
 	/* function manager hook */
@@ -749,7 +763,7 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 
 	fcinfo->flinfo = save_flinfo;
 
-	if (fcache->configNames != NIL)
+	if (fcache->configHandles != NIL)
 		AtEOXact_GUC(true, save_nestlevel);
 	if (OidIsValid(fcache->userid))
 		SetUserIdAndSecContext(save_userid, save_sec_context);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e76c083003..157e39d654 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3357,6 +3357,52 @@ set_config_option_ext(const char *name, const char *value,
 	  bool is_reload)
 {
 	struct config_generic *record;
+
+	if (elevel == 0)
+	{
+		if (source == PGC_S_DEFAULT || source == PGC_S_FILE)
+		{
+			/*
+			 * To avoid cluttering the log, only the postmaster bleats loudly
+			 * about problems with the config file.
+			 */
+			elevel = IsUnderPostmaster ? DEBUG3 : LOG;
+		}
+		else if (source == PGC_S_GLOBAL ||
+ source == PGC_S_DATABASE ||
+ source == PGC_S_USER ||
+ source == PGC_S_DATABASE_USER)
+			elevel = WARNING;
+		else
+			elevel = ERROR;
+	}
+
+	record = find_option(name, true, false, elevel);
+	if (record == NULL)
+		return 0;
+
+	return set_config_with_handle(record, value,
+  context, source, srole,
+  action, changeVal, elevel,
+  is_reload);
+}
+
+
+/*
+ * set_config_with_handle: sets option with the given handle to the given
+ * value.
+ *
+ * This should be us

Re: backtrace_on_internal_error

2023-12-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 1:47 PM Matthias van de Meent
 wrote:
> I can't speak for Nathan, but my reason would be that I'm not in the
> habit to attach a debugger to my program to keep track of state
> progression, but instead use elog() during patch development. I'm not
> super stoked for getting my developmental elog(LOG)-s spammed with
> stack traces, so I'd want to set this at least to ERROR, while in
> production LOG could be fine.
>
> Similarly, there are probably extensions that do not use ereport()
> directly, but instead use elog(), because of reasons like 'not
> planning on doing translations' and 'elog() is the easier API'.
> Forcing a change over to ereport because of stack trace spam in logs
> caused by elog would be quite annoying.

That does seem like a fair complaint. But I also think it would be
really good if we had something that could be enabled unconditionally
instead of via a GUC... because if it's gated by aa GUC then it often
won't be there when you need it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possible segfault when sending notification within a ProcessUtility hook

2023-12-05 Thread Tom Lane
Anthonin Bonnefoy  writes:
> This happens when a transaction block fails and a ProcessUtility hook
> sends a notification during the rollback command.

Why should we regard that as anything other than a bug in the
ProcessUtility hook?  A failed transaction should not send any
notifies.

> Moving notification clean up from AbortTransaction to CleanupTransaction fixes
> the issue as it will clear pendingActions in the same function that destroys 
> the
> TopTransactionContext.

Maybe that would be okay, or maybe not, but I'm disinclined to
mess with it without a better argument for changing it.  It seems
like there would still be room to break things with mistimed
calls to async.c functions.

regards, tom lane




Re: backtrace_on_internal_error

2023-12-05 Thread Nathan Bossart
On Tue, Dec 05, 2023 at 07:47:25PM +0100, Matthias van de Meent wrote:
> On Tue, 5 Dec 2023 at 19:30, Robert Haas  wrote:
>> On Tue, Dec 5, 2023 at 1:28 PM Nathan Bossart  
>> wrote:
>> > Perhaps this should be a GUC that defaults to LOG or ERROR.
>>
>> Why?

Sorry, I should've explained why in my message.

> I can't speak for Nathan, but my reason would be that I'm not in the
> habit to attach a debugger to my program to keep track of state
> progression, but instead use elog() during patch development. I'm not
> super stoked for getting my developmental elog(LOG)-s spammed with
> stack traces, so I'd want to set this at least to ERROR, while in
> production LOG could be fine.
> 
> Similarly, there are probably extensions that do not use ereport()
> directly, but instead use elog(), because of reasons like 'not
> planning on doing translations' and 'elog() is the easier API'.
> Forcing a change over to ereport because of stack trace spam in logs
> caused by elog would be quite annoying.

My main concern was forcing extra logging that users won't have a great way
to turn off (except for maybe raising log_min_messages or something).
Also, it'd give us a way to slowly ramp up backtraces over a few years
without suddenly spamming everyones logs in v17.  For example, maybe this
GUC defaults to PANIC or FATAL in v17.  Once we've had a chance to address
any common backtraces there, we could bump it to ERROR or WARNING in v18.
And so on.  If we just flood everyone's logs immediately, I worry that
folks will just turn it off, and we won't get the reports we are hoping
for.

I know we already have so many GUCs and would like to avoid adding new ones
when possible.  Maybe this is one that could eventually be retired as we
gain confidence that it won't obliterate the log files.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: backtrace_on_internal_error

2023-12-05 Thread Tom Lane
Matthias van de Meent  writes:
> On Tue, 5 Dec 2023 at 19:30, Robert Haas  wrote:
>>> I think we should consider unconditionally emitting a backtrace when
>>> an elog() is hit, instead of requiring a GUC.

>> Perhaps this should be a GUC that defaults to LOG or ERROR.

> I can't speak for Nathan, but my reason would be that I'm not in the
> habit to attach a debugger to my program to keep track of state
> progression, but instead use elog() during patch development. I'm not
> super stoked for getting my developmental elog(LOG)-s spammed with
> stack traces, so I'd want to set this at least to ERROR, while in
> production LOG could be fine.

Yeah, I would not be happy either with elog(LOG) suddenly getting
10x more verbose.  I think it might be okay to unconditionally do this
when elevel >= ERROR, though.

(At the same time, I don't have a problem with the idea of a GUC
controlling the minimum elevel to cause the report.  Other people
might have other use-cases than I do.)

regards, tom lane




Re: Add checks in pg_rewind to abort if backup_label file is present

2023-12-05 Thread Heikki Linnakangas

On 05/12/2023 19:36, Krishnakumar R wrote:

Hi,

Please find the attached patch for $subject and associated test. Please review.


Thanks for picking up this long-standing TODO!


+/*
+ * Check if a file is present using the connection to the
+ * database.
+ */
+static bool
+libpq_is_file_present(rewind_source *source, const char *path)
+{
+   PGconn *conn = ((libpq_source *) source)->conn;
+   PGresult   *res;
+   const char *paramValues[1];
+
+   paramValues[0] = path;
+   res = PQexecParams(conn, "SELECT pg_stat_file($1)",
+  1, NULL, paramValues, NULL, NULL, 1);
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   return false;
+
+   return true;
+}


The backup_label file cannot be present when the server is running. No 
need to check for that when connected to a live server.



--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -729,7 +729,11 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 static void
 sanityChecks(void)
 {
-   /* TODO Check that there's no backup_label in either cluster */
+   if (source->is_file_present(source, "backup_label"))
+   pg_fatal("The backup_label file is present in source cluster");
+
+   if (is_file_present(datadir_target, "backup_label"))
+   pg_fatal("The backup_label file is present in target  cluster");
 
 	/* Check system_identifier match */

if (ControlFile_target.system_identifier != 
ControlFile_source.system_identifier)


The error message isn't very user friendly. It's pretty dangerous 
actually: I think a lot of users would just delete the backup_label file 
when they see that message. Because then the file is no longer present 
and problem solved, right?


The point of checking for backup_label is that if it's present, the 
cluster wasn't really shut down cleanly. The correct fix is to start it, 
let WAL recovery finish, and shut it down again. The error message 
should make that clear. Perhaps make it similar to the existing "target 
server must be shut down cleanly" message.


I think today if you try to run pg_rewind on a cluster that was restored 
from a backup, so that backup_label is present, you get the "target 
server must be shut down cleanly" message. But we don't have any tests 
for it. We do have a test for when the server is still running, but not 
for the restored-from-backup case. Would be nice to add one.


Perhaps move the backup_label check later in sanityChecks(), after 
checking the state in the control file. That way, you still normally hit 
the "target server must be shut down cleanly" case, and the backup_label 
check would be just an additional "can't happen" sanity check.


In createBackupLabel() we have this:


/* TODO: move old file out of the way, if any. */
open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
write_target_range(buf, 0, len);
close_target_file();


That TODO comment needs to go away now. And we probably should complain 
if the file already exists. With your patch, we already checked earlier 
that it doesn't exists, so if it exists when we reach that code, 
something's gone wrong.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: UBSan pointer overflow in xlogreader.c

2023-12-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 1:04 PM Nathan Bossart  wrote:
> On Wed, Dec 06, 2023 at 12:03:53AM +1300, Thomas Munro wrote:
> > xlogreader.c has a pointer overflow bug, as revealed by the
> > combination of -fsanitize=undefined -m32, the new 039_end_of_wal.pl
> > test and Robert's incremental backup patch[1].  The bad code tests
> > whether an object could fit using something like base + size <= end,
> > which can be converted to something like size <= end - base to avoid
> > the overflow.  See experimental fix patch, attached.
>
> The patch LGTM.  I wonder if it might be worth creating some special
> pointer arithmetic routines (perhaps using the stuff in common/int.h) to
> help prevent this sort of thing in the future.  But that'd require you to
> realize that your code is at risk of overflow, at which point it's probably
> just as easy to restructure the logic like you've done here.

The patch LGTM, too. Thanks for investigating and writing the code.
The part about how the reserved kernel memory prevents the bug from
appearing on 32-bit systems but not 64-bit systems running in 32-bit
mode is pretty interesting -- I don't want to think about how long it
would have taken me to figure that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan



On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is 
only one legal delimiter of array items in JSON, and that's a comma. 
There's no alternative and it's not optional. So in the array case you 
MUST have commas and in any other case (e.g. LINES) I can't see why you 
would have them.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: UBSan pointer overflow in xlogreader.c

2023-12-05 Thread Nathan Bossart
On Tue, Dec 05, 2023 at 03:48:33PM -0500, Robert Haas wrote:
> The patch LGTM, too. Thanks for investigating and writing the code.
> The part about how the reserved kernel memory prevents the bug from
> appearing on 32-bit systems but not 64-bit systems running in 32-bit
> mode is pretty interesting -- I don't want to think about how long it
> would have taken me to figure that out.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array case. 
It says so here:

8<---
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+   if (opts_out->force_array &&
+   force_row_delimiter_specified &&
+   !opts_out->force_row_delimiter)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 	 errmsg("cannot specify FORCE_ROW_DELIMITER false with 
FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and export 
style. It was not that tough to support and the code as written already 
does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan



On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:12, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?



See the follow up email -- other databases support it so why not? It 
seems to be a thing...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan



On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result





That seems quite absurd, TBH. I know we've catered for some absurdity in 
the CSV code (much of it down to me), so maybe we need to be liberal in 
what we accept here too. IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-05 Thread Masahiko Sawada
On Mon, Dec 4, 2023 at 5:21 PM John Naylor  wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada  wrote:
>
> > Since the variable-length values support is a big deal and would be
> > related to API design I'd like to discuss the API design first.
>
> Thanks for the fine summary of the issues here.
>
> [Swapping this back in my head]
>
> > RT_VALUE_TYPE
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> > or for variable-length value support,
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If an entry already exists, return its pointer and set "found" to
> > true. Otherwize, insert an empty value with sz bytes, return its
> > pointer, and set "found" to false.
>
> > ---
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> I'd have to double-check, but I think RT_SET is vestigial and I'm not
> sure it has any advantage over RT_GET as I've sketched it out. I'm
> pretty sure it's only there now because changing the radix tree
> regression tests is much harder than changing TID store.

Agreed.

>
> > Given variable-length value support, RT_GET() would have to do
> > repalloc() if the existing value size is not big enough for the new
> > value, but it cannot as the radix tree doesn't know the size of each
> > stored value.
>
> I think we have two choices:
>
> - the value stores the "length". The caller would need to specify a
> function to compute size from the "length" member. Note this assumes
> there is an array. I think both aspects are not great.
> - the value stores the "size". Callers that store an array (as
> PageTableEntry's do) would compute length when they need to. This
> sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

>
> > Another idea is that the radix tree returns the pointer
> > to the slot and the caller updates the value accordingly.
>
> I did exactly this in v43 TidStore if I understood you correctly. If I
> misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

>
> > But it means
> > that the caller has to update the slot properly while considering the
> > value size (embedded vs. single-leave value), which seems not a good
> > idea.
>
> For this optimization, callers will have to know about pointer-sized
> values and treat them differently, but they don't need to know the
> details about how where they are stored.
>
> While we want to keep embedded values in the back of our minds, I
> really think the details should be postponed to a follow-up commit.

Agreed.

>
> > To deal with this problem, I think we can somewhat change RT_GET() API
> > as follow:
> >
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> For the case we are discussing, bitmaps, updating an existing value is
> a bit tricky. We need the existing value to properly update it with
> set or unset bits. This can't work in general without a lot of work
> for the caller.

True.

>
> However, for vacuum, we have all values that we need up front. That
> gives me an idea: Something like this insert API could be optimized
> for "insert-only": If we only free values when we free the whole tree
> at the end, that's a clear use case for David Rowley's proposed "bump
> context", which would save 8 bytes per allocation and be a bit faster.
> [1] (RT_GET for varlen values would use an aset context, to allow
> repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).



So your preference would be to not allow the non-array-with-commas case 
but if/when we implement COPY FROM we would accept that format? As in 
Postel'a law ("be conservative in what you do, be liberal in what you 
accept from others")?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: [PATCH] plpython function causes server panic

2023-12-05 Thread Tom Lane
Hao Zhang  writes:
>> The only readily-reachable error case in BeginInternalSubTransaction
>> is this specific one about IsInParallelMode, which was added later
>> than the original design and evidently with not a lot of thought or
>> testing.  The comment for it speculates about whether we could get
>> rid of it, so I wonder if our thoughts about this ought to go in that
>> direction.

> IMHO, there are other error reports in the function
> BeginInternalSubTransaction(), like

Sure, but all the other ones are extremely hard to hit, which is why
we didn't bother to worry about them to begin with.  If we want to
make this more formally bulletproof, my inclination would be to
(a) get rid of the IsInParallelMode restriction and then (b) turn
the function into a critical section, so that any other error gets
treated as a PANIC.  Maybe at some point we'd be willing to make a
variant of BeginInternalSubTransaction that has a different API and
can manage such cases without a PANIC, but that seems far down the
road to me, and certainly not something to be back-patched.

The main reason for my caution here is that, by catching an error
and allowing Python (or Perl, or something else) code to decide
what to do next, we are very dependent on that code doing the right
thing.  This is already a bit of a leap of faith for run-of-the-mill
errors.  For errors in transaction startup or shutdown, I think it's
a bigger leap than I care to make.  We're pretty well hosed if we
can't make the transaction machinery work, so imagining that we can
clean up after such an error and march merrily onwards seems mighty
optimistic.

regards, tom lane




Re: [PATCH] ltree hash functions

2023-12-05 Thread Tommy Pavlicek
Thanks.

I've attached the latest version that updates the naming in line with
the convention.

On Mon, Dec 4, 2023 at 12:46 AM jian he  wrote:
>
> On Fri, Dec 1, 2023 at 8:44 AM Tommy Pavlicek  wrote:
> >
> >
> > Patch updated for those comments (and a touch of cleanup in the tests) 
> > attached.
>
> it would be a better name as hash_ltree than ltree_hash, similar logic
> applies to ltree_hash_extended.
> that would be the convention. see: 
> https://stackoverflow.com/a/69650940/15603477
>
>
> Other than that, it looks good.


0001-ltree-hash-v4.patch
Description: Binary data


Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer
> Am I understanding something incorrectly?

No, you've got it.  You already covered the concerns there.

> That seems quite absurd, TBH. I know we've catered for some absurdity in
> the CSV code (much of it down to me), so maybe we need to be liberal in
> what we accept here too. IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).

For what it's worth, I agree with Andrew on this.  I also agree with COPY
FROM allowing for potentially bogus commas at the end of non-arrays for
interop with other products, but to not do that in COPY TO (unless there is
some real compelling case to do so).  Emitting bogus JSON (non-array with
commas) feels wrong and would be nice to not perpetuate that, if possible.

Thanks again for doing this.  If I can be of any help, let me know.
If\When this makes it into the production product, I'll be using this
feature for sure.

-Davin


Re: Faster "SET search_path"

2023-12-05 Thread Jeff Davis
On Mon, 2023-11-20 at 17:13 -0800, Jeff Davis wrote:
> Will commit 0005 soon.

Committed.

> I also attached a trivial 0006 patch that uses SH_STORE_HASH. I
> wasn't
> able to show much benefit, though, even when there's a bucket
> collision. Perhaps there just aren't enough elements to matter -- I
> suppose there would be a benefit if there are lots of unique
> search_path strings, but that doesn't seem very plausible to me. If
> someone thinks it's worth committing, then I will, but I don't see
> any
> real upside or downside.

I tried again by forcing a hash table with ~25 entries and 13
collisions, and even then, SH_STORE_HASH didn't make a difference in my
test. Maybe a microbenchmark would show a difference, but I didn't see
much reason to commit 0006. (There's also no downside, so I was tempted
to commit it just so I wouldn't have to put more thought into whether
it's a problem or not.)

Regards,
Jeff Davis





Re: Change GUC hashtable to use simplehash?

2023-12-05 Thread John Naylor
On Tue, Dec 5, 2023 at 1:57 AM Jeff Davis  wrote:
>
> On Mon, 2023-12-04 at 12:12 +0700, John Naylor wrote:

> There's already a patch to use simplehash, and the API is a bit
> cleaner, and there's a minor performance improvement. It seems fairly
> non-controversial -- should I just proceed with that patch?

I won't object if you want to commit that piece now, but I hesitate to
call it a performance improvement on its own.

- The runtime measurements I saw reported were well within the noise level.
- The memory usage starts out better, but with more entries is worse.

> > From my point of view, it would at least be useful for C-strings,
> > where we don't have the length available up front.
>
> That's good news.
>
> By the way, is there any reason that we would need hash_bytes(s,
> strlen(s)) == cstring_hash(s)?

"git grep cstring_hash" found nothing, so not sure what you're asking.

> Each approach has its own optimization techniques. In (a), we can use
> the |= 0x20 trick, and for equality do a memcmp() check first.

I will assume you are referring to semantics, but on the odd chance
readers take this to mean the actual C library call, that wouldn't be
an optimization, that'd be a pessimization.

> As a tangential point, we may eventually want to provide a more
> internationalized definition of "case insensitive" for GUC names. That
> would be slightly easier with (b) than with (a), but we can cross that
> bridge if and when we come to it.

The risk/reward ratio seems pretty bad.

> It seems you are moving toward (a) whereas my patches moved toward (b).
> I am fine with either approach but I wanted to clarify which approach
> we are using.

I will make my case:

> In the abstract, I kind of like approach (b) because we don't need to
> be as special/clever with the hash functions.

In the abstract, I consider (b) to be a layering violation. As a
consequence, the cleverness in (b) is not confined to one or two
places, but is smeared over a whole bunch of places. I find it hard to
follow.

Concretely, it also adds another pointer to the element struct. That's
not good for a linear open-addressing array, which simplehash has.

Further, remember the equality function is important as well. In v3,
it was "strcmp(a,b)==0", which is a holdover from the dynahash API.
One of the advantages of the simplehash API is that we can 1) use an
equality function, which should be slightly cheaper than a full
comparison function, and 2) we have the option to inline it. (It
doesn't make sense in turn, to jump to a shared lib page and invoke an
indirect function call.) Once we've done that, it's already "special",
so it's not a stretch to make it do what we want to begin with. If a
nicer API is important, why not use it?




Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Peter Geoghegan
As I recently went into on the thread where we've been discussing my
nbtree SAOP patch [1], there is good reason to suspect that one of the
optimizations added by commit e0b1ee17 is buggy in the presence of an
opfamily lacking the full set of cross-type comparisons. The attached
test case confirms that these suspicions were correct. Running the
tese case against HEAD will lead to an assertion failure (or a wrong
answer when assertions are disabled).

To recap, the optimization in question (which is not to be confused
with the "precheck" optimization from the same commit) is based on the
idea that _bt_first must always land the scan ahead of the position
that the scan would end on, were the scan direction to change (from
forwards to backwards, say). It follows that inequality strategy scan
keys that are required in the opposite-to-scan direction *only* must
be redundant in the current scan direction (in the sense that
_bt_checkkeys needn't bother comparing them at all). Unfortunately,
that rationale is at least slightly wrong.

Although some version of the same assumption must really hold in the
case of required equality strategy scan keys (old comments in
_bt_checkkeys and in _bt_first say as much), it isn't really
guaranteed in the case of inequalities. In fact, the following
sentence appears in old comments above _bt_preprocess_keys, directly
contradicting the theory behind the optimization in question:

"In general, when inequality keys are present, the initial-positioning
code only promises to position before the first possible match, not
exactly at the first match, for a forward scan; or after the last
match for a backward scan."

My test case mostly just demonstrates how to reproduce the scenario
described by this sentence.

It's probably possible to salvage the optimization, but that will
require bookkeeping sufficient to detect these unsafe cases, so that
_bt_checkkeys only skips the comparisons when it's truly safe. As far
as I know, the only reason that inequalities differ from equalities is
this respect is the issue that the test case highlights. (There were
also issues with NULLs, but AFAICT Alexander dealt with that aspect of
the problem already.)

[1] 
https://postgr.es/m/CAH2-Wz=BuxYEHxpNH0tPvo=+g1wte1pamrovu1devow1vy9...@mail.gmail.com
-- 
Peter Geoghegan


require_opposite_dir_repro.sql
Description: Binary data


Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Tom Lane
Peter Geoghegan  writes:
> As I recently went into on the thread where we've been discussing my
> nbtree SAOP patch [1], there is good reason to suspect that one of the
> optimizations added by commit e0b1ee17 is buggy in the presence of an
> opfamily lacking the full set of cross-type comparisons. The attached
> test case confirms that these suspicions were correct. Running the
> tese case against HEAD will lead to an assertion failure (or a wrong
> answer when assertions are disabled).

Hmm ... I had not paid any attention to this commit, but the rationale
given in the commit message is just flat wrong:

Imagine the ordered B-tree scan for the query like this.

SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;

The (col > 'a') scan key will be always matched once we find the location to
start the scan.  The (col < 'b') scan key will match every item on the page
as long as it matches the last item on the page.

That argument probably holds for the index's first column, but it is
completely and obviously wrong for every following column.  Nonetheless
it looks like we're trying to apply the optimization to every scan key.

regards, tom lane




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Peter Geoghegan
On Tue, Dec 5, 2023 at 4:53 PM Tom Lane  wrote:
> Hmm ... I had not paid any attention to this commit, but the rationale
> given in the commit message is just flat wrong:
>
> Imagine the ordered B-tree scan for the query like this.
>
> SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;
>
> The (col > 'a') scan key will be always matched once we find the location 
> to
> start the scan.  The (col < 'b') scan key will match every item on the 
> page
> as long as it matches the last item on the page.
>
> That argument probably holds for the index's first column, but it is
> completely and obviously wrong for every following column.  Nonetheless
> it looks like we're trying to apply the optimization to every scan key.

Just to be clear, you're raising a concern that seems to me to apply
to "the other optimization" from the same commit, specifically -- the
precheck optimization. Not the one I found a problem in. (They're
closely related but distinct optimizations.)

I *think* that that part is handled correctly, because non-required
scan keys are not affected (by either optimization). I have no
specific reason to doubt the proposition that 'b' could only be marked
required in situations where it is indeed safe to assume that the col
< 'b' condition must also apply to earlier tuples transitively (i.e.
this must be true because it was true for the the final tuple on the
page during the _bt_readpage precheck).

That being said, I wouldn't rule out problems for the precheck
optimization in the presence of opfamilies like the one from my test
case. I didn't get as far as exploring that side of things, at least.

-- 
Peter Geoghegan




Re: Test 002_pg_upgrade fails with olddump on Windows

2023-12-05 Thread Michael Paquier
On Tue, Dec 05, 2023 at 12:00:00PM +0300, Alexander Lakhin wrote:
> So removing the condition "if ($oldnode->pg_version != $newnode->pg_version)"
> works here as well, but maybe switching the file mode (to preserve EOLs
> produced by pg_dump) in the block "After dumping, update references ..."
> is more efficient than filtering dumps (on all OSes?).

Well, there's the argument that we replace the library references in
a SQL file that we are handling as a text file, so switching it to use
the binary mode is not right.  A second argument is to apply the same
filtering logic across both the old and new dumps, even if we know
that the second dump file taken by pg_dump with not append CRLFs.

At the end, just applying the filtering all the time makes the most
sense to me, so I've applied a patch doing just that.
--
Michael


signature.asc
Description: PGP signature


Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Peter Geoghegan
On Tue, Dec 5, 2023 at 4:41 PM Peter Geoghegan  wrote:
> "In general, when inequality keys are present, the initial-positioning
> code only promises to position before the first possible match, not
> exactly at the first match, for a forward scan; or after the last
> match for a backward scan."
>
> My test case mostly just demonstrates how to reproduce the scenario
> described by this sentence.

I just realized that my test case wasn't quite minimized correctly. It
depended on a custom function that was no longer created.

Attached is a revised version that uses btint84cmp instead.

-- 
Peter Geoghegan


require_opposite_dir_repro_v2.sql
Description: Binary data


ExecSetupTransitionCaptureState not declared in nodeModifyTable.c

2023-12-05 Thread jian he
hi.

static void
ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate);

not declared in src/backend/executor/nodeModifyTable.c.
do we need to add the declaration?




Re: ExecSetupTransitionCaptureState not declared in nodeModifyTable.c

2023-12-05 Thread Tom Lane
jian he  writes:
> static void
> ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate);

> not declared in src/backend/executor/nodeModifyTable.c.
> do we need to add the declaration?

Not if the compiler's not complaining about it.  We don't have a
policy requiring all static functions to be forward-declared.

regards, tom lane




Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Sutou Kouhei
Hi,

Thanks for replying to this proposal!

In <20231205182458.GC2757816@nathanxps13>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Tue, 5 Dec 2023 12:24:58 -0600,
  Nathan Bossart  wrote:

> I think it makes sense to do this part independently, but we should be
> careful to design this with the follow-up tasks in mind.

OK. I'll keep updating the "TODOs" section in the original
e-mail. It also includes design in the follow-up tasks. We
can discuss the design separately from the patches
submitting. (The current submitted patch just focuses on
refactoring but we can discuss the final design.)

> I assume the performance concerns stem from the use of
> function pointers.  Or was there something else?

I think so too.

The original e-mail that mentioned the performance concern
[1] didn't say about the reason but the use of function
pointers might be concerned.

If the currently supported formats ("text", "csv" and
"binary") are implemented as an extension, it may have more
concerns but we will keep them as built-in formats for
compatibility. So I think that no more concerns exist for
these formats.


[1]: 
https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d


Thanks,
-- 
kou




Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Junwang Zhao
On Wed, Dec 6, 2023 at 10:45 AM Sutou Kouhei  wrote:
>
> Hi,
>
> Thanks for replying to this proposal!
>
> In <20231205182458.GC2757816@nathanxps13>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Tue, 5 Dec 2023 12:24:58 -0600,
>   Nathan Bossart  wrote:
>
> > I think it makes sense to do this part independently, but we should be
> > careful to design this with the follow-up tasks in mind.
>
> OK. I'll keep updating the "TODOs" section in the original
> e-mail. It also includes design in the follow-up tasks. We
> can discuss the design separately from the patches
> submitting. (The current submitted patch just focuses on
> refactoring but we can discuss the final design.)
>
> > I assume the performance concerns stem from the use of
> > function pointers.  Or was there something else?
>
> I think so too.
>
> The original e-mail that mentioned the performance concern
> [1] didn't say about the reason but the use of function
> pointers might be concerned.
>
> If the currently supported formats ("text", "csv" and
> "binary") are implemented as an extension, it may have more
> concerns but we will keep them as built-in formats for
> compatibility. So I think that no more concerns exist for
> these formats.
>

For the modern formats(parquet, orc, avro, etc.), will they be
implemented as extensions or in core?

The patch looks good except for a pair of extra curly braces.

>
> [1]: 
> https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d
>
>
> Thanks,
> --
> kou
>
>


-- 
Regards
Junwang Zhao




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Alexander Korotkov
Hi, Peter!

On Wed, Dec 6, 2023 at 3:46 AM Peter Geoghegan  wrote:
> On Tue, Dec 5, 2023 at 4:41 PM Peter Geoghegan  wrote:
> > "In general, when inequality keys are present, the initial-positioning
> > code only promises to position before the first possible match, not
> > exactly at the first match, for a forward scan; or after the last
> > match for a backward scan."
> >
> > My test case mostly just demonstrates how to reproduce the scenario
> > described by this sentence.
>
> I just realized that my test case wasn't quite minimized correctly. It
> depended on a custom function that was no longer created.
>
> Attached is a revised version that uses btint84cmp instead.

Thank you for raising this issue.  Preprocessing of btree scan keys is
normally removing the redundant scan keys.  However, redundant scan
keys aren't removed when they have arguments of different types.
Please give me a bit of time to figure out how to workaround this.

--
Regards,
Alexander Korotkov




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-12-05 Thread vignesh C
On Tue, 5 Dec 2023 at 11:11, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san, hackers,
>
> Based on comments I made a fix. PSA the patch.
>

Thanks for the patch, the changes look good to me.

Regards,
Vignesh




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-05 Thread Peter Geoghegan
On Tue, Dec 5, 2023 at 8:06 PM Alexander Korotkov  wrote:
> Thank you for raising this issue.  Preprocessing of btree scan keys is
> normally removing the redundant scan keys.  However, redundant scan
> keys aren't removed when they have arguments of different types.
> Please give me a bit of time to figure out how to workaround this.

Couldn't you condition the use of the optimization on
_bt_preprocess_keys being able to use cross-type operators when it
checked for redundant or contradictory scan keys? The vast majority of
index scans will be able to do that.

As I said already, what you're doing here isn't all that different to
the way that we rely on required equality strategy scan keys being
used to build our initial insertion scan key, that determines where
the scan is initially positioned to within _bt_first. Inequalities
aren't all that different to equalities.

-- 
Peter Geoghegan




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-12-05 Thread Amit Kapila
On Wed, Dec 6, 2023 at 9:40 AM vignesh C  wrote:
>
> On Tue, 5 Dec 2023 at 11:11, Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Sawada-san, hackers,
> >
> > Based on comments I made a fix. PSA the patch.
> >
>
> Thanks for the patch, the changes look good to me.
>

Thanks, I have added a comment and updated the commit message. I'll
push this tomorrow unless there are more comments.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-issues-in-binary_upgrade_logical_slot_has_cau.patch
Description: Binary data


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-05 Thread Shlok Kyal
On Tue, 5 Dec 2023 at 17:18, Tomas Vondra  wrote:
>
> On 12/5/23 08:14, Shlok Kyal wrote:
> > Hi,
> >
> >> As for the test results, I very much doubt the differences are not
> >> caused simply by random timing variations, or something like that. And I
> >> don't understand what "Performance Machine Linux" is, considering those
> >> timings are slower than the other two machines.
> >
> > The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating 
> > System
> > Also find the detailed info of the performance machine attached.
> >
>
> Thanks for the info. I don't think the tests really benefit from this
> much resources, I would be rather surprised if it was faster beyond 8
> cores or so. The CPU frequency likely matters much more. Which probably
> explains why this machine was the slowest.
>
> Also, I wonder how much the results vary between the runs. I suppose you
> only did s single run for each, right?

I did 10 runs for each of the cases and reported the median result in
the previous thread.
I have documented the result of the runs and have attached the same here.

Thanks and Regards,
Shlok Kyal


Performance_Test.xlsx
Description: MS-Excel 2007 spreadsheet


Re: Synchronizing slots from primary to standby

2023-12-05 Thread Amit Kapila
On Tue, Dec 5, 2023 at 7:38 PM Drouvot, Bertrand
 wrote:
>
> On 12/5/23 12:32 PM, Amit Kapila wrote:
> > On Tue, Dec 5, 2023 at 10:38 AM shveta malik  wrote:
> >>
> >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
> >>  wrote:
> 
> >>>
> >>> Maybe another option could be to have the walreceiver a way to let the 
> >>> slot sync
> >>> worker knows that it (the walreceiver) was not able to start due to non 
> >>> existing
> >>> replication slot on the primary? (that way we'd avoid the slot sync 
> >>> worker having
> >>> to talk to the primary).
> >>
> >> Few points:
> >> 1) I think if we do it, we should do it in generic way i.e. slotsync
> >> worker should go to no-op if walreceiver is not able to start due to
> >> any reason and not only due to invalid primary_slot_name.
> >> 2) Secondly, slotsync worker needs to make sure it has synced the
> >> slots so far i.e. worker should not go to no-op immediately on seeing
> >> missing WalRcv process if there are pending slots to be synced.
> >>
> >
> > Won't it be better to just ping and check the validity of
> > 'primary_slot_name' at the start of slot-sync and if it is changed
> > anytime? I think it would be better to avoid adding dependency on
> > walreciever state as that sounds like needless complexity.
>
> I think the overall extra complexity is linked to the fact that we first
> want to ensure that the slots are in sync before shutting down the
> sync slot worker.
>
> I think than talking to the primary or relying on the walreceiver state
> is "just" what would trigger the decision to shutdown the sync slot worker.
>
> Relying on the walreceiver state looks better to me (as it avoids possibly
> useless round trips with the primary).
>

But the round trip will only be once in the beginning and if the user
changes the GUC primary-slot_name which shouldn't be that often.

> Also the walreceiver could be down for multiple reasons, and I think there
> is no point of having a sync slot worker running if the slots are in sync and
> there is no walreceiver running (even if primary_slot_name is a valid one).
>

I feel that is indirectly relying on the fact that the primary won't
advance logical slots unless physical standby has consumed data. Now,
it is possible that slot-sync worker lags behind and still needs to
sync more data for slots in which it makes sense for slot-sync worker
to be alive. I think we can try to avoid checking walreceiver status
till we can get more data to avoid the problem I mentioned but it
doesn't sound like a clean way to achieve our purpose.

-- 
With Regards,
Amit Kapila.




Re: RFI: Extending the TOAST Pointer

2023-12-05 Thread Michael Paquier
On Wed, May 24, 2023 at 11:50:21AM +0200, Matthias van de Meent wrote:
> I think you might not have picked up on what I was arguing for, but I
> agree with what you just said.
> 
> My comment on not needing to invent a new ToastCompressionId was on
> the topic of adding capabilities^ to toast pointers that do things
> differently than the current TOAST and need more info than just sizes,
> 2x 32-bit ID and a compression algorithm.

I am not sure to understand why a new vartag is really required just
for the sake of compression when it comes to VARTAG_EXTERNAL, because
that's where the compression information is stored for ages.  The code
footprint gets more invasive, as well if you add more compression
methods as a new vartag implies more code areas to handle.

> ^ capabilities such as compression dictionaries (which would need to
> store a dictionary ID in the pointer), TOAST IDs that are larger than
> 32 bits, and other such advances.

Saying that, I don't really see why we cannot just do both, because it
is clear that many other projects want to fill in more data into
varlena headers for their own needs.  Hence, I would do:
1) Use the last bit of va_extinfo in varatt_external to link it more
info related to compression, and keep the compression information
close to varatt_external.
2) Add a new kind of "custom" vartag, for what any other requirements
want it to be.
--
Michael


signature.asc
Description: PGP signature


Re: logical decoding and replication of sequences, take 2

2023-12-05 Thread Dilip Kumar
On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra
 wrote:
>

> Some time ago I floated the idea of maybe "queuing" the sequence changes
> and only replay them on the next commit, somehow. But we did ran into
> problems with which snapshot to use, that I didn't know how to solve.
> Maybe we should try again. The idea is we'd queue the non-transactional
> changes somewhere (can't be in the transaction, because we must keep
> them even if it aborts), and then "inject" them into the next commit.
> That'd mean we wouldn't do the separate start/abort for each change.

Why can't we use the same concept of
SnapBuildDistributeNewCatalogSnapshot(), I mean we keep queuing the
non-transactional changes (have some base snapshot before the first
change), and whenever there is any catalog change, queue new snapshot
change also in the queue of the non-transactional sequence change so
that while sending it to downstream whenever it is necessary we will
change the historic snapshot?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Is WAL_DEBUG related code still relevant today?

2023-12-05 Thread Michael Paquier
On Mon, Dec 04, 2023 at 10:14:36AM +0900, Michael Paquier wrote:
> I can add the flag in one of my nix animals if we don't have any to
> provide minimal coverage, that's not an issue for me.  I'd suggest to
> just fix the build on Windows, this flag is a low maintenance burden.

Hearing nothing about that, I've reproduced the failure, checked that
the proposed fix is OK, and applied it down to 13 where this was
introduced.

Regarding the tests, like Noah, I am not really sure that it is worth
spending resources on fixing as they'd require wal_debug = on to
break.
--
Michael


signature.asc
Description: PGP signature


Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED

2023-12-05 Thread Michael Paquier
On Sat, Nov 18, 2023 at 04:32:36PM -0800, Noah Misch wrote:
> On Sat, Nov 18, 2023 at 03:09:58PM -0800, Andres Freund wrote:
>> Unfortunately, there is a case of such an sqlstate that's not at all 
>> indicating
>> corruption, namely REINDEX CONCURRENTLY when the index is invalid:
>> 
>> if (!indexRelation->rd_index->indisvalid)
>> ereport(WARNING,
>> (errcode(ERRCODE_INDEX_CORRUPTED),
>>  errmsg("cannot reindex invalid index 
>> \"%s.%s\" concurrently, skipping",
>> 
>> get_namespace_name(get_rel_namespace(cellOid)),
>> get_rel_name(cellOid;
>> 
>> The only thing required to get to this is an interrupted CREATE INDEX
>> CONCURRENTLY, which I don't think can be fairly characterized as 
>> "corruption".
>> 
>> ISTM something like ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be more
>> appropriate?
> 
> +1, that's a clear improvement.

The same thing can be said a couple of lines above where the code uses
ERRCODE_FEATURE_NOT_SUPPORTED but your suggestion of
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be better.

Would the attached be OK for you?

> The "cannot" part of the message is also inaccurate, and it's not clear to me
> why we have this specific restriction at all.  REINDEX INDEX CONCURRENTLY
> accepts such indexes, so I doubt it's an implementation gap.

If you would reword that, what would you change?

> Since an INVALID
> index often duplicates some valid index, I could see an argument that
> reindexing INVALID indexes as part of a table-level REINDEX is wanted less
> often than not.

The argument behind this restriction is that repeated interruptions of
a table-level REINDEX CONCURRENTLY would bloat the entire relation in
index entries if invalid entries are rebuilt.  This was discussed back
on the original thread back in 2019, around here:
https://www.postgresql.org/message-id/20190411132704.gc30...@paquier.xyz
--
Michael
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 412e1ba84f..05570eb3f3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3524,7 +3524,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	if (!indexRelation->rd_index->indisvalid)
 		ereport(WARNING,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("cannot reindex invalid index \"%s.%s\" concurrently, skipping",
 		get_namespace_name(get_rel_namespace(cellOid)),
 		get_rel_name(cellOid;
@@ -3576,7 +3576,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		if (!indexRelation->rd_index->indisvalid)
 			ereport(WARNING,
-	(errcode(ERRCODE_INDEX_CORRUPTED),
+	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("cannot reindex invalid index \"%s.%s\" concurrently, skipping",
 			get_namespace_name(get_rel_namespace(cellOid)),
 			get_rel_name(cellOid;


signature.asc
Description: PGP signature


Re: Synchronizing slots from primary to standby

2023-12-05 Thread shveta malik
On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila  wrote:
>
> On Tue, Dec 5, 2023 at 7:38 PM Drouvot, Bertrand
>  wrote:
> >
> > On 12/5/23 12:32 PM, Amit Kapila wrote:
> > > On Tue, Dec 5, 2023 at 10:38 AM shveta malik  
> > > wrote:
> > >>
> > >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand
> > >>  wrote:
> > 
> > >>>
> > >>> Maybe another option could be to have the walreceiver a way to let the 
> > >>> slot sync
> > >>> worker knows that it (the walreceiver) was not able to start due to non 
> > >>> existing
> > >>> replication slot on the primary? (that way we'd avoid the slot sync 
> > >>> worker having
> > >>> to talk to the primary).
> > >>
> > >> Few points:
> > >> 1) I think if we do it, we should do it in generic way i.e. slotsync
> > >> worker should go to no-op if walreceiver is not able to start due to
> > >> any reason and not only due to invalid primary_slot_name.
> > >> 2) Secondly, slotsync worker needs to make sure it has synced the
> > >> slots so far i.e. worker should not go to no-op immediately on seeing
> > >> missing WalRcv process if there are pending slots to be synced.
> > >>
> > >
> > > Won't it be better to just ping and check the validity of
> > > 'primary_slot_name' at the start of slot-sync and if it is changed
> > > anytime? I think it would be better to avoid adding dependency on
> > > walreciever state as that sounds like needless complexity.
> >
> > I think the overall extra complexity is linked to the fact that we first
> > want to ensure that the slots are in sync before shutting down the
> > sync slot worker.
> >
> > I think than talking to the primary or relying on the walreceiver state
> > is "just" what would trigger the decision to shutdown the sync slot worker.
> >
> > Relying on the walreceiver state looks better to me (as it avoids possibly
> > useless round trips with the primary).
> >
>
> But the round trip will only be once in the beginning and if the user
> changes the GUC primary-slot_name which shouldn't be that often.
>
> > Also the walreceiver could be down for multiple reasons, and I think there
> > is no point of having a sync slot worker running if the slots are in sync 
> > and
> > there is no walreceiver running (even if primary_slot_name is a valid one).
> >
>
> I feel that is indirectly relying on the fact that the primary won't
> advance logical slots unless physical standby has consumed data.

Yes, that is the basis of this discussion. But now on rethinking, if
the user has not set 'standby_slot_names' on primary at first pace,
then even if walreceiver on standby is down, slots on primary will
keep on advancing and thus we need to sync. We have no check currently
that mandates users to set standby_slot_names.

> Now,
> it is possible that slot-sync worker lags behind and still needs to
> sync more data for slots in which it makes sense for slot-sync worker
> to be alive. I think we can try to avoid checking walreceiver status
> till we can get more data to avoid the problem I mentioned but it
> doesn't sound like a clean way to achieve our purpose.
>




Re: Remove MSVC scripts from the tree

2023-12-05 Thread Peter Eisentraut

On 04.12.23 21:11, Andrew Dunstan wrote:
I just had a look at shifting bowerbird to use meson, and it got stymied 
at the c99 test, which apparently doesn't compile with anything less 
than VS2019.


If that is the case, then wouldn't that invalidate the documented claim 
that you can build with VS2015 or newer?






Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 6 Dec 2023 11:18:35 +0800,
  Junwang Zhao  wrote:

> For the modern formats(parquet, orc, avro, etc.), will they be
> implemented as extensions or in core?

I think that they should be implemented as extensions
because they will depend of external libraries and may not
use C. For example, C++ will be used for Apache Parquet
because the official Apache Parquet C++ implementation
exists but the C implementation doesn't.

(I can implement an extension for Apache Parquet after we
complete this feature. I'll implement an extension for
Apache Arrow with the official Apache Arrow C++
implementation. And it's easy that we convert Apache Arrow
data to Apache Parquet with the official Apache Parquet
implementation.)

> The patch looks good except for a pair of extra curly braces.

Thanks for the review! I attach the v2 patch that removes
extra curly braces for "if (isnull)".


Thanks,
-- 
kou
>From 2cd0d344d68667db71b621a8c94f376ddf1707c3 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Mon, 4 Dec 2023 12:32:54 +0900
Subject: [PATCH v2] Extract COPY TO format implementations

This is a part of making COPY format extendable. See also these past
discussions:
* New Copy Formats - avro/orc/parquet:
  https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
* Make COPY extendable in order to support Parquet and other formats:
  https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com

This doesn't change the current behavior. This just introduces
CopyToFormatOps, which just has function pointers of format
implementation like TupleTableSlotOps, and use it for existing "text",
"csv" and "binary" format implementations.

Note that CopyToFormatOps can't be used from extensions yet because
CopySend*() aren't exported yet. Extensions can't send formatted data
to a destination without CopySend*(). They will be exported by
subsequent patches.

Here is a benchmark result with/without this change because there was
a discussion that we should care about performance regression:

https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us

> I think that step 1 ought to be to convert the existing formats into
> plug-ins, and demonstrate that there's no significant loss of
> performance.

You can see that there is no significant loss of performance:

Data: Random 32 bit integers:

CREATE TABLE data (int32 integer);
INSERT INTO data
  SELECT random() * 1
FROM generate_series(1, ${n_records});

The number of records: 100K, 1M and 10M

100K without this change:

format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806

100K with this change:

format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705

1M without this change:

format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687

1M with this change:

format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164

10M without this change:

format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817

10M with this change:

format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617
---
 src/backend/commands/copy.c   |   8 +
 src/backend/commands/copyto.c | 383 --
 src/include/commands/copy.h   |  27 ++-
 3 files changed, 262 insertions(+), 156 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b562..27a1add456 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -427,6 +427,8 @@ ProcessCopyOptions(ParseState *pstate,
 
 	opts_out->file_encoding = -1;
 
+	/* Text is the default format. */
+	opts_out->to_ops = CopyToFormatOpsText;
 	/* Extract options from the statement node tree */
 	foreach(option, options)
 	{
@@ -442,9 +444,15 @@ ProcessCopyOptions(ParseState *pstate,
 			if (strcmp(fmt, "text") == 0)
  /* default format */ ;
 			else if (strcmp(fmt, "csv") == 0)
+			{
 opts_out->csv_mode = true;
+opts_out->to_ops = CopyToFormatOpsCSV;
+			}
 			else if (strcmp(fmt, "binary") == 0)
+			{
 opts_out->binary = true;
+opts_out->to_ops = CopyToFormatOpsBinary;
+			}
 			else
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index c66a047c4a..79806b9a1b 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -131,6 +131,234 @@ static void CopySendEndOfRow(CopyToState cstate);
 static void CopySendInt32(CopyToState cstate, int32 val);
 static void CopySendInt16(CopyToState cstate, int16 val);
 
+/*
+ * CopyToFormatOps implementations.
+ */
+
+/*
+ * CopyToFormatOps implementation for "text" and "csv". CopyToFormatText*()
+ * refer cstate->opts.csv_mode and change their behavior. We can split this
+ * impleme

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-05 Thread John Naylor
On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada  wrote:
>
> On Mon, Dec 4, 2023 at 5:21 PM John Naylor  wrote:

> > > Given variable-length value support, RT_GET() would have to do
> > > repalloc() if the existing value size is not big enough for the new
> > > value, but it cannot as the radix tree doesn't know the size of each
> > > stored value.
> >
> > I think we have two choices:
> >
> > - the value stores the "length". The caller would need to specify a
> > function to compute size from the "length" member. Note this assumes
> > there is an array. I think both aspects are not great.
> > - the value stores the "size". Callers that store an array (as
> > PageTableEntry's do) would compute length when they need to. This
> > sounds easier.
>
> As for the second idea, do we always need to require the value to have
> the "size" (e.g. int32) in the first field of its struct? If so, the
> caller will be able to use only 4 bytes in embedded value cases (or
> won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
  uint16 size;

  /* later: uint8 flags; for bitmap scan */

  /* 64 bit: 3 elements , 32-bit: 1 element */
  OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

  /* end of embeddable value */

  bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.
But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

> > > Another idea is that the radix tree returns the pointer
> > > to the slot and the caller updates the value accordingly.
> >
> > I did exactly this in v43 TidStore if I understood you correctly. If I
> > misunderstood you, can you clarify?
>
> I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
> the value as they want.

Did my sketch above get closer to that? Side note: I don't think we
can expose that directly (e.g. need to check for create or extend
upwards), but some functionality can be a thin wrapper around it.

> > However, for vacuum, we have all values that we need up front. That
> > gives me an idea: Something like this insert API could be optimized
> > for "insert-only": If we only free values when we free the whole tree
> > at the end, that's a clear use case for David Rowley's proposed "bump
> > context", which would save 8 bytes per allocation and be a bit faster.
> > [1] (RT_GET for varlen values would use an aset context, to allow
> > repalloc, and nodes would continue to use slab).
>
> Interesting idea and worth trying it. Do we need to protect the whole
> tree as insert-only for safety? It's problematic if the user uses
> mixed RT_INSERT() and RT_GET().

You're right, but I'm not sure what the policy should be.




Re: Remove MSVC scripts from the tree

2023-12-05 Thread Shubham Khanna
On Fri, Nov 17, 2023 at 6:31 AM Michael Paquier  wrote:
>
> On Wed, Nov 15, 2023 at 05:07:03PM -0800, Andres Freund wrote:
> > On 2023-11-15 13:49:06 +0900, Michael Paquier wrote:
> > Where "Windows" actually seems to solely describe visual studio? That seems
> > confusing.
>
> Yeah, switch that to Visual.
>
> > Huh, so this was wrong since the code was added? For a moment I thought I'd
> > unintentionally promoted it to be built by default, but ...
>
> Yes, I was wondering if there could be an argument for simplifying
> some code here by pushing more logic into this wrapper, but I'm
> finding that a bit unappealing, and building it under Visual has no
> actual consequence: it seems that we never call pg_strsignal() under
> WIN32.
>
> >> -# FIXME: copied from Mkvcbuild.pm, but I don't think that's the right 
> >> approach
> >>  pgevent_link_args = []
> >>  if cc.get_id() == 'msvc'
> >>pgevent_link_args += '/ignore:4104'
> >
> > I think it's worth leaving a trail indicating that adding this
> > warning-suppression is dubious at best.  It seems to pretty obviously paper
> > over us exporting the symbols the wrong way:
> > https://learn.microsoft.com/en-us/cpp/error-messages/tool-errors/linker-tools-warning-lnk4104?view=msvc-170
> >
> > Which pretty clearly explains that pgevent.def is wrong.
> >
> > I just can't really test it, nor does it have test. Otherwise I might have
> > fixed it.
>
> Agreed that there is a good argument for removing it at some point,
> with a separate investigation.  I've just added a XXX comment for now.
>
> >> @@ -53,10 +53,25 @@ AC_DEFUN([PGAC_CHECK_PERL_CONFIGS],
> >>  # would be fatal to try to compile PL/Perl to a different libc ABI than 
> >> core
> >>  # Postgres uses.  The available information says that most symbols that 
> >> affect
> >>  # Perl's own ABI begin with letters, so it's almost sufficient to adopt -D
> >> -# switches for symbols not beginning with underscore.  Some exceptions 
> >> are the
> >> -# Windows-specific -D_USE_32BIT_TIME_T and -D__MINGW_USE_VC2005_COMPAT; 
> >> see
> >> -# Mkvcbuild.pm for details.  We absorb the former when Perl reports it.  
> >> Perl
> >> -# never reports the latter, and we don't attempt to deduce when it's 
> >> needed.
> >> +# switches for symbols not beginning with underscore.
> >> +
> >> +# Some exceptions are the Windows-specific -D_USE_32BIT_TIME_T and
> >> +# -D__MINGW_USE_VC2005_COMPAT. To be exact, Windows offers several 32-bit 
> >> ABIs.
> >> +# Perl is sensitive to sizeof(time_t), one of the ABI dimensions.  To get
> >> +# 32-bit time_t, use "cl -D_USE_32BIT_TIME_T" or plain "gcc".  For 64-bit
> >> +# time_t, use "gcc -D__MINGW_USE_VC2005_COMPAT" or plain "cl".  Before 
> >> MSVC
> >> +# 2005, plain "cl" chose 32-bit time_t.  PostgreSQL doesn't support 
> >> building
> >> +# with pre-MSVC-2005 compilers, but it does support linking to Perl built 
> >> with
> >> +# such a compiler.  MSVC-built Perl 5.13.4 and later report 
> >> -D_USE_32BIT_TIME_T
> >> +# in $Config{ccflags} if applicable, but MinGW-built Perl never reports
> >> +# -D_USE_32BIT_TIME_T despite typically needing it.
> >
> > Hm, it's pretty odd to have comments about cl.exe here, given that it can't
> > even be used with msvc.
> >
> > My impression from testing this is that absorbing the flag from perl 
> > suffices
> > with strawberry perl and mingw perl, both when building with mingw and msvc.
>
> I was a bit uncomfortable with removing these references, but I
> suspect that you are right and that they're outdated artifacts of the
> past.  So I'm OK to remove the cl and gcc parts as the flags come from
> $PERL.
>
> >> +# Ignore the $Config{ccflags} opinion about -D_USE_32BIT_TIME_T, and use a
> >> +# runtime test to deduce the ABI Perl expects.  Specifically, test use of
> >> +# PL_modglobal, which maps to a PerlInterpreter field whose position 
> >> depends
> >> +# on sizeof(time_t).  We absorb the former when Perl reports it.  Perl 
> >> never
> >> +# reports the latter, and we don't attempt to deduce when it's needed.
> >
> > I don't think this is implemented anywhere now?
>
> Indeed, that's now gone.
>
> >> +   
> >> +PostgreSQL for Windows can be built using meson, as described
> >> +in .
> >> +The native Windows port requires a 32 or 64-bit version of Windows
> >> +2000 or later. Earlier operating systems do
> >> +not have sufficient infrastructure (but Cygwin may be used on
> >> +those).
> >> +   
> >
> > Is this actually true? I don't think we build on win2k...
>
> Nah, this is a reference outdated for ages.  495ed0ef2d72 has even
> bumped _WIN32_WINNT to require Windows 10 as the minimal runtime
> version supported, so this needs to be updated and backpatched.  The
> first two sentences can be simplified like that:
> -The native Windows port requires a 32 or 64-bit version of Windows
> -2000 or later. Earlier operating systems do
> -not have sufficient infrastructure (but Cygwin may be used on
> -those)

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-12-05 Thread Richard Guo
On Fri, Oct 20, 2023 at 2:52 AM Alena Rybakina 
wrote:

> Thank you for your work on the subject.
>
Thanks for taking an interest in this patch.


> During review your patch I didn't understand why are you checking that the
> variable is path and not new_path of type T_SamplePath (I highlighted it)?
>
> Is it a typo and should be new_path?
>
I don't think there is any difference: path and new_path are the same
pointer in this case.


> Besides, it may not be important, but reviewing your code all the time
> stumbled on the statement of the comments while reading it (I highlighted
> it too). This:
>
> * path_is_reparameterizable_by_child
>  * Given a path parameterized by the parent of the given child
> relation,
>  * see if it can be translated to be parameterized by the child
> relation.
>  *
>  * This must return true if *and only if *reparameterize_path_by_child()
>  * would succeed on this path.
>
> Maybe is it better to rewrite it simplier:
>
>  * This must return true *only if *reparameterize_path_by_child()
>  * would succeed on this path.
>
I don't think so.  "if and only if" is more accurate to me.


> And can we add assert in reparameterize_pathlist_by_child function that
> pathlist is not a NIL, because according to the comment it needs to be
> added there:
>
Hmm, I'm not sure, as in REPARAMETERIZE_CHILD_PATH_LIST we have already
explicitly checked that the pathlist is not NIL.

Thanks
Richard


Re: PATCH: Add REINDEX tag to event triggers

2023-12-05 Thread Alexander Lakhin

Hi Michael,

05.12.2023 02:45, Michael Paquier wrote:

Popping a snapshot at this stage when there are no indexes has been a
decision taken by the original commit in 5dc92b844e68 because we had
no need for it yet, but we may do now depending on the function
triggered.  I have been looking at the whole stack and something like
the attached to make a pop conditional seems to be sufficient to
satisfy all the cases I've been able to come up with, including the
one reported here.

Alexander, do you see any hole in that?  Perhaps the snapshot push
should be more aggressive at the end of ReindexRelationConcurrently()
as well (in the last transaction when rebuilds happen)?


Thank you for the fix proposed!

I agree with it. I had worried a bit about ReindexRelationConcurrently()
becoming twofold for callers (it can leave the snapshot or pop it), but I
couldn't find a way to hide this twofoldness inside without adding more
complexity. On the other hand, ReindexRelationConcurrently() now satisfies
EnsurePortalSnapshotExists() in all cases.

Best regards,
Alexander




Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Junwang Zhao
On Wed, Dec 6, 2023 at 2:19 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Wed, 6 Dec 2023 11:18:35 +0800,
>   Junwang Zhao  wrote:
>
> > For the modern formats(parquet, orc, avro, etc.), will they be
> > implemented as extensions or in core?
>
> I think that they should be implemented as extensions
> because they will depend of external libraries and may not
> use C. For example, C++ will be used for Apache Parquet
> because the official Apache Parquet C++ implementation
> exists but the C implementation doesn't.
>
> (I can implement an extension for Apache Parquet after we
> complete this feature. I'll implement an extension for
> Apache Arrow with the official Apache Arrow C++
> implementation. And it's easy that we convert Apache Arrow
> data to Apache Parquet with the official Apache Parquet
> implementation.)
>
> > The patch looks good except for a pair of extra curly braces.
>
> Thanks for the review! I attach the v2 patch that removes
> extra curly braces for "if (isnull)".
>
For the extra curly braces, I mean the following code block in
CopyToFormatBinaryStart:

+ {<-- I thought this is useless?
+ /* Generate header for a binary copy */
+ int32 tmp;
+
+ /* Signature */
+ CopySendData(cstate, BinarySignature, 11);
+ /* Flags field */
+ tmp = 0;
+ CopySendInt32(cstate, tmp);
+ /* No header extension */
+ tmp = 0;
+ CopySendInt32(cstate, tmp);
+ }

>
> Thanks,
> --
> kou



-- 
Regards
Junwang Zhao




Re: speed up a logical replica setup

2023-12-05 Thread Euler Taveira
On Thu, Nov 9, 2023, at 8:12 PM, Michael Paquier wrote:
> On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> > On 08.11.23 00:12, Michael Paquier wrote:
> >> - Should the subdirectory pg_basebackup be renamed into something more
> >> generic at this point?  All these things are frontend tools that deal
> >> in some way with the replication protocol to do their work.  Say
> >> a replication_tools?
> > 
> > Seems like unnecessary churn.  Nobody has complained about any of the other
> > tools in there.
> 
> Not sure.  We rename things across releases in the tree from time to
> time, and here that's straight-forward.

Based on this discussion it seems we have a consensus that this tool should be
in the pg_basebackup directory. (If/when we agree with the directory renaming,
it could be done in a separate patch.) Besides this move, the v3 provides a dry
run mode. It basically executes every routine but skip when should do
modifications. It is an useful option to check if you will be able to run it
without having issues with connectivity, permission, and existing objects
(replication slots, publications, subscriptions). Tests were slightly improved.
Messages were changed to *not* provide INFO messages by default and --verbose
provides INFO messages and --verbose --verbose also provides DEBUG messages. I
also refactored the connect_database() function into which the connection will
always use the logical replication mode. A bug was fixed in the transient
replication slot name. Ashutosh review [1] was included. The code was also 
indented.

There are a few suggestions from Ashutosh [2] that I will reply in another
email.

I'm still planning to work on the following points:

1. improve the cleanup routine to point out leftover objects if there is any
   connection issue.
2. remove the physical replication slot if the standby is using one
   (primary_slot_name).
3. provide instructions to promote the logical replica into primary, I mean,
   stop the replication between the nodes and remove the replication setup
   (publications, subscriptions, replication slots). Or even include another
   action to do it. We could add both too.

Point 1 should be done. Points 2 and 3 aren't essential but will provide a nice
UI for users that would like to use it.


[1] 
https://postgr.es/m/CAExHW5sCAU3NvPKd7msScQKvrBN-x_AdDQD-ZYAwOxuWG%3Doz1w%40mail.gmail.com
[2] 
https://postgr.es/m/caexhw5vhfemfvtuhe+7xwphvzjxrexz5h3dd4uqi7cwmdmj...@mail.gmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 003255b64910ce73f15931a43def25c37be96b81 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 5 Jun 2023 14:39:40 -0400
Subject: [PATCH v3] Creates a new logical replica from a standby server

A new tool called pg_subscriber can convert a physical replica into a
logical replica. It runs on the target server and should be able to
connect to the source server (publisher) and the target server
(subscriber).

The conversion requires a few steps. Check if the target data directory
has the same system identifier than the source data directory. Stop the
target server if it is running as a standby server. Create one
replication slot per specified database on the source server. One
additional replication slot is created at the end to get the consistent
LSN (This consistent LSN will be used as (a) a stopping point for the
recovery process and (b) a starting point for the subscriptions). Write
recovery parameters into the target data directory and start the target
server (Wait until the target server is promoted). Create one
publication (FOR ALL TABLES) per specified database on the source
server. Create one subscription per specified database on the target
server (Use replication slot and publication created in a previous step.
Don't enable the subscriptions yet). Sets the replication progress to
the consistent LSN that was got in a previous step. Enable the
subscription for each specified database on the target server. Remove
the additional replication slot that was used to get the consistent LSN.
Stop the target server. Change the system identifier from the target
server.

Depending on your workload and database size, creating a logical replica
couldn't be an option due to resource constraints (WAL backlog should be
available until all table data is synchronized). The initial data copy
and the replication progress tends to be faster on a physical replica.
The purpose of this tool is to speed up a logical replica setup.
---
 doc/src/sgml/ref/allfiles.sgml|1 +
 doc/src/sgml/ref/pg_subscriber.sgml   |  284 +++
 doc/src/sgml/reference.sgml   |1 +
 src/bin/pg_basebackup/Makefile|8 +-
 src/bin/pg_basebackup/meson.build |   19 +
 src/bin/pg_basebackup/pg_subscriber.c | 1517 +
 src/bin/pg_basebackup/t/040_pg_subscriber.pl  |   44 +
 .../t/041_pg_subscriber_standby.pl|  139 ++
 src/tools/msvc/Mkvcbuild.pm  

Re: Make COPY format extendable: Extract COPY TO format implementations

2023-12-05 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 6 Dec 2023 15:11:34 +0800,
  Junwang Zhao  wrote:

> For the extra curly braces, I mean the following code block in
> CopyToFormatBinaryStart:
> 
> + {<-- I thought this is useless?
> + /* Generate header for a binary copy */
> + int32 tmp;
> +
> + /* Signature */
> + CopySendData(cstate, BinarySignature, 11);
> + /* Flags field */
> + tmp = 0;
> + CopySendInt32(cstate, tmp);
> + /* No header extension */
> + tmp = 0;
> + CopySendInt32(cstate, tmp);
> + }

Oh, I see. I've removed and attach the v3 patch. In general,
I don't change variable name and so on in this patch. I just
move codes in this patch. But I also removed the "tmp"
variable for this case because I think that the name isn't
suitable for larger scope. (I think that "tmp" is acceptable
in a small scope like the above code.)

New code:

/* Generate header for a binary copy */
/* Signature */
CopySendData(cstate, BinarySignature, 11);
/* Flags field */
CopySendInt32(cstate, 0);
/* No header extension */
CopySendInt32(cstate, 0);


Thanks,
-- 
kou
>From 9fe0087d9a6a79a7d1a7d0af63eb16abadbf0d4a Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Mon, 4 Dec 2023 12:32:54 +0900
Subject: [PATCH v3] Extract COPY TO format implementations

This is a part of making COPY format extendable. See also these past
discussions:
* New Copy Formats - avro/orc/parquet:
  https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
* Make COPY extendable in order to support Parquet and other formats:
  https://www.postgresql.org/message-id/flat/CAJ7c6TM6Bz1c3F04Cy6%2BSzuWfKmr0kU8c_3Stnvh_8BR0D6k8Q%40mail.gmail.com

This doesn't change the current behavior. This just introduces
CopyToFormatOps, which just has function pointers of format
implementation like TupleTableSlotOps, and use it for existing "text",
"csv" and "binary" format implementations.

Note that CopyToFormatOps can't be used from extensions yet because
CopySend*() aren't exported yet. Extensions can't send formatted data
to a destination without CopySend*(). They will be exported by
subsequent patches.

Here is a benchmark result with/without this change because there was
a discussion that we should care about performance regression:

https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us

> I think that step 1 ought to be to convert the existing formats into
> plug-ins, and demonstrate that there's no significant loss of
> performance.

You can see that there is no significant loss of performance:

Data: Random 32 bit integers:

CREATE TABLE data (int32 integer);
INSERT INTO data
  SELECT random() * 1
FROM generate_series(1, ${n_records});

The number of records: 100K, 1M and 10M

100K without this change:

format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806

100K with this change:

format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705

1M without this change:

format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687

1M with this change:

format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164

10M without this change:

format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817

10M with this change:

format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617
---
 src/backend/commands/copy.c   |   8 +
 src/backend/commands/copyto.c | 377 --
 src/include/commands/copy.h   |  27 ++-
 3 files changed, 256 insertions(+), 156 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b562..27a1add456 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -427,6 +427,8 @@ ProcessCopyOptions(ParseState *pstate,
 
 	opts_out->file_encoding = -1;
 
+	/* Text is the default format. */
+	opts_out->to_ops = CopyToFormatOpsText;
 	/* Extract options from the statement node tree */
 	foreach(option, options)
 	{
@@ -442,9 +444,15 @@ ProcessCopyOptions(ParseState *pstate,
 			if (strcmp(fmt, "text") == 0)
  /* default format */ ;
 			else if (strcmp(fmt, "csv") == 0)
+			{
 opts_out->csv_mode = true;
+opts_out->to_ops = CopyToFormatOpsCSV;
+			}
 			else if (strcmp(fmt, "binary") == 0)
+			{
 opts_out->binary = true;
+opts_out->to_ops = CopyToFormatOpsBinary;
+			}
 			else
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index c66a047c4a..8f51090a03 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -131,6 +131,228 @@ static void CopySendEndOfRow(CopyToState cstate);
 static void CopySendInt32(CopyToState cstate, int32 val);
 static void CopySendInt16(CopyToState cstate, int16 val);
 
+/*
+ * CopyToFormatOps implementations.
+ */
+
+/*
+ * CopyToFormatOps implementat

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-12-05 Thread Richard Guo
I've self-reviewed this patch again and I think it's now in a
committable state.  I'm wondering if we can mark it as 'Ready for
Committer' now, or we need more review comments/feedbacks.

To recap, this patch postpones reparameterization of paths until
createplan.c, which would help avoid building the reparameterized
expressions we might not use.  More importantly, it makes it possible to
modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos
(e.g. baserestrictinfo) for reparameterization.  Failing to do that can
lead to executor crashes, wrong results, or planning errors, as we have
already seen.

Thanks
Richard