[PATCH] Faster, easier valgrind runs with make USE_VALGRIND=1 check

2021-05-05 Thread Craig Ringer
Hi all

The attached patch adds support for running any temp-install based tests
(check, isolationcheck, src/test/recovery, etc) under the control of
valgrind with a simple

 make USE_VALGRIND=1 check

It's based on a script I've been using for some time to run faster, simpler
Valgrind checks on the codebase with much less irrelevant noise than the
usual approaches.

There are no C code changes at all in this patch, it only touches
Makefile.global and adds a new src/tools/valgrind_wrapper tool.

When you specify USE_VALGRIND=1, the PATH used by $(with_temp_install) is
prefixed with a tmp_install/bin_valgrind_wrapper/ directory. Each binary in
$(bindir) gets a corresponding wrapper script in bin_valgrind_wrapper in
the temp install. The wrappers intercept execution of every command in the
bindir and exec them under the control of valgrind (or skip valgrind and
exec that target directly, if desired

This has many advantages over the usual approaches of an installcheck-based
valgrind run or "valgrind make check":

* There's no custom setup, it works out of the box
* It produces much less irrelevant log output and runs a lot faster because
it only runs postgres-related binaries under valgrind, not irrelevant noise
like perl interpreters, make, shellscripts, etc.
* It's much more targeted and selective - if you're only interested in some
extension or new backend feature, you can trivially set it to target just
the backend, skip checking of initdb, and skip checking of psql, so you get
more relevant log output and faster runs.

I'll follow up to this post with some timing and log output numbers but
wanted to share what I had first.

-DUSE_VALGRIND is also added to CFLAGS at compile time when USE_VALGRIND=1
is passed to make. This gets rid of the need to hack pg_config_manual.h or
fiddle with configure re-runs when you want to build with valgrind support.
Arguably it'd be better to add a --enable-valgrind option to configure. LMK
if that's preferable.

Note that there's a bit of a hack in the wrapper script to work around
Valgrind's inability to set the argv[0] of a process run under valgrind to
anything other than the exact command-name to be executed. I have a patch
for valgrind pending to add that capability (like "exec -a" in bash) but a
workaround is necessary for now. It's made a bit more complicated by
PostgreSQL's determination to canonicalize paths and follow symlinks in
find_my_exec(). The script's hardlink based workarounds for this could be
removed if we could agree to support a debug env-var or command-line option
that could be used to supply an override path to be returned by
find_my_exec() instead of performing normal discovery. If you'd prefer that
approach to the current workaround in the script let me know.

I'm also willing to add valgrind-support-detection logic that will cause
valgrind launched via "make USE_VALGRIND=1" to refuse to run if it detects
that the target postgres was not built with -DUSE_VALGRIND for proper
instrumentation. This can be done with the valgrind --require-text-symbol
option and a dummy export symbol added to the symbol table only when
compiled with -DUSE_VALGRIND. If that's desirable let me know, it should be
quick to add.

You can find more detail in the patch commit message (attached) and in the
src/test/valgrind_wrapper script it adds. If you're wondering why the
valgrind options --trace-children=yes --trace-children-skip=pattern
--trace-children-skip-by-arg=pattern don't solve this problem, read the
script's comments.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
From e730e862a0731dc3d2786c5004a146aff7dd6bf7 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 4 May 2021 10:34:01 +0800
Subject: [PATCH v3] Make Valgrind runs simpler with make USE_VALGRIND=1

To run Valgrind based tests on postgres, it's generally been necessary
to either:

* Initdb and start your own cluster under the control of valgrind then
  use "make installcheck". This works won't work with TAP tests, and
  it's cumbersome to set up and run.
  -or-
* Run "make check" under the control of valgrind. This approach
  runs all the uninteresting processes under valgrind, with all
  the associated overhead. Every make, shell, utility command,
  perl interpreter, etc gets run under valgrind, which slows
  everything down a lot and produces a great deal of uninteresting
  valgrind output. There's no practical way to target just the
  server backend this way.

Provide a faster, simpler and more targeted way to run valgrind on
postgres by adding support for a "USE_VALGRIND=1" parameter to
the makefiles. This has two main effects:

* It adds -DUSE_VALGRIND to CFLAGS, so enhanced valgrind runtime
  support is compiled into postgres; and
* It interposes a wrapper script before all executions of any binary
  installed in the $(bindir) of a temp-install. The wrapper script
  invokes the real binary under valgrind.

The valgrind i

Re: Small issues with CREATE TABLE COMPRESSION

2021-05-05 Thread Dilip Kumar
On Wed, May 5, 2021 at 11:02 AM Dilip Kumar  wrote:
>
> On Wed, May 5, 2021 at 12:06 AM Robert Haas  wrote:
> >
>
> > > There are no tests in pg_dump to make sure that some ALTER
> > > MATERIALIZED VIEW or ALTER TABLE commands are generated when the
> > > compression of a matview's or table's column is changed.
> >
> > True, but it does seem to work. I am happy if you or anyone want to
> > write some tests.
>
> I think it will be really hard to generate such a test in pg_dump,
> because default we are compiling --without-lz4, which means we have
> only one compression option available, and if there is only one option
> available then the column compression method and the default
> compression method will be same so the dump will not generate an extra
> command of type ALTER TABLE... SET COMPRESSION.

I think we already have such test cases at least through pg_upgrade.
Basically, if you see in compression.sql we are not dropping the table
so that pg_upgrade and dump them and test.  So if test run --with-lz4
then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type
of commands.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Use simplehash.h instead of dynahash in SMgr

2021-05-05 Thread Jakub Wartak
Hi David, Alvaro, -hackers 

> Hi David,
> 
> You're probably aware of this, but just to make it explicit: Jakub Wartak was
> testing performance of recovery, and one of the bottlenecks he found in
> some of his cases was dynahash as used by SMgr.  It seems quite possible
> that this work would benefit some of his test workloads.

I might be a little bit out of the loop, but as Alvaro stated - Thomas did 
plenty of excellent job related to recovery performance in that thread. In my 
humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* 
how Smgr hash works, but how often it is being exercised and that would also 
explain relatively lower than expected(?) gains here. There are at least two 
very important emails from him that I'm aware that are touching the topic of 
reordering/compacting/batching calls to Smgr:
https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com

Another potential option that we've discussed is that the redo generation 
itself is likely a brake of efficient recovery performance today (e.g. 
INSERT-SELECT on table with indexes, generates interleaved WAL records that 
touch often limited set of blocks that usually put Smgr into spotlight).

-Jakub Wartak.




Re: few ideas for pgbench

2021-05-05 Thread Fabien COELHO



Hello Pavel,


This is not a simple question. Personally I prefer to show this info every
time, although it can be redundant. Just for check and for more simple
automatic processing.

When I run pgbench, I usually work with more releases together, so the
server version is important info.


Ok. Yes.


2. can ve generate some output in structured format - XML, JSON ?


It is obviously possible, but that would mean some code. ISTM that the
various outputs are easy enough to parse and convert to anything without
needing a special format? Is there some particular part you have in mind?



I thought about something what I can simply import to  Postgres or to R.
But maybe XML or JSON is a bad idea.

What about CSV? Any run can produce one row.


Yep, CSV is simple and nice. It depends on what information you would 
like. For instance for progress report (-P 1) or logs/sampling (-l) would 
be relevant candidates for CSV. Not so much for the final report, though.


--
Fabien.




Re: few ideas for pgbench

2021-05-05 Thread Pavel Stehule
st 5. 5. 2021 v 11:55 odesílatel Fabien COELHO  napsal:

>
> Hello Pavel,
>
> > This is not a simple question. Personally I prefer to show this info
> every
> > time, although it can be redundant. Just for check and for more simple
> > automatic processing.
> >
> > When I run pgbench, I usually work with more releases together, so the
> > server version is important info.
>
> Ok. Yes.
>
> >>> 2. can ve generate some output in structured format - XML, JSON ?
> >>
> >> It is obviously possible, but that would mean some code. ISTM that the
> >> various outputs are easy enough to parse and convert to anything without
> >> needing a special format? Is there some particular part you have in
> mind?
> >>
> >
> > I thought about something what I can simply import to  Postgres or to R.
> > But maybe XML or JSON is a bad idea.
> >
> > What about CSV? Any run can produce one row.
>
> Yep, CSV is simple and nice. It depends on what information you would
> like. For instance for progress report (-P 1) or logs/sampling (-l) would
> be relevant candidates for CSV. Not so much for the final report, though.
>

I think so there can be almost all information. We have to ensure
consistency of columns.

The basic usage can be

for 
do
  pg_bench ... >> logfile
done

and log file can looks like

start time, rowno, serverver, clientver, connections, scale, readonly,
jobs, tps, latency, ...

The header row can be optional

>
> --
> Fabien.
>


Re: Small issues with CREATE TABLE COMPRESSION

2021-05-05 Thread Michael Paquier
On Wed, May 05, 2021 at 01:41:03PM +0530, Dilip Kumar wrote:
> I think we already have such test cases at least through pg_upgrade.
> Basically, if you see in compression.sql we are not dropping the table
> so that pg_upgrade and dump them and test.  So if test run --with-lz4
> then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type
> of commands.

The TAP tests of pg_dump are much more picky than what pg_upgrade is
able to do.  With the existing set of tests in place, what you are
able to detect is that pg_upgrade does not *break* if there are tables
with attributes using various compression types, but that would not be
enough to make sure that the correct compression method is *applied*
depending on the context expected (default_toast_compression + the
attribute compression + pg_dump options), which is what the TAP tests
of pg_dump are able to correctly detect if extended in an appropriate
way.

With what's on HEAD, we would easily miss any bugs introduced in
pg_dump that change the set of commands generated depending on the
options given by a user, but still allow pg_upgrade to work correctly.
For example, there could be issues where we finish by setting up the
incorrect compression option, with pg_upgrade happily finishing.
There is a gap in the test coverage here.
--
Michael


signature.asc
Description: PGP signature


RFC: Detailed reorder buffer stats dumps

2021-05-05 Thread Craig Ringer
Hi all

I'm thinking of piggy-backing on the approach used in the "Get memory
contexts of an arbitrary backend process" patch in order to provide access
to detailed reorder buffer content statistics from walsenders on request.

Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or
on dynamic probes (perf, systemtap) to see what it's doing. I intend a
patch soon to add a couple of fields to struct WalSnd to report some very
coarse reorder buffer stats - at least oldest buffered xid, number of
buffered txns, total bytes of buffered txns in memory, total bytes of
buffered txns spilled to disk.

But sometimes what I really want is details on the txns that're in the
reorder buffer, and that's not feasible to export via always-enabled
reporting like struct WalSnd. So I'm thinking that the same approach used
for the memory context stats patch might work well for asking the walsender
for a detailed dump of reorder buffer contents. Something like a
per-buffered-txn table of txn topxid, start-lsn, most recent change lsn,
number of changes, number of subxids, number of invalidations, number of
catalog changes, buffer size in memory, buffer size spilled to disk.

Anyone drastically opposed to the idea?

(I know I have to finish up with the LWLock tracepoint patchset first, this
is a RFC at this stage).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-05-05 Thread Pavel Luzanov

Hello,

I found this in the documentation, section '5.11.3. Partitioning Using 
Inheritance'[1]:
"Some operations require a stronger lock when using declarative 
partitioning than when using table inheritance. For example, removing a 
partition from a partitioned table requires taking an ACCESS EXCLUSIVE 
lock on the parent table, whereas a SHARE UPDATE EXCLUSIVE lock is 
enough in the case of regular inheritance."


This point is no longer valid with some restrictions. If the table has a 
default partition, then removing a partition still requires taking an 
ACCESS EXCLUSIVE lock.


May be make sense to add some details about DETACH CONCURRENTLY to the 
section '5.11.2.2. Partition Maintenance' and completely remove this point?


1. 
https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE


Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company




Re: Small issues with CREATE TABLE COMPRESSION

2021-05-05 Thread Dilip Kumar
On Wed, May 5, 2021 at 4:00 PM Michael Paquier  wrote:
>
> On Wed, May 05, 2021 at 01:41:03PM +0530, Dilip Kumar wrote:
> > I think we already have such test cases at least through pg_upgrade.
> > Basically, if you see in compression.sql we are not dropping the table
> > so that pg_upgrade and dump them and test.  So if test run --with-lz4
> > then in pg_upgrade dump we can see ALTER TABLE... SET COMPRESSION type
> > of commands.
>
> The TAP tests of pg_dump are much more picky than what pg_upgrade is
> able to do.  With the existing set of tests in place, what you are
> able to detect is that pg_upgrade does not *break* if there are tables
> with attributes using various compression types, but that would not be
> enough to make sure that the correct compression method is *applied*
> depending on the context expected (default_toast_compression + the
> attribute compression + pg_dump options), which is what the TAP tests
> of pg_dump are able to correctly detect if extended in an appropriate
> way.

 Okay, got your point.

> With what's on HEAD, we would easily miss any bugs introduced in
> pg_dump that change the set of commands generated depending on the
> options given by a user, but still allow pg_upgrade to work correctly.
> For example, there could be issues where we finish by setting up the
> incorrect compression option, with pg_upgrade happily finishing.
> There is a gap in the test coverage here.

Basically, the problem is default compilation is --without-lz4 so by
default there is only one compression method and with only one
compression method we can not generate the test case you asked for,
because that will be the default compression method and we don't dump
the default compression method.

So basically, if we have to write this test case in pg_dump then we
will have to use lz4 which means it will generate different output
--with-lz4 vs --without-lz4.  With a simple regress test it easy to
deal with such cases by keeping multiple .out files but I am not sure
can we do this easily with pg_dump test without adding much
complexity?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Use simplehash.h instead of dynahash in SMgr

2021-05-05 Thread David Rowley
Hi Jakub,

On Wed, 5 May 2021 at 20:16, Jakub Wartak  wrote:
> I might be a little bit out of the loop, but as Alvaro stated - Thomas did 
> plenty of excellent job related to recovery performance in that thread. In my 
> humble opinion and if I'm not mistaken (I'm speculating here) it might be 
> *not* how Smgr hash works, but how often it is being exercised and that would 
> also explain relatively lower than expected(?) gains here. There are at least 
> two very important emails from him that I'm aware that are touching the topic 
> of reordering/compacting/batching calls to Smgr:
> https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com

I'm not much of an expert here and I didn't follow the recovery
prefetching stuff closely. So, with that in mind, I think there are
lots that could be done along the lines of what Thomas is mentioning.
Batching WAL records up by filenode then replaying each filenode one
by one when our batching buffer is full.  There could be some sort of
parallel options there too, where workers replay a filenode each.
However, that wouldn't really work for recovery on a hot-standby
though.  We'd need to ensure we replay the commit record for each
transaction last. I think you'd have to batch by filenode and
transaction in that case.  Each batch might be pretty small on a
typical OLTP workload, so it might not help much there, or it might
hinder.

But having said that, I don't think any of those possibilities should
stop us speeding up smgropen().

> Another potential option that we've discussed is that the redo generation 
> itself is likely a brake of efficient recovery performance today (e.g. 
> INSERT-SELECT on table with indexes, generates interleaved WAL records that 
> touch often limited set of blocks that usually put Smgr into spotlight).

I'm not quite sure if I understand what you mean here.  Is this
queuing up WAL records up during transactions and flush them out to
WAL every so often after rearranging them into an order that's more
optimal for replay?

David




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-05 Thread Tomas Vondra




On 4/27/21 8:22 PM, Andres Freund wrote:

Hi,

On 2021-04-28 00:44:47 +0900, Masahiko Sawada wrote:

On Wed, Apr 28, 2021 at 12:26 AM Masahiko Sawada  wrote:

What Andres is suggesting (I think) is to modify ExecInsert() to pass a
valid bistate to table_tuple_insert, instead of just NULL, and store the
vmbuffer in it.


Understood. This approach keeps using the same vmbuffer until we need
another vm page corresponding to the target heap page, which seems
better.


But how is ExecInsert() related to REFRESH MATERIALIZED VIEW?


I was thinking of the CONCURRENTLY path for REFRESH MATERIALIZED VIEW I
think. Or something.

That actually makes it easier - we already pass in a bistate in the relevant
paths. So if we add a current_vmbuf to BulkInsertStateData, we can avoid
needing to pin so often. It seems that'd end up with a good bit cleaner and
less risky code than the skip_vmbuffer_for_frozen_tuple_insertion_v3.patch
approach.

The current RelationGetBufferForTuple() interface / how it's used in heapam.c
doesn't make this quite as trivial as it could be... Attached is a quick hack
implementing this. For me it reduces the overhead noticably:

REFRESH MATERIALIZED VIEW mv;
before:
Time: 26542.333 ms (00:26.542)
after:
Time: 23105.047 ms (00:23.105)



Thanks, that looks promising. I repeated the tests I did on 26/4, and 
the results look like this:


old (0c7d3bb99): 497ms
master:  621ms
patched: 531ms

So yeah, that's a bit improvement - it does not remove the regression 
entirely, but +5% is much better than +25%.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: RFC: Detailed reorder buffer stats dumps

2021-05-05 Thread Ashutosh Bapat
On Wed, May 5, 2021 at 4:03 PM Craig Ringer  wrote:
>
> Hi all
>
> I'm thinking of piggy-backing on the approach used in the "Get memory 
> contexts of an arbitrary backend process" patch in order to provide access to 
> detailed reorder buffer content statistics from walsenders on request.
>
> Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or 
> on dynamic probes (perf, systemtap) to see what it's doing. I intend a patch 
> soon to add a couple of fields to struct WalSnd to report some very coarse 
> reorder buffer stats - at least oldest buffered xid, number of buffered txns, 
> total bytes of buffered txns in memory, total bytes of buffered txns spilled 
> to disk.
>
> But sometimes what I really want is details on the txns that're in the 
> reorder buffer, and that's not feasible to export via always-enabled 
> reporting like struct WalSnd. So I'm thinking that the same approach used for 
> the memory context stats patch might work well for asking the walsender for a 
> detailed dump of reorder buffer contents. Something like a per-buffered-txn 
> table of txn topxid, start-lsn, most recent change lsn, number of changes, 
> number of subxids, number of invalidations, number of catalog changes, buffer 
> size in memory, buffer size spilled to disk.
>

+1.

Will be really useful to troubleshoot what's going on in the ReorderBuffer.

If we put that in WalSnd it will not be useful for the connections
which are using normal backends to get logical changes through
built-in functions.
-- 
Best Wishes,
Ashutosh Bapat




Re: Small issues with CREATE TABLE COMPRESSION

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 7:09 AM Dilip Kumar  wrote:
> So basically, if we have to write this test case in pg_dump then we
> will have to use lz4 which means it will generate different output
> --with-lz4 vs --without-lz4.  With a simple regress test it easy to
> deal with such cases by keeping multiple .out files but I am not sure
> can we do this easily with pg_dump test without adding much
> complexity?

TAP tests have a facility for conditionally skipping tests; see
perldoc Test::More. That's actually superior to what you can do with
pg_regress. We'd need to come up with some logic to determine when to
skip or not, though. Perhaps the easiest method would be to have the
relevant Perl script try to create a table with an lz4 column. If that
works, then perform the LZ4-based tests. If it fails, check the error
message. If it says anything that LZ4 is not supported by this build,
skip those tests. If it says anything else, die.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Use simplehash.h instead of dynahash in SMgr

2021-05-05 Thread Jakub Wartak
Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each 
> batch might be pretty small on a typical OLTP workload, so it might not help 
> much there, or it might hinder.

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, 
INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even 
say that INSERT-as-SELECT would be more suited for DWH-like processing.

> But having said that, I don't think any of those possibilities should stop us 
> speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big 
gains. I was able to squeeze like 300-400k function calls per second (WAL 
records/s), that was the point I think where I think smgropen() got abused. 

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
> 
> I'm not quite sure if I understand what you mean here.  Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side 
one could also change how WAL records are generated:

1) Minimalization of records towards same repeated $block eg. Heap2 
table_multi_insert() API already does this and it matters to generate more 
optimal stream for replay:

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same 
block from Smgr perspective (this is especially visible on highly indexed 
tables)  =>  
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000108, prev 4/E7A0, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer 
optimization that avoid repeated calls to smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with 
obvious elimination of unnecessary individual calls to smgopen() via one big 
MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2   len (rec/tot):210/   210, tx:   17243290, lsn: 
4/E928, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, 
blkref #0: rel 1663/16384/32801 blk 0
rmgr: Btree   len (rec/tot):102/   102, tx:   17243290, lsn: 
4/E9000100, prev 4/E928, desc: NEWROOT lev 0, blkref #0: rel 
1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32804 blk 1
[..]
Here Btree it is very localized (at least when concurrent sessions are not 
generating WAL) and it enables Thomas's recent-buffer to kick in

DELETE is much more simple (thanks to not chewing out those Btree records) and 
also thanks to Thomas's recent-buffer should theoretically put much less stress 
on smgropen() already:
rmgr: Heaplen (rec/tot): 54/54, tx:   17243

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-05-05 Thread Robert Haas
On Tue, May 4, 2021 at 11:47 PM Greg Nancarrow  wrote:
> Problem is, for built-in functions, the changes are allowed, but for
> some properties (like strict) the allowed changes don't actually take
> effect (this is what Amit was referring to - so why allow those
> changes?).
> It's because some of the function properties are cached in
> FmgrBuiltins[] (for a "fast-path" lookup for built-ins), according to
> their state at build time (from pg_proc.dat), but ALTER FUNCTION is
> just changing it in the system catalogs. Also, with sufficient
> privileges, a built-in function can be redefined, yet the original
> function (whose info is cached in FmgrBuiltins[]) is always invoked,
> not the newly-defined version.

I agree. I think that's not ideal. I think we should consider putting
some more restrictions on updating system catalog changes, and I also
think that if we can get out of having strict need to be part of
FmgrBuiltins[] that would be good. But what I don't agree with is the
idea that since strict already has this problem, it's OK to do the
same thing with parallel-safety. That seems to me to be making a bad
situation worse, and I can't see what problem it actually solves.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Tue, May 4, 2021 at 9:24 PM Peter Geoghegan  wrote:
> Here is my concern: I have an obligation to make it clear that I think
> that you really ought to straighten out this business with
> generalizing TIDs before too long. Not because I say so, but because
> it's holding up progress in general. If you aren't getting cooperation
> from people who work on indexing (could be somebody else), then
> consider the possibility that this business with TIDs and bitmap scans
> has a lot to do with it. Most people are not as outspoken as I am.

It seems to me that we're doing a lot of disagreeing given that, as I
see it, there are only relatively minor differences between the
positions of the various people here. Andres and I are, I think,
relatively convinced that variable-width TIDs would let us do things
that aren't otherwise possible, and that those things are potentially
useful and I would even venture to say cool. I don't believe you
disagree with that, but you think it's going to be too much work to
implement. Fair enough; anyone can try it who is interested and see
how far they get. Anyone who thinks it's going to be impossibly hard
probably will prefer not to try, and that's OK too.

But if we take that off the table, what about a less-ambitious
generalization of the TID mechanism? I can't really see anyone putting
up a serious argument against allowing all 48 bits of space available
in the existing TID format to be used which, as Jeff points out, is
not currently the case. So anyone who wants to try to write that patch
is free to do so. I don't have a clear idea how to make that work, to
be honest, but my limited supply of ideas need not prevent anyone else
from trying theirs.

There might be some slight disagreement about whether it's useful to
generalize TIDs from a 48-bit address space to a 64-bit address space
without making it fully general. Like Andres, I am unconvinced that's
meaningfully easier, and I am convinced that it's meaningfully less
good, but other people can disagree and that's fine. I'm perfectly
willing to change my opinion if somebody shows up with a patch that
demonstrates the value of this approach.

The main point here is one that I think you made a few emails back:
the limitations of the current system are defined by what will
actually work with the code as it exists today, not some mailing list
discussion. It's too early for the project to commit to stability in
this area; we have not managed to get a single AM apart from heapam
into core, and that situation doesn't appear likely to change in the
near future. If and when we have say 5 of those we can probably
articulate some intelligent ideas about what we think the patterns
that need to hold for future AMs are, but it's reckless to extrapolate
from 1 working example, and right now that's all we have.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Wed, 10 Feb 2021 at 04:28, Robert Haas  wrote:

> On Mon, Feb 8, 2021 at 4:52 PM Andres Freund  wrote:
> > The 011_crash_recovery.pl test test starts a transaction, creates a
> > table, fetches the transaction's xid. Then shuts down the server in
> > immediate mode. It then asserts that after crash recovery the previously
> > assigned xid is shown as aborted, and that new xids are assigned after
> > the xid.
> >
> > But as far as I can tell there's no guarantee that that is the case.
>
> I think you are right.
>
>
Andres, I missed this mail initially. I'll look into it shortly.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


pg_receivewal makes a bad daemon

2021-05-05 Thread Robert Haas
You might want to use pg_receivewal to save all of your WAL segments
somewhere instead of relying on archive_command. It has, at the least,
the advantage of working on the byte level rather than the segment
level. But it seems to me that it is not entirely suitable as a
substitute for archiving, for a couple of reasons. One is that as soon
as it runs into a problem, it exits, which is not really what you want
out of a daemon that's critical to the future availability of your
system. Another is that you can't monitor it aside from looking at
what it prints out, which is also not really what you want for a piece
of critical infrastructure.

The first problem seems somewhat more straightforward. Suppose we add
a new command-line option, perhaps --daemon but we can bikeshed. If
this option is specified, then it tries to keep going when it hits a
problem, rather than just giving up. There's some fuzziness in my mind
about exactly what this should mean. If the problem we hit is that we
lost the connection to the remote server, then we should try to
reconnect. But if the problem is something like a failure inside
open_walfile() or close_walfile(), like a failed open() or fsync() or
close() or something, it's a little less clear what to do. Maybe one
idea would be to have a parent process and a child process, where the
child process does all the work and the parent process just keeps
re-launching it if it dies. It's not entirely clear that this is a
suitable way of recovering from, say, an fsync() failure, given
previous discussions claiming that - and I might be exaggerating a bit
here - there is essentially no way to recover from a failed fsync()
because the kernel might have already thrown out your data and you
might as well just set the data center on fire - but perhaps an retry
system that can't cope with certain corner cases is better than not
having one at all, and perhaps we could revise the logic here and
there to have the process doing the work take some action other than
exiting when that's an intelligent approach.

The second problem is a bit more complex. If you were transferring WAL
to another PostgreSQL instance rather than to a frontend process, you
could log to some place other than standard output, like for example a
file, and you could periodically rotate that file, or alternatively
you could log to syslog or the Windows event log. Even better, you
could connect to PostgreSQL and run SQL queries against monitoring
views and see what results you get. If the existing monitoring views
don't give users what they need, we can improve them, but the whole
infrastructure needed for this kind of thing is altogether lacking for
any frontend program. It does not seem very appealing to reinvent log
rotation, connection management, and monitoring views inside
pg_receivewal, let alone in every frontend process where similar
monitoring might be useful. But at least for me, without such
capabilities, it is a little hard to take pg_receivewal seriously.

I wonder first of all whether other people agree with these concerns,
and secondly what they think we ought to do about it. One option is -
do nothing. This could be based either on the idea that pg_receivewal
is hopeless, or else on the idea that pg_receivewal can be restarted
by some external system when required and monitored well enough as
things stand. A second option is to start building out capabilities in
pg_receivewal to turn it into something closer to what you'd expect of
a normal daemon, with the addition of a retry capability as probably
the easiest improvement. A third option is to somehow move towards a
world where you can use the server to move WAL around even if you
don't really want to run the server. Imagine a server running with no
data directory and only a minimal set of running processes, just (1) a
postmaster and (2) a walreceiver that writes to an archive directory
and (3) non-database-connected backends that are just smart enough to
handle queries for status information. This has the same problem that
I mentioned on the thread about monitoring the recovery process,
namely that we haven't got pg_authid. But against that, you get a lot
of infrastructure for free: configuration files, process management,
connection management, an existing wire protocol, memory contexts,
rich error reporting, etc.

I am curious to hear what other people think about the usefulness (or
lack thereof) of pg_receivewal as thing stand today, as well as ideas
about future direction.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Proposal: per expression intervalstyle

2021-05-05 Thread Michał Wadas
Problem: I have to set IntervalStyle in separate statement to convert
interval type to ISO8601 string.

This isn't well supported by ORMs or similar tools, requiring us to
set it globally (per role or per database).

Problem #2 (exotic): you can't mix two output styles in a single query.

Proposed solution: special case to_char function to accept values accepted
by
SET intervalstyle to 'XYZ'

So:

SELECT to_char(INTERVAL '8 minutes', 'iso_8601')

will act similar to

SET intervalstyle TO 'iso_8601';
SELECT (INTERVAL '8 minutes')::text
RESET interval_style;


Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Tue, 9 Feb 2021 at 05:52, Andres Freund  wrote:

>
> Craig, it kind of looks to me like you assumed it'd be guaranteed that
> the xid at this point would show in-progress?
>

At the time I wrote that code, I don't think I understood that xid
assignment wasn't necessarily durable until either (a) the next checkpoint;
or (b) commit of some txn with a greater xid.

IIRC I expected that after crash and recovery the tx would always be
treated as aborted, because the xid had been assigned but no corresponding
commit was found before end-of-recovery. No explicit abort records are
written to WAL for such txns since we crashed, but the server's oldest
in-progress txn threshold is used to determine that they must be aborted
rather than in-progress even though their clog entries aren't set to
aborted.

Which was fine as far as it went, but I failed to account for the xid
assignment not necessarily being durable when the client calls
txid_status().


> I don't think the use of txid_status() described in the docs added in
> the commit is actually ever safe?
>

I agree. The client can query for its xid with txid_current() but as you
note there's no guarantee that the assigned xid is durable.

The client would have to ensure that an xid was assigned, then ensure that
the WAL was durably flushed past the point of the xid assignment before
relying on the xid.

If we do a txn that performs a small write, calls txid_current(), and sends
a commit that the server crashes before completing, we can't know for sure
that the xid we recorded client-side before the server crash is the same
txn we check the status of after crash recovery. Some other txn could've
re-used the xid after crash so long as no other txn with a greater xid
durably committed before the crash.

That scenario isn't hugely likely, but it's definitely possible on systems
that don't do a lot of concurrent txns or do mostly long, heavyweight txns.

The txid_status() function was originally intended to be paired with a way
to report topxid assignment to the client automatically, NOTIFY or
GUC_REPORT-style. But that would not make this usage safe either, unless we
delayed the report until WAL was flushed past the LSN of the xid assignment
*or* some other txn with a greater xid committed.

This could be made safe with a variant of txid_current() that forced the
xid assignment to be logged immediately if it was not already, and did not
return until WAL flushed past the point of the assignment. If the client
did most of the txn's work before requesting a guaranteed-durable xid, it
would in practice not land up having to wait for a flush. But we'd have to
keep track of when we assigned the xid in every single topxact in order to
be able to promise we'd flushed it without having to immediately force a
flush. That's pointless overhead all the rest of the time, just in case
someone wants to get an xid for later use with txid_status().

The simplest option with no overhead on anything that doesn't care about
txid_status() is to expose a function to force flush of WAL up to the
current insert LSN. Then update the docs to say you have to call it after
txid_current(), and before sending your commit. But at that point you might
as well use 2PC, since you're paying the same double flush and double
round-trip costs. The main point of txid_status() was to avoid the cost of
that double-flush.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-05-05 Thread Craig Ringer
On Wed, 5 May 2021 at 23:15, Craig Ringer  wrote:

> Which was fine as far as it went, but I failed to account for the xid
> assignment not necessarily being durable when the client calls
> txid_status().

Ahem, I meant "when the client calls txid_current()"

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
Hi,

>From time to time, I need to deal with bizarrely delimited text files,
having to use tools such as awk/sed/perl to reformat the files
so that they can be copied into PostgreSQL.

If such files could be imported to a table with a single text column,
we could then use PostgreSQL's now really fast regex-engine
to do data cleaning/reformatting, instead of having to rely on external 
awk-ward tools.

Furthermore, sometimes you don't want to clean/reformat the data at all,
but simply import the text lines "as is" without modifications,
such as when wanting to import unformatted log files,
where the log lines can contain any characters.

Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an 
error?

ERROR:  COPY delimiter cannot be newline or carriage return

That is, to change E'\n' to be a valid delimiter, which would simply read each 
line
delimited by newlines, as a single column.

The hack I'm currently abusing is to find some one-byte character that is not 
present anywhere in the text file,
and then to use that character as a delimiter. This doesn't work when needing 
to deal with a text file
which content is unknown at the time when writing the code though, so it's 
mostly useful for throwaway one-off queries.

Thoughts?

/Joel

MySQL seems to already support using \n as a delimiter (I haven't verified it 
myself though) [1]

[1] 
https://stackoverflow.com/questions/18394620/postgres-import-file-that-has-columns-separated-by-new-lines



Re: pg_receivewal makes a bad daemon

2021-05-05 Thread Laurenz Albe
On Wed, 2021-05-05 at 11:04 -0400, Robert Haas wrote:
> You might want to use pg_receivewal to save all of your WAL segments
> somewhere instead of relying on archive_command. It has, at the least,
> the advantage of working on the byte level rather than the segment
> level. But it seems to me that it is not entirely suitable as a
> substitute for archiving, for a couple of reasons. One is that as soon
> as it runs into a problem, it exits, which is not really what you want
> out of a daemon that's critical to the future availability of your
> system. Another is that you can't monitor it aside from looking at
> what it prints out, which is also not really what you want for a piece
> of critical infrastructure.
> 
> The first problem seems somewhat more straightforward. Suppose we add
> a new command-line option, perhaps --daemon but we can bikeshed. If
> this option is specified, then it tries to keep going when it hits a
> problem, rather than just giving up.  [...]

That sounds like a good idea.

I don't know what it takes to make that perfect (if such a thing exists),
but simply trying to re-establish database connections and dying when
we hit an I/O problem seems like a clear improvement.

> The second problem is a bit more complex.  [...]

If I wanted to monitor pg_receivewal, I'd have it use a replication
slot and monitor "pg_replication_slots" on the primary.  That way I see
if there is a WAL sender process, and I can measure the lag in bytes.

What more could you want?

Yours,
Laurenz Albe





Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 7:27 AM Robert Haas  wrote:
> It seems to me that we're doing a lot of disagreeing given that, as I
> see it, there are only relatively minor differences between the
> positions of the various people here.

I'm being very vocal here because I'm concerned that we're going about
generalizing TIDs in the wrong way. To me it feels like there is a
loss of perspective about what really matters. There just isn't that
many table AM TID designs that could ever work, and even among those
schemes that could ever work there is a pretty clear hierarchy. This
blue sky thinking about generalizing TIDs 2 years in seems *weird* to
me.

Nobody is obligated to reassure me. I felt that this had to be said.

> Andres and I are, I think,
> relatively convinced that variable-width TIDs would let us do things
> that aren't otherwise possible, and that those things are potentially
> useful and I would even venture to say cool. I don't believe you
> disagree with that, but you think it's going to be too much work to
> implement. Fair enough; anyone can try it who is interested and see
> how far they get. Anyone who thinks it's going to be impossibly hard
> probably will prefer not to try, and that's OK too.

I think that's accurate. But it's easy to not disagree with the idea
that variable-width TIDs might lead to something interesting. Talk is
cheap.

No other database system has something like indirect indexes. They
have clustered indexes, but that's rather different. I think that
indirect indexes were a design that was concerned about the issue of
write amplification from non-HOT updates. But do I even remember the
details correctly? We're talking about indirect indexes as if that was
an idea whose high level user-visible goals were clear, but I don't
even think that that much is true. This kind of thing concerns me. It
very much feels like failing to see the forest for the trees.

> But if we take that off the table, what about a less-ambitious
> generalization of the TID mechanism? I can't really see anyone putting
> up a serious argument against allowing all 48 bits of space available
> in the existing TID format to be used which, as Jeff points out, is
> not currently the case. So anyone who wants to try to write that patch
> is free to do so. I don't have a clear idea how to make that work, to
> be honest, but my limited supply of ideas need not prevent anyone else
> from trying theirs.

I agree that we should focus on what we can agree on. It seems as if
we all more or less agree on this much.

> There might be some slight disagreement about whether it's useful to
> generalize TIDs from a 48-bit address space to a 64-bit address space
> without making it fully general. Like Andres, I am unconvinced that's
> meaningfully easier, and I am convinced that it's meaningfully less
> good, but other people can disagree and that's fine. I'm perfectly
> willing to change my opinion if somebody shows up with a patch that
> demonstrates the value of this approach.

It's going to be hard if not impossible to provide empirical evidence
for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are
the way to go. Same with any other scheme. We're talking way too much
about TIDs themselves and way too little about table AM use cases, the
way the data structures might work in new table AMs, and so on.

> The main point here is one that I think you made a few emails back:
> the limitations of the current system are defined by what will
> actually work with the code as it exists today, not some mailing list
> discussion.

Right.

> It's too early for the project to commit to stability in
> this area; we have not managed to get a single AM apart from heapam
> into core, and that situation doesn't appear likely to change in the
> near future.

I would be happy if we could commit to committing to stability. I
really don't think that it's *that* hard to move significantly closer
to a design that describes just how close to heapam a table AM should
be. It doesn't commit the table AM to all that many details.

> If and when we have say 5 of those we can probably
> articulate some intelligent ideas about what we think the patterns
> that need to hold for future AMs are, but it's reckless to extrapolate
> from 1 working example, and right now that's all we have.

I don't think that there will be 5 table AMs that are credible to
users at any point in the future. In any case there only needs to be 1
or 2 good ones for the table AM to have been a resounding success.

-- 
Peter Geoghegan




Re: Printing backtrace of postgres processes

2021-05-05 Thread vignesh C
On Wed, Feb 3, 2021 at 3:24 PM Bharath Rupireddy
 wrote:
>
> On Wed, Feb 3, 2021 at 1:49 PM vignesh C  wrote:
> >
> > On Wed, Feb 3, 2021 at 1:00 PM Tom Lane  wrote:
> > >
> > > vignesh C  writes:
> > > > On Mon, Feb 1, 2021 at 11:04 AM Bharath Rupireddy
> > > >  wrote:
> > > >> Are these superuser and permission checks enough from a security
> > > >> standpoint that we don't expose some sensitive information to the
> > > >> user?
> > >
> > > > This will just print the backtrace of the current backend. Users
> > > > cannot get password information from this.
> > >
> > > Really?
> > >
> > > A backtrace normally exposes the text of the current query, for
> > > instance, which could contain very sensitive data (passwords in ALTER
> > > USER, customer credit card numbers in ordinary data, etc etc).  We
> > > don't allow the postmaster log to be seen by any but very privileged
> > > users; it's not sane to think that this data is any less
> > > security-critical than the postmaster log.
> > >
> > > This point is entirely separate from the question of whether
> > > triggering stack traces at inopportune moments could cause system
> > > malfunctions, but that question is also not to be ignored.
> > >
> > > TBH, I'm leaning to the position that this should be superuser
> > > only.  I do NOT agree with the idea that ordinary users should
> > > be able to trigger it, even against backends theoretically
> > > belonging to their own userid.  (Do I need to point out that
> > > some levels of the call stack might be from security-definer
> > > functions with more privilege than the session's nominal user?)
> > >
> >
> > I had seen that the log that will be logged will be something like:
> > postgres: test postgres [local]
> > idle(ProcessClientReadInterrupt+0x3a) [0x9500ec]
> > postgres: test postgres [local] idle(secure_read+0x183) [0x787f43]
> > postgres: test postgres [local] idle() [0x7919de]
> > postgres: test postgres [local] idle(pq_getbyte+0x32) [0x791a8e]
> > postgres: test postgres [local] idle() [0x94fc16]
> > postgres: test postgres [local] idle() [0x950099]
> > postgres: test postgres [local] idle(PostgresMain+0x6d5) [0x954bd5]
> > postgres: test postgres [local] idle() [0x898a09]
> > postgres: test postgres [local] idle() [0x89838f]
> > postgres: test postgres [local] idle() [0x894953]
> > postgres: test postgres [local] idle(PostmasterMain+0x116b) 
> > [0x89422a]
> > postgres: test postgres [local] idle() [0x79725b]
> > /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6e68d7]
> > postgres: test postgres [local] idle() [0x484249]
> > I was not sure if we would be able to get any secure information from
> > this. I did not notice the function arguments being printed. I felt
> > the function name, offset  and the return address will be logged. I
> > might be missing something here.
> > Thoughts?
>
> First of all, we need to see if the output of pg_print_backtrace shows
> up function parameter addresses or only function start addresses along
> with line and file information when attached to gdb. In either case,
> IMO, it will be easy for experienced hackers(I'm not one though) to
> calculate and fetch the query string or other function parameters or
> the variables inside the functions from the stack by looking at the
> code (which is available openly, of course).
>
> Say, if a backend is in a long running scan or insert operation, then
> pg_print_backtrace is issued from another session, the
> exec_simple_query function shows up query_string. Below is captured
> from attached gdb though, I'm not sure whether the logged backtrace
> will have function address or the function parameters addresses, I
> think we can check that by having a long running query which
> frequently checks interrupts and issue pg_print_backtrace from another
> session to that backend. Now, attach gdb to the backend in which the
> query is running, then take bt, see if the logged backtrace and the
> gdb bt have the same or closer addresses.
>
> #13 0x5644f4320729 in exec_simple_query (
> query_string=0x5644f6771bf0 "select pg_backend_pid();") at postgres.c:1240
> #14 0x5644f4324ff4 in PostgresMain (argc=1, argv=0x7ffd819bd5e0,
> dbname=0x5644f679d2b8 "postgres", username=0x5644f679d298 "bharath")
> at postgres.c:4394
> #15 0x5644f4256f9d in BackendRun (port=0x5644f67935c0) at 
> postmaster.c:4484
> #16 0x5644f4256856 in BackendStartup (port=0x5644f67935c0) at
> postmaster.c:4206
> #17 0x5644f4252a11 in ServerLoop () at postmaster.c:1730
> #18 0x5644f42521aa in PostmasterMain (argc=3, argv=0x5644f676b1f0)
> at postmaster.c:1402
> #19 0x5644f4148789 in main (argc=3, argv=0x5644f676b1f0) at main.c:209
>
> As suggested by Tom, I'm okay if this function is callable only by the
> superusers. In that case, the superusers can fetch the backtrace and
> send it for further analysis in case of 

Re: useless argument of ATAddForeignKeyConstraint

2021-05-05 Thread Alvaro Herrera
On 2019-Feb-14, Amit Langote wrote:

> While reviewing the foreign keys referencing partitioned tables patch, I
> noticed that the parentConstr argument of ATAddForeignConstraint is
> rendered useless by the following commit:

> Maybe remove that argument in HEAD ?  Attached a patch.

Indeed -- two years later this is still valid, so applied, with thanks!

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)




Re: pg_receivewal makes a bad daemon

2021-05-05 Thread Magnus Hagander
On Wed, May 5, 2021 at 5:04 PM Robert Haas  wrote:
>
> You might want to use pg_receivewal to save all of your WAL segments
> somewhere instead of relying on archive_command. It has, at the least,
> the advantage of working on the byte level rather than the segment
> level. But it seems to me that it is not entirely suitable as a
> substitute for archiving, for a couple of reasons. One is that as soon
> as it runs into a problem, it exits, which is not really what you want
> out of a daemon that's critical to the future availability of your
> system. Another is that you can't monitor it aside from looking at
> what it prints out, which is also not really what you want for a piece
> of critical infrastructure.
>
> The first problem seems somewhat more straightforward. Suppose we add
> a new command-line option, perhaps --daemon but we can bikeshed. If
> this option is specified, then it tries to keep going when it hits a
> problem, rather than just giving up. There's some fuzziness in my mind
> about exactly what this should mean. If the problem we hit is that we
> lost the connection to the remote server, then we should try to
> reconnect. But if the problem is something like a failure inside
> open_walfile() or close_walfile(), like a failed open() or fsync() or
> close() or something, it's a little less clear what to do. Maybe one
> idea would be to have a parent process and a child process, where the
> child process does all the work and the parent process just keeps
> re-launching it if it dies. It's not entirely clear that this is a
> suitable way of recovering from, say, an fsync() failure, given
> previous discussions claiming that - and I might be exaggerating a bit
> here - there is essentially no way to recover from a failed fsync()
> because the kernel might have already thrown out your data and you
> might as well just set the data center on fire - but perhaps an retry
> system that can't cope with certain corner cases is better than not
> having one at all, and perhaps we could revise the logic here and
> there to have the process doing the work take some action other than
> exiting when that's an intelligent approach.

Is this really a problem we should fix ourselves? Most daemon-managers
today will happily be configured to automatically restart a daemon on
failure with a single setting since a long time now. E.g. in systemd
(which most linuxen uses now) you just set Restart=on-failure (or
maybe even Restart=always) and something like RestartSec=10.

That said, it wouldn't cover an fsync() error -- they will always
restart. The way to handle that is for the operator to capture the
error message perhaps, and just "deal with it"?

What could be more interesting there in a "systemd world" would be to
add watchdog support. That'd obviously only be interesting on systemd
platforms, but we already have some of that basic notification support
in the postmaster for those.

> The second problem is a bit more complex. If you were transferring WAL
> to another PostgreSQL instance rather than to a frontend process, you
> could log to some place other than standard output, like for example a
> file, and you could periodically rotate that file, or alternatively
> you could log to syslog or the Windows event log. Even better, you
> could connect to PostgreSQL and run SQL queries against monitoring
> views and see what results you get. If the existing monitoring views
> don't give users what they need, we can improve them, but the whole
> infrastructure needed for this kind of thing is altogether lacking for
> any frontend program. It does not seem very appealing to reinvent log
> rotation, connection management, and monitoring views inside
> pg_receivewal, let alone in every frontend process where similar
> monitoring might be useful. But at least for me, without such
> capabilities, it is a little hard to take pg_receivewal seriously.

Again, isn't this the job of the daemon runner? At least in cases
where it's not Windows :)? That is, taking the output and putting it
in a log, and interfacing with log rotation.

Now, having some sort of statistics *other* than parsing a log would
definitely be useful. But perhaps that could be something as simple
having a --statsfile=/foo/bar parameter and then update that one at
regular intervals with "whatever is the current state"?

And of course, the other point to monitor is the replication slot on
the server it's connected to -- but I agree that being able to monitor
both sides there would be good.


> I wonder first of all whether other people agree with these concerns,
> and secondly what they think we ought to do about it. One option is -
> do nothing. This could be based either on the idea that pg_receivewal
> is hopeless, or else on the idea that pg_receivewal can be restarted
> by some external system when required and monitored well enough as
> things stand. A second option is to start building out capabilities in
> pg_receivewal to turn it into something clo

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan  wrote:
> I'm being very vocal here because I'm concerned that we're going about
> generalizing TIDs in the wrong way. To me it feels like there is a
> loss of perspective about what really matters.

Well, which things matter is a question of opinion, not fact.

> No other database system has something like indirect indexes. They
> have clustered indexes, but that's rather different.

I don't think this is true at all. If you have a clustered index -
i.e. the table is physically arranged according to the index ordering
- then your secondary indexes all pretty much have to be what we're
calling indirect indexes. They can hardly point to a physical
identifier if rows are being moved around. I believe InnoDB works this
way, and I think Oracle's index-organized tables do too. I suspect
there are other examples.

> > There might be some slight disagreement about whether it's useful to
> > generalize TIDs from a 48-bit address space to a 64-bit address space
> > without making it fully general. Like Andres, I am unconvinced that's
> > meaningfully easier, and I am convinced that it's meaningfully less
> > good, but other people can disagree and that's fine. I'm perfectly
> > willing to change my opinion if somebody shows up with a patch that
> > demonstrates the value of this approach.
>
> It's going to be hard if not impossible to provide empirical evidence
> for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are
> the way to go. Same with any other scheme. We're talking way too much
> about TIDs themselves and way too little about table AM use cases, the
> way the data structures might work in new table AMs, and so on.

I didn't mean that it has to be a test result showing that 64-bit TIDs
outperform 56-bit TIDs or something. I just meant there has to be a
reason to believe it's good, which could be based on a discussion of
use cases or whatever. If we *don't* have a reason to believe it's
good, we shouldn't do it.

My point is that so far I am not seeing a whole lot of value of this
proposed approach. For a 64-bit TID to be valuable to you, one of two
things has to be true: you either don't care about having indexes that
store TIDs on your new table type, or the index types you want to use
can store those 64-bit TIDs. Now, I have not yet heard of anyone
working on a table AM who does not want to be able to support adding
btree indexes. There may be someone that I don't know about, and if
so, fine. But otherwise, we need a way to store them. And that
requires changing the page format for btree indexes. But surely we do
not want to make all TIDs everywhere wider in future btree versions,
so at least two TID widths - 6 bytes and 8 bytes - would have to be
supported. And if we're at all going to do that, I think it's
certainly worth asking whether supporting varlena TIDs would really be
all that much harder. You seem to think it is, and you might be right,
but I'm not ready to give up, because I do not see how we are ever
going to get global indexes or indirect indexes without doing it, and
those would be good features to have.

If we can't ever get them, so be it, but you seem to kind of be saying
that things like global indexes and indirect indexes are hard, and
therefore they don't count as reasons why we might want variable-width
TIDs. But one very large reason why those things are hard is that they
require variable-width TIDs, so AFAICS this boils down to saying that
we don't want the feature because it's hard to implement. But we
should not conflate feasibility with desirability. I am quite sure
that lots of people want global indexes. The number of people who want
indirect indexes is in my estimation much smaller, but it's probably
not zero, or else Alvaro wouldn't have tried his hand at writing a
patch. Whether we can *get* those things is in doubt; whether it will
happen in the near future is very much in doubt. But I at least am not
in doubt about whether people want it, because I hear complaints about
the lack of global indexes on an almost-daily basis. If those
complaints are all from people hoping to fake me out into spending
time on something that is worthless to them, my colleagues are very
good actors.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread David G. Johnston
On Wed, May 5, 2021 at 8:31 AM Joel Jacobson  wrote:

> Could it be an idea to exploit the fact that DELIMITER E'\n' is currently
> an error?
>
>
Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
what it says and does exactly what you desire?

David J.


Re: .ready and .done files considered harmful

2021-05-05 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, May 4, 2021 at 11:54 AM Dilip Kumar  wrote:
> > I agree that if we continue to archive one file using the archive
> > command then Robert's solution of checking the existence of the next
> > WAL segment (N+1) has an advantage.  But, currently, if you notice
> > pgarch_readyXlog always consider any history file as the oldest file
> > but that will not be true if we try to predict the next WAL segment
> > name.  For example, if we have archived 00010004 then
> > next we will look for 00010005 but after generating
> > segment 00010005, if there is a timeline switch then
> > we will have the below files in the archive status
> > (00010005.ready, 0002.history file).  Now, the
> > existing archiver will archive 0002.history first whereas our code
> > will archive 00010005 first.  Said that I don't see
> > any problem with that because before archiving any segment file from
> > TL 2 we will definitely archive the 0002.history file because we
> > will not find the 00010006.ready and we will scan the
> > full directory and now we will find 0002.history as oldest file.
> 
> OK, that makes sense and is good to know.

I expect David will chime in on this thread too, but I did want to point
out that when it coming to archiving history files you'd *really* like
that to be done just about as quickly as absolutely possible, to avoid
the case that we saw before that code was added, to wit: two promotions
done too quickly that ended up with conflicting history and possibly
conflicting WAL files trying to be archived, and ensuing madness.

It's not just about making sure that we archive the history file for a
timeline before archiving WAL segments along that timeline but also
about making sure we get that history file into the archive as fast as
we can, and archiving a 16MB WAL first would certainly delay that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: pg_receivewal makes a bad daemon

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 12:34 PM Magnus Hagander  wrote:
> Is this really a problem we should fix ourselves? Most daemon-managers
> today will happily be configured to automatically restart a daemon on
> failure with a single setting since a long time now. E.g. in systemd
> (which most linuxen uses now) you just set Restart=on-failure (or
> maybe even Restart=always) and something like RestartSec=10.
>
> That said, it wouldn't cover an fsync() error -- they will always
> restart. The way to handle that is for the operator to capture the
> error message perhaps, and just "deal with it"?

Maybe, but if that's really a non-problem, why does postgres itself
restart, and have facilities to write and rotate log files? I feel
like this argument boils down to "a manual transmission ought to be
good enough for anyone, let's not have automatics." But over the years
people have found that automatics are a lot easier to drive. It may be
true that if you know just how to configure your system's daemon
manager, you can make all of this work, but it's not like we document
how to do any of that, and it's probably not the same on every
platform - Windows in particular - and, really, why should people have
to do this much work? If I want to run postgres in the background I
can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
crontab to make sure it gets restarted within a few minutes even if
the postmaster dies. If I want to keep pg_receivewal running all the
time ... I need a whole pile of extra mechanism to work around its
inherent fragility. Documenting how that's typically done on modern
systems, as you propose further on, would be great, but I can't do it,
because I don't know how to make it work. Hence the thread.

> Also, all the above also apply to pg_recvlogical, right? So if we do
> want to invent our own daemon-init-system, we should probably do one
> more generic that can handle both.

Yeah. And I'm not really 100% convinced that trying to patch this
functionality into pg_receive{wal,logical} is the best way forward ...
but I'm not entirely convinced that it isn't, either. I think one of
the basic problems with trying to deploy PostgreSQL in 2021 is that it
needs so much supporting infrastructure and so much babysitting.
archive_command has to be a complicated, almost magical program we
don't provide, and we don't even tell you in the documentation that
you need it. If you don't want to use that, you can stream with
pg_receivewal instead, but now you need a complicated daemon-runner
mechanism that we don't provide or document the need for. You also
probably need a connection pooler that we don't provide, a failover
manager that we don't provide, and backup management software that we
don't provide. And the interfaces that those tools have to work with
are so awkward and primitive that even the tool authors can't always
get it right. So I'm sort of unimpressed by any arguments that boil
down to "what we have is good enough" or "that's the job of some other
piece of software". Too many things are the job of some piece of
software that doesn't really exist, or is only available on certain
platforms, or that has some other problem that makes it not usable for
everyone. People want to be able to download and use PostgreSQL
without needing a whole library of other bits and pieces from around
the Internet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Jeff Davis
On Wed, 2021-05-05 at 10:27 -0400, Robert Haas wrote:
> It's too early for the project to commit to stability in
> this area; we have not managed to get a single AM apart from heapam
> into core

"In core" shouldn't matter. In fact, if it's in core, stability of the
APIs is much less important.

>  If and when we have say 5 of those

That seems like a standard that we won't reach in any reasonable amount
of time.

> we can probably
> articulate some intelligent ideas about what we think the patterns
> that need to hold for future AMs are, but it's reckless to
> extrapolate
> from 1 working example, and right now that's all we have.

We should count columnar as a second example. While it doesn't support
everything that heap does, we are actively working on it and it's
gaining features quickly. It's also showing some impressive real-world
results.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 9:42 AM Robert Haas  wrote:
> On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan  wrote:
> > I'm being very vocal here because I'm concerned that we're going about
> > generalizing TIDs in the wrong way. To me it feels like there is a
> > loss of perspective about what really matters.
>
> Well, which things matter is a question of opinion, not fact.

I'm not trying to win an argument here. I am giving an opinion in the
hopes that it leads to some kind of useful synthesis, based on all of
our opinions.

> > No other database system has something like indirect indexes. They
> > have clustered indexes, but that's rather different.
>
> I don't think this is true at all. If you have a clustered index -
> i.e. the table is physically arranged according to the index ordering
> - then your secondary indexes all pretty much have to be what we're
> calling indirect indexes. They can hardly point to a physical
> identifier if rows are being moved around. I believe InnoDB works this
> way, and I think Oracle's index-organized tables do too. I suspect
> there are other examples.

But these systems don't have indirect indexes *on a heap table*! Why
would they ever do it that way? They already have rowid/TID as a
stable identifier of logical rows, so having indirect indexes that
point to a heap table's rows would be strictly worse than the generic
approach for indexes on a heap table.

What we call indirect indexes seem to me to be a failed attempt to
solve the "TID is not a stable identifier of logical row" issue that
is baked-in to Postgres. If I thought it was worth solving that
problem then I suppose I'd solve it directly. The "indirection" of
indirect indexes actuallys buys you nothing! It just moves some of the
problem somewhere else, at the cost of even more complexity. Indirect
indexes (without a clustered index) are a muddled idea.

Of course I accept that clustered indexes make sense in general
(though less and less these days). But the fact that these systems
"use indirect indexes" for secondary indexes is precisely why
clustered indexes don't seem like a great design with modern hardware!
Should we invest a huge amount of work in order to have all of the
disadvantages, and none of the advantages?

> My point is that so far I am not seeing a whole lot of value of this
> proposed approach. For a 64-bit TID to be valuable to you, one of two
> things has to be true: you either don't care about having indexes that
> store TIDs on your new table type, or the index types you want to use
> can store those 64-bit TIDs. Now, I have not yet heard of anyone
> working on a table AM who does not want to be able to support adding
> btree indexes. There may be someone that I don't know about, and if
> so, fine. But otherwise, we need a way to store them. And that
> requires changing the page format for btree indexes. But surely we do
> not want to make all TIDs everywhere wider in future btree versions,
> so at least two TID widths - 6 bytes and 8 bytes - would have to be
> supported.

I agree that we don't want a performance/space overhead for simple
cases that are quite happy with the current format.

> And if we're at all going to do that, I think it's
> certainly worth asking whether supporting varlena TIDs would really be
> all that much harder. You seem to think it is, and you might be right,
> but I'm not ready to give up, because I do not see how we are ever
> going to get global indexes or indirect indexes without doing it, and
> those would be good features to have.

I think that global indexes are well worth having, and should be
solved some completely different way. The partition key can be an
additive thing. I strongly suspect that indirect indexes (without a
clustered index) are 100% useless in both theory and practice, so
naturally I have little to no interest.

The difficulty of supporting (say) 6 byte and 8 byte TIDs together is
vastly lower than variable-width TIDs, for all kinds of reasons. See
my remarks to Andres upthread about deduplication.

> If we can't ever get them, so be it, but you seem to kind of be saying
> that things like global indexes and indirect indexes are hard, and
> therefore they don't count as reasons why we might want variable-width
> TIDs.But one very large reason why those things are hard is that they
> require variable-width TIDs, so AFAICS this boils down to saying that
> we don't want the feature because it's hard to implement.

More like very hard to implement for a very low benefit.

> But we
> should not conflate feasibility with desirability. I am quite sure
> that lots of people want global indexes.

I do too!

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Jeff Davis
On Wed, 2021-05-05 at 08:50 -0700, Peter Geoghegan wrote:
> There just isn't that
> many table AM TID designs that could ever work, and even among those
> schemes that could ever work there is a pretty clear hierarchy. This
> blue sky thinking about generalizing TIDs 2 years in seems *weird* to
> me.

I am happy to keep table AM discussions concrete, as I have plenty of
concrete problems which I would like to turn into proposals.

Regards,
Jeff Davis






Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Chapman Flack
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

Regards,
-Chap




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 1:13 PM Jeff Davis  wrote:
> "In core" shouldn't matter. In fact, if it's in core, stability of the
> APIs is much less important.

I don't know what to say here. I think it's unrealistic to believe
that a very new API that has only 1 in-core user is going to be fully
stable, or that we can know how it might evolve. I can understand why
you and probably other people want that, but if somebody figures out a
way to make some part of core significantly better and it requires
changing that API, they're going to change the API, not give up on the
idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Isaac Morland
On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:

> On 05/05/21 13:02, David G. Johnston wrote:
> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> > what it says and does exactly what you desire?
>
> What would it mean? That you get one column, multiple rows of text
> corresponding to "lines" delimited by something, or that you get one
> column, one row of text for the entire content of the file?
>

It means no column delimiter. In other words, there is no character
which marks the end of a data value, so the entire line is a single data
value.

Would DELIMITER NULL make sense? The existing values are literal strings so
NULL fits with that. Do we already have NONE as a keyword somewhere? It's
listed in the keyword appendix to the documentation but I can't think of
where it is used off the top of my head.


Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 10:33 AM Robert Haas  wrote:
> I don't know what to say here. I think it's unrealistic to believe
> that a very new API that has only 1 in-core user is going to be fully
> stable, or that we can know how it might evolve. I can understand why
> you and probably other people want that, but if somebody figures out a
> way to make some part of core significantly better and it requires
> changing that API, they're going to change the API, not give up on the
> idea.

I strongly agree. More generally, we need to decide what downsides
we're willing to live with.

What we have right now has little chance of failing. It also has
little chance of succeeding (except for something like zheap, which
can presumably get by with the heapam's idea of TID).

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Jeff Davis
On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote:
> What we have right now has little chance of failing. It also has
> little chance of succeeding (except for something like zheap, which
> can presumably get by with the heapam's idea of TID).

What has little chance of succeeding? Table AMs?

And why isn't columnar an example of someting that can "get by with
heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
complaint this whole thread is that it's undefined, not that it's
completely unworkable.

Regards,
Jeff Davis






Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 1:15 PM Peter Geoghegan  wrote:
> > I don't think this is true at all. If you have a clustered index -
> > i.e. the table is physically arranged according to the index ordering
> > - then your secondary indexes all pretty much have to be what we're
> > calling indirect indexes. They can hardly point to a physical
> > identifier if rows are being moved around. I believe InnoDB works this
> > way, and I think Oracle's index-organized tables do too. I suspect
> > there are other examples.
>
> But these systems don't have indirect indexes *on a heap table*! Why
> would they ever do it that way? They already have rowid/TID as a
> stable identifier of logical rows, so having indirect indexes that
> point to a heap table's rows would be strictly worse than the generic
> approach for indexes on a heap table.

One advantage of indirect indexes is that you can potentially avoid a
lot of writes to the index. If a non-HOT update is performed, but the
primary key is not updated, the index does not need to be touched. I
think that's a potentially significant savings, even if bottom-up
index deletion would have prevented the page splits. Similarly, you
can mark a dead line pointer unused without having to scan the
indirect index, because the index isn't pointing to that dead line
pointer anyway.

Hmm, but I guess you have another cleanup problem. What prevents
someone from inserting a new row with the same primary key as a
previously-deleted row but different values in some indirectly-indexed
column? Then the old index entries, if still present, could mistakenly
refer to the new row. I don't know whether Alvaro thought of that
problem when he was working on this previously, or whether he solved
it somehow. Possibly that's a big enough problem that the whole idea
is dead in the water, but it's not obvious to me that this is so.

And, anyway, this whole argument is predicated on the fact that the
only table AM we have right now is heapam. If we had a table AM that
organized the data by primary key value, we'd still want to be able to
have secondary indexes, and they'd have to use the primary key value
as the TID.

> I think that global indexes are well worth having, and should be
> solved some completely different way. The partition key can be an
> additive thing.

I agree that the partition identifier should be an additive thing, but
where would we add it? It seems to me that the obvious answer is to
make it a column of the index tuple. And if we can do that, why can't
we put whatever kind of TID-like stuff people want in the index tuple,
too? Maybe part of the problem here is that I don't actually
understand how posting lists are represented...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread David G. Johnston
On Wed, May 5, 2021 at 10:34 AM Isaac Morland 
wrote:

> On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:
>
>> On 05/05/21 13:02, David G. Johnston wrote:
>> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
>> > what it says and does exactly what you desire?
>>
>> What would it mean? That you get one column, multiple rows of text
>> corresponding to "lines" delimited by something, or that you get one
>> column, one row of text for the entire content of the file?
>>
>
> It means no column delimiter. In other words, there is no character
> which marks the end of a data value, so the entire line is a single data
> value.
>
>
This.  When dealing with COPY it's expected that each line becomes its own
row.  On the server you can do pg_read_file() if you need the entire file
to be considered a single value.  psql (\I and variables) is a bit more
hackey, but I'd rather see that improved directly anyway if the goal is to
try and make getting the "whole document" easier - copy isn't the right API
for that IMO.

David J.


v14 mechanical code beautification patches

2021-05-05 Thread Tom Lane
It's getting to be time to think about these steps for v14:

* Renumber any manually-assigned OIDs between 8000 and 
  to lower numbers, using renumber_oids.pl (see notes in bki.sgml)

* pgindent, perltidy, reformat-dat-files

* Update config.guess and config.sub
  (from https://savannah.gnu.org/projects/config)

* Update Unicode data: Edit UNICODE_VERSION and CLDR_VERSION in
  src/Makefile.global.in, run make update-unicode, and commit.

It looks like Peter already took care of the last two.  Barring
objections, I'll plan to do the first two next Wednesday or so
(after the back-branch-release dust has settled).

I notice that we also list this as a pre-beta task in
src/tools/RELEASE_CHANGES:

* Update inet/cidr data types with newest Bind patches

However, I can't recall that anyone has ever done any such thing;
and at this point, any attempt to re-sync that code would likely
be a rather major task.  Should we take that off the checklist?

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Andres Freund
Hi,

On 2021-05-05 13:32:57 -0400, Robert Haas wrote:
> I don't know what to say here. I think it's unrealistic to believe
> that a very new API that has only 1 in-core user is going to be fully
> stable, or that we can know how it might evolve. I can understand why
> you and probably other people want that, but if somebody figures out a
> way to make some part of core significantly better and it requires
> changing that API, they're going to change the API, not give up on the
> idea.

Yea. I think it would be actively *bad* if tableam were too
stable. tableam is at best an 80% solution to the abstraction needs
(those 80% were pretty painful to achieve already, I don't think we
could have gotten much more initially). If we get cornered into not
evolving the API because of 2-3 external users, we're a) going to live
with a leaky abstraction for much longer b) getting more hesitant to
work incrementally. Both would be bad.

Greetings,

Andres Freund




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 10:57 AM Robert Haas  wrote:
> One advantage of indirect indexes is that you can potentially avoid a
> lot of writes to the index. If a non-HOT update is performed, but the
> primary key is not updated, the index does not need to be touched. I
> think that's a potentially significant savings, even if bottom-up
> index deletion would have prevented the page splits. Similarly, you
> can mark a dead line pointer unused without having to scan the
> indirect index, because the index isn't pointing to that dead line
> pointer anyway.

As I said, this is equivalent to solving the "TID is a stable
identifier of logical row" issue (an exceptionally hard problem that I
don't think is worth solving), except that you make the secondary
indexes have potentially larger keys for no benefit. Sure, you can
consistently refer to a logical row using its PK value (assuming you
have this whole two-phase locking infrastructure), but why wouldn't
you "just" solve the problem with TID directly instead? What does
involving PK values actually buy you? I am pretty sure that the answer
is "less than nothing".

It is still true that I'm arguing against ever having a clustered
index table AM, which would be somewhat useful to users (that much
I'll own). The main reason for that is because we'd still be required
to solve the "TID is a stable identifier of logical row" issue, except
it's not a physiological TID/rowid (it's a fully logical row
identifier). So everything seems to lead back to that hard problem
seeming insoluble.

> Hmm, but I guess you have another cleanup problem. What prevents
> someone from inserting a new row with the same primary key as a
> previously-deleted row but different values in some indirectly-indexed
> column?

Two-phase locking in indexes stops it. Note that this is pretty much
what happens in Oracle -- it's not just SQL Server.

This is why we have rich extensibility indexing -- indexes are
strictly physical data structures in Postgres.

> And, anyway, this whole argument is predicated on the fact that the
> only table AM we have right now is heapam. If we had a table AM that
> organized the data by primary key value, we'd still want to be able to
> have secondary indexes, and they'd have to use the primary key value
> as the TID.

But Jeff has a design for the columnstore table AM where TIDs are
essentially logical (not physiological like those of heapam), that
nevertheless will work with the design around TIDs that I have in
mind. "Logical identifiers" versus "Logical identifiers that stably
identify logical rows" seems like a subtle but important distinction
here.

Of course I cannot yet rule out the possibility that this approach to
TIDs will always be good enough. But it sure seems like it might be,
and starting with the assumption that it is and working backwards
seems like a good way to attack the problem as a practical matter.

> > I think that global indexes are well worth having, and should be
> > solved some completely different way. The partition key can be an
> > additive thing.
>
> I agree that the partition identifier should be an additive thing, but
> where would we add it? It seems to me that the obvious answer is to
> make it a column of the index tuple.

Right.

> And if we can do that, why can't
> we put whatever kind of TID-like stuff people want in the index tuple,
> too? Maybe part of the problem here is that I don't actually
> understand how posting lists are represented...

You want to use the partition identifier for predicate push-down and
stuff anyway, so making it part of the TID doesn't seem particularly
natural to me.

"Posting list splits" from the nbtree README will give you some idea
of why I care about making TIDs integer-like and equi-sized within any
given index tuple. There will be similar considerations for GIN.
Though I think that nbtree deduplication is important enough on its
own to try to preserve across table AMs.

-- 
Peter Geoghegan




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Andres Freund
Hi,

On 2021-05-05 10:56:56 -0700, Jeff Davis wrote:
> On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote:
> > What we have right now has little chance of failing. It also has
> > little chance of succeeding (except for something like zheap, which
> > can presumably get by with the heapam's idea of TID).
> 
> What has little chance of succeeding? Table AMs?
> 
> And why isn't columnar an example of someting that can "get by with
> heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
> complaint this whole thread is that it's undefined, not that it's
> completely unworkable.

Agreed. And we can increase the fit a good bit without needing invasive
all-over changes. With those changes often even helping heap.

E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
even for heap - we waste a lot of space that's not commonly used. A
better datastructure (radix tree like I'd say, but several tree shaped
approaches seem possible).

Greetings,

Andres Freund




Re: RFC: Detailed reorder buffer stats dumps

2021-05-05 Thread Andres Freund
Hi,

On 2021-05-05 18:33:27 +0800, Craig Ringer wrote:
> I'm thinking of piggy-backing on the approach used in the "Get memory
> contexts of an arbitrary backend process" patch in order to provide access
> to detailed reorder buffer content statistics from walsenders on request.
> 
> Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or
> on dynamic probes (perf, systemtap) to see what it's doing. I intend a
> patch soon to add a couple of fields to struct WalSnd to report some very
> coarse reorder buffer stats - at least oldest buffered xid, number of
> buffered txns, total bytes of buffered txns in memory, total bytes of
> buffered txns spilled to disk.
> 
> But sometimes what I really want is details on the txns that're in the
> reorder buffer, and that's not feasible to export via always-enabled
> reporting like struct WalSnd. So I'm thinking that the same approach used
> for the memory context stats patch might work well for asking the walsender
> for a detailed dump of reorder buffer contents. Something like a
> per-buffered-txn table of txn topxid, start-lsn, most recent change lsn,
> number of changes, number of subxids, number of invalidations, number of
> catalog changes, buffer size in memory, buffer size spilled to disk.
> 
> Anyone drastically opposed to the idea?

I am doubtful. The likelihood of ending with effectively unused code
seems very substantial here.

Greetings,

Andres Freund




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
On Wed, May 5, 2021, at 19:58, David G. Johnston wrote:
> On Wed, May 5, 2021 at 10:34 AM Isaac Morland  wrote:
>> On Wed, 5 May 2021 at 13:23, Chapman Flack  wrote:
>>> On 05/05/21 13:02, David G. Johnston wrote:
>>> > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
>>> > what it says and does exactly what you desire?
>>> 
>>> What would it mean? That you get one column, multiple rows of text
>>> corresponding to "lines" delimited by something, or that you get one
>>> column, one row of text for the entire content of the file?
>> 
>> It means no column delimiter. In other words, there is no character which 
>> marks the end of a data value, so the entire line is a single data value.
>> 
> 
> This.  When dealing with COPY it's expected that each line becomes its own 
> row.  On the server you can do pg_read_file() if you need the entire file to 
> be considered a single value.  psql (\I and variables) is a bit more hackey, 
> but I'd rather see that improved directly anyway if the goal is to try and 
> make getting the "whole document" easier - copy isn't the right API for that 
> IMO.

I think you misunderstood the problem.
I don't want the entire file to be considered a single value.
I want each line to become its own row, just a row with a single column.

So I actually think COPY seems like a perfect match for the job,
since it does precisely that, except there is no delimiter in this case.

I'm currently using the pg_read_file()-hack in a project,
and even though it can read files up to 1GB,
using e.g. regexp_split_to_table() to split on E'\n'
seems to need 4x as much memory, so it only
works with files less than ~256MB.

SELECT COUNT(*) FROM regexp_split_to_table(repeat(E'\n',10),E'\n');
ERROR:  invalid memory alloc request size 44
Time: 4151.374 ms (00:04.151)

/Joel

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 10:56 AM Jeff Davis  wrote:
> What has little chance of succeeding? Table AMs?
>
> And why isn't columnar an example of someting that can "get by with
> heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
> complaint this whole thread is that it's undefined, not that it's
> completely unworkable.

I think that it could be fairly workable with moderate effort (maybe
even without that effort, but that doesn't seem so appealing). To do
it well we have to actually generalize TIDs sensibly. And I think that
that means admitting that we'll never solve the "TID is a stable
identifier of a logical row, not a physical version" problem. ISTM
that that's the problem that is at the root of everything here.

-- 
Peter Geoghegan




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Joel Jacobson
On Wed, May 5, 2021, at 19:34, Isaac Morland wrote:
> Would DELIMITER NULL make sense? The existing values are literal strings so 
> NULL fits with that. Do we already have NONE as a keyword somewhere? It's 
> listed in the keyword appendix to the documentation but I can't think of 
> where it is used off the top of my head.

+1 to using some keyword. NULL or NONE seems fine to me. Or maybe WITHOUT 
DELIMITER?

/Joel

Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 11:25 AM Andres Freund  wrote:
> Agreed. And we can increase the fit a good bit without needing invasive
> all-over changes. With those changes often even helping heap.
>
> E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
> even for heap - we waste a lot of space that's not commonly used. A
> better datastructure (radix tree like I'd say, but several tree shaped
> approaches seem possible).

Agreed -- even if we only cared about heapam we still ought to do
something about tidbitmap.c's use of MaxHeapTuplesPerPage.

-- 
Peter Geoghegan




Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Tom Lane
"Joel Jacobson"  writes:
> I think you misunderstood the problem.
> I don't want the entire file to be considered a single value.
> I want each line to become its own row, just a row with a single column.

> So I actually think COPY seems like a perfect match for the job,
> since it does precisely that, except there is no delimiter in this case.

Well, there's more to it than just the column delimiter.

* What about \N being converted to NULL?
* What about \. being treated as EOF?
* Do you want to turn off the special behavior of backslash (ESCAPE)
  altogether?
* What about newline conversions (\r\n being seen as just \n, etc)?

I'm inclined to think that "use pg_read_file and then split at newlines"
might be a saner answer than delving into all these fine points.
Not least because people yell when you add cycles to the COPY
inner loops.

> I'm currently using the pg_read_file()-hack in a project,
> and even though it can read files up to 1GB,
> using e.g. regexp_split_to_table() to split on E'\n'
> seems to need 4x as much memory, so it only
> works with files less than ~256MB.

Yeah, that's because of the conversion to "chr".  But a regexp
is overkill for that anyway.  Don't we have something that will
split on simple substring matches?

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Jeff Davis
On Wed, 2021-05-05 at 11:22 -0700, Andres Freund wrote:
> Yea. I think it would be actively *bad* if tableam were too
> stable. tableam is at best an 80% solution to the abstraction needs
> (those 80% were pretty painful to achieve already, I don't think we
> could have gotten much more initially). If we get cornered into not
> evolving the API because of 2-3 external users, we're a) going to
> live
> with a leaky abstraction for much longer b) getting more hesitant to
> work incrementally. Both would be bad.

Like anything, we make the decision at the time we have a reason to
break something. But why are are exensions disfavored in this
calculation vs. in-core? Isn't it a lot easier to update in-core code
to new APIs?

Evolving the API is one thing, but we should be more careful about
things that could affect on-disk state like ItemPointer
representations. By "more careful", I don't mean that we reject all
proposals; I mean that we don't casually impose new limits in other
parts of the system that happen to work for heapam but will cause table
AM extensions to break.

Regards,
Jeff Davis






Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Andrew Dunstan


On 5/5/21 2:45 PM, Tom Lane wrote:
> "Joel Jacobson"  writes:
>> I think you misunderstood the problem.
>> I don't want the entire file to be considered a single value.
>> I want each line to become its own row, just a row with a single column.
>> So I actually think COPY seems like a perfect match for the job,
>> since it does precisely that, except there is no delimiter in this case.
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.



+1


Also we have generally been resistant to supporting odd formats. FDWs
can help here (e.g. file_text_array), but they can't use STDIN IIRC.


>
>> I'm currently using the pg_read_file()-hack in a project,
>> and even though it can read files up to 1GB,
>> using e.g. regexp_split_to_table() to split on E'\n'
>> seems to need 4x as much memory, so it only
>> works with files less than ~256MB.
> Yeah, that's because of the conversion to "chr".  But a regexp
> is overkill for that anyway.  Don't we have something that will
> split on simple substring matches?
>
>   



Not that I know of. There is split_part but I don't think that's fit for
purpose here. Do we need one, or have I missed something?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 12:09 PM Jeff Davis  wrote:
> Like anything, we make the decision at the time we have a reason to
> break something. But why are are exensions disfavored in this
> calculation vs. in-core? Isn't it a lot easier to update in-core code
> to new APIs?

We don't really have an API for how TIDs behave (unless you happen to
want to emulate heapam, which is reasonable and was expected). It's
unspecified because nobody knows what it is (or what it should be)
just yet. AFAICT there is no TID API to break.

-- 
Peter Geoghegan




Re: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Justin Pryzby
On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote:
> > I'm currently using the pg_read_file()-hack in a project,
> > and even though it can read files up to 1GB,
> > using e.g. regexp_split_to_table() to split on E'\n'
> > seems to need 4x as much memory, so it only
> > works with files less than ~256MB.
> 
> Yeah, that's because of the conversion to "chr".  But a regexp
> is overkill for that anyway.  Don't we have something that will
> split on simple substring matches?

For v14

commit 66f163068030b5c5fe792a0daee27822dac43791
Author: Tom Lane 
Date:   Wed Sep 2 18:23:56 2020 -0400

Add string_to_table() function.

-- 
Justin




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Matthias van de Meent
On Wed, 5 May 2021 at 19:15, Peter Geoghegan  wrote:
>
> On Wed, May 5, 2021 at 9:42 AM Robert Haas  wrote:
> > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan  wrote:
> > > I'm being very vocal here because I'm concerned that we're going about
> > > generalizing TIDs in the wrong way. To me it feels like there is a
> > > loss of perspective about what really matters.
> >
> > Well, which things matter is a question of opinion, not fact.
>
> I'm not trying to win an argument here. I am giving an opinion in the
> hopes that it leads to some kind of useful synthesis, based on all of
> our opinions.
>
> > > No other database system has something like indirect indexes. They
> > > have clustered indexes, but that's rather different.
> >
> > I don't think this is true at all. If you have a clustered index -
> > i.e. the table is physically arranged according to the index ordering
> > - then your secondary indexes all pretty much have to be what we're
> > calling indirect indexes. They can hardly point to a physical
> > identifier if rows are being moved around. I believe InnoDB works this
> > way, and I think Oracle's index-organized tables do too. I suspect
> > there are other examples.
>
> But these systems don't have indirect indexes *on a heap table*! Why
> would they ever do it that way? They already have rowid/TID as a
> stable identifier of logical rows, so having indirect indexes that
> point to a heap table's rows would be strictly worse than the generic
> approach for indexes on a heap table.
>
> What we call indirect indexes seem to me to be a failed attempt to
> solve the "TID is not a stable identifier of logical row" issue that
> is baked-in to Postgres. If I thought it was worth solving that
> problem then I suppose I'd solve it directly. The "indirection" of
> indirect indexes actuallys buys you nothing! It just moves some of the
> problem somewhere else, at the cost of even more complexity. Indirect
> indexes (without a clustered index) are a muddled idea.
>
> Of course I accept that clustered indexes make sense in general
> (though less and less these days). But the fact that these systems
> "use indirect indexes" for secondary indexes is precisely why
> clustered indexes don't seem like a great design with modern hardware!
> Should we invest a huge amount of work in order to have all of the
> disadvantages, and none of the advantages?
>
> > My point is that so far I am not seeing a whole lot of value of this
> > proposed approach. For a 64-bit TID to be valuable to you, one of two
> > things has to be true: you either don't care about having indexes that
> > store TIDs on your new table type, or the index types you want to use
> > can store those 64-bit TIDs. Now, I have not yet heard of anyone
> > working on a table AM who does not want to be able to support adding
> > btree indexes. There may be someone that I don't know about, and if
> > so, fine. But otherwise, we need a way to store them. And that
> > requires changing the page format for btree indexes. But surely we do
> > not want to make all TIDs everywhere wider in future btree versions,
> > so at least two TID widths - 6 bytes and 8 bytes - would have to be
> > supported.
>
> I agree that we don't want a performance/space overhead for simple
> cases that are quite happy with the current format.
>
> > And if we're at all going to do that, I think it's
> > certainly worth asking whether supporting varlena TIDs would really be
> > all that much harder. You seem to think it is, and you might be right,
> > but I'm not ready to give up, because I do not see how we are ever
> > going to get global indexes or indirect indexes without doing it, and
> > those would be good features to have.
>
> I think that global indexes are well worth having, and should be
> solved some completely different way. The partition key can be an
> additive thing.

I believe that it cannot be "just" an additive thing, at least not
through a normal INCLUDEd column, as you'd get duplicate TIDs in the
index, with its related problems. You also cannot add it as a key
column, as this would disable UNIQUE indexes; one of the largest use
cases of global indexes. So, you must create specialized
infrastructure for this identifier.

And when we're already adding specialized infrastructure, then this
should probably be part of a new TID infrastructure.

And if we're going to change TID infrastructure to allow for more
sizes (as we'd need normal TableAM TIDs, and global index
partition-identifying TIDs), I'd argue that it should not be too much
more difficult to create an infrastructure for 'new TID' in which the
table AM supplies type, size and strict ordering information for these
'new TID's.

And if this 'new TID' size is not going to be defined by the index AM
but by the indexed object (be it a table or a 'global' or whatever
we'll build indexes on), I see no reason why this 'new TID'
infrastructure couldn't eventually support variable length TIDs; or
constant sized usertyp

Dubious assertion in RegisterDynamicBackgroundWorker

2021-05-05 Thread Tom Lane
I noticed this recent crash on lorikeet:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-05-05%2009%3A19%3A29

The relevant bits of the log seem to be

2021-05-05 05:36:22.011 EDT [60926716.bb24:1] ERROR:  could not map dynamic 
shared memory segment
...
2021-05-05 05:36:22.013 EDT [609266c5.b793:4] LOG:  background worker "parallel 
worker" (PID 47908) exited with exit code 1
...
TRAP: FailedAssertion("BackgroundWorkerData->parallel_register_count - 
BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT", 
File: 
"/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
 Line: 1016)
*** starting debugger for pid 47743, tid 1264
2021-05-05 05:36:26.629 EDT [609266c5.b793:5] LOG:  server process (PID 47743) 
exited with exit code 127

So we had a parallel worker fail to start, whereupon its leader went down
with an assertion failure.  I know that the parallel-worker code is held
together with chewing gum and baling wire, but that's a bit much.

Looking at the indicated code, we find

/*
 * If this is a parallel worker, check whether there are already too many
 * parallel workers; if so, don't register another one.  Our view of
 * parallel_terminate_count may be slightly stale, but that doesn't really
 * matter: we would have gotten the same result if we'd arrived here
 * slightly earlier anyway.  There's no help for it, either, since the
 * postmaster must not take locks; a memory barrier wouldn't guarantee
 * anything useful.
 */
if (parallel && (BackgroundWorkerData->parallel_register_count -
 BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
{
Assert(BackgroundWorkerData->parallel_register_count -
   BackgroundWorkerData->parallel_terminate_count <=
   MAX_PARALLEL_WORKER_LIMIT);
LWLockRelease(BackgroundWorkerLock);
return false;
}

I would like to know on what grounds that Assert isn't insanity?
We just finished pointing out that we might see an old
parallel_terminate_count, which ISTM says specifically that
parallel_register_count minus parallel_terminate_count might
be larger than expected.

Admittedly, it seems unlikely that the difference could exceed
MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where
the limit on number of parallel workers is only 8.  What I think is
more likely, given that these counters are unsigned, is that the
difference was actually negative.  Which could be a bug, or it could
be an expectable race condition, or it could just be some flakiness
on lorikeet's part (that machine has had a lot of issues lately).

I trawled the buildfarm logs going back 180 days, and found no
other instances of this assertion, which seems to be evidence
in favor of the "lorikeet got flaky" theory.  But it's not proof.

In any case, I see zero value in this assertion, so I propose
we remove it.  If we don't remove it, it needs serious revision,
because it seems absolutely obvious to me that it could trigger
when there is nothing wrong.  A system pushing the limit of
number of parallel workers would be at considerable risk.

regards, tom lane




Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Andrew Dunstan


On 5/5/21 3:36 PM, Justin Pryzby wrote:
> On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote:
>>> I'm currently using the pg_read_file()-hack in a project,
>>> and even though it can read files up to 1GB,
>>> using e.g. regexp_split_to_table() to split on E'\n'
>>> seems to need 4x as much memory, so it only
>>> works with files less than ~256MB.
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?
> For v14
>
> commit 66f163068030b5c5fe792a0daee27822dac43791
> Author: Tom Lane 
> Date:   Wed Sep 2 18:23:56 2020 -0400
>
> Add string_to_table() function.
>

Ha! just in time :-)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

2021-05-05 Thread Tom Lane
Andrew Dunstan  writes:
> On 5/5/21 2:45 PM, Tom Lane wrote:
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?

> Not that I know of. There is split_part but I don't think that's fit for
> purpose here. Do we need one, or have I missed something?

[ checks manual ... ]

string_to_array or string_to_table would do, I think.

regards, tom lane




Re: .ready and .done files considered harmful

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 1:06 PM Stephen Frost  wrote:
> It's not just about making sure that we archive the history file for a
> timeline before archiving WAL segments along that timeline but also
> about making sure we get that history file into the archive as fast as
> we can, and archiving a 16MB WAL first would certainly delay that.

Ooph. That's a rather tough constraint. Could we get around it by
introducing some kind of signalling mechanism, perhaps? Like if
there's a new history file, that must mean the server has switched
timelines -- I think, anyway -- so if we notified the archiver every
time there was a timeline switch it could react accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent
 wrote:
> I believe that it cannot be "just" an additive thing, at least not
> through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> index, with its related problems. You also cannot add it as a key
> column, as this would disable UNIQUE indexes; one of the largest use
> cases of global indexes. So, you must create specialized
> infrastructure for this identifier.

You're just quibbling about the precise words that I used. Of course
it is true that there must be some sense in which a global index
partition key attribute will need to be special to the implementation
-- how else could a global index enforce uniqueness? That was clearly
implied.

> And when we're already adding specialized infrastructure, then this
> should probably be part of a new TID infrastructure.

This is a non-sequitur.

> And if we're going to change TID infrastructure to allow for more
> sizes (as we'd need normal TableAM TIDs, and global index
> partition-identifying TIDs), I'd argue that it should not be too much
> more difficult to create an infrastructure for 'new TID' in which the
> table AM supplies type, size and strict ordering information for these
> 'new TID's.
>
> And if this 'new TID' size is not going to be defined by the index AM
> but by the indexed object (be it a table or a 'global' or whatever
> we'll build indexes on), I see no reason why this 'new TID'
> infrastructure couldn't eventually support variable length TIDs; or
> constant sized usertype TIDs (e.g. the 3 int columns of the primary
> key of a clustered table).

You're not considering the big picture. It's not self-evident that
anybody will ever have much use for a variable-width TID in their
table AM, at least beyond some fairly simple scheme -- because of the
fundamental issue of TID not working as a stable identifier of logical
rows in Postgres. If it was very clear that there would be *some*
significant benefit then the costs might start to look reasonable. But
there isn't. "Build it and they will come" is not at all convincing to
me.

-- 
Peter Geoghegan




Re: .ready and .done files considered harmful

2021-05-05 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, May 5, 2021 at 1:06 PM Stephen Frost  wrote:
> > It's not just about making sure that we archive the history file for a
> > timeline before archiving WAL segments along that timeline but also
> > about making sure we get that history file into the archive as fast as
> > we can, and archiving a 16MB WAL first would certainly delay that.
> 
> Ooph. That's a rather tough constraint. Could we get around it by
> introducing some kind of signalling mechanism, perhaps? Like if
> there's a new history file, that must mean the server has switched
> timelines -- I think, anyway -- so if we notified the archiver every
> time there was a timeline switch it could react accordingly.

I would think something like that would be alright and not worse than
what we've got now.

That said, in an ideal world, we'd have a way to get the new timeline to
switch to in a way that doesn't leave open race conditions, so as long
we're talking about big changes to the way archiving and archive_command
work (or about throwing out the horrible idea that is archive_command in
the first place and replacing it with appropriate hooks such that
someone could install an extension which would handle archiving...), I
would hope we'd have a way of saying "please, atomically, go get me a new
timeline."

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: .ready and .done files considered harmful

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 4:13 PM Stephen Frost  wrote:
> I would think something like that would be alright and not worse than
> what we've got now.

OK.

> That said, in an ideal world, we'd have a way to get the new timeline to
> switch to in a way that doesn't leave open race conditions, so as long
> we're talking about big changes to the way archiving and archive_command
> work (or about throwing out the horrible idea that is archive_command in
> the first place and replacing it with appropriate hooks such that
> someone could install an extension which would handle archiving...), I
> would hope we'd have a way of saying "please, atomically, go get me a new
> timeline."
>
> Just as a reminder for those following along at home, as I'm sure you're
> already aware, the way we figure out what timeline to switch to when a
> replica is getting promoted is that we go run the restore command asking
> for history files until we get back "nope, there is no file named
> 123.history", and then we switch to that timeline and then try to
> push such a history file into the repo and hope that it works.

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: .ready and .done files considered harmful

2021-05-05 Thread Andres Freund
Hi,

On 2021-05-05 16:13:08 -0400, Stephen Frost wrote:
> Just as a reminder for those following along at home, as I'm sure you're
> already aware, the way we figure out what timeline to switch to when a
> replica is getting promoted is that we go run the restore command asking
> for history files until we get back "nope, there is no file named
> 123.history", and then we switch to that timeline and then try to
> push such a history file into the repo and hope that it works.

Which is why the whole concept of timelines as we have them right now is
pretty much useless. It is fundamentally impossible to guarantee unique
timeline ids in all cases if they are assigned sequentially at timeline
creation - consider needing to promote a node on both ends of a split
network.  I'm quite doubtful that pretending to tackle this problem via
archiving order is a good idea, given the fundamentally racy nature.

Greetings,

Andres Freund




Re: .ready and .done files considered harmful

2021-05-05 Thread Andres Freund
Hi,

On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> Huh, I had not thought about that problem. So, at the risk of getting
> sidetracked, what exactly are you asking for here? Let the extension
> pick the timeline using an algorithm of its own devising, rather than
> having core do it? Or what?

Not Stephen, but to me the most reasonable way to address this is to
make timeline identifier wider and randomly allocated. The sequential
looking natures of timelines imo is actively unhelpful.

Greetings,

Andres Freund




Re: .ready and .done files considered harmful

2021-05-05 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, May 5, 2021 at 4:13 PM Stephen Frost  wrote:
> > That said, in an ideal world, we'd have a way to get the new timeline to
> > switch to in a way that doesn't leave open race conditions, so as long
> > we're talking about big changes to the way archiving and archive_command
> > work (or about throwing out the horrible idea that is archive_command in
> > the first place and replacing it with appropriate hooks such that
> > someone could install an extension which would handle archiving...), I
> > would hope we'd have a way of saying "please, atomically, go get me a new
> > timeline."
> >
> > Just as a reminder for those following along at home, as I'm sure you're
> > already aware, the way we figure out what timeline to switch to when a
> > replica is getting promoted is that we go run the restore command asking
> > for history files until we get back "nope, there is no file named
> > 123.history", and then we switch to that timeline and then try to
> > push such a history file into the repo and hope that it works.
> 
> Huh, I had not thought about that problem. So, at the risk of getting
> sidetracked, what exactly are you asking for here? Let the extension
> pick the timeline using an algorithm of its own devising, rather than
> having core do it? Or what?

Having the extension do it somehow is an interesting idea and one which
might be kind of cool.

The first thought I had was to make it archive_command's job to "pick"
the timeline by just re-trying to push the .history file (the actual
contents of it don't change, as the information in the file is about the
timeline we are switching *from* and at what LSN).  That requires an
archive command which will fail if that file already exists though and,
ideally, would perform the file archival in an atomic fashion (though
this last bit isn't stricly necessary- anything along these lines would
certainly be better than the current state).

Having an entirely independent command/hook that's explicitly for this
case would be another approach, of course, either in a manner that
allows the extension to pick the destination timeline or is defined to
be "return success only if the file is successfully archived, but do
*not* overwrite any existing file of the same name and return an error
instead." and then the same approach as outlined above.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Dubious assertion in RegisterDynamicBackgroundWorker

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 3:46 PM Tom Lane  wrote:
> Admittedly, it seems unlikely that the difference could exceed
> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where
> the limit on number of parallel workers is only 8.  What I think is
> more likely, given that these counters are unsigned, is that the
> difference was actually negative.  Which could be a bug, or it could
> be an expectable race condition, or it could just be some flakiness
> on lorikeet's part (that machine has had a lot of issues lately).

I think that assertion was added by me, and I think the thought
process was that the value shouldn't go negative and that if it does
it's probably a bug which we might want to fix. But since the values
are unsigned I could hardly check for < 0, so I did it this way
instead.

But since there's no memory barrier between the two loads, I guess
there's no guarantee that they have the expected relationship, even if
there is a memory barrier on the store side. I wonder if it's worth
trying to tighten that up so that the assertion is more meaningful, or
just give up and rip it out. I'm afraid that if we do have (or
develop) bugs in this area, someone will discover that the effective
max_parallel_workers value on their system slowly drifts up or down
from the configured value, and we'll have no clue where things are
going wrong. The assertion was intended to give us a chance of
noticing that sort of problem in the buildfarm or on a developer's
machine before the code gets out into the real world.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: .ready and .done files considered harmful

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 4:31 PM Andres Freund  wrote:
> On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> > Huh, I had not thought about that problem. So, at the risk of getting
> > sidetracked, what exactly are you asking for here? Let the extension
> > pick the timeline using an algorithm of its own devising, rather than
> > having core do it? Or what?
>
> Not Stephen, but to me the most reasonable way to address this is to
> make timeline identifier wider and randomly allocated. The sequential
> looking natures of timelines imo is actively unhelpful.

Yeah, I always wondered why we didn't assign them randomly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: .ready and .done files considered harmful

2021-05-05 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, May 5, 2021 at 4:31 PM Andres Freund  wrote:
> > On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> > > Huh, I had not thought about that problem. So, at the risk of getting
> > > sidetracked, what exactly are you asking for here? Let the extension
> > > pick the timeline using an algorithm of its own devising, rather than
> > > having core do it? Or what?
> >
> > Not Stephen, but to me the most reasonable way to address this is to
> > make timeline identifier wider and randomly allocated. The sequential
> > looking natures of timelines imo is actively unhelpful.
> 
> Yeah, I always wondered why we didn't assign them randomly.

Based on what we do today regarding the info we put into .history files,
trying to figure out which is the "latest" timeline might be a bit
tricky with randomly selected timelines.  Maybe we could find a way to
solve that though.

I do note that this comment is timeline.c is, ahem, perhaps over-stating
things a bit:

 * Note: while this is somewhat heuristic, it does positively guarantee
 * that (result + 1) is not a known timeline, and therefore it should
 * be safe to assign that ID to a new timeline.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Bogus collation version recording in recordMultipleDependencies

2021-05-05 Thread Andrew Dunstan


On 4/21/21 4:28 PM, Andres Freund wrote:
> Hi,
>
> On 2021-04-20 12:05:27 +1200, Thomas Munro wrote:
>> I'll hold off reverting for a few more days to see if anyone has any
>> other thoughts on that, because there doesn't seem to be any advantage
>> in being too hasty about it.
> I'm not really convinced that this is warranted, and that it isn't
> better addressed by reducing the scope of the feature:
>
> When using index collation versions to decide whether to reindex
> individual indexes it is important to not have any false negatives -
> otherwise the feature could trigger corruption.
>
> However, the feature has a second, IMO more crucial, aspect: Preventing
> silent corruption due to collation changes. There are regular reports of
> people corrupting their indexes (and subsequently constraints) due to
> collation changes (or collation differences between primary/replica).
> To be effective detecting such cases it is not required to catch 100% of
> all dangerous cases, just that a high fraction of cases is caught.
>
> And handling the composite type case doesn't seem like it'd impact the
> percentage of detected collation issues all that much. For one, indexes
> on composite types aren't all that common, and additing new columns to
> those composite types is likely even rarer. For another, I'd expect that
> nearly all databases that have indexes on composite types also have
> indexes on non-composite text columns - which'd be likely to catch the
> issue.
>
> Given that this is a regularly occurring source of corruption for users,
> and not one just negligent operators run into (we want people to upgrade
> OS versions), I think we ought to factor that into our decision what to
> do.
>


Hi,


this is an open item for release 14 . The discussion seems to have gone
silent for a couple of weeks. Are we in a position to make any
decisions? I hear what Andres says, but is anyone acting on it?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





cache lookup failed for statistics object 123

2021-05-05 Thread Justin Pryzby
Per sqlsmith.

postgres=# SELECT pg_get_statisticsobjdef_expressions(123);
ERROR:  cache lookup failed for statistics object 123
postgres=# \errverbose 
ERROR:  XX000: cache lookup failed for statistics object 123
LOCATION:  pg_get_statisticsobjdef_expressions, ruleutils.c:1762

The expectation is that sql callable functions should return null rather than
hitting elog().

In the 003 patch, I wonder if this part should be updated, too:

| ... which can greatly improve query plans that use the expression index.

It can improve queries even that don't use the index, right ?
Say, if a query has f(x) = 11, and the MCV list for the expression shows that
50% of the table has f(x)=11, then the query might decide to *not* use an index
scan.

-- 
Justin
>From 6dec09300e4ad6cc7977acbfee9db7087420a9b5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 5 May 2021 04:29:00 -0500
Subject: [PATCH 1/3] Return NULL rather than elog(ERROR) for sql-callable
 function

---
 src/backend/utils/adt/ruleutils.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 0a4fa93d01..881e8ec03d 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -1759,9 +1759,9 @@ pg_get_statisticsobjdef_expressions(PG_FUNCTION_ARGS)
 	statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
 
 	if (!HeapTupleIsValid(statexttup))
-		elog(ERROR, "cache lookup failed for statistics object %u", statextid);
+		PG_RETURN_NULL();
 
-	/* has the statistics expressions? */
+	/* Does the stats object have expressions? */
 	has_exprs = !heap_attisnull(statexttup, Anum_pg_statistic_ext_stxexprs, NULL);
 
 	/* no expressions? we're done */
-- 
2.17.0

>From ede54e64cf5e2249fe6910d6f2c0de177a0edb9b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 27 Apr 2021 07:57:50 -0500
Subject: [PATCH 2/3] Comment typos: extended stats a4d75c86b and 518442c7f

---
 src/backend/parser/parse_utilcmd.c  | 2 +-
 src/backend/statistics/extended_stats.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 9dd30370da..eb9e63f4a8 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1943,7 +1943,7 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid,
 	 * simply append them after simple column references.
 	 *
 	 * XXX Some places during build/estimation treat expressions as if they
-	 * are before atttibutes, but for the CREATE command that's entirely
+	 * are before attributes, but for the CREATE command that's entirely
 	 * irrelevant.
 	 */
 	datum = SysCacheGetAttr(STATEXTOID, ht_stats,
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 7e11cb9d5f..5e53783ea6 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -1796,7 +1796,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
 continue;
 
 			/*
-			 * Now we know the clause is compatible (we have either atttnums
+			 * Now we know the clause is compatible (we have either attnums
 			 * or expressions extracted from it), and was not estimated yet.
 			 */
 
-- 
2.17.0

>From e64e4b3d206d76ed8bbd5345798e0d35958759bd Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 23 Apr 2021 09:15:40 -0500
Subject: [PATCH 3/3] Mention statistics objects

Previously mentioned at 20210423025012.gi7...@telsasoft.com
---
 doc/src/sgml/maintenance.sgml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index ee6113926a..de7fd75e1c 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -330,7 +330,8 @@
 
 
  Also, by default there is limited information available about
- the selectivity of functions.  However, if you create an expression
+ the selectivity of functions.  However, if you create a statistics
+ object or an expression
  index that uses a function call, useful statistics will be
  gathered about the function, which can greatly improve query
  plans that use the expression index.
-- 
2.17.0



Re: Bogus collation version recording in recordMultipleDependencies

2021-05-05 Thread Thomas Munro
On Thu, May 6, 2021 at 8:58 AM Andrew Dunstan  wrote:
> this is an open item for release 14 . The discussion seems to have gone
> silent for a couple of weeks. Are we in a position to make any
> decisions? I hear what Andres says, but is anyone acting on it?

I'm going to revert this and resubmit for 15.  That'll give proper
time to reconsider the question of whether pg_depend is right for
this, and come up with a non-rushed response to the composite type
problem etc.




Re: Dubious assertion in RegisterDynamicBackgroundWorker

2021-05-05 Thread Tom Lane
Robert Haas  writes:
> On Wed, May 5, 2021 at 3:46 PM Tom Lane  wrote:
>> Admittedly, it seems unlikely that the difference could exceed
>> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where
>> the limit on number of parallel workers is only 8.  What I think is
>> more likely, given that these counters are unsigned, is that the
>> difference was actually negative.  Which could be a bug, or it could
>> be an expectable race condition, or it could just be some flakiness
>> on lorikeet's part (that machine has had a lot of issues lately).

> I think that assertion was added by me, and I think the thought
> process was that the value shouldn't go negative and that if it does
> it's probably a bug which we might want to fix. But since the values
> are unsigned I could hardly check for < 0, so I did it this way
> instead.

> But since there's no memory barrier between the two loads, I guess
> there's no guarantee that they have the expected relationship, even if
> there is a memory barrier on the store side. I wonder if it's worth
> trying to tighten that up so that the assertion is more meaningful, or
> just give up and rip it out. I'm afraid that if we do have (or
> develop) bugs in this area, someone will discover that the effective
> max_parallel_workers value on their system slowly drifts up or down
> from the configured value, and we'll have no clue where things are
> going wrong. The assertion was intended to give us a chance of
> noticing that sort of problem in the buildfarm or on a developer's
> machine before the code gets out into the real world.

I follow your concern, but I'm not convinced that this assertion is
a useful aid; first because the asynchrony involved makes the edge
cases rather squishy, and second because allowing 1024 bogus
increments before complaining will likely mean that developer test
runs will not last long enough to trigger the assertion, and third
because if it does fire it's too far removed from the perpetrator
to be much help in figuring out what went wrong, or even if
anything *is* wrong.

I've not tried to trace the code, but I'm now a bit suspicious
that there is indeed a design bug here.  I gather from the
comments that parallel_register_count is incremented by the
worker processes, which of course implies that a worker that
fails to reattach to shared memory won't do that.  But
parallel_terminate_count is incremented by the postmaster.
If the postmaster will do that even in the case of a worker that
failed at startup, then lorikeet's symptoms are neatly explained.

I'd be more comfortable with this code if the increments and
decrements were handled by the same process.

regards, tom lane




Re: Bogus collation version recording in recordMultipleDependencies

2021-05-05 Thread Andrew Dunstan


On 5/5/21 5:12 PM, Thomas Munro wrote:
> On Thu, May 6, 2021 at 8:58 AM Andrew Dunstan  wrote:
>> this is an open item for release 14 . The discussion seems to have gone
>> silent for a couple of weeks. Are we in a position to make any
>> decisions? I hear what Andres says, but is anyone acting on it?
> I'm going to revert this and resubmit for 15.  That'll give proper
> time to reconsider the question of whether pg_depend is right for
> this, and come up with a non-rushed response to the composite type
> problem etc.


OK, thanks.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Matthias van de Meent
On Wed, 5 May 2021 at 22:09, Peter Geoghegan  wrote:
>
> On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent
>  wrote:
> > I believe that it cannot be "just" an additive thing, at least not
> > through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> > index, with its related problems. You also cannot add it as a key
> > column, as this would disable UNIQUE indexes; one of the largest use
> > cases of global indexes. So, you must create specialized
> > infrastructure for this identifier.
>
> You're just quibbling about the precise words that I used. Of course
> it is true that there must be some sense in which a global index
> partition key attribute will need to be special to the implementation
> -- how else could a global index enforce uniqueness? That was clearly
> implied.

This implication was not 100% clear to me, and the last thread on
global indexes that implemented it through INCLUDEd columns didn't
mention this. As such, I wanted to explicitly mention that this
partition/table identifier would need to be part of the keyspace.

> > And when we're already adding specialized infrastructure, then this
> > should probably be part of a new TID infrastructure.
>
> This is a non-sequitur.

I may have skipped some reasoning:

I believe that the TID is the unique identifier of that tuple, within context.

For normal indexes, the TID as supplied directly by the TableAM is
sufficient, as the context is that table.
For global indexes, this TID must include enough information to relate
it to the table the tuple originated from.
In the whole database, that would be the OID of the table + the TID as
supplied by the table.

As such, the identifier of the logical row (which can be called the
TID), as stored in index tuples in global indexes, would need to
consist of the TableAM supplied TID + the (local) id of the table
containing the tuple. Assuming we're in agreement on that part, I
would think it would be consistent to put this in TID infrastructure,
such that all indexes that use such new TID infrastructure can be
defined to be global with only minimal effort.

> > And if we're going to change TID infrastructure to allow for more
> > sizes (as we'd need normal TableAM TIDs, and global index
> > partition-identifying TIDs), I'd argue that it should not be too much
> > more difficult to create an infrastructure for 'new TID' in which the
> > table AM supplies type, size and strict ordering information for these
> > 'new TID's.
> >
> > And if this 'new TID' size is not going to be defined by the index AM
> > but by the indexed object (be it a table or a 'global' or whatever
> > we'll build indexes on), I see no reason why this 'new TID'
> > infrastructure couldn't eventually support variable length TIDs; or
> > constant sized usertype TIDs (e.g. the 3 int columns of the primary
> > key of a clustered table).
>
> You're not considering the big picture. It's not self-evident that
> anybody will ever have much use for a variable-width TID in their
> table AM, at least beyond some fairly simple scheme -- because of the
> fundamental issue of TID not working as a stable identifier of logical
> rows in Postgres.

ZHeap states that it can implement stable TIDs within limits, as IIRC
it requires retail index deletion support for all indexes on the
updated columns of that table. I fail to see why this same
infrastructure could not be used for supporting clustered tables,
while enforcing these limits only soft enforced in ZHeap (that is,
only allowing index AMs that support retail index tuple deletion).

> If it was very clear that there would be *some*
> significant benefit then the costs might start to look reasonable. But
> there isn't. "Build it and they will come" is not at all convincing to
> me.

Clustered tables / Index-oriented Tables are very useful for tables of
which most columns are contained in the PK, or otherwise are often
ordered by their PK. I don't know of any way that would allow us to
build a clustered table _without_ including the primary key in some
form into the TID, or otherwise introducing a layer of indirection
that would undo the clustered access implicated by the clustered
table.

Additionally, compacting/re-clustering a table would be _much_ cheaper
for clustered tables, as the indexes attached to that table would not
need rebuilding: all TIDs will stay valid across the clustering
operation.

With regards,

Matthias van de Meent




Make some column descriptions easier to distinguish visually

2021-05-05 Thread David Fetter
Folks,

I was writing up a query on pg_constraint, and the columns whose
descriptions I've changed here were pretty hard to puzzle out, as they
were only distinct up to the difference between F and P, which isn't
always easy to see.  Please find attached a patch to disambiguate
them.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 4202229a9319fa2f307b2761696775a2c5905fcd Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Wed, 5 May 2021 15:48:53 -0700
Subject: [PATCH v1] Clarify some column descriptions in pg_constraint
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.31.1"

This is a multi-part message in MIME format.
--2.31.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


diff --git doc/src/sgml/catalogs.sgml doc/src/sgml/catalogs.sgml
index 492ed348b3..2c501f21f7 100644
--- doc/src/sgml/catalogs.sgml
+++ doc/src/sgml/catalogs.sgml
@@ -2664,7 +2664,7 @@ SCRAM-SHA-256$:&l
(references pg_operator.oid)
   
   
-   If a foreign key, list of the equality operators for PK = FK comparisons
+   If a foreign key, list of the equality operators for establishing whether primary key columns are equal to the corresponding foreign key columns (PK = FK)
   
  
 
@@ -2674,7 +2674,7 @@ SCRAM-SHA-256$:&l
(references pg_operator.oid)
   
   
-   If a foreign key, list of the equality operators for PK = PK comparisons
+   If a foreign key, list of the equality operators for establishing whether primary key columns are equal (PK = PK)
   
  
 
@@ -2684,7 +2684,7 @@ SCRAM-SHA-256$:&l
(references pg_operator.oid)
   
   
-   If a foreign key, list of the equality operators for FK = FK comparisons
+   If a foreign key, list of the equality operators for establishing whether foreign key columns are equal (FK = FK)
   
  
 

--2.31.1--




Re: v14 mechanical code beautification patches

2021-05-05 Thread Bruce Momjian
On Wed, May  5, 2021 at 02:18:04PM -0400, Tom Lane wrote:
> I notice that we also list this as a pre-beta task in
> src/tools/RELEASE_CHANGES:
> 
> * Update inet/cidr data types with newest Bind patches
> 
> However, I can't recall that anyone has ever done any such thing;
> and at this point, any attempt to re-sync that code would likely
> be a rather major task.  Should we take that off the checklist?

I think it is related to these files:

src/backend/utils/adt/inet_cidr_ntop.c
src/backend/utils/adt/inet_net_pton.c

which have at the top:

 * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC")
 * Copyright (c) 1996,1999 by Internet Software Consortium.

but I am not sure we still need to update those, so I would remove it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: v14 mechanical code beautification patches

2021-05-05 Thread Tom Lane
Bruce Momjian  writes:
> On Wed, May  5, 2021 at 02:18:04PM -0400, Tom Lane wrote:
>> I notice that we also list this as a pre-beta task in
>> src/tools/RELEASE_CHANGES:
>>  * Update inet/cidr data types with newest Bind patches
>> However, I can't recall that anyone has ever done any such thing;
>> and at this point, any attempt to re-sync that code would likely
>> be a rather major task.  Should we take that off the checklist?

> I think it is related to these files:
>   src/backend/utils/adt/inet_cidr_ntop.c
>   src/backend/utils/adt/inet_net_pton.c
> which have at the top:
>  * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC")
>  * Copyright (c) 1996,1999 by Internet Software Consortium.
> but I am not sure we still need to update those, so I would remove it.

I dug in the archives and found the thread that prompted you to
add that bullet item:

https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us

which made the point that those were moving targets back in 2005.
I doubt they still are, so I don't see much point in keeping this
in the checklist.

(There may or may not be value in doing a one-time check to see
if we've missed anything.)

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Peter Geoghegan
On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent
 wrote:
> I believe that the TID is the unique identifier of that tuple, within context.
>
> For normal indexes, the TID as supplied directly by the TableAM is
> sufficient, as the context is that table.
> For global indexes, this TID must include enough information to relate
> it to the table the tuple originated from.

Clearly something like a partition identifier column is sometimes just
like a regular user-visible column, though occasionally not like one
-- whichever is useful to the implementation in each context. For
example, we probably want to do predicate pushdown, maybe with real
cataloged operators that access the column like any other user-created
column (the optimizer knows about the column, which even has a
pg_attribute entry). Note that we only ever access the TID column
using an insertion scankey today -- so there are several ways in which
the partition identifier really would be much more like a user column
than tid/scantid ever was.

The TID is a key column for most purposes as of Postgres 12 (at least
internally). That didn't break all unique indexes due to the existence
of non-unique TIDs across duplicates! Insertions that must call
_bt_check_unique() can deal with the issue directly, by temporarily
unsetting scantid.

We can easily do roughly the same thing here: be slightly creative
about how we interpret whether or not the partition identifier is
"just another key column" across each context. This is also similar to
the way the implementation is slightly creative about NULL values,
which are not equal to any other value to the user, but are
nevertheless just another value from the domain of indexed values to
the nbtree implementation. Cleverly defining the semantics of keys to
get better performance and to avoid the need for special case code is
more or less a standard technique.

> In the whole database, that would be the OID of the table + the TID as
> supplied by the table.
>
> As such, the identifier of the logical row (which can be called the
> TID), as stored in index tuples in global indexes, would need to
> consist of the TableAM supplied TID + the (local) id of the table
> containing the tuple.

2 points:

1. Clearly you need to use the partition identifier with the TID in
order to look up the version in the table -- you need to use both
together in global indexes. But it can still work in much the same way
as it would in a standard index -- it's just that you handle that
extra detail as well. That's what I meant by additive.

2. If a TID points to a version of a row (or whatever you want to call
the generalized version of a HOT chain -- almost the same thing), then
of course you can always map it back to the logical row. That must
always be true. It is equally true within a global index.

Points 1 and 2 above seem obvious to me...so I think we agree on that
much. I just don't know how you go from here to "we need
variable-width TIDs". In all sincerity, I am confused because to me it
just seems as if you're asserting that it must be necessary to have
variable width TIDs again and again, without ever getting around to
justifying it. Or even trying to.

> Assuming we're in agreement on that part, I
> would think it would be consistent to put this in TID infrastructure,
> such that all indexes that use such new TID infrastructure can be
> defined to be global with only minimal effort.

 Abstract definitions can be very useful, but ultimately they're just
tools. They're seldom useful as a starting point in my experience. I
try to start with the reality on the ground, and perhaps arrive at
some kind of abstract model or idea much later.

> ZHeap states that it can implement stable TIDs within limits, as IIRC
> it requires retail index deletion support for all indexes on the
> updated columns of that table.

Whether or not that's true is not at all clear. What is true is that
the prototype version of zheap that we have as of today is notable in
that it more or less allows the moral equivalent of a HOT chain to be
arbitrarily long (or much longer, at least). To the best of my
knowledge there is nothing about retail index tuple deletion in the
design, except perhaps something vague and aspirational.

> I fail to see why this same
> infrastructure could not be used for supporting clustered tables,
> while enforcing these limits only soft enforced in ZHeap (that is,
> only allowing index AMs that support retail index tuple deletion).

You're ignoring an ocean of complexity here. Principally the need to
implement something like two-phase locking (key value locking) in
indexes to make this work, but also the need to account for how
fundamentally redefining TID breaks things. To say nothing of how this
might affect crash recovery.

> > If it was very clear that there would be *some*
> > significant benefit then the costs might start to look reasonable. But
> > there isn't. "Build it and they will come" is not at all convincing to
> > m

Re: v14 mechanical code beautification patches

2021-05-05 Thread Bruce Momjian
On Wed, May  5, 2021 at 07:08:35PM -0400, Tom Lane wrote:
> > I think it is related to these files:
> > src/backend/utils/adt/inet_cidr_ntop.c
> > src/backend/utils/adt/inet_net_pton.c
> > which have at the top:
> >  * Copyright (c) 2004 by Internet Systems Consortium, Inc. ("ISC")
> >  * Copyright (c) 1996,1999 by Internet Software Consortium.
> > but I am not sure we still need to update those, so I would remove it.
> 
> I dug in the archives and found the thread that prompted you to
> add that bullet item:
> 
> https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us
> 
> which made the point that those were moving targets back in 2005.
> I doubt they still are, so I don't see much point in keeping this
> in the checklist.
> 
> (There may or may not be value in doing a one-time check to see
> if we've missed anything.)

Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Small issues with CREATE TABLE COMPRESSION

2021-05-05 Thread Michael Paquier
On Wed, May 05, 2021 at 09:59:41AM -0400, Robert Haas wrote:
> TAP tests have a facility for conditionally skipping tests; see
> perldoc Test::More. That's actually superior to what you can do with
> pg_regress. We'd need to come up with some logic to determine when to
> skip or not, though. Perhaps the easiest method would be to have the
> relevant Perl script try to create a table with an lz4 column. If that
> works, then perform the LZ4-based tests. If it fails, check the error
> message. If it says anything that LZ4 is not supported by this build,
> skip those tests. If it says anything else, die.

There is a simpler and cheaper method to make the execution of TAP
test conditional.  As in src/test/ssl/t/002_scram.pl for channel
binding, I think that you could use something like
check_pg_config("#define HAVE_LIBLZ4 1") and use its result to decide
which tests to skip or not.
--
Michael


signature.asc
Description: PGP signature


Re: .ready and .done files considered harmful

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 4:53 PM Stephen Frost  wrote:
> I do note that this comment is timeline.c is, ahem, perhaps over-stating
> things a bit:
>
>  * Note: while this is somewhat heuristic, it does positively guarantee
>  * that (result + 1) is not a known timeline, and therefore it should
>  * be safe to assign that ID to a new timeline.

OK, that made me laugh out loud.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Dubious assertion in RegisterDynamicBackgroundWorker

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 5:22 PM Tom Lane  wrote:
> I follow your concern, but I'm not convinced that this assertion is
> a useful aid; first because the asynchrony involved makes the edge
> cases rather squishy, and second because allowing 1024 bogus
> increments before complaining will likely mean that developer test
> runs will not last long enough to trigger the assertion, and third
> because if it does fire it's too far removed from the perpetrator
> to be much help in figuring out what went wrong, or even if
> anything *is* wrong.

Well, it allows 1024 bogus increments in one direction, but a negative
value will trip it PDQ.

> I've not tried to trace the code, but I'm now a bit suspicious
> that there is indeed a design bug here.  I gather from the
> comments that parallel_register_count is incremented by the
> worker processes, which of course implies that a worker that
> fails to reattach to shared memory won't do that.  But
> parallel_terminate_count is incremented by the postmaster.
> If the postmaster will do that even in the case of a worker that
> failed at startup, then lorikeet's symptoms are neatly explained.

parallel_register_count is incremented by
RegisterDynamicBackgroundWorker, i.e. when the worker process is
requested, not after it starts. To try to do it from the worker after
it's launched would, as you suppose, be completely busted. The more
general point here is that we should be adjusting these at the same
point we allocate and free the worker slots. Wherever that happens is
the right place to do this, because the goal of the mechanism is
precisely to limit the number of such slots that can be used by
parallel query.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Replication slot stats misgivings

2021-05-05 Thread Masahiko Sawada
On Tue, May 4, 2021 at 2:34 PM vignesh C  wrote:
>
> On Tue, May 4, 2021 at 9:48 AM Masahiko Sawada  wrote:
> >
> > On Mon, May 3, 2021 at 10:21 PM Amit Kapila  wrote:
> > >
> > > On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila 
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > >
> > > > > > > > I am not sure if any of these alternatives are a good idea. 
> > > > > > > > What do
> > > > > > > > you think? Do you have any other ideas for this?
> > > > > > >
> > > > > > > I've been considering some ideas but don't come up with a good one
> > > > > > > yet. It’s just an idea and not tested but how about having
> > > > > > > CreateDecodingContext() register before_shmem_exit() callback 
> > > > > > > with the
> > > > > > > decoding context to ensure that we send slot stats even on
> > > > > > > interruption. And FreeDecodingContext() cancels the callback.
> > > > > > >
> > > > > >
> > > > > > Is it a good idea to send stats while exiting and rely on the same? 
> > > > > > I
> > > > > > think before_shmem_exit is mostly used for the cleanup purpose so 
> > > > > > not
> > > > > > sure if we can rely on it for this purpose. I think we can't be sure
> > > > > > that in all cases we will send all stats, so maybe Vignesh's patch 
> > > > > > is
> > > > > > sufficient to cover the cases where we avoid losing it in cases 
> > > > > > where
> > > > > > we would have sent a large amount of data.
> > > > > >
> > > > >
> > > > > Sawada-San, any thoughts on this point?
> > > >
> > > > before_shmem_exit is mostly used to the cleanup purpose but how about
> > > > on_shmem_exit()? pgstats relies on that to send stats at the
> > > > interruption. See pgstat_shutdown_hook().
> > > >
> > >
> > > Yeah, that is worth trying. Would you like to give it a try?
> >
> > Yes.
> >
> > In this approach, I think we will need to have a static pointer in
> > logical.c pointing to LogicalDecodingContext that we’re using. At
> > StartupDecodingContext(), we set the pointer to the just created
> > LogicalDecodingContext and register the callback so that we can refer
> > to the LogicalDecodingContext on that callback. And at
> > FreeDecodingContext(), we reset the pointer to NULL (however, since
> > FreeDecodingContext() is not called when an error happens we would
> > need to ensure resetting it somehow). But, after more thought, if we
> > have the static pointer in logical.c it would rather be better to have
> > a global function that sends slot stats based on the
> > LogicalDecodingContext pointed by the static pointer and can be called
> > by ReplicationSlotRelease(). That way, we don’t need to worry about
> > erroring out cases as well as interruption cases, although we need to
> > have a new static pointer.
> >
> > I've attached a quick-hacked patch. I also incorporated the change
> > that calls UpdateDecodingStats() at FreeDecodingContext() so that we
> > can send slot stats also in the case where we spilled/streamed changes
> > but finished without commit/abort/prepare record.
> >
> > >  I think
> > > it still might not cover the cases where we error out in the backend
> > > while decoding via APIs because at that time we won't exit, maybe for
> > > that we can consider Vignesh's patch.
> >
> > Agreed. It seems to me that the approach of the attached patch is
> > better than the approach using on_shmem_exit(). So if we want to avoid
> > having the new static pointer and function for this purpose we can
> > consider Vignesh’s patch.
> >
>
> I'm ok with using either my patch or Sawada san's patch, Even I had
> the same thought of whether we should have a static variable thought
> pointed out by Sawada san. Apart from that I had one minor comment:
> This comment needs to be corrected "andu sed to sent"
> +/*
> + * Pointing to the currently-used logical decoding context andu sed to sent
> + * slot statistics on releasing slots.
> + */
> +static LogicalDecodingContext *MyLogicalDecodingContext = NULL;
> +

Right, that needs to be fixed.

After more thought, I'm concerned that my patch's approach might be
invasive for PG14. Given that Vignesh’s patch would cover most cases,
I think we can live with a small downside that could miss some slot
stats. If we want to ensure sending slot stats at releasing slot, we
can develop it as an improvement. My patch would be better to get
reviewed by more peoples including the design during PG15 development.
Thoughts?

Regards,

---
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent
 wrote:
> I believe that it cannot be "just" an additive thing, at least not
> through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> index, with its related problems. You also cannot add it as a key
> column, as this would disable UNIQUE indexes; one of the largest use
> cases of global indexes. So, you must create specialized
> infrastructure for this identifier.
>
> And when we're already adding specialized infrastructure, then this
> should probably be part of a new TID infrastructure.
>
> And if we're going to change TID infrastructure to allow for more
> sizes (as we'd need normal TableAM TIDs, and global index
> partition-identifying TIDs), I'd argue that it should not be too much
> more difficult to create an infrastructure for 'new TID' in which the
> table AM supplies type, size and strict ordering information for these
> 'new TID's.
>
> And if this 'new TID' size is not going to be defined by the index AM
> but by the indexed object (be it a table or a 'global' or whatever
> we'll build indexes on), I see no reason why this 'new TID'
> infrastructure couldn't eventually support variable length TIDs; or
> constant sized usertype TIDs (e.g. the 3 int columns of the primary
> key of a clustered table).
>
> The only requirements that I believe to be fundamental for any kind of TID are
>
> 1.) Uniqueness during the lifecycle of the tuple, from creation to
> life to dead to fully dereferenced from all indexes;
> 2.) There exists a strict ordering of all TIDs of that type;
>
> And maybe to supply some form of efficiency to the underlying tableAM:
>
> 3.) There should be an equivalent of bitmap for that TID type.
>
> For the nbtree deduplication subsystem, and for gin posting lists to
> be able to work efficiently, the following must also hold:
>
> 4.) The TID type has a fixed size, preferably efficiently packable.
>
> Only the last requirement cannot be met with varlena TID types. But,
> as I also believe that not all indexes can be expected to work (well)
> for all kinds of TableAM, I don't see how this would be a blocking
> issue.

+1 to all of that.

> Storage gains for index-oriented tables can become as large as the
> size of the primary key by not having to store all primary key values
> in both the index and the table; which can thus be around 100% of a
> table in the least efficient cases of having a PK over all columns.
>
> Yes, this might be indeed only a 'small gain' for access latency, but
> not needing to store another copy of your data (and keeping it in
> cache, etc.) is a significant win in my book.

This is a really good point. Also, if the table is ordered by a
synthetic logical TID, range scans on the primary key will be less
efficient than if the primary key is itself the TID. We have the
ability to CLUSTER on an index for good reasons, and "Automatically
maintain clustering on a table" has been on the todo list forever.
It's hard to imagine this will ever be achieved with the current heap,
though: the way to get there is to have a table AM for which this is
an explicit goal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Dubious assertion in RegisterDynamicBackgroundWorker

2021-05-05 Thread Tom Lane
I wrote:
> I've not tried to trace the code, but I'm now a bit suspicious
> that there is indeed a design bug here.  I gather from the
> comments that parallel_register_count is incremented by the
> worker processes, which of course implies that a worker that
> fails to reattach to shared memory won't do that.  But
> parallel_terminate_count is incremented by the postmaster.
> If the postmaster will do that even in the case of a worker that
> failed at startup, then lorikeet's symptoms are neatly explained.

That theory seems to be nonsense.  After a bit more study of the
code, I see that parallel_register_count is incremented by the *leader*
process, when it reserves a BackgroundWorkerSlot for the worker.
And parallel_terminate_count is incremented by the postmaster when
it releases the slot; so it's darn hard to see how
parallel_terminate_count could get ahead of parallel_register_count.

I noticed that lorikeet's worker didn't fail at shared memory reattach,
as I first thought, anyway.  It failed at
ERROR:  could not map dynamic shared memory segment
which means we ought to be able to reproduce the symptoms by faking
failure of dsm_attach(), as I did in the quick hack attached.
What I get is a lot of "parallel worker failed to initialize" and
"lost connection to parallel worker" errors, but no assertion.
(I also tried this with an EXEC_BACKEND build, just in case that'd
change the behavior, but it didn't.)  So it seems like the "lorikeet
is flaky" theory is looking pretty plausible.

I do see what seems to be a bug-let in ForgetBackgroundWorker.
BackgroundWorkerStateChange is careful to do this when freeing
a slot:

/*
 * We need a memory barrier here to make sure that the load of
 * bgw_notify_pid and the update of parallel_terminate_count
 * complete before the store to in_use.
 */
notify_pid = slot->worker.bgw_notify_pid;
if ((slot->worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerData->parallel_terminate_count++;
pg_memory_barrier();
slot->pid = 0;
slot->in_use = false;

but the mainline case in ForgetBackgroundWorker is a lot less
paranoid:

Assert(rw->rw_shmem_slot < max_worker_processes);
slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
if ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
BackgroundWorkerData->parallel_terminate_count++;

slot->in_use = false;

One of these functions is mistaken.  However, I can't construct
a theory whereby that explains lorikeet's symptoms, mainly because
Intel chips don't do out-of-order stores so the messing with
parallel_terminate_count should be done before in_use is cleared,
even without an explicit memory barrier.

regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3550ef13ba..d3b00c2f9e 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1278,6 +1278,11 @@ ParallelWorkerMain(Datum main_arg)
  "Parallel worker",
  ALLOCSET_DEFAULT_SIZES);
 
+	if (random() < INT_MAX / 64)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("fake failure to map dynamic shared memory segment")));
+
 	/*
 	 * Attach to the dynamic shared memory segment for the parallel query, and
 	 * find its table of contents.


Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-05 Thread Robert Haas
On Tue, May 4, 2021 at 4:35 PM Tom Lane  wrote:
> I'm still thinking that we're best off refusing to do
> that and making people install one of these shims that's meant
> for the job.

I have to admit that I care less about the specific issue here than
about the general issue of being open to hearing what the user needs
actually are. I honestly have no idea whether it's sensible to want to
run postgres as init. If people who know about container stuff say
that's a dumb idea and you shouldn't do it, then IMHO your conclusion
that we should simply disallow it is 100% correct. But if those people
show up and say, no, it's actually super-convenient for postgres to
run as init and using one of those shim things has significant
downsides that are hard to mitigate, and if further we could do what
they say they need with just a little bit of extra code, then IMHO
your conclusion is 100% wrong. Now so far as I can see right now
neither conclusion is crystal clear - opinions seem to be a bit mixed.
So right now I don't really know what to think. I just don't want to
fall into the trap of thinking that core developers are somehow in a
better place to know the right answer than users.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing
How hard would it be to declare TID as current ItemPointerData with
some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).

And then commit to fixing usage outside access/heap/ which depend on
small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c
, access/gin/ginpostinglist.c and access/brin/brin_*.c

there is also MaxHeapTuplesPerPage usage in
./backend/storage/page/bufpage.c but it seems to be all in
heapam-dependent functions (PageRepairFragmentation(),
PageGetHeapFreeSpace() and few others) which most likely should be
moved to access/heap/

Doing it this way would leave us with some manageable complexity in
mapping from TID to 48-bit integer and/or 3 wanted positions in  2^32


Hannu Krosing


On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan  wrote:
>
> On Wed, May 5, 2021 at 11:25 AM Andres Freund  wrote:
> > Agreed. And we can increase the fit a good bit without needing invasive
> > all-over changes. With those changes often even helping heap.
> >
> > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
> > even for heap - we waste a lot of space that's not commonly used. A
> > better datastructure (radix tree like I'd say, but several tree shaped
> > approaches seem possible).
>
> Agreed -- even if we only cared about heapam we still ought to do
> something about tidbitmap.c's use of MaxHeapTuplesPerPage.
>
> --
> Peter Geoghegan
>
>




Re: PG in container w/ pid namespace is init, process exits cause restart

2021-05-05 Thread Tom Lane
Robert Haas  writes:
> I have to admit that I care less about the specific issue here than
> about the general issue of being open to hearing what the user needs
> actually are. I honestly have no idea whether it's sensible to want to
> run postgres as init. If people who know about container stuff say
> that's a dumb idea and you shouldn't do it, then IMHO your conclusion
> that we should simply disallow it is 100% correct. But if those people
> show up and say, no, it's actually super-convenient for postgres to
> run as init and using one of those shim things has significant
> downsides that are hard to mitigate, and if further we could do what
> they say they need with just a little bit of extra code, then IMHO
> your conclusion is 100% wrong. Now so far as I can see right now
> neither conclusion is crystal clear - opinions seem to be a bit mixed.
> So right now I don't really know what to think. I just don't want to
> fall into the trap of thinking that core developers are somehow in a
> better place to know the right answer than users.

I don't claim to have an opinion about how convenient it would be
for users to not need an init shim.  I do claim to have a qualified
opinion about how hard it would be for us to support the case.  It'd
hobble our ability to detect child-process bookkeeping errors, and
it'd put constraints on how we manage the postmaster's signal handling.
Maybe those constraints will never matter, but that's a contract I
don't really want to buy into for this seemingly-not-large benefit.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing
On Thu, May 6, 2021 at 3:07 AM Robert Haas  wrote:
>
> On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent
> > Storage gains for index-oriented tables can become as large as the
> > size of the primary key by not having to store all primary key values
> > in both the index and the table; which can thus be around 100% of a
> > table in the least efficient cases of having a PK over all columns.
> >
> > Yes, this might be indeed only a 'small gain' for access latency, but
> > not needing to store another copy of your data (and keeping it in
> > cache, etc.) is a significant win in my book.
>
> This is a really good point. Also, if the table is ordered by a
> synthetic logical TID, range scans on the primary key will be less
> efficient than if the primary key is itself the TID. We have the
> ability to CLUSTER on an index for good reasons, and "Automatically
> maintain clustering on a table" has been on the todo list forever.
> It's hard to imagine this will ever be achieved with the current heap,
> though: the way to get there is to have a table AM for which this is
> an explicit goal.

But would this not have the downside that all the secondary indexes
will blow up as they now need to have the full table row as the TID  ?

-
Hannu Krosing




Re: Printing backtrace of postgres processes

2021-05-05 Thread Justin Pryzby
Here's a cleaned-up copy of the doc text.

Send a request to the backend with the specified process ID to log its 
backtrace.
The backtrace will be logged at message level LOG.
It will appear in the server log based on the log configuration set
(See  for more information),
but will not be sent to the client regardless of
.
A backtrace will identify where exactly the backend process is currently
executing. This may be useful to developers to diagnose stuck
processes and other problems. This feature is
not supported for the postmaster, logger, or statistics collector process. This
feature will be available if PostgreSQL was built
with the ability to capture backtracee.  If not available, the function will
return false and show a WARNING.
Only superusers can request backends to log their backtrace.

> - * this and related functions are not inlined.
> + * this and related functions are not inlined. If edata pointer is valid
> + * backtrace information will set in edata.

will *be* set

-- 
Justin




Re: pg_receivewal makes a bad daemon

2021-05-05 Thread David Fetter
On Wed, May 05, 2021 at 01:12:03PM -0400, Robert Haas wrote:
> On Wed, May 5, 2021 at 12:34 PM Magnus Hagander  wrote:
> > Is this really a problem we should fix ourselves? Most daemon-managers
> > today will happily be configured to automatically restart a daemon on
> > failure with a single setting since a long time now. E.g. in systemd
> > (which most linuxen uses now) you just set Restart=on-failure (or
> > maybe even Restart=always) and something like RestartSec=10.
> >
> > That said, it wouldn't cover an fsync() error -- they will always
> > restart. The way to handle that is for the operator to capture the
> > error message perhaps, and just "deal with it"?
> 
> Maybe, but if that's really a non-problem, why does postgres itself
> restart, and have facilities to write and rotate log files? I feel
> like this argument boils down to "a manual transmission ought to be
> good enough for anyone, let's not have automatics." But over the years
> people have found that automatics are a lot easier to drive. It may be
> true that if you know just how to configure your system's daemon
> manager, you can make all of this work, but it's not like we document
> how to do any of that, and it's probably not the same on every
> platform - Windows in particular - and, really, why should people have
> to do this much work? If I want to run postgres in the background I
> can just type 'pg_ctl start'. I could even put 'pg_ctl start' in my
> crontab to make sure it gets restarted within a few minutes even if
> the postmaster dies. If I want to keep pg_receivewal running all the
> time ... I need a whole pile of extra mechanism to work around its
> inherent fragility. Documenting how that's typically done on modern
> systems, as you propose further on, would be great, but I can't do it,
> because I don't know how to make it work. Hence the thread.
> 
> > Also, all the above also apply to pg_recvlogical, right? So if we do
> > want to invent our own daemon-init-system, we should probably do one
> > more generic that can handle both.
> 
> Yeah. And I'm not really 100% convinced that trying to patch this
> functionality into pg_receive{wal,logical} is the best way forward ...
> but I'm not entirely convinced that it isn't, either. I think one of
> the basic problems with trying to deploy PostgreSQL in 2021 is that it
> needs so much supporting infrastructure and so much babysitting.
> archive_command has to be a complicated, almost magical program we
> don't provide, and we don't even tell you in the documentation that
> you need it. If you don't want to use that, you can stream with
> pg_receivewal instead, but now you need a complicated daemon-runner
> mechanism that we don't provide or document the need for. You also
> probably need a connection pooler that we don't provide, a failover
> manager that we don't provide, and backup management software that we
> don't provide. And the interfaces that those tools have to work with
> are so awkward and primitive that even the tool authors can't always
> get it right. So I'm sort of unimpressed by any arguments that boil
> down to "what we have is good enough" or "that's the job of some other
> piece of software". Too many things are the job of some piece of
> software that doesn't really exist, or is only available on certain
> platforms, or that has some other problem that makes it not usable for
> everyone. People want to be able to download and use PostgreSQL
> without needing a whole library of other bits and pieces from around
> the Internet.

We do use at least one bit and piece from around the internet to make
our software usable, namely libreadline, the absence of which make
psql pretty much unusable.

That out of the way, am I understanding correctly that you're
proposing that make tools for daemon-izing, logging, connection
management, and failover, and ship same with PostgreSQL? I can see the
appeal for people shipping proprietary forks of the PostgreSQL,
especially ones under restrictive licenses, and I guess we could make
a pretty good case for continuing to center those interests as we have
since the Berkeley days.  Rather than, or maybe as a successor to,
wiring such things into each tool we ship that require them, I'd
picture something along the lines of .sos that could then be
repurposed, modified, etc., as we provide with the distribution as it
is now.

Another possibility would be to look around for mature capabilities
that are cross-platform in the sense that they work on all the
platforms we do.  While I don't think it's likely we'd find them for
all the above use cases under compatible licenses, it's probably worth
a look. At worst, we'd get some idea of how (not) to design the APIs
to them.

I'm going to guess that anything with an incompatible license will
upset people who are accustomed to ensuring that we have what legally
amounts to an MIT license clean distribution, but I'm thinking that
option is at least worth discussing, even if the immediate conse

Re: v14 mechanical code beautification patches

2021-05-05 Thread Tom Lane
I wrote:
> I dug in the archives and found the thread that prompted you to
> add that bullet item:
> https://www.postgresql.org/message-id/200502021700.j12H05j20872%40candle.pha.pa.us
> which made the point that those were moving targets back in 2005.
> I doubt they still are, so I don't see much point in keeping this
> in the checklist.
> (There may or may not be value in doing a one-time check to see
> if we've missed anything.)

I located the "current" versions of those files in libbind 6.0.
(I put "current" in quotes because the file dates seem to be
2005-2008, so indeed development came to a stop a long time ago.)

They are *very* different from what we have, though.  Some of it
is visibly cosmetic, but other parts have been rewritten quite a bit,
so it's hard to tell if the functionality is identical.

In the absence of a reason to think we have bugs that we need to
fix, I'm not sure it's worth analyzing the differences in detail.
I definitely wouldn't just adopt all the diffs blindly.

In any case, that RELEASE_CHANGES item is clearly a dead letter
now, so I'll go remove it.

regards, tom lane




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Jeff Davis
On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> How hard would it be to declare TID as current ItemPointerData with
> some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).

I don't think there's consensus in this thread that we want to do that,
but I'd be fine with it.

It's possible but not trivial. tidbitmap.c would be the biggest
challenge, I think.

Regards,
Jeff Davis







RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-05-05 Thread tsunakawa.ta...@fujitsu.com
From: Robert Haas 
> On Tue, May 4, 2021 at 11:47 PM Greg Nancarrow 
> wrote:
> > Problem is, for built-in functions, the changes are allowed, but for
> > some properties (like strict) the allowed changes don't actually take
> > effect (this is what Amit was referring to - so why allow those
> > changes?).
> > It's because some of the function properties are cached in
> > FmgrBuiltins[] (for a "fast-path" lookup for built-ins), according to
> > their state at build time (from pg_proc.dat), but ALTER FUNCTION is
> > just changing it in the system catalogs. Also, with sufficient
> > privileges, a built-in function can be redefined, yet the original
> > function (whose info is cached in FmgrBuiltins[]) is always invoked,
> > not the newly-defined version.
> 
> I agree. I think that's not ideal. I think we should consider putting
> some more restrictions on updating system catalog changes, and I also
> think that if we can get out of having strict need to be part of
> FmgrBuiltins[] that would be good. But what I don't agree with is the
> idea that since strict already has this problem, it's OK to do the
> same thing with parallel-safety. That seems to me to be making a bad
> situation worse, and I can't see what problem it actually solves.

Let me divide the points:


(1) Is it better to get hardcoded function properties out of fmgr_builtins[]?
It's little worth doing so or thinking about that.  It's no business for users 
to change system objects, in this case system functions.

Also, hardcoding is a worthwhile strategy for good performance or other 
inevitable reasons.  Postgres is using it as in the system catalog relcache 
below.

[relcache.c]
/*
 *  hardcoded tuple descriptors, contents generated by genbki.pl
 */
static const FormData_pg_attribute Desc_pg_class[Natts_pg_class] = 
{Schema_pg_class};
static const FormData_pg_attribute Desc_pg_attribute[Natts_pg_attribute] = 
{Schema_pg_attribute};
...


(2) Should it be disallowed for users to change system function properties with 
ALTER FUNCTION?
Maybe yes, but it's not an important issue for achieving parallel INSERT SELECT 
at the moment.  So, I think this can be discussed in an independent separate 
thread.

As a reminder, Postgres have safeguards against modifying system objects as 
follows.

test=# drop table^C
test=# drop function pg_wal_replay_pause();
ERROR:  cannot drop function pg_wal_replay_pause() because it is required by 
the database system
test=# drop table pg_largeobject;
ERROR:  permission denied: "pg_largeobject" is a system catalog

OTOH, Postgres doesn't disallow changing the system table column values 
directly, such as UPDATE pg_proc SET   But it's warned in the manual that 
such operations are dangerous.  So, we don't have to care about it.

Chapter 52. System Catalogs
https://www.postgresql.org/docs/devel/catalogs.html

"You can drop and recreate the tables, add columns, insert and update values, 
and severely mess up your system that way. Normally, one should not change the 
system catalogs by hand, there are normally SQL commands to do that. (For 
example, CREATE DATABASE inserts a row into the pg_database catalog — and 
actually creates the database on disk.) There are some exceptions for 
particularly esoteric operations, but many of those have been made available as 
SQL commands over time, and so the need for direct manipulation of the system 
catalogs is ever decreasing."


(3) Why do we want to have parallel-safety in fmgr_builtins[]?
As proposed in this thread and/or "Parallel INSERT SELECT take 2", we thought 
of detecting parallel unsafe function execution during SQL statement execution, 
instead of imposing much overhead to check parallel safety during query 
planning.  Specifically, we add parallel safety check in fmgr_info() and/or 
FunctionCallInvoke().

(Alternatively, I think we can conclude that we assume parallel unsafe built-in 
functions won't be used in parallel DML.  In that case, we don't change 
FmgrBuiltin and we just skip the parallel safety check for built-in functions 
when the function is called.  Would you buy this?)


Regards
Takayuki Tsunakawa




Re: decoupling table and index vacuum

2021-05-05 Thread Masahiko Sawada
On Fri, Apr 23, 2021 at 5:01 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-04-22 12:15:27 -0400, Robert Haas wrote:
> > On Wed, Apr 21, 2021 at 5:38 PM Andres Freund  wrote:
> > > I'm not sure that's the only way to deal with this. While some form of
> > > generic "conveyor belt" infrastructure would be a useful building block,
> > > and it'd be sensible to use it here if it existed, it seems feasible to
> > > dead tids in a different way here. You could e.g. have per-heap-vacuum
> > > files with a header containing LSNs that indicate the age of the
> > > contents.
> >
> > That's true, but have some reservations about being overly reliant on
> > the filesystem to provide structure here. There are good reasons to be
> > worried about bloating the number of files in the data directory. Hmm,
> > but maybe we could mitigate that. First, we could skip this for small
> > relations. If you can vacuum the table and all of its indexes using
> > the naive algorithm in <10 seconds, you probably shouldn't do anything
> > fancy. That would *greatly* reduce the number of additional files
> > generated. Second, we could forget about treating them as separate
> > relation forks and make them some other kind of thing entirely, in a
> > separate directory
>
> I'm not *too* worried about this issue. IMO the big difference to the
> cost of additional relation forks is that such files would only exist
> when the table is modified to a somewhat meaningful degree. IME the
> practical issues with the number of files due to forks are cases where
> huge number of tables that are practically never modified exist.
>
> That's not to say that I am sure that some form of "conveyor belt"
> storage *wouldn't* be the right thing. How were you thinking of dealing
> with the per-relation aspects of this? One conveyor belt per relation?
>
>
> > especially if we adopted Sawada-san's proposal to skip WAL logging. I
> > don't know if that proposal is actually a good idea, because it
> > effectively adds a performance penalty when you crash or fail over,
> > and that sort of thing can be an unpleasant surprise.  But it's
> > something to think about.
>
> I'm doubtful about skipping WAL logging entirely - I'd have to think
> harder about it, but I think that'd mean we'd restart from scratch after
> crashes / immediate restarts as well, because we couldn't rely on the
> contents of the "dead tid" files to be accurate. In addition to the
> replication issues you mention.

Yeah, not having WAL would have a big negative impact on other various
aspects. Can we piggyback the WAL for the TID fork and
XLOG_HEAP2_PRUNE? That is, we add the buffer for the TID fork to
XLOG_HEAP2_PRUNE and record one 64-bit number of the first dead TID in
the list so that we can add dead TIDs to the TID fork during replaying
XLOG_HEAP2_PRUNE.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Hannu Krosing
-
Hannu Krosing

On Thu, May 6, 2021 at 4:53 AM Jeff Davis  wrote:
>
> On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> > How hard would it be to declare TID as current ItemPointerData with
> > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).
>
> I don't think there's consensus in this thread that we want to do that,
> but I'd be fine with it.

Sure. I just proposed this as a Minimal Viable Change.

Just hoping that we can agree on an interim solution which addresses
the immediate problem and then continue arguing about the ideal way
for the rest of v15 cycle (and  the v16 and v17 ;) )

>
> It's possible but not trivial. tidbitmap.c would be the biggest
> challenge, I think.

I think all these places (tidbitmap, gin, brin) relay on "relatively
small" MaxHeapTuplesPerPage as an upper bound for some allocations and
then still allocate a lot more than needed.

One can get to 291 tids / page only when you create a table with no
columns, or less than 8 columns which are all set to NULL. A table
with a single non-null boolean column already can fit only 226 tuples
per page.

It is definitely a non-trivial amount of work to rewrite these three
but going to (almost) full 48 bits from current
theoretically-a-little-over-40-effective-bits would expand the number
of addressable tuples 225 times.

Of course it would be extra nice to also somehow encode the 3 special
ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get
an option of uninterrupted 48-bit address space for non-heap table
AMs, but this would likely be much more disruptive, if possible at
all.
We could still check, if they are used outside of heapam and maybe
just fix these places.

>
> Regards,
> Jeff Davis
>
>
>




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2021-05-05 Thread Amit Kapila
On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar  wrote:
>
> So, I
> > used the other approach which led to the attached.
>
> The patch looks fine to me.
>

Thanks, pushed!

-- 
With Regards,
Amit Kapila.




Re: Identify missing publications from publisher while create/alter subscription.

2021-05-05 Thread Japin Li


On Tue, 04 May 2021 at 21:20, vignesh C  wrote:
> On Tue, May 4, 2021 at 2:37 PM Bharath Rupireddy
>  wrote:
>>
>> On Mon, May 3, 2021 at 7:59 PM vignesh C  wrote:
>> > Thanks for the comments, these comments are handle in the v7 patch
>> > posted in my earlier mail.
>>
>> Thanks. Some comments on v7 patch:
>>
>> 1) How about "Add publication names from the list to a string."
>>  instead of
>>  * Append the list of publication to dest string.
>>
>
> Modified.
>
>> 2) How about "Connect to the publisher and see if the given
>> publication(s) is(are) present."
>> instead of
>>  * Connect to the publisher and check if the publication(s) exist.
>>
>
> Modified.
>
>> 3) Below comments are unnecessary as the functions/code following them
>> will tell what the code does.
>> /* Verify specified publication(s) exist in the publisher. */
>> /* We are done with the remote side, close connection. */
>>
>> /* Verify specified publication(s) exist in the publisher. */
>> PG_TRY();
>> {
>> check_publications(wrconn, publications, true);
>> }
>> PG_FINALLY();
>> {
>> /* We are done with the remote side, close connection. */
>> walrcv_disconnect(wrconn);
>> }
>>
>
> Modified.
>
>> 4) And also the comment below that's there before check_publications
>> is unnecessary, as the function name and description would say it all.
>> /* Verify specified publication(s) exist in the publisher. */
>>
>
> Modified.
>
>> 5) A typo -  it is "do not exist"
>> # Multiple publications does not exist.
>>
>
> Modified.
>
>> 6) Should we use "m" specified in all the test cases something like we
>> do for $stderr =~ m/threads are not supported on this platform/ or
>> m/replication slot "test_slot" was not created in this database/?
>> $stderr =~
>>  /ERROR:  publication "non_existent_pub" does not exist in the
>> publisher/,
>
> Modified.
>
> Thanks for the comments, Attached patch has the fixes for the same.

Thanks for updating the patch.  Some comments on v8 patch.

1) How about use appendStringInfoChar() to replace the first and last one,
since it more faster.
+   appendStringInfoString(dest, "\"");
+   appendStringInfoString(dest, pubname);
+   appendStringInfoString(dest, "\"");

2) How about use if (!validate_publication) to keep the code style consistent?
+   if (validate_publication == false)
+   return;

3) Should we free the memory when finish the check_publications()?
+   publicationsCopy = list_copy(publications);

4) It is better wrap the word "streaming" with quote.  Also, should we add
'no "validate_publication"' comment for validate_publication parameters?

   NULL, NULL,  /* no "binary" */
-   
   NULL, NULL); /* no streaming */
+   
   NULL, NULL,  /* no streaming */
+   
   NULL, NULL);

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: MaxOffsetNumber for Table AMs

2021-05-05 Thread Robert Haas
On Wed, May 5, 2021 at 10:53 PM Jeff Davis  wrote:
> On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> > How hard would it be to declare TID as current ItemPointerData with
> > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0x ?).
>
> I don't think there's consensus in this thread that we want to do that,
> but I'd be fine with it.
>
> It's possible but not trivial. tidbitmap.c would be the biggest
> challenge, I think.

I think that would be fine, too. I don't think it's the ideal
situation, but it seems like a clear improvement over what we have
now. We might want to reserve a few values for future projects that
might need distinguished values like SpecTokenOffsetNumber or
MovedPartitionsOffsetNumber, though, so we don't completely box
ourselves into a corner.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




  1   2   >