Logical replication stall on box type with REPLICA IDENTITY FULL due to lack of exact equality

2023-12-26 Thread Laurence Parry
Long story short, I tried setting REPLICA IDENTITY FULL on a table with a box 
type and it made replication stall on the receiver because - as I found out 
later - the predefined box type seems to lack a way to definitively determine 
exact equality of coordinates.

I was tempted to post this as a bug because - while discouraged - FULL works 
with other types, and from a user perspective based on the string output the 
box type seems like it could at least have hash-based equality, if not a btree 
total ordering.

The wording of 
https://www.postgresql.org/docs/current/logical-replication-publication.html 
also implies to me that FULL should work in all cases with default types: "If 
there are no such suitable indexes, the search on the subscriber side can be 
very inefficient, therefore replica identity FULL should only be used as a 
fallback if no other solution is possible." - so inefficient, but still working?

I might be missing something obvious, so I figured I'd bring it up here for 
advice. If not... well, it could've led to a production outage if I hadn't 
caught it in time, and this kind of issue seems likely to increase as logical 
replication becomes more popular.

Related reading:
https://www.postgresql.org/message-id/flat/1361F665-56E4-4CE6-9199-592067A656AB%40postgrespro.ru
https://www.postgresql.org/message-id/flat/4CF96953.40505%40enterprisedb.com#06df5ff5c9835e6c05ac3c55d48b2228

--

We have a cascading replica setup in production; the replicas are used to 
offload CPU-intensive recommendation queries and disaster recovery. In the 
PostgreSQL 16 upgrade cycle we switched from physical to logical replication. 
This entailed a few gotchas - we have a decade-and-a-half old, 
rapidly-developed schema, and had to set up some primary keys or use other 
indexes for replica identity. In cases where there was no obvious candidate, 
and the tables weren't busy, we set REPLICA IDENTITY FULL.

One that we missed until now, as it is rarely changed, is content servers areas 
- a set of boxes in ascending priority order defining the geospatial areas 
within which users should be directed to particular content servers. (It was 
great to find that PostgreSQL had support for these - like inet/cidr and uuid - 
out of the box.)

fb=# \d content_servers_areas
 Table "public.content_servers_areas"
  Column   |   Type   | Collation | Nullable | Default
---+--+---+--+-
 content_server_id | smallint |   |  |
 priority  | smallint |   |  |
 area  | box  |   |  |

db=# SELECT * FROM content_servers_areas WHERE content_server_id = 10;
 content_server_id | priority |  area
---+--+
10 |   53 | (90,-96),(-90,-170)
10 |   65 | (90,-27),(-90,-170)
10 |   35 | (48,-81.5),(35,-105.5)
10 |   35 | (90,-81.5),(48,-128)
10 |5 | (48,-84.8),(37,-99)
(5 rows)

The use of logical replication became an issue when we wanted to alter an 
existing area:

db=# UPDATE content_servers_areas SET area = '(48,-89),(37,-99)' WHERE area = 
'(48,-84.8),(37,-99)';
ERROR:  cannot update table "content_servers_areas" because it does not have a 
replica identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.
db=# ALTER TABLE content_servers_areas REPLICA IDENTITY FULL;
ALTER TABLE

This seemed to work, right up to the point our Munin monitoring infrastructure 
trigged a critical warning for high WAL levels on the sender a few hours later. 
Looking at the replica logs, the issue became obvious.

2023-12-26 01:55:49.849 UTC,,,2752638,,658a32a5.2a007e,1,,2023-12-26 01:55:49 
UTC,8/31,0,LOG,0,"logical replication apply worker for subscription 
""replica"" has started""InitializeApplyWorker, 
worker.c:4514","","logical replication worker",,0
2023-12-26 01:55:49.930 UTC,,,2752638,,658a32a5.2a007e,2,,2023-12-26 01:55:49 
UTC,8/0,0,ERROR,42883,"could not identify an equality operator for type 
box","processing remote data for replication origin ""pg_18583"" during 
message type ""UPDATE"" for replication target relation 
""public.content_servers_areas"" in transaction 256215511, finished at 
244/5DE85B48",,,"tuples_equal, execReplication.c:309","","logical replication 
worker",,0
2023-12-26 01:55:49.932 UTC,,,1314286,,65819479.140dee,13166,,2023-12-19 
13:02:49 UTC,,0,LOG,0,"background worker ""logical replication worker"" 
(PID 2752638) exited with exit code 1""LogChildExit, 
postmaster.c:3660","","postmaster",,0

ALTER SUBSCRIPTION replica SKIP (lsn = '244/5DE85B48'); worked to clear this - 
I had to do it a few times as we'd made further changes.

Unfortunately I can't do the obvious:
db=# CREATE UNIQUE INDEX content_server_id_area ON content_servers_areas 
(content_server_id

Re: Query crash with 15.5 on debian bookworm/armv8

2023-12-26 Thread Clemens Eisserer
Hi Tom,

> FWIW, since this crash is inside LLVM you could presumably dodge the bug
> by setting "jit" to off.

Thanks, this indeed solved the crash.
Just to make sure this crash doesn't have anything to do with my
setup/config (I'd changed quite a few settings in postgresql.conf),
I gave it a try on a fresh bookworm install and it also crashed immeditaly.

> As for an actual fix, perhaps a newer version of LLVM is needed?
> I don't see a problem testing this query on my RPI with Ubuntu 23.10
> (LLVM 16).

I also gave Ubuntu 23.10 a try (15.4 built with llvm-15) and it worked
as expected, explain analyze even mentioned the JIT was active.

I've filed a debian bug report with a link to this discussion and a
plea to build postgresql against llvm >= 15:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059476

To be honest I don't know why llvm-14 was chosen, as 15 is also
available in bookworm.

Thanks & best regards, Clemens




Intermittent Issue with WAL Segment Removal in Logical Replication

2023-12-26 Thread Kaushik Iska
Dear PostgreSQL Community,

I am seeking guidance regarding a recurring issue we've encountered with
WAL segment removal during logical replication using pgoutput plugin. We
sporadically encounter an error indicating that a requested WAL segment has
already been removed. This issue arises intermittently when executing
START_REPLICATION. An example error message is as follows:


requested WAL segment 000114600AE has already been removed


Please note that this error is not specific to the segment mentioned above;
it serves as an example of the type of error we are experiencing.

Additional Context:


   -

   max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
   -

   The error seems to appear randomly and is not consistent.
   -

   After a couple of retries, the replication process eventually succeeds.
   -

   For one of the users it seems to be happening every 16 hours or so.


Our approach involves starting with START_REPLICATION 0, replicating data
in batches, and then restarting at the last LSN of the previous batch. We
are trying to understand the root cause behind the intermittent removal of
WAL segments during logical replication. Specifically, we are looking for
insights into:


   -

   The potential reasons for the WAL segments being reported as removed.
   -

   Why this error occurs intermittently and why replication succeeds after
   several retries.
   -

   Any advice on troubleshooting and resolving this issue, or insights into
   whether it might be related to our specific replication setup or a
   characteristic of pgoutput, would be highly valuable.


Related Posts


   -

   https://issues.redhat.com/browse/DBZ-590
   -

   Troubleshooting Postgres Sources | Airbyte Documentation
   

   -


   
https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error



Thank you very much for your time and assistance.

Thanks,

Kaushik Iska


ERROR: missing chunk number 0 for toast value 1563962066 in pg_toast_2619

2023-12-26 Thread Johnathan Tiamoh
Hello,

Please, I need help to fix this. Can't do anything in the database. This
error keeps popping up.

I have tried to do a pg_dump and the same error keeps coming up.


Thank you
Johnathan Tiamoh


Re: ERROR: missing chunk number 0 for toast value 1563962066 in pg_toast_2619

2023-12-26 Thread Adrian Klaver

On 12/26/23 12:51, Johnathan Tiamoh wrote:

Hello,

Please, I need help to fix this. Can't do anything in the database. This 
error keeps popping up.


I have tried to do a pg_dump and the same error keeps coming up.


You need to provide answers to the questions posed here:

https://www.postgresql.org/message-id/2184af82-8dbd-4c78-9cc1-29f7a577e2f1%40aklaver.com



Thank you
Johnathan Tiamoh


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: PostgreSQL 15.5 stops processing user queries

2023-12-26 Thread Andrey Zhidenkov
Sorry, wrong stack trace. Here is a correct one:

(gdb) bt
#0  0x7f9acca1bdf6 in do_futex_wait.constprop () from
/lib64/libpthread.so.0
#1  0x7f9acca1bee8 in __new_sem_wait_slow.constprop.0 () from
/lib64/libpthread.so.0
#2  0x007815a2 in PGSemaphoreLock ()
#3  0x007fca6c in LWLockAcquire ()
#4  0x7f9ac5c385e6 in pgsm_store () from
/usr/pgsql-15/lib/pg_stat_monitor.so
#5  0x7f9ac5c39a6b in pgsm_ExecutorEnd () from
/usr/pgsql-15/lib/pg_stat_monitor.so
#6  0x00658d91 in PortalCleanup ()
#7  0x0095b264 in PortalDrop ()
#8  0x0080c6df in exec_simple_query ()
#9  0x0080de22 in PostgresMain ()
#10 0x0078db80 in ServerLoop ()
#11 0x0078eb34 in PostmasterMain ()
#12 0x0050474d in main ()

It’s seems to be a problem in pg_stat_monitor extension. The similar issue
is described here (in Chinese):
https://blog.csdn.net/qq_43687755/article/details/117592635

On Thu, 21 Dec 2023, 18:32 Tom Lane  wrote:

> Andrey Zhidenkov  writes:
> > We have encountered an issue with our PostgreSQL 15.5 installation. The
> > problem is that
> > PostgreSQL server periodically falls into a state when it accepts new
> > connections but
> > doesn't execute any queries. The session which runs a query hangs and
> > cannot be terminated
> > via SIGINT (even if it's just "SELECT 1") - a corresponding linux process
> > has state "S" and it's
> > not terminated even if the PostgreSQL master process is stopped. No
> matter
> > how we connect to
> > the database - both TCP and unix socket sessions hang but it seems that
> > existing sessions can
> > still execute queries (once we succeeded to connect using pgBouncer which
> > seemed to use an
> > existing connection to the database).
>
> Try using an existing session to capture pg_stat_activity and pg_locks
> information about the stuck session(s).
>
> > Here is a stack trace from gdb for one of the frozen sessions:
>
> This appears to be a trace of a psql process waiting for a query
> result.  It won't teach you much about what the server is doing.
>
> regards, tom lane
>


Re: Query crash with 15.5 on debian bookworm/armv8

2023-12-26 Thread Thomas Munro
On Wed, Dec 27, 2023 at 5:17 AM Clemens Eisserer  wrote:
> > FWIW, since this crash is inside LLVM you could presumably dodge the bug
> > by setting "jit" to off.
>
> Thanks, this indeed solved the crash.
> Just to make sure this crash doesn't have anything to do with my
> setup/config (I'd changed quite a few settings in postgresql.conf),
> I gave it a try on a fresh bookworm install and it also crashed immeditaly.
>
> > As for an actual fix, perhaps a newer version of LLVM is needed?
> > I don't see a problem testing this query on my RPI with Ubuntu 23.10
> > (LLVM 16).
>
> I also gave Ubuntu 23.10 a try (15.4 built with llvm-15) and it worked
> as expected, explain analyze even mentioned the JIT was active.

I can't reproduce this on LLVM 14 on an aarch64 Mac FWIW (after
setting jit_*_cost to 0, as required since the table is empty).

> I've filed a debian bug report with a link to this discussion and a
> plea to build postgresql against llvm >= 15:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059476

I doubt they'll change that, and in any case we'll need to get to the
bottom of this.  Perhaps an assertion build of LLVM will fail in some
illuminating internal assertion?  Unfortunately it's a non-trivial
business to get a debug build of LLVM going (it takes oodles of disk
and CPU and a few confusing-to-me steps)...

. o O ( It would be wonderful if assertion-enabled packages were
readily available for a common platform like Debian.  I've finally
been spurred on to reach out to the maintainer of apt.llvm.org to ask
about that.  It'd also be very handy for automated next-version
monitoring. )




looking for mariadb-devel to install mysql_fdw_16 in AlmaLinux 9

2023-12-26 Thread Michael Nolan
I'm trying to install mysql_fdw_16 and it requires the mariadb-devel module.

There is no mariadb-devel module for Almalinux 9.  There is a
MariaDB-devel module, but when I install it, the install for
mysql_fdw_16 still tells me that there is no mariadb-devel module
available.

How do I work around this?

Mike Nolan
htf...@gmail.com




Re: looking for mariadb-devel to install mysql_fdw_16 in AlmaLinux 9

2023-12-26 Thread Adrian Klaver

On 12/26/23 18:02, Michael Nolan wrote:

I'm trying to install mysql_fdw_16 and it requires the mariadb-devel module.


From what I see the latest mysql_fdw version is REL-2_9_1, from here:

https://github.com/EnterpriseDB/mysql_fdw/releases/tag/REL-2_9_1

Are you getting it from some other source?



There is no mariadb-devel module for Almalinux 9.  There is a
MariaDB-devel module, but when I install it, the install for
mysql_fdw_16 still tells me that there is no mariadb-devel module
available.


Provide the command you used as well as the complete error message that 
resulted.




How do I work around this?

Mike Nolan
htf...@gmail.com




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Changing a schema's name with function1 calling function2

2023-12-26 Thread Kirk Wolak
On Mon, Dec 25, 2023 at 1:47 PM Christophe Pettus  wrote:

>
>
> > On Dec 25, 2023, at 10:44, Adrian Klaver 
> wrote:
> > Functions with same name in different schemas would need to be dealt
> with.
>
> I think that's the primary use-case (at least, it would be for me), and I
> don't see a convenient way of doing that.  Even a "get OID of current
> function" function would be useful here.
>
> And Pavel Already Created it:
GET DIAGNOSTIC PID = PG_ROUTINE_OID ;

If I understand the issue correctly.

Kirk Out!


Re: Read write performance check

2023-12-26 Thread Kirk Wolak
On Thu, Dec 21, 2023 at 8:31 AM veem v  wrote:

> Can someone please guide me, if any standard scripting is available for
> doing such read/write performance test? Or point me to any available docs?
>
>
> ...


Veem, first things first... "Top Posting" is when you reply at the top of
the email...  Notice how I replied at the bottom (and I deleted context,
clearly).
This is the style we prefer here.

Second, since you are new to postgreSQL... Let me recommend some reading.
Cybertec has articles on performance (Tom Kyte style).
Also, read the "Don't Do That" wiki, and finally, have a look at pgbench
and psql documentation.  And specifically look at GENERATE_SERIES(),
but the Cybertec articles will touch on that.  Regardless...  Reading the
docs is insightful.

Links:
https://www.cybertec-postgresql.com/en/postgresql-hash-index-performance/
https://wiki.postgresql.org/wiki/Don't_Do_This
https://www.postgresql.org/docs/current/app-psql.html
https://www.postgresql.org/docs/current/pgbench.html
https://www.postgresql.org/docs/16/functions-srf.html

HTH,

Kirk Out!