[GENERAL] Logical decoding error
Hello, Running Postgres 9.6.5, we're using logical decoding to take changes to the database and propagate them elsewhere in our system. We are using the PGX Go Postgres library, at https://github.com/jackc/pgx, and we are using the test_decoding plugin to format the changes. We are using 6 slots/have 6 processes streaming the changes from our database. This setup works great, except that every 20 hours or so, some or all of the processes encounter a problem, all at the same time. They receive an unexpected message type 'w'. At this point the processes restart, and when they do, they encounter another error: "ERROR: got sequence entry 0 for toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk number/seq number varies). This causes them to restart again. They will encounter the sequence entry error up to 3 more times, before things magically start to work again. We are also doing standard streaming replication to a slave off this database, and that has never seen a problem. Does this ring a bell for anyone? Do you have any suggestions for how I should go about figuring out what's happening? Thanks, Mark
Re: [GENERAL] Logical decoding error
On Thu, Nov 2, 2017 at 9:59 AM, Steve Atkins wrote: > > Where are the errors coming from - your code or pgx? If it's from pgx, > what's the exact error? ('w' is regular replication payload data, so it'd > be expected as a copydata payload message type, but would be an error for a > replication message). > > The errors are coming from PGX. Best I can tell (and I'm not very familiar with the PGX code, nor with the internals of Postgres replication), it's waiting for a replication message from the backend, so it's expecting things like CopyData/CopyInResponse/CopyOutResponse/CopyBothResponse. It gets a 'w' msg type there, which it doesn't handle, so it errors back to me (the copydata message type it expects there is 'd'). I'm only using PGX to open a replication connection, start replication, read replication messages and send heartbeats. I did open an issue on GitHub about this a couple of days ago, but haven't heard back from the PGX developer, so I thought I should try to help out and debug it myself. So I've been trying to educate myself on replication internals and such. Hence my email here. Do you capture the raw data from the replication connection when the error > happens? > > I was not, but I have added that and some other logging, and will report back when I get more info. Thanks, Mark
[GENERAL] pg_logical/snapshots directory
Hi All, While trying to track down my logical decoding problem, I noticed that my pg_logical/snapshots directory has ~5000 .snap files and is growing at a rate of about 4 files a minute. The earliest file is from yesterday afternoon, dating to the time I took our site down and added a column to a large table. I have not been able to find any docs on this directory. Can someone point me to an explanation of this directory and how to manage the number of .snap files in it (and/or whether I should be worried about the number of files in it)? Thanks, Mark
Re: [GENERAL] pg_logical/snapshots directory
On Sat, Nov 4, 2017 at 10:45 AM, Mark Fletcher wrote: > > While trying to track down my logical decoding problem, I noticed that > my pg_logical/snapshots directory has ~5000 .snap files and is growing at a > rate of about 4 files a minute. The earliest file is from yesterday > afternoon, dating to the time I took our site down and added a column to a > large table. > > This is with Postgres 9.6.5. After the directory got to 8000 files, I decided it wasn't going to stop on its own. Here is my setup, and what I ended up doing: We run 6 logical streaming processes. First, I stopped and restarted those processes. That did not clean up the pg_logical/snapshots directory. Then I restarted Postgres itself. That also had no effect. Finally, I destroyed all 6 logical slots. That was the thing that caused Postgres to clean up the snapshots directory. I then recreated the slots and everything is back to normal (normal meaning around 20 .snap files in that directory that get rotated regularly). Going back through my backups, it's clear that this problem began right after I added a column to a table. So maybe that caused something to be held that shouldn't be held? Thanks, Mark