[GENERAL] Logical decoding error

2017-11-02 Thread Mark Fletcher
Hello,

Running Postgres 9.6.5, we're using logical decoding to take changes to the
database and propagate them elsewhere in our system. We are using the PGX
Go Postgres library, at https://github.com/jackc/pgx, and we are using the
test_decoding plugin to format the changes. We are using 6 slots/have 6
processes streaming the changes from our database.

This setup works great, except that every 20 hours or so, some or all of
the processes encounter a problem, all at the same time. They receive an
unexpected message type 'w'.  At this point the processes restart, and when
they do, they encounter another error: "ERROR: got sequence entry 0 for
toast chunk 20559160 instead of seq 6935 (SQLSTATE XX000)" (the chunk
number/seq number varies). This causes them to restart again. They will
encounter the sequence entry error up to 3 more times, before things
magically start to work again.

We are also doing standard streaming replication to a slave off this
database, and that has never seen a problem.

Does this ring a bell for anyone? Do you have any suggestions for how I
should go about figuring out what's happening?

Thanks,
Mark


Re: [GENERAL] Logical decoding error

2017-11-02 Thread Mark Fletcher
On Thu, Nov 2, 2017 at 9:59 AM, Steve Atkins  wrote:

>
> Where are the errors coming from - your code or pgx? If it's from pgx,
> what's the exact error? ('w' is regular replication payload data, so it'd
> be expected as a copydata payload message type, but would be an error for a
> replication message).
>
> The errors are coming from PGX. Best I can tell (and I'm not very familiar
with the PGX code, nor with the internals of Postgres replication), it's
waiting for a replication message from the backend, so it's expecting
things like CopyData/CopyInResponse/CopyOutResponse/CopyBothResponse. It
gets a 'w' msg type there, which it doesn't handle, so it errors back to me
(the copydata message type it expects there is 'd').

I'm only using PGX to open a replication connection, start replication,
read replication messages and send heartbeats.

I did open an issue on GitHub about this a couple of days ago, but haven't
heard back from the PGX developer, so I thought I should try to help out
and debug it myself. So I've been trying to educate myself on replication
internals and such. Hence my email here.


Do you capture the raw data from the replication connection when the error
> happens?
>
> I was not, but I have added that and some other logging, and will report
back when I get more info.


Thanks,
Mark


[GENERAL] pg_logical/snapshots directory

2017-11-04 Thread Mark Fletcher
Hi All,

While trying to track down my logical decoding problem, I noticed that
my pg_logical/snapshots directory has ~5000 .snap files and is growing at a
rate of about 4 files a minute. The earliest file is from yesterday
afternoon, dating to the time I took our site down and added a column to a
large table.

I have not been able to find any docs on this directory. Can someone point
me to an explanation of this directory and how to manage the number of
.snap files in it (and/or whether I should be worried about the number of
files in it)?

Thanks,
Mark


Re: [GENERAL] pg_logical/snapshots directory

2017-11-05 Thread Mark Fletcher
On Sat, Nov 4, 2017 at 10:45 AM, Mark Fletcher  wrote:

>
> While trying to track down my logical decoding problem, I noticed that
> my pg_logical/snapshots directory has ~5000 .snap files and is growing at a
> rate of about 4 files a minute. The earliest file is from yesterday
> afternoon, dating to the time I took our site down and added a column to a
> large table.
>
> This is with Postgres 9.6.5. After the directory got to 8000 files, I
decided it wasn't going to stop on its own. Here is my setup, and what I
ended up doing:

We run 6 logical streaming processes. First, I stopped and restarted those
processes. That did not clean up the pg_logical/snapshots directory. Then I
restarted Postgres itself. That also had no effect. Finally, I destroyed
all 6 logical slots. That was the thing that caused Postgres to clean up
the snapshots directory. I then recreated the slots and everything is back
to normal (normal meaning around 20 .snap files in that directory that get
rotated regularly).

Going back through my backups, it's clear that this problem began right
after I added a column to a table. So maybe that caused something to be
held that shouldn't be held?

Thanks,
Mark