Shared buffer hash table corrupted

2020-02-21 Thread Mark Fletcher
Hi All,

Running 9.6.15, this morning we got a 'shared buffer hash table corrupted'
error on a query. I reran the query a couple hours later, and it completed
without error. This is running in production on a Linode instance which
hasn't seen any config changes in months.

I didn't find much on-line about this. How concerned should I be? Would you
move the instance to a different physical host?

Thanks,
Mark


Re: Shared buffer hash table corrupted

2020-02-22 Thread Mark Fletcher
On Fri, Feb 21, 2020 at 2:53 PM Tom Lane  wrote:

>
> Personally, I'd restart the postmaster, but not do more than that unless
> the error recurs.
>

Thanks for the response. I did restart the postmaster yesterday. Earlier
this morning, a query that normally completes fine started to error out
with 'invalid memory alloc request size 18446744073709551613'. Needless to
say our database isn't quite that size. This query was against a table in a
different database than the one that had the corruption warning yesterday.
Restarting the postmaster again fixed the problem. For good measure I
restarted the machine as well.

I need to decide what to do next, if anything. We have a hot standby that
we also run queries against, and it hasn't shown any errors. I can switch
over to that as the primary. Or I can move the main database to a different
physical host.

Thoughts appreciated.

Thanks,
Mark


Re: Shared buffer hash table corrupted

2020-02-22 Thread Mark Fletcher
On Sat, Feb 22, 2020 at 9:34 AM Tom Lane  wrote:

>
> Um.  At that point I'd agree with your concern about developing hardware
> problems.  Both of these symptoms could be easily explained by dropped
> bits in PG's shared memory area.  Do you happen to know if the server
> has ECC RAM?
>
> Yes, it appears that Linode uses ECC and other server grade hardware for
their machines.

Thanks,
Mark


Logical replication/publication question

2022-10-09 Thread Mark Fletcher
Hi,

We're migrating from 9.6 to 14, using pglogical. We have several logical
slots on the 9.6 instance implementing a change data capture pattern. For
the migration, we plan on recreating the slots on the 14 instance, without
taking a snapshot of the data. When the migration happens, we will simply
start using the slots on the 14 instance (with the understanding that the
LSNs won't match between the 9.6 and 14 instances).

In testing, we have this working, but there was a wrinkle and I'd like to
know if my understanding is correct. On the 9.6 instance, when creating the
replication slots, we would use the START_REPLICATION SLOT command, and
then immediately take a snapshot, and it has worked great for years.

On a 14 instance, receiving changes from the 9.6 instance via pglogical, if
we recreate the logical slot, but not take a snapshot, no changes are
propagated, and there are no errors. However, if we first issue a 'CREATE
PUBLICATION' command, then changes are propagated to the slot as expected.

We never issued a 'CREATE PUBLICATION' command on the 9.6 instance. My
guess is that the act of taking a snapshot twittled whatever bits were
necessary to propagate changes through the slot. By not taking a snapshot
on the 14 instance, that doesn't happen, hence the need for the 'CREATE
PUBLICATION' command. Alternatively, something was changed between 9.6 and
14 that now requires the new command (I looked through the docs and could
find no mention, however).

Is my understanding correct? Am I missing something? I just want to make
sure I'm not screwing something up.

Thanks,
Mark


Re: Oracle vs PG

2018-10-24 Thread Mark Fletcher
A Twitter thread from someone who talked with the reporter (also read
Werner's statement referenced in the first tweet):

https://twitter.com/andy_pavlo/status/1055154039606910976

Mark


ERROR: found multixact XX from before relminmxid YY

2018-12-28 Thread Mark Fletcher
Hi,

Starting yesterday morning, auto vacuuming of one of our postgresql 9.6.10
(CentOS 7) table's started failing:

ERROR:  found multixact 370350365 from before relminmxid 765860874
CONTEXT:  automatic vacuum of table "userdb.public.subs"

This is about as plain and simple a table as there is. No triggers or
foreign keys, I'm not using any extensions. It has about 2.8M rows. I have
not done any consistency checks, but nothing strange has manifested in
production.

Reading the various discussions about this error, the only solution I found
was here:

https://www.postgresql.org/message-id/CAGewt-ukbL6WL8cc-G%2BiN9AVvmMQkhA9i2TKP4-6wJr6YOQkzA%40mail.gmail.com

But no other reports of this solving the problem. Can someone verify that
if I do the mentioned fix (and I assume upgrade to 9.6.11) that will fix
the problem? And that it doesn't indicate table corruption?

Thanks,
Mark


Re: ERROR: found multixact XX from before relminmxid YY

2018-12-28 Thread Mark Fletcher
On Fri, Dec 28, 2018 at 4:49 PM Tom Lane  wrote:

>
> Yeah, SELECT FOR UPDATE should overwrite the broken xmax value and thereby
> fix it, I expect.  However, I don't see anything in the release notes
> suggesting that we've fixed any related bugs since 9.6.10, so if this
> just appeared then we've still got a problem :-(.  Did anything
> interesting happen since your last successful autovacuum on that table?
> Database crashes, WAL-related parameter changes, that sort of thing?
>
> The last autovacuum of that table was on Dec 8th, the last auto analyze
was Dec 26. There have been no schema changes on that particular table,
database crashes or WAL-related parameter changes since then. We've done
other schema changes during that time, but otherwise the database has been
stable.

Thanks,
Mark


Re: Pulling data from Postgres DB table for every 5 seconds.

2019-01-09 Thread Mark Fletcher
On Wed, Jan 9, 2019 at 9:02 AM github kran  wrote:

>
>> Hi Postgres Team,
>>
>> I have an application using RDS Aurora Postgresql 9.6 version having 4 TB
>> of DB size. In this DB we have a table PRODUCT_INFO with around  1 million
>> rows and table size of 1 GB.
>> We are looking for a implementation where we want to pull the data in
>> real time for every 5 seconds from the DB ( Table mentioned above) and send
>> it to IOT topic whenever an event occurs for a product. ( event is any new
>> product information or change in the existing
>> product information.).
>>
>>
It's unclear whether you want to do table scans or if you're just looking
for changes to the database. If you're looking just for changes, consider
implementing something using logical replication. We have a logical
replication system set up to stream changes from the database into an
elastic search cluster, and it works great.

Mark


Re: Pulling data from Postgres DB table for every 5 seconds.

2019-01-09 Thread Mark Fletcher
On Wed, Jan 9, 2019 at 10:10 AM github kran  wrote:

> Mark - We are currently on 9.6 version of postgres and cant use this
> feature of logical replication.Answering to your question we are looking
> for any changes in the data related to a specific table ( changes like any
> update on a timestamp field
> OR any new inserts happened in the last 5 seconds for a specific product
> entity).
> Any other alternatives ?.
>
> The feature was added in 9.4 (I think). We are on 9.6 and it works great.
Not sure about RDS Aurora, however.

Mark


Re: Pulling data from Postgres DB table for every 5 seconds.

2019-01-09 Thread Mark Fletcher
On Wed, Jan 9, 2019 at 12:58 PM github kran  wrote:

>
> Mark - just curious to know on the logical replication. Do you think I can
> use it for my use case where i need to publish data to a subscriber when
> there is a change in the data updated for a row or any new inserts
> happening on the table. Intention
> is to send this data in Json format by collecting this modified data in
> real time to a subscriber.
>
> From what you've said, it's a great use case for that feature. The one
thing to note is that you will have to code up a logical replication
client. If I can do it, pretty much anyone can, but it might take some time
to get things right. I wrote about some of what I found when developing our
client a year ago here:
https://wingedpig.com/2017/09/20/streaming-postgres-changes/

We ended up just using the included test output plugin that comes with the
postgresql distribution. And we didn't end up streaming to Kafka or
anything else first. We just take the data and insert it into our
elasticsearch cluster directly as we get it.

Mark


Re: Barman disaster recovery solution

2019-02-27 Thread Mark Fletcher
On Wed, Feb 27, 2019 at 1:39 AM Achilleas Mantzios <
ach...@matrix.gatewaynet.com> wrote:

>
> Hello, as promised here is my blog :
>
> https://severalnines.com/blog/current-state-open-source-backup-management-postgresql
>
>
Nice blog post. If you're aiming for a comprehensive run down of tools, I
suggest including wal-g. We've been using it since it was released (and its
predecessor wal-e for years before that) and it's been great. We currently
use it to back up a replica to S3, and it has no issues doing that. In more
recent versions, it supports delta backups, which decrease the time/load
required for a backup immensely in our case.

Cheers,
Mark


Non-pausing table scan on 9.6 replica?

2019-03-05 Thread Mark Fletcher
Hi All,

On a 9.6 streaming replica, we do table scans for stats and other things.
During these scans, the replication is paused (the 'recovering' postgres
process has 'waiting' appended to it). We're not using transactions with
these scans. Is there anything we can do to prevent the pausing?

Thanks,
Mark


Re: Non-pausing table scan on 9.6 replica?

2019-03-05 Thread Mark Fletcher
Thank you for responding to my email.

On Tue, Mar 5, 2019 at 9:20 PM Andreas Kretschmer 
wrote:

>
> have you set ```max_standby_streaming_delay``? The default is 30
> seconds, which means that this will be the maximum time allowed for a
> replication lag caused by a conflicting query.
>

Yes, we've bumped that up a lot.


> You can use ``hot_standby_feedback = on``, but the downside will be more
> bloat on the tables.
>
> I'm not sure I understand. I'm not worried about the query being cancelled
on the replica. max_standby_streaming_delay fixes that for us. I'm worried
about the streaming replication being paused while this table scan is
running. If the table scan takes several minutes, then the replica becomes
several minutes out of sync with the master. I'd prefer that not to happen
and I'm wondering if there's a way to do that.

Thanks,
Mark


Re: Non-pausing table scan on 9.6 replica?

2019-03-06 Thread Mark Fletcher
Andreas, Sameer,

Thank you for replying. I did not understand the purpose of
hot_standby_feedback, and your explanations helped. I turned it on, and the
pausing stopped.

Thanks,
Mark


Re: schema change tracking

2019-05-16 Thread Mark Fletcher
On Thu, May 16, 2019 at 9:41 AM Benedict Holland <
benedict.m.holl...@gmail.com> wrote:

>
> I need a tool that can track schema changes in a postgesql database, write
> scripts to alter the tables, and store those changes in git. Are there
> tools that exist that can do this?
>
> We ended up rolling our own. We do schema dumps and then use
https://www.apgdiff.com/ to diff them. For our relatively simple schemas,
it's worked fine. One thing to note, apgdiff doesn't support `ALTER TABLE
ONLY [a-z\.]+ REPLICA IDENTITY FULL;` lines, which we just remove before
diffing.

Cheers,
Mark


Logical replication blocking alter/drop

2017-12-12 Thread Mark Fletcher
Hi All,

Postgres 9.6.5. We run several logical replication processes off our main
postgres server. What we've noticed is that schema changes seem to block
until we halt the logical replication processes. For example, I just did a
'DROP INDEX CONCURRENTLY' command, and it just sat there until I stopped
the logical replication processes (I did not have to drop the logical
replication slots).

Is this expected or are we perhaps doing something wrong?

Thanks,
Mark