[GENERAL] PG 9.3.12: Replication appears to have worked, but getting error messages in logs
Hello, We're using streaming replication. Our technique for spinning up a db slave is this: rsync from master (gross copy) pg_start_backup() on server rsync from master (correct copy) pg_stop_backup() drop in recovery.conf into slave directory enable hot_standby in slave conf start slave After starting the slave, I'm getting this error every 5 seconds in the log: ERROR: requested WAL segment 00010E220005 has already been removed But I can connect to the DB and make queries and new records are definitely streaming in. I thought I just didn't have enough wal segments, so I bumped up the number on the master and restarted the process. It just finished the second time and the exact same error message is in the logs again (same wal segment number). When I ran pg_start_backup() and pg_stop_backup(), the output was: pg_start_backup - E27/3100A200 pg_stop_backup E28/7D357950 The backup_label file looked like this: START WAL LOCATION: E27/3100A200 (file 00010E270031) CHECKPOINT LOCATION: E27/31C9C740 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2016-04-02 12:34:25 PDT LABEL: clone During the rsync it copied pg_xlog/00010E2500F1 to pg_xlog/00010E280071 So I'm confused: why is the E22 wal being requested? It seems to predate the backup by a lot. Does the slave really contain all the data? If not, how can I tell what is missing (and why is it accepting streaming data if it's missing something)? One more piece of the puzzle that may or may not be relevant: The current master used to be streaming replication slave. The original master had a disk failure and so we switched one of the backup slaves into a master. We've replaced the disk on the original server and we're now trying to make it a streaming replication slave. This is the part that's failing. If I do rough estimates of how fast the Exx number is incrementing and compute backwards, E22 seems like about the time of the original disk failure, give or take. Thanks, David smime.p7s Description: S/MIME Cryptographic Signature
[GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.
Hi, there. First, my particulars: * Ubuntu Trusty build and runtime environment * PostgreSQL 9.3.10 Ubuntu source code * Using a FIPS enabled version of OpenSSL (i.e. 1.0.1p version of the library and 2.0.9 of the FIPS canister source code) * I think this is probably more of a developer question, but the mailing list site said to post it elsewhere first, so here it is * I am new to FIPS and postgresql in general (i.e. working with them for a few months) I've been trying to get the postgresql packages to work in FIPS mode. To accomplish this, I've patched the Ubuntu source code with the patch that is attached to this message. The main postgresql server runs fine as expected in either FIPS or non-FIPS modes. However, when I try to use the psql command in FIPS mode, I get the following error: # psql -h 127.0.0.1 -U postgres -d sslmode=require psql: SSL SYSCALL error: EOF detected I used the gdb debugger to try to find where in the backend the command was failing. The backtrace on the server side suggests that the problem involves the key-exchange failing: (gdb) bt #0 0x7f40183e8f20 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x7f40183e8dd4 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137 #2 0x7f40196a95ce in DH_generate_key () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #3 0x7f40199e8ba6 in ssl3_send_server_key_exchange () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #4 0x7f40199ec18b in ssl3_accept () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #5 0x7f40199fb8b3 in ssl23_accept () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #6 0x5618082567a4 in open_server_SSL (port=0x561808e05700) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/libpq/be-secure.c:925 #7 secure_open_server (port=port@entry=0x561808e05700) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/libpq/be-secure.c:221 #8 0x5618082c7eb8 in ProcessStartupPacket (port=port@entry=0x561808e05700, SSLdone=SSLdone@entry=0 '\000') at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1921 #9 0x5618081030f9 in BackendInitialize (port=0x561808e05700) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:4036 #10 BackendStartup (port=0x561808e05700) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:3807 #11 ServerLoop () at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1690 #12 0x5618082cace1 in PostmasterMain (argc=5, argv=) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1315 #13 0x561808103fb3 in main (argc=5, argv=0x561808db6970) at /home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/main/main.c:227 I tracked it down to the following code in the OpenSSL 2.0.9 canister code: int FIPS_drbg_generate(DRBG_CTX *dctx, unsigned char *out, size_t outlen, int prediction_resistance, const unsigned char *adin, size_t adinlen) { int r = 0; if (FIPS_selftest_failed()) { FIPSerr(FIPS_F_FIPS_DRBG_GENERATE, FIPS_R_SELFTEST_FAILED); return 0; } if (!fips_drbg_check(dctx)) return 0; if (dctx->status != DRBG_STATUS_READY && dctx->status != DRBG_STATUS_RESEED) { if (dctx->status == DRBG_STATUS_ERROR) r = FIPS_R_IN_ERROR_STATE; else if(dctx->status == DRBG_STATUS_UNINITIALISED) r = FIPS_R_NOT_INSTANTIATED; goto end; } ... The place where it fails is where dctx->status == DRBG_STATUS_UNINITIALIZED (i.e. 0). So, my question is this: In FIPS mode, what would cause the random number generation to not initialize? I have put print statements in the postgresql code such that I know that it is in FIPS mode properly. I know that the dctx->status pointer, which points to a "static DRBG_CTX ossl_dctx" structure, is initialize to 1 in the main process. It appears that this initialization doesn't get propagated to other backends or the SSL transaction above. If any of the developers have some insight into this, I would appreciate it. Thanks, Rodney Lott 0001-FIPS-enabled-postgresql.patch Description: 0001-FIPS-enabled-postgresql.patch -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] CORRUPTION on TOAST table
Hello Adrian, thanks for the response. master data also located on SAN Yes, each replica is it own VM with its own virtual disk/volume as served up from the same SAN Raw disk mappings are a way for ESX to present a SAN volume directly to a VM instead of creating a virtual disk. no unexpected messages detected. On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver wrote: > On 04/02/2016 08:38 PM, Soni M wrote: > >> Hello Everyone, >> >> We face TOAST table corruption. >> >> One master and two streaming replicas. The corruption happen only on >> both streaming replicas. >> >> We did found the corrupted rows. Selecting on this row, return (on both >> replica) : unexpected chunk number 0 (expected 1) for toast value >> 1100613112 in pg_toast_112517 >> selecting this row on master does not return corruption error, but >> return correct result instead. >> >> Previously, dump on a replica return : unexpected chunk number 0 >> (expected 1) for toast value 3234098599 in pg_toast_112517 (please note >> the toast value is different) >> >> This table size is 343 GB, contain around 206,179,697 live tuples. We >> found that the corruption happen on the biggest column (this column and >> its pkey sized around 299 GB total). >> >> > on both replica : >> fsync NEVER turned off. >> none unexpected power loss nor OS crash. >> >> How can the corruption occurs ? and how can I resolve them ? >> > > Meant to add to previous post. > > Do you see anything in the replica Postgres logs that indicate a problem > with the replication process? > > Or any other unexpected messages prior to the point you did the select on > the replica(s)? > > > >> Thank so much for the help. >> >> Cheers \o/ >> >> -- >> Regards, >> >> Soni Maula Harriz >> > > > -- > Adrian Klaver > adrian.kla...@aklaver.com > -- Regards, Soni Maula Harriz
Re: [GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.
Rodney Lott writes: > So, my question is this: In FIPS mode, what would cause the random > number generation to not initialize? I remember that Red Hat's version of "FIPS mode" involved crypto features (including RNGs) just refusing to work in modes deemed inadequately secure. So my guess is that psql is trying to configure OpenSSL with some inadequately-secure settings. Not sure why it'd be different from the server though. Are you sure psql and the libpq it's using are same version as the apparently-working server? regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.
> > So, my question is this: In FIPS mode, what would cause the random > > number generation to not initialize? > > I remember that Red Hat's version of "FIPS mode" involved crypto > features (including RNGs) just refusing to work in modes deemed > inadequately secure. So my guess is that psql is trying to configure > OpenSSL with some inadequately-secure settings. Not sure why it'd be > different from the server though. Are you sure psql and the libpq it's > using are same version as the apparently-working server? > > regards, tom lane Hi, Tom. Thanks for the quick reply. I'll look into the settings and see what I can find. I double checked the installed packages and they seem to be from my same postgresql build (i.e. note my timestamp of 1459281538): # dpkg -l | grep postgres ii postgresql-9.3 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64object-relational SQL database, version 9.3 server ii postgresql-9.3-dbg 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64debug symbols for postgresql-9.3 ii postgresql-client-9.3 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64front-end programs for PostgreSQL 9.3 ii postgresql-client-common154-et1~fips~2.0.9~1459281538 all manager for multiple PostgreSQL client versions ii postgresql-common 154-et1~fips~2.0.9~1459281538 all PostgreSQL database-cluster manager ii postgresql-contrib-9.3 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64additional facilities for PostgreSQL ii postgresql-json-build 1.1.0-et3 amd64json_build extension for postgresql ii postgresql-plpython-9.3 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64PL/Python procedural language for PostgreSQL 9.3 # dpkg -l | grep libpq ii libpq5 9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64PostgreSQL C client library # dpkg -S /usr/bin/psql postgresql-client-common: /usr/bin/psql # dpkg -S /usr/lib/postgresql/9.3/bin/postgres postgresql-9.3: /usr/lib/postgresql/9.3/bin/postgres # psql -h 127.0.0.1 -U postgres -d sslmode=require psql: SSL SYSCALL error (0): EOF detected, err=5 So, I believe that psql and libpq are from the same version as the currently working server. Regards, Rodney -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to quote the COALESCE function?
On Tue, Mar 29, 2016 at 12:38 PM, Pavel Stehule wrote: > The coalesce is one few functions implemented by special rule in > PostgreSQL parser. In the SQL standard the COALESCE feature is not listed as a function; it is listed as one of the short forms of CASE expression. While it has function-like syntax, thinking of it as a function is semantically incorrect. COALESCE(a, b) is supposed to be semantically equivalent to: CASE WHEN a is not null THEN a ELSE b END Among other things, that means that this statement should not generate a divide by zero error: SELECT COALESCE(1, 1/0); -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] CORRUPTION on TOAST table
It seems that it was the Postgres bug on replica, after upgrading minor version to 9.1.21 on replica1, the corruption goes away. Thanks everyone for the help On Tue, Apr 5, 2016 at 1:32 AM, Soni M wrote: > Hello Adrian, thanks for the response. > > master data also located on SAN > > Yes, each replica is it own VM with its own virtual disk/volume as served > up from the same SAN > > Raw disk mappings are a way for ESX to present a SAN volume directly to a > VM instead of creating a virtual disk. > > no unexpected messages detected. > > On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver > wrote: > >> On 04/02/2016 08:38 PM, Soni M wrote: >> >>> Hello Everyone, >>> >>> We face TOAST table corruption. >>> >>> One master and two streaming replicas. The corruption happen only on >>> both streaming replicas. >>> >>> We did found the corrupted rows. Selecting on this row, return (on both >>> replica) : unexpected chunk number 0 (expected 1) for toast value >>> 1100613112 in pg_toast_112517 >>> selecting this row on master does not return corruption error, but >>> return correct result instead. >>> >>> Previously, dump on a replica return : unexpected chunk number 0 >>> (expected 1) for toast value 3234098599 in pg_toast_112517 (please note >>> the toast value is different) >>> >>> This table size is 343 GB, contain around 206,179,697 live tuples. We >>> found that the corruption happen on the biggest column (this column and >>> its pkey sized around 299 GB total). >>> >>> >> on both replica : >>> fsync NEVER turned off. >>> none unexpected power loss nor OS crash. >>> >>> How can the corruption occurs ? and how can I resolve them ? >>> >> >> Meant to add to previous post. >> >> Do you see anything in the replica Postgres logs that indicate a problem >> with the replication process? >> >> Or any other unexpected messages prior to the point you did the select on >> the replica(s)? >> >> >> >>> Thank so much for the help. >>> >>> Cheers \o/ >>> >>> -- >>> Regards, >>> >>> Soni Maula Harriz >>> >> >> >> -- >> Adrian Klaver >> adrian.kla...@aklaver.com >> > > > > -- > Regards, > > Soni Maula Harriz > -- Regards, Soni Maula Harriz
[GENERAL] template1 being accessed
Hi there. I'm currently using postgres 9.2. As you can see below, my "template1" database was being accessed: [image: Inline images 2] That server is a 4-day-old backup DB - does a gzip of pg_dump, excluding some tables; also 4-day old replication using WAL-archive with 345600s delay; also file-level copies (excl. PG DB) Why was the template1 database being accessed and now it's not anymore? Cheers Lucas
Re: [GENERAL] template1 being accessed
On 4/4/2016 9:04 PM, drum.lu...@gmail.com wrote: I'm currently using postgres 9.2. As you can see below, my "template1" database was being accessed: 500m is 1/2 access per whatever interval that graph is using. typically, template1 is accessed when you do a create database if you don't specify a different template. its also possible some management software might use it as a default place to connect so they can get a list of databases or whatever . -- john r pierce, recycling bits in santa cruz -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] PG 9.3.12: Replication appears to have worked, but getting error messages in logs
On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell wrote: > Hello, > > We're using streaming replication. Our technique for spinning up a db > slave is this: > > rsync from master (gross copy) > pg_start_backup() on server > rsync from master (correct copy) > pg_stop_backup() > drop in recovery.conf into slave directory > enable hot_standby in slave conf > start slave > > After starting the slave, I'm getting this error every 5 seconds in the log: > > ERROR: requested WAL segment 00010E220005 has already been > removed Are you sure that this is from the slave and not another not requesting it? If this log entry has just ERROR, it means that the WAL sender generated it. So don't you have a cascading slave connecting to this new slave and requested for this WAL segment that has already been removed? Should the error come from the WAL receiver, you would have something like that. FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0001001C has already been removed -- Michael -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] template1 being accessed
(sorry, back to the list) On Tue, Apr 5, 2016 at 6:11 AM, John R Pierce wrote: > its also possible some management software might use it as a default place > to connect so they can get a list of databases or whatever . This is probably the most common case for continuos access to template1. Luca -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general