[GENERAL] PG 9.3.12: Replication appears to have worked, but getting error messages in logs

2016-04-04 Thread David Caldwell
Hello,

We're using streaming replication. Our technique for spinning up a db
slave is this:

rsync from master (gross copy)
pg_start_backup() on server
rsync from master (correct copy)
pg_stop_backup()
drop in recovery.conf into slave directory
enable hot_standby in slave conf
start slave

After starting the slave, I'm getting this error every 5 seconds in the log:

ERROR:  requested WAL segment 00010E220005 has already been
removed

But I can connect to the DB and make queries and new records are
definitely streaming in.

I thought I just didn't have enough wal segments, so I bumped up the
number on the master and restarted the process. It just finished the
second time and the exact same error message is in the logs again (same
wal segment number).


When I ran pg_start_backup() and pg_stop_backup(), the output was:

 pg_start_backup
-
 E27/3100A200

 pg_stop_backup

 E28/7D357950


The backup_label file looked like this:

START WAL LOCATION: E27/3100A200 (file 00010E270031)
CHECKPOINT LOCATION: E27/31C9C740
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-04-02 12:34:25 PDT
LABEL: clone


During the rsync it copied
  pg_xlog/00010E2500F1
to
  pg_xlog/00010E280071


So I'm confused: why is the E22 wal being requested? It seems to predate
the backup by a lot.

Does the slave really contain all the data? If not, how can I tell what
is missing (and why is it accepting streaming data if it's missing
something)?


One more piece of the puzzle that may or may not be relevant:

The current master used to be streaming replication slave. The original
master had a disk failure and so we switched one of the backup slaves
into a master. We've replaced the disk on the original server and we're
now trying to make it a streaming replication slave. This is the part
that's failing. If I do rough estimates of how fast the Exx number is
incrementing and compute backwards, E22 seems like about the time of the
original disk failure, give or take.


Thanks,
  David




smime.p7s
Description: S/MIME Cryptographic Signature


[GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.

2016-04-04 Thread Rodney Lott
Hi, there.

First, my particulars:

* Ubuntu Trusty build and runtime environment

* PostgreSQL 9.3.10 Ubuntu source code

* Using a FIPS enabled version of OpenSSL (i.e. 1.0.1p version of the 
library and 2.0.9 of the FIPS canister source code)

* I think this is probably more of a developer question, but the 
mailing list site said to post it elsewhere first, so here it is

* I am new to FIPS and postgresql in general (i.e. working with them 
for a few months)

I've been trying to get the postgresql packages to work in FIPS mode. To 
accomplish this, I've patched the Ubuntu source code with the patch that is 
attached to this message.

The main postgresql server runs fine as expected in either FIPS or non-FIPS 
modes. However, when I try to use the psql command in FIPS mode, I get the 
following error:

# psql -h 127.0.0.1 -U postgres -d sslmode=require
psql: SSL SYSCALL error: EOF detected

I used the gdb debugger to try to find where in the backend the command was 
failing. The backtrace on the server side suggests that the problem involves 
the key-exchange failing:

(gdb) bt
#0  0x7f40183e8f20 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x7f40183e8dd4 in __sleep (seconds=0) at 
../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x7f40196a95ce in DH_generate_key () from 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#3  0x7f40199e8ba6 in ssl3_send_server_key_exchange () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#4  0x7f40199ec18b in ssl3_accept () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#5  0x7f40199fb8b3 in ssl23_accept () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#6  0x5618082567a4 in open_server_SSL (port=0x561808e05700) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/libpq/be-secure.c:925
#7  secure_open_server (port=port@entry=0x561808e05700) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/libpq/be-secure.c:221
#8  0x5618082c7eb8 in ProcessStartupPacket (port=port@entry=0x561808e05700, 
SSLdone=SSLdone@entry=0 '\000') at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1921
#9  0x5618081030f9 in BackendInitialize (port=0x561808e05700) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:4036
#10 BackendStartup (port=0x561808e05700) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:3807
#11 ServerLoop () at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1690
#12 0x5618082cace1 in PostmasterMain (argc=5, argv=) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/postmaster/postmaster.c:1315
#13 0x561808103fb3 in main (argc=5, argv=0x561808db6970) at 
/home/rlott/git/stash/postgresql-fips/postgresql-9.3-9.3.10/build/../src/backend/main/main.c:227

I tracked it down to the following code in the OpenSSL 2.0.9 canister code:

int FIPS_drbg_generate(DRBG_CTX *dctx, unsigned char *out, size_t outlen,
int prediction_resistance,
const unsigned char *adin, size_t adinlen)
{
int r = 0;

if (FIPS_selftest_failed())
{
FIPSerr(FIPS_F_FIPS_DRBG_GENERATE, FIPS_R_SELFTEST_FAILED);
return 0;
}

if (!fips_drbg_check(dctx))
return 0;

if (dctx->status != DRBG_STATUS_READY
&& dctx->status != DRBG_STATUS_RESEED)
{
if (dctx->status == DRBG_STATUS_ERROR)
r = FIPS_R_IN_ERROR_STATE;
else if(dctx->status == DRBG_STATUS_UNINITIALISED)
r = FIPS_R_NOT_INSTANTIATED;
goto end;
}
...

The place where it fails is where dctx->status == DRBG_STATUS_UNINITIALIZED 
(i.e. 0).

So, my question is this: In FIPS mode, what would cause the random number 
generation to not initialize? I have put print statements in the postgresql 
code such that I know that it is in FIPS mode properly. I know that the 
dctx->status pointer, which points to a "static DRBG_CTX ossl_dctx" structure, 
is initialize to 1 in the main process. It appears that this initialization 
doesn't get propagated to other backends or the SSL transaction above.

If any of the developers have some insight into this, I would appreciate it.

Thanks,

Rodney Lott


0001-FIPS-enabled-postgresql.patch
Description: 0001-FIPS-enabled-postgresql.patch

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] CORRUPTION on TOAST table

2016-04-04 Thread Soni M
Hello Adrian, thanks for the response.

master data also located on SAN

Yes, each replica is it own VM with its own virtual disk/volume as served
up from the same SAN

Raw disk mappings are a way for ESX to present a SAN volume directly to a
VM instead of creating a virtual disk.

no unexpected messages detected.

On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver 
wrote:

> On 04/02/2016 08:38 PM, Soni M wrote:
>
>> Hello Everyone,
>>
>> We face TOAST table corruption.
>>
>> One master and two streaming replicas. The corruption happen only on
>> both streaming replicas.
>>
>> We did found the corrupted rows. Selecting on this row, return (on both
>> replica) : unexpected chunk number 0 (expected 1) for toast value
>> 1100613112 in pg_toast_112517
>> selecting this row on master does not return corruption error, but
>> return correct result instead.
>>
>> Previously, dump on a replica return : unexpected chunk number 0
>> (expected 1) for toast value 3234098599 in pg_toast_112517 (please note
>> the toast value is different)
>>
>> This table size is 343 GB, contain around 206,179,697 live tuples. We
>> found that the corruption happen on the biggest column (this column and
>> its pkey sized around 299 GB total).
>>
>>
> on both replica :
>> fsync NEVER turned off.
>> none unexpected power loss nor OS crash.
>>
>> How can the corruption occurs ? and how can I resolve them ?
>>
>
> Meant to add to previous post.
>
> Do you see anything in the replica Postgres logs that indicate a problem
> with the replication process?
>
> Or any other unexpected messages prior to the point you did the select on
> the replica(s)?
>
>
>
>> Thank so much for the help.
>>
>> Cheers \o/
>>
>> --
>> Regards,
>>
>> Soni Maula Harriz
>>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>



-- 
Regards,

Soni Maula Harriz


Re: [GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.

2016-04-04 Thread Tom Lane
Rodney Lott  writes:
> So, my question is this: In FIPS mode, what would cause the random
> number generation to not initialize?

I remember that Red Hat's version of "FIPS mode" involved crypto features
(including RNGs) just refusing to work in modes deemed inadequately
secure.  So my guess is that psql is trying to configure OpenSSL with some
inadequately-secure settings.  Not sure why it'd be different from the
server though.  Are you sure psql and the libpq it's using are same
version as the apparently-working server?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql 9.3.10, FIPS mode and DRBG issues.

2016-04-04 Thread Rodney Lott
> > So, my question is this: In FIPS mode, what would cause the random
> > number generation to not initialize?
> 
> I remember that Red Hat's version of "FIPS mode" involved crypto
> features (including RNGs) just refusing to work in modes deemed
> inadequately secure.  So my guess is that psql is trying to configure
> OpenSSL with some inadequately-secure settings.  Not sure why it'd be
> different from the server though.  Are you sure psql and the libpq it's
> using are same version as the apparently-working server?
> 
>   regards, tom lane

Hi, Tom.

Thanks for the quick reply. I'll look into the settings and see what I can 
find. 

I double checked the installed packages and they seem to be from my same 
postgresql build (i.e. note my timestamp of 1459281538): 

# dpkg -l | grep postgres
ii  postgresql-9.3  
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64object-relational 
SQL database, version 9.3 server
ii  postgresql-9.3-dbg  
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64debug symbols for 
postgresql-9.3
ii  postgresql-client-9.3   
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64front-end programs 
for PostgreSQL 9.3
ii  postgresql-client-common154-et1~fips~2.0.9~1459281538   
all  manager for multiple PostgreSQL client versions
ii  postgresql-common   154-et1~fips~2.0.9~1459281538   
all  PostgreSQL database-cluster manager
ii  postgresql-contrib-9.3  
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64additional 
facilities for PostgreSQL
ii  postgresql-json-build   1.1.0-et3   
amd64json_build extension for postgresql
ii  postgresql-plpython-9.3 
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64PL/Python 
procedural language for PostgreSQL 9.3
# dpkg -l | grep libpq
ii  libpq5  
9.3.10-0ubuntu0.14.04~et1~fips~2.0.9~1459281538 amd64PostgreSQL C 
client library
# dpkg -S /usr/bin/psql 
postgresql-client-common: /usr/bin/psql
# dpkg -S /usr/lib/postgresql/9.3/bin/postgres
postgresql-9.3: /usr/lib/postgresql/9.3/bin/postgres
# psql -h 127.0.0.1 -U postgres -d sslmode=require
psql: SSL SYSCALL error (0): EOF detected, err=5

So, I believe that psql and libpq are from the same version as the currently 
working server. 

Regards, 

Rodney


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to quote the COALESCE function?

2016-04-04 Thread Kevin Grittner
On Tue, Mar 29, 2016 at 12:38 PM, Pavel Stehule  wrote:

> The coalesce is one few functions implemented by special rule in
> PostgreSQL parser.

In the SQL standard the COALESCE feature is not listed as a
function; it is listed as one of the short forms of CASE
expression.  While it has function-like syntax, thinking of it as a
function is semantically incorrect.

  COALESCE(a, b)

is supposed to be semantically equivalent to:

  CASE WHEN a is not null THEN a ELSE b END

Among other things, that means that this statement should not
generate a divide by zero error:

  SELECT COALESCE(1, 1/0);

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] CORRUPTION on TOAST table

2016-04-04 Thread Soni M
It seems that it was the Postgres bug on replica, after upgrading minor
version to 9.1.21 on replica1, the corruption goes away.

Thanks everyone for the help

On Tue, Apr 5, 2016 at 1:32 AM, Soni M  wrote:

> Hello Adrian, thanks for the response.
>
> master data also located on SAN
>
> Yes, each replica is it own VM with its own virtual disk/volume as served
> up from the same SAN
>
> Raw disk mappings are a way for ESX to present a SAN volume directly to a
> VM instead of creating a virtual disk.
>
> no unexpected messages detected.
>
> On Sun, Apr 3, 2016 at 11:23 PM, Adrian Klaver 
> wrote:
>
>> On 04/02/2016 08:38 PM, Soni M wrote:
>>
>>> Hello Everyone,
>>>
>>> We face TOAST table corruption.
>>>
>>> One master and two streaming replicas. The corruption happen only on
>>> both streaming replicas.
>>>
>>> We did found the corrupted rows. Selecting on this row, return (on both
>>> replica) : unexpected chunk number 0 (expected 1) for toast value
>>> 1100613112 in pg_toast_112517
>>> selecting this row on master does not return corruption error, but
>>> return correct result instead.
>>>
>>> Previously, dump on a replica return : unexpected chunk number 0
>>> (expected 1) for toast value 3234098599 in pg_toast_112517 (please note
>>> the toast value is different)
>>>
>>> This table size is 343 GB, contain around 206,179,697 live tuples. We
>>> found that the corruption happen on the biggest column (this column and
>>> its pkey sized around 299 GB total).
>>>
>>>
>> on both replica :
>>> fsync NEVER turned off.
>>> none unexpected power loss nor OS crash.
>>>
>>> How can the corruption occurs ? and how can I resolve them ?
>>>
>>
>> Meant to add to previous post.
>>
>> Do you see anything in the replica Postgres logs that indicate a problem
>> with the replication process?
>>
>> Or any other unexpected messages prior to the point you did the select on
>> the replica(s)?
>>
>>
>>
>>> Thank so much for the help.
>>>
>>> Cheers \o/
>>>
>>> --
>>> Regards,
>>>
>>> Soni Maula Harriz
>>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com
>>
>
>
>
> --
> Regards,
>
> Soni Maula Harriz
>



-- 
Regards,

Soni Maula Harriz


[GENERAL] template1 being accessed

2016-04-04 Thread drum.lu...@gmail.com
Hi there.

I'm currently using postgres 9.2.
As you can see below, my "template1" database was being accessed:

[image: Inline images 2]


That server is a 4-day-old backup DB - does a gzip of pg_dump, excluding
some tables; also 4-day old replication using WAL-archive with 345600s
delay; also file-level copies (excl. PG DB)


Why was the template1 database being accessed and now it's not anymore?

Cheers
Lucas


Re: [GENERAL] template1 being accessed

2016-04-04 Thread John R Pierce

On 4/4/2016 9:04 PM, drum.lu...@gmail.com wrote:

I'm currently using postgres 9.2.
As you can see below, my "template1" database was being accessed:



500m is 1/2 access per whatever interval that graph is using. typically, 
template1 is accessed when you do a create database if you don't specify 
a different template.


its also possible some management software might use it as a default 
place to connect so they can get a list of databases or whatever .




--
john r pierce, recycling bits in santa cruz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PG 9.3.12: Replication appears to have worked, but getting error messages in logs

2016-04-04 Thread Michael Paquier
On Sun, Apr 3, 2016 at 2:50 PM, David Caldwell  wrote:
> Hello,
>
> We're using streaming replication. Our technique for spinning up a db
> slave is this:
>
> rsync from master (gross copy)
> pg_start_backup() on server
> rsync from master (correct copy)
> pg_stop_backup()
> drop in recovery.conf into slave directory
> enable hot_standby in slave conf
> start slave
>
> After starting the slave, I'm getting this error every 5 seconds in the log:
>
> ERROR:  requested WAL segment 00010E220005 has already been
> removed

Are you sure that this is from the slave and not another not
requesting it? If this log entry has just ERROR, it means that the WAL
sender generated it. So don't you have a cascading slave connecting to
this new slave and requested for this WAL segment that has already
been removed?

Should the error come from the WAL receiver, you would have something like that.
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment 0001001C has already been removed
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] template1 being accessed

2016-04-04 Thread Luca Ferrari
(sorry, back to the list)

On Tue, Apr 5, 2016 at 6:11 AM, John R Pierce  wrote:
> its also possible some management software might use it as a default place
> to connect so they can get a list of databases or whatever .

This is probably the most common case for continuos access to template1.


Luca


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general