Re: [GENERAL] empty pg_stat_replication when replication works fine?

2016-05-31 Thread Andrej Vanek
| rolcreaterole | rolcreatedb |
rolcatupdate | rolcanlogin | rolreplication | rolconnlimit | rolpassword |
rolvaliduntil
---+---+--++---+-+--+-++--+-+---
10 | postgres  | t| t  | t | t   |
t| t   | t  |   -1 | |
 29732 | xxusrrole | f| t  | f | f   |
f| f   | f  |   -1 | |
 29733 | xxadmrole | f| t  | f | f   |
f| f   | f  |   -1 | |
 29734 | xxschema  | f| t  | f | f   |
f| t   | f  |   -1 | |
 29735 | xxadm | f| t  | f | f   |
f| t   | f  |   -1 | x   |
 29737 | pgbackup  | f| t  | f | f   |
f| t   | f  |   -1 | |
 29738 | pgmonitor | f| t  | f | f   |
f| t   | f  |   -1 | x   |
 29739 | pgreplic  | f| t  | f | f   |
f| t   | t  |   -1 | |
 29736 | CTSYSTEM  | f| t  | f | f   |
f| t   | f  |   -1 | xx  |
(9 rows)





2016-05-25 23:22 GMT+02:00 Andrej Vanek :

> Streaming replication set-up,
>
> one master, 3 slaves connecting to it.
> I expected ps -ef gets all wal-sender processes and SAME information I'll
> get via select * from pg_stat_replication.
> Instead I observed:
> - pg_stat_replication is empty
> - 3 wal-sender processes up and running
> - each slave has wal-receiver process running
> - replication works (tried to create a table- it appears in all databases)
> Question:
> - why is pg_stat_replication empty?
>
> Andrej
> ---details
> [root@l2bmain ~]# tail /opt/pg_data/postgresql.conf
> max_wal_senders = 5
> hot_standby = on
> wal_keep_segments = 128
> archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
> wal_receiver_status_interval = 2
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> restart_after_crash = off
> hot_standby_feedback = on
> wal_sender_timeout = 1min
> [root@l2bmain ~]# ps f -fu postgres
> UIDPID  PPID  C STIME TTY  STAT   TIME CMD
> postgres 10797 1  0 15:53 ?S  0:20
> /usr/pgsql-9.3/bin/postgres -D /opt/pg_data -c
> config_file=/opt/pg_data//postgresql.conf
> postgres 10820 10797  0 15:53 ?Ss 0:00  \_ postgres: logger
> process
> postgres 10823 10797  0 15:53 ?Ss 0:00  \_ postgres:
> checkpointer process
> postgres 10824 10797  0 15:53 ?Ss 0:00  \_ postgres: writer
> process
> postgres 10825 10797  0 15:53 ?Ss 0:00  \_ postgres: wal
> writer process
> postgres 10826 10797  0 15:53 ?Ss 0:01  \_ postgres:
> autovacuum launcher process
> postgres 10827 10797  0 15:53 ?Ss 0:00  \_ postgres: archiver
> process   last was 000100A3.0028.backup
> postgres 10828 10797  0 15:53 ?Ss 0:03  \_ postgres: stats
> collector process
> postgres 11286 10797  0 15:54 ?Ss 0:08  \_ postgres: wal
> sender process pgreplic 192.168.204.12(55231) streaming 0/A401BED8
> postgres 11287 10797  0 15:54 ?Ss 0:06  \_ postgres: wal
> sender process pgreplic 192.168.204.11(42937) streaming 0/A401BED8
> postgres 19322 10797  0 15:58 ?Ss 0:08  \_ postgres: wal
> sender process pgreplic 192.168.101.11(52379) streaming 0/A401BED8
> postgres 28704 10797  0 18:44 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.13(58245) idle
> postgres  7256 10797  0 18:52 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.23(55190) idle
> postgres  8667 10797  0 18:53 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.13(58287) idle
> [root@l2bmain ~]# psql -U postgres -c "select * from pg_stat_replication;"
>  pid | usesysid | usename | application_name | client_addr |
> client_hostname | client_port | backend_start | state | sent_location |
> write_location | flush_location | r
> eplay_location | sync_priority | sync_state
>
> -+--+-+--+-+-+-+---+---+---+++--
> ---+---+
> (0 rows)
>
> [ro

[GENERAL] pg_basebackup: return value 1: reason?

2016-04-15 Thread Andrej Vanek
Hello,

I tried to run pg_basebackup. Return value is 1.

How to find out its reason?
(I suspect that some wal after backup is missing- but how to find out the
real reason? How to fix it?)

thanks, Andrej
--details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit)

I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued
from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D
${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr  >>
${LOG_FILE}
# echo $?
1 <--pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E328 on timeline 1
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E3F8
pg_basebackup: base backup completed<--no
reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/000100E2' ->
`../backups/arc/000100E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
special file "./pg_hba.conf.save" <---recorded in
pg_log on master node and copied by pg_basebackup (note time difference
between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was interrupted;
last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not properly
shut down; automatic recovery in progress <-something missing from
pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E328
2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state reached
at 0/E400
2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to accept
read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from primary
at 0/E400 on timeline 1
---second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D
${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
#  cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E528 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E5F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
 
`../backups/arc/000100E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was interrupted;
last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not properly
shut down; automatic recovery in progress  

Re: [GENERAL] pg_basebackup: return value 1: reason?

2016-04-16 Thread Andrej Vanek
Hello,

my setup is:
1 master
1 synchronous slave (up and running)
2 asynchronous slave (up and running)
Now I'm setting up asynchronous slave: first step of this setup is
pg_basebackup connecting to master.

Indirections are because it is encapsulated in a script. This script is
aimed to serve for automated replication recovery. This script is launched
by crm_mon -d daemon.

You are right to check without indirections: now I tried to run the command
directly from command line without variables- it works fine...

This means: no problem in pg_basebackup itself

but some problem in environment of process launched from crm_mon daemon.
Thanks for your hint (remove indirections). Pg_basebackup works fine when
launched from command-line..
Unfortenutely I have still no clue how to solve the issue- su works. I have
to find out why pg_basebackup cannot fork when launched from crm_mon.

Best Regards, Andrej

2016-04-16 1:17 GMT+02:00 Adrian Klaver :

> On 04/15/2016 03:28 PM, Andrej Vanek wrote:
>
>> Hello,
>>
>> I tried to run pg_basebackup. Return value is 1.
>>
>> How to find out its reason?
>> (I suspect that some wal after backup is missing- but how to find
>> out the real reason? How to fix it?)
>>
>
> First it is not clear to me where you are taking the backup from, the
> master or the standby?
>
> Second there is a lot of redirection going on. What happens if you run the
> pg_basebackup directly (without doing  su - postgres ...) and use hardcoded
> values instead of shell variables?
>
>
>
>> thanks, Andrej
>> --details:
>> environment: CentOS 6.7, postgres 9.5.1
>> ( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
>> 20120313 (Red Hat 4.4.7-16), 64-bit)
>>
>> I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
>> issued from a script:
>> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
>> -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr
>>   >> ${LOG_FILE}
>> # echo $?
>> 1 <--pg_basebackup failed!
>> # cat log.stderr
>> # cat /var/log/cluster/geo_repair.log.err
>> transaction log start point: 0/E328 on timeline 1
>> WARNING:  skipping special file "./pg_hba.conf"
>> WARNING:  skipping special file "./pg_hba.conf.save"
>> transaction log end point: 0/E3F8
>> pg_basebackup: base backup completed<--no
>> reason for pg_basebackup failure!
>> # cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
>> # su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
>> # tail /pg_data/pg_log/postgresql-Fri.log
>> `pg_xlog/000100E2' ->
>> `../backups/arc/000100E2'
>> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
>> special file "./pg_hba.conf"
>> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
>> special file "./pg_hba.conf.save" <---recorded in
>> pg_log on master node and copied by pg_basebackup (note time difference
>> between two servers)
>> 2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was
>> interrupted; last known up at 2016-04-15 23:15:10 CEST
>> 2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database
>> system is starting up
>> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
>> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not
>> properly shut down; automatic recovery in progress <-something
>> missing from pg_basebackup
>> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E328
>> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state
>> reached at 0/E400
>> 2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to
>> accept read only connections
>> 2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from
>> primary at 0/E400 on timeline 1
>> ---second trial
>> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
>> -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
>> # echo $?
>> 1
>> #  cat /var/log/cluster/geo_repair.log.err
>> transaction log start point: 0/E528 on timeline 1
>> pg_basebackup: starting background WAL receiver
>> WARNING:  skipping special file "./pg_hba.conf"
>> WARNING:  skipping special file "./pg_hba.conf.save"
>> transaction log end point: 0/E5F8
>> pg_basebackup: waiting for background process to finish streaming ...
>> 

Re: [GENERAL] pg_basebackup: return value 1: reason?

2016-04-16 Thread Andrej Vanek
Hello,

> Are you not getting something informative on your display and/or sending
stderr to somewher else?

- in case -X fetch no informative output
- in case -X stream following:
pg_basebackup: could not wait for child process: No child processes

Now after re-test directly from command-line it works. It fails just when
launched from a script fired by crm_mon -d -E my-script

Regards, Andrej

2016-04-16 1:18 GMT+02:00 Jerry Sievers :

> Andrej Vanek  writes:
>
> > Hello,
> >
> > I tried to run pg_basebackup. Return value is 1.
> >
> > How to find out its reason?
> > (I suspect that some wal after backup is missing- but how to find
> out the real reason? How to fix it?)
>
> Well there are more than 80 cases of exit 1 in a couple .c files and
> most/all are preceeded unsurprisingly with an fprintf(stderr.
>
> Are you not getting something informative on your display and/or sending
> stderr to somewher else?
>
> $ echo $*
> ./src/bin/pg_basebackup/pg_basebackup.c
> ./src/backend/replication/basebackup.c
>
> $ grep -h -B 4 -i 'exit.*(1' $* | egrep -hi 'fprintf\(stderr|exit'
> fprintf(stderr, _("%s: directory name too
> long\n"), progname);
> exit(1);
> fprintf(stderr, _("%s: multiple \"=\"
> signs in tablespace mapping\n"), progname);
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr, _("%s: old directory is not an absolute
> path in tablespace mapping: %s\n"),
> exit(1);
> fprintf(stderr, _("%s: new directory is not an absolute
> path in tablespace mapping: %s\n"),
> exit(1);
> fprintf(stderr, _("%s: could not read from
> ready pipe: %s\n"),
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not create background
> process: %s\n"),
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not create background thread:
> %s\n"),
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not access directory
> \"%s\": %s\n"),
> disconnect_and_exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr, _("%s: transfer rate must be greater than
> zero\n"),
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr,
> exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not write to file
> \"%s\": %s\n"),
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not create file
> \"%s\": %s\n"),
> disconnect_and_exit(1);
> fprintf(stderr, _("%s: could not get COPY data stream:
> %s"),
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr,
> disconnect_and_exit(1);
> fprintf(stderr, _("

Re: [GENERAL] pg_basebackup: return value 1: reason?

2016-04-17 Thread Andrej Vanek
Hello Adrian,

I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup;
rsync; pg_stop_backup)- this works- the only downside is possible
collisions with on-line backup/synchronizaiton of other two nodes on master
node...

Back to the pg_basebackup issue: it is clear to me that this is an issue of
environment which launched pg_basebackup.
Possibly either some privileges or  some kernel parameters/limits. Who
knows?
Summary: clusterlab's crm_mon launched a shell script starting
pg_basebackup which fails to do some its work (pg_basebackup: could not
wait for child process: No child processes)- probably due to some failing
system call.

How can I report to clusterlabs: What system call fails in pg_basebackup?

Best Regards, Andrej



2016-04-17 1:09 GMT+02:00 Adrian Klaver :

>
> Is the su - even necessary?
>
> pg_basebackup is a Postgres client program you can specify the user you
> want it to connect to using -U.
>
> Or do you need the script to run as postgres in order to get permissions
> on wherever you are creating the backup directory?
>
> have to find out why pg_basebackup cannot fork when launched from crm_mon.
>>
>
>
> I assume crm_mon is this:
>
> http://linux.die.net/man/8/crm_mon
>
> from Pacemaker.
>
> I do not use Pacemaker, but I am pretty sure that running what is a
> monitoring program in daemon mode and then shelling out to another program
> is not workable. The docs seem to bear this out:
>
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
>
>
> https://github.com/smbambling/pgsql_ha_cluster/wiki/Building-A-Highly-Available-Multi-Node-PostgreSQL-Cluster
>
>>
>>


Re: [GENERAL] pg_basebackup: return value 1: reason?

2016-05-23 Thread Andrej Vanek
Hello,
I've given a try once again.
Two variants used in my script (launched by crm_mon):
1. /usr/pgsql-9.5/bin/pg_basebackup -U pgreplic -h db-other-site -w -D
/opt/geo_stdby_data -c fast -vvv -X stream &>> /tmp/log
2. strace -o /tmp/pg_basebackup.log /usr/pgsql-9.5/bin/pg_basebackup -U
pgreplic -h db-other-site -w -D /opt/geo_stdby_data -c fast -vvv -X stream
&>> /tmp/log

Result:
variant 2. works fine with return code 0 (with strace)
variant 1. fails with error code 1 (without strace)

Any ideas?

Andrej
--details
Output:
Variant 2:
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:   STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup'   FAST NOWAIT
-- Mon May 23 17:54:31 CEST 2016 [l1abrnch->l1abrnch:3122/27282:GEO]
--INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0
(OK)
transaction log start point: 0/FA28 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:   STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 0/FA00 TIMELINE
1
WARNING:  skipping special file "./pg_hba.conf"
DEBUG:  standby "pg_basebackup" has now caught up with primary
DEBUG:  write 0/FA00 flush 0/0 apply 0/0
DEBUG:  removing transaction log backup history file
"000100F8.0028.backup"
transaction log end point: 0/FAF8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
RETVAL=0

Output
Variant 1:
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:   STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: BASE_BACKUP LABEL 'pg_basebackup base
backup'   FAST NOWAIT
-- Mon May 23 17:55:32 CEST 2016 [l1abrnch->l1abrnch:3122/28785:GEO]
--INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0
(OK)
transaction log start point: 0/FC28 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:   STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 0/FC00 TIMELINE
1
WARNING:  skipping special file "./pg_hba.conf"
DEBUG:  standby "pg_basebackup" has now caught up with primary
DEBUG:  write 0/FC00 flush 0/0 apply 0/0
DEBUG:  removing transaction log backup history file
"000100FA.0028.backup"
transaction log end point: 0/FCF8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
RETVAL=1




2016-04-18 16:12 GMT+02:00 Adrian Klaver :

> On 04/17/2016 12:13 PM, Andrej Vanek wrote:
>
>> Hello Adrian,
>>
>> I tried to use -U without "su"- launched directly by root: same behaviour.
>> Finally I reverted my script to use standard backup (pg_start_backup;
>> rsync; pg_stop_backup)- this works- the only downside is possible
>> collisions with on-line backup/synchronizaiton of other two nodes on
>> master node...
>>
>> Back to the pg_basebackup issue: it is clear to me that this is an issue
>> of environment which launched pg_basebackup.
>> Possibly either some privileges or  some kernel parameters/limits. Who
>> knows?
>> Summary: clusterlab's crm_mon launched a shell script starting
>> pg_basebackup which fails to do some its work (pg_basebackup: could not
>> wait for child process: No child processes)- probably due to some
>> failing system call.
>>
>> How can I report to clusterlabs: What system call fails in pg_basebackup?
>>
>
> All I can to do is point you at:
>
>
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
>
>
>> Best Regards, Andrej
>>
>>
>>
>> 2016-04-17 1:09 GMT+02:00 Adrian Klaver > <mailto:adrian.kla...@aklaver.com>>:
>>
>>
>> Is the su - even necessary?
>>
>> pg_basebackup is a Postgres client program you can specify the user
>> you want it to connect to using -U.
>>
>> Or do you need the script to run as postgres in order to get
>> permissions on wherever you are creating the backup directory?
>>
>> have to find out why pg_basebackup cannot fork w

[GENERAL] empty pg_stat_replication when replication works fine?

2016-05-25 Thread Andrej Vanek
Streaming replication set-up,

one master, 3 slaves connecting to it.
I expected ps -ef gets all wal-sender processes and SAME information I'll
get via select * from pg_stat_replication.
Instead I observed:
- pg_stat_replication is empty
- 3 wal-sender processes up and running
- each slave has wal-receiver process running
- replication works (tried to create a table- it appears in all databases)
Question:
- why is pg_stat_replication empty?

Andrej
---details
[root@l2bmain ~]# tail /opt/pg_data/postgresql.conf
max_wal_senders = 5
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on
wal_sender_timeout = 1min
[root@l2bmain ~]# ps f -fu postgres
UIDPID  PPID  C STIME TTY  STAT   TIME CMD
postgres 10797 1  0 15:53 ?S  0:20
/usr/pgsql-9.3/bin/postgres -D /opt/pg_data -c
config_file=/opt/pg_data//postgresql.conf
postgres 10820 10797  0 15:53 ?Ss 0:00  \_ postgres: logger
process
postgres 10823 10797  0 15:53 ?Ss 0:00  \_ postgres:
checkpointer process
postgres 10824 10797  0 15:53 ?Ss 0:00  \_ postgres: writer
process
postgres 10825 10797  0 15:53 ?Ss 0:00  \_ postgres: wal writer
process
postgres 10826 10797  0 15:53 ?Ss 0:01  \_ postgres: autovacuum
launcher process
postgres 10827 10797  0 15:53 ?Ss 0:00  \_ postgres: archiver
process   last was 000100A3.0028.backup
postgres 10828 10797  0 15:53 ?Ss 0:03  \_ postgres: stats
collector process
postgres 11286 10797  0 15:54 ?Ss 0:08  \_ postgres: wal sender
process pgreplic 192.168.204.12(55231) streaming 0/A401BED8
postgres 11287 10797  0 15:54 ?Ss 0:06  \_ postgres: wal sender
process pgreplic 192.168.204.11(42937) streaming 0/A401BED8
postgres 19322 10797  0 15:58 ?Ss 0:08  \_ postgres: wal sender
process pgreplic 192.168.101.11(52379) streaming 0/A401BED8
postgres 28704 10797  0 18:44 ?Ss 0:00  \_ postgres: CTSYSTEM
lidb 192.168.102.13(58245) idle
postgres  7256 10797  0 18:52 ?Ss 0:00  \_ postgres: CTSYSTEM
lidb 192.168.102.23(55190) idle
postgres  8667 10797  0 18:53 ?Ss 0:00  \_ postgres: CTSYSTEM
lidb 192.168.102.13(58287) idle
[root@l2bmain ~]# psql -U postgres -c "select * from pg_stat_replication;"
 pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start | state | sent_location |
write_location | flush_location | r
eplay_location | sync_priority | sync_state
-+--+-+--+-+-+-+---+---+---+++--
---+---+
(0 rows)

[root@l2bmain ~]# tail /opt/pg_data/pg_log/postgresql-Wed.log
2016-05-25 15:53:56 CEST:@:[8603] LOG:  database system is shut down
2016-05-25 15:53:58 CEST:@:[10821] LOG:  database system was shut down in
recovery at 2016-05-25 15:53:56 CEST
2016-05-25 15:53:58 CEST:@:[10821] LOG:  database system was not properly
shut down; automatic recovery in progress
2016-05-25 15:53:58 CEST:@:[10821] LOG:  consistent recovery state reached
at 0/A290
2016-05-25 15:53:58 CEST:@:[10821] LOG:  record with zero length at
0/A290
2016-05-25 15:53:58 CEST:@:[10821] LOG:  redo is not required
2016-05-25 15:53:58 CEST:@:[10821] LOG:  MultiXact member wraparound
protections are now enabled
2016-05-25 15:53:58 CEST:@:[10797] LOG:  database system is ready to accept
connections
2016-05-25 15:53:58 CEST:@:[10826] LOG:  autovacuum launcher started
`
[root@l2bmain ~]# ssh 192.168.101.11
Last login: Wed May 25 22:48:18 2016 from 192.168.101.12
[root@l2amain ~]# ps f -fu postgres
UIDPID  PPID  C STIME TTY  STAT   TIME CMD
postgres  5730 1  0 15:58 ?S  0:04
/usr/pgsql-9.3/bin/postgres -D /opt/pg_data -c
config_file=/opt/pg_data//postgresql.conf
postgres  5754  5730  0 15:58 ?Ss 0:00  \_ postgres: logger
process
postgres  5755  5730  0 15:58 ?Ss 0:00  \_ postgres: startup
process   recovering 000100A4
postgres  5773  5730  0 15:58 ?Ss 0:12  \_ postgres: wal
receiver process   streaming 0/A401C030
postgres  5774  5730  0 15:58 ?Ss 0:00  \_ postgres:
checkpointer process
postgres  5775  5730  0 15:58 ?Ss 0:00  \_ postgres: writer
process
postgres  5776  5730  0 15:58 ?Ss 0:00  \_ postgres: stats
collector process
[root@l2amain ~]# psql -U postgres -c "select pg_is_in_recovery();"
 pg_is_in_recovery
---
 t
(1 row)
[root@l2bmain ~]# ssh 192.168.204.11
Warning: Permanently added '192.168.204.11' (RSA) to the list of known
hosts.
Last login: Wed May 25 16:28:49 2016 from 192.168.200.254
[root@l2abrnch ~]# ps f -fu pos

Re: [GENERAL] empty pg_stat_replication when replication works fine?

2016-05-26 Thread Andrej Vanek
The instance is still running, I tried to collect more information from it:

all databases are working as expected,
the only issue is that monitoring SQL commands (pg_stat_activity,
pg_stat_replication) are not working as expected (do not reflect postgres
processes list from command-line)

on Master:
- pg_stat_activity is empty as well (they can  be seen just in ps f -fu
postgres output: CTSYSTEM lines)
- psql as postgres: select * from pg_stat_activity sees only its own session
- psql as unprivileged user (CTSYSTEM): select * from pg_stat_activity is
empty
- replication works fine (created a table, that was created also on all
replicas)
- added lines to postgresql.conf + reload:

Opening new  lines to postgresql.conf + reload configuration:
client_min_messages = debug5
log_min_messages = debug5
log_min_error_statement = debug5
- activity seen in pg_log, also replication activity (pgreplic user) is
seen, still nothing in pg_stat_replication/pg_stat_activity

killed one slave postgres instance, restarted it
- "standby "l2abrnch" has now caught up with primary"
- replication works fine
- no entries on Master in pg_stat_replication
- ps -ef shows the new wal-sender process on master and wal-receiver
process streaming on this slave

Version is:
PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit

I suspect something happened within master server
(pg_stat_activity+pg_stat_replication not working as described, do not
reflect ps -ef list of postgres processes and running SQL
client/replication  information)

What may be additionally useful information before restarting the master?

Regards, Andrej

2016-05-25 23:22 GMT+02:00 Andrej Vanek :

> Streaming replication set-up,
>
> one master, 3 slaves connecting to it.
> I expected ps -ef gets all wal-sender processes and SAME information I'll
> get via select * from pg_stat_replication.
> Instead I observed:
> - pg_stat_replication is empty
> - 3 wal-sender processes up and running
> - each slave has wal-receiver process running
> - replication works (tried to create a table- it appears in all databases)
> Question:
> - why is pg_stat_replication empty?
>
> Andrej
> ---details
> [root@l2bmain ~]# tail /opt/pg_data/postgresql.conf
> max_wal_senders = 5
> hot_standby = on
> wal_keep_segments = 128
> archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
> wal_receiver_status_interval = 2
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> restart_after_crash = off
> hot_standby_feedback = on
> wal_sender_timeout = 1min
> [root@l2bmain ~]# ps f -fu postgres
> UIDPID  PPID  C STIME TTY  STAT   TIME CMD
> postgres 10797 1  0 15:53 ?S  0:20
> /usr/pgsql-9.3/bin/postgres -D /opt/pg_data -c
> config_file=/opt/pg_data//postgresql.conf
> postgres 10820 10797  0 15:53 ?Ss 0:00  \_ postgres: logger
> process
> postgres 10823 10797  0 15:53 ?Ss 0:00  \_ postgres:
> checkpointer process
> postgres 10824 10797  0 15:53 ?Ss 0:00  \_ postgres: writer
> process
> postgres 10825 10797  0 15:53 ?Ss 0:00  \_ postgres: wal
> writer process
> postgres 10826 10797  0 15:53 ?Ss 0:01  \_ postgres:
> autovacuum launcher process
> postgres 10827 10797  0 15:53 ?Ss 0:00  \_ postgres: archiver
> process   last was 000100A3.0028.backup
> postgres 10828 10797  0 15:53 ?Ss 0:03  \_ postgres: stats
> collector process
> postgres 11286 10797  0 15:54 ?Ss 0:08  \_ postgres: wal
> sender process pgreplic 192.168.204.12(55231) streaming 0/A401BED8
> postgres 11287 10797  0 15:54 ?Ss 0:06  \_ postgres: wal
> sender process pgreplic 192.168.204.11(42937) streaming 0/A401BED8
> postgres 19322 10797  0 15:58 ?Ss 0:08  \_ postgres: wal
> sender process pgreplic 192.168.101.11(52379) streaming 0/A401BED8
> postgres 28704 10797  0 18:44 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.13(58245) idle
> postgres  7256 10797  0 18:52 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.23(55190) idle
> postgres  8667 10797  0 18:53 ?Ss 0:00  \_ postgres: CTSYSTEM
> lidb 192.168.102.13(58287) idle
> [root@l2bmain ~]# psql -U postgres -c "select * from pg_stat_replication;"
>  pid | usesysid | usename | application_name | client_addr |
> client_hostname | client_port | backend_start | state | sent_location |
> write_location | flush_location | r
> eplay_location | sync_priority | sync_state
>
> -+--+-+--+-+-+-+---+---+---+++--
> ---

[GENERAL] hidden junk files in ...data/base/oid/

2014-05-27 Thread Andrej Vanek
Hello,

I ran upgrade from 9.1 to 9.3 on CentOS using pg_upgrade and database
cluster size dropped from 77GB to 4.3GB. I wonder how this could happen. No
data lost. This means I had about 70GB junk files in my database...

I checked top 10 biggest database tables: nearly same size on old/new
cluster, the biggest table about 2GB, next 1GB, all other tables smaller
and smaller..
I checked files of biggest database (/var/lib/pgsql/9.1/data/base/27610):

# du  -sm * |awk '{s+=$1} END {print s}'
7107
# du  -sm . |sort -n
75264   .

So size of non-hidden database files is 1/10th of the whole database size.

Question: Where does the 70GB hidden files in ...data/base/oid/ come
from? Which postgres process could generate them? Some missed
maintenance from my side? A bug? Anybody else experienced such issue?

Thanks, Andrej

-details: ...data/base/oid file listing shortened:

# ls -la
total 77069960
drwx--. 2 postgres postgres  32768 May 12 18:50 .
drwxr-xr-x. 5 postgres postgres   4096 May 12 11:47 ..
-rw---. 1 postgres postgres  354156544 Feb 25 12:06 .27623.17rLmT
-rw---. 1 postgres postgres  338952192 Feb 25 14:21 .27623.6dH1b6
-rw---. 1 postgres postgres5767168 Mar  7 16:00 ..27623.6dH1b6.6PrU4B
-rw---. 1 postgres postgres  411041792 Feb 25 15:07 .27623.aN42DG
-rw---. 1 postgres postgres  342884352 Mar  8 15:16 ..27623.aN42DG.0U5xfj
-rw---. 1 postgres postgres  343146496 Mar  8 13:13 ..27623.aN42DG.2WFmNo
-rw---. 1 postgres postgres  343408640 Mar  8 10:43 ..27623.aN42DG.384SXU
-rw---. 1 postgres postgres  357302272 Mar  8 05:26 ..27623.aN42DG.3hHjZ8
-rw---. 1 postgres postgres  360185856 Mar  7 18:19 ..27623.aN42DG.5lWta4
-rw---. 1 postgres postgres  343146496 Mar  8 10:12 ..27623.aN42DG.64lNVQ

...shortened...

-rw---. 1 postgres postgres 1005322240 Feb 25 15:38 .27731.2.JKYXGW
-rw---. 1 postgres postgres  359661568 Mar  9 14:52 ..27731.2.JKYXGW.3h8RuF
-rw---. 1 postgres postgres  331087872 Mar  9 07:37 ..27731.2.JKYXGW.3hK5aF
-rw---. 1 postgres postgres  359923712 Mar  9 09:29 ..27731.2.JKYXGW.3KA5Cq
-rw---. 1 postgres postgres  359923712 Mar  9 16:55 ..27731.2.JKYXGW.45nQei
-rw---. 1 postgres postgres  137363456 Mar  9 04:47 ..27731.2.JKYXGW.4zya2Z

...shortened...

-rw---. 1 postgres postgres  769916928 Feb 25 15:53 .27902.YboxvS
-rw---. 1 postgres postgres  671612928 Feb 20 10:01 .27902.YMEtoS
-rw---. 1 postgres postgres  159645696 Feb 25 16:24 .59866.Lkyxgs
-rw---. 1 postgres postgres  272629760 Feb 20 18:37 .59866.RTcUkC
-rw---. 1 postgres postgres  505151488 Feb 25 16:40 .59961.5BcZpK
-rw---. 1 postgres postgres   91750400 Feb 25 16:55 .60194.gUqSdJ
-rw---. 1 postgres postgres   8192 Apr  7 05:20 60592
-rw---. 1 postgres postgres   8192 Jan 31 13:03 60594
-rw---. 1 postgres postgres   8192 Apr  7 02:01 60596
-rw---. 1 postgres postgres   8192 Feb 28 14:44 60598
-rw---. 1 postgres postgres   8192 Apr  7 11:55 60600

...shortened...

-rw---. 1 postgres postgres 139264 May 12 12:08 702364
-rw---. 1 postgres postgres  24576 May  9 12:42 702364_fsm
-rw---. 1 postgres postgres   8192 May  9 12:40 702364_vm
-rw---. 1 postgres postgres  0 May  9 10:10 702369
-rw---. 1 postgres postgres 860160 May 12 12:08 702372
-rw---. 1 postgres postgres  24576 May  9 12:37 702372_fsm
-rw---. 1 postgres postgres   8192 May  9 12:42 702372_vm
-rw---. 1 postgres postgres   8192 May  9 10:10 702377
-rw---. 1 postgres postgres 499712 May 12 12:08 702381

...shortened...

-rw---. 1 postgres postgres  16384 May  9 14:34 704207
-rw---. 1 postgres postgres  16384 May  9 14:34 704208
-rw---. 1 postgres postgres   8192 May  9 14:34 704209
-rw---. 1 postgres postgres  16384 May  9 14:34 704210
-rw---. 1 postgres postgres  16384 May  9 14:34 704211
-rw---. 1 postgres postgres512 May  9 14:34 pg_filenode.map
-rw---. 1 postgres postgres 106804 May 12 18:50 pg_internal.init
-rw---. 1 postgres postgres  4 Jan 28 13:52 PG_VERSION


Re: [GENERAL] hidden junk files in ...data/base/oid/

2014-05-27 Thread Andrej Vanek
Hello,

solved.
This is not a postgres issue.

The system was used in HA-cluster with streaming replications.
The hidden files I asked for were created probably by broken (killed)
rsync. It uses such file-format for temporary files used during copying.

This rsync is used by master to slave database synchronization (full
on-line backup of master database to slave node) before starting postgres
in hot-standby mode on slave the node...

Best Regards, Andrej


Re: [GENERAL] hidden junk files in ...data/base/oid/

2014-05-28 Thread Andrej Vanek
Hello,

thanks for your answer.

I've identified problems in my cluster agent script. It is a custom written
script with built in automated recovery of failed slave. It was written in
time when postgres 9.1 streaming replications feature was just in beta
release and there was no postgres agent for streaming replications
available out there.
The problem was that the failed slave recovery was hardcoded into start
operation. But this start operation was aborted by pacemaker due to startup
operation timeout. This occured before having finished backup from master
to failed slave (in case of bigger database). This is the point where rsync
could be aborted and left over temporary junk files. There was no cleanup
before re-running the backup from master (using rsync). This may be the
reason why there may be left rsync temporary files.
Second problem identified is what you write: copying stuff from one
direction first, then failed over, then copied in the opposite direction.
This was caused because my agent was missing the lock file that standard
clusterlabs pgsql agent uses to avoid starting failed master in case of
double failure followed by reboot.

Now I'm migrating to the standard pacemaker's postgres cluster agent
provided by clusterlabs.org to avoid such issues. It is surely much better
tested by plenty of installations worldwide with community feedback.

In addition I need to automate single (master or slave) failure recovery as
much as possible. For this purpose I plan to introduce a new resource on
top of pgsql resource which would recover failed pgsql slave(or master) in
case master is active on another node (I use only two node cluster). Manual
recovery by operator would be needed for cases when postgres on both nodes
is down to avoid accidental data loss.
Do you know whether there is such cluster agent already available?

Best Regards, Andrej



2014-05-27 16:09 GMT+02:00 Alvaro Herrera :

> Andrej Vanek wrote:
> > Hello,
> >
> > solved.
> > This is not a postgres issue.
> >
> > The system was used in HA-cluster with streaming replications.
> > The hidden files I asked for were created probably by broken (killed)
> > rsync. It uses such file-format for temporary files used during copying.
> >
> > This rsync is used by master to slave database synchronization (full
> > on-line backup of master database to slave node) before starting postgres
> > in hot-standby mode on slave the node...
>
> You not only have leftover first-order rsync temp files (.N.uvwxyz)
> -- but also when those temp files were being copied over by another
> rsync run, which created temp files for the first-order temp files,
> leaving you with second-order temp files (..N.uvwxyz.opqrst).  Not
> nice.  I wonder if this is anywhere near sanity -- it looks like you're
> copying stuff from one direction first, then failed over, then copied in
> the opposite direction.  I would have your setup reviewed real closely,
> to avoid data-corrupting configuration mistakes.  I have seen people
> make subtle mistakes in their configuration, causing their whole HA
> setups to be completely broken.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>


[GENERAL] deadlock of lock-waits (on transaction and on tuple) using same update statement

2014-09-23 Thread Andrej Vanek
Hi,

My application runs many concurrent sessions with the same transaction code
starting with an update statement.
I would expect locking and serialization of those transactions. But I get
unexpected deadlocks.
As opposed to *http://momjian.us/main/writings/pgsql/locking.pdf
* page 84 (waits for
ShareLock on transaction only)
my case looks different:

ERROR:  deadlock detected
DETAIL:  Process 6973 waits for ShareLock on transaction 318396117; blocked
by process 11039.
  ^^
Process 11039 waits for ExclusiveLock on tuple (4,9) of relation
16416 of database 16417; blocked by process 6973.
  ^^
Process 6973: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP =
'15')
Process 11039: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP =
'15')

Is this an expected behaviour, or maybe a bug?

Thanks, Andrej
---details
case is reproducible via my application.
I was unable to reproduce it via psql sessions (locking worked fine).
I was unable to reproduce it via shell for loop with psql sessions running
same transactions (100 loops in 10 terminal windows).

postgres version: PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit

postgres log:
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.021 ms  execute
S_2: ROLLBACK
2014-09-22 22:21:54 CEST:yy@xx1:[11039] LOG:  duration: 7.965 ms  execute
S_3: COMMIT
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.015 ms  bind S_1:
BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.008 ms  execute
S_1: BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.122 ms  parse
: select  *  from AELDATA_READ_VIEW where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.317 ms  bind
: select  *  from AELDATA_READ_VIEW where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.041 ms  execute
: select  *  from AELDATA_READ_VIEW where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.017 ms  bind S_2:
ROLLBACK
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.022 ms  execute
S_2: ROLLBACK

2014-09-22 22:21:54 CEST:yy@xx1:[11039] LOG:  duration: 0.017 ms  bind S_1:
BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[11039] LOG:  duration: 0.010 ms  execute
S_1: BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[11039] LOG:  duration: 0.083 ms  parse
: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.016 ms  bind S_1:
BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.013 ms  execute
S_1: BEGIN
2014-09-22 22:21:54 CEST:yy@xx1:[11039] LOG:  duration: 0.174 ms  bind
: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.096 ms  parse
: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:54 CEST:yy@xx1:[6973] LOG:  duration: 0.152 ms  bind
: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:55 CEST:yy@xx1:[6973] LOG:  process 6973 detected deadlock
while waiting for ShareLock on transaction 318396117 after 1000.060 ms
2014-09-22 22:21:55 CEST:yy@xx1:[6973] STATEMENT:  update AELDATA set
LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:55 CEST:yy@xx1:[11039] LOG:  process 11039 still waiting
for ExclusiveLock on tuple (4,9) of relation 16416 of database 16417 after
1000.038 ms
2014-09-22 22:21:55 CEST:yy@xx1:[11039] STATEMENT:  update AELDATA set
LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:55 CEST:yy@xx1:[6973] ERROR:  deadlock detected
2014-09-22 22:21:55 CEST:yy@xx1:[6973] DETAIL:  Process 6973 waits for
ShareLock on transaction 318396117; blocked by process 11039.
Process 11039 waits for ExclusiveLock on tuple (4,9) of relation
16416 of database 16417; blocked by process 6973.
Process 6973: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP =
'15')
Process 11039: update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP =
'15')
2014-09-22 22:21:55 CEST:yy@xx1:[6973] HINT:  See server log for query
details.
2014-09-22 22:21:55 CEST:yy@xx1:[6973] STATEMENT:  update AELDATA set
LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:55 CEST:yy@xx1:[11039] LOG:  process 11039 acquired
ExclusiveLock on tuple (4,9) of relation 16416 of database 16417 after
1000.224 ms
2014-09-22 22:21:55 CEST:yy@xx1:[11039] STATEMENT:  update AELDATA set
LOCK_COLUMN = 1 where  (SBO_GRP = '15')
2014-09-22 22:21:55 CEST:yy@xx1:[11039] LOG:  duration: 1004.543 ms
 execute : update AELDATA set LOCK_COLUMN = 1 where  (SBO_GRP =
'15')
2014-09-22 22:21:55 CEST:yy@xx1:[11039] LOG:  duration: 0.084 ms  parse
: select AEL_ID, SBO_GRP, LOCK_KEY from AELDATA where  (SBO_GRP =
'15')
2014-09-22 22:2

Re: [GENERAL] deadlock of lock-waits (on transaction and on tuple) using same update statement

2014-09-25 Thread Andrej Vanek
Hi Bill,

thanks for your answer.

>  most often caused by something earlier in the transactions
>  need all of the statements in each transaction

It would be great if we could reveal an application error.

Whole transactions I've already posted (in postgres log:
log_min_duration_statement=0).
Nothing suspicious yet:
- both sessions COMMIT/ROLLBACK before BEGIN
- both sessions run the same SQL statements
- deadlock: FIRST statement of both transactions
- deadlock: update single table, same row, column without any constraints
(WHY?)
- no statements of other sessions executed in between

Any idea?

thanks, Andrej


Re: [GENERAL] deadlock of lock-waits (on transaction and on tuple) using same update statement

2014-09-25 Thread Andrej Vanek
Hi,


now I've checked release-notes of 9.3.5 (my version 9.3.4)- found a fix
which probably could lead to my deadlocks:

> Fix race condition when updating a tuple concurrently locked by another
process (Andres Freund,Álvaro Herrera)

How can I make sure I've run into this bug?


Re: [GENERAL] deadlock of lock-waits (on transaction and on tuple) using same update statement

2014-10-03 Thread Andrej Vanek
Hi,

retested:
yes, this is still an issue in 9.3.5, same deadlock errors occured.
Do you need to extract some simplified reproducible testcase?

Best Regards, Andrej