Monitor repl slot size

2018-07-13 Thread Nicola Contu
Hello,
we used to monitor the replication slot size on postgres 9.6.6 with the
following query:

SELECT pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn) FROM
pg_replication_slots WHERE slot_name = 'SLOT NAME';

We are moving to postgres 10.4 and we saw the pg_xlog_location_diff is not
there anymore.

I know we can re-create it following this link :
https://github.com/DataDog/integrations-core/issues/907

but, is there any better way to do it? Any replacement for that function on
postgres 10?

Thanks a lot,
Nicola


Waiting connections postgres 10

2018-07-23 Thread Nicola Contu
Hello,

we used to monitor waiting connections with postgres 9.6.6 via this query :


select count (*) from pg_stat_activity where wait_event IS NOT NULL


Now  with postgres 10 it seems a bit harder to understand when a query from the 
application is waiting.


I  built this query but not sure I'm getting the right information.


select count (*) from pg_stat_activity where wait_event_type ='Client' and 
wait_event IN ('ClientRead','ClienteWrite') and state='idle'


Can anyone help me?


Thanks


Restore from dumps

2018-07-25 Thread Nicola Contu
Hello,
we recently moved from postgres 9.6.6 to 10.4

We perform a pg_dump in production to restore daily in a preprod env.
This process used to work perfectly, but now we have a tiny problem.

We first restore data, we perform a vacuum and then we restore matviews.
Restoring matviews now we have :

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424
MATERIALIZED VIEW DATA matview_vrs_request_sla postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  relation
"all_days" does not exist
LINE 3: from all_days
 ^
QUERY:
select count(*)::numeric
from all_days
where (("date" between $2::date and $1::date) or ("date" between $1::date
and $2::date))
and dow not in (0,6)

CONTEXT:  SQL function "bdays" during inlining
Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_sla;

The relation is there, in fact if I go there when I get in to the office,
the same command works.

I'm not sure why it does not work here, this seems really strange to me.

Can anyone help?

Thank you,
Nicola


Re: Restore from dumps

2018-07-25 Thread Nicola Contu
Hello
these are the commands we use pretty much:

tar -xf tarname.tar  -C /backupdatadir --strip-components=4
pg_restore -l /backupdatadir | sed '/MATERIALIZED VIEW DATA/d' >
/restore.lst
pg_restore -U postgres  -L /restore.lst -d DBNAME  -j 32 /backupdatadir
vacuumdb --analyze-in-stages  -U postgres --jobs 32 -d  DBNAME
pg_restore -l /backupdatadir  | grep 'MATERIALIZED VIEW DATA' > /refresh.lst
pg_restore -U postgres  -L /refresh.lst -d DBNAME  -j 32 /backupdatadir


all_days is a table yes. bdays instead is a function and it include in the
first file.

Both servers have 10.4 for psql commands, we take the backup with 10.4 and
we restore with 10.4

We used to have postgres9.6.6 in production and pì10.4 in preprod, and the
restore went always fine. After switching to 10.4 in prod we started having
the problem.


2018-07-25 11:28 GMT+02:00 Alban Hertroys :

>
> > On 25 Jul 2018, at 9:43, Nicola Contu  wrote:
> >
> > Hello,
> > we recently moved from postgres 9.6.6 to 10.4
> >
> > We perform a pg_dump in production to restore daily in a preprod env.
> > This process used to work perfectly, but now we have a tiny problem.
> >
> > We first restore data, we perform a vacuum and then we restore matviews.
>
> What are the commands you used? You don't seem to mention restoring the
> schema?
>
> > Restoring matviews now we have :
> >
> > pg_restore: [archiver (db)] Error while PROCESSING TOC:
> > pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424
> MATERIALIZED VIEW DATA matview_vrs_request_sla postgres
> > pg_restore: [archiver (db)] could not execute query: ERROR:  relation
> "all_days" does not exist
> > LINE 3: from all_days
> >  ^
> > QUERY:
> > select count(*)::numeric
> > from all_days
> > where (("date" between $2::date and $1::date) or ("date" between
> $1::date and $2::date))
> > and dow not in (0,6)
> >
> > CONTEXT:  SQL function "bdays" during inlining
> > Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_
> sla;
>
> Is all_days a table? Or is it perhaps another view, possibly materialized
> even?
>
> > The relation is there, in fact if I go there when I get in to the
> office, the same command works.
>
> This sounds to me like you may be using a different version of pg_restore
> in the office. Are both versions 10.4 or newer?
> It can't hurt to check that you used version 10.4 of pg_dump as well.
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.
>
>


Re: Restore from dumps

2018-07-25 Thread Nicola Contu
Thanks Laurenz
I will try that.
Btw, just wondering why it works if I refresh it later, even if the
definition is still without public

2018-07-25 12:06 GMT+02:00 Laurenz Albe :

> Nicola Contu wrote:
> > we recently moved from postgres 9.6.6 to 10.4
> >
> > We perform a pg_dump in production to restore daily in a preprod env.
> > This process used to work perfectly, but now we have a tiny problem.
> >
> > We first restore data, we perform a vacuum and then we restore matviews.
> > Restoring matviews now we have :
> >
> > pg_restore: [archiver (db)] Error while PROCESSING TOC:
> > pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424
> MATERIALIZED VIEW DATA matview_vrs_request_sla postgres
> > pg_restore: [archiver (db)] could not execute query: ERROR:  relation
> "all_days" does not exist
> > LINE 3: from all_days
> >  ^
> > QUERY:
> > select count(*)::numeric
> > from all_days
> > where (("date" between $2::date and $1::date) or ("date" between
> $1::date and $2::date))
> > and dow not in (0,6)
> >
> > CONTEXT:  SQL function "bdays" during inlining
> > Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_
> sla;
> >
> > The relation is there, in fact if I go there when I get in to the
> office, the same command works.
> >
> > I'm not sure why it does not work here, this seems really strange to me.
>
> I suspect that it has to do with the recent security fixes around the
> "public" schema.
>
> Try to ALTER the materialized view so that it refers to "public.all_days"
> rather than "all_days".
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>


Re: Restore from dumps

2018-07-25 Thread Nicola Contu
yeah, we updated that function in production to says public.all_days.
I will let you know at the next restore.

Thanks guys, appreciated.

2018-07-25 16:28 GMT+02:00 Tom Lane :

> Laurenz Albe  writes:
> > Nicola Contu wrote:
> >> pg_restore: [archiver (db)] could not execute query: ERROR:  relation
> "all_days" does not exist
> >> LINE 3: from all_days
> >> ^
> >> QUERY:
> >> select count(*)::numeric
> >> from all_days
> >> where (("date" between $2::date and $1::date) or ("date" between
> $1::date and $2::date))
> >> and dow not in (0,6)
> >>
> >> CONTEXT:  SQL function "bdays" during inlining
>
> > Try to ALTER the materialized view so that it refers to "public.all_days"
> > rather than "all_days".
>
> Per the error message, what needs fixing is the SQL function "bdays",
> not the matview as such.
>
> regards, tom lane
>


Re: Restore from dumps

2018-07-26 Thread Nicola Contu
That worked.
Thanks guys.

2018-07-25 16:33 GMT+02:00 Nicola Contu :

> yeah, we updated that function in production to says public.all_days.
> I will let you know at the next restore.
>
> Thanks guys, appreciated.
>
> 2018-07-25 16:28 GMT+02:00 Tom Lane :
>
>> Laurenz Albe  writes:
>> > Nicola Contu wrote:
>> >> pg_restore: [archiver (db)] could not execute query: ERROR:  relation
>> "all_days" does not exist
>> >> LINE 3: from all_days
>> >> ^
>> >> QUERY:
>> >> select count(*)::numeric
>> >> from all_days
>> >> where (("date" between $2::date and $1::date) or ("date" between
>> $1::date and $2::date))
>> >> and dow not in (0,6)
>> >>
>> >> CONTEXT:  SQL function "bdays" during inlining
>>
>> > Try to ALTER the materialized view so that it refers to
>> "public.all_days"
>> > rather than "all_days".
>>
>> Per the error message, what needs fixing is the SQL function "bdays",
>> not the matview as such.
>>
>> regards, tom lane
>>
>
>


Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
Hello,

I'm trying to get pgbouncer working but I'm getting this error :


2018-09-17 12:20:15.304 87772 NOISE safe_accept(12) = Resource temporarily 
unavailable
2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = 13 (127.0.0.1:59100)
2018-09-17 12:20:36.183 87772 NOISE new fd from accept=13
2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = Resource temporarily 
unavailable
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'logfile' = 
'/var/log/pgbouncer.log' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = 
'/home/postgres/pgbouncer.pid'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = 
'/home/postgres/pgbouncer.pid' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = '220'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = 
'220' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'log_connections' = '0'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_connections' = '0' ok:1
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0' 
ok:1
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' ok:1
2018-09-17 12:21:14.413 88424 DEBUG loading auth_file: 
"/etc/pgbouncer/users.txt"
2018-09-17 12:21:14.417 88424 NOISE event: 128, SBuf: 192, PgSocket: 400, 
IOBuf: 4108
2018-09-17 12:21:14.417 88424 LOG File descriptor limit: 1024 (H:4096), 
max_client_conn: 100, max fds possible: 230
2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x25277c0
2018-09-17 12:21:14.417 88424 DEBUG make_room(0x25277c0, 4): realloc newlen=256
2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x252ebb0
2018-09-17 12:21:14.418 88424 NOISE connect(4, unix:/tmp/.s.PGSQL.6543) = 
Connection refused
2018-09-17 12:21:14.418 88424 NOISE safe_close(4) = 0
2018-09-17 12:21:14.418 88424 LOG Stale pidfile, removing
2018-09-17 12:21:14.419 88424 DEBUG adns_create_context: evdns2
2018-09-17 12:21:14.421 88424 DEBUG add_listen: 127.0.0.1:6543
2018-09-17 12:21:14.422 88424 NOISE old TCP_DEFER_ACCEPT on 11 = 0
2018-09-17 12:21:14.422 88424 NOISE install TCP_DEFER_ACCEPT on 11
2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543
2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543
2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543
2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent 
2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips  26 Jan 2017
2018-09-17 12:21:14.756 88424 DEBUG loading auth_file: 
"/etc/pgbouncer/users.txt"
2018-09-17 12:21:48.917 88424 NOISE safe_accept(12) = 13 (unix:)
2018-09-17 12:21:48.917 88424 NOISE new fd from accept=13
2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation not 
supported
2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0
2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource temporarily 
unavailable
2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 
0 B/s, xact 0 us, query 0 us wait time 0 us
2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 
0 B/s, xact 0 us, query 0 us wait time 0 us
2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down
2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting
2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0
2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0



Postgres version : 10.5

Libevent : 2.0.5

Pgbouncer : 1.8.1



It used to work. Now after trying to install latest version of pgbouncer 
(1.9.0) does not work anymore. Then I re-compiled the v 1.8.1



Content of the pgbouncer.ini

[databases]
DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer 
pool_size=120

[pgbouncer]
listen_port = 6543
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/users.txt
auth_query = select uname,phash from user_lookup($1)
logfile = /var/log/pgbouncer.log
pidfile = /home/postgres/pgbouncer.pid
admin_users = admin
user=postgres
max_db_connections = 220
log_connections=0
log_disconnections=0
verbose=3



Content of the users .txt

[root@cmd-dev1 pgbouncer]# cat users.txt
"pgbouncer" "***"



I tried to change the listen address to 127.0.0.1 or to the IP of the VM. 
Nothing changed.


Can anyone help?


Thanks



Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
Hello,

I'm trying to get pgbouncer working but I'm getting this error :


2018-09-17 12:20:15.304 87772 NOISE safe_accept(12) = Resource temporarily
unavailable
2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = 13 (127.0.0.1:59100)
2018-09-17 12:20:36.183 87772 NOISE new fd from accept=13
2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = Resource temporarily
unavailable
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'logfile' =
'/var/log/pgbouncer.log' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' =
'/home/postgres/pgbouncer.pid'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' =
'/home/postgres/pgbouncer.pid' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin'
ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' =
'220'
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' =
'220' ok:1
2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'log_connections' = '0'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_connections' = '0'
ok:1
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' =
'0'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' =
'0' ok:1
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3'
2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' ok:1
2018-09-17 12:21:14.413 88424 DEBUG loading auth_file:
"/etc/pgbouncer/users.txt"
2018-09-17 12:21:14.417 88424 NOISE event: 128, SBuf: 192, PgSocket: 400,
IOBuf: 4108
2018-09-17 12:21:14.417 88424 LOG File descriptor limit: 1024 (H:4096),
max_client_conn: 100, max fds possible: 230
2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x25277c0
2018-09-17 12:21:14.417 88424 DEBUG make_room(0x25277c0, 4): realloc
newlen=256
2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x252ebb0
2018-09-17 12:21:14.418 88424 NOISE connect(4, unix:/tmp/.s.PGSQL.6543) =
Connection refused
2018-09-17 12:21:14.418 88424 NOISE safe_close(4) = 0
2018-09-17 12:21:14.418 88424 LOG Stale pidfile, removing
2018-09-17 12:21:14.419 88424 DEBUG adns_create_context: evdns2
2018-09-17 12:21:14.421 88424 DEBUG add_listen: 127.0.0.1:6543
2018-09-17 12:21:14.422 88424 NOISE old TCP_DEFER_ACCEPT on 11 = 0
2018-09-17 12:21:14.422 88424 NOISE install TCP_DEFER_ACCEPT on 11
2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543
2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543
2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543
2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent
2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips  26 Jan 2017
2018-09-17 12:21:14.756 88424 DEBUG loading auth_file:
"/etc/pgbouncer/users.txt"
2018-09-17 12:21:48.917 88424 NOISE safe_accept(12) = 13 (unix:)
2018-09-17 12:21:48.917 88424 NOISE new fd from accept=13
2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation not
supported
2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0
2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource temporarily
unavailable
2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s,
out 0 B/s, xact 0 us, query 0 us wait time 0 us
2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s,
out 0 B/s, xact 0 us, query 0 us wait time 0 us
2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down
2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting
2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0
2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0


Postgres version : 10.5

Libevent : 2.0.5

Pgbouncer : 1.8.1

OS : Centos 7


It used to work. Now after trying to install latest version of pgbouncer
(1.9.0) does not work anymore. Then I re-compiled the v 1.8.1



Content of the pgbouncer.ini

[databases]
DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer
pool_size=120

[pgbouncer]
listen_port = 6543
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/users.txt
auth_query = select uname,phash from user_lookup($1)
logfile = /var/log/pgbouncer.log
pidfile = /home/postgres/pgbouncer.pid
admin_users = admin
user=postgres
max_db_connections = 220
log_connections=0
log_disconnections=0
verbose=3


Content of the users .txt

[root@cmd-dev1 pgbouncer]# cat users.txt
"pgbouncer" "***"


I tried to change the listen address to 127.0.0.1 or to the IP of the VM.
Nothing changed.


Can anyone help?


Thanks


Re: Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
I'm not aware of any update installed. There were like 600+ updates in fact.

These are the logs when connecting via telnet :

2018-09-17 14:18:18.933 65617 NOISE safe_accept(11) = 14 (10.151.2.145:39478
)
2018-09-17 14:18:18.933 65617 NOISE new fd from accept=14
2018-09-17 14:18:18.933 65617 NOISE safe_accept(11) = Resource temporarily
unavailable
2018-09-17 14:18:28.385 65617 NOISE safe_accept(12) = 15 (::1/49712)
2018-09-17 14:18:28.385 65617 NOISE new fd from accept=15
2018-09-17 14:18:28.385 65617 NOISE safe_accept(12) = Resource temporarily
unavailable
2018-09-17 14:18:30.703 65617 NOISE safe_accept(11) = 16 (10.151.2.145:39520
)
2018-09-17 14:18:30.703 65617 NOISE new fd from accept=16
2018-09-17 14:18:30.703 65617 NOISE safe_accept(11) = Resource temporarily
unavailable
2018-09-17 14:18:34.416 65617 NOISE safe_accept(11) = 17 (10.151.2.145:39530
)
2018-09-17 14:18:34.416 65617 NOISE new fd from accept=17
2018-09-17 14:18:34.416 65617 NOISE safe_accept(11) = Resource temporarily
unavailable

[root@cmd-dev1 pgbouncer]# telnet localhost 6543
Trying ::1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

Il giorno lun 17 set 2018 alle ore 16:11 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 9/17/18 7:05 AM, Nicola Contu wrote:
> > The log is from pgbouncer.
> >
> > I installed updates after this problem, but they did not resolve
> anything.
>
> I was thinking more of updates that where done between the last time it
> worked and the time it started failing.
>
> Along that line, when was the last time it worked?
>
> >
> > I compiled 1.9.0 from source on the same path of the previous one. Then
> > back to 1.8.1 on the same path.
> >
> > I also tried changing the listening port on pgbouncer, but that does not
> > change too much.
>
> Yeah, your log shows the port/socket coming up:
>
> 2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543
> 2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543
> 2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543
> 2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent
> 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips  26 Jan 2017
>
> > I can connect via telnet on that port, but we I try with anything else
> > (like Apache) it makes pgbouncer crashing.
>
> Are there log entries from that?
>
> >
> >
> > Il giorno lun 17 set 2018 alle ore 15:44 Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>> ha
> scritto:
> >
> > On 9/17/18 5:33 AM, Nicola Contu wrote:
> >  > Hello,
> >  >
> >  > I'm trying to get pgbouncer working but I'm getting this error :
> >  >
> >  >
> >
> > What is generating the log below?
> >
> > To me it looks like everything is on track until this:
> >
> >  > 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed:
> > Operation
> >  > not supported
> >  > 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0
> >  > 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource
> >  > temporarily unavailable
> >  > 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s,
> > in 0
> >  > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us
> >  > 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s,
> > in 0
> >  > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us
> >  > 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down
> >  > 2018-09-17 12:23:26.872 88424 LOG server connections dropped,
> exiting
> >  > 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0
> >  > 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0
> >  >
> >  >
> >  > Postgres version : 10.5
> >  >
> >  > Libevent : 2.0.5
> >  >
> >  > Pgbouncer : 1.8.1
> >  >
> >  > OS : Centos 7
> >
> > Have there been updates to CentOS during this?
> >
> >  >
> >  >
> >  > It used to work. Now after trying to install latest version of
> > pgbouncer
> >  > (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1
> >
> > Did you compile 1.9.0 from source or install from package?
> >
> >  >
> >  >
> >  >
> >  > Content of the pgbouncer.ini
> >  >
> >  > [databases]
> >  > DB_NAME = host=IP_ADDD port=5432
> dbname=DB_NAM

Re: Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
[Mon Sep 17 10:24:59.041589 2018] [php7:notice] [pid 63893:tid
140586929567488] [client 10.160.41.3:51764] PHP WARNING: pg_connect():
Unable to connect to PostgreSQL server: could not connect to server:
Connection refused\n\tIs the server running on host "IP_ADDR" and
accepting\n\tTCP/IP connections on port 6543? in
/home/cmd3/adodb5/drivers/adodb-postgres64.inc.php on line 727

So basically the same I get when connecting via psql on port 6543

Il giorno lun 17 set 2018 alle ore 16:23 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 9/17/18 7:19 AM, Nicola Contu wrote:
> > I'm not aware of any update installed. There were like 600+ updates in
> fact.
> >
> > These are the logs when connecting via telnet :
>
> How about those from when Apache tries to connect and pgbouncer crashes?
>
> > Connection closed by foreign host.
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
The log is from pgbouncer.

I installed updates after this problem, but they did not resolve anything.

I compiled 1.9.0 from source on the same path of the previous one. Then
back to 1.8.1 on the same path.

I also tried changing the listening port on pgbouncer, but that does not
change too much.
I can connect via telnet on that port, but we I try with anything else
(like Apache) it makes pgbouncer crashing.


Il giorno lun 17 set 2018 alle ore 15:44 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 9/17/18 5:33 AM, Nicola Contu wrote:
> > Hello,
> >
> > I'm trying to get pgbouncer working but I'm getting this error :
> >
> >
>
> What is generating the log below?
>
> To me it looks like everything is on track until this:
>
> > 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation
> > not supported
> > 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0
> > 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource
> > temporarily unavailable
> > 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0
> > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us
> > 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0
> > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us
> > 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down
> > 2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting
> > 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0
> > 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0
> >
> >
> > Postgres version : 10.5
> >
> > Libevent : 2.0.5
> >
> > Pgbouncer : 1.8.1
> >
> > OS : Centos 7
>
> Have there been updates to CentOS during this?
>
> >
> >
> > It used to work. Now after trying to install latest version of pgbouncer
> > (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1
>
> Did you compile 1.9.0 from source or install from package?
>
> >
> >
> >
> > Content of the pgbouncer.ini
> >
> > [databases]
> > DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer
> > pool_size=120
> >
> > [pgbouncer]
> > listen_port = 6543
> > listen_addr = *
> > auth_type = md5
> > auth_file = /etc/pgbouncer/users.txt
> > auth_query = select uname,phash from user_lookup($1)
> > logfile = /var/log/pgbouncer.log
> > pidfile = /home/postgres/pgbouncer.pid
> > admin_users = admin
> > user=postgres
> > max_db_connections = 220
> > log_connections=0
> > log_disconnections=0
> > verbose=3
> >
> >
> > Content of the users .txt
> >
> > [root@cmd-dev1 pgbouncer]# cat users.txt
> > "pgbouncer" "***"
> >
> >
> > I tried to change the listen address to 127.0.0.1 or to the IP of the
> > VM. Nothing changed.
> >
> >
> > Can anyone help?
> >
> >
> > Thanks
> >
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
Yeah, I replaced the IP of the machine with the word IP_ADDR.

On the pgbouncer log I get this while trying to connect via psql:

2018-09-17 15:56:49.452 67611 WARNING tune_socket(16) failed: Operation not
supported
2018-09-17 15:56:49.452 67611 NOISE safe_close(16) = 0
2018-09-17 15:56:49.452 67611 NOISE safe_accept(13) = Resource temporarily
unavailable
2018-09-17 15:56:50.728 67611 NOISE safe_accept(13) = 16 (unix:)
2018-09-17 15:56:50.729 67611 NOISE new fd from accept=16
2018-09-17 15:56:50.729 67611 WARNING tune_socket(16) failed: Operation not
supported
2018-09-17 15:56:50.729 67611 NOISE safe_close(16) = 0
2018-09-17 15:56:50.729 67611 NOISE safe_accept(13) = Resource temporarily
unavailable
2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = 16 (10.151.2.145:58696
)
2018-09-17 15:56:56.418 67611 NOISE new fd from accept=16
2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = Resource temporarily
unavailable


Il giorno lun 17 set 2018 alle ore 17:46 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 9/17/18 8:39 AM, Nicola Contu wrote:
> Please reply to list also.
> Ccing list.
>
> > No wait, IP_ADDR has been modified by me, but there is the IP of the
> > machine 10.151.x.
>
> To be clear in your posts you are replacing the actual IP with IP_ADDR,
> correct?
>
> >
> >
> > Here is the psql command .
> >
> > [root@cmd-dev1 ~]# psql -U ncontu -p 6543 cmd3dev
> > psql: server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
>
> So what does the pgbouncer log show when you do above?
>
> >
> >
> >
> >
> > Il giorno lun 17 set 2018 alle ore 17:35 Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>> ha
> scritto:
> >
> > On 9/17/18 7:26 AM, Nicola Contu wrote:
> >  > [Mon Sep 17 10:24:59.041589 2018] [php7:notice] [pid 63893:tid
> >  > 140586929567488] [client 10.160.41.3:51764
> > <http://10.160.41.3:51764> <http://10.160.41.3:51764>]
> >  > PHP WARNING: pg_connect(): Unable to connect to PostgreSQL
> > server: could
> >  > not connect to server: Connection refused\n\tIs the server
> > running on
> >  > host "IP_ADDR" and accepting\n\tTCP/IP connections on
> port
> >  > 6543? in /home/cmd3/adodb5/drivers/adodb-postgres64.inc.php on
> > line 727
> >
> > To me it looks like IP_ADDR is not resolving to a host.
> >
> > In your pgbouncer.ini I see:
> >
> > DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer
> > pool_size=120
> >
> > but the error is about connecting to port 6543 which is your
> pgbouncer
> > 'server'.
> >
> > Are you sure your connection string is pointing at the correct thing?
> >
> >  >
> >  > So basically the same I get when connecting via psql on port 6543
> >
> > What is the full psql command?
> >
> >  >
> >  > Il giorno lun 17 set 2018 alle ore 16:23 Adrian Klaver
> >  > mailto:adrian.kla...@aklaver.com>
> > <mailto:adrian.kla...@aklaver.com
> > <mailto:adrian.kla...@aklaver.com>>> ha scritto:
> >  >
> >  > On 9/17/18 7:19 AM, Nicola Contu wrote:
> >  >  > I'm not aware of any update installed. There were like 600+
> >  > updates in fact.
> >  >  >
> >  >  > These are the logs when connecting via telnet :
> >  >
> >  > How about those from when Apache tries to connect and
> > pgbouncer crashes?
> >  >
> >  >  > Connection closed by foreign host.
> >  >
> >  >
> >  > --
> >  > Adrian Klaver
> >  > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com
> >>
> >  >
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Pgbouncer and postgres

2018-09-17 Thread Nicola Contu
Yes I can connect to port 5432 without any problem.

The thing is that I am not seeing anything on the postgres log when
connecting to 6543.

Sure, thanks for your help.
I will create an issue there.


Il giorno lun 17 set 2018 alle ore 18:07 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 9/17/18 8:59 AM, Nicola Contu wrote:
> > Yeah, I replaced the IP of the machine with the word IP_ADDR.
> >
> > On the pgbouncer log I get this while trying to connect via psql:
> >
> > 2018-09-17 15:56:49.452 67611 WARNING tune_socket(16) failed: Operation
> > not supported
> > 2018-09-17 15:56:49.452 67611 NOISE safe_close(16) = 0
> > 2018-09-17 15:56:49.452 67611 NOISE safe_accept(13) = Resource
> > temporarily unavailable
> > 2018-09-17 15:56:50.728 67611 NOISE safe_accept(13) = 16 (unix:)
> > 2018-09-17 15:56:50.729 67611 NOISE new fd from accept=16
> > 2018-09-17 15:56:50.729 67611 WARNING tune_socket(16) failed: Operation
> > not supported
> > 2018-09-17 15:56:50.729 67611 NOISE safe_close(16) = 0
> > 2018-09-17 15:56:50.729 67611 NOISE safe_accept(13) = Resource
> > temporarily unavailable
> > 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = 16
> > (10.151.2.145:58696 <http://10.151.2.145:58696>)
> > 2018-09-17 15:56:56.418 67611 NOISE new fd from accept=16
> > 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = Resource
> > temporarily unavailable
>
> Can you connect to the actual database running on port 5432?
>
> Deciphering the above log is beyond my capabilities. I think it is time
> to file an issue here:
>
> https://github.com/pgbouncer/pgbouncer/issues
>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Pgbouncer discard all

2018-10-10 Thread Nicola Contu
Hello,
we are running pgbouncer 1.9.1 connected to postgres 10.5

Sometimes we are seeing a lot of waiting connections with this query :

DISCARD ALL

This is our pgbouncer config :

[databases]
dev = host=10.10.10.1 port=5432 dbname=dev auth_user=pgbouncer pool_size=120

[pgbouncer]
listen_port = 6543
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/users.txt
auth_query = select uname,phash from user_lookup($1)
logfile = /var/log/pgbouncer.log
pidfile = /home/postgres/pgbouncer.pid
admin_users = admin
user = postgres
max_db_connections = 220
log_connections = 0
log_disconnections = 0


Do you think this can depend on the server_idle_timeout default config
value?

Thanks a lot,
Nicola


Swap on postgres master server

2018-10-16 Thread Nicola Contu
Hello,
we are running Postgres 10.5 with master slave replication.

These are our custom params

archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' # command
to use to archive a logfile segment
archive_mode = on # enables archiving; off, on, or always
checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - 1.0
default_text_search_config = 'pg_catalog.english' #
datestyle = 'iso, mdy' #
effective_cache_size = 120GB #
hot_standby = on# "on" allows queries during
recovery
lc_messages = 'en_US.UTF-8'  # locale for system error
message
lc_monetary = 'en_US.UTF-8'  # locale for monetary
formatting
lc_numeric = 'en_US.UTF-8'  # locale for number
formatting
lc_time = 'en_US.UTF-8'  # locale for time
formatting
listen_addresses = '*' # defaults to 'localhost', '*' = all
log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all
actions and
log_checkpoints = on #
log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' #
log_lock_waits = on  # log lock waits >=
deadlock_timeout
log_min_duration_statement = 1000ms# -1 is disabled, 0 logs all
statements
log_statement = 'ddl'  # none, ddl, mod, all
log_temp_files = 1024kB  # log temporary files equal or
larger
maintenance_work_mem = 2GB #
max_connections = 220 #
max_parallel_workers_per_gather = 8# taken from max_worker_processes
max_wal_size = 2GB #
min_wal_size = 1GB #
pg_stat_statements.max = 1 #
pg_stat_statements.track = all #
port = 5432# port number which Postgres listen
shared_buffers = 10GB #
shared_preload_libraries = 'pg_stat_statements'  # (change requires
restart)
synchronous_standby_names = '1 ( "usnyh2" )' # comment out during upgrade
track_activity_query_size = 16384# (change requires restart)
track_io_timing = on #
wal_buffers = 16MB #
wal_keep_segments = 100 #
wal_level = replica#  minimal, replica, or logical
work_mem = 600MB #

This server is on Centos 7 and the strange thing is that we see a lot of
swap usage :

[root@usnyh-cmd1 ~]# free -m
  totalusedfree  shared  buff/cache
 available
Mem: 25765275555559   12804  244536
236036
Swap: 1638373269057

7GB used.

But can't see it from any of the commands like top etc.
I am sure it is postgres because it is the only service running on that
machine.

Is there anything we can do?
On the sync slave, the usage is just 400MB.

Any trick?

Thanks a lot,
Nicola


Re: Swap on postgres master server

2018-10-16 Thread Nicola Contu
No it is not probably used, because I can't find it in any way as I said.

I run your commands :

[root@usnyh-cmd1 ~]# vmstat 1
procs ---memory-- ---swap-- -io -system--
--cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
wa st
 3  2 7505332 14545468  13692 2414367840047   11500 11
2 84  2  0
 3  2 7505332 14490408  13692 24143692800 0   248 6153 2013  3
1 93  3  0
 1  2 7505332 14474672  13692 24143680000 0  1124 4905 1454  3
1 93  3  0
 4  2 7505332 14381156  13692 24143683200 096 5322 1782  2
1 94  3  0
^C
[root@usnyh-cmd1 ~]# ps -A --sort -rss -o comm,pmem | awk '
>   NR == 1 { print; next }
>   { a[$1] += $2 }
>   END {
> for (i in a) {
>   printf "%-15s\t%s\n", i, a[i];
> }
>   }
> '
COMMAND %MEM
kworker/42:20
kworker/60:6H  0
kworker/60:2H  0
kdmwork-253:2  0
ksoftirqd/600
postmaster  15.2
kworker/9:0H0

So I'm just asking why it is still there with free -m and if there is any
way to free that up if it is not used.

Thanks

Il giorno mar 16 ott 2018 alle ore 11:18 Hans Schou 
ha scritto:

> Are you sure that swap is used actively? Maybe it had just been used
> during backup or something.
>
> Look after SwapIn/SwapOut (si/so) it should be '0'
> $ vmstat 1
> procs ---memory-- ---swap-- -io -system--
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
> wa st
>  1  0 12  89344  46608 5863840012 8   30   86  0  0
> 99  0  0
>
> If you want to see the amount of ram used by each program with childs run
> this:
> ps -A --sort -rss -o comm,pmem | awk '
>   NR == 1 { print; next }
>   { a[$1] += $2 }
>   END {
>     for (i in a) {
>   printf "%-15s\t%s\n", i, a[i];
> }
>   }
> '
>
>
> On Tue, Oct 16, 2018 at 11:04 AM Nicola Contu 
> wrote:
>
>> Hello,
>> we are running Postgres 10.5 with master slave replication.
>>
>> These are our custom params
>>
>> archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' #
>> command to use to archive a logfile segment
>> archive_mode = on # enables archiving; off, on, or always
>> checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 -
>> 1.0
>> default_text_search_config = 'pg_catalog.english' #
>> datestyle = 'iso, mdy' #
>> effective_cache_size = 120GB #
>> hot_standby = on# "on" allows queries during
>> recovery
>> lc_messages = 'en_US.UTF-8'  # locale for system
>> error message
>> lc_monetary = 'en_US.UTF-8'  # locale for monetary
>> formatting
>> lc_numeric = 'en_US.UTF-8'  # locale for number
>> formatting
>> lc_time = 'en_US.UTF-8'  # locale for time
>> formatting
>> listen_addresses = '*' # defaults to 'localhost', '*' = all
>> log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all
>> actions and
>> log_checkpoints = on #
>> log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' #
>> log_lock_waits = on  # log lock waits >=
>> deadlock_timeout
>> log_min_duration_statement = 1000ms# -1 is disabled, 0 logs all
>> statements
>> log_statement = 'ddl'  # none, ddl, mod, all
>> log_temp_files = 1024kB  # log temporary files equal
>> or larger
>> maintenance_work_mem = 2GB #
>> max_connections = 220 #
>> max_parallel_workers_per_gather = 8# taken from
>> max_worker_processes
>> max_wal_size = 2GB #
>> min_wal_size = 1GB #
>> pg_stat_statements.max = 1 #
>> pg_stat_statements.track = all #
>> port = 5432# port number which Postgres
>> listen
>> shared_buffers = 10GB #
>> shared_preload_libraries = 'pg_stat_statements'  # (change
>> requires restart)
>> synchronous_standby_names = '1 ( "usnyh2" )' # comment out during upgrade
>> track_activity_query_size = 16384# (change requires restart)
>> track_io_timing = on #
>> wal_buffers = 16MB #
>> wal_keep_segments = 100 #
>> wal_level = replica#  minimal, replica, or logical
>> work_mem = 600MB #
>>
>> This server is on Centos 7 and the strange thing is that we see a lot of
>> swap usage :
>>
>> [root@usnyh-cmd1 ~]# free -m
>>   totalusedfree  shared  buff/cache
>>  available
>> Mem: 25765275555559   12804  244536
>> 236036
>> Swap: 1638373269057
>>
>> 7GB used.
>>
>> But can't see it from any of the commands like top etc.
>> I am sure it is postgres because it is the only service running on that
>> machine.
>>
>> Is there anything we can do?
>> On the sync slave, the usage is just 400MB.
>>
>> Any trick?
>>
>> Thanks a lot,
>> Nicola
>>
>


Re: Pgbouncer discard all

2018-10-16 Thread Nicola Contu
Hello,
is this normal? can anyone help?

Thanks a lot for your help in advance.

Nicola

Il giorno mer 10 ott 2018 alle ore 17:03 Nicola Contu <
nicola.co...@gmail.com> ha scritto:

> Hello,
> we are running pgbouncer 1.9.1 connected to postgres 10.5
>
> Sometimes we are seeing a lot of waiting connections with this query :
>
> DISCARD ALL
>
> This is our pgbouncer config :
>
> [databases]
> dev = host=10.10.10.1 port=5432 dbname=dev auth_user=pgbouncer
> pool_size=120
>
> [pgbouncer]
> listen_port = 6543
> listen_addr = *
> auth_type = md5
> auth_file = /etc/pgbouncer/users.txt
> auth_query = select uname,phash from user_lookup($1)
> logfile = /var/log/pgbouncer.log
> pidfile = /home/postgres/pgbouncer.pid
> admin_users = admin
> user = postgres
> max_db_connections = 220
> log_connections = 0
> log_disconnections = 0
>
>
> Do you think this can depend on the server_idle_timeout default config
> value?
>
> Thanks a lot,
> Nicola
>


Re: Swap on postgres master server

2018-10-16 Thread Nicola Contu
Ok thanks I will try that.
But do you know if there is any way to avoid this?

vm.swapiness? or anything on the postgres conf?

Il giorno mar 16 ott 2018 alle ore 15:17 Bob Jolliffe 
ha scritto:

> I guess you can run swapoff (followed by swapon).  That will free up
> whatever is currently swapped.  Beware if the system is actively
> swapping then swapoff can take some time.  But it seems not in your
> case.
> On Tue, 16 Oct 2018 at 10:48, Nicola Contu  wrote:
> >
> > No it is not probably used, because I can't find it in any way as I said.
> >
> > I run your commands :
> >
> > [root@usnyh-cmd1 ~]# vmstat 1
> > procs ---memory-- ---swap-- -io -system--
> --cpu-
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
> id wa st
> >  3  2 7505332 14545468  13692 2414367840047   11500
> 11  2 84  2  0
> >  3  2 7505332 14490408  13692 24143692800 0   248 6153 2013
> 3  1 93  3  0
> >  1  2 7505332 14474672  13692 24143680000 0  1124 4905 1454
> 3  1 93  3  0
> >  4  2 7505332 14381156  13692 24143683200 096 5322 1782
> 2  1 94  3  0
> > ^C
> > [root@usnyh-cmd1 ~]# ps -A --sort -rss -o comm,pmem | awk '
> > >   NR == 1 { print; next }
> > >   { a[$1] += $2 }
> > >   END {
> > > for (i in a) {
> > >   printf "%-15s\t%s\n", i, a[i];
> > > }
> > >   }
> > > '
> > COMMAND %MEM
> > kworker/42:20
> > kworker/60:6H  0
> > kworker/60:2H  0
> > kdmwork-253:2  0
> > ksoftirqd/600
> > postmaster  15.2
> > kworker/9:0H0
> >
> > So I'm just asking why it is still there with free -m and if there is
> any way to free that up if it is not used.
> >
> > Thanks
> >
> > Il giorno mar 16 ott 2018 alle ore 11:18 Hans Schou <
> hans.sc...@gmail.com> ha scritto:
> >>
> >> Are you sure that swap is used actively? Maybe it had just been used
> during backup or something.
> >>
> >> Look after SwapIn/SwapOut (si/so) it should be '0'
> >> $ vmstat 1
> >> procs ---memory-- ---swap-- -io -system--
> --cpu-
> >>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
> id wa st
> >>  1  0 12  89344  46608 5863840012 8   30   86  0  0
> 99  0  0
> >>
> >> If you want to see the amount of ram used by each program with childs
> run this:
> >> ps -A --sort -rss -o comm,pmem | awk '
> >>   NR == 1 { print; next }
> >>   { a[$1] += $2 }
> >>   END {
> >> for (i in a) {
> >>   printf "%-15s\t%s\n", i, a[i];
> >> }
> >>   }
> >> '
> >>
> >>
> >> On Tue, Oct 16, 2018 at 11:04 AM Nicola Contu 
> wrote:
> >>>
> >>> Hello,
> >>> we are running Postgres 10.5 with master slave replication.
> >>>
> >>> These are our custom params
> >>>
> >>> archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' #
> command to use to archive a logfile segment
> >>> archive_mode = on # enables archiving; off, on, or always
> >>> checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 -
> 1.0
> >>> default_text_search_config = 'pg_catalog.english' #
> >>> datestyle = 'iso, mdy' #
> >>> effective_cache_size = 120GB #
> >>> hot_standby = on# "on" allows queries during
> recovery
> >>> lc_messages = 'en_US.UTF-8'  # locale for system
> error message
> >>> lc_monetary = 'en_US.UTF-8'  # locale for monetary
> formatting
> >>> lc_numeric = 'en_US.UTF-8'  # locale for number
> formatting
> >>> lc_time = 'en_US.UTF-8'  # locale for time
> formatting
> >>> listen_addresses = '*' # defaults to 'localhost', '*' = all
> >>> log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all
> actions and
> >>> log_checkpoints = on #
> >>> log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' #
> >>> log_lock_waits = on  # log lock waits >=
> deadlock_timeout
> >>> log_min_duration_statement = 1000ms# -1 is disabled, 0 lo

Re: Pgbouncer discard all

2018-10-16 Thread Nicola Contu
 2492534808 | dev|   7355 | 1833427130 | pgbouncer   |
   | 10.151.2.145   | |   60570 | 2018-10-16
14:13:05.151015+00 |   | 2018-10-16
15:10:40.309993+00 | 2018-10-16 15:10:40.310038+00 | Client
  | ClientRead  | idle   | |  | DISCARD ALL

They are idle actually.
Will they count as client connection on the total amount set on the
postgres.conf?

Il giorno mar 16 ott 2018 alle ore 16:22 Martín Marqués <
martin.marq...@2ndquadrant.com> ha scritto:

> El 16/10/18 a las 09:59, Scot Kreienkamp escribió:
> > Are you sure they’re actually waiting?  Don’t forget 10.5 will show the
> > last query executed even if the connection is idle.  I believe discard
> > all would be the last command the pgbouncer would send to the database
> > when the client is done as it resets the connection for the next
> > client.  So what you’re describing would seem to be expected behavior.
>
> He might have been referring to client waiting. That is visible in the
> pgbouncer pseudo-database
>
> OTOH if the the waiting is seen in pg_stat_activity, then pgbouncer has
> nothing to do. The connection has already been assigned to the client
> and the waiting is happening on the database server, not the pooler.
>
> Regards,
>
> --
> Martín Marquéshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>


Matview size - space increased on concurrently refresh

2019-07-12 Thread Nicola Contu
Hello,
we noticed with a simple matview we have that refreshing it using the
concurrently item the space always increases of about 120MB .
This only happens if I am reading from that matview and at the same time I
am am refreshing it.

cmdv3=# SELECT
pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
pg_size_pretty

133 MB
(1 row)

cmdv3=# refresh materialized view matview_nm_connections;
REFRESH MATERIALIZED VIEW
cmdv3=# SELECT
pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
pg_size_pretty

133 MB
(1 row)

cmdv3=# \! date
Fri Jul 12 13:52:51 GMT 2019

cmdv3=# refresh materialized view matview_nm_connections;
REFRESH MATERIALIZED VIEW
cmdv3=# SELECT
pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
pg_size_pretty

133 MB
(1 row)


Let's try concurrently.

cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections;
REFRESH MATERIALIZED VIEW
cmdv3=# SELECT
pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
pg_size_pretty

261 MB
(1 row)


So the matview is not really used and it does not have anything strange but
that matview growth to 12GB as we refresh it once an hour.
It had the free percent at 97%.
I understand with concurrenlty it needs to take copy of the data while
reading, but this seems to be too much on the space side.

Is this a bug? Or is there anyone can help us understanding this?

Thanks a lot,
Nicola


Re: Matview size - space increased on concurrently refresh

2019-07-12 Thread Nicola Contu
P.S.: I am on postgres 11.3

Il giorno ven 12 lug 2019 alle ore 16:32 Nicola Contu <
nicola.co...@gmail.com> ha scritto:

> Hello,
> we noticed with a simple matview we have that refreshing it using the
> concurrently item the space always increases of about 120MB .
> This only happens if I am reading from that matview and at the same time I
> am am refreshing it.
>
> cmdv3=# SELECT
> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
> pg_size_pretty
> 
> 133 MB
> (1 row)
>
> cmdv3=# refresh materialized view matview_nm_connections;
> REFRESH MATERIALIZED VIEW
> cmdv3=# SELECT
> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
> pg_size_pretty
> 
> 133 MB
> (1 row)
>
> cmdv3=# \! date
> Fri Jul 12 13:52:51 GMT 2019
>
> cmdv3=# refresh materialized view matview_nm_connections;
> REFRESH MATERIALIZED VIEW
> cmdv3=# SELECT
> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
> pg_size_pretty
> 
> 133 MB
> (1 row)
>
>
> Let's try concurrently.
>
> cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections;
> REFRESH MATERIALIZED VIEW
> cmdv3=# SELECT
> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
> pg_size_pretty
> 
> 261 MB
> (1 row)
>
>
> So the matview is not really used and it does not have anything strange
> but that matview growth to 12GB as we refresh it once an hour.
> It had the free percent at 97%.
> I understand with concurrenlty it needs to take copy of the data while
> reading, but this seems to be too much on the space side.
>
> Is this a bug? Or is there anyone can help us understanding this?
>
> Thanks a lot,
> Nicola
>


Re: Matview size - space increased on concurrently refresh

2019-07-14 Thread Nicola Contu
It does not. That's the issue.
It always increases of 120mb and it reached 12gb instead of just 180mb.

Il dom 14 lug 2019, 21:34 Kaixi Luo  ha scritto:

>
> On Fri, Jul 12, 2019 at 4:34 PM Nicola Contu 
> wrote:
>
>> P.S.: I am on postgres 11.3
>>
>> Il giorno ven 12 lug 2019 alle ore 16:32 Nicola Contu <
>> nicola.co...@gmail.com> ha scritto:
>>
>>> Hello,
>>> we noticed with a simple matview we have that refreshing it using the
>>> concurrently item the space always increases of about 120MB .
>>> This only happens if I am reading from that matview and at the same time
>>> I am am refreshing it.
>>>
>>> cmdv3=# SELECT
>>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
>>> pg_size_pretty
>>> 
>>> 133 MB
>>> (1 row)
>>>
>>> cmdv3=# refresh materialized view matview_nm_connections;
>>> REFRESH MATERIALIZED VIEW
>>> cmdv3=# SELECT
>>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
>>> pg_size_pretty
>>> 
>>> 133 MB
>>> (1 row)
>>>
>>> cmdv3=# \! date
>>> Fri Jul 12 13:52:51 GMT 2019
>>>
>>> cmdv3=# refresh materialized view matview_nm_connections;
>>> REFRESH MATERIALIZED VIEW
>>> cmdv3=# SELECT
>>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
>>> pg_size_pretty
>>> 
>>> 133 MB
>>> (1 row)
>>>
>>>
>>> Let's try concurrently.
>>>
>>> cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections;
>>> REFRESH MATERIALIZED VIEW
>>> cmdv3=# SELECT
>>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass));
>>> pg_size_pretty
>>> 
>>> 261 MB
>>> (1 row)
>>>
>>>
>>> So the matview is not really used and it does not have anything strange
>>> but that matview growth to 12GB as we refresh it once an hour.
>>> It had the free percent at 97%.
>>> I understand with concurrenlty it needs to take copy of the data while
>>> reading, but this seems to be too much on the space side.
>>>
>>> Is this a bug? Or is there anyone can help us understanding this?
>>>
>>> Thanks a lot,
>>> Nicola
>>>
>>
> This is normal and something to be expected. When refreshing the
> materialized view, the new data is written to a disk and then the two
> tables are diffed. After the refresh finishes, your view size should go
> back to normal.
>
>


Re: Matview size - space increased on concurrently refresh

2019-07-17 Thread Nicola Contu
Il giorno dom 14 lug 2019 alle ore 22:23 Tom Lane  ha
scritto:

> [ please do not top-post in your replies, it makes the conversation hard
>   to follow ]
>
> Nicola Contu  writes:
> > Il dom 14 lug 2019, 21:34 Kaixi Luo  ha scritto:
> >> This is normal and something to be expected. When refreshing the
> >> materialized view, the new data is written to a disk and then the two
> >> tables are diffed. After the refresh finishes, your view size should go
> >> back to normal.
>
> > It does not. That's the issue.
> > It always increases of 120mb and it reached 12gb instead of just 180mb.
>
> A concurrent matview refresh will necessarily leave behind two copies
> of any rows it changes, just like any other row-update operation in
> Postgres.  Once there are no concurrent transactions that can "see"
> the old row copies, they should be reclaimable by vacuum.
>
> Since you're not seeing autovacuum reclaim the space automatically,
> I hypothesize that you've got autovacuum turned off or dialed down
> to unrealistically non-aggressive settings.  Or possibly you have
> old open transactions that are preventing reclaiming dead rows
> (because they can still possibly "see" those rows).  Either of those
> explanations should imply that you're getting similar bloat in every
> other table and matview, though.
>
> You might want to look into pg_stat_all_tables to see what it says
> about the last_autovacuum time etc. for that matview.  Another source
> of insight is to do a manual "vacuum verbose" on the matview and see
> what that says about removable and nonremovable rows.
>
> regards, tom lane
>

This matview has nothing strange and nothign custom.
We can replicate the matview that is not used by anyone.

cmdv3=# vacuum (full,analyze,verbose) public.matview_nm_connections;
INFO:  vacuuming "public.matview_nm_connections"
INFO:  "matview_nm_connections": found 0 removable, 295877 nonremovable row
versions in 33654 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 0.45 s, system: 0.43 s, elapsed: 1.24 s.
INFO:  analyzing "public.matview_nm_connections"
INFO:  "matview_nm_connections": scanned 16986 of 16986 pages, containing
295877 live rows and 0 dead rows; 3 rows in sample, 295877 estimated
total rows
VACUUM

This is an example of full and verbose vacuum. Everytime I refresh it I get
the size increased.

See stats from the pg_stat_all_tables :

cmdv3=# select * from pg_stat_all_tables where relname =
'matview_nm_connections';
   relid| schemaname |relname | seq_scan | seq_tup_read
| idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del |
n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze |
 last_vacuum  |last_autovacuum|
last_analyze  |   last_autoanalyze
  | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
+++--+--+--+---+---+---+---+---+++-+---+---+---+-
--+--+--+---+---
 3466831733 | public | matview_nm_connections | 3725 |540992219
|33235 |255113 |  96874161 | 0 |  95692276 |
  0 | 295877 |  0 |   0 | 2019-07-12
11:58:39.198049+00 | 2019-07-16 11:07:02.765612+00 | 2019-07-17
10:28:08.819679+00 | 2019-07-16 11:03:32.4895
73+00 |5 |   29 |11 |17
(1 row)


ERROR: too many dynamic shared memory segments

2019-09-11 Thread Nicola Contu
Hello,
We are running postgres 11.5 and in the last two weeks we did :

- upgrade of postgres to 11.5 from 11.4
- increased shared_buffer to 1/3 of the memory
- increased effective_cache_size = 160GB from 120
- increased checkpoint_completion_target = 0.9 from 0.7
- increased checkpoint_timeout = 1h
- increased work_mem = 2GB (this can be set up to 4GB) from 600MB

Since that, in the last two weeks we saw an increment of this error :

ERROR: too many dynamic shared memory segments

Is there any relation between these parameters or the pgsql 11.5 version?

Any help can be appreciated.

Thank you,
Nicola


Re: ERROR: too many dynamic shared memory segments

2019-09-11 Thread Nicola Contu
Hello,
We did not see any error in the logs, just that one.
Unfortunately we had problems installing updates in this machine and we are
not installing updates since a few months.

Do you think that can be the issue? We are running Centos 7.

I will look into those parameters as well.

Thanks for your feedback


Il giorno mer 11 set 2019 alle ore 09:56 Mickael Franc (mickaelf) <
micka...@cisco.com> ha scritto:

> Hello,
>
> Maybe a change needed in a kernel parameter such "kernel.shmmax" or
> "kernel.shmall" to allow by kernel to provide a huge shared-mem.
> Did you see any other error in logs ? Have you upgrade your kernel since
> last two weeks ?
>
> Best,
> --
> *De :* Nicola Contu 
> *Envoyé :* mercredi 11 septembre 2019 09:47
> *À :* pgsql-generallists.postgresql.org <
> pgsql-general@lists.postgresql.org>
> *Cc :* Alessandro Aste 
> *Objet :* ERROR: too many dynamic shared memory segments
>
> Hello,
> We are running postgres 11.5 and in the last two weeks we did :
>
> - upgrade of postgres to 11.5 from 11.4
> - increased shared_buffer to 1/3 of the memory
> - increased effective_cache_size = 160GB from 120
> - increased checkpoint_completion_target = 0.9 from 0.7
> - increased checkpoint_timeout = 1h
> - increased work_mem = 2GB (this can be set up to 4GB) from 600MB
>
> Since that, in the last two weeks we saw an increment of this error :
>
> ERROR: too many dynamic shared memory segments
>
> Is there any relation between these parameters or the pgsql 11.5 version?
>
> Any help can be appreciated.
>
> Thank you,
> Nicola
>


Re: ERROR: too many dynamic shared memory segments

2019-09-11 Thread Nicola Contu
If the error persist I will try to revert the work_mem.
Thanks a lot

Il giorno mer 11 set 2019 alle ore 10:10 Pavel Stehule <
pavel.steh...@gmail.com> ha scritto:

> Hi
>
> st 11. 9. 2019 v 9:48 odesílatel Nicola Contu 
> napsal:
>
>> Hello,
>> We are running postgres 11.5 and in the last two weeks we did :
>>
>> - upgrade of postgres to 11.5 from 11.4
>> - increased shared_buffer to 1/3 of the memory
>> - increased effective_cache_size = 160GB from 120
>> - increased checkpoint_completion_target = 0.9 from 0.7
>> - increased checkpoint_timeout = 1h
>> - increased work_mem = 2GB (this can be set up to 4GB) from 600MB
>>
>> Since that, in the last two weeks we saw an increment of this error :
>>
>> ERROR: too many dynamic shared memory segments
>>
>> Is there any relation between these parameters or the pgsql 11.5 version?
>>
>
> I expect it can be related to increasing work_mem - maybe parallel hash
> join was used
>
> Regards
>
> Pavel
>
>>
>> Any help can be appreciated.
>>
>> Thank you,
>> Nicola
>>
>


Re: ERROR: too many dynamic shared memory segments

2020-01-21 Thread Nicola Contu
Hey Thomas,
after a few months, we started having this issue again.
So we revert the work_mem parameter to 600MB instead of 2GB.
But the issue is still there. A query went to segmentation fault, the DB
went to recovery mode and our app went to read only for a few minutes.

I understand we can increase max_connections so we can have many more
segments.

My question is : is there a way to understand the number of segments we
reached?
Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have
about 500 shared segments.
We would like to increase that number to 300 or 400 but would be great to
understand if there is a way to make sure we will solve the issue as it
requires a restart of the service.

I know you were also talking about a redesign this part in PostgreSQL. Do
you know if anything has changed in any of the newer versions after 11.5?

Thanks a lot,
Nicola





Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <
thomas.mu...@gmail.com> ha scritto:

> On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu 
> wrote:
> > If the error persist I will try to revert the work_mem.
> > Thanks a lot
>
> Hi Nicola,
>
> It's hard to say exactly what the cause of the problem is in your case
> and how to avoid it, without knowing what your query plans look like.
> PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
> and it needs a number of them that depends on work_mem (in the case of
> Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
> on the number of Gather nodes that appear in the plan, which in some
> unusual cases can result from partitioning.
>
> I've seen people reaching this error by running a lot of parallel
> queries concurrently.  If that's the cause, then you can definitely
> get some relief by turning work_mem down, or by turning
> max_connections up (even though you don't want to allow more
> connections -- because it influences the formula for deciding on the
> DSM segment limit).  We should probably adjust some of the internal
> constants to give us more slots, to avoid that problem, as discussed
> here:
>
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com
>
> I've also seen people reaching this error by somehow coming up with
> plans that had a very large number of Gather nodes in them,
> corresponding to partitions; that's probably a bad plan (it'd
> presumably be better to terminate parallelism higher up in the plan,
> but these plans do seem to exist in the wild; I don't recall exactly
> why).  I think we need a bit of a redesign so that if there are
> multiple Gather nodes, they share the same main DSM segment, instead
> of blowing through this limit.
>
> --
> Thomas Munro
> https://enterprisedb.com
>


Re: ERROR: too many dynamic shared memory segments

2020-01-21 Thread Nicola Contu
We also reverted this param :

cmdv3=# show max_parallel_workers_per_gather;
 max_parallel_workers_per_gather
-
 2
(1 row)

It was set to 8.


Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu <
nicola.co...@gmail.com> ha scritto:

> Hey Thomas,
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB
> went to recovery mode and our app went to read only for a few minutes.
>
> I understand we can increase max_connections so we can have many more
> segments.
>
> My question is : is there a way to understand the number of segments we
> reached?
> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have
> about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to
> understand if there is a way to make sure we will solve the issue as it
> requires a restart of the service.
>
> I know you were also talking about a redesign this part in PostgreSQL. Do
> you know if anything has changed in any of the newer versions after 11.5?
>
> Thanks a lot,
> Nicola
>
>
>
>
>
> Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu 
>> wrote:
>> > If the error persist I will try to revert the work_mem.
>> > Thanks a lot
>>
>> Hi Nicola,
>>
>> It's hard to say exactly what the cause of the problem is in your case
>> and how to avoid it, without knowing what your query plans look like.
>> PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
>> and it needs a number of them that depends on work_mem (in the case of
>> Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
>> on the number of Gather nodes that appear in the plan, which in some
>> unusual cases can result from partitioning.
>>
>> I've seen people reaching this error by running a lot of parallel
>> queries concurrently.  If that's the cause, then you can definitely
>> get some relief by turning work_mem down, or by turning
>> max_connections up (even though you don't want to allow more
>> connections -- because it influences the formula for deciding on the
>> DSM segment limit).  We should probably adjust some of the internal
>> constants to give us more slots, to avoid that problem, as discussed
>> here:
>>
>>
>> https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com
>>
>> I've also seen people reaching this error by somehow coming up with
>> plans that had a very large number of Gather nodes in them,
>> corresponding to partitions; that's probably a bad plan (it'd
>> presumably be better to terminate parallelism higher up in the plan,
>> but these plans do seem to exist in the wild; I don't recall exactly
>> why).  I think we need a bit of a redesign so that if there are
>> multiple Gather nodes, they share the same main DSM segment, instead
>> of blowing through this limit.
>>
>> --
>> Thomas Munro
>> https://enterprisedb.com
>>
>


Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Nicola Contu
Hello, may I ask you for a feedback?
Thanks a lot

Il giorno mar 21 gen 2020 alle ore 17:14 Nicola Contu <
nicola.co...@gmail.com> ha scritto:

> We also reverted this param :
>
> cmdv3=# show max_parallel_workers_per_gather;
>  max_parallel_workers_per_gather
> -
>  2
> (1 row)
>
> It was set to 8.
>
>
> Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu <
> nicola.co...@gmail.com> ha scritto:
>
>> Hey Thomas,
>> after a few months, we started having this issue again.
>> So we revert the work_mem parameter to 600MB instead of 2GB.
>> But the issue is still there. A query went to segmentation fault, the DB
>> went to recovery mode and our app went to read only for a few minutes.
>>
>> I understand we can increase max_connections so we can have many more
>> segments.
>>
>> My question is : is there a way to understand the number of segments we
>> reached?
>> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have
>> about 500 shared segments.
>> We would like to increase that number to 300 or 400 but would be great to
>> understand if there is a way to make sure we will solve the issue as it
>> requires a restart of the service.
>>
>> I know you were also talking about a redesign this part in PostgreSQL. Do
>> you know if anything has changed in any of the newer versions after 11.5?
>>
>> Thanks a lot,
>> Nicola
>>
>>
>>
>>
>>
>> Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro <
>> thomas.mu...@gmail.com> ha scritto:
>>
>>> On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu 
>>> wrote:
>>> > If the error persist I will try to revert the work_mem.
>>> > Thanks a lot
>>>
>>> Hi Nicola,
>>>
>>> It's hard to say exactly what the cause of the problem is in your case
>>> and how to avoid it, without knowing what your query plans look like.
>>> PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
>>> and it needs a number of them that depends on work_mem (in the case of
>>> Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
>>> on the number of Gather nodes that appear in the plan, which in some
>>> unusual cases can result from partitioning.
>>>
>>> I've seen people reaching this error by running a lot of parallel
>>> queries concurrently.  If that's the cause, then you can definitely
>>> get some relief by turning work_mem down, or by turning
>>> max_connections up (even though you don't want to allow more
>>> connections -- because it influences the formula for deciding on the
>>> DSM segment limit).  We should probably adjust some of the internal
>>> constants to give us more slots, to avoid that problem, as discussed
>>> here:
>>>
>>>
>>> https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com
>>>
>>> I've also seen people reaching this error by somehow coming up with
>>> plans that had a very large number of Gather nodes in them,
>>> corresponding to partitions; that's probably a bad plan (it'd
>>> presumably be better to terminate parallelism higher up in the plan,
>>> but these plans do seem to exist in the wild; I don't recall exactly
>>> why).  I think we need a bit of a redesign so that if there are
>>> multiple Gather nodes, they share the same main DSM segment, instead
>>> of blowing through this limit.
>>>
>>> --
>>> Thomas Munro
>>> https://enterprisedb.com
>>>
>>


Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Nicola Contu
This is the error on postgres log of the segmentation fault :

2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process
(PID 2042) was terminated by signal 11: Segmentation fault
2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed
process was running: select pid from pg_stat_activity where query ilike
'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating any
other active server processes
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [4-1]
db=cmdv3,user=admin WARNING:  terminating connection because of crash of
another server process
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [5-1]
db=cmdv3,user=admin DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [6-1]
db=cmdv3,user=admin HINT:  In a moment you should be able to reconnect to
the database and repeat your command.
2020-01-21 14:20:29 GMT [127.0.0.1(34026)] [2055]: [5-1]
db=cmdv3,user=admin WARNING:  terminating connection because of crash of
another server process

At CentOS level :
Jan 21 14:20:29 usnyh-cmd1.gt-t.net kernel: postmaster[2042]: segfault at 0
ip 0048bef4 sp 7ffdf4955bb0 error 4 in postgres[40+6c5000]
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT
[33727] WARNING C-0x21526e8: cmdv3/admin@10.151.2.154:39688 pooler error:
server conn crashed?
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT
[33727] WARNING S-0x1f5ff58: cmdv3/admin@127.0.0.1:5432 got packet 'N' from
server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT
[33727] WARNING S-0x25097f0: cmdv3/admin@127.0.0.1:5432 got packet 'N' from
server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.906 GMT
[33727] WARNING S-0x2508b60: cmdv3/admin@127.0.0.1:5432 got packet 'N' from
server when not linked
Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.981 GMT
[33727] ERROR S: login failed: FATAL: the database system is in recovery
mode


> If you're on Linux, you can probably see them with "ls /dev/shm".

I see a lot of files there, and doing a cat they are empty. What can I do
with them?

Those are two different problems I guess, but they are related because
right before the Segmentation Fault I see a lot of shared segment errors in
the postgres log.

Il giorno mer 29 gen 2020 alle ore 10:09 Thomas Munro <
thomas.mu...@gmail.com> ha scritto:

> On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu 
> wrote:
> > after a few months, we started having this issue again.
> > So we revert the work_mem parameter to 600MB instead of 2GB.
> > But the issue is still there. A query went to segmentation fault, the DB
> went to recovery mode and our app went to read only for a few minutes.
>
> Hi Nicola,
> Hmm, a segmentation fault sounds like a different problem.  Can you
> please share the exact error messages from PostgreSQL and OS logs?
>
> > I understand we can increase max_connections so we can have many more
> segments.
> >
> > My question is : is there a way to understand the number of segments we
> reached?
>
> If you're on Linux, you can probably see them with "ls /dev/shm".
>
> > Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have
> about 500 shared segments.
> > We would like to increase that number to 300 or 400 but would be great
> to understand if there is a way to make sure we will solve the issue as it
> requires a restart of the service.
> >
> > I know you were also talking about a redesign this part in PostgreSQL.
> Do you know if anything has changed in any of the newer versions after 11.5?
>
> It's possible that we should increase a couple of constants used the
> formula -- I'll look into that again.  But first I'd like to see if
> we're even investigating the right problem here.
>


Re: ERROR: too many dynamic shared memory segments

2020-01-31 Thread Nicola Contu
Hi Thomas,
unfortunately I can't find any core dump to help you more.
Thanks for the fix, we are in the process of installing 12.1 in production,
so we can still wait on this release and go live with 12.2

I will let you know at this point if I still get this after installing 12.2
trying to build a core dump file.

Do you still recommend to increase max_conn?


Il giorno gio 30 gen 2020 alle ore 23:41 Thomas Munro <
thomas.mu...@gmail.com> ha scritto:

> On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro 
> wrote:
> > On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu 
> wrote:
> > > This is the error on postgres log of the segmentation fault :
> > >
> > > 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server
> process (PID 2042) was terminated by signal 11: Segmentation fault
> > > 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed
> process was running: select pid from pg_stat_activity where query ilike
> 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> > > 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:
> terminating any other active server processes
>
> > That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
> > to decide whether to pretend there are no slots free (see below), and
> > I managed to make it crash in the regression tests when doing a
> > parallel index build.  It's late here now, but I'll look into that
> > tomorrow.  It's possible that the parallel index code needs to learn
> > to cope with that.
>
> Hi Nicola,
>
> Without more information I can't know if I found the same bug you
> experienced, but I think it's likely.  I have committed a fix for
> that, which will be available in the next release (mid-February).
>
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1
>


Streaming replication - 11.5

2020-03-10 Thread Nicola Contu
Hello,
I have two servers connected to the same switch running postgres 11.5

I am trying to replicate one of those servers after a planned work on the
master, so the replica has been lost. It has always worked but now I get
this :

pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I don't really understand what the issue is.
I had this issue last week as well in another DC and I had to reboot the
slave to make it working (not sure why it helped)

Do you know what can cause this?

Thank you,
Nicola


Re: Streaming replication - 11.5

2020-03-11 Thread Nicola Contu
user= LOG:  recovery
restart point at 643B/12680A80
2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL:  Last
completed transaction was at log time 2020-03-11 09:47:14.069731+00.
2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG:  restartpoint
starting: time



Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 3/10/20 8:17 AM, Nicola Contu wrote:
> Please post to list also.
> Ccing list.
>
> What came immediately before the temporary file error?
>
> >2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1]
> > db=postgres,user=postgres LOG:  temporary file: path
> > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936
> > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1]
> > db=postgres,user=postgres LOG:  could not send data to client: Broken
> pipe
> > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1]
> > db=postgres,user=postgres FATAL:  connection to client lost
> > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG:
> >   restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s)
> > added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s,
> > total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s;
> > distance=175940 kB, estimate=416149 kB
> > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG:  recovery
> > restart point at 6424/1D7DEDE8
> >
> > It is a cascade replication
> >
> > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>> ha
> scritto:
> >
> > On 3/10/20 2:26 AM, Nicola Contu wrote:
> >  > Hello,
> >  > I have two servers connected to the same switch running postgres
> 11.5
> >  >
> >  > I am trying to replicate one of those servers after a planned
> > work on
> >  > the master, so the replica has been lost. It has always worked
> > but now I
> >  > get this :
> >  >
> >  > pg_basebackup: could not receive data from WAL stream: server
> > closed the
> >  > connection unexpectedly
> >  >  This probably means the server terminated abnormally
> >  >  before or while processing the request.
> >  >
> >  > I don't really understand what the issue is.
> >
> > I would start with the logs from the Postgres server you are taking
> the
> > backup from.
> >
> >  > I had this issue last week as well in another DC and I had to
> > reboot the
> >  > slave to make it working (not sure why it helped)
> >  >
> >  > Do you know what can cause this?
> >  >
> >  > Thank you,
> >  > Nicola
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Streaming replication - 11.5

2020-03-11 Thread Nicola Contu
I am actually cascading.
The master is in nyh, the first slave is in Dallas and the one having
problems is in Dallas as well on the same switch of the one replicating
from the master.

It always worked not sure what is wrong now. We just encrypted disks on all
servers


Il mer 11 mar 2020, 18:57 Adrian Klaver  ha
scritto:

> On 3/11/20 2:54 AM, Nicola Contu wrote:
> > These are the lines before
> >
> > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
> > db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
> > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
> > db=cmdv3,user=zabbix_check HINT:  WAL control functions cannot be
> > executed during recovery.
> > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
> > db=cmdv3,user=zabbix_check STATEMENT:  select
> > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
> > pg_stat_replication where client_addr ='10.150.20.22'
> >
> > That query is made by Zabbix. So I stopped the zabbix agent and tested
> > again. But still failing, because of this now :
> >
> > pg_basebackup: starting background WAL receiver
> > pg_basebackup: created temporary replication slot "pg_basebackup_51199"
> > *pg_basebackup: could not receive data from WAL stream: SSL SYSCALL
> > error: EOF detected
> > *^C4699810/504983062 kB (70%), 0/1 tablespace
> > (...ql11/data/base/16401/231363544.2)
>
> So you started over with a pg_basebackup?
>
> Also from below:
>
> 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
> db=[unknown],user=replicator LOG:  terminating walsender process due to
> replication timeout
>
> Where are the master and standby in relation to each other network wise?
>
> Intervening firewalls, network latency issues?
>
>
>
> >
> >
> > here the full log starting right before the last try :
> >
> > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
> >   restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
> > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
> > total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
> > distance=131239 kB, estimate=725998 kB
> > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:  recovery
> > restart point at 643A/D8C05F70
> > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:  Last
> > completed transaction was at log time 2020-03-11 09:22:44.050084+00.
> > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
> >   restartpoint starting: time
> > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
> >   restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
> > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
> > total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
> > distance=120469 kB, estimate=665445 kB
> > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:  recovery
> > restart point at 643A/E01AB438
> > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:  Last
> > completed transaction was at log time 2020-03-11 09:27:43.945485+00.
> > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
> >   restartpoint starting: force wait
> > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
> > db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement: SELECT id,
> > name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients;
> > 2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1]
> > db=cmdv3,user=nis LOG:  duration: 9493.259 ms  statement: SELECT slid,
> > gnid, sof_id, client_id, product FROM mmx_slids;
> > 2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG:
> >   restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s)
> > added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s,
> > total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s;
> > distance=123412 kB, estimate=611242 kB
> > 2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG:  recovery
> > restart point at 643A/E7A30498
> > 2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL:  Last
> > completed transaction was at log time 2020-03-11 09:32:13.916101+00.
> > 2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG:
> >   restartpoint starting: time
> > 2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG:
> >   restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s)
> > added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s,
> > total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s;
> > distance=28059

Re: Streaming replication - 11.5

2020-03-11 Thread Nicola Contu
CPU load on the server to be built? No.
System logs don't show anything relevant unfortunately

Il mer 11 mar 2020, 21:34 Adrian Klaver  ha
scritto:

> On 3/11/20 11:59 AM, Nicola Contu wrote:
> > I am actually cascading.
> > The master is in nyh, the first slave is in Dallas and the one having
> > problems is in Dallas as well on the same switch of the one replicating
> > from the master.
> >
> > It always worked not sure what is wrong now. We just encrypted disks on
> > all servers
>
> Do you have before and after on CPU load, I/O throughput?
>
> Do system logs show anything relevant during replication drop out?
>
> >
> >
> > Il mer 11 mar 2020, 18:57 Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> ha scritto:
> >
> > On 3/11/20 2:54 AM, Nicola Contu wrote:
> >  > These are the lines before
> >  >
> >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
> >  > db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
> >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
> >  > db=cmdv3,user=zabbix_check HINT:  WAL control functions cannot be
> >  > executed during recovery.
> >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
> >  > db=cmdv3,user=zabbix_check STATEMENT:  select
> >  > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from
> >  > pg_stat_replication where client_addr ='10.150.20.22'
> >  >
> >  > That query is made by Zabbix. So I stopped the zabbix agent and
> > tested
> >  > again. But still failing, because of this now :
> >  >
> >  > pg_basebackup: starting background WAL receiver
> >  > pg_basebackup: created temporary replication slot
> > "pg_basebackup_51199"
> >  > *pg_basebackup: could not receive data from WAL stream: SSL
> SYSCALL
> >  > error: EOF detected
> >  > *^C4699810/504983062 kB (70%), 0/1 tablespace
> >  > (...ql11/data/base/16401/231363544.2)
> >
> > So you started over with a pg_basebackup?
> >
> > Also from below:
> >
> > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
> > db=[unknown],user=replicator LOG:  terminating walsender process due
> to
> > replication timeout
> >
> > Where are the master and standby in relation to each other network
> wise?
> >
> > Intervening firewalls, network latency issues?
> >
> >
> >
> >  >
> >  >
> >  > here the full log starting right before the last try :
> >  >
> >  > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
> >  >   restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s)
> >  > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s,
> >  > total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s;
> >  > distance=131239 kB, estimate=725998 kB
> >  > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG:
> >   recovery
> >  > restart point at 643A/D8C05F70
> >  > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL:
>  Last
> >  > completed transaction was at log time 2020-03-11
> 09:22:44.050084+00.
> >  > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG:
> >  >   restartpoint starting: time
> >  > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG:
> >  >   restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s)
> >  > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s,
> >  > total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s;
> >  > distance=120469 kB, estimate=665445 kB
> >  > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG:
> >   recovery
> >  > restart point at 643A/E01AB438
> >  > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL:
>  Last
> >  > completed transaction was at log time 2020-03-11
> 09:27:43.945485+00.
> >  > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG:
> >  >   restartpoint starting: force wait
> >  > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1]
> >  > db=cmdv3,user=nis LOG:  duration: 1402.004 ms  statement: SELECT
> id,
> >  > name, parent_id, parent, short_name, sales_rep_id FROM
> mmx_clients;
> >  > 2020-03-11 09:29:34 GM

Re: Streaming replication - 11.5

2020-03-12 Thread Nicola Contu
The replicator is ok and the replicated as well.
%Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0 hi,  0.0 si,
 0.0 st

CPU is really low on both.

I am running pg_basebackup again everytime.
Any other suggestions?



Il giorno mer 11 mar 2020 alle ore 23:13 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 3/11/20 2:12 PM, Nicola Contu wrote:
> > CPU load on the server to be built? No.
>
> CPU load, I/O load on the servers in the replication chain.
>
> Basically you just recently, it seems, imposed extra overhead to the
> process by encrypting/decrypting. From what I gather from earlier post
> then your replication started failing. Trying to establish whether the
> former is causing the latter.
>
> Not sure what you are referring to in regards to server to be built?
>
> > System logs don't show anything relevant unfortunately
> >
> > Il mer 11 mar 2020, 21:34 Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> ha scritto:
> >
> > On 3/11/20 11:59 AM, Nicola Contu wrote:
> >  > I am actually cascading.
> >  > The master is in nyh, the first slave is in Dallas and the one
> > having
> >  > problems is in Dallas as well on the same switch of the one
> > replicating
> >  > from the master.
> >  >
> >  > It always worked not sure what is wrong now. We just encrypted
> > disks on
> >  > all servers
> >
> > Do you have before and after on CPU load, I/O throughput?
> >
> > Do system logs show anything relevant during replication drop out?
> >
> >  >
> >  >
> >  > Il mer 11 mar 2020, 18:57 Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>
> >  > <mailto:adrian.kla...@aklaver.com
> > <mailto:adrian.kla...@aklaver.com>>> ha scritto:
> >  >
> >  > On 3/11/20 2:54 AM, Nicola Contu wrote:
> >  >  > These are the lines before
> >  >  >
> >  >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1]
> >  >  > db=cmdv3,user=zabbix_check ERROR:  recovery is in progress
> >  >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1]
> >  >  > db=cmdv3,user=zabbix_check HINT:  WAL control functions
> > cannot be
> >  >  > executed during recovery.
> >  >  > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1]
> >  >  > db=cmdv3,user=zabbix_check STATEMENT:  select
> >  >  > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(),
> > replay_lsn)) from
> >  >  > pg_stat_replication where client_addr ='10.150.20.22'
> >  >  >
> >  >  > That query is made by Zabbix. So I stopped the zabbix
> > agent and
> >  > tested
> >  >  > again. But still failing, because of this now :
> >  >  >
> >  >  > pg_basebackup: starting background WAL receiver
> >  >  > pg_basebackup: created temporary replication slot
> >  > "pg_basebackup_51199"
> >  >  > *pg_basebackup: could not receive data from WAL stream:
> > SSL SYSCALL
> >  >  > error: EOF detected
> >  >  > *^C4699810/504983062 kB (70%), 0/1 tablespace
> >  >  > (...ql11/data/base/16401/231363544.2)
> >  >
> >  > So you started over with a pg_basebackup?
> >  >
> >  > Also from below:
> >  >
> >  > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1]
> >  > db=[unknown],user=replicator LOG:  terminating walsender
> > process due to
> >  > replication timeout
> >  >
> >  > Where are the master and standby in relation to each other
> > network wise?
> >  >
> >  > Intervening firewalls, network latency issues?
> >  >
> >  >
> >  >
> >  >  >
> >  >  >
> >  >  > here the full log starting right before the last try :
> >  >  >
> >  >  > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG:
> >  >  >   restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL
> > file(s)
> >  >  > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009
> s,
> >  >  >

Re: Streaming replication - 11.5

2020-03-12 Thread Nicola Contu
The encryption is at os level. So the drives are encrypted with a password
where the db saves data

Il gio 12 mar 2020, 15:51 Adrian Klaver  ha
scritto:

> On 3/12/20 4:31 AM, Nicola Contu wrote:
> > The replicator is ok and the replicated as well.
> > %Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0 hi,  0.0 si,
> >   0.0 st
> >
> > CPU is really low on both.
> >
> > I am running pg_basebackup again everytime.
> > Any other suggestions?
> >
>
> I have to believe their is a connection between changing to encrypting
> the disks and your issues. Not sure what, but to help how is the
> encryption being done and what program is being used?
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Streaming replication - 11.5

2020-03-13 Thread Nicola Contu
So in the logs I now see this :

2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
db=[unknown],user=replicator LOG:  terminating walsender process due to
replication timeout

So I tried increasing the wal_sender_timeout to 300s but it did not help

Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu <
nicola.co...@gmail.com> ha scritto:

> The encryption is at os level. So the drives are encrypted with a password
> where the db saves data
>
> Il gio 12 mar 2020, 15:51 Adrian Klaver  ha
> scritto:
>
>> On 3/12/20 4:31 AM, Nicola Contu wrote:
>> > The replicator is ok and the replicated as well.
>> > %Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0 hi,  0.0 si,
>> >   0.0 st
>> >
>> > CPU is really low on both.
>> >
>> > I am running pg_basebackup again everytime.
>> > Any other suggestions?
>> >
>>
>> I have to believe their is a connection between changing to encrypting
>> the disks and your issues. Not sure what, but to help how is the
>> encryption being done and what program is being used?
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com
>>
>


Re: Streaming replication - 11.5

2020-03-16 Thread Nicola Contu
I was able to make pg_basebackup working using --max-rate=128M

Still don't understand why. I guess it is related to the encryption and
slowness of the disk..
Do you have any idea?

Il giorno ven 13 mar 2020 alle ore 16:15 Adrian Klaver <
adrian.kla...@aklaver.com> ha scritto:

> On 3/13/20 4:11 AM, Nicola Contu wrote:
> > So in the logs I now see this :
> >
> > 2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1]
> > db=[unknown],user=replicator LOG:  terminating walsender process due to
> > replication timeout
>
> Yeah that's been showing up the log snippets you have been posting.
>
> To figure this out you will need to:
>
> 1) Make a list of what changed since the last time replication worked
> consistently.
>
> 2) Monitor the changed components, start logging or increase logging.
>
> 3) Monitor the chain of replication as whole, to catch changes that you
> do not know about. Since you seem to be operating across data centers
> that would include verifying the network.
>
> >
> > So I tried increasing the wal_sender_timeout to 300s but it did not help
> >
> > Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu
> > mailto:nicola.co...@gmail.com>> ha scritto:
> >
> > The encryption is at os level. So the drives are encrypted with a
> > password where the db saves data
> >
> > Il gio 12 mar 2020, 15:51 Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> ha scritto:
> >
> > On 3/12/20 4:31 AM, Nicola Contu wrote:
> >  > The replicator is ok and the replicated as well.
> >  > %Cpu(s):  0.2 us,  1.0 sy,  0.0 ni, 94.8 id,  4.0 wa,  0.0
> > hi,  0.0 si,
> >  >   0.0 st
> >  >
> >  > CPU is really low on both.
> >  >
> >  > I am running pg_basebackup again everytime.
> >  > Any other suggestions?
> >  >
> >
> > I have to believe their is a connection between changing to
> > encrypting
> > the disks and your issues. Not sure what, but to help how is the
> > encryption being done and what program is being used?
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


[no subject]

2020-04-01 Thread Nicola Contu
Hello,
I am running postgres 11.5 and we were having issues with shared segments.
So I increased the max_connection as suggested by you guys and reduced my
work_mem to 600M.

Right now instead, it is the second time I see this error :

ERROR:  could not resize shared memory segment "/PostgreSQL.2137675995" to
33624064 bytes: Interrupted system call

So do you know what it means and how can I solve it?

Thanks a lot,
Nicola


Re: EINTR while resizing dsm segment.

2020-04-07 Thread Nicola Contu
So that seems to be a bug, correct?
Just to confirm, I am not using NFS, it is directly on disk.

Other than that, is there a particular option we can set in the
postgres.conf to mitigate the issue?

Thanks a lot for your help.


Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro 
ha scritto:

> On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
>  wrote:
> > I provided the subject, and added -hackers.
> >
> > > Hello,
> > > I am running postgres 11.5 and we were having issues with shared
> segments.
> > > So I increased the max_connection as suggested by you guys and reduced
> my
> > > work_mem to 600M.
> > >
> > > Right now instead, it is the second time I see this error :
> > >
> > > ERROR:  could not resize shared memory segment
> "/PostgreSQL.2137675995" to
> > > 33624064 bytes: Interrupted system call
> >
> > The function posix_fallocate is protected against EINTR.
> >
> > | do
> > | {
> > |   rc = posix_fallocate(fd, 0, size);
> > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
> >
> > But not for ftruncate and write. Don't we need to protect them from
> > ENTRI as the attached?
>
> We don't handle EINTR for write() generally because that's not
> supposed to be necessary on local files (local disks are not "slow
> devices", and we document that if you're using something like NFS you
> should use its "hard" mount option so that it behaves that way too).
> As for ftruncate(), you'd think it'd be similar, and I can't think of
> a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
> but I can't seem to figure that out from reading man pages; maybe I'm
> reading the wrong ones.  Perhaps in low memory situations, an I/O wait
> path reached by ftruncate() can return EINTR here rather than entering
> D state (non-interruptable sleep) or restarting due to our SA_RESTART
> flag... anyone know?
>
> Another thought: is there some way for the posix_fallocate() retry
> loop to exit because (ProcDiePending || QueryCancelPending), but then
> for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
> reporting the EINTR?
>


Re: EINTR while resizing dsm segment.

2020-04-07 Thread Nicola Contu
The only change we made on the disk, is the encryption at OS level.
Not sure this can be something related.

Il giorno mar 7 apr 2020 alle ore 10:58 Nicola Contu 
ha scritto:

> So that seems to be a bug, correct?
> Just to confirm, I am not using NFS, it is directly on disk.
>
> Other than that, is there a particular option we can set in the
> postgres.conf to mitigate the issue?
>
> Thanks a lot for your help.
>
>
> Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
>>  wrote:
>> > I provided the subject, and added -hackers.
>> >
>> > > Hello,
>> > > I am running postgres 11.5 and we were having issues with shared
>> segments.
>> > > So I increased the max_connection as suggested by you guys and
>> reduced my
>> > > work_mem to 600M.
>> > >
>> > > Right now instead, it is the second time I see this error :
>> > >
>> > > ERROR:  could not resize shared memory segment
>> "/PostgreSQL.2137675995" to
>> > > 33624064 bytes: Interrupted system call
>> >
>> > The function posix_fallocate is protected against EINTR.
>> >
>> > | do
>> > | {
>> > |   rc = posix_fallocate(fd, 0, size);
>> > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
>> >
>> > But not for ftruncate and write. Don't we need to protect them from
>> > ENTRI as the attached?
>>
>> We don't handle EINTR for write() generally because that's not
>> supposed to be necessary on local files (local disks are not "slow
>> devices", and we document that if you're using something like NFS you
>> should use its "hard" mount option so that it behaves that way too).
>> As for ftruncate(), you'd think it'd be similar, and I can't think of
>> a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
>> but I can't seem to figure that out from reading man pages; maybe I'm
>> reading the wrong ones.  Perhaps in low memory situations, an I/O wait
>> path reached by ftruncate() can return EINTR here rather than entering
>> D state (non-interruptable sleep) or restarting due to our SA_RESTART
>> flag... anyone know?
>>
>> Another thought: is there some way for the posix_fallocate() retry
>> loop to exit because (ProcDiePending || QueryCancelPending), but then
>> for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
>> reporting the EINTR?
>>
>


pgbouncer

2019-01-17 Thread Nicola Contu
Hello,
I am a bit confused about the settings in pgbouncer

What's exactly the pool_size?
If I set 3, and I tried to connect from 4 shells, I am still able to
connect.
Same thing for max_db_connections. I set this to 1 and I am able to connect
from 2 shells.

This is kind of confusing and I'm not really cleared reading the
documentation.

This is my config.

[databases]
cmdwela = host=127.0.0.1 port=5432 dbname=cmdwela auth_user=pgbouncer
pool_size=120
cicciopasticcio = host=127.0.0.1 port=5432
dbname=cicciopasticcio auth_user=pgbouncer pool_size=2

[pgbouncer]
listen_port = 6543
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/users.txt
auth_query = select uname,phash from user_lookup($1)
logfile = /var/log/pgbouncer.log
pidfile = /home/postgres/pgbouncer.pid
admin_users = admin
user=postgres
max_db_connections=1
log_connections=0
log_disconnections=0
max_client_conn=5


I just want to limit connections from the app etc.

Thanks


Re: pgbouncer

2019-01-17 Thread Nicola Contu
> If I set 3, and I tried to connect from 4 shells, I am still able to
connect.

That would be the point - pgbouncer is sharing the 3 connections it
keeps with PostgreSQL between the 4 client connections made to it.

Mmh, my pool_mode is per session. The 4 sessions were active, not doing any
query, but connected to the shell.
So that's what my doubt

Il giorno gio 17 gen 2019 alle ore 16:32 David G. Johnston <
david.g.johns...@gmail.com> ha scritto:

> On Thu, Jan 17, 2019 at 8:18 AM Nicola Contu 
> wrote:
> >
> > Hello,
> > I am a bit confused about the settings in pgbouncer
> >
> > What's exactly the pool_size?
>
> Roughly, the number of open connections pgbouncer will keep to PostgreSQL.
>
> > If I set 3, and I tried to connect from 4 shells, I am still able to
> connect.
>
> That would be the point - pgbouncer is sharing the 3 connections it
> keeps with PostgreSQL between the 4 client connections made to it.
>
> > Same thing for max_db_connections. I set this to 1 and I am able to
> connect from 2 shells.
>
> Same as pool_size but basically a fail-safe since pools are
> per-user/per-database while this limit is per-database only.
>
> > This is kind of confusing and I'm not really cleared reading the
> documentation.
>
> For each setting you need to understand whether it limits
> pgbouncer->PostgreSQL or client->pgbouncer
>
> Configurations in [databases] limit the former; [pgbouncer] options
> either provide defaults for the [databases] or limit clients.
>
> > [databases]
> > cmdwela = host=127.0.0.1 port=5432 dbname=cmdwela auth_user=pgbouncer
> pool_size=120
> > cicciopasticcio = host=127.0.0.1 port=5432 dbname=cicciopasticcio
> auth_user=pgbouncer pool_size=2
> >
> > [pgbouncer]
> > listen_port = 6543
> > listen_addr = *
> > auth_type = md5
> > auth_file = /etc/pgbouncer/users.txt
> > auth_query = select uname,phash from user_lookup($1)
> > logfile = /var/log/pgbouncer.log
> > pidfile = /home/postgres/pgbouncer.pid
> > admin_users = admin
> > user=postgres
>
> > max_db_connections=1
> So one open connection max per database/user pool but it is shared -
> i.e., actively executing queries running in parallel are limited to
> this number.
>
> > max_client_conn=5
> > I just want to limit connections from the app etc.
>
> That limit is 5
>
> David J.
>


Pg_restore

2019-03-01 Thread Nicola Contu
Hello,
we are trying to move our postgres 10.5 instance to 11.2

We are trying to restore a DB in a pre-production env but we get an error
on the pg_restore command

[root@STAGING]#  /usr/local/pgsql11.2/bin/pg_restore -v -U postgres -L
/data/cofinder_restore.lst  -d cofinder  -j 16
/home/ncontu/data/dbbackups/C
pg_restore: connecting to database for restore
pg_restore: processing item 3098 ENCODING ENCODING
pg_restore: processing item 3099 STDSTRINGS STDSTRINGS
pg_restore: processing item 3100 SEARCHPATH SEARCHPATH
pg_restore: processing item 3101 DATABASE cofinder
pg_restore: processing item 3 SCHEMA public
pg_restore: creating SCHEMA "public"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 3; 2615 2200 SCHEMA public
postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  schema
"public" already exists
Command was: CREATE SCHEMA public;

The dump is taken with pgsql10 binaries.

Can you help on this please?
We also have another DB, but with the other one no issues doing the same
thing.

Thanks


Re: Pg_restore

2019-03-01 Thread Nicola Contu
I guess it depends on this
https://www.postgresql.org/message-id/15466-0b90383ff69c6e4b%40postgresql.org

Thanks

Il giorno ven 1 mar 2019 alle ore 09:17 Nicola Contu 
ha scritto:

> Hello,
> we are trying to move our postgres 10.5 instance to 11.2
>
> We are trying to restore a DB in a pre-production env but we get an error
> on the pg_restore command
>
> [root@STAGING]#  /usr/local/pgsql11.2/bin/pg_restore -v -U postgres -L
> /data/cofinder_restore.lst  -d cofinder  -j 16
> /home/ncontu/data/dbbackups/C
> pg_restore: connecting to database for restore
> pg_restore: processing item 3098 ENCODING ENCODING
> pg_restore: processing item 3099 STDSTRINGS STDSTRINGS
> pg_restore: processing item 3100 SEARCHPATH SEARCHPATH
> pg_restore: processing item 3101 DATABASE cofinder
> pg_restore: processing item 3 SCHEMA public
> pg_restore: creating SCHEMA "public"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 3; 2615 2200 SCHEMA
> public postgres
> pg_restore: [archiver (db)] could not execute query: ERROR:  schema
> "public" already exists
> Command was: CREATE SCHEMA public;
>
> The dump is taken with pgsql10 binaries.
>
> Can you help on this please?
> We also have another DB, but with the other one no issues doing the same
> thing.
>
> Thanks
>
>
>


Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-01 Thread Nicola Contu
Hello,
is there any reason why I am getting worse results using pgsql11.2 in
writing comparing it with pgsql 10.6?

I have two Instances, both just restored, so no bloats.
Running read queries I have pretty much same results, a little bit better
on pg11- Running writes the difference is in favour of 10.

I am expecting pg11 to be better.

Running pgbench :

PG11
[root@STAGING-CMD1 ~]#  /usr/local/pgsql11.2/bin/pgbench -t 1000 -c 20 -C
-f stress_service_order.sql cmdstaging -U admin
transaction type: stress_service_order.sql
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 2/2
latency average = 45.322 ms
tps = 441.283336 (including connections establishing)
tps = 463.731537 (excluding connections establishing)

PG10
[root@STAGING-CMD1 ~]#  pgbench -t 1000 -c 20 -C -f
stress_service_order.sql cmdstaging -U admin
transaction type: stress_service_order.sql
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 2/2
latency average = 44.686 ms
tps = 447.565403 (including connections establishing)
tps = 470.285561 (excluding connections establishing)

This is making a really big difference with longer queries.
Here I am updating a field in a random record.

With more transactions the difference is bigger

WITH POSTGRES 10

[root@STAGING-CMD1 ~]#  pgbench -t 10 -c 20 -C -f
stress_service_order_read.sql cmdstaging -U postgres
transaction type: stress_service_order_read.sql
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 200/200
latency average = 55.291 ms
tps = 442.1490778 (including connections establishing)
tps = 454.846844 (excluding connections establishing)

WITH POSTGRES 11
[root@STAGING-CMD1 ~]#  pgbench -t 10 -c 20 -C -f
stress_service_order_read.sql cmdstaging -U postgres
transaction type: stress_service_order_read.sql
scaling factor: 1
query mode: simple
number of clients: 20
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 200/200
latency average = 53.291 ms
tps = 375.297748 (including connections establishing)
tps = 392.316057 (excluding connections establishing)


The postgres.conf file are the same.

max_connections = 220
shared_buffers = 10GB
effective_cache_size = 120GB
work_mem = 600MB
maintenance_work_mem = 2GB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
#default_statistics_target = 100

Using data_sync_retry=on doesn't make any difference.

Is there anything else changed in the default values?

Any trick?
I don't want to go live and loose performances.

Thanks a lot,
Nicola


Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-04 Thread Nicola Contu
I did a analyze in stages on both.
And Yes both are compiled.
This is the configure command (change 10.6 for PG10)

./configure --prefix=/usr/local/pgsql11.2

See attached perf report. The difference seems to be all in this line, but
not sure :

+   26.80% 0.00%   222  postmaster   [kernel.kallsyms]
  [k] system_call_fastpath



I am using CentOS 7
With Centos I am using this profile for tuned-adm
[root@STAGING-CMD1 ~]#  tuned-adm active
Current active profile: latency-performance


Il giorno sab 2 mar 2019 alle ore 20:41 Thomas Munro 
ha scritto:

> On Sat, Mar 2, 2019 at 5:02 AM Ray O'Donnell  wrote:
> > On 01/03/2019 15:01, Nicola Contu wrote:
> > > Hello,
> > > is there any reason why I am getting worse results using pgsql11.2 in
> > > writing comparing it with pgsql 10.6?
> > >
> > > I have two Instances, both just restored, so no bloats.
> > > Running read queries I have pretty much same results, a little bit
> > > better on pg11- Running writes the difference is in favour of 10.
> >
> > Did you run ANALYZE on the databases after restoring?
>
> If you can rule out different query plans, and if you compiled them
> both with the same compiler and optimisation levels and without
> cassert enabled (it's a long shot but I mentioned that because you
> showed a path in /usr/local so perhaps you're hand-compiling 11, but
> 10 came from a package?), then the next step might be to use a
> profiler like "perf" (or something equivalent on your OS) to figure
> out where 11 is spending more time in the write test?
>
> --
> Thomas Munro
> https://enterprisedb.com
>
PGSQL11

+   26.80% 0.00%   222  postmaster   [kernel.kallsyms]  
  [k] system_call_fastpath
+   13.46% 0.04%   751  postmaster   [kernel.kallsyms]  
  [k] sys_futex
+   13.43% 0.00%   782  postmaster   [kernel.kallsyms]  
  [k] do_futex
+   12.89% 0.00% 0  postmaster   [unknown]  
  [k] 
+9.09% 0.00% 0  swapper  [kernel.kallsyms]  
  [k] start_cpu
+9.09% 0.14%  2860  swapper  [kernel.kallsyms]  
  [k] cpu_startup_entry
+9.09% 0.00% 0  swapper  [kernel.kallsyms]  
  [k] start_secondary
+8.97% 0.00%   389  postmaster   [kernel.kallsyms]  
  [k] page_fault
+8.97% 0.00%   687  postmaster   [kernel.kallsyms]  
  [k] do_page_fault
+8.97% 0.73%  6471  postmaster   [kernel.kallsyms]  
  [k] __do_page_fault
+8.66% 0.14%  1648  postmaster   libpthread-2.17.so 
  [.] sem_post@@GLIBC_2.2.5
+8.40% 0.70% 13567  postmaster   [kernel.kallsyms]  
  [k] get_futex_key
+8.08% 0.00% 0  postmaster   [unknown]  
  [k] 0x0080
+8.08% 0.93%   642  postmaster   libpthread-2.17.so 
  [.] do_futex_wait.constprop.1
+7.87% 0.19%  2666  postmaster   [kernel.kallsyms]  
  [k] futex_wake
+7.68% 0.19%  9343  postmaster   [kernel.kallsyms]  
  [k] handle_mm_fault
+7.34% 0.00%   207  swapper  [kernel.kallsyms]  
  [k] arch_cpu_idle
+7.34% 0.00%   979  swapper  [kernel.kallsyms]  
  [k] cpuidle_idle_call
+7.15% 0.00%  2671  postmaster   [kernel.kallsyms]  
  [k] handle_pte_fault
+5.87% 0.00% 6  postmaster   libc-2.17.so   
  [.] __GI___setsid
+5.87% 0.00% 9  postmaster   [kernel.kallsyms]  
  [k] sys_setsid
+5.87% 0.00%15  postmaster   [kernel.kallsyms]  
  [k] sched_autogroup_create_attach
+5.41% 5.41% 48199  postmaster   postgres   
  [.] hash_search_with_hash_value
+5.40% 5.40% 44439  postmaster   postgres   
  [.] LWLockWaitListLock
+5.38% 0.72%  1425  postmaster   [kernel.kallsyms]  
  [k] futex_wait
+5.07% 0.00% 5  postmaster   [kernel.kallsyms]  
  [k] sched_create_group
+4.23% 0.59%  2253  postmaster   [kernel.kallsyms]  
  [k] __slab_alloc
+4.13% 0.00%  2330  postmaster   [kernel.kallsyms]  
  [k] kmem_cache_alloc_node_trace
+3.75% 0.16%  2919  postmaster   [kernel.kallsyms]  
  

Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-04 Thread Nicola Contu
Because I have 10.6 in production :) and I am comparing with what I will be
loosing.
And I read that in the release notes but as said in my first email, even
with data_sync_retry=on (going back to previous behavior) doesn't make any
difference.

So I am looking for something that will keep my performances but still
allows me to upgrade to 11 in production.
Also, trying with 11.1, the problem seems still there.

Il giorno lun 4 mar 2019 alle ore 14:45 Imre Samu  ha
scritto:

> > is there any reason why I am getting worse results using pgsql11.2 in
> writing comparing it with pgsql 10.6?
> >... And Yes both are compiled.
>
> Why 10.6?
>
> according to release notes
> "14th February 2019: PostgreSQL 11.2, 10.7, 9.6.12, 9.5.16, and 9.4.21
> Released!"  https://www.postgresql.org/about/news/1920/
> imho:* it would be better to compare PG11.2  with  PG10.7  *(  similar
> bug Fixes and Improvements + same fsync()  behavior )
>
> *"This release changes the behavior in how PostgreSQL interfaces with
> fsync() and includes fixes for partitioning and over 70 other bugs that
> were reported over the past three months"*
>
> Imre
>
>
>
> Nicola Contu  ezt írta (időpont: 2019. márc. 4.,
> H, 13:14):
>
>> I did a analyze in stages on both.
>> And Yes both are compiled.
>> This is the configure command (change 10.6 for PG10)
>>
>> ./configure --prefix=/usr/local/pgsql11.2
>>
>> See attached perf report. The difference seems to be all in this line,
>> but not sure :
>>
>> +   26.80% 0.00%   222  postmaster   [kernel.kallsyms]
>> [k] system_call_fastpath
>>
>>
>>
>> I am using CentOS 7
>> With Centos I am using this profile for tuned-adm
>> [root@STAGING-CMD1 ~]#  tuned-adm active
>> Current active profile: latency-performance
>>
>>
>> Il giorno sab 2 mar 2019 alle ore 20:41 Thomas Munro <
>> thomas.mu...@gmail.com> ha scritto:
>>
>>> On Sat, Mar 2, 2019 at 5:02 AM Ray O'Donnell  wrote:
>>> > On 01/03/2019 15:01, Nicola Contu wrote:
>>> > > Hello,
>>> > > is there any reason why I am getting worse results using pgsql11.2 in
>>> > > writing comparing it with pgsql 10.6?
>>> > >
>>> > > I have two Instances, both just restored, so no bloats.
>>> > > Running read queries I have pretty much same results, a little bit
>>> > > better on pg11- Running writes the difference is in favour of 10.
>>> >
>>> > Did you run ANALYZE on the databases after restoring?
>>>
>>> If you can rule out different query plans, and if you compiled them
>>> both with the same compiler and optimisation levels and without
>>> cassert enabled (it's a long shot but I mentioned that because you
>>> showed a path in /usr/local so perhaps you're hand-compiling 11, but
>>> 10 came from a package?), then the next step might be to use a
>>> profiler like "perf" (or something equivalent on your OS) to figure
>>> out where 11 is spending more time in the write test?
>>>
>>> --
>>> Thomas Munro
>>> https://enterprisedb.com
>>>
>>


Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-05 Thread Nicola Contu
Not sure what you are requesting exactly but here is the strace for the
start of the pg_ctl



Il giorno lun 4 mar 2019 alle ore 21:55 Thomas Munro 
ha scritto:

> On Tue, Mar 5, 2019 at 5:01 AM Nicola Contu 
> wrote:
> > Attached a part of the strace running the pgbench command for pg11
> > Also attached strace_10 for pg10.6.
>
> That looks like strace output from pgbench, and I don't see any
> interesting differences between v10 and v11 (though I'm surprised to
> see it using poll() instead of ppoll(), and opening a new connection
> for every transaction).
>
> How about syscalls on the server side?  You could start it with
> something like "strace -f path/to/postgres -D path/to/pgdata" (-f for
> follow children), and perhaps also use -c so that it shows aggregated
> data (up until you ^C it) instead of every syscall?
>
> --
> Thomas Munro
> https://enterprisedb.com
>
[postgres@STAGING-CMD1 ~]$  strace /usr/local/pgsql11.2/bin/pg_ctl -D 
/db/pgsql11/data/ -c start
execve("/usr/local/pgsql11.2/bin/pg_ctl", ["/usr/local/pgsql11.2/bin/pg_ctl", 
"-D", "/db/pgsql11/data/", "-c", "start"], [/* 27 vars */]) = 0
brk(NULL)   = 0x204b000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f6878854000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/usr/local/pgsql11.2/lib/tls/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) 
= -1 ENOENT (No such file or directory)
stat("/usr/local/pgsql11.2/lib/tls/x86_64", 0x7fff444938d0) = -1 ENOENT (No 
such file or directory)
open("/usr/local/pgsql11.2/lib/tls/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 
ENOENT (No such file or directory)
stat("/usr/local/pgsql11.2/lib/tls", 0x7fff444938d0) = -1 ENOENT (No such file 
or directory)
open("/usr/local/pgsql11.2/lib/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 
-1 ENOENT (No such file or directory)
stat("/usr/local/pgsql11.2/lib/x86_64", 0x7fff444938d0) = -1 ENOENT (No such 
file or directory)
open("/usr/local/pgsql11.2/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 
ENOENT (No such file or directory)
stat("/usr/local/pgsql11.2/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=48564, ...}) = 0
mmap(NULL, 48564, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f6878848000
close(3)= 0
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0m\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=144792, ...}) = 0
mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f6878418000
mprotect(0x7f687842f000, 2093056, PROT_NONE) = 0
mmap(0x7f687862e000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f687862e000
mmap(0x7f687863, 13448, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f687863
close(3)= 0
open("/usr/local/pgsql11.2/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No 
such file or directory)
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P%\2\0\0\0\0\0"..., 832) 
= 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2173512, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f6878847000
mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f687804b000
mprotect(0x7f687820e000, 2093056, PROT_NONE) = 0
mmap(0x7f687840d000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f687840d000
mmap(0x7f6878413000, 16864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f6878413000
close(3)= 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f6878845000
arch_prctl(ARCH_SET_FS, 0x7f6878845740) = 0
mprotect(0x7f687840d000, 16384, PROT_READ) = 0
mprotect(0x7f687862e000, 4096, PROT_READ) = 0
mprotect(0x60a000, 4096, PROT_READ) = 0
mprotect(0x7f6878855000, 4096, PROT_READ) = 0
munmap(0x7f6878848000, 48564)   = 0
set_tid_address(0x7f6878845a10) = 92441
set_robust_list(0x7f6878845a20, 24) = 0
rt_sigaction(SIGRTMIN, {0x7f687841e7e0, [], SA_RESTORER|SA_SIGINFO, 
0x7f68784276d0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7f687841e870, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 
0x7f68784276d0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL)   = 0x204b000
brk(0

Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-06 Thread Nicola Contu
Here is the strace as requested for pg11

Thanks

Il giorno mar 5 mar 2019 alle ore 17:47 Thomas Munro 
ha scritto:

> On Wed, Mar 6, 2019 at 4:22 AM Nicola Contu 
> wrote:
> >
> > Not sure what you are requesting exactly but here is the strace for the
> start of the pg_ctl
>
> I meant that you could run the server itself in the foreground under
> strace, like so:
>
> $ strace -f -c /usr/local/pgsql11.2/bin/postgres -D /db/pgsql11/data/
>
> Then perform your testing, and finally stop it with pg_ctl from
> another window (or hit ^C in this window) and strace should spit out a
> table of system calls with some counters.  We might be able to see why
> v11 is spending so much more time executing system calls than v10 for
> your workload, or at least which systems calls they are, assuming you
> run the same transactions against both versions.
>
> --
> Thomas Munro
> https://enterprisedb.com
>
[postgres@STAGING-CMD1 ~]$  strace -f -c /usr/local/pgsql11.2/bin/postgres -D 
/db/pgsql11/data/
2019-03-06 10:07:47 GMT [] [163892]: [1-1] db=,user= LOG:  listening on IPv4 
address "0.0.0.0", port 5432
2019-03-06 10:07:47 GMT [] [163892]: [2-1] db=,user= LOG:  listening on IPv6 
address "::", port 5432
2019-03-06 10:07:47 GMT [] [163892]: [3-1] db=,user= LOG:  listening on Unix 
socket "/tmp/.s.PGSQL.5432"
strace: Process 163894 attached
2019-03-06 10:07:48 GMT [] [163894]: [1-1] db=,user= LOG:  database system was 
shut down at 2019-03-06 10:06:22 GMT
strace: Process 163895 attached
strace: Process 163896 attached
strace: Process 163897 attached
strace: Process 163898 attached
strace: Process 163899 attached
strace: Process 163900 attached
2019-03-06 10:07:48 GMT [] [163892]: [4-1] db=,user= LOG:  database system is 
ready to accept connections
strace: Process 163908 attached
strace: Process 163915 attached
strace: Process 163928 attached
strace: Process 163931 attached
strace: Process 163943 attached
strace: Process 163960 attached
strace: Process 163976 attached
2019-03-06 10:09:08 GMT [] [163895]: [1-1] db=,user= LOG:  checkpoint starting: 
xlog
strace: Process 163989 attached
strace: Process 164002 attached
2019-03-06 10:09:42 GMT [] [163895]: [2-1] db=,user= LOG:  checkpoint complete: 
wrote 201350 buffers (15.4%); 0 WAL file(s) added, 0 removed, 0 recycled; 
write=33.879 s, sync=0.208 s, total=34.093 s; sync files=30, longest=0.142 s, 
average=0.006 s; distance=1228942 kB, estimate=1228942 kB
strace: Process 164015 attached
2019-03-06 10:09:58 GMT [] [163895]: [3-1] db=,user= LOG:  checkpoint starting: 
xlog
strace: Process 164035 attached
strace: Process 164050 attached
2019-03-06 10:10:26 GMT [] [163895]: [4-1] db=,user= LOG:  checkpoint complete: 
wrote 126477 buffers (9.6%); 0 WAL file(s) added, 0 removed, 0 recycled; 
write=27.650 s, sync=0.015 s, total=27.668 s; sync files=30, longest=0.004 s, 
average=0.000 s; distance=1226813 kB, estimate=1228729 kB
strace: Process 164066 attached
2019-03-06 10:10:43 GMT [] [163895]: [5-1] db=,user= LOG:  checkpoint starting: 
xlog
strace: Process 164079 attached
strace: Process 164102 attached
2019-03-06 10:11:08 GMT [] [163895]: [6-1] db=,user= LOG:  checkpoint complete: 
wrote 125741 buffers (9.6%); 1 WAL file(s) added, 0 removed, 0 recycled; 
write=24.804 s, sync=0.077 s, total=24.947 s; sync files=30, longest=0.015 s, 
average=0.002 s; distance=1229772 kB, estimate=1229772 kB
strace: Process 164223 attached
2019-03-06 10:11:22 GMT [] [163895]: [7-1] db=,user= LOG:  checkpoint starting: 
xlog
strace: Process 164347 attached
2019-03-06 10:11:46 GMT [] [163895]: [8-1] db=,user= LOG:  checkpoint complete: 
wrote 133784 buffers (10.2%); 1 WAL file(s) added, 0 removed, 0 recycled; 
write=23.399 s, sync=0.141 s, total=23.595 s; sync files=29, longest=0.027 s, 
average=0.004 s; distance=1227832 kB, estimate=1229578 kB
strace: Process 164460 attached
2019-03-06 10:11:52 GMT [[local]] [163931]: [1-1] db=cmdstaging,user=postgres 
LOG:  duration: 209593.079 ms  statement: update service_order set customer_pon 
= now();
strace: Process 164575 attached
strace: Process 164694 attached
2019-03-06 10:12:28 GMT [] [163928]: [1-1] db=,user= LOG:  automatic vacuum of 
table "cmdstaging.public.service_order": index scans: 1
pages: 0 removed, 171162 remain, 413 skipped due to pins, 0 skipped 
frozen
tuples: 123264 removed, 1300328 remain, 0 are dead but not yet 
removable, oldest xmin: 1172166
buffer usage: 890073 hits, 250961 misses, 270986 dirtied
avg read rate: 7.826 MB/s, avg write rate: 8.451 MB/s
system usage: CPU: user: 12.96 s, system: 10.24 s, elapsed: 250.52 s
^Cstrace: Process 163892 detached
2019-03-06 10:12:29 GMT [] [163892]: [5-1] db=,user= LOG:  received fast 
shutdown request
strace: Process 163895 detached
strace: Process 163896 detached
strace: Process 163897 detached
strace: Process 163898 detached
strace: Process

Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-06 Thread Nicola Contu
This is instead the strace of another server running the same version
compiled  but that is even slower.



Il giorno mer 6 mar 2019 alle ore 11:14 Nicola Contu 
ha scritto:

> Here is the strace as requested for pg11
>
> Thanks
>
> Il giorno mar 5 mar 2019 alle ore 17:47 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Wed, Mar 6, 2019 at 4:22 AM Nicola Contu 
>> wrote:
>> >
>> > Not sure what you are requesting exactly but here is the strace for the
>> start of the pg_ctl
>>
>> I meant that you could run the server itself in the foreground under
>> strace, like so:
>>
>> $ strace -f -c /usr/local/pgsql11.2/bin/postgres -D /db/pgsql11/data/
>>
>> Then perform your testing, and finally stop it with pg_ctl from
>> another window (or hit ^C in this window) and strace should spit out a
>> table of system calls with some counters.  We might be able to see why
>> v11 is spending so much more time executing system calls than v10 for
>> your workload, or at least which systems calls they are, assuming you
>> run the same transactions against both versions.
>>
>> --
>> Thomas Munro
>> https://enterprisedb.com
>>
>
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
2019-03-06 11:04:56 GMT [127.0.0.1(46276)] [14799]: [2-1] 
db=cmdv3preprod,user=cmdpreprodapp FATAL:  terminating connection due to 
administrator command
 32.50  143.010306   7  21044095   lseek
 26.21  115.354045  14   8144577   read
  6.18   27.185578  16   166988910 sendto
  5.29   23.300584  57407528   fdatasync
  4.93   21.709522   9   2313529824174 recvfrom
  3.31   14.547568  19765897   write
  2.73   12.007486  14867088 14494 epoll_wait
  2.189.597460  15659871 84097 futex
  1.858.147759  14567414   close
  1.777.767832  18437656 11319 open
  1.536.749496 161 42009   wait4
  1.396.110869 226 26994   setsid
  1.315.760971  15396402   brk
  1.165.083198  29175022   munmap
  0.903.954892  18224829   epoll_ctl
  0.893.928084  17227227 1 mmap
  0.773.376280  11298018   rt_sigaction
  0.632.778183  16178446   kill
  0.632.763563  17159928   rt_sigprocmask
  0.522.306854  12190988   fstat
  0.421.850528 128 14468   sync_file_range
  0.351.534351  52 29591 18762 select
  0.351.532532  16 96298   epoll_create1
  0.321.4151602331   607   fsync
  0.291.263220  21 61605 35503 stat
  0.281.227967 125  982728 fallocate
  0.271.195613   8153557 38629 rt_sigreturn
  0.261.123353 112 10013   173 unlink
  0.241.060130  20 52112   mprotect
  0.110.487515  18 26994   clone
  0.100.458161  11 41080   fadvise64
  0.100.440217  16 26991 1 access
  0.090.406644   4109169   fcntl
  0.040.188007 173  1088   rename
  0.040.181400  18  9829   ftruncate
  0.030.119387   4 26995   pipe
  0.010.036748   1 26995   set_robust_list
  0.010.023510  24  1000   dup
  0.000.019099  13  1425   setitimer
  0.000.017918  18   981   setsockopt
  0.000.015479  26   595   accept
  0.000.010666  18   602   getsockname
  0.000.010172  50   202   getdents
  0.000.004645  28   165   openat
  0.000.003369  7346   link
  0.000.002171  4350   mremap
  0.000.000213  71 3 2 mkdir
  0.000.000189  32 6   getcwd
  0.000.000179  30 6   bind
  0.000.000153  1015   getrusage
  0.000.000152  30 5   chdir
  0.000.000133  1211   socket
  0.000.000119  24 5   lstat
  0.000.000102  13 8 2 connect
  0.000.83  83 1   rmdir
  0.000.70  23 3   geteuid
  0.000.68  23 3   listen
  0.000.56  28 2

Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-07 Thread Nicola Contu
So the first file is on Postgres11.2 on a test server (and where I compare
10 vs 11)
The second file, is our preprod machine running Postgres 11.2 (different
hardware etc, it is a VM). I know that could be confusing, but I just
wanted to compare that too because if you see the two files there's a lot
of difference between the two machines.
And they are both running CentOS 7.

So at this point I have two problems. One inside the machine between
Postgres 10 and 11 and another problem on the preprod (similar to prod)
with a lot of lseek.

Sorry if this is confusing, hope it is clear now.

Regarding partitions/tables. The first file involves just one table. The
second file (with a huge lseek) was running the test on a single table, but
meanwhile it was accessible by the preprod web application. So it was maybe
hit by some user and some other table.


Question:
1) Is it possible that pgbench could not be really a good tool for testing
the performances? If I use a sql script of thousands of insert records and
compare on the same server between pg10 and pg11 I get pretty much the same
result (maybe better on pg11)
2) regarding preprod, is there any way to reduce those lseek()?  Just to
let you know, comparing the same insert script between the first server,
the first server takes 2m the second one takes 5-7m.

Thanks a lot,





Il giorno gio 7 mar 2019 alle ore 04:47 Thomas Munro 
ha scritto:

> On Thu, Mar 7, 2019 at 12:12 AM Nicola Contu 
> wrote:
> > This is instead the strace of another server running the same version
> compiled  but that is even slower.
>
> Huh.  That's a lot of lseek().  Some of these will be for random
> reads/writes and will go way in v12, and some will be for probing the
> size of relations while planning, and some while executing scans.  I
> bet you could make some of them go away by using prepared statements.
> Does the query in your test involve many partitions/tables?
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  32.50  143.010306   7  21044095   lseek
>  26.21  115.354045  14   8144577   read
>   6.18   27.185578  16   166988910 sendto
>   5.29   23.300584  57407528   fdatasync
>   4.93   21.709522   9   2313529824174 recvfrom
>   3.31   14.547568  19765897   write
>   2.73   12.007486  14867088 14494 epoll_wait
>   2.189.597460  15659871 84097 futex
>   1.858.147759  14567414   close
>   1.777.767832  18437656 11319 open
>
> The other results had 1 usec lseek(), and much fewer of them relative
> to the number of reads and writes.  BTW, are you comparing v10 and v11
> on the same hardware, kernel, filesystem?  Just wondering if there
> could be some change in syscall overhead on different kernel patch
> levels or something like that: we see 7 usec vs 1 usec in those two
> files (though I have no idea how reliable these times are) and if
> we're going to call it 21 million times at some point it might
> matter...
>
> --
> Thomas Munro
> https://enterprisedb.com
>


Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2

2019-03-12 Thread Nicola Contu
Hello,
do you have any advice on this?

Thanks a lot in advance

Il giorno gio 7 mar 2019 alle ore 09:39 Nicola Contu 
ha scritto:

> So the first file is on Postgres11.2 on a test server (and where I compare
> 10 vs 11)
> The second file, is our preprod machine running Postgres 11.2 (different
> hardware etc, it is a VM). I know that could be confusing, but I just
> wanted to compare that too because if you see the two files there's a lot
> of difference between the two machines.
> And they are both running CentOS 7.
>
> So at this point I have two problems. One inside the machine between
> Postgres 10 and 11 and another problem on the preprod (similar to prod)
> with a lot of lseek.
>
> Sorry if this is confusing, hope it is clear now.
>
> Regarding partitions/tables. The first file involves just one table. The
> second file (with a huge lseek) was running the test on a single table, but
> meanwhile it was accessible by the preprod web application. So it was maybe
> hit by some user and some other table.
>
>
> Question:
> 1) Is it possible that pgbench could not be really a good tool for testing
> the performances? If I use a sql script of thousands of insert records and
> compare on the same server between pg10 and pg11 I get pretty much the same
> result (maybe better on pg11)
> 2) regarding preprod, is there any way to reduce those lseek()?  Just to
> let you know, comparing the same insert script between the first server,
> the first server takes 2m the second one takes 5-7m.
>
> Thanks a lot,
>
>
>
>
>
> Il giorno gio 7 mar 2019 alle ore 04:47 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Thu, Mar 7, 2019 at 12:12 AM Nicola Contu 
>> wrote:
>> > This is instead the strace of another server running the same version
>> compiled  but that is even slower.
>>
>> Huh.  That's a lot of lseek().  Some of these will be for random
>> reads/writes and will go way in v12, and some will be for probing the
>> size of relations while planning, and some while executing scans.  I
>> bet you could make some of them go away by using prepared statements.
>> Does the query in your test involve many partitions/tables?
>>
>> % time seconds  usecs/call callserrors syscall
>> -- --- --- - - 
>>  32.50  143.010306   7  21044095   lseek
>>  26.21  115.354045  14   8144577   read
>>   6.18   27.185578  16   166988910 sendto
>>   5.29   23.300584  57407528   fdatasync
>>   4.93   21.709522   9   2313529824174 recvfrom
>>   3.31   14.547568  19765897   write
>>   2.73   12.007486  14867088 14494 epoll_wait
>>   2.189.597460  15659871 84097 futex
>>   1.858.147759  14567414   close
>>   1.777.767832  18437656 11319 open
>>
>> The other results had 1 usec lseek(), and much fewer of them relative
>> to the number of reads and writes.  BTW, are you comparing v10 and v11
>> on the same hardware, kernel, filesystem?  Just wondering if there
>> could be some change in syscall overhead on different kernel patch
>> levels or something like that: we see 7 usec vs 1 usec in those two
>> files (though I have no idea how reliable these times are) and if
>> we're going to call it 21 million times at some point it might
>> matter...
>>
>> --
>> Thomas Munro
>> https://enterprisedb.com
>>
>


pg_replication_slots

2017-11-29 Thread Nicola Contu
Hello,
I just have few questions on the replication slots :

- is it possible to get size of the slot?
- if the slave is down, the table grows, when the slave comes up again,
will the table be flushed after pushing wals?
- will they impact performances on the master?

I'm just worried about the size.

Thank you


Centos 6.9 and centos 7

2017-12-04 Thread Nicola Contu
Hello,
we recently upgrade OS from centos 6.9 to a new server with centos 7.

The centos 6.9 server has became the preproduction server now.
We are running postgres 9.6.6 on both servers.


They are both on SSD disk, these are the only differences :

- DB partition on centos 7 is on a RAID 10
- file system is xfs on centos 7 (ext4 in centos 6.9)
- more memory on the centos 7 (so params on the postgres.conf are higher)
 max_connections = 220
 shared_buffers = 10GB
 effective_cache_size = 120GB
 work_mem = 349525kB
 maintenance_work_mem = 2GB
 min_wal_size = 1GB
 max_wal_size = 2GB
 checkpoint_completion_target = 0.7
 wal_buffers = 16MB
 default_statistics_target = 100
- we have two replicas on the centos 7. One is async one is sync
 synchronous_standby_names = '1 ( "" )'
 synchronous_commit = on

The have the same db inside, with same data.
Running the same script on the two servers will give different results.
Even a select query is faster on the centos 6.9 server. Half time on the
preprod server

centos 7 :

dbname=# \timing Timing is on. cmdv3=# SELECT id FROM
client_billing_account WHERE name = 'name'; id --- * (1 row) Time:
3.884 ms

centos 6.9

dbname=# SELECT id FROM client_billing_account WHERE name = 'name'; id
--- * (1 row) Time: 1.620 ms

This table has 32148 records.

Do you think we can modify anything to achieve same performances?

I read about few kernel params :

kernel.sched_migration_cost_ns = 500
kernel.sched_autogroup_enabled = 0
vm.dirty_background_bytes = 67108864
vm.dirty_bytes = 1073741824
vm.zone_reclaim_mode = 0
vm.swappiness = 1.1

Is there anything you can advice to solve or identify the problem?

Thanks a lot,
Nicola


Re: Centos 6.9 and centos 7

2017-12-04 Thread Nicola Contu
These are the timings in centos 7 :

Time: 4.248 ms
Time: 2.983 ms
Time: 3.027 ms
Time: 3.298 ms
Time: 4.420 ms
Time: 2.599 ms
Time: 2.555 ms
Time: 3.008 ms
Time: 6.220 ms
Time: 4.275 ms
Time: 2.841 ms
Time: 3.699 ms
Time: 3.387 ms


These are the timings in centos 6:
Time: 1.722 ms
Time: 1.670 ms
Time: 1.843 ms
Time: 1.823 ms
Time: 1.723 ms
Time: 1.724 ms
Time: 1.747 ms
Time: 1.734 ms
Time: 1.764 ms
Time: 1.622 ms


This is top on centos 6 :

[root@]# top
top - 14:33:32 up 577 days, 23:08,  1 user,  load average: 0.16, 0.11, 0.15
Tasks: 1119 total,   1 running, 1118 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:  132040132k total, 129530504k used,  2509628k free,   108084k buffers
Swap: 11665404k total,   331404k used, 11334000k free, 124508916k cached

This is top on centos 7:

top - 14:35:38 up 73 days, 19:00,  6 users,  load average: 22.46, 20.89,
20.54
Tasks: 821 total,  13 running, 807 sleeping,   0 stopped,   1 zombie
%Cpu(s): 14.2 us,  5.0 sy,  0.0 ni, 77.5 id,  3.1 wa,  0.0 hi,  0.2 si,
0.0 st
KiB Mem : 26383592+total,  4301464 free,  6250384 used, 25328406+buff/cache
KiB Swap: 16777212 total, 11798876 free,  4978336 used. 24497036+avail Mem


The production machine is obviously more accessed. But that does not seem
to be the problem as running the same query on the replica of the
production machine (same config of the master but not accessed by anyone)
gives the same bad result:
Time: 6.366 ms


2017-12-04 15:19 GMT+01:00 Chris Mair :

> centos 7 :
>> Time: 3.884 ms
>>
>> centos 6.9
>>
> Time: 1.620 ms
>
>>
>>
>> Is there anything you can advice to solve or identify the problem?
>>
>
> Can you run this query 10 times on each server and note the timings?
>
> I'd like to see the reproducability of this.
>
> Also: both machines are otherwise idle (check with top or uptime)?
>
> Bye,
> Chris.
>
>
>


Re: Centos 6.9 and centos 7

2017-12-04 Thread Nicola Contu
To make a better testing, I used a third server.
This is identical to the centos 7 machine, and it is not included in the
replica cluster.

Nobody is accessing this machine, this is top :

top - 14:48:36 up 73 days, 17:39,  3 users,  load average: 0.00, 0.01, 0.05
Tasks: 686 total,   1 running, 685 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
KiB Mem : 26383592+total,  1782196 free,  2731144 used, 25932257+buff/cache
KiB Swap: 16777212 total, 16298536 free,   478676 used. 21693456+avail Mem

These are timings :
Time: 2.841 ms
Time: 1.980 ms
Time: 2.240 ms
Time: 2.947 ms
Time: 2.828 ms
Time: 2.227 ms
Time: 1.998 ms
Time: 1.990 ms
Time: 2.643 ms
Time: 2.143 ms
Time: 2.919 ms
Time: 2.246 ms

I never got same results of the centos 6.9 machine.



2017-12-04 15:40 GMT+01:00 Nicola Contu :

> These are the timings in centos 7 :
>
> Time: 4.248 ms
> Time: 2.983 ms
> Time: 3.027 ms
> Time: 3.298 ms
> Time: 4.420 ms
> Time: 2.599 ms
> Time: 2.555 ms
> Time: 3.008 ms
> Time: 6.220 ms
> Time: 4.275 ms
> Time: 2.841 ms
> Time: 3.699 ms
> Time: 3.387 ms
>
>
> These are the timings in centos 6:
> Time: 1.722 ms
> Time: 1.670 ms
> Time: 1.843 ms
> Time: 1.823 ms
> Time: 1.723 ms
> Time: 1.724 ms
> Time: 1.747 ms
> Time: 1.734 ms
> Time: 1.764 ms
> Time: 1.622 ms
>
>
> This is top on centos 6 :
>
> [root@]# top
> top - 14:33:32 up 577 days, 23:08,  1 user,  load average: 0.16, 0.11, 0.15
> Tasks: 1119 total,   1 running, 1118 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  132040132k total, 129530504k used,  2509628k free,   108084k buffers
> Swap: 11665404k total,   331404k used, 11334000k free, 124508916k cached
>
> This is top on centos 7:
>
> top - 14:35:38 up 73 days, 19:00,  6 users,  load average: 22.46, 20.89,
> 20.54
> Tasks: 821 total,  13 running, 807 sleeping,   0 stopped,   1 zombie
> %Cpu(s): 14.2 us,  5.0 sy,  0.0 ni, 77.5 id,  3.1 wa,  0.0 hi,  0.2 si,
> 0.0 st
> KiB Mem : 26383592+total,  4301464 free,  6250384 used, 25328406+buff/cache
> KiB Swap: 16777212 total, 11798876 free,  4978336 used. 24497036+avail Mem
>
>
> The production machine is obviously more accessed. But that does not seem
> to be the problem as running the same query on the replica of the
> production machine (same config of the master but not accessed by anyone)
> gives the same bad result:
> Time: 6.366 ms
>
>
> 2017-12-04 15:19 GMT+01:00 Chris Mair :
>
>> centos 7 :
>>> Time: 3.884 ms
>>>
>>> centos 6.9
>>>
>> Time: 1.620 ms
>>
>>>
>>>
>>> Is there anything you can advice to solve or identify the problem?
>>>
>>
>> Can you run this query 10 times on each server and note the timings?
>>
>> I'd like to see the reproducability of this.
>>
>> Also: both machines are otherwise idle (check with top or uptime)?
>>
>> Bye,
>> Chris.
>>
>>
>>
>


Re: Centos 6.9 and centos 7

2017-12-04 Thread Nicola Contu
No I did not run a vacuum analyze. Do you want me to try with that first?

@Tomas:
Talking abut power management, I changed the profile for tuned-adm
to latency-performance instead of balanced (that is the default)

that is increasing performances for now and they are similar to centos 6.9.

Time: 2.121 ms
Time: 2.026 ms
Time: 1.664 ms
Time: 1.749 ms
Time: 1.656 ms
Time: 1.675 ms

Do you think this can be easily done in production as well?

2017-12-04 16:37 GMT+01:00 Alban Hertroys :

> Did  you run ANALYZE on your tables before the test?
>
> On 4 December 2017 at 16:01, Tomas Vondra 
> wrote:
> >
> > On 12/04/2017 02:19 PM, Nicola Contu wrote:
> > ...>
> >> centos 7 :
> >>
> >> dbname=# \timing Timing is on. cmdv3=# SELECT id FROM
> >> client_billing_account WHERE name = 'name'; id --- * (1 row)
> >> Time: 3.884 ms
> >>
> >> centos 6.9
> >>
> >> dbname=# SELECT id FROM client_billing_account WHERE name = 'name'; id
> >> --- * (1 row) Time: 1.620 ms
> >>
> >
> > We need to see EXPLAIN (ANALYZE,BUFFERS) for the queries.
> >
> > Are those VMs or bare metal? What CPUs and RAM are there? Have you
> > checked that power management is disabled / cpufreq uses the same
> > policy? That typically affects short CPU-bound queries.
> >
> > Other than that, I recommend performing basic system benchmarks (CPU,
> > memory, ...) and only if those machines perform equally should you look
> > for issues in PostgreSQL. Chances are the root cause is in hw or OS, in
> > which case you need to address that first.
> >
> > regards
> >
> > --
> > Tomas Vondra  http://www.2ndQuadrant.com
> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> >
>
>
>
> --
> If you can't see the forest for the trees,
> Cut the trees and you'll see there is no forest.
>


pgstattuple free_percent to high

2017-12-13 Thread Nicola Contu
Hello,
We are running postgres 9.6.6 on centos 7.

We have a large DB (180GB) with about 1200 tables.

We have autovacuum set with default values and we are seeing that for some
tables the free percent goes really high (51%) and we need to daily full
vacuum those tables.

dbanme=# SELECT * FROM pgstattuple('tablename');
 table_len  | tuple_count | tuple_len  | tuple_percent | dead_tuple_count |
dead_tuple_len | dead_tuple_percent | free_space | free_percent
+-++---+--++++--
 2119548928 |  526658 | 1023569149 | 48.29 |0
|  0 |  0 | 1083485292 |51.12
(1 row)

I guess this is because of long queries but I'm not really sure.
Do you know how to avoid this problem and what can cause it?

Do you think that increasing the autovacuum settings for those tables would
alleviate the issue?

Thanks,
Nicola


Re: pgstattuple free_percent to high

2017-12-14 Thread Nicola Contu
Hi Rene,
thanks for you reply.

I think tuning the autovacuum settings may increase performances and remove
dead_tuples but as far as I know, the autovacuum runs a vacuum analyze.
The vacuum analyze won't touch the free_percent of the table.

So I'm trying to find a way to adjust the free percent for some tables
without doing a manually full vacuum.
We are now monitoring the free percent, so we may find the part of the code
that can increase that value, but was wondering if there is anything on the
postgres side to resolve this problem.

Thanks,
Nicola


2017-12-14 0:16 GMT+01:00 Rene Romero Benavides :

> Check for long running transactions modifying (update, insert) on those
> tables ,using pg_stat_activity.
>
> Tweak these storage parameters for such tables:
> autovacuum_vacuum_cost_delay : decrease it (the autovacuum daemon goes to
> sleep less often )
> autovacuum_vacuum_threshold : decrease it (to trigger more frequent
> autovacuum activations )
> autovacuum_vacuum_cost_limit : increase it (to allow the autovacuum daemon
> to work for longer periods)
> autovacuum_vacuum_scale_factor : decrease it (to trigger more autovacuum
> activations when this percentage of a table has been modified)
>
> For example I've set these parameters for one table experiencing long
> running transactions, and for its access patterns have worked:
>
> autovacuum_vacuum_cost_delay=5, autovacuum_vacuum_threshold=
> 50,autovacuum_vacuum_cost_limit=3000, autovacuum_vacuum_scale_factor=0.01
> but these settings are very particular for each usage pattern.
>
> Take into account that more activity from autovacuum means more IO, more
> CPU usage, you might also benefit from setting autovacuum_work_mem to a
> higher setting if the available RAM allows it, to give more RAM to the
> autovacuum daemon.
>
>
>
> 2017-12-13 9:49 GMT-06:00 Nicola Contu :
>
>> Hello,
>> We are running postgres 9.6.6 on centos 7.
>>
>> We have a large DB (180GB) with about 1200 tables.
>>
>> We have autovacuum set with default values and we are seeing that for
>> some tables the free percent goes really high (51%) and we need to daily
>> full vacuum those tables.
>>
>> dbanme=# SELECT * FROM pgstattuple('tablename');
>>  table_len  | tuple_count | tuple_len  | tuple_percent | dead_tuple_count
>> | dead_tuple_len | dead_tuple_percent | free_space | free_percent
>> +-++---+
>> --+++---
>> -+--
>>  2119548928 |  526658 | 1023569149 | 48.29 |0
>> |  0 |  0 | 1083485292 |51.12
>> (1 row)
>>
>> I guess this is because of long queries but I'm not really sure.
>> Do you know how to avoid this problem and what can cause it?
>>
>> Do you think that increasing the autovacuum settings for those tables
>> would alleviate the issue?
>>
>> Thanks,
>> Nicola
>>
>
>
>
> --
> El genio es 1% inspiración y 99% transpiración.
> Thomas Alva Edison
> http://pglearn.blogspot.mx/
>
>


Re: pgstattuple free_percent to high

2017-12-19 Thread Nicola Contu
Hello,
anyone can help on this matter, Alessandro highlighted?

Thanks a lot,
Nicola


2017-12-16 13:40 GMT+01:00 Alessandro Aste :

> Stephen, Rene - Thanks!
>
> Our experience teach us that above 20% of free space performance start to
> seriously deteriorate.  I'm not sure if this is related to index or table
> fragmentation. We'll do our homework and we'll try to discover more.
>
> However we have identified a process potentially causing the free space
> spike high but it's not related to long running transactions modifying
> (update, insert) on those tables. What DEV is currently doing is create a
> sort of de-normalized cache table  like this:
> * tablename is the table with the free space issue
>
> INSERT INTO cache_table
> SELECT *, table2., ...table.col  FROM tablename
> JOIN table2 on ...
> ...
> ..
> JOIN  table on
>
> So no updates/inserts/deletes to tablename are involved but that query may
> run for 15 (this is the AVG more or less)  minutes holding an ACCESS SHARE
> lock (I guess).
>
> We are wondering why we have that spike if we are not modifying the
> relation and what we can suggest DEV from the query/db perspective to
> alleviate/fix the issue.
>
> Again, thanks so much.
>
>
>
> Thanks and kind regards
> Aste - alessandro.a...@gtt.net
>
> -Original Message-
> From: Stephen Frost [mailto:sfr...@snowman.net]
> Sent: 14 December 2017 16:13
> To: Nicola Contu 
> Cc: Rene Romero Benavides ; pgsql-general@lists.
> postgresql.org; Alessandro Aste 
> Subject: Re: pgstattuple free_percent to high
>
> Greetings Nicola,
>
> * Nicola Contu (nicola.co...@gmail.com) wrote:
> > I think tuning the autovacuum settings may increase performances and
> > remove dead_tuples but as far as I know, the autovacuum runs a vacuum
> analyze.
> > The vacuum analyze won't touch the free_percent of the table.
>
> That's not entirely accurate.  If all of the free space is at the *end* of
> the relation then autovacuum will attempt to lock the relation and truncate
> the table to give that free space back to the OS.
>
> On a table where all of the rows are regularly updated, eventually the
> "live" data should end up towards the front of the relation and the end of
> the relation will be all dead tuples, allowing the truncate to happen.  If
> you have tuples at the end of the relation that aren't ever updated but
> they're "live" then we won't be able to truncate.
>
> The pg_freespacemap extension can be useful to see where the free space is
> in the relation.
>
> There are a few tools out there that aren't part of core PostgreSQL that
> you could consider using such as pg_repack and pg_squeeze.
>
> > So I'm trying to find a way to adjust the free percent for some tables
> > without doing a manually full vacuum.
> > We are now monitoring the free percent, so we may find the part of the
> > code that can increase that value, but was wondering if there is
> > anything on the postgres side to resolve this problem.
>
> Having some free space in the relation isn't a 'problem' and is a good
> thing because it means that new rows (from either INSERTs or UPDATEs) have
> a place to go that doesn't require extending the relation (which requires
> an additional lock as well as some additional work).  As for how much free
> space is good to have and how much is too much depends on the specific
> workload.
>
> Thanks!
>
> Stephen
>