Monitor repl slot size
Hello, we used to monitor the replication slot size on postgres 9.6.6 with the following query: SELECT pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn) FROM pg_replication_slots WHERE slot_name = 'SLOT NAME'; We are moving to postgres 10.4 and we saw the pg_xlog_location_diff is not there anymore. I know we can re-create it following this link : https://github.com/DataDog/integrations-core/issues/907 but, is there any better way to do it? Any replacement for that function on postgres 10? Thanks a lot, Nicola
Waiting connections postgres 10
Hello, we used to monitor waiting connections with postgres 9.6.6 via this query : select count (*) from pg_stat_activity where wait_event IS NOT NULL Now with postgres 10 it seems a bit harder to understand when a query from the application is waiting. I built this query but not sure I'm getting the right information. select count (*) from pg_stat_activity where wait_event_type ='Client' and wait_event IN ('ClientRead','ClienteWrite') and state='idle' Can anyone help me? Thanks
Restore from dumps
Hello, we recently moved from postgres 9.6.6 to 10.4 We perform a pg_dump in production to restore daily in a preprod env. This process used to work perfectly, but now we have a tiny problem. We first restore data, we perform a vacuum and then we restore matviews. Restoring matviews now we have : pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424 MATERIALIZED VIEW DATA matview_vrs_request_sla postgres pg_restore: [archiver (db)] could not execute query: ERROR: relation "all_days" does not exist LINE 3: from all_days ^ QUERY: select count(*)::numeric from all_days where (("date" between $2::date and $1::date) or ("date" between $1::date and $2::date)) and dow not in (0,6) CONTEXT: SQL function "bdays" during inlining Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_sla; The relation is there, in fact if I go there when I get in to the office, the same command works. I'm not sure why it does not work here, this seems really strange to me. Can anyone help? Thank you, Nicola
Re: Restore from dumps
Hello these are the commands we use pretty much: tar -xf tarname.tar -C /backupdatadir --strip-components=4 pg_restore -l /backupdatadir | sed '/MATERIALIZED VIEW DATA/d' > /restore.lst pg_restore -U postgres -L /restore.lst -d DBNAME -j 32 /backupdatadir vacuumdb --analyze-in-stages -U postgres --jobs 32 -d DBNAME pg_restore -l /backupdatadir | grep 'MATERIALIZED VIEW DATA' > /refresh.lst pg_restore -U postgres -L /refresh.lst -d DBNAME -j 32 /backupdatadir all_days is a table yes. bdays instead is a function and it include in the first file. Both servers have 10.4 for psql commands, we take the backup with 10.4 and we restore with 10.4 We used to have postgres9.6.6 in production and pì10.4 in preprod, and the restore went always fine. After switching to 10.4 in prod we started having the problem. 2018-07-25 11:28 GMT+02:00 Alban Hertroys : > > > On 25 Jul 2018, at 9:43, Nicola Contu wrote: > > > > Hello, > > we recently moved from postgres 9.6.6 to 10.4 > > > > We perform a pg_dump in production to restore daily in a preprod env. > > This process used to work perfectly, but now we have a tiny problem. > > > > We first restore data, we perform a vacuum and then we restore matviews. > > What are the commands you used? You don't seem to mention restoring the > schema? > > > Restoring matviews now we have : > > > > pg_restore: [archiver (db)] Error while PROCESSING TOC: > > pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424 > MATERIALIZED VIEW DATA matview_vrs_request_sla postgres > > pg_restore: [archiver (db)] could not execute query: ERROR: relation > "all_days" does not exist > > LINE 3: from all_days > > ^ > > QUERY: > > select count(*)::numeric > > from all_days > > where (("date" between $2::date and $1::date) or ("date" between > $1::date and $2::date)) > > and dow not in (0,6) > > > > CONTEXT: SQL function "bdays" during inlining > > Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_ > sla; > > Is all_days a table? Or is it perhaps another view, possibly materialized > even? > > > The relation is there, in fact if I go there when I get in to the > office, the same command works. > > This sounds to me like you may be using a different version of pg_restore > in the office. Are both versions 10.4 or newer? > It can't hurt to check that you used version 10.4 of pg_dump as well. > > Alban Hertroys > -- > If you can't see the forest for the trees, > cut the trees and you'll find there is no forest. > >
Re: Restore from dumps
Thanks Laurenz I will try that. Btw, just wondering why it works if I refresh it later, even if the definition is still without public 2018-07-25 12:06 GMT+02:00 Laurenz Albe : > Nicola Contu wrote: > > we recently moved from postgres 9.6.6 to 10.4 > > > > We perform a pg_dump in production to restore daily in a preprod env. > > This process used to work perfectly, but now we have a tiny problem. > > > > We first restore data, we perform a vacuum and then we restore matviews. > > Restoring matviews now we have : > > > > pg_restore: [archiver (db)] Error while PROCESSING TOC: > > pg_restore: [archiver (db)] Error from TOC entry 23090; 0 1912379424 > MATERIALIZED VIEW DATA matview_vrs_request_sla postgres > > pg_restore: [archiver (db)] could not execute query: ERROR: relation > "all_days" does not exist > > LINE 3: from all_days > > ^ > > QUERY: > > select count(*)::numeric > > from all_days > > where (("date" between $2::date and $1::date) or ("date" between > $1::date and $2::date)) > > and dow not in (0,6) > > > > CONTEXT: SQL function "bdays" during inlining > > Command was: REFRESH MATERIALIZED VIEW public.matview_vrs_request_ > sla; > > > > The relation is there, in fact if I go there when I get in to the > office, the same command works. > > > > I'm not sure why it does not work here, this seems really strange to me. > > I suspect that it has to do with the recent security fixes around the > "public" schema. > > Try to ALTER the materialized view so that it refers to "public.all_days" > rather than "all_days". > > Yours, > Laurenz Albe > -- > Cybertec | https://www.cybertec-postgresql.com >
Re: Restore from dumps
yeah, we updated that function in production to says public.all_days. I will let you know at the next restore. Thanks guys, appreciated. 2018-07-25 16:28 GMT+02:00 Tom Lane : > Laurenz Albe writes: > > Nicola Contu wrote: > >> pg_restore: [archiver (db)] could not execute query: ERROR: relation > "all_days" does not exist > >> LINE 3: from all_days > >> ^ > >> QUERY: > >> select count(*)::numeric > >> from all_days > >> where (("date" between $2::date and $1::date) or ("date" between > $1::date and $2::date)) > >> and dow not in (0,6) > >> > >> CONTEXT: SQL function "bdays" during inlining > > > Try to ALTER the materialized view so that it refers to "public.all_days" > > rather than "all_days". > > Per the error message, what needs fixing is the SQL function "bdays", > not the matview as such. > > regards, tom lane >
Re: Restore from dumps
That worked. Thanks guys. 2018-07-25 16:33 GMT+02:00 Nicola Contu : > yeah, we updated that function in production to says public.all_days. > I will let you know at the next restore. > > Thanks guys, appreciated. > > 2018-07-25 16:28 GMT+02:00 Tom Lane : > >> Laurenz Albe writes: >> > Nicola Contu wrote: >> >> pg_restore: [archiver (db)] could not execute query: ERROR: relation >> "all_days" does not exist >> >> LINE 3: from all_days >> >> ^ >> >> QUERY: >> >> select count(*)::numeric >> >> from all_days >> >> where (("date" between $2::date and $1::date) or ("date" between >> $1::date and $2::date)) >> >> and dow not in (0,6) >> >> >> >> CONTEXT: SQL function "bdays" during inlining >> >> > Try to ALTER the materialized view so that it refers to >> "public.all_days" >> > rather than "all_days". >> >> Per the error message, what needs fixing is the SQL function "bdays", >> not the matview as such. >> >> regards, tom lane >> > >
Pgbouncer and postgres
Hello, I'm trying to get pgbouncer working but I'm getting this error : 2018-09-17 12:20:15.304 87772 NOISE safe_accept(12) = Resource temporarily unavailable 2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = 13 (127.0.0.1:59100) 2018-09-17 12:20:36.183 87772 NOISE new fd from accept=13 2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = Resource temporarily unavailable 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'logfile' = '/var/log/pgbouncer.log' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = '/home/postgres/pgbouncer.pid' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = '/home/postgres/pgbouncer.pid' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = '220' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = '220' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'log_connections' = '0' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_connections' = '0' ok:1 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0' ok:1 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' ok:1 2018-09-17 12:21:14.413 88424 DEBUG loading auth_file: "/etc/pgbouncer/users.txt" 2018-09-17 12:21:14.417 88424 NOISE event: 128, SBuf: 192, PgSocket: 400, IOBuf: 4108 2018-09-17 12:21:14.417 88424 LOG File descriptor limit: 1024 (H:4096), max_client_conn: 100, max fds possible: 230 2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x25277c0 2018-09-17 12:21:14.417 88424 DEBUG make_room(0x25277c0, 4): realloc newlen=256 2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x252ebb0 2018-09-17 12:21:14.418 88424 NOISE connect(4, unix:/tmp/.s.PGSQL.6543) = Connection refused 2018-09-17 12:21:14.418 88424 NOISE safe_close(4) = 0 2018-09-17 12:21:14.418 88424 LOG Stale pidfile, removing 2018-09-17 12:21:14.419 88424 DEBUG adns_create_context: evdns2 2018-09-17 12:21:14.421 88424 DEBUG add_listen: 127.0.0.1:6543 2018-09-17 12:21:14.422 88424 NOISE old TCP_DEFER_ACCEPT on 11 = 0 2018-09-17 12:21:14.422 88424 NOISE install TCP_DEFER_ACCEPT on 11 2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543 2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543 2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543 2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips 26 Jan 2017 2018-09-17 12:21:14.756 88424 DEBUG loading auth_file: "/etc/pgbouncer/users.txt" 2018-09-17 12:21:48.917 88424 NOISE safe_accept(12) = 13 (unix:) 2018-09-17 12:21:48.917 88424 NOISE new fd from accept=13 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation not supported 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource temporarily unavailable 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down 2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0 Postgres version : 10.5 Libevent : 2.0.5 Pgbouncer : 1.8.1 It used to work. Now after trying to install latest version of pgbouncer (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1 Content of the pgbouncer.ini [databases] DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer pool_size=120 [pgbouncer] listen_port = 6543 listen_addr = * auth_type = md5 auth_file = /etc/pgbouncer/users.txt auth_query = select uname,phash from user_lookup($1) logfile = /var/log/pgbouncer.log pidfile = /home/postgres/pgbouncer.pid admin_users = admin user=postgres max_db_connections = 220 log_connections=0 log_disconnections=0 verbose=3 Content of the users .txt [root@cmd-dev1 pgbouncer]# cat users.txt "pgbouncer" "***" I tried to change the listen address to 127.0.0.1 or to the IP of the VM. Nothing changed. Can anyone help? Thanks
Pgbouncer and postgres
Hello, I'm trying to get pgbouncer working but I'm getting this error : 2018-09-17 12:20:15.304 87772 NOISE safe_accept(12) = Resource temporarily unavailable 2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = 13 (127.0.0.1:59100) 2018-09-17 12:20:36.183 87772 NOISE new fd from accept=13 2018-09-17 12:20:36.183 87772 NOISE safe_accept(11) = Resource temporarily unavailable 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'logfile' = '/var/log/pgbouncer.log' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = '/home/postgres/pgbouncer.pid' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'pidfile' = '/home/postgres/pgbouncer.pid' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'admin_users' = 'admin' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'user' = 'postgres' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = '220' 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'max_db_connections' = '220' ok:1 2018-09-17 12:21:14.412 88424 DEBUG parse_ini_file: 'log_connections' = '0' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_connections' = '0' ok:1 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'log_disconnections' = '0' ok:1 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' 2018-09-17 12:21:14.413 88424 DEBUG parse_ini_file: 'verbose' = '3' ok:1 2018-09-17 12:21:14.413 88424 DEBUG loading auth_file: "/etc/pgbouncer/users.txt" 2018-09-17 12:21:14.417 88424 NOISE event: 128, SBuf: 192, PgSocket: 400, IOBuf: 4108 2018-09-17 12:21:14.417 88424 LOG File descriptor limit: 1024 (H:4096), max_client_conn: 100, max fds possible: 230 2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x25277c0 2018-09-17 12:21:14.417 88424 DEBUG make_room(0x25277c0, 4): realloc newlen=256 2018-09-17 12:21:14.417 88424 DEBUG pktbuf_dynamic(128): 0x252ebb0 2018-09-17 12:21:14.418 88424 NOISE connect(4, unix:/tmp/.s.PGSQL.6543) = Connection refused 2018-09-17 12:21:14.418 88424 NOISE safe_close(4) = 0 2018-09-17 12:21:14.418 88424 LOG Stale pidfile, removing 2018-09-17 12:21:14.419 88424 DEBUG adns_create_context: evdns2 2018-09-17 12:21:14.421 88424 DEBUG add_listen: 127.0.0.1:6543 2018-09-17 12:21:14.422 88424 NOISE old TCP_DEFER_ACCEPT on 11 = 0 2018-09-17 12:21:14.422 88424 NOISE install TCP_DEFER_ACCEPT on 11 2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543 2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543 2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543 2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips 26 Jan 2017 2018-09-17 12:21:14.756 88424 DEBUG loading auth_file: "/etc/pgbouncer/users.txt" 2018-09-17 12:21:48.917 88424 NOISE safe_accept(12) = 13 (unix:) 2018-09-17 12:21:48.917 88424 NOISE new fd from accept=13 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation not supported 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource temporarily unavailable 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down 2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0 Postgres version : 10.5 Libevent : 2.0.5 Pgbouncer : 1.8.1 OS : Centos 7 It used to work. Now after trying to install latest version of pgbouncer (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1 Content of the pgbouncer.ini [databases] DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer pool_size=120 [pgbouncer] listen_port = 6543 listen_addr = * auth_type = md5 auth_file = /etc/pgbouncer/users.txt auth_query = select uname,phash from user_lookup($1) logfile = /var/log/pgbouncer.log pidfile = /home/postgres/pgbouncer.pid admin_users = admin user=postgres max_db_connections = 220 log_connections=0 log_disconnections=0 verbose=3 Content of the users .txt [root@cmd-dev1 pgbouncer]# cat users.txt "pgbouncer" "***" I tried to change the listen address to 127.0.0.1 or to the IP of the VM. Nothing changed. Can anyone help? Thanks
Re: Pgbouncer and postgres
I'm not aware of any update installed. There were like 600+ updates in fact. These are the logs when connecting via telnet : 2018-09-17 14:18:18.933 65617 NOISE safe_accept(11) = 14 (10.151.2.145:39478 ) 2018-09-17 14:18:18.933 65617 NOISE new fd from accept=14 2018-09-17 14:18:18.933 65617 NOISE safe_accept(11) = Resource temporarily unavailable 2018-09-17 14:18:28.385 65617 NOISE safe_accept(12) = 15 (::1/49712) 2018-09-17 14:18:28.385 65617 NOISE new fd from accept=15 2018-09-17 14:18:28.385 65617 NOISE safe_accept(12) = Resource temporarily unavailable 2018-09-17 14:18:30.703 65617 NOISE safe_accept(11) = 16 (10.151.2.145:39520 ) 2018-09-17 14:18:30.703 65617 NOISE new fd from accept=16 2018-09-17 14:18:30.703 65617 NOISE safe_accept(11) = Resource temporarily unavailable 2018-09-17 14:18:34.416 65617 NOISE safe_accept(11) = 17 (10.151.2.145:39530 ) 2018-09-17 14:18:34.416 65617 NOISE new fd from accept=17 2018-09-17 14:18:34.416 65617 NOISE safe_accept(11) = Resource temporarily unavailable [root@cmd-dev1 pgbouncer]# telnet localhost 6543 Trying ::1... Connected to localhost. Escape character is '^]'. Connection closed by foreign host. Il giorno lun 17 set 2018 alle ore 16:11 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 9/17/18 7:05 AM, Nicola Contu wrote: > > The log is from pgbouncer. > > > > I installed updates after this problem, but they did not resolve > anything. > > I was thinking more of updates that where done between the last time it > worked and the time it started failing. > > Along that line, when was the last time it worked? > > > > > I compiled 1.9.0 from source on the same path of the previous one. Then > > back to 1.8.1 on the same path. > > > > I also tried changing the listening port on pgbouncer, but that does not > > change too much. > > Yeah, your log shows the port/socket coming up: > > 2018-09-17 12:21:14.422 88424 LOG listening on 127.0.0.1:6543 > 2018-09-17 12:21:14.422 88424 DEBUG add_listen: unix:/tmp/.s.PGSQL.6543 > 2018-09-17 12:21:14.422 88424 LOG listening on unix:/tmp/.s.PGSQL.6543 > 2018-09-17 12:21:14.423 88424 LOG process up: pgbouncer 1.8.1, libevent > 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips 26 Jan 2017 > > > I can connect via telnet on that port, but we I try with anything else > > (like Apache) it makes pgbouncer crashing. > > Are there log entries from that? > > > > > > > Il giorno lun 17 set 2018 alle ore 15:44 Adrian Klaver > > mailto:adrian.kla...@aklaver.com>> ha > scritto: > > > > On 9/17/18 5:33 AM, Nicola Contu wrote: > > > Hello, > > > > > > I'm trying to get pgbouncer working but I'm getting this error : > > > > > > > > > > What is generating the log below? > > > > To me it looks like everything is on track until this: > > > > > 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: > > Operation > > > not supported > > > 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0 > > > 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource > > > temporarily unavailable > > > 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, > > in 0 > > > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us > > > 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, > > in 0 > > > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us > > > 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down > > > 2018-09-17 12:23:26.872 88424 LOG server connections dropped, > exiting > > > 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0 > > > 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0 > > > > > > > > > Postgres version : 10.5 > > > > > > Libevent : 2.0.5 > > > > > > Pgbouncer : 1.8.1 > > > > > > OS : Centos 7 > > > > Have there been updates to CentOS during this? > > > > > > > > > > > It used to work. Now after trying to install latest version of > > pgbouncer > > > (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1 > > > > Did you compile 1.9.0 from source or install from package? > > > > > > > > > > > > > > Content of the pgbouncer.ini > > > > > > [databases] > > > DB_NAME = host=IP_ADDD port=5432 > dbname=DB_NAM
Re: Pgbouncer and postgres
[Mon Sep 17 10:24:59.041589 2018] [php7:notice] [pid 63893:tid 140586929567488] [client 10.160.41.3:51764] PHP WARNING: pg_connect(): Unable to connect to PostgreSQL server: could not connect to server: Connection refused\n\tIs the server running on host "IP_ADDR" and accepting\n\tTCP/IP connections on port 6543? in /home/cmd3/adodb5/drivers/adodb-postgres64.inc.php on line 727 So basically the same I get when connecting via psql on port 6543 Il giorno lun 17 set 2018 alle ore 16:23 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 9/17/18 7:19 AM, Nicola Contu wrote: > > I'm not aware of any update installed. There were like 600+ updates in > fact. > > > > These are the logs when connecting via telnet : > > How about those from when Apache tries to connect and pgbouncer crashes? > > > Connection closed by foreign host. > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: Pgbouncer and postgres
The log is from pgbouncer. I installed updates after this problem, but they did not resolve anything. I compiled 1.9.0 from source on the same path of the previous one. Then back to 1.8.1 on the same path. I also tried changing the listening port on pgbouncer, but that does not change too much. I can connect via telnet on that port, but we I try with anything else (like Apache) it makes pgbouncer crashing. Il giorno lun 17 set 2018 alle ore 15:44 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 9/17/18 5:33 AM, Nicola Contu wrote: > > Hello, > > > > I'm trying to get pgbouncer working but I'm getting this error : > > > > > > What is generating the log below? > > To me it looks like everything is on track until this: > > > 2018-09-17 12:21:48.917 88424 WARNING tune_socket(13) failed: Operation > > not supported > > 2018-09-17 12:21:48.917 88424 NOISE safe_close(13) = 0 > > 2018-09-17 12:21:48.918 88424 NOISE safe_accept(12) = Resource > > temporarily unavailable > > 2018-09-17 12:22:14.422 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 > > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us > > 2018-09-17 12:23:14.423 88424 LOG Stats: 0 xacts/s, 0 queries/s, in 0 > > B/s, out 0 B/s, xact 0 us, query 0 us wait time 0 us > > 2018-09-17 12:23:26.561 88424 LOG Got SIGINT, shutting down > > 2018-09-17 12:23:26.872 88424 LOG server connections dropped, exiting > > 2018-09-17 12:23:26.872 88424 NOISE safe_close(11) = 0 > > 2018-09-17 12:23:26.872 88424 NOISE safe_close(12) = 0 > > > > > > Postgres version : 10.5 > > > > Libevent : 2.0.5 > > > > Pgbouncer : 1.8.1 > > > > OS : Centos 7 > > Have there been updates to CentOS during this? > > > > > > > It used to work. Now after trying to install latest version of pgbouncer > > (1.9.0) does not work anymore. Then I re-compiled the v 1.8.1 > > Did you compile 1.9.0 from source or install from package? > > > > > > > > > Content of the pgbouncer.ini > > > > [databases] > > DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer > > pool_size=120 > > > > [pgbouncer] > > listen_port = 6543 > > listen_addr = * > > auth_type = md5 > > auth_file = /etc/pgbouncer/users.txt > > auth_query = select uname,phash from user_lookup($1) > > logfile = /var/log/pgbouncer.log > > pidfile = /home/postgres/pgbouncer.pid > > admin_users = admin > > user=postgres > > max_db_connections = 220 > > log_connections=0 > > log_disconnections=0 > > verbose=3 > > > > > > Content of the users .txt > > > > [root@cmd-dev1 pgbouncer]# cat users.txt > > "pgbouncer" "***" > > > > > > I tried to change the listen address to 127.0.0.1 or to the IP of the > > VM. Nothing changed. > > > > > > Can anyone help? > > > > > > Thanks > > > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: Pgbouncer and postgres
Yeah, I replaced the IP of the machine with the word IP_ADDR. On the pgbouncer log I get this while trying to connect via psql: 2018-09-17 15:56:49.452 67611 WARNING tune_socket(16) failed: Operation not supported 2018-09-17 15:56:49.452 67611 NOISE safe_close(16) = 0 2018-09-17 15:56:49.452 67611 NOISE safe_accept(13) = Resource temporarily unavailable 2018-09-17 15:56:50.728 67611 NOISE safe_accept(13) = 16 (unix:) 2018-09-17 15:56:50.729 67611 NOISE new fd from accept=16 2018-09-17 15:56:50.729 67611 WARNING tune_socket(16) failed: Operation not supported 2018-09-17 15:56:50.729 67611 NOISE safe_close(16) = 0 2018-09-17 15:56:50.729 67611 NOISE safe_accept(13) = Resource temporarily unavailable 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = 16 (10.151.2.145:58696 ) 2018-09-17 15:56:56.418 67611 NOISE new fd from accept=16 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = Resource temporarily unavailable Il giorno lun 17 set 2018 alle ore 17:46 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 9/17/18 8:39 AM, Nicola Contu wrote: > Please reply to list also. > Ccing list. > > > No wait, IP_ADDR has been modified by me, but there is the IP of the > > machine 10.151.x. > > To be clear in your posts you are replacing the actual IP with IP_ADDR, > correct? > > > > > > > Here is the psql command . > > > > [root@cmd-dev1 ~]# psql -U ncontu -p 6543 cmd3dev > > psql: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > So what does the pgbouncer log show when you do above? > > > > > > > > > > > Il giorno lun 17 set 2018 alle ore 17:35 Adrian Klaver > > mailto:adrian.kla...@aklaver.com>> ha > scritto: > > > > On 9/17/18 7:26 AM, Nicola Contu wrote: > > > [Mon Sep 17 10:24:59.041589 2018] [php7:notice] [pid 63893:tid > > > 140586929567488] [client 10.160.41.3:51764 > > <http://10.160.41.3:51764> <http://10.160.41.3:51764>] > > > PHP WARNING: pg_connect(): Unable to connect to PostgreSQL > > server: could > > > not connect to server: Connection refused\n\tIs the server > > running on > > > host "IP_ADDR" and accepting\n\tTCP/IP connections on > port > > > 6543? in /home/cmd3/adodb5/drivers/adodb-postgres64.inc.php on > > line 727 > > > > To me it looks like IP_ADDR is not resolving to a host. > > > > In your pgbouncer.ini I see: > > > > DB_NAME = host=IP_ADDD port=5432 dbname=DB_NAME auth_user=pgbouncer > > pool_size=120 > > > > but the error is about connecting to port 6543 which is your > pgbouncer > > 'server'. > > > > Are you sure your connection string is pointing at the correct thing? > > > > > > > > So basically the same I get when connecting via psql on port 6543 > > > > What is the full psql command? > > > > > > > > Il giorno lun 17 set 2018 alle ore 16:23 Adrian Klaver > > > mailto:adrian.kla...@aklaver.com> > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> ha scritto: > > > > > > On 9/17/18 7:19 AM, Nicola Contu wrote: > > > > I'm not aware of any update installed. There were like 600+ > > > updates in fact. > > > > > > > > These are the logs when connecting via telnet : > > > > > > How about those from when Apache tries to connect and > > pgbouncer crashes? > > > > > > > Connection closed by foreign host. > > > > > > > > > -- > > > Adrian Klaver > > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > <mailto:adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com > >> > > > > > > > > > -- > > Adrian Klaver > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: Pgbouncer and postgres
Yes I can connect to port 5432 without any problem. The thing is that I am not seeing anything on the postgres log when connecting to 6543. Sure, thanks for your help. I will create an issue there. Il giorno lun 17 set 2018 alle ore 18:07 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 9/17/18 8:59 AM, Nicola Contu wrote: > > Yeah, I replaced the IP of the machine with the word IP_ADDR. > > > > On the pgbouncer log I get this while trying to connect via psql: > > > > 2018-09-17 15:56:49.452 67611 WARNING tune_socket(16) failed: Operation > > not supported > > 2018-09-17 15:56:49.452 67611 NOISE safe_close(16) = 0 > > 2018-09-17 15:56:49.452 67611 NOISE safe_accept(13) = Resource > > temporarily unavailable > > 2018-09-17 15:56:50.728 67611 NOISE safe_accept(13) = 16 (unix:) > > 2018-09-17 15:56:50.729 67611 NOISE new fd from accept=16 > > 2018-09-17 15:56:50.729 67611 WARNING tune_socket(16) failed: Operation > > not supported > > 2018-09-17 15:56:50.729 67611 NOISE safe_close(16) = 0 > > 2018-09-17 15:56:50.729 67611 NOISE safe_accept(13) = Resource > > temporarily unavailable > > 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = 16 > > (10.151.2.145:58696 <http://10.151.2.145:58696>) > > 2018-09-17 15:56:56.418 67611 NOISE new fd from accept=16 > > 2018-09-17 15:56:56.418 67611 NOISE safe_accept(11) = Resource > > temporarily unavailable > > Can you connect to the actual database running on port 5432? > > Deciphering the above log is beyond my capabilities. I think it is time > to file an issue here: > > https://github.com/pgbouncer/pgbouncer/issues > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Pgbouncer discard all
Hello, we are running pgbouncer 1.9.1 connected to postgres 10.5 Sometimes we are seeing a lot of waiting connections with this query : DISCARD ALL This is our pgbouncer config : [databases] dev = host=10.10.10.1 port=5432 dbname=dev auth_user=pgbouncer pool_size=120 [pgbouncer] listen_port = 6543 listen_addr = * auth_type = md5 auth_file = /etc/pgbouncer/users.txt auth_query = select uname,phash from user_lookup($1) logfile = /var/log/pgbouncer.log pidfile = /home/postgres/pgbouncer.pid admin_users = admin user = postgres max_db_connections = 220 log_connections = 0 log_disconnections = 0 Do you think this can depend on the server_idle_timeout default config value? Thanks a lot, Nicola
Swap on postgres master server
Hello, we are running Postgres 10.5 with master slave replication. These are our custom params archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' # command to use to archive a logfile segment archive_mode = on # enables archiving; off, on, or always checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - 1.0 default_text_search_config = 'pg_catalog.english' # datestyle = 'iso, mdy' # effective_cache_size = 120GB # hot_standby = on# "on" allows queries during recovery lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'en_US.UTF-8' # locale for time formatting listen_addresses = '*' # defaults to 'localhost', '*' = all log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all actions and log_checkpoints = on # log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' # log_lock_waits = on # log lock waits >= deadlock_timeout log_min_duration_statement = 1000ms# -1 is disabled, 0 logs all statements log_statement = 'ddl' # none, ddl, mod, all log_temp_files = 1024kB # log temporary files equal or larger maintenance_work_mem = 2GB # max_connections = 220 # max_parallel_workers_per_gather = 8# taken from max_worker_processes max_wal_size = 2GB # min_wal_size = 1GB # pg_stat_statements.max = 1 # pg_stat_statements.track = all # port = 5432# port number which Postgres listen shared_buffers = 10GB # shared_preload_libraries = 'pg_stat_statements' # (change requires restart) synchronous_standby_names = '1 ( "usnyh2" )' # comment out during upgrade track_activity_query_size = 16384# (change requires restart) track_io_timing = on # wal_buffers = 16MB # wal_keep_segments = 100 # wal_level = replica# minimal, replica, or logical work_mem = 600MB # This server is on Centos 7 and the strange thing is that we see a lot of swap usage : [root@usnyh-cmd1 ~]# free -m totalusedfree shared buff/cache available Mem: 25765275555559 12804 244536 236036 Swap: 1638373269057 7GB used. But can't see it from any of the commands like top etc. I am sure it is postgres because it is the only service running on that machine. Is there anything we can do? On the sync slave, the usage is just 400MB. Any trick? Thanks a lot, Nicola
Re: Swap on postgres master server
No it is not probably used, because I can't find it in any way as I said. I run your commands : [root@usnyh-cmd1 ~]# vmstat 1 procs ---memory-- ---swap-- -io -system-- --cpu- r b swpd free buff cache si sobibo in cs us sy id wa st 3 2 7505332 14545468 13692 2414367840047 11500 11 2 84 2 0 3 2 7505332 14490408 13692 24143692800 0 248 6153 2013 3 1 93 3 0 1 2 7505332 14474672 13692 24143680000 0 1124 4905 1454 3 1 93 3 0 4 2 7505332 14381156 13692 24143683200 096 5322 1782 2 1 94 3 0 ^C [root@usnyh-cmd1 ~]# ps -A --sort -rss -o comm,pmem | awk ' > NR == 1 { print; next } > { a[$1] += $2 } > END { > for (i in a) { > printf "%-15s\t%s\n", i, a[i]; > } > } > ' COMMAND %MEM kworker/42:20 kworker/60:6H 0 kworker/60:2H 0 kdmwork-253:2 0 ksoftirqd/600 postmaster 15.2 kworker/9:0H0 So I'm just asking why it is still there with free -m and if there is any way to free that up if it is not used. Thanks Il giorno mar 16 ott 2018 alle ore 11:18 Hans Schou ha scritto: > Are you sure that swap is used actively? Maybe it had just been used > during backup or something. > > Look after SwapIn/SwapOut (si/so) it should be '0' > $ vmstat 1 > procs ---memory-- ---swap-- -io -system-- > --cpu- > r b swpd free buff cache si sobibo in cs us sy id > wa st > 1 0 12 89344 46608 5863840012 8 30 86 0 0 > 99 0 0 > > If you want to see the amount of ram used by each program with childs run > this: > ps -A --sort -rss -o comm,pmem | awk ' > NR == 1 { print; next } > { a[$1] += $2 } > END { > for (i in a) { > printf "%-15s\t%s\n", i, a[i]; > } > } > ' > > > On Tue, Oct 16, 2018 at 11:04 AM Nicola Contu > wrote: > >> Hello, >> we are running Postgres 10.5 with master slave replication. >> >> These are our custom params >> >> archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' # >> command to use to archive a logfile segment >> archive_mode = on # enables archiving; off, on, or always >> checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - >> 1.0 >> default_text_search_config = 'pg_catalog.english' # >> datestyle = 'iso, mdy' # >> effective_cache_size = 120GB # >> hot_standby = on# "on" allows queries during >> recovery >> lc_messages = 'en_US.UTF-8' # locale for system >> error message >> lc_monetary = 'en_US.UTF-8' # locale for monetary >> formatting >> lc_numeric = 'en_US.UTF-8' # locale for number >> formatting >> lc_time = 'en_US.UTF-8' # locale for time >> formatting >> listen_addresses = '*' # defaults to 'localhost', '*' = all >> log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all >> actions and >> log_checkpoints = on # >> log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' # >> log_lock_waits = on # log lock waits >= >> deadlock_timeout >> log_min_duration_statement = 1000ms# -1 is disabled, 0 logs all >> statements >> log_statement = 'ddl' # none, ddl, mod, all >> log_temp_files = 1024kB # log temporary files equal >> or larger >> maintenance_work_mem = 2GB # >> max_connections = 220 # >> max_parallel_workers_per_gather = 8# taken from >> max_worker_processes >> max_wal_size = 2GB # >> min_wal_size = 1GB # >> pg_stat_statements.max = 1 # >> pg_stat_statements.track = all # >> port = 5432# port number which Postgres >> listen >> shared_buffers = 10GB # >> shared_preload_libraries = 'pg_stat_statements' # (change >> requires restart) >> synchronous_standby_names = '1 ( "usnyh2" )' # comment out during upgrade >> track_activity_query_size = 16384# (change requires restart) >> track_io_timing = on # >> wal_buffers = 16MB # >> wal_keep_segments = 100 # >> wal_level = replica# minimal, replica, or logical >> work_mem = 600MB # >> >> This server is on Centos 7 and the strange thing is that we see a lot of >> swap usage : >> >> [root@usnyh-cmd1 ~]# free -m >> totalusedfree shared buff/cache >> available >> Mem: 25765275555559 12804 244536 >> 236036 >> Swap: 1638373269057 >> >> 7GB used. >> >> But can't see it from any of the commands like top etc. >> I am sure it is postgres because it is the only service running on that >> machine. >> >> Is there anything we can do? >> On the sync slave, the usage is just 400MB. >> >> Any trick? >> >> Thanks a lot, >> Nicola >> >
Re: Pgbouncer discard all
Hello, is this normal? can anyone help? Thanks a lot for your help in advance. Nicola Il giorno mer 10 ott 2018 alle ore 17:03 Nicola Contu < nicola.co...@gmail.com> ha scritto: > Hello, > we are running pgbouncer 1.9.1 connected to postgres 10.5 > > Sometimes we are seeing a lot of waiting connections with this query : > > DISCARD ALL > > This is our pgbouncer config : > > [databases] > dev = host=10.10.10.1 port=5432 dbname=dev auth_user=pgbouncer > pool_size=120 > > [pgbouncer] > listen_port = 6543 > listen_addr = * > auth_type = md5 > auth_file = /etc/pgbouncer/users.txt > auth_query = select uname,phash from user_lookup($1) > logfile = /var/log/pgbouncer.log > pidfile = /home/postgres/pgbouncer.pid > admin_users = admin > user = postgres > max_db_connections = 220 > log_connections = 0 > log_disconnections = 0 > > > Do you think this can depend on the server_idle_timeout default config > value? > > Thanks a lot, > Nicola >
Re: Swap on postgres master server
Ok thanks I will try that. But do you know if there is any way to avoid this? vm.swapiness? or anything on the postgres conf? Il giorno mar 16 ott 2018 alle ore 15:17 Bob Jolliffe ha scritto: > I guess you can run swapoff (followed by swapon). That will free up > whatever is currently swapped. Beware if the system is actively > swapping then swapoff can take some time. But it seems not in your > case. > On Tue, 16 Oct 2018 at 10:48, Nicola Contu wrote: > > > > No it is not probably used, because I can't find it in any way as I said. > > > > I run your commands : > > > > [root@usnyh-cmd1 ~]# vmstat 1 > > procs ---memory-- ---swap-- -io -system-- > --cpu- > > r b swpd free buff cache si sobibo in cs us sy > id wa st > > 3 2 7505332 14545468 13692 2414367840047 11500 > 11 2 84 2 0 > > 3 2 7505332 14490408 13692 24143692800 0 248 6153 2013 > 3 1 93 3 0 > > 1 2 7505332 14474672 13692 24143680000 0 1124 4905 1454 > 3 1 93 3 0 > > 4 2 7505332 14381156 13692 24143683200 096 5322 1782 > 2 1 94 3 0 > > ^C > > [root@usnyh-cmd1 ~]# ps -A --sort -rss -o comm,pmem | awk ' > > > NR == 1 { print; next } > > > { a[$1] += $2 } > > > END { > > > for (i in a) { > > > printf "%-15s\t%s\n", i, a[i]; > > > } > > > } > > > ' > > COMMAND %MEM > > kworker/42:20 > > kworker/60:6H 0 > > kworker/60:2H 0 > > kdmwork-253:2 0 > > ksoftirqd/600 > > postmaster 15.2 > > kworker/9:0H0 > > > > So I'm just asking why it is still there with free -m and if there is > any way to free that up if it is not used. > > > > Thanks > > > > Il giorno mar 16 ott 2018 alle ore 11:18 Hans Schou < > hans.sc...@gmail.com> ha scritto: > >> > >> Are you sure that swap is used actively? Maybe it had just been used > during backup or something. > >> > >> Look after SwapIn/SwapOut (si/so) it should be '0' > >> $ vmstat 1 > >> procs ---memory-- ---swap-- -io -system-- > --cpu- > >> r b swpd free buff cache si sobibo in cs us sy > id wa st > >> 1 0 12 89344 46608 5863840012 8 30 86 0 0 > 99 0 0 > >> > >> If you want to see the amount of ram used by each program with childs > run this: > >> ps -A --sort -rss -o comm,pmem | awk ' > >> NR == 1 { print; next } > >> { a[$1] += $2 } > >> END { > >> for (i in a) { > >> printf "%-15s\t%s\n", i, a[i]; > >> } > >> } > >> ' > >> > >> > >> On Tue, Oct 16, 2018 at 11:04 AM Nicola Contu > wrote: > >>> > >>> Hello, > >>> we are running Postgres 10.5 with master slave replication. > >>> > >>> These are our custom params > >>> > >>> archive_command = 'pgbackrest --stanza=cmdprod archive-push %p' # > command to use to archive a logfile segment > >>> archive_mode = on # enables archiving; off, on, or always > >>> checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - > 1.0 > >>> default_text_search_config = 'pg_catalog.english' # > >>> datestyle = 'iso, mdy' # > >>> effective_cache_size = 120GB # > >>> hot_standby = on# "on" allows queries during > recovery > >>> lc_messages = 'en_US.UTF-8' # locale for system > error message > >>> lc_monetary = 'en_US.UTF-8' # locale for monetary > formatting > >>> lc_numeric = 'en_US.UTF-8' # locale for number > formatting > >>> lc_time = 'en_US.UTF-8' # locale for time > formatting > >>> listen_addresses = '*' # defaults to 'localhost', '*' = all > >>> log_autovacuum_min_duration = 1000ms# -1 disables, 0 logs all > actions and > >>> log_checkpoints = on # > >>> log_line_prefix = '%t [%r] [%p]: [%l-1] db=%d,user=%u ' # > >>> log_lock_waits = on # log lock waits >= > deadlock_timeout > >>> log_min_duration_statement = 1000ms# -1 is disabled, 0 lo
Re: Pgbouncer discard all
2492534808 | dev| 7355 | 1833427130 | pgbouncer | | 10.151.2.145 | | 60570 | 2018-10-16 14:13:05.151015+00 | | 2018-10-16 15:10:40.309993+00 | 2018-10-16 15:10:40.310038+00 | Client | ClientRead | idle | | | DISCARD ALL They are idle actually. Will they count as client connection on the total amount set on the postgres.conf? Il giorno mar 16 ott 2018 alle ore 16:22 Martín Marqués < martin.marq...@2ndquadrant.com> ha scritto: > El 16/10/18 a las 09:59, Scot Kreienkamp escribió: > > Are you sure they’re actually waiting? Don’t forget 10.5 will show the > > last query executed even if the connection is idle. I believe discard > > all would be the last command the pgbouncer would send to the database > > when the client is done as it resets the connection for the next > > client. So what you’re describing would seem to be expected behavior. > > He might have been referring to client waiting. That is visible in the > pgbouncer pseudo-database > > OTOH if the the waiting is seen in pg_stat_activity, then pgbouncer has > nothing to do. The connection has already been assigned to the client > and the waiting is happening on the database server, not the pooler. > > Regards, > > -- > Martín Marquéshttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Matview size - space increased on concurrently refresh
Hello, we noticed with a simple matview we have that refreshing it using the concurrently item the space always increases of about 120MB . This only happens if I am reading from that matview and at the same time I am am refreshing it. cmdv3=# SELECT pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); pg_size_pretty 133 MB (1 row) cmdv3=# refresh materialized view matview_nm_connections; REFRESH MATERIALIZED VIEW cmdv3=# SELECT pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); pg_size_pretty 133 MB (1 row) cmdv3=# \! date Fri Jul 12 13:52:51 GMT 2019 cmdv3=# refresh materialized view matview_nm_connections; REFRESH MATERIALIZED VIEW cmdv3=# SELECT pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); pg_size_pretty 133 MB (1 row) Let's try concurrently. cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections; REFRESH MATERIALIZED VIEW cmdv3=# SELECT pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); pg_size_pretty 261 MB (1 row) So the matview is not really used and it does not have anything strange but that matview growth to 12GB as we refresh it once an hour. It had the free percent at 97%. I understand with concurrenlty it needs to take copy of the data while reading, but this seems to be too much on the space side. Is this a bug? Or is there anyone can help us understanding this? Thanks a lot, Nicola
Re: Matview size - space increased on concurrently refresh
P.S.: I am on postgres 11.3 Il giorno ven 12 lug 2019 alle ore 16:32 Nicola Contu < nicola.co...@gmail.com> ha scritto: > Hello, > we noticed with a simple matview we have that refreshing it using the > concurrently item the space always increases of about 120MB . > This only happens if I am reading from that matview and at the same time I > am am refreshing it. > > cmdv3=# SELECT > pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); > pg_size_pretty > > 133 MB > (1 row) > > cmdv3=# refresh materialized view matview_nm_connections; > REFRESH MATERIALIZED VIEW > cmdv3=# SELECT > pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); > pg_size_pretty > > 133 MB > (1 row) > > cmdv3=# \! date > Fri Jul 12 13:52:51 GMT 2019 > > cmdv3=# refresh materialized view matview_nm_connections; > REFRESH MATERIALIZED VIEW > cmdv3=# SELECT > pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); > pg_size_pretty > > 133 MB > (1 row) > > > Let's try concurrently. > > cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections; > REFRESH MATERIALIZED VIEW > cmdv3=# SELECT > pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); > pg_size_pretty > > 261 MB > (1 row) > > > So the matview is not really used and it does not have anything strange > but that matview growth to 12GB as we refresh it once an hour. > It had the free percent at 97%. > I understand with concurrenlty it needs to take copy of the data while > reading, but this seems to be too much on the space side. > > Is this a bug? Or is there anyone can help us understanding this? > > Thanks a lot, > Nicola >
Re: Matview size - space increased on concurrently refresh
It does not. That's the issue. It always increases of 120mb and it reached 12gb instead of just 180mb. Il dom 14 lug 2019, 21:34 Kaixi Luo ha scritto: > > On Fri, Jul 12, 2019 at 4:34 PM Nicola Contu > wrote: > >> P.S.: I am on postgres 11.3 >> >> Il giorno ven 12 lug 2019 alle ore 16:32 Nicola Contu < >> nicola.co...@gmail.com> ha scritto: >> >>> Hello, >>> we noticed with a simple matview we have that refreshing it using the >>> concurrently item the space always increases of about 120MB . >>> This only happens if I am reading from that matview and at the same time >>> I am am refreshing it. >>> >>> cmdv3=# SELECT >>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); >>> pg_size_pretty >>> >>> 133 MB >>> (1 row) >>> >>> cmdv3=# refresh materialized view matview_nm_connections; >>> REFRESH MATERIALIZED VIEW >>> cmdv3=# SELECT >>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); >>> pg_size_pretty >>> >>> 133 MB >>> (1 row) >>> >>> cmdv3=# \! date >>> Fri Jul 12 13:52:51 GMT 2019 >>> >>> cmdv3=# refresh materialized view matview_nm_connections; >>> REFRESH MATERIALIZED VIEW >>> cmdv3=# SELECT >>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); >>> pg_size_pretty >>> >>> 133 MB >>> (1 row) >>> >>> >>> Let's try concurrently. >>> >>> cmdv3=# refresh materialized view CONCURRENTLY matview_nm_connections; >>> REFRESH MATERIALIZED VIEW >>> cmdv3=# SELECT >>> pg_size_pretty(pg_relation_size('public.matview_nm_connections'::regclass)); >>> pg_size_pretty >>> >>> 261 MB >>> (1 row) >>> >>> >>> So the matview is not really used and it does not have anything strange >>> but that matview growth to 12GB as we refresh it once an hour. >>> It had the free percent at 97%. >>> I understand with concurrenlty it needs to take copy of the data while >>> reading, but this seems to be too much on the space side. >>> >>> Is this a bug? Or is there anyone can help us understanding this? >>> >>> Thanks a lot, >>> Nicola >>> >> > This is normal and something to be expected. When refreshing the > materialized view, the new data is written to a disk and then the two > tables are diffed. After the refresh finishes, your view size should go > back to normal. > >
Re: Matview size - space increased on concurrently refresh
Il giorno dom 14 lug 2019 alle ore 22:23 Tom Lane ha scritto: > [ please do not top-post in your replies, it makes the conversation hard > to follow ] > > Nicola Contu writes: > > Il dom 14 lug 2019, 21:34 Kaixi Luo ha scritto: > >> This is normal and something to be expected. When refreshing the > >> materialized view, the new data is written to a disk and then the two > >> tables are diffed. After the refresh finishes, your view size should go > >> back to normal. > > > It does not. That's the issue. > > It always increases of 120mb and it reached 12gb instead of just 180mb. > > A concurrent matview refresh will necessarily leave behind two copies > of any rows it changes, just like any other row-update operation in > Postgres. Once there are no concurrent transactions that can "see" > the old row copies, they should be reclaimable by vacuum. > > Since you're not seeing autovacuum reclaim the space automatically, > I hypothesize that you've got autovacuum turned off or dialed down > to unrealistically non-aggressive settings. Or possibly you have > old open transactions that are preventing reclaiming dead rows > (because they can still possibly "see" those rows). Either of those > explanations should imply that you're getting similar bloat in every > other table and matview, though. > > You might want to look into pg_stat_all_tables to see what it says > about the last_autovacuum time etc. for that matview. Another source > of insight is to do a manual "vacuum verbose" on the matview and see > what that says about removable and nonremovable rows. > > regards, tom lane > This matview has nothing strange and nothign custom. We can replicate the matview that is not used by anyone. cmdv3=# vacuum (full,analyze,verbose) public.matview_nm_connections; INFO: vacuuming "public.matview_nm_connections" INFO: "matview_nm_connections": found 0 removable, 295877 nonremovable row versions in 33654 pages DETAIL: 0 dead row versions cannot be removed yet. CPU: user: 0.45 s, system: 0.43 s, elapsed: 1.24 s. INFO: analyzing "public.matview_nm_connections" INFO: "matview_nm_connections": scanned 16986 of 16986 pages, containing 295877 live rows and 0 dead rows; 3 rows in sample, 295877 estimated total rows VACUUM This is an example of full and verbose vacuum. Everytime I refresh it I get the size increased. See stats from the pg_stat_all_tables : cmdv3=# select * from pg_stat_all_tables where relname = 'matview_nm_connections'; relid| schemaname |relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum |last_autovacuum| last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count +++--+--+--+---+---+---+---+---+++-+---+---+---+- --+--+--+---+--- 3466831733 | public | matview_nm_connections | 3725 |540992219 |33235 |255113 | 96874161 | 0 | 95692276 | 0 | 295877 | 0 | 0 | 2019-07-12 11:58:39.198049+00 | 2019-07-16 11:07:02.765612+00 | 2019-07-17 10:28:08.819679+00 | 2019-07-16 11:03:32.4895 73+00 |5 | 29 |11 |17 (1 row)
ERROR: too many dynamic shared memory segments
Hello, We are running postgres 11.5 and in the last two weeks we did : - upgrade of postgres to 11.5 from 11.4 - increased shared_buffer to 1/3 of the memory - increased effective_cache_size = 160GB from 120 - increased checkpoint_completion_target = 0.9 from 0.7 - increased checkpoint_timeout = 1h - increased work_mem = 2GB (this can be set up to 4GB) from 600MB Since that, in the last two weeks we saw an increment of this error : ERROR: too many dynamic shared memory segments Is there any relation between these parameters or the pgsql 11.5 version? Any help can be appreciated. Thank you, Nicola
Re: ERROR: too many dynamic shared memory segments
Hello, We did not see any error in the logs, just that one. Unfortunately we had problems installing updates in this machine and we are not installing updates since a few months. Do you think that can be the issue? We are running Centos 7. I will look into those parameters as well. Thanks for your feedback Il giorno mer 11 set 2019 alle ore 09:56 Mickael Franc (mickaelf) < micka...@cisco.com> ha scritto: > Hello, > > Maybe a change needed in a kernel parameter such "kernel.shmmax" or > "kernel.shmall" to allow by kernel to provide a huge shared-mem. > Did you see any other error in logs ? Have you upgrade your kernel since > last two weeks ? > > Best, > -- > *De :* Nicola Contu > *Envoyé :* mercredi 11 septembre 2019 09:47 > *À :* pgsql-generallists.postgresql.org < > pgsql-general@lists.postgresql.org> > *Cc :* Alessandro Aste > *Objet :* ERROR: too many dynamic shared memory segments > > Hello, > We are running postgres 11.5 and in the last two weeks we did : > > - upgrade of postgres to 11.5 from 11.4 > - increased shared_buffer to 1/3 of the memory > - increased effective_cache_size = 160GB from 120 > - increased checkpoint_completion_target = 0.9 from 0.7 > - increased checkpoint_timeout = 1h > - increased work_mem = 2GB (this can be set up to 4GB) from 600MB > > Since that, in the last two weeks we saw an increment of this error : > > ERROR: too many dynamic shared memory segments > > Is there any relation between these parameters or the pgsql 11.5 version? > > Any help can be appreciated. > > Thank you, > Nicola >
Re: ERROR: too many dynamic shared memory segments
If the error persist I will try to revert the work_mem. Thanks a lot Il giorno mer 11 set 2019 alle ore 10:10 Pavel Stehule < pavel.steh...@gmail.com> ha scritto: > Hi > > st 11. 9. 2019 v 9:48 odesílatel Nicola Contu > napsal: > >> Hello, >> We are running postgres 11.5 and in the last two weeks we did : >> >> - upgrade of postgres to 11.5 from 11.4 >> - increased shared_buffer to 1/3 of the memory >> - increased effective_cache_size = 160GB from 120 >> - increased checkpoint_completion_target = 0.9 from 0.7 >> - increased checkpoint_timeout = 1h >> - increased work_mem = 2GB (this can be set up to 4GB) from 600MB >> >> Since that, in the last two weeks we saw an increment of this error : >> >> ERROR: too many dynamic shared memory segments >> >> Is there any relation between these parameters or the pgsql 11.5 version? >> > > I expect it can be related to increasing work_mem - maybe parallel hash > join was used > > Regards > > Pavel > >> >> Any help can be appreciated. >> >> Thank you, >> Nicola >> >
Re: ERROR: too many dynamic shared memory segments
Hey Thomas, after a few months, we started having this issue again. So we revert the work_mem parameter to 600MB instead of 2GB. But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes. I understand we can increase max_connections so we can have many more segments. My question is : is there a way to understand the number of segments we reached? Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments. We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service. I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5? Thanks a lot, Nicola Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro < thomas.mu...@gmail.com> ha scritto: > On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu > wrote: > > If the error persist I will try to revert the work_mem. > > Thanks a lot > > Hi Nicola, > > It's hard to say exactly what the cause of the problem is in your case > and how to avoid it, without knowing what your query plans look like. > PostgreSQL allows 64 + 2 * max_connections segments to exist a time, > and it needs a number of them that depends on work_mem (in the case of > Parallel Hash Join and Parallel Bitmap Index Scan), and also depends > on the number of Gather nodes that appear in the plan, which in some > unusual cases can result from partitioning. > > I've seen people reaching this error by running a lot of parallel > queries concurrently. If that's the cause, then you can definitely > get some relief by turning work_mem down, or by turning > max_connections up (even though you don't want to allow more > connections -- because it influences the formula for deciding on the > DSM segment limit). We should probably adjust some of the internal > constants to give us more slots, to avoid that problem, as discussed > here: > > > https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com > > I've also seen people reaching this error by somehow coming up with > plans that had a very large number of Gather nodes in them, > corresponding to partitions; that's probably a bad plan (it'd > presumably be better to terminate parallelism higher up in the plan, > but these plans do seem to exist in the wild; I don't recall exactly > why). I think we need a bit of a redesign so that if there are > multiple Gather nodes, they share the same main DSM segment, instead > of blowing through this limit. > > -- > Thomas Munro > https://enterprisedb.com >
Re: ERROR: too many dynamic shared memory segments
We also reverted this param : cmdv3=# show max_parallel_workers_per_gather; max_parallel_workers_per_gather - 2 (1 row) It was set to 8. Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu < nicola.co...@gmail.com> ha scritto: > Hey Thomas, > after a few months, we started having this issue again. > So we revert the work_mem parameter to 600MB instead of 2GB. > But the issue is still there. A query went to segmentation fault, the DB > went to recovery mode and our app went to read only for a few minutes. > > I understand we can increase max_connections so we can have many more > segments. > > My question is : is there a way to understand the number of segments we > reached? > Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have > about 500 shared segments. > We would like to increase that number to 300 or 400 but would be great to > understand if there is a way to make sure we will solve the issue as it > requires a restart of the service. > > I know you were also talking about a redesign this part in PostgreSQL. Do > you know if anything has changed in any of the newer versions after 11.5? > > Thanks a lot, > Nicola > > > > > > Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro < > thomas.mu...@gmail.com> ha scritto: > >> On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu >> wrote: >> > If the error persist I will try to revert the work_mem. >> > Thanks a lot >> >> Hi Nicola, >> >> It's hard to say exactly what the cause of the problem is in your case >> and how to avoid it, without knowing what your query plans look like. >> PostgreSQL allows 64 + 2 * max_connections segments to exist a time, >> and it needs a number of them that depends on work_mem (in the case of >> Parallel Hash Join and Parallel Bitmap Index Scan), and also depends >> on the number of Gather nodes that appear in the plan, which in some >> unusual cases can result from partitioning. >> >> I've seen people reaching this error by running a lot of parallel >> queries concurrently. If that's the cause, then you can definitely >> get some relief by turning work_mem down, or by turning >> max_connections up (even though you don't want to allow more >> connections -- because it influences the formula for deciding on the >> DSM segment limit). We should probably adjust some of the internal >> constants to give us more slots, to avoid that problem, as discussed >> here: >> >> >> https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com >> >> I've also seen people reaching this error by somehow coming up with >> plans that had a very large number of Gather nodes in them, >> corresponding to partitions; that's probably a bad plan (it'd >> presumably be better to terminate parallelism higher up in the plan, >> but these plans do seem to exist in the wild; I don't recall exactly >> why). I think we need a bit of a redesign so that if there are >> multiple Gather nodes, they share the same main DSM segment, instead >> of blowing through this limit. >> >> -- >> Thomas Munro >> https://enterprisedb.com >> >
Re: ERROR: too many dynamic shared memory segments
Hello, may I ask you for a feedback? Thanks a lot Il giorno mar 21 gen 2020 alle ore 17:14 Nicola Contu < nicola.co...@gmail.com> ha scritto: > We also reverted this param : > > cmdv3=# show max_parallel_workers_per_gather; > max_parallel_workers_per_gather > - > 2 > (1 row) > > It was set to 8. > > > Il giorno mar 21 gen 2020 alle ore 16:06 Nicola Contu < > nicola.co...@gmail.com> ha scritto: > >> Hey Thomas, >> after a few months, we started having this issue again. >> So we revert the work_mem parameter to 600MB instead of 2GB. >> But the issue is still there. A query went to segmentation fault, the DB >> went to recovery mode and our app went to read only for a few minutes. >> >> I understand we can increase max_connections so we can have many more >> segments. >> >> My question is : is there a way to understand the number of segments we >> reached? >> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have >> about 500 shared segments. >> We would like to increase that number to 300 or 400 but would be great to >> understand if there is a way to make sure we will solve the issue as it >> requires a restart of the service. >> >> I know you were also talking about a redesign this part in PostgreSQL. Do >> you know if anything has changed in any of the newer versions after 11.5? >> >> Thanks a lot, >> Nicola >> >> >> >> >> >> Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro < >> thomas.mu...@gmail.com> ha scritto: >> >>> On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu >>> wrote: >>> > If the error persist I will try to revert the work_mem. >>> > Thanks a lot >>> >>> Hi Nicola, >>> >>> It's hard to say exactly what the cause of the problem is in your case >>> and how to avoid it, without knowing what your query plans look like. >>> PostgreSQL allows 64 + 2 * max_connections segments to exist a time, >>> and it needs a number of them that depends on work_mem (in the case of >>> Parallel Hash Join and Parallel Bitmap Index Scan), and also depends >>> on the number of Gather nodes that appear in the plan, which in some >>> unusual cases can result from partitioning. >>> >>> I've seen people reaching this error by running a lot of parallel >>> queries concurrently. If that's the cause, then you can definitely >>> get some relief by turning work_mem down, or by turning >>> max_connections up (even though you don't want to allow more >>> connections -- because it influences the formula for deciding on the >>> DSM segment limit). We should probably adjust some of the internal >>> constants to give us more slots, to avoid that problem, as discussed >>> here: >>> >>> >>> https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com >>> >>> I've also seen people reaching this error by somehow coming up with >>> plans that had a very large number of Gather nodes in them, >>> corresponding to partitions; that's probably a bad plan (it'd >>> presumably be better to terminate parallelism higher up in the plan, >>> but these plans do seem to exist in the wild; I don't recall exactly >>> why). I think we need a bit of a redesign so that if there are >>> multiple Gather nodes, they share the same main DSM segment, instead >>> of blowing through this limit. >>> >>> -- >>> Thomas Munro >>> https://enterprisedb.com >>> >>
Re: ERROR: too many dynamic shared memory segments
This is the error on postgres log of the segmentation fault : 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG: server process (PID 2042) was terminated by signal 11: Segmentation fault 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL: Failed process was running: select pid from pg_stat_activity where query ilike 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats' 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG: terminating any other active server processes 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [4-1] db=cmdv3,user=admin WARNING: terminating connection because of crash of another server process 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [5-1] db=cmdv3,user=admin DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [6-1] db=cmdv3,user=admin HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-01-21 14:20:29 GMT [127.0.0.1(34026)] [2055]: [5-1] db=cmdv3,user=admin WARNING: terminating connection because of crash of another server process At CentOS level : Jan 21 14:20:29 usnyh-cmd1.gt-t.net kernel: postmaster[2042]: segfault at 0 ip 0048bef4 sp 7ffdf4955bb0 error 4 in postgres[40+6c5000] Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING C-0x21526e8: cmdv3/admin@10.151.2.154:39688 pooler error: server conn crashed? Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x1f5ff58: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905 GMT [33727] WARNING S-0x25097f0: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.906 GMT [33727] WARNING S-0x2508b60: cmdv3/admin@127.0.0.1:5432 got packet 'N' from server when not linked Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.981 GMT [33727] ERROR S: login failed: FATAL: the database system is in recovery mode > If you're on Linux, you can probably see them with "ls /dev/shm". I see a lot of files there, and doing a cat they are empty. What can I do with them? Those are two different problems I guess, but they are related because right before the Segmentation Fault I see a lot of shared segment errors in the postgres log. Il giorno mer 29 gen 2020 alle ore 10:09 Thomas Munro < thomas.mu...@gmail.com> ha scritto: > On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu > wrote: > > after a few months, we started having this issue again. > > So we revert the work_mem parameter to 600MB instead of 2GB. > > But the issue is still there. A query went to segmentation fault, the DB > went to recovery mode and our app went to read only for a few minutes. > > Hi Nicola, > Hmm, a segmentation fault sounds like a different problem. Can you > please share the exact error messages from PostgreSQL and OS logs? > > > I understand we can increase max_connections so we can have many more > segments. > > > > My question is : is there a way to understand the number of segments we > reached? > > If you're on Linux, you can probably see them with "ls /dev/shm". > > > Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have > about 500 shared segments. > > We would like to increase that number to 300 or 400 but would be great > to understand if there is a way to make sure we will solve the issue as it > requires a restart of the service. > > > > I know you were also talking about a redesign this part in PostgreSQL. > Do you know if anything has changed in any of the newer versions after 11.5? > > It's possible that we should increase a couple of constants used the > formula -- I'll look into that again. But first I'd like to see if > we're even investigating the right problem here. >
Re: ERROR: too many dynamic shared memory segments
Hi Thomas, unfortunately I can't find any core dump to help you more. Thanks for the fix, we are in the process of installing 12.1 in production, so we can still wait on this release and go live with 12.2 I will let you know at this point if I still get this after installing 12.2 trying to build a core dump file. Do you still recommend to increase max_conn? Il giorno gio 30 gen 2020 alle ore 23:41 Thomas Munro < thomas.mu...@gmail.com> ha scritto: > On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro > wrote: > > On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu > wrote: > > > This is the error on postgres log of the segmentation fault : > > > > > > 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG: server > process (PID 2042) was terminated by signal 11: Segmentation fault > > > 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL: Failed > process was running: select pid from pg_stat_activity where query ilike > 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats' > > > 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG: > terminating any other active server processes > > > That gave me an idea... I hacked my copy of PostgreSQL to flip a coin > > to decide whether to pretend there are no slots free (see below), and > > I managed to make it crash in the regression tests when doing a > > parallel index build. It's late here now, but I'll look into that > > tomorrow. It's possible that the parallel index code needs to learn > > to cope with that. > > Hi Nicola, > > Without more information I can't know if I found the same bug you > experienced, but I think it's likely. I have committed a fix for > that, which will be available in the next release (mid-February). > > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1 >
Streaming replication - 11.5
Hello, I have two servers connected to the same switch running postgres 11.5 I am trying to replicate one of those servers after a planned work on the master, so the replica has been lost. It has always worked but now I get this : pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. I don't really understand what the issue is. I had this issue last week as well in another DC and I had to reboot the slave to make it working (not sure why it helped) Do you know what can cause this? Thank you, Nicola
Re: Streaming replication - 11.5
user= LOG: recovery restart point at 643B/12680A80 2020-03-11 09:47:14 GMT [] [12598]: [4530-1] db=,user= DETAIL: Last completed transaction was at log time 2020-03-11 09:47:14.069731+00. 2020-03-11 09:47:44 GMT [] [12598]: [4531-1] db=,user= LOG: restartpoint starting: time Il giorno mer 11 mar 2020 alle ore 01:53 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 3/10/20 8:17 AM, Nicola Contu wrote: > Please post to list also. > Ccing list. > > What came immediately before the temporary file error? > > >2020-03-10 15:10:17 GMT [[local]] [28171]: [1-1] > > db=postgres,user=postgres LOG: temporary file: path > > "base/pgsql_tmp/pgsql_tmp28171.0", size 382474936 > > 2020-03-10 15:10:17 GMT [[local]] [28171]: [4-1] > > db=postgres,user=postgres LOG: could not send data to client: Broken > pipe > > 2020-03-10 15:10:17 GMT [[local]] [28171]: [5-1] > > db=postgres,user=postgres FATAL: connection to client lost > > 2020-03-10 15:10:26 GMT [] [12598]: [3544-1] db=,user= LOG: > > restartpoint complete: wrote 37315 buffers (0.4%); 0 WAL file(s) > > added, 0 removed, 16 recycled; write=269.943 s, sync=0.039 s, > > total=269.999 s; sync files=1010, longest=0.001 s, average=0.000 s; > > distance=175940 kB, estimate=416149 kB > > 2020-03-10 15:10:26 GMT [] [12598]: [3545-1] db=,user= LOG: recovery > > restart point at 6424/1D7DEDE8 > > > > It is a cascade replication > > > > Il giorno mar 10 mar 2020 alle ore 15:58 Adrian Klaver > > mailto:adrian.kla...@aklaver.com>> ha > scritto: > > > > On 3/10/20 2:26 AM, Nicola Contu wrote: > > > Hello, > > > I have two servers connected to the same switch running postgres > 11.5 > > > > > > I am trying to replicate one of those servers after a planned > > work on > > > the master, so the replica has been lost. It has always worked > > but now I > > > get this : > > > > > > pg_basebackup: could not receive data from WAL stream: server > > closed the > > > connection unexpectedly > > > This probably means the server terminated abnormally > > > before or while processing the request. > > > > > > I don't really understand what the issue is. > > > > I would start with the logs from the Postgres server you are taking > the > > backup from. > > > > > I had this issue last week as well in another DC and I had to > > reboot the > > > slave to make it working (not sure why it helped) > > > > > > Do you know what can cause this? > > > > > > Thank you, > > > Nicola > > > > > > -- > > Adrian Klaver > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: Streaming replication - 11.5
I am actually cascading. The master is in nyh, the first slave is in Dallas and the one having problems is in Dallas as well on the same switch of the one replicating from the master. It always worked not sure what is wrong now. We just encrypted disks on all servers Il mer 11 mar 2020, 18:57 Adrian Klaver ha scritto: > On 3/11/20 2:54 AM, Nicola Contu wrote: > > These are the lines before > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1] > > db=cmdv3,user=zabbix_check ERROR: recovery is in progress > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1] > > db=cmdv3,user=zabbix_check HINT: WAL control functions cannot be > > executed during recovery. > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1] > > db=cmdv3,user=zabbix_check STATEMENT: select > > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from > > pg_stat_replication where client_addr ='10.150.20.22' > > > > That query is made by Zabbix. So I stopped the zabbix agent and tested > > again. But still failing, because of this now : > > > > pg_basebackup: starting background WAL receiver > > pg_basebackup: created temporary replication slot "pg_basebackup_51199" > > *pg_basebackup: could not receive data from WAL stream: SSL SYSCALL > > error: EOF detected > > *^C4699810/504983062 kB (70%), 0/1 tablespace > > (...ql11/data/base/16401/231363544.2) > > So you started over with a pg_basebackup? > > Also from below: > > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] > db=[unknown],user=replicator LOG: terminating walsender process due to > replication timeout > > Where are the master and standby in relation to each other network wise? > > Intervening firewalls, network latency issues? > > > > > > > > > here the full log starting right before the last try : > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: > > restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s) > > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s, > > total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s; > > distance=131239 kB, estimate=725998 kB > > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG: recovery > > restart point at 643A/D8C05F70 > > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL: Last > > completed transaction was at log time 2020-03-11 09:22:44.050084+00. > > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG: > > restartpoint starting: time > > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG: > > restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s) > > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s, > > total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s; > > distance=120469 kB, estimate=665445 kB > > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG: recovery > > restart point at 643A/E01AB438 > > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL: Last > > completed transaction was at log time 2020-03-11 09:27:43.945485+00. > > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG: > > restartpoint starting: force wait > > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1] > > db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: SELECT id, > > name, parent_id, parent, short_name, sales_rep_id FROM mmx_clients; > > 2020-03-11 09:29:34 GMT [10.222.8.2(47834)] [50961]: [2-1] > > db=cmdv3,user=nis LOG: duration: 9493.259 ms statement: SELECT slid, > > gnid, sof_id, client_id, product FROM mmx_slids; > > 2020-03-11 09:32:14 GMT [] [12598]: [4516-1] db=,user= LOG: > > restartpoint complete: wrote 71260 buffers (0.8%); 0 WAL file(s) > > added, 0 removed, 13 recycled; write=269.953 s, sync=0.012 s, > > total=269.979 s; sync files=760, longest=0.002 s, average=0.000 s; > > distance=123412 kB, estimate=611242 kB > > 2020-03-11 09:32:14 GMT [] [12598]: [4517-1] db=,user= LOG: recovery > > restart point at 643A/E7A30498 > > 2020-03-11 09:32:14 GMT [] [12598]: [4518-1] db=,user= DETAIL: Last > > completed transaction was at log time 2020-03-11 09:32:13.916101+00. > > 2020-03-11 09:32:44 GMT [] [12598]: [4519-1] db=,user= LOG: > > restartpoint starting: time > > 2020-03-11 09:37:14 GMT [] [12598]: [4520-1] db=,user= LOG: > > restartpoint complete: wrote 27130 buffers (0.3%); 0 WAL file(s) > > added, 0 removed, 12 recycled; write=270.026 s, sync=0.007 s, > > total=270.052 s; sync files=814, longest=0.001 s, average=0.000 s; > > distance=28059
Re: Streaming replication - 11.5
CPU load on the server to be built? No. System logs don't show anything relevant unfortunately Il mer 11 mar 2020, 21:34 Adrian Klaver ha scritto: > On 3/11/20 11:59 AM, Nicola Contu wrote: > > I am actually cascading. > > The master is in nyh, the first slave is in Dallas and the one having > > problems is in Dallas as well on the same switch of the one replicating > > from the master. > > > > It always worked not sure what is wrong now. We just encrypted disks on > > all servers > > Do you have before and after on CPU load, I/O throughput? > > Do system logs show anything relevant during replication drop out? > > > > > > > Il mer 11 mar 2020, 18:57 Adrian Klaver > <mailto:adrian.kla...@aklaver.com>> ha scritto: > > > > On 3/11/20 2:54 AM, Nicola Contu wrote: > > > These are the lines before > > > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1] > > > db=cmdv3,user=zabbix_check ERROR: recovery is in progress > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1] > > > db=cmdv3,user=zabbix_check HINT: WAL control functions cannot be > > > executed during recovery. > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1] > > > db=cmdv3,user=zabbix_check STATEMENT: select > > > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from > > > pg_stat_replication where client_addr ='10.150.20.22' > > > > > > That query is made by Zabbix. So I stopped the zabbix agent and > > tested > > > again. But still failing, because of this now : > > > > > > pg_basebackup: starting background WAL receiver > > > pg_basebackup: created temporary replication slot > > "pg_basebackup_51199" > > > *pg_basebackup: could not receive data from WAL stream: SSL > SYSCALL > > > error: EOF detected > > > *^C4699810/504983062 kB (70%), 0/1 tablespace > > > (...ql11/data/base/16401/231363544.2) > > > > So you started over with a pg_basebackup? > > > > Also from below: > > > > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] > > db=[unknown],user=replicator LOG: terminating walsender process due > to > > replication timeout > > > > Where are the master and standby in relation to each other network > wise? > > > > Intervening firewalls, network latency issues? > > > > > > > > > > > > > > > here the full log starting right before the last try : > > > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: > > > restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL file(s) > > > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 s, > > > total=270.036 s; sync files=804, longest=0.001 s, average=0.000 s; > > > distance=131239 kB, estimate=725998 kB > > > 2020-03-11 09:22:44 GMT [] [12598]: [4509-1] db=,user= LOG: > > recovery > > > restart point at 643A/D8C05F70 > > > 2020-03-11 09:22:44 GMT [] [12598]: [4510-1] db=,user= DETAIL: > Last > > > completed transaction was at log time 2020-03-11 > 09:22:44.050084+00. > > > 2020-03-11 09:23:14 GMT [] [12598]: [4511-1] db=,user= LOG: > > > restartpoint starting: time > > > 2020-03-11 09:27:44 GMT [] [12598]: [4512-1] db=,user= LOG: > > > restartpoint complete: wrote 17069 buffers (0.2%); 0 WAL file(s) > > > added, 0 removed, 17 recycled; write=269.879 s, sync=0.006 s, > > > total=269.902 s; sync files=811, longest=0.001 s, average=0.000 s; > > > distance=120469 kB, estimate=665445 kB > > > 2020-03-11 09:27:44 GMT [] [12598]: [4513-1] db=,user= LOG: > > recovery > > > restart point at 643A/E01AB438 > > > 2020-03-11 09:27:44 GMT [] [12598]: [4514-1] db=,user= DETAIL: > Last > > > completed transaction was at log time 2020-03-11 > 09:27:43.945485+00. > > > 2020-03-11 09:27:44 GMT [] [12598]: [4515-1] db=,user= LOG: > > > restartpoint starting: force wait > > > 2020-03-11 09:29:24 GMT [10.222.8.2(47834)] [50961]: [1-1] > > > db=cmdv3,user=nis LOG: duration: 1402.004 ms statement: SELECT > id, > > > name, parent_id, parent, short_name, sales_rep_id FROM > mmx_clients; > > > 2020-03-11 09:29:34 GM
Re: Streaming replication - 11.5
The replicator is ok and the replicated as well. %Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si, 0.0 st CPU is really low on both. I am running pg_basebackup again everytime. Any other suggestions? Il giorno mer 11 mar 2020 alle ore 23:13 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 3/11/20 2:12 PM, Nicola Contu wrote: > > CPU load on the server to be built? No. > > CPU load, I/O load on the servers in the replication chain. > > Basically you just recently, it seems, imposed extra overhead to the > process by encrypting/decrypting. From what I gather from earlier post > then your replication started failing. Trying to establish whether the > former is causing the latter. > > Not sure what you are referring to in regards to server to be built? > > > System logs don't show anything relevant unfortunately > > > > Il mer 11 mar 2020, 21:34 Adrian Klaver > <mailto:adrian.kla...@aklaver.com>> ha scritto: > > > > On 3/11/20 11:59 AM, Nicola Contu wrote: > > > I am actually cascading. > > > The master is in nyh, the first slave is in Dallas and the one > > having > > > problems is in Dallas as well on the same switch of the one > > replicating > > > from the master. > > > > > > It always worked not sure what is wrong now. We just encrypted > > disks on > > > all servers > > > > Do you have before and after on CPU load, I/O throughput? > > > > Do system logs show anything relevant during replication drop out? > > > > > > > > > > > Il mer 11 mar 2020, 18:57 Adrian Klaver > > mailto:adrian.kla...@aklaver.com> > > > <mailto:adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>>> ha scritto: > > > > > > On 3/11/20 2:54 AM, Nicola Contu wrote: > > > > These are the lines before > > > > > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [1-1] > > > > db=cmdv3,user=zabbix_check ERROR: recovery is in progress > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [2-1] > > > > db=cmdv3,user=zabbix_check HINT: WAL control functions > > cannot be > > > > executed during recovery. > > > > 2020-03-11 09:05:08 GMT [127.0.0.1(40214)] [43853]: [3-1] > > > > db=cmdv3,user=zabbix_check STATEMENT: select > > > > greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), > > replay_lsn)) from > > > > pg_stat_replication where client_addr ='10.150.20.22' > > > > > > > > That query is made by Zabbix. So I stopped the zabbix > > agent and > > > tested > > > > again. But still failing, because of this now : > > > > > > > > pg_basebackup: starting background WAL receiver > > > > pg_basebackup: created temporary replication slot > > > "pg_basebackup_51199" > > > > *pg_basebackup: could not receive data from WAL stream: > > SSL SYSCALL > > > > error: EOF detected > > > > *^C4699810/504983062 kB (70%), 0/1 tablespace > > > > (...ql11/data/base/16401/231363544.2) > > > > > > So you started over with a pg_basebackup? > > > > > > Also from below: > > > > > > 2020-03-11 09:43:53 GMT [10.150.20.22(54906)] [51199]: [1-1] > > > db=[unknown],user=replicator LOG: terminating walsender > > process due to > > > replication timeout > > > > > > Where are the master and standby in relation to each other > > network wise? > > > > > > Intervening firewalls, network latency issues? > > > > > > > > > > > > > > > > > > > > > here the full log starting right before the last try : > > > > > > > > 2020-03-11 09:22:44 GMT [] [12598]: [4508-1] db=,user= LOG: > > > > restartpoint complete: wrote 19565 buffers (0.2%); 0 WAL > > file(s) > > > > added, 0 removed, 7 recycled; write=270.014 s, sync=0.009 > s, > > > >
Re: Streaming replication - 11.5
The encryption is at os level. So the drives are encrypted with a password where the db saves data Il gio 12 mar 2020, 15:51 Adrian Klaver ha scritto: > On 3/12/20 4:31 AM, Nicola Contu wrote: > > The replicator is ok and the replicated as well. > > %Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si, > > 0.0 st > > > > CPU is really low on both. > > > > I am running pg_basebackup again everytime. > > Any other suggestions? > > > > I have to believe their is a connection between changing to encrypting > the disks and your issues. Not sure what, but to help how is the > encryption being done and what program is being used? > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: Streaming replication - 11.5
So in the logs I now see this : 2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1] db=[unknown],user=replicator LOG: terminating walsender process due to replication timeout So I tried increasing the wal_sender_timeout to 300s but it did not help Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu < nicola.co...@gmail.com> ha scritto: > The encryption is at os level. So the drives are encrypted with a password > where the db saves data > > Il gio 12 mar 2020, 15:51 Adrian Klaver ha > scritto: > >> On 3/12/20 4:31 AM, Nicola Contu wrote: >> > The replicator is ok and the replicated as well. >> > %Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 hi, 0.0 si, >> > 0.0 st >> > >> > CPU is really low on both. >> > >> > I am running pg_basebackup again everytime. >> > Any other suggestions? >> > >> >> I have to believe their is a connection between changing to encrypting >> the disks and your issues. Not sure what, but to help how is the >> encryption being done and what program is being used? >> >> >> -- >> Adrian Klaver >> adrian.kla...@aklaver.com >> >
Re: Streaming replication - 11.5
I was able to make pg_basebackup working using --max-rate=128M Still don't understand why. I guess it is related to the encryption and slowness of the disk.. Do you have any idea? Il giorno ven 13 mar 2020 alle ore 16:15 Adrian Klaver < adrian.kla...@aklaver.com> ha scritto: > On 3/13/20 4:11 AM, Nicola Contu wrote: > > So in the logs I now see this : > > > > 2020-03-13 11:03:42 GMT [10.150.20.22(45294)] [27804]: [1-1] > > db=[unknown],user=replicator LOG: terminating walsender process due to > > replication timeout > > Yeah that's been showing up the log snippets you have been posting. > > To figure this out you will need to: > > 1) Make a list of what changed since the last time replication worked > consistently. > > 2) Monitor the changed components, start logging or increase logging. > > 3) Monitor the chain of replication as whole, to catch changes that you > do not know about. Since you seem to be operating across data centers > that would include verifying the network. > > > > > So I tried increasing the wal_sender_timeout to 300s but it did not help > > > > Il giorno gio 12 mar 2020 alle ore 15:56 Nicola Contu > > mailto:nicola.co...@gmail.com>> ha scritto: > > > > The encryption is at os level. So the drives are encrypted with a > > password where the db saves data > > > > Il gio 12 mar 2020, 15:51 Adrian Klaver > <mailto:adrian.kla...@aklaver.com>> ha scritto: > > > > On 3/12/20 4:31 AM, Nicola Contu wrote: > > > The replicator is ok and the replicated as well. > > > %Cpu(s): 0.2 us, 1.0 sy, 0.0 ni, 94.8 id, 4.0 wa, 0.0 > > hi, 0.0 si, > > > 0.0 st > > > > > > CPU is really low on both. > > > > > > I am running pg_basebackup again everytime. > > > Any other suggestions? > > > > > > > I have to believe their is a connection between changing to > > encrypting > > the disks and your issues. Not sure what, but to help how is the > > encryption being done and what program is being used? > > > > > > -- > > Adrian Klaver > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
[no subject]
Hello, I am running postgres 11.5 and we were having issues with shared segments. So I increased the max_connection as suggested by you guys and reduced my work_mem to 600M. Right now instead, it is the second time I see this error : ERROR: could not resize shared memory segment "/PostgreSQL.2137675995" to 33624064 bytes: Interrupted system call So do you know what it means and how can I solve it? Thanks a lot, Nicola
Re: EINTR while resizing dsm segment.
So that seems to be a bug, correct? Just to confirm, I am not using NFS, it is directly on disk. Other than that, is there a particular option we can set in the postgres.conf to mitigate the issue? Thanks a lot for your help. Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro ha scritto: > On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi > wrote: > > I provided the subject, and added -hackers. > > > > > Hello, > > > I am running postgres 11.5 and we were having issues with shared > segments. > > > So I increased the max_connection as suggested by you guys and reduced > my > > > work_mem to 600M. > > > > > > Right now instead, it is the second time I see this error : > > > > > > ERROR: could not resize shared memory segment > "/PostgreSQL.2137675995" to > > > 33624064 bytes: Interrupted system call > > > > The function posix_fallocate is protected against EINTR. > > > > | do > > | { > > | rc = posix_fallocate(fd, 0, size); > > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending)); > > > > But not for ftruncate and write. Don't we need to protect them from > > ENTRI as the attached? > > We don't handle EINTR for write() generally because that's not > supposed to be necessary on local files (local disks are not "slow > devices", and we document that if you're using something like NFS you > should use its "hard" mount option so that it behaves that way too). > As for ftruncate(), you'd think it'd be similar, and I can't think of > a more local filesystem than tmpfs (where POSIX shmem lives on Linux), > but I can't seem to figure that out from reading man pages; maybe I'm > reading the wrong ones. Perhaps in low memory situations, an I/O wait > path reached by ftruncate() can return EINTR here rather than entering > D state (non-interruptable sleep) or restarting due to our SA_RESTART > flag... anyone know? > > Another thought: is there some way for the posix_fallocate() retry > loop to exit because (ProcDiePending || QueryCancelPending), but then > for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to > reporting the EINTR? >
Re: EINTR while resizing dsm segment.
The only change we made on the disk, is the encryption at OS level. Not sure this can be something related. Il giorno mar 7 apr 2020 alle ore 10:58 Nicola Contu ha scritto: > So that seems to be a bug, correct? > Just to confirm, I am not using NFS, it is directly on disk. > > Other than that, is there a particular option we can set in the > postgres.conf to mitigate the issue? > > Thanks a lot for your help. > > > Il giorno sab 4 apr 2020 alle ore 02:49 Thomas Munro < > thomas.mu...@gmail.com> ha scritto: > >> On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi >> wrote: >> > I provided the subject, and added -hackers. >> > >> > > Hello, >> > > I am running postgres 11.5 and we were having issues with shared >> segments. >> > > So I increased the max_connection as suggested by you guys and >> reduced my >> > > work_mem to 600M. >> > > >> > > Right now instead, it is the second time I see this error : >> > > >> > > ERROR: could not resize shared memory segment >> "/PostgreSQL.2137675995" to >> > > 33624064 bytes: Interrupted system call >> > >> > The function posix_fallocate is protected against EINTR. >> > >> > | do >> > | { >> > | rc = posix_fallocate(fd, 0, size); >> > | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending)); >> > >> > But not for ftruncate and write. Don't we need to protect them from >> > ENTRI as the attached? >> >> We don't handle EINTR for write() generally because that's not >> supposed to be necessary on local files (local disks are not "slow >> devices", and we document that if you're using something like NFS you >> should use its "hard" mount option so that it behaves that way too). >> As for ftruncate(), you'd think it'd be similar, and I can't think of >> a more local filesystem than tmpfs (where POSIX shmem lives on Linux), >> but I can't seem to figure that out from reading man pages; maybe I'm >> reading the wrong ones. Perhaps in low memory situations, an I/O wait >> path reached by ftruncate() can return EINTR here rather than entering >> D state (non-interruptable sleep) or restarting due to our SA_RESTART >> flag... anyone know? >> >> Another thought: is there some way for the posix_fallocate() retry >> loop to exit because (ProcDiePending || QueryCancelPending), but then >> for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to >> reporting the EINTR? >> >
pgbouncer
Hello, I am a bit confused about the settings in pgbouncer What's exactly the pool_size? If I set 3, and I tried to connect from 4 shells, I am still able to connect. Same thing for max_db_connections. I set this to 1 and I am able to connect from 2 shells. This is kind of confusing and I'm not really cleared reading the documentation. This is my config. [databases] cmdwela = host=127.0.0.1 port=5432 dbname=cmdwela auth_user=pgbouncer pool_size=120 cicciopasticcio = host=127.0.0.1 port=5432 dbname=cicciopasticcio auth_user=pgbouncer pool_size=2 [pgbouncer] listen_port = 6543 listen_addr = * auth_type = md5 auth_file = /etc/pgbouncer/users.txt auth_query = select uname,phash from user_lookup($1) logfile = /var/log/pgbouncer.log pidfile = /home/postgres/pgbouncer.pid admin_users = admin user=postgres max_db_connections=1 log_connections=0 log_disconnections=0 max_client_conn=5 I just want to limit connections from the app etc. Thanks
Re: pgbouncer
> If I set 3, and I tried to connect from 4 shells, I am still able to connect. That would be the point - pgbouncer is sharing the 3 connections it keeps with PostgreSQL between the 4 client connections made to it. Mmh, my pool_mode is per session. The 4 sessions were active, not doing any query, but connected to the shell. So that's what my doubt Il giorno gio 17 gen 2019 alle ore 16:32 David G. Johnston < david.g.johns...@gmail.com> ha scritto: > On Thu, Jan 17, 2019 at 8:18 AM Nicola Contu > wrote: > > > > Hello, > > I am a bit confused about the settings in pgbouncer > > > > What's exactly the pool_size? > > Roughly, the number of open connections pgbouncer will keep to PostgreSQL. > > > If I set 3, and I tried to connect from 4 shells, I am still able to > connect. > > That would be the point - pgbouncer is sharing the 3 connections it > keeps with PostgreSQL between the 4 client connections made to it. > > > Same thing for max_db_connections. I set this to 1 and I am able to > connect from 2 shells. > > Same as pool_size but basically a fail-safe since pools are > per-user/per-database while this limit is per-database only. > > > This is kind of confusing and I'm not really cleared reading the > documentation. > > For each setting you need to understand whether it limits > pgbouncer->PostgreSQL or client->pgbouncer > > Configurations in [databases] limit the former; [pgbouncer] options > either provide defaults for the [databases] or limit clients. > > > [databases] > > cmdwela = host=127.0.0.1 port=5432 dbname=cmdwela auth_user=pgbouncer > pool_size=120 > > cicciopasticcio = host=127.0.0.1 port=5432 dbname=cicciopasticcio > auth_user=pgbouncer pool_size=2 > > > > [pgbouncer] > > listen_port = 6543 > > listen_addr = * > > auth_type = md5 > > auth_file = /etc/pgbouncer/users.txt > > auth_query = select uname,phash from user_lookup($1) > > logfile = /var/log/pgbouncer.log > > pidfile = /home/postgres/pgbouncer.pid > > admin_users = admin > > user=postgres > > > max_db_connections=1 > So one open connection max per database/user pool but it is shared - > i.e., actively executing queries running in parallel are limited to > this number. > > > max_client_conn=5 > > I just want to limit connections from the app etc. > > That limit is 5 > > David J. >
Pg_restore
Hello, we are trying to move our postgres 10.5 instance to 11.2 We are trying to restore a DB in a pre-production env but we get an error on the pg_restore command [root@STAGING]# /usr/local/pgsql11.2/bin/pg_restore -v -U postgres -L /data/cofinder_restore.lst -d cofinder -j 16 /home/ncontu/data/dbbackups/C pg_restore: connecting to database for restore pg_restore: processing item 3098 ENCODING ENCODING pg_restore: processing item 3099 STDSTRINGS STDSTRINGS pg_restore: processing item 3100 SEARCHPATH SEARCHPATH pg_restore: processing item 3101 DATABASE cofinder pg_restore: processing item 3 SCHEMA public pg_restore: creating SCHEMA "public" pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 3; 2615 2200 SCHEMA public postgres pg_restore: [archiver (db)] could not execute query: ERROR: schema "public" already exists Command was: CREATE SCHEMA public; The dump is taken with pgsql10 binaries. Can you help on this please? We also have another DB, but with the other one no issues doing the same thing. Thanks
Re: Pg_restore
I guess it depends on this https://www.postgresql.org/message-id/15466-0b90383ff69c6e4b%40postgresql.org Thanks Il giorno ven 1 mar 2019 alle ore 09:17 Nicola Contu ha scritto: > Hello, > we are trying to move our postgres 10.5 instance to 11.2 > > We are trying to restore a DB in a pre-production env but we get an error > on the pg_restore command > > [root@STAGING]# /usr/local/pgsql11.2/bin/pg_restore -v -U postgres -L > /data/cofinder_restore.lst -d cofinder -j 16 > /home/ncontu/data/dbbackups/C > pg_restore: connecting to database for restore > pg_restore: processing item 3098 ENCODING ENCODING > pg_restore: processing item 3099 STDSTRINGS STDSTRINGS > pg_restore: processing item 3100 SEARCHPATH SEARCHPATH > pg_restore: processing item 3101 DATABASE cofinder > pg_restore: processing item 3 SCHEMA public > pg_restore: creating SCHEMA "public" > pg_restore: [archiver (db)] Error while PROCESSING TOC: > pg_restore: [archiver (db)] Error from TOC entry 3; 2615 2200 SCHEMA > public postgres > pg_restore: [archiver (db)] could not execute query: ERROR: schema > "public" already exists > Command was: CREATE SCHEMA public; > > The dump is taken with pgsql10 binaries. > > Can you help on this please? > We also have another DB, but with the other one no issues doing the same > thing. > > Thanks > > >
Performance comparison between Pgsql 10.5 and Pgsql 11.2
Hello, is there any reason why I am getting worse results using pgsql11.2 in writing comparing it with pgsql 10.6? I have two Instances, both just restored, so no bloats. Running read queries I have pretty much same results, a little bit better on pg11- Running writes the difference is in favour of 10. I am expecting pg11 to be better. Running pgbench : PG11 [root@STAGING-CMD1 ~]# /usr/local/pgsql11.2/bin/pgbench -t 1000 -c 20 -C -f stress_service_order.sql cmdstaging -U admin transaction type: stress_service_order.sql scaling factor: 1 query mode: simple number of clients: 20 number of threads: 1 number of transactions per client: 1000 number of transactions actually processed: 2/2 latency average = 45.322 ms tps = 441.283336 (including connections establishing) tps = 463.731537 (excluding connections establishing) PG10 [root@STAGING-CMD1 ~]# pgbench -t 1000 -c 20 -C -f stress_service_order.sql cmdstaging -U admin transaction type: stress_service_order.sql scaling factor: 1 query mode: simple number of clients: 20 number of threads: 1 number of transactions per client: 1000 number of transactions actually processed: 2/2 latency average = 44.686 ms tps = 447.565403 (including connections establishing) tps = 470.285561 (excluding connections establishing) This is making a really big difference with longer queries. Here I am updating a field in a random record. With more transactions the difference is bigger WITH POSTGRES 10 [root@STAGING-CMD1 ~]# pgbench -t 10 -c 20 -C -f stress_service_order_read.sql cmdstaging -U postgres transaction type: stress_service_order_read.sql scaling factor: 1 query mode: simple number of clients: 20 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 200/200 latency average = 55.291 ms tps = 442.1490778 (including connections establishing) tps = 454.846844 (excluding connections establishing) WITH POSTGRES 11 [root@STAGING-CMD1 ~]# pgbench -t 10 -c 20 -C -f stress_service_order_read.sql cmdstaging -U postgres transaction type: stress_service_order_read.sql scaling factor: 1 query mode: simple number of clients: 20 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 200/200 latency average = 53.291 ms tps = 375.297748 (including connections establishing) tps = 392.316057 (excluding connections establishing) The postgres.conf file are the same. max_connections = 220 shared_buffers = 10GB effective_cache_size = 120GB work_mem = 600MB maintenance_work_mem = 2GB min_wal_size = 1GB max_wal_size = 2GB checkpoint_completion_target = 0.7 wal_buffers = 16MB #default_statistics_target = 100 Using data_sync_retry=on doesn't make any difference. Is there anything else changed in the default values? Any trick? I don't want to go live and loose performances. Thanks a lot, Nicola
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
I did a analyze in stages on both. And Yes both are compiled. This is the configure command (change 10.6 for PG10) ./configure --prefix=/usr/local/pgsql11.2 See attached perf report. The difference seems to be all in this line, but not sure : + 26.80% 0.00% 222 postmaster [kernel.kallsyms] [k] system_call_fastpath I am using CentOS 7 With Centos I am using this profile for tuned-adm [root@STAGING-CMD1 ~]# tuned-adm active Current active profile: latency-performance Il giorno sab 2 mar 2019 alle ore 20:41 Thomas Munro ha scritto: > On Sat, Mar 2, 2019 at 5:02 AM Ray O'Donnell wrote: > > On 01/03/2019 15:01, Nicola Contu wrote: > > > Hello, > > > is there any reason why I am getting worse results using pgsql11.2 in > > > writing comparing it with pgsql 10.6? > > > > > > I have two Instances, both just restored, so no bloats. > > > Running read queries I have pretty much same results, a little bit > > > better on pg11- Running writes the difference is in favour of 10. > > > > Did you run ANALYZE on the databases after restoring? > > If you can rule out different query plans, and if you compiled them > both with the same compiler and optimisation levels and without > cassert enabled (it's a long shot but I mentioned that because you > showed a path in /usr/local so perhaps you're hand-compiling 11, but > 10 came from a package?), then the next step might be to use a > profiler like "perf" (or something equivalent on your OS) to figure > out where 11 is spending more time in the write test? > > -- > Thomas Munro > https://enterprisedb.com > PGSQL11 + 26.80% 0.00% 222 postmaster [kernel.kallsyms] [k] system_call_fastpath + 13.46% 0.04% 751 postmaster [kernel.kallsyms] [k] sys_futex + 13.43% 0.00% 782 postmaster [kernel.kallsyms] [k] do_futex + 12.89% 0.00% 0 postmaster [unknown] [k] +9.09% 0.00% 0 swapper [kernel.kallsyms] [k] start_cpu +9.09% 0.14% 2860 swapper [kernel.kallsyms] [k] cpu_startup_entry +9.09% 0.00% 0 swapper [kernel.kallsyms] [k] start_secondary +8.97% 0.00% 389 postmaster [kernel.kallsyms] [k] page_fault +8.97% 0.00% 687 postmaster [kernel.kallsyms] [k] do_page_fault +8.97% 0.73% 6471 postmaster [kernel.kallsyms] [k] __do_page_fault +8.66% 0.14% 1648 postmaster libpthread-2.17.so [.] sem_post@@GLIBC_2.2.5 +8.40% 0.70% 13567 postmaster [kernel.kallsyms] [k] get_futex_key +8.08% 0.00% 0 postmaster [unknown] [k] 0x0080 +8.08% 0.93% 642 postmaster libpthread-2.17.so [.] do_futex_wait.constprop.1 +7.87% 0.19% 2666 postmaster [kernel.kallsyms] [k] futex_wake +7.68% 0.19% 9343 postmaster [kernel.kallsyms] [k] handle_mm_fault +7.34% 0.00% 207 swapper [kernel.kallsyms] [k] arch_cpu_idle +7.34% 0.00% 979 swapper [kernel.kallsyms] [k] cpuidle_idle_call +7.15% 0.00% 2671 postmaster [kernel.kallsyms] [k] handle_pte_fault +5.87% 0.00% 6 postmaster libc-2.17.so [.] __GI___setsid +5.87% 0.00% 9 postmaster [kernel.kallsyms] [k] sys_setsid +5.87% 0.00%15 postmaster [kernel.kallsyms] [k] sched_autogroup_create_attach +5.41% 5.41% 48199 postmaster postgres [.] hash_search_with_hash_value +5.40% 5.40% 44439 postmaster postgres [.] LWLockWaitListLock +5.38% 0.72% 1425 postmaster [kernel.kallsyms] [k] futex_wait +5.07% 0.00% 5 postmaster [kernel.kallsyms] [k] sched_create_group +4.23% 0.59% 2253 postmaster [kernel.kallsyms] [k] __slab_alloc +4.13% 0.00% 2330 postmaster [kernel.kallsyms] [k] kmem_cache_alloc_node_trace +3.75% 0.16% 2919 postmaster [kernel.kallsyms]
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
Because I have 10.6 in production :) and I am comparing with what I will be loosing. And I read that in the release notes but as said in my first email, even with data_sync_retry=on (going back to previous behavior) doesn't make any difference. So I am looking for something that will keep my performances but still allows me to upgrade to 11 in production. Also, trying with 11.1, the problem seems still there. Il giorno lun 4 mar 2019 alle ore 14:45 Imre Samu ha scritto: > > is there any reason why I am getting worse results using pgsql11.2 in > writing comparing it with pgsql 10.6? > >... And Yes both are compiled. > > Why 10.6? > > according to release notes > "14th February 2019: PostgreSQL 11.2, 10.7, 9.6.12, 9.5.16, and 9.4.21 > Released!" https://www.postgresql.org/about/news/1920/ > imho:* it would be better to compare PG11.2 with PG10.7 *( similar > bug Fixes and Improvements + same fsync() behavior ) > > *"This release changes the behavior in how PostgreSQL interfaces with > fsync() and includes fixes for partitioning and over 70 other bugs that > were reported over the past three months"* > > Imre > > > > Nicola Contu ezt írta (időpont: 2019. márc. 4., > H, 13:14): > >> I did a analyze in stages on both. >> And Yes both are compiled. >> This is the configure command (change 10.6 for PG10) >> >> ./configure --prefix=/usr/local/pgsql11.2 >> >> See attached perf report. The difference seems to be all in this line, >> but not sure : >> >> + 26.80% 0.00% 222 postmaster [kernel.kallsyms] >> [k] system_call_fastpath >> >> >> >> I am using CentOS 7 >> With Centos I am using this profile for tuned-adm >> [root@STAGING-CMD1 ~]# tuned-adm active >> Current active profile: latency-performance >> >> >> Il giorno sab 2 mar 2019 alle ore 20:41 Thomas Munro < >> thomas.mu...@gmail.com> ha scritto: >> >>> On Sat, Mar 2, 2019 at 5:02 AM Ray O'Donnell wrote: >>> > On 01/03/2019 15:01, Nicola Contu wrote: >>> > > Hello, >>> > > is there any reason why I am getting worse results using pgsql11.2 in >>> > > writing comparing it with pgsql 10.6? >>> > > >>> > > I have two Instances, both just restored, so no bloats. >>> > > Running read queries I have pretty much same results, a little bit >>> > > better on pg11- Running writes the difference is in favour of 10. >>> > >>> > Did you run ANALYZE on the databases after restoring? >>> >>> If you can rule out different query plans, and if you compiled them >>> both with the same compiler and optimisation levels and without >>> cassert enabled (it's a long shot but I mentioned that because you >>> showed a path in /usr/local so perhaps you're hand-compiling 11, but >>> 10 came from a package?), then the next step might be to use a >>> profiler like "perf" (or something equivalent on your OS) to figure >>> out where 11 is spending more time in the write test? >>> >>> -- >>> Thomas Munro >>> https://enterprisedb.com >>> >>
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
Not sure what you are requesting exactly but here is the strace for the start of the pg_ctl Il giorno lun 4 mar 2019 alle ore 21:55 Thomas Munro ha scritto: > On Tue, Mar 5, 2019 at 5:01 AM Nicola Contu > wrote: > > Attached a part of the strace running the pgbench command for pg11 > > Also attached strace_10 for pg10.6. > > That looks like strace output from pgbench, and I don't see any > interesting differences between v10 and v11 (though I'm surprised to > see it using poll() instead of ppoll(), and opening a new connection > for every transaction). > > How about syscalls on the server side? You could start it with > something like "strace -f path/to/postgres -D path/to/pgdata" (-f for > follow children), and perhaps also use -c so that it shows aggregated > data (up until you ^C it) instead of every syscall? > > -- > Thomas Munro > https://enterprisedb.com > [postgres@STAGING-CMD1 ~]$ strace /usr/local/pgsql11.2/bin/pg_ctl -D /db/pgsql11/data/ -c start execve("/usr/local/pgsql11.2/bin/pg_ctl", ["/usr/local/pgsql11.2/bin/pg_ctl", "-D", "/db/pgsql11/data/", "-c", "start"], [/* 27 vars */]) = 0 brk(NULL) = 0x204b000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6878854000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/usr/local/pgsql11.2/lib/tls/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/pgsql11.2/lib/tls/x86_64", 0x7fff444938d0) = -1 ENOENT (No such file or directory) open("/usr/local/pgsql11.2/lib/tls/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/pgsql11.2/lib/tls", 0x7fff444938d0) = -1 ENOENT (No such file or directory) open("/usr/local/pgsql11.2/lib/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/pgsql11.2/lib/x86_64", 0x7fff444938d0) = -1 ENOENT (No such file or directory) open("/usr/local/pgsql11.2/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) stat("/usr/local/pgsql11.2/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=48564, ...}) = 0 mmap(NULL, 48564, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f6878848000 close(3)= 0 open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0m\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=144792, ...}) = 0 mmap(NULL, 2208904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f6878418000 mprotect(0x7f687842f000, 2093056, PROT_NONE) = 0 mmap(0x7f687862e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f687862e000 mmap(0x7f687863, 13448, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f687863 close(3)= 0 open("/usr/local/pgsql11.2/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P%\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2173512, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6878847000 mmap(NULL, 3981792, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f687804b000 mprotect(0x7f687820e000, 2093056, PROT_NONE) = 0 mmap(0x7f687840d000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c2000) = 0x7f687840d000 mmap(0x7f6878413000, 16864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f6878413000 close(3)= 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6878845000 arch_prctl(ARCH_SET_FS, 0x7f6878845740) = 0 mprotect(0x7f687840d000, 16384, PROT_READ) = 0 mprotect(0x7f687862e000, 4096, PROT_READ) = 0 mprotect(0x60a000, 4096, PROT_READ) = 0 mprotect(0x7f6878855000, 4096, PROT_READ) = 0 munmap(0x7f6878848000, 48564) = 0 set_tid_address(0x7f6878845a10) = 92441 set_robust_list(0x7f6878845a20, 24) = 0 rt_sigaction(SIGRTMIN, {0x7f687841e7e0, [], SA_RESTORER|SA_SIGINFO, 0x7f68784276d0}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x7f687841e870, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f68784276d0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 brk(NULL) = 0x204b000 brk(0
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
Here is the strace as requested for pg11 Thanks Il giorno mar 5 mar 2019 alle ore 17:47 Thomas Munro ha scritto: > On Wed, Mar 6, 2019 at 4:22 AM Nicola Contu > wrote: > > > > Not sure what you are requesting exactly but here is the strace for the > start of the pg_ctl > > I meant that you could run the server itself in the foreground under > strace, like so: > > $ strace -f -c /usr/local/pgsql11.2/bin/postgres -D /db/pgsql11/data/ > > Then perform your testing, and finally stop it with pg_ctl from > another window (or hit ^C in this window) and strace should spit out a > table of system calls with some counters. We might be able to see why > v11 is spending so much more time executing system calls than v10 for > your workload, or at least which systems calls they are, assuming you > run the same transactions against both versions. > > -- > Thomas Munro > https://enterprisedb.com > [postgres@STAGING-CMD1 ~]$ strace -f -c /usr/local/pgsql11.2/bin/postgres -D /db/pgsql11/data/ 2019-03-06 10:07:47 GMT [] [163892]: [1-1] db=,user= LOG: listening on IPv4 address "0.0.0.0", port 5432 2019-03-06 10:07:47 GMT [] [163892]: [2-1] db=,user= LOG: listening on IPv6 address "::", port 5432 2019-03-06 10:07:47 GMT [] [163892]: [3-1] db=,user= LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" strace: Process 163894 attached 2019-03-06 10:07:48 GMT [] [163894]: [1-1] db=,user= LOG: database system was shut down at 2019-03-06 10:06:22 GMT strace: Process 163895 attached strace: Process 163896 attached strace: Process 163897 attached strace: Process 163898 attached strace: Process 163899 attached strace: Process 163900 attached 2019-03-06 10:07:48 GMT [] [163892]: [4-1] db=,user= LOG: database system is ready to accept connections strace: Process 163908 attached strace: Process 163915 attached strace: Process 163928 attached strace: Process 163931 attached strace: Process 163943 attached strace: Process 163960 attached strace: Process 163976 attached 2019-03-06 10:09:08 GMT [] [163895]: [1-1] db=,user= LOG: checkpoint starting: xlog strace: Process 163989 attached strace: Process 164002 attached 2019-03-06 10:09:42 GMT [] [163895]: [2-1] db=,user= LOG: checkpoint complete: wrote 201350 buffers (15.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=33.879 s, sync=0.208 s, total=34.093 s; sync files=30, longest=0.142 s, average=0.006 s; distance=1228942 kB, estimate=1228942 kB strace: Process 164015 attached 2019-03-06 10:09:58 GMT [] [163895]: [3-1] db=,user= LOG: checkpoint starting: xlog strace: Process 164035 attached strace: Process 164050 attached 2019-03-06 10:10:26 GMT [] [163895]: [4-1] db=,user= LOG: checkpoint complete: wrote 126477 buffers (9.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=27.650 s, sync=0.015 s, total=27.668 s; sync files=30, longest=0.004 s, average=0.000 s; distance=1226813 kB, estimate=1228729 kB strace: Process 164066 attached 2019-03-06 10:10:43 GMT [] [163895]: [5-1] db=,user= LOG: checkpoint starting: xlog strace: Process 164079 attached strace: Process 164102 attached 2019-03-06 10:11:08 GMT [] [163895]: [6-1] db=,user= LOG: checkpoint complete: wrote 125741 buffers (9.6%); 1 WAL file(s) added, 0 removed, 0 recycled; write=24.804 s, sync=0.077 s, total=24.947 s; sync files=30, longest=0.015 s, average=0.002 s; distance=1229772 kB, estimate=1229772 kB strace: Process 164223 attached 2019-03-06 10:11:22 GMT [] [163895]: [7-1] db=,user= LOG: checkpoint starting: xlog strace: Process 164347 attached 2019-03-06 10:11:46 GMT [] [163895]: [8-1] db=,user= LOG: checkpoint complete: wrote 133784 buffers (10.2%); 1 WAL file(s) added, 0 removed, 0 recycled; write=23.399 s, sync=0.141 s, total=23.595 s; sync files=29, longest=0.027 s, average=0.004 s; distance=1227832 kB, estimate=1229578 kB strace: Process 164460 attached 2019-03-06 10:11:52 GMT [[local]] [163931]: [1-1] db=cmdstaging,user=postgres LOG: duration: 209593.079 ms statement: update service_order set customer_pon = now(); strace: Process 164575 attached strace: Process 164694 attached 2019-03-06 10:12:28 GMT [] [163928]: [1-1] db=,user= LOG: automatic vacuum of table "cmdstaging.public.service_order": index scans: 1 pages: 0 removed, 171162 remain, 413 skipped due to pins, 0 skipped frozen tuples: 123264 removed, 1300328 remain, 0 are dead but not yet removable, oldest xmin: 1172166 buffer usage: 890073 hits, 250961 misses, 270986 dirtied avg read rate: 7.826 MB/s, avg write rate: 8.451 MB/s system usage: CPU: user: 12.96 s, system: 10.24 s, elapsed: 250.52 s ^Cstrace: Process 163892 detached 2019-03-06 10:12:29 GMT [] [163892]: [5-1] db=,user= LOG: received fast shutdown request strace: Process 163895 detached strace: Process 163896 detached strace: Process 163897 detached strace: Process 163898 detached strace: Process
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
This is instead the strace of another server running the same version compiled but that is even slower. Il giorno mer 6 mar 2019 alle ore 11:14 Nicola Contu ha scritto: > Here is the strace as requested for pg11 > > Thanks > > Il giorno mar 5 mar 2019 alle ore 17:47 Thomas Munro < > thomas.mu...@gmail.com> ha scritto: > >> On Wed, Mar 6, 2019 at 4:22 AM Nicola Contu >> wrote: >> > >> > Not sure what you are requesting exactly but here is the strace for the >> start of the pg_ctl >> >> I meant that you could run the server itself in the foreground under >> strace, like so: >> >> $ strace -f -c /usr/local/pgsql11.2/bin/postgres -D /db/pgsql11/data/ >> >> Then perform your testing, and finally stop it with pg_ctl from >> another window (or hit ^C in this window) and strace should spit out a >> table of system calls with some counters. We might be able to see why >> v11 is spending so much more time executing system calls than v10 for >> your workload, or at least which systems calls they are, assuming you >> run the same transactions against both versions. >> >> -- >> Thomas Munro >> https://enterprisedb.com >> > % time seconds usecs/call callserrors syscall -- --- --- - - 2019-03-06 11:04:56 GMT [127.0.0.1(46276)] [14799]: [2-1] db=cmdv3preprod,user=cmdpreprodapp FATAL: terminating connection due to administrator command 32.50 143.010306 7 21044095 lseek 26.21 115.354045 14 8144577 read 6.18 27.185578 16 166988910 sendto 5.29 23.300584 57407528 fdatasync 4.93 21.709522 9 2313529824174 recvfrom 3.31 14.547568 19765897 write 2.73 12.007486 14867088 14494 epoll_wait 2.189.597460 15659871 84097 futex 1.858.147759 14567414 close 1.777.767832 18437656 11319 open 1.536.749496 161 42009 wait4 1.396.110869 226 26994 setsid 1.315.760971 15396402 brk 1.165.083198 29175022 munmap 0.903.954892 18224829 epoll_ctl 0.893.928084 17227227 1 mmap 0.773.376280 11298018 rt_sigaction 0.632.778183 16178446 kill 0.632.763563 17159928 rt_sigprocmask 0.522.306854 12190988 fstat 0.421.850528 128 14468 sync_file_range 0.351.534351 52 29591 18762 select 0.351.532532 16 96298 epoll_create1 0.321.4151602331 607 fsync 0.291.263220 21 61605 35503 stat 0.281.227967 125 982728 fallocate 0.271.195613 8153557 38629 rt_sigreturn 0.261.123353 112 10013 173 unlink 0.241.060130 20 52112 mprotect 0.110.487515 18 26994 clone 0.100.458161 11 41080 fadvise64 0.100.440217 16 26991 1 access 0.090.406644 4109169 fcntl 0.040.188007 173 1088 rename 0.040.181400 18 9829 ftruncate 0.030.119387 4 26995 pipe 0.010.036748 1 26995 set_robust_list 0.010.023510 24 1000 dup 0.000.019099 13 1425 setitimer 0.000.017918 18 981 setsockopt 0.000.015479 26 595 accept 0.000.010666 18 602 getsockname 0.000.010172 50 202 getdents 0.000.004645 28 165 openat 0.000.003369 7346 link 0.000.002171 4350 mremap 0.000.000213 71 3 2 mkdir 0.000.000189 32 6 getcwd 0.000.000179 30 6 bind 0.000.000153 1015 getrusage 0.000.000152 30 5 chdir 0.000.000133 1211 socket 0.000.000119 24 5 lstat 0.000.000102 13 8 2 connect 0.000.83 83 1 rmdir 0.000.70 23 3 geteuid 0.000.68 23 3 listen 0.000.56 28 2
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
So the first file is on Postgres11.2 on a test server (and where I compare 10 vs 11) The second file, is our preprod machine running Postgres 11.2 (different hardware etc, it is a VM). I know that could be confusing, but I just wanted to compare that too because if you see the two files there's a lot of difference between the two machines. And they are both running CentOS 7. So at this point I have two problems. One inside the machine between Postgres 10 and 11 and another problem on the preprod (similar to prod) with a lot of lseek. Sorry if this is confusing, hope it is clear now. Regarding partitions/tables. The first file involves just one table. The second file (with a huge lseek) was running the test on a single table, but meanwhile it was accessible by the preprod web application. So it was maybe hit by some user and some other table. Question: 1) Is it possible that pgbench could not be really a good tool for testing the performances? If I use a sql script of thousands of insert records and compare on the same server between pg10 and pg11 I get pretty much the same result (maybe better on pg11) 2) regarding preprod, is there any way to reduce those lseek()? Just to let you know, comparing the same insert script between the first server, the first server takes 2m the second one takes 5-7m. Thanks a lot, Il giorno gio 7 mar 2019 alle ore 04:47 Thomas Munro ha scritto: > On Thu, Mar 7, 2019 at 12:12 AM Nicola Contu > wrote: > > This is instead the strace of another server running the same version > compiled but that is even slower. > > Huh. That's a lot of lseek(). Some of these will be for random > reads/writes and will go way in v12, and some will be for probing the > size of relations while planning, and some while executing scans. I > bet you could make some of them go away by using prepared statements. > Does the query in your test involve many partitions/tables? > > % time seconds usecs/call callserrors syscall > -- --- --- - - > 32.50 143.010306 7 21044095 lseek > 26.21 115.354045 14 8144577 read > 6.18 27.185578 16 166988910 sendto > 5.29 23.300584 57407528 fdatasync > 4.93 21.709522 9 2313529824174 recvfrom > 3.31 14.547568 19765897 write > 2.73 12.007486 14867088 14494 epoll_wait > 2.189.597460 15659871 84097 futex > 1.858.147759 14567414 close > 1.777.767832 18437656 11319 open > > The other results had 1 usec lseek(), and much fewer of them relative > to the number of reads and writes. BTW, are you comparing v10 and v11 > on the same hardware, kernel, filesystem? Just wondering if there > could be some change in syscall overhead on different kernel patch > levels or something like that: we see 7 usec vs 1 usec in those two > files (though I have no idea how reliable these times are) and if > we're going to call it 21 million times at some point it might > matter... > > -- > Thomas Munro > https://enterprisedb.com >
Re: Performance comparison between Pgsql 10.5 and Pgsql 11.2
Hello, do you have any advice on this? Thanks a lot in advance Il giorno gio 7 mar 2019 alle ore 09:39 Nicola Contu ha scritto: > So the first file is on Postgres11.2 on a test server (and where I compare > 10 vs 11) > The second file, is our preprod machine running Postgres 11.2 (different > hardware etc, it is a VM). I know that could be confusing, but I just > wanted to compare that too because if you see the two files there's a lot > of difference between the two machines. > And they are both running CentOS 7. > > So at this point I have two problems. One inside the machine between > Postgres 10 and 11 and another problem on the preprod (similar to prod) > with a lot of lseek. > > Sorry if this is confusing, hope it is clear now. > > Regarding partitions/tables. The first file involves just one table. The > second file (with a huge lseek) was running the test on a single table, but > meanwhile it was accessible by the preprod web application. So it was maybe > hit by some user and some other table. > > > Question: > 1) Is it possible that pgbench could not be really a good tool for testing > the performances? If I use a sql script of thousands of insert records and > compare on the same server between pg10 and pg11 I get pretty much the same > result (maybe better on pg11) > 2) regarding preprod, is there any way to reduce those lseek()? Just to > let you know, comparing the same insert script between the first server, > the first server takes 2m the second one takes 5-7m. > > Thanks a lot, > > > > > > Il giorno gio 7 mar 2019 alle ore 04:47 Thomas Munro < > thomas.mu...@gmail.com> ha scritto: > >> On Thu, Mar 7, 2019 at 12:12 AM Nicola Contu >> wrote: >> > This is instead the strace of another server running the same version >> compiled but that is even slower. >> >> Huh. That's a lot of lseek(). Some of these will be for random >> reads/writes and will go way in v12, and some will be for probing the >> size of relations while planning, and some while executing scans. I >> bet you could make some of them go away by using prepared statements. >> Does the query in your test involve many partitions/tables? >> >> % time seconds usecs/call callserrors syscall >> -- --- --- - - >> 32.50 143.010306 7 21044095 lseek >> 26.21 115.354045 14 8144577 read >> 6.18 27.185578 16 166988910 sendto >> 5.29 23.300584 57407528 fdatasync >> 4.93 21.709522 9 2313529824174 recvfrom >> 3.31 14.547568 19765897 write >> 2.73 12.007486 14867088 14494 epoll_wait >> 2.189.597460 15659871 84097 futex >> 1.858.147759 14567414 close >> 1.777.767832 18437656 11319 open >> >> The other results had 1 usec lseek(), and much fewer of them relative >> to the number of reads and writes. BTW, are you comparing v10 and v11 >> on the same hardware, kernel, filesystem? Just wondering if there >> could be some change in syscall overhead on different kernel patch >> levels or something like that: we see 7 usec vs 1 usec in those two >> files (though I have no idea how reliable these times are) and if >> we're going to call it 21 million times at some point it might >> matter... >> >> -- >> Thomas Munro >> https://enterprisedb.com >> >
pg_replication_slots
Hello, I just have few questions on the replication slots : - is it possible to get size of the slot? - if the slave is down, the table grows, when the slave comes up again, will the table be flushed after pushing wals? - will they impact performances on the master? I'm just worried about the size. Thank you
Centos 6.9 and centos 7
Hello, we recently upgrade OS from centos 6.9 to a new server with centos 7. The centos 6.9 server has became the preproduction server now. We are running postgres 9.6.6 on both servers. They are both on SSD disk, these are the only differences : - DB partition on centos 7 is on a RAID 10 - file system is xfs on centos 7 (ext4 in centos 6.9) - more memory on the centos 7 (so params on the postgres.conf are higher) max_connections = 220 shared_buffers = 10GB effective_cache_size = 120GB work_mem = 349525kB maintenance_work_mem = 2GB min_wal_size = 1GB max_wal_size = 2GB checkpoint_completion_target = 0.7 wal_buffers = 16MB default_statistics_target = 100 - we have two replicas on the centos 7. One is async one is sync synchronous_standby_names = '1 ( "" )' synchronous_commit = on The have the same db inside, with same data. Running the same script on the two servers will give different results. Even a select query is faster on the centos 6.9 server. Half time on the preprod server centos 7 : dbname=# \timing Timing is on. cmdv3=# SELECT id FROM client_billing_account WHERE name = 'name'; id --- * (1 row) Time: 3.884 ms centos 6.9 dbname=# SELECT id FROM client_billing_account WHERE name = 'name'; id --- * (1 row) Time: 1.620 ms This table has 32148 records. Do you think we can modify anything to achieve same performances? I read about few kernel params : kernel.sched_migration_cost_ns = 500 kernel.sched_autogroup_enabled = 0 vm.dirty_background_bytes = 67108864 vm.dirty_bytes = 1073741824 vm.zone_reclaim_mode = 0 vm.swappiness = 1.1 Is there anything you can advice to solve or identify the problem? Thanks a lot, Nicola
Re: Centos 6.9 and centos 7
These are the timings in centos 7 : Time: 4.248 ms Time: 2.983 ms Time: 3.027 ms Time: 3.298 ms Time: 4.420 ms Time: 2.599 ms Time: 2.555 ms Time: 3.008 ms Time: 6.220 ms Time: 4.275 ms Time: 2.841 ms Time: 3.699 ms Time: 3.387 ms These are the timings in centos 6: Time: 1.722 ms Time: 1.670 ms Time: 1.843 ms Time: 1.823 ms Time: 1.723 ms Time: 1.724 ms Time: 1.747 ms Time: 1.734 ms Time: 1.764 ms Time: 1.622 ms This is top on centos 6 : [root@]# top top - 14:33:32 up 577 days, 23:08, 1 user, load average: 0.16, 0.11, 0.15 Tasks: 1119 total, 1 running, 1118 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132040132k total, 129530504k used, 2509628k free, 108084k buffers Swap: 11665404k total, 331404k used, 11334000k free, 124508916k cached This is top on centos 7: top - 14:35:38 up 73 days, 19:00, 6 users, load average: 22.46, 20.89, 20.54 Tasks: 821 total, 13 running, 807 sleeping, 0 stopped, 1 zombie %Cpu(s): 14.2 us, 5.0 sy, 0.0 ni, 77.5 id, 3.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 26383592+total, 4301464 free, 6250384 used, 25328406+buff/cache KiB Swap: 16777212 total, 11798876 free, 4978336 used. 24497036+avail Mem The production machine is obviously more accessed. But that does not seem to be the problem as running the same query on the replica of the production machine (same config of the master but not accessed by anyone) gives the same bad result: Time: 6.366 ms 2017-12-04 15:19 GMT+01:00 Chris Mair : > centos 7 : >> Time: 3.884 ms >> >> centos 6.9 >> > Time: 1.620 ms > >> >> >> Is there anything you can advice to solve or identify the problem? >> > > Can you run this query 10 times on each server and note the timings? > > I'd like to see the reproducability of this. > > Also: both machines are otherwise idle (check with top or uptime)? > > Bye, > Chris. > > >
Re: Centos 6.9 and centos 7
To make a better testing, I used a third server. This is identical to the centos 7 machine, and it is not included in the replica cluster. Nobody is accessing this machine, this is top : top - 14:48:36 up 73 days, 17:39, 3 users, load average: 0.00, 0.01, 0.05 Tasks: 686 total, 1 running, 685 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 26383592+total, 1782196 free, 2731144 used, 25932257+buff/cache KiB Swap: 16777212 total, 16298536 free, 478676 used. 21693456+avail Mem These are timings : Time: 2.841 ms Time: 1.980 ms Time: 2.240 ms Time: 2.947 ms Time: 2.828 ms Time: 2.227 ms Time: 1.998 ms Time: 1.990 ms Time: 2.643 ms Time: 2.143 ms Time: 2.919 ms Time: 2.246 ms I never got same results of the centos 6.9 machine. 2017-12-04 15:40 GMT+01:00 Nicola Contu : > These are the timings in centos 7 : > > Time: 4.248 ms > Time: 2.983 ms > Time: 3.027 ms > Time: 3.298 ms > Time: 4.420 ms > Time: 2.599 ms > Time: 2.555 ms > Time: 3.008 ms > Time: 6.220 ms > Time: 4.275 ms > Time: 2.841 ms > Time: 3.699 ms > Time: 3.387 ms > > > These are the timings in centos 6: > Time: 1.722 ms > Time: 1.670 ms > Time: 1.843 ms > Time: 1.823 ms > Time: 1.723 ms > Time: 1.724 ms > Time: 1.747 ms > Time: 1.734 ms > Time: 1.764 ms > Time: 1.622 ms > > > This is top on centos 6 : > > [root@]# top > top - 14:33:32 up 577 days, 23:08, 1 user, load average: 0.16, 0.11, 0.15 > Tasks: 1119 total, 1 running, 1118 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 132040132k total, 129530504k used, 2509628k free, 108084k buffers > Swap: 11665404k total, 331404k used, 11334000k free, 124508916k cached > > This is top on centos 7: > > top - 14:35:38 up 73 days, 19:00, 6 users, load average: 22.46, 20.89, > 20.54 > Tasks: 821 total, 13 running, 807 sleeping, 0 stopped, 1 zombie > %Cpu(s): 14.2 us, 5.0 sy, 0.0 ni, 77.5 id, 3.1 wa, 0.0 hi, 0.2 si, > 0.0 st > KiB Mem : 26383592+total, 4301464 free, 6250384 used, 25328406+buff/cache > KiB Swap: 16777212 total, 11798876 free, 4978336 used. 24497036+avail Mem > > > The production machine is obviously more accessed. But that does not seem > to be the problem as running the same query on the replica of the > production machine (same config of the master but not accessed by anyone) > gives the same bad result: > Time: 6.366 ms > > > 2017-12-04 15:19 GMT+01:00 Chris Mair : > >> centos 7 : >>> Time: 3.884 ms >>> >>> centos 6.9 >>> >> Time: 1.620 ms >> >>> >>> >>> Is there anything you can advice to solve or identify the problem? >>> >> >> Can you run this query 10 times on each server and note the timings? >> >> I'd like to see the reproducability of this. >> >> Also: both machines are otherwise idle (check with top or uptime)? >> >> Bye, >> Chris. >> >> >> >
Re: Centos 6.9 and centos 7
No I did not run a vacuum analyze. Do you want me to try with that first? @Tomas: Talking abut power management, I changed the profile for tuned-adm to latency-performance instead of balanced (that is the default) that is increasing performances for now and they are similar to centos 6.9. Time: 2.121 ms Time: 2.026 ms Time: 1.664 ms Time: 1.749 ms Time: 1.656 ms Time: 1.675 ms Do you think this can be easily done in production as well? 2017-12-04 16:37 GMT+01:00 Alban Hertroys : > Did you run ANALYZE on your tables before the test? > > On 4 December 2017 at 16:01, Tomas Vondra > wrote: > > > > On 12/04/2017 02:19 PM, Nicola Contu wrote: > > ...> > >> centos 7 : > >> > >> dbname=# \timing Timing is on. cmdv3=# SELECT id FROM > >> client_billing_account WHERE name = 'name'; id --- * (1 row) > >> Time: 3.884 ms > >> > >> centos 6.9 > >> > >> dbname=# SELECT id FROM client_billing_account WHERE name = 'name'; id > >> --- * (1 row) Time: 1.620 ms > >> > > > > We need to see EXPLAIN (ANALYZE,BUFFERS) for the queries. > > > > Are those VMs or bare metal? What CPUs and RAM are there? Have you > > checked that power management is disabled / cpufreq uses the same > > policy? That typically affects short CPU-bound queries. > > > > Other than that, I recommend performing basic system benchmarks (CPU, > > memory, ...) and only if those machines perform equally should you look > > for issues in PostgreSQL. Chances are the root cause is in hw or OS, in > > which case you need to address that first. > > > > regards > > > > -- > > Tomas Vondra http://www.2ndQuadrant.com > > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > > > > -- > If you can't see the forest for the trees, > Cut the trees and you'll see there is no forest. >
pgstattuple free_percent to high
Hello, We are running postgres 9.6.6 on centos 7. We have a large DB (180GB) with about 1200 tables. We have autovacuum set with default values and we are seeing that for some tables the free percent goes really high (51%) and we need to daily full vacuum those tables. dbanme=# SELECT * FROM pgstattuple('tablename'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent +-++---+--++++-- 2119548928 | 526658 | 1023569149 | 48.29 |0 | 0 | 0 | 1083485292 |51.12 (1 row) I guess this is because of long queries but I'm not really sure. Do you know how to avoid this problem and what can cause it? Do you think that increasing the autovacuum settings for those tables would alleviate the issue? Thanks, Nicola
Re: pgstattuple free_percent to high
Hi Rene, thanks for you reply. I think tuning the autovacuum settings may increase performances and remove dead_tuples but as far as I know, the autovacuum runs a vacuum analyze. The vacuum analyze won't touch the free_percent of the table. So I'm trying to find a way to adjust the free percent for some tables without doing a manually full vacuum. We are now monitoring the free percent, so we may find the part of the code that can increase that value, but was wondering if there is anything on the postgres side to resolve this problem. Thanks, Nicola 2017-12-14 0:16 GMT+01:00 Rene Romero Benavides : > Check for long running transactions modifying (update, insert) on those > tables ,using pg_stat_activity. > > Tweak these storage parameters for such tables: > autovacuum_vacuum_cost_delay : decrease it (the autovacuum daemon goes to > sleep less often ) > autovacuum_vacuum_threshold : decrease it (to trigger more frequent > autovacuum activations ) > autovacuum_vacuum_cost_limit : increase it (to allow the autovacuum daemon > to work for longer periods) > autovacuum_vacuum_scale_factor : decrease it (to trigger more autovacuum > activations when this percentage of a table has been modified) > > For example I've set these parameters for one table experiencing long > running transactions, and for its access patterns have worked: > > autovacuum_vacuum_cost_delay=5, autovacuum_vacuum_threshold= > 50,autovacuum_vacuum_cost_limit=3000, autovacuum_vacuum_scale_factor=0.01 > but these settings are very particular for each usage pattern. > > Take into account that more activity from autovacuum means more IO, more > CPU usage, you might also benefit from setting autovacuum_work_mem to a > higher setting if the available RAM allows it, to give more RAM to the > autovacuum daemon. > > > > 2017-12-13 9:49 GMT-06:00 Nicola Contu : > >> Hello, >> We are running postgres 9.6.6 on centos 7. >> >> We have a large DB (180GB) with about 1200 tables. >> >> We have autovacuum set with default values and we are seeing that for >> some tables the free percent goes really high (51%) and we need to daily >> full vacuum those tables. >> >> dbanme=# SELECT * FROM pgstattuple('tablename'); >> table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count >> | dead_tuple_len | dead_tuple_percent | free_space | free_percent >> +-++---+ >> --+++--- >> -+-- >> 2119548928 | 526658 | 1023569149 | 48.29 |0 >> | 0 | 0 | 1083485292 |51.12 >> (1 row) >> >> I guess this is because of long queries but I'm not really sure. >> Do you know how to avoid this problem and what can cause it? >> >> Do you think that increasing the autovacuum settings for those tables >> would alleviate the issue? >> >> Thanks, >> Nicola >> > > > > -- > El genio es 1% inspiración y 99% transpiración. > Thomas Alva Edison > http://pglearn.blogspot.mx/ > >
Re: pgstattuple free_percent to high
Hello, anyone can help on this matter, Alessandro highlighted? Thanks a lot, Nicola 2017-12-16 13:40 GMT+01:00 Alessandro Aste : > Stephen, Rene - Thanks! > > Our experience teach us that above 20% of free space performance start to > seriously deteriorate. I'm not sure if this is related to index or table > fragmentation. We'll do our homework and we'll try to discover more. > > However we have identified a process potentially causing the free space > spike high but it's not related to long running transactions modifying > (update, insert) on those tables. What DEV is currently doing is create a > sort of de-normalized cache table like this: > * tablename is the table with the free space issue > > INSERT INTO cache_table > SELECT *, table2., ...table.col FROM tablename > JOIN table2 on ... > ... > .. > JOIN table on > > So no updates/inserts/deletes to tablename are involved but that query may > run for 15 (this is the AVG more or less) minutes holding an ACCESS SHARE > lock (I guess). > > We are wondering why we have that spike if we are not modifying the > relation and what we can suggest DEV from the query/db perspective to > alleviate/fix the issue. > > Again, thanks so much. > > > > Thanks and kind regards > Aste - alessandro.a...@gtt.net > > -Original Message- > From: Stephen Frost [mailto:sfr...@snowman.net] > Sent: 14 December 2017 16:13 > To: Nicola Contu > Cc: Rene Romero Benavides ; pgsql-general@lists. > postgresql.org; Alessandro Aste > Subject: Re: pgstattuple free_percent to high > > Greetings Nicola, > > * Nicola Contu (nicola.co...@gmail.com) wrote: > > I think tuning the autovacuum settings may increase performances and > > remove dead_tuples but as far as I know, the autovacuum runs a vacuum > analyze. > > The vacuum analyze won't touch the free_percent of the table. > > That's not entirely accurate. If all of the free space is at the *end* of > the relation then autovacuum will attempt to lock the relation and truncate > the table to give that free space back to the OS. > > On a table where all of the rows are regularly updated, eventually the > "live" data should end up towards the front of the relation and the end of > the relation will be all dead tuples, allowing the truncate to happen. If > you have tuples at the end of the relation that aren't ever updated but > they're "live" then we won't be able to truncate. > > The pg_freespacemap extension can be useful to see where the free space is > in the relation. > > There are a few tools out there that aren't part of core PostgreSQL that > you could consider using such as pg_repack and pg_squeeze. > > > So I'm trying to find a way to adjust the free percent for some tables > > without doing a manually full vacuum. > > We are now monitoring the free percent, so we may find the part of the > > code that can increase that value, but was wondering if there is > > anything on the postgres side to resolve this problem. > > Having some free space in the relation isn't a 'problem' and is a good > thing because it means that new rows (from either INSERTs or UPDATEs) have > a place to go that doesn't require extending the relation (which requires > an additional lock as well as some additional work). As for how much free > space is good to have and how much is too much depends on the specific > workload. > > Thanks! > > Stephen >