[GENERAL] Weirdness with the stats collector process

2016-07-25 Thread Matthew Musgrove
One of our instances has been behaving -- oddly. Most queries are blazing fast. 
It appears to just be some of the stat views that are slow.

Queries against the following views are quick: pg_stat_activity, 
pg_stat_xact_all_tables, pg_stat_xact_sys_tables, pg_stat_xact_user_tables, 
pg_statio_sys_sequences (no system sequences), pg_stat_xact_user_functions, and 
pg_stat_replication.

When I query one of the other statistics views, the queries take roughly 10 
seconds when they used to take milliseconds. Here's an explain of one such 
query: https://explain.depesz.com/s/IVEu

Logs show 'using stale statistics instead of current ones because stats 
collector is not responding' when I query the stats views.

CPU usage is very low. Disk I/O looks okay to me.

-bash-4.1$ iostat
Linux 2.6.32-573.12.1.el6.x86_64 (hostname) 07/25/2016  _x86_64_
(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2.800.000.150.010.00   97.04

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda   0.45 4.85 7.45   33142948   50914838
sdb  31.48  1403.57   713.55 9597023154 4878965216
dm-0  0.14 0.58 0.5639497203847592
dm-1  1.00 4.26 6.83   29155482   46727688

Most of the files in $PGDATA/pg_stat_tmp haven't been updated by PostgreSQL 
since July 7th at 14:16. The one exception is pgss_query_texts.stat. I 
shouldn't have but I touched $PGDATA/pg_stat_tmp/db* on July 12th at 15:14 
(doing so did trigger an autovacuum though).

I set track_activities and track_counts to off and reloaded config (pg_ctl 
reload). The log indicated that it noticed the change. I created a directory 
/dev/shm/pg_stat_tmp, set track_activities and track_counts on, set 
stats_temp_directory to '/dev/shm/pg_stat_tmp' and reloaded config. Again, the 
log indicated that it noticed the change but nothing happened with stats. The 
new directory is empty.

I notified the statistics collector to quit (kill -QUIT). It did and the 
postmaster restarted it. The new stats directory is still empty. The only 
updates to the old stats directory is the pgss_query_texts.stat file.

I tried getting stack traces and the like with gdb and strace but it wasn't 
really helpful.

Most of what I have tried was the result of help by Jim Nasby on 
postgresteam.slack.com. He and Lukas Fittl suggested that I post about my 
problem. There's one thing that we all agree on... It should not be so hard to 
figure out what is going on with the statistics collector.

I tried turning stats off and on again today. Here's what the two pg_stat_tmp 
directories look like today:
 so now the directory looks like:

-bash-4.1$ ls -alh 9.4/data/pg_stat_tmp/
total 4.3M
drwx--  2 postgres postgres 4.0K Jul  7 14:16 .
drwx-- 19 postgres postgres 4.0K Jul 25 11:20 ..
-rw---  1 postgres postgres 3.4K Jul 12 15:14 db_0.stat
-rw---  1 postgres postgres  26K Jul 12 15:14 db_13003.stat
-rw---  1 postgres postgres  13K Jul 12 15:14 db_1.stat
-rw---  1 postgres postgres 140K Jul 12 15:14 db_2473933.stat
-rw---  1 postgres postgres 203K Jul 12 15:14 db_2476698.stat
-rw---  1 postgres postgres  27K Jul 12 15:14 db_2595334.stat
-rw---  1 postgres postgres  28K Jul 12 15:14 db_2631762.stat
-rw---  1 postgres postgres 1.5K Jul  7 14:16 global.stat
-rw---  1 postgres postgres 3.9M Jul 25 13:39 pgss_query_texts.stat
-bash-4.1$ ls -alh /dev/shm/pg_stat_tmp/
total 0
drwxr-xr-x 2 postgres postgres 40 Jul 12 17:19 .
drwxrwxrwt 3 root root 80 Jul 11 17:31 ..

Does anyone have any suggestions on how to:
 - see what the statistics collector is doing?
 - tell the postmaster to give the statistics collector the up to date config?
 - get stats working without restarting the instance? (I am not yet sure when 
that can happen but it would have to be scheduled [in advance] to occur during 
an after hours maintenance window.

Thanks in advance,
Matt

--
Matthew Musgrove
Senior Software Engineer
Assurant Mortgage Solutions
817.788.4482
mmusgr...@emortgagelogic.com<mailto:mmusgr...@emortgagelogic.com>
matthew.musgr...@assurant.com<mailto:matthew.musgr...@assurant.com>


Re: [GENERAL] Weirdness with the stats collector process

2016-07-26 Thread Matthew Musgrove
On 07/25/2016 03:20 PM, Tom Lane wrote:

Matthew Musgrove 
<mailto:mmusgr...@emortgagelogic.com> writes:


One of our instances has been behaving -- oddly. Most queries are blazing fast. 
It appears to just be some of the stat views that are slow.



It sounds like requests for stats updates are not getting through to the
collector.  I wonder if your kernel is blocking those packets for some
reason (probably ill-considered iptables filters?).  There is a test
for packet filtering conducted during postmaster startup, but we don't
repeat it after that.



Today was the first day that I have had any significant amount of time to 
investigate since probably the 12th. I had checked iptables before. When I saw 
this message I checked, double-checked and even re-ordered the iptables rules 
but it didn't make any difference.

I started going through the log files under /var/log again. I didn't see 
anything until I made it back to /var/log/messages for July 7th. I noticed some 
entries for system-config-network at 14:05.

I looked at ifconfig and the files in /etc/sysconfig/network-scripts but 
everything looked fine. I checked what ports the postmaster and stats collector 
were using ('lsof -i -P | grep '). I'm not sure how I missed this before 
but they were using an IPv6 localhost port.

I asked what network changes were made on July 7th around 14:05 and was told 
that they disabled IPv6! I had them turn it back on and stats started working 
immediately.

I have since changed the config from listen_addresses = '*' to listen_addresses 
= '0.0.0.0' but we won't be restarting until our next maintenance window (in 
about a week) when they will disable IPv6 again.

Tom and Adrian, thank you both! :)


--
Matthew Musgrove