Re: Troubleshooting a segfault and instance crash

Blair Boadway Sat, 24 Mar 2018 15:45:45 -0700

Following up on this thread, we removed pgaudit from the system to eliminate on 
variable (removed from postgres.conf including shared_preload_libraries) but 
after a couple of weeks of success we hit the segfault again.  Again it 
happened while running some DDL (object grants).  This time we were configured 
to harvest a core file, which gave us a small bit of info:


gdb -q -c core /usr/pgsql-9.6/bin/postgres
Reading symbols from /usr/pgsql-9.6/bin/postgres...(no debugging symbols 
found)...done.
<many more lines such as this with no debugging symbols found>
Core was generated by `postgres: batch_user_account'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

That wasn’t really enough information to tell me what the problem.  Did not 
have success with installing debuginfo:

Could not find debuginfo for main pkg: 
postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

Not sure how useful that would be to dig further on.  So it doesn’t seem 
pgaudit is the culprit but not sure what to make of the strcmp error.

-Blair


From: Jan Bilek <jan.bi...@eftlab.com.au>
Date: Thursday, March 8, 2018 at 2:56 PM
To: "pavel.steh...@gmail.com" <pavel.steh...@gmail.com>, Blair Boadway 
<bboad...@abebooks.com>
Cc: "pgsql-gene...@postgresql.org" <pgsql-gene...@postgresql.org>
Subject: RE: Troubleshooting a segfault and instance crash

Hi Blair, Pavel,

we are using procedure described in https://access.redhat.com/solutions/4896  
to automate crash detail collection for our production systems on RHEL 7.

Perhaps something like this can help on your side.

Kind Regards,
Jan

On 2018-03-09 04:35:05+10:00 Pavel Stehule wrote:


2018-03-08 19:16 GMT+01:00 Blair Boadway 
<bboad...@abebooks.com<mailto:bboad...@abebooks.com>>:
Hi Pavel,

I don’t have a core yet, the only way I have now is to intentionally crash the 
prod system a couple of times.  Haven’t resorted to that yet.
hard to help without backtrace - and then you need core dump


Interesting you mentioned pgaudit—it is installed on this system because that 
is a our standard installation but on this particular system we haven’t yet 
needed audits so the audit role is ‘empty’.  (And on a different system with 
same installation and heavy of audit we’ve seen no segfaults)
other extensions are simply or without relation to DDL or well known. So 
pgaudit is best candidate - but the error can be anywhere

Regards

Pavel
On this system

pgaudit.role = 'auditor'
pgaudit.log_parameter = off
pgaudit.log_catalog = off
pgaudit.log_statement_once = on
pgaudit.log_level = log

select * from information_schema.role_table_grants where grantee = 'auditor';
(0 rows)

thanks, Blair

From: Pavel Stehule <pavel.steh...@gmail.com<mailto:pavel.steh...@gmail.com>>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboad...@abebooks.com<mailto:bboad...@abebooks.com>>
Cc: "pgsql-gene...@postgresql.org<mailto:pgsql-gene...@postgresql.org>" 
<pgsql-gene...@postgresql.org<mailto:pgsql-gene...@postgresql.org>>
Subject: Re: Troubleshooting a segfault and instance crash
Hi

2018-03-08 18:40 GMT+01:00 Blair Boadway 
<bboad...@abebooks.com<mailto:bboad...@abebooks.com>>:
Hello,

We’re seeing an occasional segfault on a particular database

Mar  7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 
000000302f32868a sp 00007ffcf1547498 error 4 in 
libc-2.12.so<http://libc-2.12.so>[302f200000+18a000]
Mar  7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:  server 
process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent 
issues.  This has happened 3 times in 2 months and each time the segfault error 
and memory address is the same. We’ve only seen it on one database, though 
we’ve seen it on both hosts of primary/standby setup—we switched over primary 
to other host and got a segfault there, which seems to eliminate a hardware 
issue.  Oddly the database has no issues for normal DML workloads (it is a 
moderately busy prod oltp system) but the segfault has happened very shortly 
after DML changes are made.  Most recently it happened while running a series 
of grants for new db users we were deploying (ie. running a sql script from 
psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9  - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 
(Red Hat 4.4.7-18), 64-bit
Extensions - 
pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to 
collect core from the OS but haven’t collected a core yet.  There isn’t any 
particular config change or extension that we can link to the problem, this is 
a system that has run for months without problems since last config changes.  
Appreciate any ideas.
can you get core dump? It can be pgaudit bug maybe? It is complex extension.
Regards
Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

Reply via email to