| I'm using PG 9.1.9 with a client application using various versions of the | pgsqlODBC driver on Windows. Cursors are used heavily, as well as some pretty | heavy trigger queries on db writes which update several materialized views. | | The server has 48GB RAM installed, PG is configured for 12GB shared buffers, | 8MB max_stack_depth, 32MB temp_buffers, and 2MB work_mem. Most of the other | settings are defaults. | | The server will seg fault from every few days to up to two weeks. Each time | one of the postgres server processes seg faults, the server gets terminated by | signal 11, restarts in recovery for up to 30 seconds, after which time it | accepts connections as if nothing ever happened. Unfortunately all the open | cursors and connections are lost, so the client apps are left in a bad state. | | Seg faults have also occurred with PG 8.4. ... I migrated the database to a | server running PG9.1 with the hopes that the problem would disappear, but it | has not. So now I'm starting to debug. | | # uname -a | Linux [hostname] 2.6.32-358.2.1.el6.x86_64 #1 SMP Tue Mar 12 14:18:09 CDT 2013 | x86_64 x86_64 x86_64 GNU/Linux | # cat /etc/redhat-release | Scientific Linux release 6.3 (Carbon) | | # psql -U jberry | psql (9.1.9) | Type "help" for help. | | jberry=# select version(); | version | ------------------------------------------------------------------------------- | PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 | 20120313 (Red Hat 4.4.7-3), 64-bit | (1 row)
I've had another postmaster segfault on my production server. It appears to be the same failure as the last one nearly a month ago, but I wanted to post the gdb bt details in case it helps shed light on the issue. Please let me know if anyone would like to drill into the dumped core with greater detail. Both the OS and PG versions remain unchanged. Kind Regards, -Joshua >From the PG log: 2013-05-07 08:48:35 CDT <%@> LOG: server process (PID 12367) was terminated by signal 11: Segmentation fault >From /var/log/messages May 7 08:48:17 active kernel: postmaster[12367]: segfault at 40 ip 0000000000710e2e sp 00007fffdcaeedf0 error 4 in postgres[400000+4ea000] (gdb) bt full #0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605 newmax = <value optimized out> #1 0x00000000006e1382 in SearchCatCache (cache=0x25bf270, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1143 res = 1 '\001' cur_skey = {{sk_flags = 0, sk_attno = -2, sk_strategy = 3, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x686640 <oideq>, fn_oid = 184, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x252f108, fn_expr = 0x0}, sk_argument = 20}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}} hashValue = 2280326203 hashIndex = 59 elt = 0x7fe0742cf228 ct = 0x7fe0742cf218 relation = <value optimized out> scandesc = <value optimized out> ntp = <value optimized out> #2 0x00000000006ec69e in getTypeOutputInfo (type=20, typOutput=0x277dea0, typIsVarlena=0x277dea8 "") at lsyscache.c:2438 typeTuple = <value optimized out> pt = <value optimized out> __func__ = "getTypeOutputInfo" #3 0x0000000000459027 in printtup_prepare_info (myState=0x25d53d0, typeinfo=0x2686640, numAttrs=84) at printtup.c:263 thisState = <value optimized out> format = <value optimized out> formats = 0x0 i = <value optimized out> __func__ = "printtup_prepare_info" #4 0x00000000004593c4 in printtup (slot=0x26fd960, self=0x25d53d0) at printtup.c:297 typeinfo = <value optimized out> myState = 0x25d53d0 buf = {data = 0x262cf40 "\320I\\\002", len = 40884576, maxlen = 0, cursor = 40884576} natts = 84 i = <value optimized out> #5 0x00000000006376ca in RunFromStore (portal=0x262cf40, direction=<value optimized out>, count=10, dest=0x25d53d0) at pquery.c:1121 oldcontext = 0x295a470 ok = <value optimized out> current_tuple_count = 0 slot = 0x26fd960 #6 0x00000000006377b2 in PortalRunSelect (portal=0x262cf40, forward=<value optimized out>, count=10, dest=0x25d53d0) at pquery.c:939 queryDesc = 0x0 direction = ForwardScanDirection nprocessed = <value optimized out> __func__ = "PortalRunSelect" #7 0x0000000000638c78 in PortalRun (portal=0x262cf40, count=10, isTopLevel=1 '\001', dest=0x25d53d0, altdest=0x25d53d0, completionTag=0x7fffdcaef470 "") at pquery.c:787 save_exception_stack = 0x7fffdcaef360 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {140736895841360, 9079525024474526506, 40030016, 0, 4294967295, 3, -9079456159289380054, 9079524170016813866}, __mask_was_saved = 0, __saved_mask = { __val = {39139104, 140736895840832, 64, 39668656, 0, 0, 80, 38984240, 8640448, 0, 140736895841360, 0, 40030016, 140736895841360, 140736895841360, 0}}}} result = <value optimized out> nprocessed = <value optimized out> saveTopTransactionResourceOwner = 0x25c8170 saveTopTransactionContext = 0x25c8060 saveActivePortal = 0x0 saveResourceOwner = 0x2748af0 savePortalContext = 0x0 saveMemoryContext = 0x295b020 __func__ = "PortalRun" #8 0x000000000063661e in exec_execute_message (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252e150 "[dbname]", username=<value optimized out>) at postgres.c:1965 completed = <value optimized out> execute_is_fetch = 1 '\001' completionTag = "\000ELECT 3\000\000\241\361=\000\000\000\360\364\256\334\377\177\000\000\200/U\002", '\000' <repeats 12 times>, " 5U\002\000\000\000\000\320\364\256\334\377\177\000\000U\000`\377\377\377\377\377" save_log_statement_stats = 0 '\000' was_logged = <value optimized out> dest = DestRemoteExecute receiver = 0x25d53d0 portal = <value optimized out> sourceText = 0x2644f10 "declare \"SQL_CUR02F082B0\" cursor with hold for SELECT job.JOB,job.COMPANYCODE,job.RECDBYCODE,job.PROJECT,job.REMARKS,job.JOBTYPE_ID,job.PRIORITY,job.LABLOC_ID,---Type <return> to continue, or q <return> to quit--- jobsch.JOB,jobsch.ISOFFLINE,jobsch.COMPAN"... prepStmtName = 0x817c44 "<unnamed>" portalParams = 0x0 is_xact_command = <value optimized out> msec_str = '\000' <repeats 16 times>, "59423\000\000\000\230/U\002\000\000\000" #9 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252e150 "[dbname]", username=<value optimized out>) at postgres.c:4026 portal_name = 0x25d4bb0 "SQL_CUR02F082B0" max_rows = 10 firstchar = <value optimized out> input_message = {data = 0x25d4bb0 "SQL_CUR02F082B0", len = 20, maxlen = 1024, cursor = 20} local_sigjmp_buf = {{__jmpbuf = {140736895841328, 9079528712728408874, 1, 1, -9187201950435737471, 0, -9079456159413112022, 9079524172817429290}, __mask_was_saved = 1, __saved_mask = { __val = {0, 0, 0, 18446462598732840960, 18446744073709551615, 0, 0, 0, 0, 0, 0, 266046903056, 140615307920592, 0, 4294967295, 0}}}} send_ready_for_query = 0 '\000' __func__ = "PostgresMain" #10 0x00000000005f6c61 in BackendRun () at postmaster.c:3612 ac = 1 secs = 421182656 usecs = 495798 i = <value optimized out> av = 0x252e398 maxac = <value optimized out> #11 BackendStartup () at postmaster.c:3302 bn = 0x2552f80 pid = 0 #12 ServerLoop () at postmaster.c:1466 rmask = {fds_bits = {8, 0 <repeats 15 times>}} selres = <value optimized out> readmask = {fds_bits = {56, 0 <repeats 15 times>}} nSockets = 6 now = <value optimized out> last_touch_time = 1367867385 __func__ = "ServerLoop" #13 0x00000000005f9431 in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1127 opt = <value optimized out> status = <value optimized out> userDoption = <value optimized out> listen_addr_saved = 1 '\001' i = <value optimized out> __func__ = "PostmasterMain" #14 0x000000000059a9b0 in main (argc=5, argv=0x252c480) at main.c:199 No locals. (gdb) Kind Regards, -Joshua On Fri, Apr 12, 2013 at 6:12 AM, Andres Freund <and...@2ndquadrant.com>wrote: > On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > > I wrote: > > > (Wanders away wondering just how much the regression tests exercise > > > holdable cursors.) > > > > And the answer is they're not testing this code path at all, because if > > you do > > DECLARE c CURSOR WITH HOLD FOR ... > > FETCH ALL FROM c; > > then the second query executes with a portal (and resource owner) > > created to execute the FETCH command, not directly on the held portal. > > > > After a little bit of thought I'm not sure it's even possible to > > reproduce this problem with libpq, because it doesn't expose any way to > > issue a bare protocol Execute command against a pre-existing portal. > > (I had thought psqlOBC went through libpq, but maybe it's playing some > > games here.) > > > > Anyway, I'm thinking the appropriate fix might be like this > > > > - CurrentResourceOwner = portal->resowner; > > + if (portal->resowner) > > + CurrentResourceOwner = portal->resowner; > > > > in several places in pquery.c; that is, keep using > > TopTransactionResourceOwner if the portal doesn't have its own. > > > > A more general but probably much more invasive solution would be to fake > > up an intermediate portal when pulling data from a held portal, to > > more closely approximate the explicit-FETCH case. > > We could also allocate a new resowner for the duration of that > transaction. That would get reassigned to the transactions resowner in > PreCommit_Portals (after a slight change there). > That actually seems simple enough? > > Greetings, > > Andres Freund >