Am 18.03.2011 21:37, schrieb Martin Simmons: >>>>>> On Fri, 18 Mar 2011 20:47:03 +0100, Christian Manal said: >> >> Am 18.03.2011 19:26, schrieb Martin Simmons: >>>>>>>> On Fri, 18 Mar 2011 13:36:36 +0100, Christian Manal said: >>>> >>>> Am 18.03.2011 13:03, schrieb Martin Simmons: >>>>>>>>>> On Fri, 18 Mar 2011 11:37:33 +0100, Christian Manal said: >>>>>> >>>>>> Am 18.03.2011 10:40, schrieb Christian Manal: >>>>>> Am 16.03.2011 09:14, schrieb Christian Manal: >>>>>>>> Am 15.03.2011 19:12, schrieb Christian Manal: >>>>>>>> Am 15.03.2011 17:49, schrieb Kjetil Torgrim Homme: >>>>>>>>>> Christian Manal <moen...@informatik.uni-bremen.de> writes: >>>>>>>>>> >>>>>>>>>> Also, after several accurate jobs running without restarting Bacula, >>>>>>>>>> the total memory usage of the director and fd didn't go up anymore, >>>>>>>>>> so >>>>>>>>>> I presume it comes down to the behavior of Solaris' free(), as >>>>>>>>>> described in the above quoted manpage. >>>>>>>>>> >>>>>>>>>> libumem may work better -- just set LD_PRELOAD, you don't have to >>>>>>>>>> recompile. I'd appreciate it if you report back if you try it. >>>>>>>>>> >>>>>>>>> >>>>>>>> Actually, I already did that. Modified the startup script for the >>>>>>>> affected fd (don't want the director crashing if things go wrong) and >>>>>>>> restarted. I will report the results tomorrow. >>>>>>>> >>>>>>>> Looks good. >>>>>>> >>>>>> Maybe I spoke too soon. Last night my director crashed with a segfault, >>>>>> after switching to libumem. Leading to that was an unusually long >>>>>> running job (the accurate one) which, going by the size, looked like it >>>>>> was doing a full instead of incremental for some reason. >>>>>>> >>>>>> I have some output from mdb and pstack attached. >>>>>> >>>>>> And going by dbx, the dir went kaboom in Jmsg(). >>>>>> ... >>>>>> =>[1] Jmsg(0xbefe5be0, 0x1, 0x0, 0x0, 0xfee8e25e, 0xf6caddb0), at >>>>>> 0xfee6a580 >>>>>> [2] j_msg(0x80c360e, 0x154, 0xbefe5be0, 0x1, 0x0, 0x0), at 0xfee6a7ad >>>>>> [3] start_storage_daemon_message_thread(0xbefe5be0, 0x80bc7f5, >>>>>> 0xfdc7f960, 0x0, 0x80bc798, 0xfde8fe6c), at 0x80834bc >>>>>> [4] do_backup(0xbefe5be0, 0x4, 0x0, 0xfdf91200, 0xfeea26e4, >>>>>> 0xfdf91200), at 0x80658b0 >>>>>> [5] _ZL10job_threadPv(0xbefe5be0, 0x1, 0xfe7c0dc7, 0xfe8422cc, >>>>>> 0xfe8422c0, 0xfdf91200), at 0x807a96e >>>>>> [6] jobq_server(0x80e5080), at 0x807d127 >>>>>> [7] _thr_setup(0xfdf91200), at 0xfe7c7e66 >>>>>> [8] _lwp_start(0xfee8e708, 0x0, 0x0, 0xfde8ea00, 0x7, 0x0), at >>>>>> 0xfe7c8150 >>>>> >>>>> It looks like it ran out of memory (the segfault is deliberate, due to >>>>> failure >>>>> to create a thread in start_storage_daemon_message_thread). >>>> >>>> That's strange. I'm monitoring that box with Nagios + pnp4nagios. >>>> Neither did Nagios report unusually high memory usage nor do I see a >>>> spike on the pnp4nagios graphs for memory and swap. >>>> >>>> >>>>> Did it write any info to the Bacula log? It should say "Cannot create >>>>> message >>>>> thread:" followed by the error message. >>>> >>>> The logfile just cleanly ends after the last finished job. But it seems >>>> to be in the coredump: >>>> >>>> core:msgchan.c:340 Cannot create message thread: Resource temporarily >>>> unavailable >>> >>> "Resource temporarily unavailable" occurs when Solaris can't allocate the >>> stack for a new thread, so memory pressure is a likely reason. It may be >>> invisible to Nagios if the memory is just reserved rather than being in use >>> (something that malloc implementations will do differently). >>> >> >> Hm.. but this didn't happen until I switched the director to libumem and >> the servers runs several other services which didn't blow up with no >> memory. So it looks like it has something to do with dir+umem, doesn't it? > > Yes, but changing the memory allocator can have far-reaching consequences. > How large was the core dump? >
1.8G >> I think I may set up a test environment, when I have time, to take a >> closer look at this issue. > > You could try running pmap to see how the memory layout changes while it is > doing the backup. > > Also, building Bacula as a 64-bit program might solve it (if you can get all > of the dependent libraries in 64-bit format). > That's a good pointer. I will try that. Regards, Christian Manal ------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users