Are you sure about this?
We have a job that takes 25 days to complete without any recompiling.
---
Andrea Brancatelli
On 2020-03-02 12:15, Josh Fisher wrote:
> Bacula has a built-in watchdog that kills a job that runs for more than 6
> days. That period can be extended at compile time, so you have to compile
> your own binaries after a change to the source. I don't remember where in the
> source, but this has come up before and should be searchable.
>
> If you have already extended the watchdog timout, a signal 11 is almost
> always a software bug and the devs should be able to tell where in the code
> this happened from the traceback. That said, Bacula running for a long time
> using lots of pointers is also a decent test of memory, as well as i/o.
> Hardware errors, anything that causes a bit flip in RAM, usually results in a
> signal 11. But it is far more likely to be a software issue and you should
> file a bug report..
>
> On 3/1/2020 6:22 PM, Chaz Vidal wrote:
>
>> Greetings all,
>> Our Bacula system crashed on Friday with a segmentation violation.
>>
>> The system has been attempting to do a full backup of over 130TB of data
>> over the past few weeks which we've appeared to have lost because of the
>> crash.
>>
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: Bacula interrupted by
>> signal 11: Segmentation violation
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: Kaboom! bacula-dir,
>> bacula-dir got signal 11 - Segmentation violation at 28-Feb-2020 09:56:31.
>> Attempting traceback.
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: Kaboom! exepath=/usr/sbin/
>> Feb 28 09:56:31 <<servername>> bacula-dir: Bacula interrupted by signal 11:
>> Segmentation violation
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: Calling:
>> /usr/sbin/btraceback /usr/sbin/bacula-dir 4211 /var/lib/bacula
>> Feb 28 09:56:31 <<servername>> postfix/smtpd[59719]: connect from
>> localhost[127.0.0.1]
>> Feb 28 09:56:31 <<servername>> postfix/smtpd[59719]: 71CC36008A:
>> client=localhost[127.0.0.1]
>> Feb 28 09:56:31 <<servername>> postfix/cleanup[59722]: 71CC36008A:
>> message-id=<20200227232631.71CC36008A@<<servername>>.company.com>
>> Feb 28 09:56:31 <<servername>> postfix/qmgr[14399]: 71CC36008A:
>> from=<root@<<servername>>.company.com>, size=593, nrcpt=1 (queue active)
>> Feb 28 09:56:31 <<servername>> postfix/smtpd[59719]: disconnect from
>> localhost[127.0.0.1] helo=1 mail=1 rcpt=1 data=1 quit=1 commands=5
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: It looks like the traceback
>> worked...
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: LockDump:
>> /var/lib/bacula/bacula.4211.traceback
>> Feb 28 09:56:31 <<servername>> bacula-dir[4211]: bacula-dir:
>> lockmgr.c:1221-0 lockmgr disabled
>>
>> I do not know how to read a traceback file to understand what may have been
>> going on. We are attempting to restart the backup again but unless we
>> understand what happened the crash may appear again.
>>
>> We are running Bacula Version: 9.4.2
>>
>> Appreciate if anyone can share any insight?
>>
>> Attempt to dump current JCRs. njcrs=7
>> threadid=0x7fb497491f40 JobId=0 JobStatus=R jcr=0x55980a04a4f8
>> name=*JobMonitor*.2020-02-11_15.29.48_01
>> use_count=1 killable=0
>> JobType=I JobLevel=
>> sched_time=11-Feb-2020 15:29 start_time=11-Feb-2020 15:29
>> end_time=01-Jan-1970 09:30 wait_time=01-Jan-1970 09:30
>> db=(nil) db_batch=(nil) batch_started=0
>> wstore=0x55980a01ff28 rstore=0x55980a01ff28 wjcr=(nil) client=0x55980a026128
>> reschedule_count=0 SD_msg_chan_started=0
>> threadid=0x7fb495897700 JobId=104686 JobStatus=R jcr=0x7fb48806aea8
>> name=job1.2020-02-11_17.38.56_13
>> use_count=2 killable=1
>> JobType=B JobLevel=F
>> sched_time=11-Feb-2020 17:38 start_time=11-Feb-2020 17:38
>> end_time=01-Jan-1970 09:30 wait_time=21-Feb-2020 16:50
>> db=0x7fb4880059a8 db_batch=(nil) batch_started=0
>> wstore=0x7fb48803fc18 rstore=(nil) wjcr=(nil) client=0x7fb4880481a8
>> reschedule_count=0 SD_msg_chan_started=1
>> BDB=0x7fb4880059a8 db_name=bacula db_user=bacula connected=true
>> cmd="UPDATE Media SET InChanger=0, Slot=0 WHERE Slot=25 AND StorageId IN
>> (10) AND MediaId!=794" changes=1814
>> RWLOCK=0x7fb4880059c0 w_active=0 w_wait=0
>> threadid=0x7fb47e7fc700 JobId=104687 JobStatus=R jcr=0x7fb488068978
>> name=job2.2020-02-11_17.40.43_14
>> use_count=2 killable=1
>> JobType=B JobLevel=F
>> sched_time=11-Feb-2020 17:40 start_time=11-Feb-2020 17:40
>> end_time=01-Jan-1970 09:30 wait_time=01-Jan-1970 09:30
>> db=0x7fb4880059a8 db_batch=(nil) batch_started=0
>> wstore=0x7fb48803fc18 rstore=(nil) wjcr=(nil) client=0x7fb4880481a8
>> reschedule_count=0 SD_msg_chan_started=1
>> BDB=0x7fb4880059a8 db_name=bacula db_user=bacula connected=true
>> cmd="UPDATE Media SET InChanger=0, Slot=0 WHERE Slot=25 AND StorageId IN
>> (10) AND MediaId!=794" changes=1814
>> RWLOCK=0x7fb4880059c0 w_active=0 w_wait=0
>> threadid=0x7fb43f7fe700 JobId=104928 JobStatus=R jcr=0x7fb44805fa88
>> name=job3.2020-02-14_15.47.06_47
>> use_count=2 killable=1
>> JobType=B JobLevel=F
>> sched_time=14-Feb-2020 15:46 start_time=14-Feb-2020 15:47
>> end_time=01-Jan-1970 09:30 wait_time=27-Feb-2020 22:21
>> db=0x7fb4880059a8 db_batch=(nil) batch_started=0
>> wstore=0x7fb448034678 rstore=(nil) wjcr=(nil) client=0x7fb44803c148
>> reschedule_count=0 SD_msg_chan_started=1
>> BDB=0x7fb4880059a8 db_name=bacula db_user=bacula connected=true
>> cmd="UPDATE Media SET InChanger=0, Slot=0 WHERE Slot=25 AND StorageId IN
>> (10) AND MediaId!=794" changes=1814
>> RWLOCK=0x7fb4880059c0 w_active=0 w_wait=0
>> threadid=0x7fb43e7fc700 JobId=105616 JobStatus=R jcr=0x55980a005fe8
>> name=job4.2020-02-21_21.30.01_16
>> use_count=2 killable=1
>> JobType=B JobLevel=F
>> sched_time=21-Feb-2020 21:30 start_time=24-Feb-2020 23:36
>> end_time=01-Jan-1970 09:30 wait_time=01-Jan-1970 09:30
>> db=0x7fb4880059a8 db_batch=(nil) batch_started=0
>> wstore=0x7fb448033b78 rstore=(nil) wjcr=(nil) client=0x7fb44803e9e8
>> reschedule_count=0 SD_msg_chan_started=1
>> BDB=0x7fb4880059a8 db_name=bacula db_user=bacula connected=true
>> cmd="UPDATE Media SET InChanger=0, Slot=0 WHERE Slot=25 AND StorageId IN
>> (10) AND MediaId!=794" changes=1814
>> RWLOCK=0x7fb4880059c0 w_active=0 w_wait=0
>> threadid=0x7fb43effd700 JobId=0 JobStatus=C jcr=0x7fb47800b2e8
>> name=-Console-.2020-02-27_08.39.19_09
>> use_count=1 killable=0
>> JobType=U JobLevel=F
>> sched_time=27-Feb-2020 08:39 start_time=27-Feb-2020 08:39
>> end_time=01-Jan-1970 09:30 wait_time=01-Jan-1970 09:30
>> db=0x7fb4880059a8 db_batch=(nil) batch_started=0
>> wstore=0x7fb448035cd8 rstore=0x7fb448034c18 wjcr=(nil) client=0x7fb44803ae18
>> reschedule_count=0 SD_msg_chan_started=0
>> BDB=0x7fb4880059a8 db_name=bacula db_user=bacula connected=true
>> cmd="UPDATE Media SET InChanger=0, Slot=0 WHERE Slot=25 AND StorageId IN
>> (10) AND MediaId!=794" changes=1814
>> RWLOCK=0x7fb4880059c0 w_active=0 w_wait=0
>> threadid=0x7fb45f7fe700 JobId=0 JobStatus=C jcr=0x7fb3f400e148
>> name=-Console-.2020-02-28_09.00.13_35
>> use_count=1 killable=0
>> JobType=U JobLevel=F
>> sched_time=28-Feb-2020 09:00 start_time=28-Feb-2020 09:00
>> end_time=01-Jan-1970 09:30 wait_time=01-Jan-1970 09:30
>> db=(nil) db_batch=(nil) batch_started=0
>> wstore=0x7fb448035cd8 rstore=0x7fb448034c18 wjcr=(nil) client=0x7fb44803ae18
>> reschedule_count=0 SD_msg_chan_started=0
>> List plugins. Hook count=0
>>
>> _______________________________________________
>> Bacula-users mailing list
>> Bacula-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users