Hello again,

I just thought I'd update this post with more information in hopes of
getting some explanation for the deadlocks.

I ran with Accurate backup on our test VMs (RHEL) for a couple of days and
got the same errors on some VMs that were running accurate and some that
were not.  These hosts were running concurrently.  I would say 90% of the
hosts that were configured to use Accurate finished successfully.  However,
there were a few that failed with the deadlock error -- some that were
configured to use accurate and some that were not configured to use
accurate.  Also, on all of these, a second job started for each of the
affected hosts right after Bacula detected the deadlock even though it said
a reschedule would happen 3600 seconds later (the 3600 seconds is correct).

Tonight, I disabled accurate on all hosts and the deadlocks did not
happen.  No errors were detected and all the backups finished successfully.

Some questions...
1.  Can I back up multiple hosts concurrently with some hosts configured to
use accurate and some configured not to use accurate?  Or, is it an all or
none thing, meaning all hosts that run concurrently must either be using
accurate backup or not using accurate backup (cannot mix the two)?

2. It seems like the hosts that get out of the starting gate first are the
ones affected.  I am configured to run 50 jobs concurrently.  Again, no
problems with accurate turned off on all hosts for months now.

3. Why is Bacula spinning off a new job right away after it detects the
deadlock for each affected job instead of waiting until the rescheduled job
runs?  I verified that there were no duplicate jobs in the queue before the
backups started running, no jobs were running before the start of the
backups, and I did not start any of these backups manually to cause a
second job to appear.

>From the INNODB Monitor output:

TRANSACTION:
TRANSACTION 208788977, ACTIVE 1 sec setting auto-inc lock
mysql tables in use 4, locked 4
9 lock struct(s), heap size 1184, 5 row lock(s)
MySQL thread id 50808, OS thread handle 0x7f8f2c3b4700, query id 29558637
<host> 192.168.10.99 bacula Sending data
INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
DeltaSeq) SELECT batch.FileIndex, batch.JobId, Path.PathId,
Filename.FilenameId,batch.LStat, batch.MD5, batch.DeltaSeq FROM batch JOIN
Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name =
Filename.Name)
WAITING FOR THIS LOCK TO BE GRANTED:
TABLE LOCK table `bacula`.`File` trx id 208788977 lock mode AUTO-INC waiting
WE ROLL BACK TRANSACTION (2)

I am running Bacula 7.0.5 on RHEL 6.6 x64 with Director, Storage and
Catalog running on separate RHEL 6.6 hosts.  Our clients are RHEL 6's, 5's
and Windows Servers 2008 and 2012R2.

Any help would be much appreciated.

Warmest regards,
-craig

On Tue, Aug 4, 2015 at 1:56 PM, Craig Shiroma <shiroma.crai...@gmail.com>
wrote:

> BTW, I suppose there could've been two jobs for the host(s) in scheduling
> queue.  If this was the case, is there a way to find out after the fact?
> If this did actually happen, what could cause duplicate jobs to be
> scheduled on the same day at the same time?  I know no one manually ran the
> jobs in question.  Again, this only was a problem for a few of the jobs
> that ran last night, not all of them and some to do accurate backup and
> some not.
>
> Regards,
> -craig
>
> On Tue, Aug 4, 2015 at 9:27 AM, Craig Shiroma <shiroma.crai...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I had a few backups fail last night with the following error:
>>
>> 2015-08-03 18:02:46bacula-dir JobId 123984: b INTO File (FileIndex,
>> JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) SELECT batch.FileIndex,
>> batch.JobId, Path.PathId, Filename.FilenameId,batch.LStat, batch.MD5,
>> batch.DeltaSeq FROM batch JOIN Path ON (batch.Path = Path.Path) JOIN
>> Filename ON (batch.Name = Filename.Name): ERR=Deadlock found when trying to
>> get lock; try restarting transaction
>>
>> The only thing I did yesterday was switch a bunch of backups to use
>> Accurate backup and restart bacula-dir and bacula-sd after that.  However,
>> the above problem also occurred on some hosts that was not set to use
>> Accurate backup.  From the log, it seems like two jobs for this host was
>> scheduled to run at 18:00 because the second job started and found a
>> duplicate job (job 123984) and canceled the backup.  I know there were no
>> jobs running before 18:00 so 123984 was not an old job still running.  Same
>> with the other jobs that were canceled because of the above situation.
>>
>> Anyway, does anyone have an idea what would cause this, especially how
>> the second job got shot into the system.  After the deadlock error, Bacula
>> said it would reschedule the job.  However the second job started right
>> after the deadlock error instead of one hour later which makes me think
>> that there were two jobs for this host scheduled to run at 18:00.
>>
>> Thank you in advance,
>> -craig
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to