To recap:

My configuration uses Bacula in a disk-to-disk-to-removable-disk
configuration (for about the past year since I gave up on replacing
failed LTO drives every other year), with its catalog DB on MariaDB
Galera Cluster and DB connections round-robinned via HAproxy (it's been
on the cluster for about three years, and on HAproxy for about the past
year).  This configuration has Just Worked as long as I've been running
it, with the one caveat that Bacula must be built with attribute
spooling disabled, because the attribute spooling code for the MySQL
driver just does not work in a useful way.

As soon as I updated to 9.6.5, jobs started hanging.  Typically about
one job in three would just hang for no reason I was ever able to
determine.  Roughly one job in three would hang mid-file.  It was
sometimes possible to cancel a hung job, but it would take a very long
time.  If I mistakenly tried to cancel a seconds hung job while the
Director was still working on cancelling the first, it would almost
invariably crash.  Changing the configuration to connect directly to the
local node of the cluster and treat it as a standalone MySQL instance
slightly mitigated the problem, but did not fix it.

I eventually discovered that I could turn the hung-job problem on and
off simply by changing the Director version *ONLY*.  With a 9.6.5
Director, jobs hung; with 9.6.3 Director, they didn't, with no other
changes.  No matter how deeply I dug into it I was unable to ever
isolate the specific cause of the problem.


I've now been running on 9.6.6 for a week and have not seen a single
hung job.

I *THINK* I can safely state that whatever the root cause of the problem
in 9.6.5, it is fixed in 9.6.6.  Since it is already reported that a TLS
issue in 9.6.5 is fixed in 9.6.6, I'm going to *speculate* (with the
caveat that I have no definite proof) that both issues were actually
networking problems, that jobs were hanging because communication with
the clients silently failed, and that the fix for the TLS problem ALSO
fixed the hung-jobs problem by fixing the client communication failures.



-- 
  Phil Stracchino
  Babylon Communications
  ph...@caerllewys.net
  p...@co.ordinate.org
  Landline: +1.603.293.8485
  Mobile:   +1.603.998.6958


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to