Re: [Bacula-users] Disk based backup using vchanger, volumes being marked as Error

Josh Fisher Wed, 06 Aug 2014 09:40:54 -0700

On 8/6/2014 1:52 AM, Kern Sibbald wrote:

On 08/04/2014 06:43 PM, Josh Fisher wrote:

...

Have you set PreferMountedVolumes=no in the Job resource in bacula-dir.conf? If 3 jobs start and want to write to volumes in the same pool, then all three can be assigned the same volume. In fact, if PreferMountedVolumes=yes, (the default), then all three WILL be assigned the same volume unless the pool restricts the max number of jobs that the volume may contain. However, your device (drive) restricts the max concurrent jobs to 2. Therefore one of those three jobs will not be able to select the drive where the volume is mounted and will be forced to select another unused drive. That third job will nevertheless select the same volume as the other two and attempt to move the volume from the drive it is in into the drive that it has been assigned to. The configuration has a built-in race condition.

This is the first time that I have heard this explained so clearly. I am going to try to duplicate this problem now that you have so clearly explained it. By the way, I am not really sure I would classify this as a race condition, because theoretically the SD is not blocked, the third job just waits until the Volume is free (at least that is what I programmed). However, this is clearly very inefficient.

I agree. It is not a race condition in the code at all. Nothing gets stuck. It is really a misconfiguration, though the config file is syntactically correct. I'm not sure what to call that. I suppose I should have said the configuration has a built-in "resource contention problem", rather than race condition. Sorry for the confusion.

I would like to fix this, but one must keep in mind one important difficulty with Bacula. The SD knows what is going on with Volumes, but the Dir does not, and it is the Dir that proposes Volumes to the SD. Currently there is no good atomic way to pass the information in the SD to the Dir so that it can make better decisions.

So, with the (current) restraint that the solution must involve changing only the SD algorithm, how could one prevent this from happening? I have some ideas, but wonder what you think.

I think that it in fact MUST be changed only in the SD. The issue is that the volume selection for a job needs to be atomic. Whether the volume info is acquired from Dir, and array in SD, or anywhere else, SD must access it in a critical section in
order to serialize volume selection. I believe that ANYTHING that changes the status of a volume or device should be handled in SD as an atomic operation. Consider a single mutex that must be held in order to make any changes to either a volume or a device. The status of devices and volumes is transmitted back to Dir as part of the mutex release. Dir then always has accurate info, because only one job at a time can change anything. (I also consider Dir commands to the SD to be "jobs" in this context).

I guess I am of the belief that the current per-device locking is too fine grained. Due to volume selection, one device can affect another, even if indirectly as in the swapping required when the same volume is needed on two devices. A global lock simplifies concurrency and imho makes the whole system more robust. The biggest con is that multiple devices cannot mount/umount volumes at the same time. As far as I know, most tape robots cannot load/unload multiple drives simultaneously anyway, and for disk the mount/umount is only a few ms at most, so I don't view that as a problem.

I think concurrent programming is just hard, period. :) Therefore I prefer simplifying the serialization over squeezing out the utmost performance. And I think a global acquisition lock in SD is the way to do that.

Setting PreferMountedVolumes=no causes the three jobs to select a drive that is NOT already mounted with a volume from the pool. This allows jobs writing to the same pool to select different volumes from the pool, rather than all selecting the same next available volume. This has its own caveats. It doesn't necessarily prevent two jobs from selecting the same volume in some cases, meaning that they will want to swap the volume back and forth between drives, which is another type of race condition. I have used this method successfully for a pool containing full backups only by setting PreferMountedVolumes=no in the job resource and setting MaximumVolumeJobs=1 in the pool resource. Since Bacula selects the volume for a job in an atomic manner, this forces an exclusive set of volumes for each job, thus preventing the race condition. This means that concurrency is limited only by the number of drives, but at the "expense" of creating a greater number of smaller volume files. I quote "expense" because on a disk vchanger it isn't usually a big issue to have more volume files. Doing this with a tape autochanger would use a lot more tapes and be truly more expensive. Of course unlimited concurrency is theoretical, since the hardware limits the USEFUL concurrency.

I really do not like the PreferMountedVolumes = No option (I have probably said this many times), but I find your use of it very well explained and very interesting.

Best regards,
Kern

------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Disk based backup using vchanger, volumes being marked as Error

Reply via email to