I've been thinking and the conclusion I reached is that it's generally understandable that bacula doesn't fail backup or copy jobs that have failed part uploads. It's also understandable that bacula wouldn't 'fail' an admin job if the admin job ran into trouble in its script. After all, how should bacula know what the script is doing, or if the results are good or not?
So, I've spent the weekend working on a script that uploads bacula cloud volume parts to the defined cloud resources. It tee's all output from bconsole off into a logfile, then once the upload jobs are done, it parses that logfile and exits 1 if it finds something bad, with suitable error messages. *This makes the admin job 'fail' *(after all the work is done), *so it will turn red* *in bacularis/baculum*, *and* presumably *in Bill A's baculabackupreport script.* I'm not an experienced shell scripter, so please forgive me if I made any embarrassing errors here. I've been testing this as thoroughly as I can, and I think it all does what it's supposed to, but I'm very much open to critique and suggestions. The first draft is pasted below, and attached to this email. Please tell me what you think and how you think it can be improved. #!/bin/bash # ---------------------------------------------------------------------- # # This script will upload any bacula cloud volume parts that haven't yet # been uploaded to the cloud storage resources defined in this script's # storage_resources variable. It is intended to be ran from a bacula # admin job. # # You must customize the storage_resources variable in this script to # match your environment. # # In bacula, it is very easy to upload cloud volume parts. # It is much harder to automatically raise an alert if a cloud volume # part has failed to upload. # Jobs with failed cloud volume part uploads terminate with status # 'backup OK'. I can understand why this is, but it still leaves the # chilling possibility that a bacula admin might be completely unaware # that their cloud backups aren't actually making it to the cloud # unless the admin happens to see the upload failures in the joblogs. # # An admin job that comes behind a backup job and triggers an upload # with a script like this will also terminate with status 'OK', even if # the upload process ran into issues and didn't complete successfully. # The bconsole output explaining the issues will be in the joblog, but # the script will still terminate 'OK', so as far as bacula is concerned, # nothing is wrong. # # How could bacula know what a script is doing, or if the results are good? # It is the job of the script to indicate that something went wrong. # # That responsibility for interpreting the results of the upload task # is where this script steps in. # # This vast majority of this script's code attempts to detect failed # uploads. If any parts fail to upload, the script will exit with a # non-zero status AFTER all upload attempts have completed. This will # cause the Bacula admin job that ran this script to 'fail,' helping # to ensure that administrators are promptly alerted. The script only # does this after all upload tasks have been completed, so the failure # won't hinder any upload tasks more than they already are by whatever # issues the script detected. What it should do is put a big red angry # checkbox in your bacularis dashboard, or in your job status report email, # a sure sign that you should investigate the cause. # # Because the error detection functionality is a core feature, this script # will refuse to continue before the upload process begins if it is unable # to write to the logfile. Without the script's logfile, it cannot do any # error detection. # # I recommend that you pair this script with some other monitoring # system, like Bill A's baculabackupreport script. # https://github.com/waa/baculabackupreport # # Many thanks to Bill A, Marcin Haba, and the folks on the Bacula-users # mailing list for all their help with this and other challenges! # # ---------------------------------------------------------------------- # #Sample admin job (modify and put in your bacula-dir.conf) # #Job { # Name = "cloud-volume-part-sweeper-job" # Type = admin # Level = Full # Level doesn't mean anything in this context, but must be defined for the config parser. # Schedule = "aScheduleThatExists" # I recommend running this job nightly. Any schedule specified must exist # # I am using Bill A's dummy resources, so I actually have 'dummy' storage, fileset, and pools defined with name="None" # # Such resources are convenient for quickly seeing at a glance whether the job is really using a specific resource type. # Storage = "None" # This job doesn't use this, but the config parser requires that it be defined, and must exist. # Fileset = "None" # This job doesn't use this, but the config parser requires that it be defined, and must exist. # Pool = "None" # This job doesn't use this, but the config parser requires that it be defined, and must exist. # Runscript { # RunsWhen = before # RunsOnClient = no # Default yes, there is no client in an Admin job & Admin Job RunScripts *only* run on the Director :) # Command = "/opt/bacula/etc/cloud-volume-part-sweeper.sh" # Path to this script # } # Priority = 30 #} # # # ---------------------------------------------------------------------- # # This top section contains variables you might need to edit. # Most users will only need to fill in their storage_resources # # ---------------------------------------------------------------------- # List your cloud storage{} resources, as defined in bacula-dir.conf. # They should be double-quoted, and space separated if you have more than one cloud resource. # example: # storage_resources=("my-resource-1" "my-resource-2" "my-resource-3") storage_resources=("my-resource-1") # Change this if you want the log files to have different names, or to go somewhere different # Make sure that bacula_user has permission to read and write files wherever you put these. logfilename="cloud-volume-part-sweeper.log" # format must be "filename" with no path separator like "/" logdir="/opt/bacula/log/" # format must be "/full/path/like/this/" # change this if bacula-dir runs under a different username than bacula bacula_user="bacula" # bconsole bin and conf locations bcbin="/opt/bacula/bin/bconsole" bccfg="/opt/bacula/etc/bconsole.conf" # ---------------------------------------------------------------------- # # You shouldn't need to edit anything past this point. # # ---------------------------------------------------------------------- # # Initial non-modifiable variables, basic housekeeping, and sanity checks # # ---------------------------------------------------------------------- logfile="$logdir$logfilename" # check that the script has been ran by the bacula-dir user. if whoami | grep -vq $bacula_user; then echo "This script must not be ran by any user other than the one that runs bacula-dir!" echo "Please run as the user that runs bacula-dir. This is usually 'bacula'" echo "Edit the bacula_user variable in this script if you run bacula-dir as a user other than the default 'bacula'" exit 1 fi # Cycle this script's logs. Maximum 10 logs kept at any one time. One logfile per run of the script. rm -f $logfile.9 for i in {8..1}; do mv $logfile.$i $logfile.$((i+1)) 2>/dev/null done mv $logfile $logfile.1 2>/dev/null # This script refuses to do its job rather than proceed with broken error checking. Safety first. touch $logfile if [ ! -w "$logfile" ]; then echo "Write permissions denied for log $logfile." echo "Without this logfile, this script cannot do any error checking." echo "Please fix this issue. No work has been done." exit 1 fi # ---------------------------------------------------------------------- # # Check all volumes for bacula cloud storage resources defined in bacula-dir.conf # Upload any volume parts that haven't yet been uploaded to the cloud. # Volumes already uploaded will not be uploaded again. # # ---------------------------------------------------------------------- echo "Starting the process to upload any pending cloud volume parts." # Loop through each storage value for storage in "${storage_resources[@]}"; do echo "Checking storage resource: $storage. Uploading any volume parts that have not yet been transferred to the cloud." echo -e "cloud allpools storage=$storage upload\nquit\n" | ${bcbin} -c ${bccfg} | tee -a $logfile done echo "Cloud volume upload process completed." # ---------------------------------------------------------------------- # # Now we parse the log for various things that could have gone wrong. # # All of the error checks below return true (0) if an error is found, # and return false (1) if no error is found. # # ---------------------------------------------------------------------- echo "Checking for issues with the upload process." # Checking for special error conditions related to the logfile itself. # does the logfile exist or is it empty? if [ ! -s $logfile ]; then echo "Logfile $logfile is empty or doesn't exist. Without this, we cannot do any error checking." exit 1 fi # can we read the logfile? if [ ! -r "$logfile" ]; then echo "No permissions to read logfile $logfile. Without this, we cannot do any error checking." exit 1 fi # Checking logfile text for keywords indicating various failure modes. # Were any of our storage_resource variable's invalid? if grep -q 'Storage resource ".*": not found' $logfile; then echo "There is an invalid cloud storage resource name in this script's storage_resources variable." echo "Please correct the storage_resources variable in $0 to match the cloud storage resources in bacula-dir.conf" bad_resource=$(grep 'Storage resource ".*": not found' $logfile) echo "bconsole complained about: $bad_resource" exit 1 fi # Did we accidentally end up in an interactive bconsole shell? if grep -q "Expected a positive integer" $logfile; then echo "bconsole was given the wrong kind of input. Something is wrong with one of the commands we sent to bconsole." exit 1 fi # Did any parts fail to upload? # Bacula might output 'state=error' for other failure modes beside volume part upload failures if grep -q "state=error" $logfile; then echo "Some volume parts failed to upload or some other error occurred."; echo "Check this admin joblog or logs in $logdir for details." exit 1 else echo "No issues detected." exit 0 fi Regards, Robert Gerber 402-237-8692 r...@craeon.net
cloud-volume-part-sweeper.sh
Description: application/shellscript
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users