replies inline below.
A lot of things can affect reliability, and who knows, maybe there are > intermittent issues during backup jobs and cloud > volume part uploads which are out of your control causing some parts to > fail to be uploaded during the backup job itself, but > those issues do not exist when you run the cloud upload command manually. > Hard to tell. > Yeah, I figured there could be a lot of things going wrong. For all I know, backblaze's servers are periodically misbehaving. Could also be that the S3 driver is handling some conditions ungracefully. There is a reason it's being replaced with the Amazon driver, after all. > > > > I was considering setting up an admin job to periodically run 'cloud > allpools upload'. > > This is exactly what is recommended. I like to refer to it as a "cloud > volume part sweeper job", and woul > d run it once/day > after your backups are complete. > > I've taken the suggestions and script you provided and set up the following (I had already adopted your dummy resources :). bacula-dir.conf: Job { Name = "cloud-volume-part-sweeper-job" Type = admin Level = Full Schedule = "Copy-End-Of-Day" Storage = "None" Fileset = "None" Pool = "None" JobDefs = "Synology-Local" Runscript { RunsWhen = before RunsOnClient = no # Default yes, there is no client in an Admin job & Admin Job RunScripts *only* run on the Director :) Command = "/opt/bacula/etc/cloud-volume-part-sweeper.sh" # This can be `Console` if you wish to send console commands, but don't use this* } Priority = 30 } cloud-volume-part-sweeper.sh: #!/bin/bash bcbin="/opt/bacula/bin/bconsole" bccfg="/opt/bacula/etc/bconsole.conf" echo "Starting the process to upload any pending cloud volume parts." # Define an array of storage values storage_resources=("B2-TGU-Diff" "B2-TGU-Full" "B2-TGU-Inc") # Loop through each storage value for storage in "${storage_resources[@]}"; do echo "Checking storage resource: $storage. Uploading any volume parts that have not yet been transferred to the cloud." echo -e "cloud allpools storage=$storage upload\nquit\n" | ${bcbin} -c ${bccfg} done echo "Cloud volume upload process completed." (chatgpt helped me with the for loop). This all seems to work. Contrary to my first email, we actually have to specify which storage resource to use with the command, either on the command line or interactively. format like "cloud allpools storage=$storage_name upload". I think it would be good to automatically populate the storage device list using output from bconsole, but I didn't see a clear way to differentiate a local device from a cloud device, and while the bconsole command 'cloud allpools storage=$storage upload' does work with a local storage resource, I felt it was probably wasteful to spend time iterating over local volumes. It does seem to iterate over every volume under that storage resource, taking about 15 - 30 seconds minimum for each, even if all parts have been uploaded previously. Edit: Maybe the S3 driver was slower than the amazon driver, because when I removed the comment in front of BlobEndpoint the new admin script swept through all 25 of my cloud volumes within 2 minutes, and that includes taking time out to upload about a gigabyte of incremental backups. Doing it this way, you can periodically check for failed uploads. > Interestingly, (or perhaps terrifyingly), what you said about the status in which bacula closes out a job appears to be absolutely correct. To simulate a failure to upload, I queued a local incremental backup, commented out the blobendpoint directive for the cloud incremental resource, reloaded the SD, ran the copy control job, and then the part sweeper job. The results were that the copy control job, the actual copy job, and the part sweeper job all terminated with normal status T, even though the job logs for the copy job and the admin job showed the failure to upload. On the copy job failure, bacularis turned the messages icon red to indicate that something was wrong, and also highlighted the failure to upload messages in red.: 22-Feb 11:56 td-bacula-sd JobId 102: Cloud Upload transfers:22-Feb 11:57 td-bacula-sd JobId 102: Error: B2-TGU-Inc-0050/part.1 state=error retry=10/10 size=263 B duration=56s msg=/opt/bacula/plugins/aws_cloud_driver upload B2-TGU-Inc-0050 part.1. Invalid endpoint: https://s3..amazonaws.com Child exited with code 1 However, I truncated the messages icon log in bacularis and found that the failure to upload errors shown in the admin job did not turn the messages icon red, and the error lines from the admin script were not highlighted. This makes sense, I suppose, since the admin job is running a script, and processing the text from the script is outside bacularis' scope. The error lines from the admin script that were not highlighted red looked like this: td-bacula-dir JobId 103: BeforeJob: 3000 Cloud Upload: B2-TGU-Inc-0050/part.1 state=error retry=10/10 size=263 B duration=53s msg=/opt/bacula/plugins/aws_cloud_driver upload B2-TGU-Inc-0050 part.1. td-bacula-dir JobId 103: BeforeJob: Invalid endpoint: https://s3..amazonaws.com Child exited with code 1 td-bacula-dir JobId 103: BeforeJob: 3000 Cloud Upload: B2-TGU-Inc-0050/part.2 state=error retry=10/10 size=447.4 MB duration=55s msg=/opt/bacula/plugins/aws_cloud_driver upload B2-TGU-Inc-0050 part.2. td-bacula-dir JobId 103: BeforeJob: Invalid endpoint: https://s3..amazonaws.com Unknown error during program execvp td-bacula-dir JobId 103: BeforeJob: The volume "B2-TGU-Inc-0050" has been uploaded Interestingly, during a manual upload like we're doing with this script, bconsole seems to declare that the volume that just failed to upload has successfully uploaded, regardless of whether that is true. So I doubt there is any sort of job or process status that could help us here. I did browse through the bucket with cyberduck and have confirmed that no part of volume B2-TGU-Inc-0050 has been uploaded. So in short, I will need to find a way to monitor these part uploads for failures. I had hoped that the admin job would at least throw an error, but it seems it does not. bconsole itself doesn't throw an error and even declares that a volume was successfully uploaded even when that isn't and can't be true. > > Manually, In bconsole: > ----8<---- > * ll joblog jobid=<idOfAdminJoblog> > ----8<---- > > Could even be automated and scripted sed/grep/awk from the job log output > as above, or from the bacula.log files, and send > you an email or similar... > This will probably be necessary since the joblogs and the bacularis messages icon are the ONLY places where these errors are flagged for review, and the joblogs for the admin jobs aren't highlighted red, meaning that they don't jump out when scrolling over 1000+ lines of text. The bacularis messages eventually scroll out of the buffer, and I think the messages icon then turns green again. So if a copy job fails to upload to the cloud, and then that part fails to upload again and again forever, once the copy job in which the failure occurred scrolls out of the buffer, there will be no obvious visual indicator that something is wrong. > > > > The bigger problem seems to be that I don't know how to monitor whether > a part subsequently uploaded successfully. > > I am working on deploying Bill A's job report script, and that's likely > going to be my monitoring solution. > > See above... But, this gets me > thinking about a new feature for my reporting script. :) > > Currently it does nothing with regards to cloud part uploads, so I may > take a look at adding an option to warn when parts > fail to upload. Hmmm, I really like this idea. :) > I would very much appreciate it if you could look into this, because right now this seems to be a big hole in the whole monitoring chain. > > > > Maybe a bad exit > > status for an admin job tasked with uploading any leftover part files > could send up a flag that something is wrong and a part > > file hasn't been uploaded? > > Yeah, this is really not possible from what I can see. > > If a cloud backup job backs up everything, the Job will terminate "Backup > OK" - even if there are cloud part upload errors. I > am pretty sure it does not even terminate "Backup OK -- with warnings" to > give you any indication. I will need to > double-check, or you already can since you have some of these jobs where > this happened. :) > >From the testing I did today, I think you are 100% right. > > The thought here is along the lines of "the backup succeeded, and your > data is backed up and is safely under Bacula's > control", so there is no consideration that the parts may no > t have made it to the cloud. > I honestly think this behavior from bacula is somewhat shortsighted. I can easily imagine someone trusting this to 'just work' and then being horrifically surprised when it turns out something went wrong with cloud uploads 3 years ago and somehow no one saw it. Bacularis does the best it can to draw attention to these failures, but outside of that bacula doesn't seem to really mind. Obviously people need to monitor their setups, but I would like to see that be easier to set up. On the other hand, this behavior is somewhat good since jobs routinely failing to upload parts could generate many, many alert emails even if subsequent part sweeper admin jobs had later resolved the issue. So this isn't as bad as I'm making it out to be, unless we totally fail to notice the issue in the admin jobs. Some users will choose to upload at a later time, so it will be normal for them not to upload at all during the initial jobs. I think the least invasive thing with the least false alarms could be to monitor the output of a cloud part sweeper script. So if a job fails to upload a part, nothing turns red, unless it also fails to upload in the part sweeper script. However, this assumes that people HAVE a part sweeper job, which they might not (even though they very much should). In fact, I wonder if a part sweeper script could monitor its own output (maybe with tee?) and then exit non-zero if it found something bad? Obviously it'd be bad for such a script to refuse to do its job because one thing went wrong. This could arguably be worse than a few failing part uploads going unnoticed. Perhaps right at the end, after iterating through every upload task for each storage loop, then if a grep of bacula.log shows any of a number of keywords, exit non-zero just to make a point? "Turn me red, daddy." > > > Hope this helps, > It very much did, thank you! Regards, Robert Gerber 402-237-8692 r...@craeon.net
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users