replies inline below.



A lot of things can affect reliability, and who knows, maybe there are
> intermittent issues during backup jobs and cloud
> volume part uploads which are out of your control causing some parts to
> fail to be uploaded during the backup job itself, but
> those issues do not exist when you run the cloud upload command manually.
> Hard to tell.
>
Yeah, I figured there could be a lot of things going wrong. For all I know,
backblaze's servers are periodically misbehaving. Could also be that the S3
driver is handling some conditions ungracefully. There is a reason it's
being replaced with the Amazon driver, after all.

>
>
>  > I was considering setting up an admin job to periodically run 'cloud
> allpools upload'.
>
> This is exactly what is recommended. I like to refer to it as a "cloud
> volume part sweeper job", and woul
> d run it once/day
> after your backups are complete.
>
>  I've taken the suggestions and script you provided and set up the
following (I had already adopted your dummy resources :).
bacula-dir.conf:
Job {
  Name = "cloud-volume-part-sweeper-job"
  Type = admin
  Level = Full
  Schedule = "Copy-End-Of-Day"
  Storage = "None"
  Fileset = "None"
  Pool = "None"
  JobDefs = "Synology-Local"
  Runscript {
     RunsWhen = before
     RunsOnClient = no  # Default yes, there is no client in an Admin job &
Admin Job RunScripts *only* run on the Director :)
     Command = "/opt/bacula/etc/cloud-volume-part-sweeper.sh" # This can be
`Console` if you wish to send console commands, but don't use this*
  }
  Priority = 30
}

cloud-volume-part-sweeper.sh:
#!/bin/bash

bcbin="/opt/bacula/bin/bconsole"
bccfg="/opt/bacula/etc/bconsole.conf"

echo "Starting the process to upload any pending cloud volume parts."

# Define an array of storage values
storage_resources=("B2-TGU-Diff" "B2-TGU-Full" "B2-TGU-Inc")

# Loop through each storage value
for storage in "${storage_resources[@]}"; do
    echo "Checking storage resource: $storage. Uploading any volume parts
that have not yet been transferred to the cloud."
    echo -e "cloud allpools storage=$storage upload\nquit\n" | ${bcbin} -c
${bccfg}
done

echo "Cloud volume upload process completed."
(chatgpt helped me with the for loop).

This all seems to work. Contrary to my first email, we actually have to
specify which storage resource to use with the command, either on the
command line or interactively. format like "cloud allpools
storage=$storage_name upload".

I think it would be good to automatically populate the storage device list
using output from bconsole, but I didn't see a clear way to differentiate a
local device from a cloud device, and while the bconsole command 'cloud
allpools storage=$storage upload' does work with a local storage resource,
I felt it was probably wasteful to spend time iterating over local volumes.
It does seem to iterate over every volume under that storage resource,
taking about 15 - 30 seconds minimum for each, even if all parts have been
uploaded previously.
Edit: Maybe the S3 driver was slower than the amazon driver, because when I
removed the comment in front of BlobEndpoint the new admin script swept
through all 25 of my cloud volumes within 2 minutes, and that includes
taking time out to upload about a gigabyte of incremental backups.

Doing it this way, you can periodically check for failed uploads.
>
Interestingly, (or perhaps terrifyingly), what you said about the status in
which bacula closes out a job appears to be absolutely correct. To simulate
a failure to upload, I queued a local incremental backup, commented out the
blobendpoint directive for the cloud incremental resource, reloaded the SD,
ran the copy control job, and then the part sweeper job. The results were
that the copy control job, the actual copy job, and the part sweeper job
all terminated with normal status T, even though the job logs for the copy
job and the admin job showed the failure to upload. On the copy job
failure, bacularis turned the messages icon red to indicate that something
was wrong, and also highlighted the failure to upload messages in red.:

22-Feb 11:56 td-bacula-sd JobId 102: Cloud Upload transfers:22-Feb
11:57 td-bacula-sd JobId 102: Error: B2-TGU-Inc-0050/part.1
state=error   retry=10/10 size=263 B duration=56s
msg=/opt/bacula/plugins/aws_cloud_driver upload B2-TGU-Inc-0050
part.1.
Invalid endpoint: https://s3..amazonaws.com Child exited with code 1


However, I truncated the messages icon log in bacularis and found that the
failure to upload errors shown in the admin job did not turn the messages
icon red, and the error lines from the admin script were not highlighted.
This makes sense, I suppose, since the admin job is running a script, and
processing the text from the script is outside bacularis' scope. The error
lines from the admin script that were not highlighted red looked like this:

td-bacula-dir JobId 103: BeforeJob: 3000 Cloud Upload:
B2-TGU-Inc-0050/part.1     state=error   retry=10/10 size=263 B
duration=53s msg=/opt/bacula/plugins/aws_cloud_driver upload
B2-TGU-Inc-0050 part.1.
td-bacula-dir JobId 103: BeforeJob: Invalid endpoint:
https://s3..amazonaws.com Child exited with code 1
td-bacula-dir JobId 103: BeforeJob: 3000 Cloud Upload:
B2-TGU-Inc-0050/part.2     state=error   retry=10/10 size=447.4 MB
duration=55s msg=/opt/bacula/plugins/aws_cloud_driver upload
B2-TGU-Inc-0050 part.2.
td-bacula-dir JobId 103: BeforeJob: Invalid endpoint:
https://s3..amazonaws.com Unknown error during program execvp
td-bacula-dir JobId 103: BeforeJob: The volume "B2-TGU-Inc-0050" has
been uploaded


Interestingly, during a manual upload like we're doing with this script,
bconsole seems to declare that the volume that just failed to upload has
successfully uploaded, regardless of whether that is true. So I doubt there
is any sort of job or process status that could help us here.

I did browse through the bucket with cyberduck and have confirmed that no
part of volume B2-TGU-Inc-0050 has been uploaded.

So in short, I will need to find a way to monitor these part uploads for
failures. I had hoped that the admin job would at least throw an error, but
it seems it does not. bconsole itself doesn't throw an error and even
declares that a volume was successfully uploaded even when that isn't and
can't be true.

>
> Manually, In bconsole:
> ----8<----
> * ll joblog jobid=<idOfAdminJoblog>
> ----8<----
>
> Could even be automated and scripted sed/grep/awk from the job log output
> as above, or from the bacula.log files, and send
> you an email or similar...
>
This will probably be necessary since the joblogs and the bacularis
messages icon are the ONLY places where these errors are flagged for
review, and the joblogs for the admin jobs aren't highlighted red, meaning
that they don't jump out when scrolling over 1000+ lines of text. The
bacularis messages eventually scroll out of the buffer, and I think the
messages icon then turns green again. So if a copy job fails to upload to
the cloud, and then that part fails to upload again and again forever, once
the copy job in which the failure occurred scrolls out of the buffer, there
will be no obvious visual indicator that something is wrong.

>
>
> > The bigger problem seems to be that I don't know how to monitor whether
> a part subsequently uploaded successfully.
>  > I am working on deploying Bill A's job report script, and that's likely
> going to be my monitoring solution.
>
> See above... But, this gets me
>  thinking about a new feature for my reporting script. :)
>
> Currently it does nothing with regards to cloud part uploads, so I may
> take a look at adding an option to warn when parts
> fail to upload. Hmmm, I really like this idea.  :)
>
I would very much appreciate it if you could look into this, because right
now this seems to be a big hole in the whole monitoring chain.

>
>
> > Maybe a bad exit
> > status for an admin job tasked with uploading any leftover part files
> could send up a flag that something is wrong and a part
> > file hasn't been uploaded?
>
> Yeah, this is really not possible from what I can see.
>
> If a cloud backup job backs up everything, the Job will terminate "Backup
> OK" - even if there are cloud part upload errors. I
> am pretty sure it does not even terminate "Backup OK -- with warnings" to
> give you any indication. I will need to
> double-check, or you already can since you have some of these jobs where
> this happened. :)
>
>From the testing I did today, I think you are 100% right.

>
> The thought here is along the lines of "the backup succeeded, and your
> data is backed up and is safely under Bacula's
> control", so there is no consideration that the parts may no
> t have made it to the cloud.
>
I honestly think this behavior from bacula is somewhat shortsighted. I can
easily imagine someone trusting this to 'just work' and then being
horrifically surprised when it turns out something went wrong with cloud
uploads 3 years ago and somehow no one saw it. Bacularis does the best it
can to draw attention to these failures, but outside of that bacula doesn't
seem to really mind. Obviously people need to monitor their setups, but I
would like to see that be easier to set up.

On the other hand, this behavior is somewhat good since jobs routinely
failing to upload parts could generate many, many alert emails even if
subsequent part sweeper admin jobs had later resolved the issue. So this
isn't as bad as I'm making it out to be, unless we totally fail to notice
the issue in the admin jobs. Some users will choose to upload at a later
time, so it will be normal for them not to upload at all during the initial
jobs.

I think the least invasive thing with the least false alarms could be to
monitor the output of a cloud part sweeper script. So if a job fails to
upload a part, nothing turns red, unless it also fails to upload in the
part sweeper script. However, this assumes that people HAVE a part sweeper
job, which they might not (even though they very much should).

In fact, I wonder if a part sweeper script could monitor its own output
(maybe with tee?) and then exit non-zero if it found something bad?
Obviously it'd be bad for such a script to refuse to do its job because one
thing went wrong. This could arguably be worse than a few failing part
uploads going unnoticed. Perhaps right at the end, after iterating through
every upload task for each storage loop, then if a grep of bacula.log shows
any of a number of keywords, exit non-zero just to make a point?
"Turn me red, daddy."

>
>
> Hope this helps,
>
 It very much did, thank you!

Regards,
Robert Gerber
402-237-8692
r...@craeon.net
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to