>>>>> On Fri, 14 Nov 2025 13:25:38 +0100, Arno Lehmann via Bacula-users said: > > Hi Chris, > > Am 14.11.2025 um 11:42 schrieb Chris Wright: > <snip detailed description> > > all seems well for a time then throughput drops to essentially zero > > The time things go well -- can you more or less reproduce that, or can > you even identify time or number of bytes transferred beyond which > things slow down? > > > - SD1 will have a single CPU pegged at 100%, with minimal IO traffic > > (both ops and bandwidth) from the open volume file, we will get spikes > > of good speed but average throughput after leaving a job running for a > > week is <1 MiB/sec. > > - SD2 is quiet, happily handling normal backup jobs from other clients > > with normal performance > > > > If we start a second, parallel, copy job we get similar initially good > > throughput then peg a second CPU on SD1 to 100% but there isn't exactly > > a big jump in performance. > > You could try to identify potential problem points by experimenting with > different job sizes, different directions, and different storage > targets. I like using FIFO storage backed by /dev/null plus a pool where > I disable cataloging of files. Just make sure you never send actual > backups there, or migration jobs... > > > There are no warnings/errors being logged and everything appears to be > > "working", just glacially slow and apparently totally bottlenecked on > > whatever that single CPU thread is doing with minimal reads from the > > volumes. > > > > Any suggestions on where to look for the root cause here? > > Not quite a root cause, but I'd start with tracing the SD activities > and/or system calls. strace with time stamps in my experience can help a > lot in identifying underlying issues, but will probably create rather > unwieldy output.
You could also attach gdb to the bacula-sd process when it is using 100% CPU and use the gdb command thread apply all bt to get backtraces from all the threads. Then use the detach command to make the process run again. That will work best if the bacula-sd has debug symbols available (either built in or from a separate package). Doing that a few times might reveal a pattern about what it is using the CPU. The output of the "status storage" bconsole command a few times while the SD is at 100% might be interesting as well. I think that will show the position in the volume being read, which will show its progress. __Martin _______________________________________________ Bacula-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-users
