Re: [Bacula-users] Solution for a strange network error affecting long running backups

kern Sat, 08 Aug 2020 02:34:22 -0700

Hello Josh,You did a really nice job tracking this down.Bottom line for me:
these kinds of network software errors are really scarry :-(Best
regards,KernPS: the new code that allows Bacula to reconnect is now complete.
However is will be sometime before it will be released, because we do lots of
testing -- especially stress testing.Sent from my Samsung Galaxy smartphone.
-------- Original message --------From: Josh Fisher <jfis...@jaybus.com> Date:
8/5/20 18:58 (GMT+01:00) To: bacula-users@lists.sourceforge.net Subject:
[Bacula-users] Solution for a strange network error affecting long
running backups Thought I would share my experience debugging a strange
problem affecting long running backups. A particular job was failing with a
"network send error to SD" at between 90-120 minutes after 100s of GB were
written. Enabling heartbeat on Dir, SD, and client had no effect and the
problem persisted.Some background. The client was a KVM VM running Centos 7 and
Bacula 9.6.5. Bacula SD and Dir run together on one node of a
Pacemaker-Corosync cluster, also Centos 7 and Bacula 9.6.5. Bacula server
daemons can failover successfully for incremental backups, but not for full (no
dual-port backup devices). Cluster uses a mix of DRBD volumes and iSCSI LUNs.
There are three networks involved; one dedicated to DRBD, one dedicated to
iSCSI, and a LAN connecting everything else. There were no obvious problems
with any other VMs or cluster nodes. There didn't appear to be any networking
issues. In both VM and cluster nodes, OS is Centos 7.8 with stock Centos kernel
3.10.0-1127.13.1 and qemu-kvm 1.5.3-173I have had issues with Bacula jobs
failing due to intermittent network issues in the past and they turned out to
be either hardware errors or buggy NIC drivers. Therefore, the first thing I
tried was moving the client VM to run on the same cluster node that the Bacula
daemons were running on. This way the VM's virtual NIC and the cluster node's
physical NIC are attached to the same Linux bridge interface, so traffic
between the two should never go on the wire, eliminating the possibility of
switch, wiring, and other external hardware problems. No luck. Exactly the same
problem.Next I turned on debugging for the SD. This produced a tremendous
amount of logging with no errors or warnings until after several hundred GB of
data was received from the client and suddenly there was a bad packet received,
causing the connection to be dropped. The Bacula log didn't lie. There was
indeed a network send error. But why?Not having any knowledge of the internals
of the Linux bridge device code, I thought that perhaps the host's physical NIC
also attached to the bridge that bacula-sd is listening on, might somehow cause
a problem. To eliminate that, I swapped the NIC in that host. I didn't have a
different type of NIC to try, so it was replaced with another Intel i350 and of
course used the same igb driver. Didn't work, but shows that it's not likely a
NIC hardware error. Could a bug in the igb driver cause this? Maybe, but the
NIC appeared to work flawlessly for everything else on the cluster node,
including a web server VM connected to it through the same bridge. Or could it
be the virtio_net driver? Again, it appears to work fine for the web server VM,
but let's start with the virtio_net driver, since virtio_net (the client VM) is
the sender and igb (bacula-sd listening on the cluster node's physical NIC) is
the receiver.So I searched for virtio-net and/or qemu-kvm network problems. I
didn't find anything like this, exactly, but I did find that people reported VM
network performance problems and latency issues and that, several qemu-kvm
versions ago, the solution was to disable some TCP offloading features. I
didn't have high expectations, but I disabled segmentation offload (TCP and
UDP), as well as generic receive offload, on all involved NICs, started the job
again, and SURPRISE, it worked! Ran for almost 3 hours, backing up 700 GB
compressed and had no errors.Conclusion: There is a bug somewhere! I think
maybe the checksum calculation is failing when segmentation offload is enabled.
It seems that checksum offload works so long as segmentation offload is
disabled. I didn't try disabling checksum offload and re-enabling segmentation
offload, nor did I try re-enabling generic receive offload.To disable
segmentation offload I used:/sbin/ethtool -K ethX tso off gso off gro offI
disabled those on all interfaces involved. It may only be necessary to do this
on one of the involved interfaces. I don't know. I just don't have time to try
all permutations, and this seems to work with little or no performance
degradation, at least in my case.Once again, Bacula shows itself to be the most
demanding network app that I know of, and so able to trigger all of the most
obscure and intermittent networking
problems._______________________________________________Bacula-users mailing
listBacula-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/bacula-users

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Solution for a strange network error affecting long running backups

Reply via email to