Thought I would share my experience debugging a strange problem
affecting long running backups. A particular job was failing with a
"network send error to SD" at between 90-120 minutes after 100s of GB
were written. Enabling heartbeat on Dir, SD, and client had no effect
and the problem persisted.
Some background. The client was a KVM VM running Centos 7 and Bacula
9.6.5. Bacula SD and Dir run together on one node of a
Pacemaker-Corosync cluster, also Centos 7 and Bacula 9.6.5. Bacula
server daemons can failover successfully for incremental backups, but
not for full (no dual-port backup devices). Cluster uses a mix of DRBD
volumes and iSCSI LUNs. There are three networks involved; one dedicated
to DRBD, one dedicated to iSCSI, and a LAN connecting everything else.
There were no obvious problems with any other VMs or cluster nodes.
There didn't appear to be any networking issues. In both VM and cluster
nodes, OS is Centos 7.8 with stock Centos kernel 3.10.0-1127.13.1 and
qemu-kvm 1.5.3-173
I have had issues with Bacula jobs failing due to intermittent network
issues in the past and they turned out to be either hardware errors or
buggy NIC drivers. Therefore, the first thing I tried was moving the
client VM to run on the same cluster node that the Bacula daemons were
running on. This way the VM's virtual NIC and the cluster node's
physical NIC are attached to the same Linux bridge interface, so traffic
between the two should never go on the wire, eliminating the possibility
of switch, wiring, and other external hardware problems. No luck.
Exactly the same problem.
Next I turned on debugging for the SD. This produced a tremendous amount
of logging with no errors or warnings until after several hundred GB of
data was received from the client and suddenly there was a bad packet
received, causing the connection to be dropped. The Bacula log didn't
lie. There was indeed a network send error. But why?
Not having any knowledge of the internals of the Linux bridge device
code, I thought that perhaps the host's physical NIC also attached to
the bridge that bacula-sd is listening on, might somehow cause a
problem. To eliminate that, I swapped the NIC in that host. I didn't
have a different type of NIC to try, so it was replaced with another
Intel i350 and of course used the same igb driver. Didn't work, but
shows that it's not likely a NIC hardware error. Could a bug in the igb
driver cause this? Maybe, but the NIC appeared to work flawlessly for
everything else on the cluster node, including a web server VM connected
to it through the same bridge. Or could it be the virtio_net driver?
Again, it appears to work fine for the web server VM, but let's start
with the virtio_net driver, since virtio_net (the client VM) is the
sender and igb (bacula-sd listening on the cluster node's physical NIC)
is the receiver.
So I searched for virtio-net and/or qemu-kvm network problems. I didn't
find anything like this, exactly, but I did find that people reported VM
network performance problems and latency issues and that, several
qemu-kvm versions ago, the solution was to disable some TCP offloading
features. I didn't have high expectations, but I disabled segmentation
offload (TCP and UDP), as well as generic receive offload, on all
involved NICs, started the job again, and SURPRISE, it worked! Ran for
almost 3 hours, backing up 700 GB compressed and had no errors.
Conclusion: There is a bug somewhere! I think maybe the checksum
calculation is failing when segmentation offload is enabled. It seems
that checksum offload works so long as segmentation offload is disabled.
I didn't try disabling checksum offload and re-enabling segmentation
offload, nor did I try re-enabling generic receive offload.
To disable segmentation offload I used:
/sbin/ethtool -K ethX tso off gso off gro off
I disabled those on all interfaces involved. It may only be necessary to
do this on one of the involved interfaces. I don't know. I just don't
have time to try all permutations, and this seems to work with little or
no performance degradation, at least in my case.
Once again, Bacula shows itself to be the most demanding network app
that I know of, and so able to trigger all of the most obscure and
intermittent networking problems.
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users