Thought I would share my experience debugging a strange problem affecting long running backups. A particular job was failing with a "network send error to SD" at between 90-120 minutes after 100s of GB were written. Enabling heartbeat on Dir, SD, and client had no effect and the problem persisted.

Some background. The client was a KVM VM running Centos 7 and Bacula 9.6.5. Bacula SD and Dir run together on one node of a Pacemaker-Corosync cluster, also Centos 7 and Bacula 9.6.5. Bacula server daemons can failover successfully for incremental backups, but not for full (no dual-port backup devices). Cluster uses a mix of DRBD volumes and iSCSI LUNs. There are three networks involved; one dedicated to DRBD, one dedicated to iSCSI, and a LAN connecting everything else. There were no obvious problems with any other VMs or cluster nodes. There didn't appear to be any networking issues. In both VM and cluster nodes, OS is Centos 7.8 with stock Centos kernel 3.10.0-1127.13.1 and qemu-kvm 1.5.3-173

I have had issues with Bacula jobs failing due to intermittent network issues in the past and they turned out to be either hardware errors or buggy NIC drivers. Therefore, the first thing I tried was moving the client VM to run on the same cluster node that the Bacula daemons were running on. This way the VM's virtual NIC and the cluster node's physical NIC are attached to the same Linux bridge interface, so traffic between the two should never go on the wire, eliminating the possibility of switch, wiring, and other external hardware problems. No luck. Exactly the same problem.

Next I turned on debugging for the SD. This produced a tremendous amount of logging with no errors or warnings until after several hundred GB of data was received from the client and suddenly there was a bad packet received, causing the connection to be dropped. The Bacula log didn't lie. There was indeed a network send error. But why?

Not having any knowledge of the internals of the Linux bridge device code, I thought that perhaps the host's physical NIC also attached to the bridge that bacula-sd is listening on, might somehow cause a problem. To eliminate that, I swapped the NIC in that host. I didn't have a different type of NIC to try, so it was replaced with another Intel i350 and of course used the same igb driver. Didn't work, but shows that it's not likely a NIC hardware error. Could a bug in the igb driver cause this? Maybe, but the NIC appeared to work flawlessly for everything else on the cluster node, including a web server VM connected to it through the same bridge. Or could it be the virtio_net driver? Again, it appears to work fine for the web server VM, but let's start with the virtio_net driver, since virtio_net (the client VM) is the sender and igb (bacula-sd listening on the cluster node's physical NIC) is the receiver.

So I searched for virtio-net and/or qemu-kvm network problems. I didn't find anything like this, exactly, but I did find that people reported VM network performance problems and latency issues and that, several qemu-kvm versions ago, the solution was to disable some TCP offloading features. I didn't have high expectations, but I disabled segmentation offload (TCP and UDP), as well as generic receive offload, on all involved NICs, started the job again, and SURPRISE, it worked! Ran for almost 3 hours, backing up 700 GB compressed and had no errors.

Conclusion: There is a bug somewhere! I think maybe the checksum calculation is failing when segmentation offload is enabled. It seems that checksum offload works so long as segmentation offload is disabled. I didn't try disabling checksum offload and re-enabling segmentation offload, nor did I try re-enabling generic receive offload.

To disable segmentation offload I used:

/sbin/ethtool -K ethX tso off gso off gro off

I disabled those on all interfaces involved. It may only be necessary to do this on one of the involved interfaces. I don't know. I just don't have time to try all permutations, and this seems to work with little or no performance degradation, at least in my case.

Once again, Bacula shows itself to be the most demanding network app that I know of, and so able to trigger all of the most obscure and intermittent networking problems.




_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to