Dear Linux kerenl-network experts,

I work at DESY (Hamburg, Germany), and is responsible for Data Acquisition 
(DAQ) from different accelerators and experiments.


Every DAQ collects data over the network. UDP multicast is used to transfer the 
data. Every data source has a multicast sender (~ 200 instances). A Dell 
PowerEdge R730xd server (DAQ server: 256 GB RAM, 40 Cores) is used for the data 
receiving. The DAQ server has several 10Gb network adapters in different 
sub-nets to receive multicast from the senders sitting in the corresponding 
sub-nets.

Every sender is pushing data via a UDP socket bound to a multicast address. The 
data sending takes place every 100ms (10Hz),

The size of data can vary from some bytes up to several MB.
The data is split into 32KB messages sent via the UDP socket.

A multi-threaded fast collector runs on the DAQ server to receive the data.


We have found and successfully used the values of the kernel parameters to 
minimize packet losses on all network stack layers till the kernel 4.4.0-128.


Since one year (after trying to switch to other kernels kernel 4.4.0-xxx [sorry 
I cannot say what xxx was, but after 128], 4.6 .. ) we have a problem that 
looks like losses on the application layer.


In my current test 2 network interfaces are used. The multicast input rate is ~ 
140MB/s for each interface.



 I'm testing kernel 5.6.0-1032-oem.
Previously kernel 5.4.0-52-generic was tested but with the same results.

The signature is the following:
1) no Rx looses in adapters
2) no counting InErrors RcvbufErrors InCsumErrors in /proc/net/snmp
3) no counts in any column but 0 in /proc/net/softnet_stat

The losses show up in bursts from time to time.
4) dropwatch shows "xxx drops in at ip_defrag+171 ..."


Putting it in one line:n my current test 2 network interfaces are used. The 
multicast input rate is ~ 140MB/s for each interface.

Multicast packages are seen by the network adapters but the application layer 
from time to time doesn't get them  simultaneously from all senders.

Here are some sysctl parameters  currently used values, I'm aware of, that could
influence on the losses level.

net.core.optmem_max = 40960
net.core.rmem_default = 16777216
net.core.rmem_max = 67108864
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.ipv4.igmp_max_memberships = 512
net.ipv4.udp_mem = 262144    327680    393216
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 4096

net.core.netdev_budget = 100000
net.core.netdev_max_backlog = 100000
net.ipv4.ipfrag_high_thresh = 33554432
net.ipv4.ipfrag_low_thresh = 16777216

All other parameters are without changes as they come with the kernel 
distribution.

We plan to switch to Ubuntu 20.04 next year, and therefore kernel 5.4(6) is 
going to be used.


I hope that this problem is solvable on the kernel level.


Many thanks in advance and best regards,

Vladimir

--
/*********************************************************************\
* Dr. Vladimir Rybnikov      Phone : [49] (40) 8998 4846              *
* FLA/MCS4                   Fax   : [49] (40) 8998 4448              *
* Geb. 55a/35                e-mail: vladimir.rybni...@desy.de        *
* WWW : http://www.desy.de/~rybnikov/ * + +
* Notkestr.85, DESY                                                   *
* D-22607 Hamburg, Germany                                            *
\*********************************************************************/

Reply via email to