I recently did some work on 40Gb and 100Gb ethernet interfaces and these are a 
few of the things that helped me during lnet_selftest:


  *   On lnet: credits set to higher than the default (e.g: 1024 or more), 
peer_credits to 128 at least for network testing (it’s just 8 by default which 
is good for a big cluster maybe not for lnet_selftest with 2 clients),
  *   On ksocklnd module options: more schedulers (10, 6 by default which was 
not enough for my server), also changed some of the buffers (tx_buffer_size and 
rx_buffer_size set to 1073741824) but you need to be very careful on these
  *   Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, 
net.core.max and default, check disabling timestamps if you can afford it)
  *   Other: cpupower governor (set to performance at least for testing), BIOS 
settings (e.g: on my AMD routers it was better to disable  HT, disable a few 
virtualization oriented features and set the PCI config to performance). 
Basically, be aware that Lustre ethernet’s performance will take CPU resources 
so better optimize for it

Last but not least be aware that Lustre’s ethernet driver (ksocklnd) does not 
load balance as well as Infiniband’s (ko2iblnd). I already saw sometimes 
several Lustre peers using the same socklnd thread on the destination but the 
other socklnd threads might not be active which means that your entire load is 
on just dependent on one core. For that the best is to try with more clients 
and check in your node what’s the cpu load per thread with top. 2 clients do 
not seem enough to me. With the proper configuration you should be perfectly 
able to saturate a 25Gb link in lnet_selftest.

Regards,

Diego


From: lustre-discuss <[email protected]> on behalf of 
Pinkesh Valdria <[email protected]>
Date: Thursday, 5 December 2019 at 06:14
To: Jongwoo Han <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [lustre-discuss] Lnet Self Test

Thanks Jongwoo.

I have the MTU set for 9000 and also ring buffer setting set to max.


ip link set dev $primaryNICInterface mtu 9000

ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191

I read about changing  Interrupt Coalesce, but unable to find what values 
should be changed and also if it really helps or not.
# Several packets in a rapid sequence can be coalesced into one interrupt 
passed up to the CPU, providing more CPU time for application processing.

Thanks,
Pinkesh valdria
Oracle Cloud



From: Jongwoo Han <[email protected]>
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria <[email protected]>
Cc: Andreas Dilger <[email protected]>, "[email protected]" 
<[email protected]>
Subject: Re: [lustre-discuss] Lnet Self Test

Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the 
switch?
If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount 
of packet, reducing available bandwidth for data.

Jongwoo Han

2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria 
<[email protected]<mailto:[email protected]>>님이 작성:
Thanks Andreas for your response.

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?


Thanks,
Pinkesh Valdria
Oracle Cloud Infrastructure




From: Andreas Dilger <[email protected]<mailto:[email protected]>>
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [lustre-discuss] Lnet Self Test

The first thing to note is that lst reports results in binary units
(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the
conversion you get 2055.31 MiB/s = 2155 MB/s.

The other thing to check is the CPU usage. For TCP the CPU usage can
be high. You should try RoCE+o2iblnd instead.

Cheers, Andreas

On Nov 26, 2019, at 21:26, Pinkesh Valdria 
<[email protected]<mailto:[email protected]>> wrote:
Hello All,

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.    Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.


[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

When I run lnet_selftest_wrapper.sh (from Lustre 
wiki<https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.lustre.org_LNET-5FSelftest&d=DwMGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dEosA07cQm7WPohubrpzab8agc4uFDGesC-4tI4ylm0&s=-ne2Yke64JRw4BQu9pa0DXwf3tHkDqaUbp7S6Eq_C_Q&e=>)
 between 2 nodes,  I get a max of  2055.31  MiB/s,  Is that expected at the 
Lnet level?  Or can I further tune the network and OS kernel (tuning I applied 
are below) to get better throughput?



Result Snippet from lnet_selftest_wrapper.sh

[LNet Rates of lfrom]
[R] Avg: 4112     RPC/s Min: 4112     RPC/s Max: 4112     RPC/s
[W] Avg: 4112     RPC/s Min: 4112     RPC/s Max: 4112     RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 0.31     MiB/s Min: 0.31     MiB/s Max: 0.31     MiB/s
[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s
[LNet Rates of lto]
[R] Avg: 4136     RPC/s Min: 4136     RPC/s Max: 4136     RPC/s
[W] Avg: 4136     RPC/s Min: 4136     RPC/s Max: 4136     RPC/s
[LNet Bandwidth of lto]
[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s
[W] Avg: 0.32     MiB/s Min: 0.32     MiB/s Max: 0.32     MiB/s


Tuning applied:
Ethernet NICs:

ip link set dev ens3 mtu 9000

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191


less /etc/sysctl.conf
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.core.wmem_default=16777216
net.core.rmem_default=16777216
net.core.optmem_max=16777216
net.core.netdev_max_backlog=27000
kernel.sysrq=1
kernel.shmmax=18446744073692774399
net.core.somaxconn=8192
net.ipv4.tcp_adv_win_scale=2
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_rmem = 212992 87380 16777216
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_wmem = 212992 65536 16777216
vm.min_free_kbytes = 65536
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_no_metrics_save = 0



echo "#
# tuned configuration
#
[main]
summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

[disk]
devices=!dm-*, !sda1, !sda2, !sda3
readahead=>4096

[cpu]
force_latency=1
governor=performance
energy_perf_bias=performance
min_perf_pct=100
[vm]
transparent_huge_pages=never
[sysctl]
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.swappiness=30
" > lustre-performance/tuned.conf

tuned-adm profile lustre-performance


Thanks,
Pinkesh Valdria

_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dEosA07cQm7WPohubrpzab8agc4uFDGesC-4tI4ylm0&s=ejwMDqk5D3TzRE5eTzFdEKo9cQ0I6GVqN04wgaJcn0s&e=>
_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=6atMUkU7ebsLF7ieA6hjGFCUwJjGhXLtGzGLzhmjz1E&s=Xha6x47Y1z2YnkFxI9WFXKuQv-wzpGbnGjd7cIKwt5A&e=>


--
Jongwoo Han
+82-505-227-6108
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to