Hi Hillf, > Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing.
I assume you meant disabling NOLOCK for pfifo_fast. Here is the modification, --- ./net/sched/sch_generic.c.orig 2020-08-24 22:02:04.589830751 +0800 +++ ./net/sched/sch_generic.c 2020-08-27 10:17:10.148977195 +0800 @@ -792,7 +792,7 @@ .dump = pfifo_fast_dump, .change_tx_queue_len = pfifo_fast_change_tx_queue_len, .owner = THIS_MODULE, - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS, + .static_flags = TCQ_F_CPUSTATS, The issue never happen again with it for over 3 hours stressing. And I restarted the test for two times. No any surprising. Quite stable... Hillf Danton <hdan...@sina.com> 于2020年8月26日周三 下午2:37写道: > > Hi Feng, > > > > On Wed, 26 Aug 2020 11:12:38 +0800 Fengkehuan Feng wrote: > > >Hi Hillf, > > > >I just gave more tries on the patch, and it seems not that good as what I > >told in last email. > > >I could see more packets getting stuck now... > > We have more to learn here:P > > > > >Let me explain what I am facing in detail in case we are not aligning to fix > >the same problem. > > > >Our application is in deep learning scenario and it's based on NVIDIA NCCL > >to do > > >collective communication intra-node or inter-node (to be more specific, it's > >data > > > all-reduce on two servers witch 8 GPU nodes each). > >NCCL can support data transmission through TCP/RDMA/GDR. In normal, it takes > > > about 1000 us for TCP or less for RDMA/GDR to transmit 512KB packet, but > > > sometimes it tooks hundreds of millisecond or several seconds to get > > completed. > > > > >When we change the default qdisc from pfifo_fast to fq_codel, the issue never > > > happen, so we suspect it's something wrong within the networking stack (but > > > it's a bit strange that RDMA or GDR has the same problem) > > Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing. > > > > --- a/net/sched/sch_generic.c > > +++ b/net/sched/sch_generic.c > > @@ -791,7 +791,7 @@ struct Qdisc_ops pfifo_fast_ops __read_m > > .dump = pfifo_fast_dump, > > .change_tx_queue_len = pfifo_fast_change_tx_queue_len, > > .owner = THIS_MODULE, > > - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS, > > + .static_flags = TCQ_F_CPUSTATS, > > }; > > EXPORT_SYMBOL(pfifo_fast_ops); > > -- > > > > > > >Here is the log print from our test application, > > > >size: 512KB, use_time: 1118us, speed: 0.436745GB/s > >size: 512KB, use_time: 912us, speed: 0.535396GB/s > >size: 512KB, use_time: 1023us, speed: 0.477303GB/s > >size: 512KB, use_time: 919us, speed: 0.531318GB/s > >size: 512KB, use_time: 1129us, speed: 0.432490GB/s > >size: 512KB, use_time: 2098748us, speed: 0.000233GB/s > >size: 512KB, use_time: 1018us, speed: 0.479648GB/s > >size: 512KB, use_time: 1120us, speed: 0.435965GB/s > >size: 512KB, use_time: 1071us, speed: 0.455912GB/ > > > > > > JFYI I failed to find this message at lore.kernel.org perhaps > > because of pure text mail. > > > > Thanks > > Hillf