Answers inline too.
2) Is the absence of btl_openib_have_driver_gdr an indicator of something
missing ?
Yes, that means that somehow the GPU Direct RDMA is not installed correctly. 
All that check does is make sure that the file 
/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?

It does not. There is no
/sys/kernel/mm/memory_peers/

3) Are the default parameters, especially the rdma limits and such, optimal for
our configuration ?
That is hard to say.  GPU Direct RDMA does not work well when the GPU and IB card are not 
"close" on the system. Can you run "nvidia-smi topo -m" on your system?
nvidia-smi topo -m
gives me the error
[mboisson@login-gpu01 ~]$ nvidia-smi topo -m
Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.

I could not find anything related to topology in the help. However, I can tell you the following which I believe to be true
- GPU0 and GPU1 are on PCIe bus 0, socket 0
- GPU2 and GPU3 are on PCIe bus 1, socket 0
- GPU4 and GPU5 are on PCIe bus 2, socket 1
- GPU6 and GPU7 are on PCIe bus 3, socket 1

There is one IB card which I believe is on socket 0.


I know that we do not have the Mellanox Ofed. We use the Linux RDMA from CentOS 6.5. However, should that completely disable GDR within a single node ? i.e. does GDR _have_ to go through IB ? I would assume that our lack of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node.


Thanks


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to