Answers inline too.
2) Is the absence of btl_openib_have_driver_gdr an indicator of something
missing ?
Yes, that means that somehow the GPU Direct RDMA is not installed correctly.
All that check does is make sure that the file
/sys/kernel/mm/memory_peers/nv_mem/version exists. Does that exist?
It does not. There is no
/sys/kernel/mm/memory_peers/
3) Are the default parameters, especially the rdma limits and such, optimal for
our configuration ?
That is hard to say. GPU Direct RDMA does not work well when the GPU and IB card are not
"close" on the system. Can you run "nvidia-smi topo -m" on your system?
nvidia-smi topo -m
gives me the error
[mboisson@login-gpu01 ~]$ nvidia-smi topo -m
Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.
I could not find anything related to topology in the help. However, I
can tell you the following which I believe to be true
- GPU0 and GPU1 are on PCIe bus 0, socket 0
- GPU2 and GPU3 are on PCIe bus 1, socket 0
- GPU4 and GPU5 are on PCIe bus 2, socket 1
- GPU6 and GPU7 are on PCIe bus 3, socket 1
There is one IB card which I believe is on socket 0.
I know that we do not have the Mellanox Ofed. We use the Linux RDMA from
CentOS 6.5. However, should that completely disable GDR within a single
node ? i.e. does GDR _have_ to go through IB ? I would assume that our
lack of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node.
Thanks
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique