There is some stuff in /dev, and also in /sys

on my system :

ls -al /dev/infiniband/
drwxr-xr-x  2 root root      120 Nov 10 17:08 .
drwxr-xr-x 21 root root     3980 Dec 13 03:09 ..
crw-rw----  1 root root 231,  64 Nov 10 17:08 issm0
crw-rw-rw-  1 root root  10,  56 Nov 10 17:08 rdma_cm
crw-rw----  1 root root 231,   0 Nov 10 17:08 umad0
crw-rw-rw-  1 root root 231, 192 Nov 10 17:08 uverbs0


here is what you can do to find out what is going wrong on your system
/* note if you are running selinux, that might also cause some issue */


$ mpirun -np 1 strace -e open,stat -o /tmp/hello.strace -- ./hello_c
Hello, world, I am 0 of 1, (Open MPI v3.0.0a1, package: Open MPI gilles@xxx Distribution, ident: 3.0.0a1, repo rev: dev-3197-g4323016, Unreleased developer copy, 160)

$ grep -v ENOENT /tmp/hello.strace  | grep /dev/
open("/dev/shm/open_mpi.0000", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 6
open("/dev/infiniband/uverbs0", O_RDWR) = 17
open("/dev/infiniband/uverbs0", O_RDWR) = 19
open("/dev/infiniband/rdma_cm", O_RDWR) = 21
open("/dev/infiniband/rdma_cm", O_RDWR) = 21
open("/dev/infiniband/rdma_cm", O_RDWR) = 21
open("/dev/infiniband/rdma_cm", O_RDWR) = 21

$ grep -v ENOENT /tmp/hello.strace  | grep /sys/
open("/sys/devices/system/cpu/possible", O_RDONLY) = 18
stat("/sys/class/infiniband", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
open("/sys/class/infiniband", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 17
open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 17
open("/sys/class/infiniband_verbs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 17 stat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0 stat("/sys/class/infiniband_verbs/uverbs0", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
open("/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY) = 18
open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 18
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 17
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 17
open("/sys/class/infiniband/mlx4_0/node_type", O_RDONLY) = 17
open("/sys/class/infiniband/mlx4_0/device/local_cpus", O_RDONLY) = 17
open("/sys/class/infiniband/mlx4_0/ports/1/gids/0", O_RDONLY) = 19
open("/sys/class/misc/rdma_cm/abi_version", O_RDONLY) = 19
open("/sys/class/infiniband/mlx4_0/node_guid", O_RDONLY) = 19

Cheers,

Gilles

On 12/18/2015 9:11 AM, Ralph Castain wrote:
To be honest, it’s been a very long time since I had an IB machine. Howard, Nathan, or someone who has one - can you answer?



On Dec 17, 2015, at 3:53 PM, Bathke, Chuck <bat...@lanl.gov <mailto:bat...@lanl.gov>> wrote:

Ralph,
Where would these be, in /dev?
Chuck

*From*: Ralph Castain [mailto:r...@open-mpi.org]
*Sent*: Thursday, December 17, 2015 04:13 PM
*To*: Open MPI Users <us...@open-mpi.org <mailto:us...@open-mpi.org>>
*Subject*: Re: [OMPI users] Need help resolving "error obtaining device context for mlx4_0"

You might want to check the permissions on the MLX device directory - according to that error message, the permissions are preventing you from accessing the device. Without getting access, we don’t have a way to communicate across nodes - you can run on one node using shared memory, but not multiple nodes.

So it looks like there is some device-level permissions issue in play.


On Dec 17, 2015, at 2:39 PM, Bathke, Chuck <bat...@lanl.gov <mailto:bat...@lanl.gov>> wrote:

Hi,
I have a system of AMD blades that I am trying to run MCNP6 on using OPENMPI. I installed openmpi-1.6.5. I also have installed Intel FORTRAN and C compiles. I compiled MCNP6 using FC="mpif90" CC="mpicc" … It runs just fine when I run it on a 1-hour test case on just one blade. I need to run it on several blades, but it issues an error and crashes and burns. I have sought help here, but no one seems to know how to fix it. I have mounted /opt and /home on bud and bud6 on the corresponding /opt and /home on bud4, at their suggestion. That did not fix anything. Please look at the attached file (created with bud4>tar -zcf info.tgz mpihT3) that holds the data that is requested athttps://www.open-mpi.org/community/help/and in bullet 13 onhttps://www.open-mpi.org/community/help/. Can you look at it and suggest a solution? I suspect that it is something trivial that does not stand out and say, “look here you idiot.” Thanks.
Charles "Chuck" Bathke
MS-C921
Los Alamos National Laboratory
P.O. Box 1663
Los Alamos, NM 87545
Phone:(505)667-7214
Cell:(505)695-5709
Fax: 505-665-2897
Location: TA-16, Building 0200, Room 125
NEN-5 Group Office: 505-667-0914
<info.tgz>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:http://www.open-mpi.org/community/lists/users/2015/12/28178.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/12/28180.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/12/28181.php

Reply via email to