I'm not sure if this is a torque issue or an MPI issue. If I log in to a compute-node and run the standard mpi broadcast test it returns no error but if I run it through PBS/Torque I get an error (see below) The nodes that return the error are fairly random. Even the same set of nodes will run a test once and then the next time they fail. In case it matters, these nodes have dual interfaces: 1GigE and 10GigE. All tests I was trying on the same group of 32 nodes.
If I login to the node (just as a regular user ; not as root) then the test runs fine. No errors at all. Is there a timeout somewhere? Or some such issue? Not at all sure why this is happening.... Things I've verified. ulimit seems ok. I explicitly have set the ulimit within the pbs init script as well as in the ssh script that spawns it. [root@eu013 ~]# grep ulimit /etc/init.d/pbs ulimit -l unlimited [root@eu013 ~]# grep ulimit /etc/init.d/sshd ulimit -l unlimited ssh eu013 ulimit -l unlimited Even if I put a "ulimit -l" in a PBS job it does return unlimited. "cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs" returns a zero on all nodes concerned. Even ifconfig does not return any Error packets. -- Rahul #############################################################3 PBS command: mpirun -mca btl openib,sm,self -mca orte_base_help_aggregate 0 /opt/src/mpitests/imb/src/IMB-MPI1 bcast -----------------------------through PBS--------------------------------------------- The RDMA CM returned an event error while attempting to make a connection. This type of error usually indicates a network configuration error. Local host: eu013 Local device: cxgb3_0 Error name: RDMA_CM_EVENT_UNREACHABLE Peer: eu010 Your MPI job will now abort, sorry. ------------------------------------------------------------------------- ####################################### Run physically from a compute node mpirun -host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032 -mca btl openib,sm,self -mca orte_base_help_aggregate 0 /opt/src/mpitests/imb/src/IMB-MPI1 bcast #---------------------------------------------------------------- # Benchmarking Bcast # #processes = 42 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.03 0.02 1 1000 170.70 170.76 170.74 2 1000 171.04 171.10 171.08 4 1000 171.09 171.15 171.13 8 1000 171.05 171.13 171.10 16 1000 171.03 171.10 171.07 32 1000 31.93 32.00 31.98 64 1000 28.86 29.02 28.99 128 1000 29.34 29.40 29.38 256 1000 29.90 29.98 29.95 512 1000 30.39 30.47 30.44 1024 1000 31.59 31.67 31.64 2048 1000 38.15 38.26 38.23 4096 1000 187.59 187.75 187.68 8192 1000 208.26 208.41 208.37 16384 1000 395.47 395.71 395.61 32768 1000 9360.99 9441.36 9416.47 65536 400 10522.09 11003.08 10781.73 131072 299 16971.71 17647.29 17329.27 262144 160 15404.01 17131.36 15816.46 524288 80 2659.56 4258.90 3002.04 1048576 40 4305.72 5305.33 5219.00 2097152 20 2472.34 10711.80 8599.28 4194304 10 6275.51 20791.20 13687.10 # All processes entering MPI_Finalize