I'm not sure if this is a torque issue or an MPI issue. If I log in to
a compute-node and run the standard mpi broadcast  test it returns no
error but if I run it through PBS/Torque I get an error (see below)
The nodes that return the error are fairly random. Even the same set
of nodes will run a test once and then the next time they fail.  In
case it matters, these nodes have dual interfaces: 1GigE and 10GigE.
All tests I was trying on the same group of 32 nodes.

If I login to the node (just as a regular user ; not as root) then the
test runs fine. No errors at all.

Is there a timeout somewhere? Or some such issue? Not at all sure why
this is happening....

Things I've verified. ulimit seems ok. I explicitly have set the
ulimit within the pbs init script as well as in the ssh script that
spawns it.

[root@eu013 ~]# grep ulimit /etc/init.d/pbs
ulimit -l unlimited
[root@eu013 ~]# grep ulimit /etc/init.d/sshd
ulimit -l unlimited


ssh eu013 ulimit -l
unlimited

Even if I put a "ulimit -l" in a PBS job it does return unlimited.

"cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs" returns
a zero on all nodes concerned. Even ifconfig does not return any Error
packets.

-- 
Rahul
#############################################################3


PBS command:

mpirun -mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast
-----------------------------through
PBS---------------------------------------------
The RDMA CM returned an event error while attempting to make a
connection.  This type of error usually indicates a network
configuration error.

  Local host:   eu013
  Local device: cxgb3_0
  Error name:   RDMA_CM_EVENT_UNREACHABLE
  Peer:         eu010

Your MPI job will now abort, sorry.
-------------------------------------------------------------------------
#######################################
Run  physically from a compute node

mpirun -host 
eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032
-mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 42
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.03         0.02
            1         1000       170.70       170.76       170.74
            2         1000       171.04       171.10       171.08
            4         1000       171.09       171.15       171.13
            8         1000       171.05       171.13       171.10
           16         1000       171.03       171.10       171.07
           32         1000        31.93        32.00        31.98
           64         1000        28.86        29.02        28.99
          128         1000        29.34        29.40        29.38
          256         1000        29.90        29.98        29.95
          512         1000        30.39        30.47        30.44
         1024         1000        31.59        31.67        31.64
         2048         1000        38.15        38.26        38.23
         4096         1000       187.59       187.75       187.68
         8192         1000       208.26       208.41       208.37
        16384         1000       395.47       395.71       395.61
        32768         1000      9360.99      9441.36      9416.47
        65536          400     10522.09     11003.08     10781.73
       131072          299     16971.71     17647.29     17329.27
       262144          160     15404.01     17131.36     15816.46
       524288           80      2659.56      4258.90      3002.04
      1048576           40      4305.72      5305.33      5219.00
      2097152           20      2472.34     10711.80      8599.28
      4194304           10      6275.51     20791.20     13687.10


# All processes entering MPI_Finalize

Reply via email to