On Wed, 2009-07-08 at 15:43 -0400, Michael Di Domenico wrote:
> On Wed, Jul 8, 2009 at 3:33 PM, Ashley Pittman<ash...@pittman.co.uk> wrote:
> >> When i run tping i get:
> >> ELAN_EXCEOPTIOn @ --: 6 (Initialization error)
> >> elan_init: Can't get capability from environment
> >>
> >> I am not using slurm or RMS at all, just trying to get openmpi to run
> >> between two nodes.
> >
> > To attach to the elan a process has to have a "capability" which is a
> > kernel attribute describing the size (number of nodes/ranks) of the job,
> > without this you'll get errors like the one from tping.  The only way to
> > generate these capabilities is by using RMS, Slurm or I believe pdsh
> > which can generate one and push it into the kernel before calling fork()
> > to create the user application.
> 
> I didn't realize it was an MPI type program, so I ran is using the
> QSNet version of mpirun and OpenMPI.  The process does start and runs
> through 0: and 2:, which i assume are packet sizes, but freezes at
> that point.
> 
> We have an existing XC cluster from HP, that we're trying to convert
> from XC to standard RHEL5.3 w/ Slurm and OpenMPI.  All i want to be
> able to do is load RHEL5 and the Quadrics NIC drivers, and run OpenMPI
> jobs between these two nodes I yanked from the cluster before we
> switch the whole thing over.

My advice would be to try OpenMPI on the (presumably functional) XC
cluster and then migrate that from there to RHEL5.3.  I don't recall
Slurm being hard to get working but it'll be a lot easier to diagnose if
you get OpenMPI and the resource manager working separately before
putting them together.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

Reply via email to