On Wed, 2009-07-08 at 15:43 -0400, Michael Di Domenico wrote: > On Wed, Jul 8, 2009 at 3:33 PM, Ashley Pittman<ash...@pittman.co.uk> wrote: > >> When i run tping i get: > >> ELAN_EXCEOPTIOn @ --: 6 (Initialization error) > >> elan_init: Can't get capability from environment > >> > >> I am not using slurm or RMS at all, just trying to get openmpi to run > >> between two nodes. > > > > To attach to the elan a process has to have a "capability" which is a > > kernel attribute describing the size (number of nodes/ranks) of the job, > > without this you'll get errors like the one from tping. The only way to > > generate these capabilities is by using RMS, Slurm or I believe pdsh > > which can generate one and push it into the kernel before calling fork() > > to create the user application. > > I didn't realize it was an MPI type program, so I ran is using the > QSNet version of mpirun and OpenMPI. The process does start and runs > through 0: and 2:, which i assume are packet sizes, but freezes at > that point. > > We have an existing XC cluster from HP, that we're trying to convert > from XC to standard RHEL5.3 w/ Slurm and OpenMPI. All i want to be > able to do is load RHEL5 and the Quadrics NIC drivers, and run OpenMPI > jobs between these two nodes I yanked from the cluster before we > switch the whole thing over.
My advice would be to try OpenMPI on the (presumably functional) XC cluster and then migrate that from there to RHEL5.3. I don't recall Slurm being hard to get working but it'll be a lot easier to diagnose if you get OpenMPI and the resource manager working separately before putting them together. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk