Hello,

I found the solution, thanks to Qlogic support.

The "can't open /dev/ipath, network down (err=26)" message from the ipath
driver is really misleading.

Actually, this is an hardware context problem on the Qlogic PSM. PSM can't
allocate any hardware context for the job because  other(s) MPI job(s) have
already used all available contexts. In order to avoid this problem, every
MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good
value, according to the number of processes that will run on the node. If
we don't use this variable, PSM will "greedily" use all contexts with the
first mpi job spawned on the node.

Regards,

Arnaud


On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres <jsquy...@cisco.com> wrote:

> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:
>
> > I do have a contract and i tried to open a case, but their support is
> ......
>
> What happens if you put a delay between the two jobs?  E.g., if you just
> delay a few seconds before the 2nd job starts?  Perhaps the ipath device
> just needs a little time before it will be available...?  (that's a total
> guess)
>
> I suggest this because the PSM device will definitely give you better
> overall performance than the QLogic verbs support.  Their verbs support
> basically barely works -- PSM is their primary device and the one that we
> always recommend.
>
> > Anyway. I'm stii working on the strange error message from mpirun saying
> it can't allocate memory when at the same time it also reports that the
> memory is unlimited ...
> >
> >
> > Arnaud
> >
> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
> > I'm afraid we don't have any contacts left at QLogic to ask them any
> more... do you have a support contract, perchance?
> >
> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
> >
> > > Hello,
> > >
> > > I run into a stange problem with qlogic OFED and openmpi. When i
> submit (through SGE) 2 jobs on the same node, the second job ends up with:
> > >
> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> > >
> > > I'm pretty sure the infiniband is working well as the other job runs
> fine.
> > >
> > > Here is details about the configuration:
> > >
> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a
> switch)
> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> > >
> > > -------------
> > >
> > > In order to fix this problem i recompiled openmpi without psm support,
> but i faced an other problem:
> > >
> > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > allocate some locked memory.  This typically can indicate that the
> > > memlock limits are set too low.  For most HPC installations, the
> > > memlock limits should be set to "unlimited".  The failure occured
> > > here:
> > >
> > >   Local host:    compute-0-6.local
> > >   OMPI source:   btl_openib.c:329
> > >   Function:      ibv_create_srq()
> > >   Device:        qib0
> > >   Memlock limit: unlimited
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to