On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote:

> Hello,
> 
> I found the solution, thanks to Qlogic support.
> 
> The "can't open /dev/ipath, network down (err=26)" message from the ipath 
> driver is really misleading.
> 
> Actually, this is an hardware context problem on the Qlogic PSM. PSM can't 
> allocate any hardware context for the job because  other(s) MPI job(s) have 
> already used all available contexts. In order to avoid this problem, every 
> MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good 
> value, according to the number of processes that will run on the node. If we 
> don't use this variable, PSM will "greedily" use all contexts with the first 
> mpi job spawned on the node.

Sounds like we should be setting this value when starting the process - yes? If 
so, what is the "good" value, and how do we compute it?

> 
> Regards,
> 
> Arnaud
> 
> 
> On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:
> 
> > I do have a contract and i tried to open a case, but their support is ......
> 
> What happens if you put a delay between the two jobs?  E.g., if you just 
> delay a few seconds before the 2nd job starts?  Perhaps the ipath device just 
> needs a little time before it will be available...?  (that's a total guess)
> 
> I suggest this because the PSM device will definitely give you better overall 
> performance than the QLogic verbs support.  Their verbs support basically 
> barely works -- PSM is their primary device and the one that we always 
> recommend.
> 
> > Anyway. I'm stii working on the strange error message from mpirun saying it 
> > can't allocate memory when at the same time it also reports that the memory 
> > is unlimited ...
> >
> >
> > Arnaud
> >
> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> > I'm afraid we don't have any contacts left at QLogic to ask them any 
> > more... do you have a support contract, perchance?
> >
> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
> >
> > > Hello,
> > >
> > > I run into a stange problem with qlogic OFED and openmpi. When i submit 
> > > (through SGE) 2 jobs on the same node, the second job ends up with:
> > >
> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> > >
> > > I'm pretty sure the infiniband is working well as the other job runs fine.
> > >
> > > Here is details about the configuration:
> > >
> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a 
> > > switch)
> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> > >
> > > -------------
> > >
> > > In order to fix this problem i recompiled openmpi without psm support, 
> > > but i faced an other problem:
> > >
> > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > allocate some locked memory.  This typically can indicate that the
> > > memlock limits are set too low.  For most HPC installations, the
> > > memlock limits should be set to "unlimited".  The failure occured
> > > here:
> > >
> > >   Local host:    compute-0-6.local
> > >   OMPI source:   btl_openib.c:329
> > >   Function:      ibv_create_srq()
> > >   Device:        qib0
> > >   Memlock limit: unlimited
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to