On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote: > Hello, > > I found the solution, thanks to Qlogic support. > > The "can't open /dev/ipath, network down (err=26)" message from the ipath > driver is really misleading. > > Actually, this is an hardware context problem on the Qlogic PSM. PSM can't > allocate any hardware context for the job because other(s) MPI job(s) have > already used all available contexts. In order to avoid this problem, every > MPI jobs have to use the PSM_SHAREDCONTEXTS_MAX variable set with the good > value, according to the number of processes that will run on the node. If we > don't use this variable, PSM will "greedily" use all contexts with the first > mpi job spawned on the node.
Sounds like we should be setting this value when starting the process - yes? If so, what is the "good" value, and how do we compute it? > > Regards, > > Arnaud > > > On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote: > > > I do have a contract and i tried to open a case, but their support is ...... > > What happens if you put a delay between the two jobs? E.g., if you just > delay a few seconds before the 2nd job starts? Perhaps the ipath device just > needs a little time before it will be available...? (that's a total guess) > > I suggest this because the PSM device will definitely give you better overall > performance than the QLogic verbs support. Their verbs support basically > barely works -- PSM is their primary device and the one that we always > recommend. > > > Anyway. I'm stii working on the strange error message from mpirun saying it > > can't allocate memory when at the same time it also reports that the memory > > is unlimited ... > > > > > > Arnaud > > > > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > > I'm afraid we don't have any contacts left at QLogic to ask them any > > more... do you have a support contract, perchance? > > > > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote: > > > > > Hello, > > > > > > I run into a stange problem with qlogic OFED and openmpi. When i submit > > > (through SGE) 2 jobs on the same node, the second job ends up with: > > > > > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26) > > > > > > I'm pretty sure the infiniband is working well as the other job runs fine. > > > > > > Here is details about the configuration: > > > > > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a > > > switch) > > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll) > > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge) > > > > > > ------------- > > > > > > In order to fix this problem i recompiled openmpi without psm support, > > > but i faced an other problem: > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > allocate some locked memory. This typically can indicate that the > > > memlock limits are set too low. For most HPC installations, the > > > memlock limits should be set to "unlimited". The failure occured > > > here: > > > > > > Local host: compute-0-6.local > > > OMPI source: btl_openib.c:329 > > > Function: ibv_create_srq() > > > Device: qib0 > > > Memlock limit: unlimited > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users