Re: [OMPI users] Qlogic & openmpi
Hello, I found the solution, thanks to Qlogic support. The "can't open /dev/ipath, network down (err=26)" message from the ipath driver is really misleading. Actually, this is an hardware context problem on the Qlogic PSM. PSM can't allocate any hardware context for the job because other(s) MPI job(s) have already used all available contexts. In order to avoid this problem, every MPI jobs have to use the PSM_SHAREDCONTEXTS_MAX variable set with the good value, according to the number of processes that will run on the node. If we don't use this variable, PSM will "greedily" use all contexts with the first mpi job spawned on the node. Regards, Arnaud On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres wrote: > On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote: > > > I do have a contract and i tried to open a case, but their support is > .. > > What happens if you put a delay between the two jobs? E.g., if you just > delay a few seconds before the 2nd job starts? Perhaps the ipath device > just needs a little time before it will be available...? (that's a total > guess) > > I suggest this because the PSM device will definitely give you better > overall performance than the QLogic verbs support. Their verbs support > basically barely works -- PSM is their primary device and the one that we > always recommend. > > > Anyway. I'm stii working on the strange error message from mpirun saying > it can't allocate memory when at the same time it also reports that the > memory is unlimited ... > > > > > > Arnaud > > > > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres > wrote: > > I'm afraid we don't have any contacts left at QLogic to ask them any > more... do you have a support contract, perchance? > > > > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote: > > > > > Hello, > > > > > > I run into a stange problem with qlogic OFED and openmpi. When i > submit (through SGE) 2 jobs on the same node, the second job ends up with: > > > > > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26) > > > > > > I'm pretty sure the infiniband is working well as the other job runs > fine. > > > > > > Here is details about the configuration: > > > > > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a > switch) > > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll) > > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge) > > > > > > - > > > > > > In order to fix this problem i recompiled openmpi without psm support, > but i faced an other problem: > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > allocate some locked memory. This typically can indicate that the > > > memlock limits are set too low. For most HPC installations, the > > > memlock limits should be set to "unlimited". The failure occured > > > here: > > > > > > Local host:compute-0-6.local > > > OMPI source: btl_openib.c:329 > > > Function: ibv_create_srq() > > > Device:qib0 > > > Memlock limit: unlimited > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Qlogic & openmpi
On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote: > Hello, > > I found the solution, thanks to Qlogic support. > > The "can't open /dev/ipath, network down (err=26)" message from the ipath > driver is really misleading. > > Actually, this is an hardware context problem on the Qlogic PSM. PSM can't > allocate any hardware context for the job because other(s) MPI job(s) have > already used all available contexts. In order to avoid this problem, every > MPI jobs have to use the PSM_SHAREDCONTEXTS_MAX variable set with the good > value, according to the number of processes that will run on the node. If we > don't use this variable, PSM will "greedily" use all contexts with the first > mpi job spawned on the node. Sounds like we should be setting this value when starting the process - yes? If so, what is the "good" value, and how do we compute it? > > Regards, > > Arnaud > > > On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres wrote: > On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote: > > > I do have a contract and i tried to open a case, but their support is .. > > What happens if you put a delay between the two jobs? E.g., if you just > delay a few seconds before the 2nd job starts? Perhaps the ipath device just > needs a little time before it will be available...? (that's a total guess) > > I suggest this because the PSM device will definitely give you better overall > performance than the QLogic verbs support. Their verbs support basically > barely works -- PSM is their primary device and the one that we always > recommend. > > > Anyway. I'm stii working on the strange error message from mpirun saying it > > can't allocate memory when at the same time it also reports that the memory > > is unlimited ... > > > > > > Arnaud > > > > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres wrote: > > I'm afraid we don't have any contacts left at QLogic to ask them any > > more... do you have a support contract, perchance? > > > > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote: > > > > > Hello, > > > > > > I run into a stange problem with qlogic OFED and openmpi. When i submit > > > (through SGE) 2 jobs on the same node, the second job ends up with: > > > > > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26) > > > > > > I'm pretty sure the infiniband is working well as the other job runs fine. > > > > > > Here is details about the configuration: > > > > > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a > > > switch) > > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll) > > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge) > > > > > > - > > > > > > In order to fix this problem i recompiled openmpi without psm support, > > > but i faced an other problem: > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > allocate some locked memory. This typically can indicate that the > > > memlock limits are set too low. For most HPC installations, the > > > memlock limits should be set to "unlimited". The failure occured > > > here: > > > > > > Local host:compute-0-6.local > > > OMPI source: btl_openib.c:329 > > > Function: ibv_create_srq() > > > Device:qib0 > > > Memlock limit: unlimited > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Qlogic & openmpi
- Arnaud HERITIER Meteo France International +33 561432940 arnaud.herit...@mfi.fr -- On Mon, Dec 5, 2011 at 6:12 PM, Ralph Castain wrote: > > On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote: > > Hello, > > I found the solution, thanks to Qlogic support. > > The "can't open /dev/ipath, network down (err=26)" message from the ipath > driver is really misleading. > > Actually, this is an hardware context problem on the Qlogic PSM. PSM can't > allocate any hardware context for the job because other(s) MPI job(s) have > already used all available contexts. In order to avoid this problem, every > MPI jobs have to use the PSM_SHAREDCONTEXTS_MAX variable set with the good > value, according to the number of processes that will run on the node. If > we don't use this variable, PSM will "greedily" use all contexts with the > first mpi job spawned on the node. > > > Sounds like we should be setting this value when starting the process - > yes? If so, what is the "good" value, and how do we compute it? > The good value is roundup( $OMPI_COMM_WORLD_LOCAL_SIZE / Context shared ratio) (ratio max 4 on my HCA) Qlogic provided me with simple script to compute this value, i just changed my mpirun script to call this script , set the PSM_SHAREDCONTEXTS_MAX with the returned value and the call the mpi binary. Script attached. Arnaud > Regards, > > Arnaud > > > On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres wrote: > >> On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote: >> >> > I do have a contract and i tried to open a case, but their support is >> .. >> >> What happens if you put a delay between the two jobs? E.g., if you just >> delay a few seconds before the 2nd job starts? Perhaps the ipath device >> just needs a little time before it will be available...? (that's a total >> guess) >> >> I suggest this because the PSM device will definitely give you better >> overall performance than the QLogic verbs support. Their verbs support >> basically barely works -- PSM is their primary device and the one that we >> always recommend. >> >> > Anyway. I'm stii working on the strange error message from mpirun >> saying it can't allocate memory when at the same time it also reports that >> the memory is unlimited ... >> > >> > >> > Arnaud >> > >> > On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres >> wrote: >> > I'm afraid we don't have any contacts left at QLogic to ask them any >> more... do you have a support contract, perchance? >> > >> > On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote: >> > >> > > Hello, >> > > >> > > I run into a stange problem with qlogic OFED and openmpi. When i >> submit (through SGE) 2 jobs on the same node, the second job ends up with: >> > > >> > > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26) >> > > >> > > I'm pretty sure the infiniband is working well as the other job runs >> fine. >> > > >> > > Here is details about the configuration: >> > > >> > > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a >> switch) >> > > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll) >> > > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge) >> > > >> > > - >> > > >> > > In order to fix this problem i recompiled openmpi without psm >> support, but i faced an other problem: >> > > >> > > The OpenFabrics (openib) BTL failed to initialize while trying to >> > > allocate some locked memory. This typically can indicate that the >> > > memlock limits are set too low. For most HPC installations, the >> > > memlock limits should be set to "unlimited". The failure occured >> > > here: >> > > >> > > Local host:compute-0-6.local >> > > OMPI source: btl_openib.c:329 >> > > Function: ibv_create_srq() >> > > Device:qib0 >> > > Memlock limit: unlimited >> > > >> > > >> > > ___ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > For corporate legal information go to: >> > http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.o
Re: [OMPI users] Open MPI and DAPL 2.0.34 are incompatible?
We've never recommended the use of dapl on Linux. I think it might have worked at one time, but I don't think anyone bothered to maintain it. On Linux, you should probably use native verbs support, instead. On Dec 2, 2011, at 1:21 PM, Paul Kapinos wrote: > Dear Open MPI developer, > > OFED 1.5.4 will contain DAPL 2.0.34. > > I tried to compile the newest release of Open MPI (1.5.4) with this DAPL > release and I was not successful. > > Configuring with --with-udapl=/path/to/2.0.34/dapl > got the error "/path/to/2.0.34/dapl/include/dat/udat.h not found" > Looking into include dir: there is no 'dat' subdir but 'dat2'. > > Just for fun I also tried to move 'dat2' to 'dat' back (dirty hack I know :-) > - the configure stage was then successful but the compilation failed. The > header seem to be really changed, not just moved. > > The question: are the Open MPI developer aware of this changes, and when a > version of Open MPI will be available with support for DAPL 2.0.34? > > (Background: we have some trouble with Intel MPI and current DAPL which we do > not have with DAPL 2.0.34, so our dream is to update as soon as possible) > > Best wishes and an nice weekend, > > Paul > > > > > > > http://www.openfabrics.org/downloads/OFED/release_notes/OFED_1.5.4_release_notes > > > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Qlogic & openmpi
On Mon, Dec 5, 2011 at 16:12, Ralph Castain wrote: > Sounds like we should be setting this value when starting the process - yes? > If so, what is the "good" value, and how do we compute it? I've also been just looking at this for the past few days. What I came up with is a small script psm_shctx which sets the envvar then execs the MPI binary and is inserted between mpirun and the MPI binary: mpirun psm_shctx my_mpi_app Of course the same effect can be obtained if the orted would set the envvar before starting the process. There is however a problem: deciding how many contexts to use. For max. performance, one should use a ratio of 1:1 between MPI ranks and contexts; the highest ratio possible (but with lowest performance) is 4 MPI ranks per context; another restriction is that each job should have at least 1 context. F.e. on AMD cluster nodes with 4 CPUs of 12 cores (so total of 48 cores) one gets 16 contexts; assigning all 16 contexts to 48 ranks would mean a ratio of 1:3 but this can only apply if allocation of cores is done in multiples of 4; with a less advantageous allocation strategy more contexts are lost due to rounding up. At the extreme, if there's only one rank per job, there can only be maximum 16 jobs - using all 16 contexts and the rest of 32 cores have to remain idle or be used for other jobs that don't require communication over InfiniPath. There is a further issue though: MPI-2 dynamic creation of processes - if it's not known how many ranks there will be, I guess one should use the highest context sharing ratio (1:4) to be on the safe side. I've found a mention of this envvar being handled in the changelog for MVAPICH2 1.4.1 - maybe that can serve as source of inspiration ? (but I haven't looked at it...) Hope this helps, Bogdan
[OMPI users] MPI_Comm_spawn problem
Hi, I'm working with MPI_Comm_spawn and I have some error messages. The code is relatively simple: #include #include #include #include #include int main(int argc, char ** argv){ int i; int rank, size, child_rank; char nomehost[20]; MPI_Comm parent, intercomm1, intercomm2; int erro; int level, curr_level; MPI_Init(&argc, &argv); level = atoi(argv[1]); MPI_Comm_get_parent(&parent); if(parent == MPI_COMM_NULL){ rank=0; } else{ MPI_Recv(&rank, 1, MPI_INT, 0, 0, parent, MPI_STATUS_IGNORE); } curr_level = (int) log2(rank+1); printf(" --> rank: %d and curr_level: %d\n", rank, curr_level); // Node propagation if(curr_level < level){ // 2^(curr_level+1) - 1 + 2*(rank - 2^curr_level - 1) = 2*rank + 1 child_rank = 2*rank + 1; printf("(%d) Before create rank %d\n", rank, child_rank); MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm1, &erro); printf("(%d) After create rank %d\n", rank, child_rank); MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm1); //sleep(1); child_rank = child_rank + 1; printf("(%d) Before create rank %d\n", rank, child_rank); MPI_Comm_spawn(argv[0], &argv[1], 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm2, &erro); printf("(%d) After create rank %d\n", rank, child_rank); MPI_Send(&child_rank, 1, MPI_INT, 0, 0, intercomm2); } gethostname(nomehost, 20); printf("(%d) in %s\n", rank, nomehost); MPI_Finalize(); return(0); } The program will create a binary tree of process until get a specific level determined by the variable "level". If the level is 2, the tree will be: (0) / \ (1) (2) / \ / \ (3) (4) (5) (6) Error messages are (when a use 1 host): Compiling: mpicc test.c -o test -lm Running: mpirun -np 1 ./test 3 --> rank: 0 and curr_level: 0 (0) Before create rank 1 (0) After create rank 1 (0) Before create rank 2 --> rank: 1 and curr_level: 1 (1) Before create rank 3 [cacau.ic.uff.br:17892] [[31928,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 75 When I use 2 hosts, error is worst. The code is similar to the writing here (I have to set hosts before spawn by MPI_Info_set). Using MPILAM, program runs normally. I think something wrong occurs when I try to use 2 MPI_Comm_spawn consecutively and children processes spawn another processes too. Seems to be a race condition because the error does not always happen (when the level is 2, for example). Using 3 levels or more, error is recurrent. Similar error has been previously posted in another thread: http://www.open-mpi.org/community/lists/users/2009/12/11601.php However, I used the stable version 1.4.4 and this problem still happens. Developers think of to fix it? Thanks, Fernanda