Re: [OMPI users] Running a hybrid MPI+openMP program
Reuti and Oscar, I'm a Torque user and I myself have never used SGE, so I hesitated to join the discussion. >From my experience with the Torque, the openmpi 1.8 series has already resolved the issue you pointed out in combining MPI with OpenMP. Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. Then, then openmpi 1.8 should allocate processes properly without any modification of the hostfile provided by the Torque. In your case(8 threads and 10 procs): # you have to request 80 slots using SGE command before mpirun mpirun --map-by slot:pe=8 -np 10 ./inverse.exe where you can omit --bind-to option because --bind-to core is assumed as default when pe=N is provided by the user. Regards, Tetsuya >Hi, > >Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > >> I discovered what was the error. I forgot include the '-fopenmp' when I >> compiled the objects in the Makefile, so the program worked but it didn't >> divide the job in threads. Now the program is working and I can use until 15 cores for machine in the queue one.q. >> >> Anyway i would like to try implement your advice. Well I'm not alone in the >> cluster so i must implement your second suggestion. The steps are >> >> a) Use '$ qconf -mp orte' to change the allocation rule to 8 > >The number of slots defined in your used one.q was also increased to 8 (`qconf >-sq one.q`)? > > >> b) Set '#$ -pe orte 80' in the script > >Fine. > > >> c) I'm not sure how to do this step. I'd appreciate your help here. I can >> add some lines to the script to determine the PE_HOSTFILE path and contents, >> but i don't know how alter it > >For now you can put in your jobscript (just after OMP_NUM_THREAD is exported): > >awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' >$PE_HOSTFILE > $TMPDIR/machines >export PE_HOSTFILE=$TMPDIR/machines > >= > >Unfortunately noone stepped into this discussion, as in my opinion it's a much >broader issue which targets all users who want to combine MPI with OpenMP. The queuingsystem should get a proper request for the overall amount of slots the user needs. For now this will be forwarded to Open MPI and it will use this information to start the appropriate number of processes (which was an achievement for the Tight Integration out-of-the-box of course) and ignores any setting of OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; there are several options: > >a) The PE of the queuingsystem should do it: > >+ a one time setup for the admin >+ in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE >- the "start_proc_args" would need to know the number of threads, i.e. >OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript >(tricky scanning of the submitted jobscript for OMP_NUM_THREADS would be too nasty) >- limits to use inside the jobscript calls to libraries behaving in the same >way as Open MPI only > > >b) The particular queue should do it in a queue prolog: > >same as a) I think > > >c) The user should do it > >+ no change in the SGE installation >- each and every user must include it in all the jobscripts to adjust the list >and export the pointer to the $PE_HOSTFILE, but he could change it forth and >back for different steps of the jobscript though > > >d) Open MPI should do it > >+ no change in the SGE installation >+ no change to the jobscript >+ OMP_NUM_THREADS can be altered for different steps of the jobscript while >staying inside the granted allocation automatically >o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS already)? > >-- Reuti > > >> echo "PE_HOSTFILE:" >> echo $PE_HOSTFILE >> echo >> echo "cat PE_HOSTFILE:" >> cat $PE_HOSTFILE >> >> Thanks for take a time for answer this emails, your advices had been very >> useful >> >> PS: The version of SGE is OGS/GE 2011.11p1 >> >> >> Oscar Fabian Mojica Ladino >> Geologist M.S. in Geophysics >> >> >> > From: re...@staff.uni-marburg.de >> > Date: Fri, 15 Aug 2014 20:38:12 +0200 >> > To: us...@open-mpi.org >> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program >> > >> > Hi, >> > >> > Am 15.08.2014 um 19:56 schrieb Oscar Mojica: >> > >> > > Yes, my installation of Open MPI is SGE-aware. I got the following >> > > >> > > [oscar@compute-1-2 ~]$ ompi_info | grep grid >> > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2) >> > >> > Fine. >> > >> > >> > > I'm a bit slow and I didn't understand the las part of your message. So >> > > i made a test trying to solve my doubts. >> > > This is the cluster configuration: There are some machines turned off >> > > but that is no problem >> > > >> > > [oscar@aguia free-noise]$ qhost >> > > HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS >> > > --- >> > > global - - - - - - - >> > > compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M 0.0 >> > > compute-1-11
[OMPI users] Does multiple Irecv means concurrent receiving ?
I have a performance problem with receiving. In a single master thread, I made several Irecv calls: Irecv(buf1, ..., tag, ANY_SOURCE, COMM_WORLD) Irecv(buf2, ..., tag, ANY_SOURCE, COMM_WORLD) ... Irecv(bufn, ..., tag, ANY_SOURCE, COMM_WORLD) all of which try to receive from any node for messages with the same tag. Then, whenever any of the Irecv completes (using Testany), a separate thread is dispatched to work on the received message. In my program, many nodes will send to this master thread. However, I noticed that the speed of recv is almost unaffected no matter how many Irecv calls were made. It seems that multiple Irecv calls does not mean concurrently receiving from many nodes. By profiling the node running the master thread, I can see that the network input bandwidth is quite low. Is my understanding correct ? or How to maximize the recv throughput of the master thread ? Thanks ! Zhang Lei @ Baidu, Inc.
Re: [OMPI users] Running a hybrid MPI+openMP program
Hi, Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: > Reuti and Oscar, > > I'm a Torque user and I myself have never used SGE, so I hesitated to join > the discussion. > > From my experience with the Torque, the openmpi 1.8 series has already > resolved the issue you pointed out in combining MPI with OpenMP. > > Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. > Then, then openmpi 1.8 should allocate processes properly without any > modification > of the hostfile provided by the Torque. > > In your case(8 threads and 10 procs): > > # you have to request 80 slots using SGE command before mpirun > mpirun --map-by slot:pe=8 -np 10 ./inverse.exe Thx for pointing me to this option, for now I can't get it working though (in fact, I want to use it without binding essentially). This allows to tell Open MPI to bind more cores to each of the MPI processes - ok, but does it lower the slot count granted by Torque too? I mean, was your submission command like: $ qsub -l nodes=10:ppn=8 ... so that Torque knows, that it should grant and remember this slot count of a total of 80 for the correct accounting? -- Reuti > where you can omit --bind-to option because --bind-to core is assumed > as default when pe=N is provided by the user. > Regards, > Tetsuya > >> Hi, >> >> Am 19.08.2014 um 19:06 schrieb Oscar Mojica: >> >>> I discovered what was the error. I forgot include the '-fopenmp' when I >>> compiled the objects in the Makefile, so the program worked but it didn't >>> divide the job > in threads. Now the program is working and I can use until 15 cores for > machine in the queue one.q. >>> >>> Anyway i would like to try implement your advice. Well I'm not alone in the >>> cluster so i must implement your second suggestion. The steps are >>> >>> a) Use '$ qconf -mp orte' to change the allocation rule to 8 >> >> The number of slots defined in your used one.q was also increased to 8 >> (`qconf -sq one.q`)? >> >> >>> b) Set '#$ -pe orte 80' in the script >> >> Fine. >> >> >>> c) I'm not sure how to do this step. I'd appreciate your help here. I can >>> add some lines to the script to determine the PE_HOSTFILE path and >>> contents, but i > don't know how alter it >> >> For now you can put in your jobscript (just after OMP_NUM_THREAD is >> exported): >> >> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' >> $PE_HOSTFILE > $TMPDIR/machines >> export PE_HOSTFILE=$TMPDIR/machines >> >> = >> >> Unfortunately noone stepped into this discussion, as in my opinion it's a >> much broader issue which targets all users who want to combine MPI with >> OpenMP. The > queuingsystem should get a proper request for the overall amount of slots the > user needs. For now this will be forwarded to Open MPI and it will use this > information to start the appropriate number of processes (which was an > achievement for the Tight Integration out-of-the-box of course) and ignores > any setting of > OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; > there are several options: >> >> a) The PE of the queuingsystem should do it: >> >> + a one time setup for the admin >> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE >> - the "start_proc_args" would need to know the number of threads, i.e. >> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript >> (tricky scanning > of the submitted jobscript for OMP_NUM_THREADS would be too nasty) >> - limits to use inside the jobscript calls to libraries behaving in the same >> way as Open MPI only >> >> >> b) The particular queue should do it in a queue prolog: >> >> same as a) I think >> >> >> c) The user should do it >> >> + no change in the SGE installation >> - each and every user must include it in all the jobscripts to adjust the >> list and export the pointer to the $PE_HOSTFILE, but he could change it >> forth and back > for different steps of the jobscript though >> >> >> d) Open MPI should do it >> >> + no change in the SGE installation >> + no change to the jobscript >> + OMP_NUM_THREADS can be altered for different steps of the jobscript while >> staying inside the granted allocation automatically >> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS >> already)? >> >> -- Reuti >> >> >>> echo "PE_HOSTFILE:" >>> echo $PE_HOSTFILE >>> echo >>> echo "cat PE_HOSTFILE:" >>> cat $PE_HOSTFILE >>> >>> Thanks for take a time for answer this emails, your advices had been very >>> useful >>> >>> PS: The version of SGE is OGS/GE 2011.11p1 >>> >>> >>> Oscar Fabian Mojica Ladino >>> Geologist M.S. in Geophysics >>> >>> From: re...@staff.uni-marburg.de Date: Fri, 15 Aug 2014 20:38:12 +0200 To: us...@open-mpi.org Subject: Re: [OMPI users] Running a hybrid MPI+openMP program Hi, Am 15.08.2014 um 19:56 schrieb Oscar Mojica: > Yes, m
Re: [OMPI users] Running a hybrid MPI+openMP program
Reuti, If you want to allocate 10 procs with N threads, the Torque script below should work for you: qsub -l nodes=10:ppn=N mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe Then, the openmpi automatically reduces the logical slot count to 10 by dividing real slot count 10N by binding width of N. I don't know why you want to use pe=N without binding, but unfortunately the openmpi allocates successive cores to each process so far when you use pe option - it forcibly bind_to core. Tetsuya > Hi, > > Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: > > > Reuti and Oscar, > > > > I'm a Torque user and I myself have never used SGE, so I hesitated to join > > the discussion. > > > > From my experience with the Torque, the openmpi 1.8 series has already > > resolved the issue you pointed out in combining MPI with OpenMP. > > > > Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. > > Then, then openmpi 1.8 should allocate processes properly without any modification > > of the hostfile provided by the Torque. > > > > In your case(8 threads and 10 procs): > > > > # you have to request 80 slots using SGE command before mpirun > > mpirun --map-by slot:pe=8 -np 10 ./inverse.exe > > Thx for pointing me to this option, for now I can't get it working though (in fact, I want to use it without binding essentially). This allows to tell Open MPI to bind more cores to each of the MPI > processes - ok, but does it lower the slot count granted by Torque too? I mean, was your submission command like: > > $ qsub -l nodes=10:ppn=8 ... > > so that Torque knows, that it should grant and remember this slot count of a total of 80 for the correct accounting? > > -- Reuti > > > > where you can omit --bind-to option because --bind-to core is assumed > > as default when pe=N is provided by the user. > > Regards, > > Tetsuya > > > >> Hi, > >> > >> Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > >> > >>> I discovered what was the error. I forgot include the '-fopenmp' when I compiled the objects in the Makefile, so the program worked but it didn't divide the job > > in threads. Now the program is working and I can use until 15 cores for machine in the queue one.q. > >>> > >>> Anyway i would like to try implement your advice. Well I'm not alone in the cluster so i must implement your second suggestion. The steps are > >>> > >>> a) Use '$ qconf -mp orte' to change the allocation rule to 8 > >> > >> The number of slots defined in your used one.q was also increased to 8 (`qconf -sq one.q`)? > >> > >> > >>> b) Set '#$ -pe orte 80' in the script > >> > >> Fine. > >> > >> > >>> c) I'm not sure how to do this step. I'd appreciate your help here. I can add some lines to the script to determine the PE_HOSTFILE path and contents, but i > > don't know how alter it > >> > >> For now you can put in your jobscript (just after OMP_NUM_THREAD is exported): > >> > >> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' $PE_HOSTFILE > $TMPDIR/machines > >> export PE_HOSTFILE=$TMPDIR/machines > >> > >> = > >> > >> Unfortunately noone stepped into this discussion, as in my opinion it's a much broader issue which targets all users who want to combine MPI with OpenMP. The > > queuingsystem should get a proper request for the overall amount of slots the user needs. For now this will be forwarded to Open MPI and it will use this > > information to start the appropriate number of processes (which was an achievement for the Tight Integration out-of-the-box of course) and ignores any setting of > > OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; there are several options: > >> > >> a) The PE of the queuingsystem should do it: > >> > >> + a one time setup for the admin > >> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE > >> - the "start_proc_args" would need to know the number of threads, i.e. OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript (tricky scanning > > of the submitted jobscript for OMP_NUM_THREADS would be too nasty) > >> - limits to use inside the jobscript calls to libraries behaving in the same way as Open MPI only > >> > >> > >> b) The particular queue should do it in a queue prolog: > >> > >> same as a) I think > >> > >> > >> c) The user should do it > >> > >> + no change in the SGE installation > >> - each and every user must include it in all the jobscripts to adjust the list and export the pointer to the $PE_HOSTFILE, but he could change it forth and back > > for different steps of the jobscript though > >> > >> > >> d) Open MPI should do it > >> > >> + no change in the SGE installation > >> + no change to the jobscript > >> + OMP_NUM_THREADS can be altered for different steps of the jobscript while staying inside the granted allocation automatically > >> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS already)? > >> > >> -- Reuti > >> > >> > >>> echo "PE_HOSTFILE:" > >>> echo
Re: [OMPI users] Running a hybrid MPI+openMP program
Just to clarify: OMPI will bind the process to *all* N cores, not just to one. On Aug 20, 2014, at 4:26 AM, tmish...@jcity.maeda.co.jp wrote: > Reuti, > > If you want to allocate 10 procs with N threads, the Torque > script below should work for you: > > qsub -l nodes=10:ppn=N > mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe > > Then, the openmpi automatically reduces the logical slot count to 10 > by dividing real slot count 10N by binding width of N. > > I don't know why you want to use pe=N without binding, but unfortunately > the openmpi allocates successive cores to each process so far when you > use pe option - it forcibly bind_to core. > > Tetsuya > > >> Hi, >> >> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: >> >>> Reuti and Oscar, >>> >>> I'm a Torque user and I myself have never used SGE, so I hesitated to > join >>> the discussion. >>> >>> From my experience with the Torque, the openmpi 1.8 series has already >>> resolved the issue you pointed out in combining MPI with OpenMP. >>> >>> Please try to add --map-by slot:pe=8 option, if you want to use 8 > threads. >>> Then, then openmpi 1.8 should allocate processes properly without any > modification >>> of the hostfile provided by the Torque. >>> >>> In your case(8 threads and 10 procs): >>> >>> # you have to request 80 slots using SGE command before mpirun >>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe >> >> Thx for pointing me to this option, for now I can't get it working though > (in fact, I want to use it without binding essentially). This allows to > tell Open MPI to bind more cores to each of the MPI >> processes - ok, but does it lower the slot count granted by Torque too? I > mean, was your submission command like: >> >> $ qsub -l nodes=10:ppn=8 ... >> >> so that Torque knows, that it should grant and remember this slot count > of a total of 80 for the correct accounting? >> >> -- Reuti >> >> >>> where you can omit --bind-to option because --bind-to core is assumed >>> as default when pe=N is provided by the user. >>> Regards, >>> Tetsuya >>> Hi, Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > I discovered what was the error. I forgot include the '-fopenmp' when > I compiled the objects in the Makefile, so the program worked but it didn't > divide the job >>> in threads. Now the program is working and I can use until 15 cores for > machine in the queue one.q. > > Anyway i would like to try implement your advice. Well I'm not alone > in the cluster so i must implement your second suggestion. The steps are > > a) Use '$ qconf -mp orte' to change the allocation rule to 8 The number of slots defined in your used one.q was also increased to 8 > (`qconf -sq one.q`)? > b) Set '#$ -pe orte 80' in the script Fine. > c) I'm not sure how to do this step. I'd appreciate your help here. I > can add some lines to the script to determine the PE_HOSTFILE path and > contents, but i >>> don't know how alter it For now you can put in your jobscript (just after OMP_NUM_THREAD is > exported): awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; > print }' $PE_HOSTFILE > $TMPDIR/machines export PE_HOSTFILE=$TMPDIR/machines = Unfortunately noone stepped into this discussion, as in my opinion > it's a much broader issue which targets all users who want to combine MPI > with OpenMP. The >>> queuingsystem should get a proper request for the overall amount of > slots the user needs. For now this will be forwarded to Open MPI and it > will use this >>> information to start the appropriate number of processes (which was an > achievement for the Tight Integration out-of-the-box of course) and ignores > any setting of >>> OMP_NUM_THREADS. So, where should the generated list of machines be > adjusted; there are several options: a) The PE of the queuingsystem should do it: + a one time setup for the admin + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE - the "start_proc_args" would need to know the number of threads, i.e. > OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript > (tricky scanning >>> of the submitted jobscript for OMP_NUM_THREADS would be too nasty) - limits to use inside the jobscript calls to libraries behaving in > the same way as Open MPI only b) The particular queue should do it in a queue prolog: same as a) I think c) The user should do it + no change in the SGE installation - each and every user must include it in all the jobscripts to adjust > the list and export the pointer to the $PE_HOSTFILE, but he could change it > forth and back >>> for different steps of the jobscript though d) Open MPI should do it + no change in the SGE installation + no change to the jo
Re: [OMPI users] ORTE daemon has unexpectedly failed after launch
Hello! As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have the problem a) $ mpirun -np 1 ./hello_c -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- b) $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- c) $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set priority to 0 [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set priority to 75 [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] [compiler-2:14673] mca: base: components_register: registering oob components [compiler-2:14673] mca: base: components_register: found loaded component tcp [compiler-2:14673] mca: base: components_register: component tcp register function successful [compiler-2:14673] mca: base: components_open: opening oob components [compiler-2:14673] mca: base: components_open: found loaded component tcp [compiler-2:14673] mca: base: components_open: component tcp open function successful [compiler-2:14673] mca:oob:select: checking available component tcp [compiler-2:14673] mca:oob:select: Querying component [tcp] [compiler-2:14673] oob:tcp: component_available called [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list of V4 connections [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 [compiler-2:14673] [[49095,0],0] TCP STARTUP [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 [compiler-2:14673] mca:oob:select: Adding component to end [compiler-2:14673] mca:oob:select: Found 1 active transports [compiler-2:14673] mca: base: components_register: registering rml components [compiler-2:14673] mca: base: components_register: found loaded component oob [compiler-2:14673] mca: base: components_register: component oob has no register or open function [compiler-2:14673] mca: base: components_open: opening rml components [compiler-2:14673] mca: base: components_open: found loaded component oob [compiler-2:14673] mca: base: components_open: component oob open function successful [compiler-2:14673] orte_rml_base_select: initializing rml component oob [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,
Re: [OMPI users] Running a hybrid MPI+openMP program
Hi, Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: > Reuti, > > If you want to allocate 10 procs with N threads, the Torque > script below should work for you: > > qsub -l nodes=10:ppn=N > mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe I played around with giving -np 10 in addition to a Tight Integration. The slot count is not really divided I think, but only 10 out of the granted maximum is used (while on each of the listed machines an `orted` is started). Due to the fixed allocation this is of course the result we want to achieve as it subtracts bunches of 8 from the given list of machines resp. slots. In SGE it's sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE any longer): === export OMP_NUM_THREADS=8 mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / $OMP_NUM_THREADS") ./inverse.exe === and submit with: $ qsub -pe orte 80 job.sh as the variables are distributed to the slave nodes by SGE already. Nevertheless, using -np in addition to the Tight Integration gives a taste of a kind of half-tight integration in some way. And would not work for us because "--bind-to none" can't be used in such a command (see below) and throws an error. > Then, the openmpi automatically reduces the logical slot count to 10 > by dividing real slot count 10N by binding width of N. > > I don't know why you want to use pe=N without binding, but unfortunately > the openmpi allocates successive cores to each process so far when you > use pe option - it forcibly bind_to core. In a shared cluster with many users and different MPI libraries in use, only the queuingsystem could know which job got which cores granted. This avoids any oversubscription of cores, while others are idle. -- Reuti > Tetsuya > > >> Hi, >> >> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: >> >>> Reuti and Oscar, >>> >>> I'm a Torque user and I myself have never used SGE, so I hesitated to > join >>> the discussion. >>> >>> From my experience with the Torque, the openmpi 1.8 series has already >>> resolved the issue you pointed out in combining MPI with OpenMP. >>> >>> Please try to add --map-by slot:pe=8 option, if you want to use 8 > threads. >>> Then, then openmpi 1.8 should allocate processes properly without any > modification >>> of the hostfile provided by the Torque. >>> >>> In your case(8 threads and 10 procs): >>> >>> # you have to request 80 slots using SGE command before mpirun >>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe >> >> Thx for pointing me to this option, for now I can't get it working though > (in fact, I want to use it without binding essentially). This allows to > tell Open MPI to bind more cores to each of the MPI >> processes - ok, but does it lower the slot count granted by Torque too? I > mean, was your submission command like: >> >> $ qsub -l nodes=10:ppn=8 ... >> >> so that Torque knows, that it should grant and remember this slot count > of a total of 80 for the correct accounting? >> >> -- Reuti >> >> >>> where you can omit --bind-to option because --bind-to core is assumed >>> as default when pe=N is provided by the user. >>> Regards, >>> Tetsuya >>> Hi, Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > I discovered what was the error. I forgot include the '-fopenmp' when > I compiled the objects in the Makefile, so the program worked but it didn't > divide the job >>> in threads. Now the program is working and I can use until 15 cores for > machine in the queue one.q. > > Anyway i would like to try implement your advice. Well I'm not alone > in the cluster so i must implement your second suggestion. The steps are > > a) Use '$ qconf -mp orte' to change the allocation rule to 8 The number of slots defined in your used one.q was also increased to 8 > (`qconf -sq one.q`)? > b) Set '#$ -pe orte 80' in the script Fine. > c) I'm not sure how to do this step. I'd appreciate your help here. I > can add some lines to the script to determine the PE_HOSTFILE path and > contents, but i >>> don't know how alter it For now you can put in your jobscript (just after OMP_NUM_THREAD is > exported): awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; > print }' $PE_HOSTFILE > $TMPDIR/machines export PE_HOSTFILE=$TMPDIR/machines = Unfortunately noone stepped into this discussion, as in my opinion > it's a much broader issue which targets all users who want to combine MPI > with OpenMP. The >>> queuingsystem should get a proper request for the overall amount of > slots the user needs. For now this will be forwarded to Open MPI and it > will use this >>> information to start the appropriate number of processes (which was an > achievement for the Tight Integration out-of-the-box of course) and ignores > any setting of >>> OMP_NUM_THREADS. So, where should the generate
Re: [OMPI users] Running a hybrid MPI+openMP program
On Aug 20, 2014, at 6:58 AM, Reuti wrote: > Hi, > > Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: > >> Reuti, >> >> If you want to allocate 10 procs with N threads, the Torque >> script below should work for you: >> >> qsub -l nodes=10:ppn=N >> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe > > I played around with giving -np 10 in addition to a Tight Integration. The > slot count is not really divided I think, but only 10 out of the granted > maximum is used (while on each of the listed machines an `orted` is started). > Due to the fixed allocation this is of course the result we want to achieve > as it subtracts bunches of 8 from the given list of machines resp. slots. In > SGE it's sufficient to use and AFAICS it works (without touching the > $PE_HOSTFILE any longer): > > === > export OMP_NUM_THREADS=8 > mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / > $OMP_NUM_THREADS") ./inverse.exe > === > > and submit with: > > $ qsub -pe orte 80 job.sh > > as the variables are distributed to the slave nodes by SGE already. > > Nevertheless, using -np in addition to the Tight Integration gives a taste of > a kind of half-tight integration in some way. And would not work for us > because "--bind-to none" can't be used in such a command (see below) and > throws an error. > > >> Then, the openmpi automatically reduces the logical slot count to 10 >> by dividing real slot count 10N by binding width of N. >> >> I don't know why you want to use pe=N without binding, but unfortunately >> the openmpi allocates successive cores to each process so far when you >> use pe option - it forcibly bind_to core. > > In a shared cluster with many users and different MPI libraries in use, only > the queuingsystem could know which job got which cores granted. This avoids > any oversubscription of cores, while others are idle. FWIW: we detect the exterior binding constraint and work within it > > -- Reuti > > >> Tetsuya >> >> >>> Hi, >>> >>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: >>> Reuti and Oscar, I'm a Torque user and I myself have never used SGE, so I hesitated to >> join the discussion. From my experience with the Torque, the openmpi 1.8 series has already resolved the issue you pointed out in combining MPI with OpenMP. Please try to add --map-by slot:pe=8 option, if you want to use 8 >> threads. Then, then openmpi 1.8 should allocate processes properly without any >> modification of the hostfile provided by the Torque. In your case(8 threads and 10 procs): # you have to request 80 slots using SGE command before mpirun mpirun --map-by slot:pe=8 -np 10 ./inverse.exe >>> >>> Thx for pointing me to this option, for now I can't get it working though >> (in fact, I want to use it without binding essentially). This allows to >> tell Open MPI to bind more cores to each of the MPI >>> processes - ok, but does it lower the slot count granted by Torque too? I >> mean, was your submission command like: >>> >>> $ qsub -l nodes=10:ppn=8 ... >>> >>> so that Torque knows, that it should grant and remember this slot count >> of a total of 80 for the correct accounting? >>> >>> -- Reuti >>> >>> where you can omit --bind-to option because --bind-to core is assumed as default when pe=N is provided by the user. Regards, Tetsuya > Hi, > > Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > >> I discovered what was the error. I forgot include the '-fopenmp' when >> I compiled the objects in the Makefile, so the program worked but it didn't >> divide the job in threads. Now the program is working and I can use until 15 cores for >> machine in the queue one.q. >> >> Anyway i would like to try implement your advice. Well I'm not alone >> in the cluster so i must implement your second suggestion. The steps are >> >> a) Use '$ qconf -mp orte' to change the allocation rule to 8 > > The number of slots defined in your used one.q was also increased to 8 >> (`qconf -sq one.q`)? > > >> b) Set '#$ -pe orte 80' in the script > > Fine. > > >> c) I'm not sure how to do this step. I'd appreciate your help here. I >> can add some lines to the script to determine the PE_HOSTFILE path and >> contents, but i don't know how alter it > > For now you can put in your jobscript (just after OMP_NUM_THREAD is >> exported): > > awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; >> print }' $PE_HOSTFILE > $TMPDIR/machines > export PE_HOSTFILE=$TMPDIR/machines > > = > > Unfortunately noone stepped into this discussion, as in my opinion >> it's a much broader issue which targets all users who want to combine MPI >> with OpenMP. The queuingsystem should get a proper request for the overall amount of >> slots the use
Re: [OMPI users] Running a hybrid MPI+openMP program
Am 20.08.2014 um 16:26 schrieb Ralph Castain: > On Aug 20, 2014, at 6:58 AM, Reuti wrote: > >> Hi, >> >> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: >> >>> Reuti, >>> >>> If you want to allocate 10 procs with N threads, the Torque >>> script below should work for you: >>> >>> qsub -l nodes=10:ppn=N >>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe >> >> I played around with giving -np 10 in addition to a Tight Integration. The >> slot count is not really divided I think, but only 10 out of the granted >> maximum is used (while on each of the listed machines an `orted` is >> started). Due to the fixed allocation this is of course the result we want >> to achieve as it subtracts bunches of 8 from the given list of machines >> resp. slots. In SGE it's sufficient to use and AFAICS it works (without >> touching the $PE_HOSTFILE any longer): >> >> === >> export OMP_NUM_THREADS=8 >> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / >> $OMP_NUM_THREADS") ./inverse.exe >> === >> >> and submit with: >> >> $ qsub -pe orte 80 job.sh >> >> as the variables are distributed to the slave nodes by SGE already. >> >> Nevertheless, using -np in addition to the Tight Integration gives a taste >> of a kind of half-tight integration in some way. And would not work for us >> because "--bind-to none" can't be used in such a command (see below) and >> throws an error. >> >> >>> Then, the openmpi automatically reduces the logical slot count to 10 >>> by dividing real slot count 10N by binding width of N. >>> >>> I don't know why you want to use pe=N without binding, but unfortunately >>> the openmpi allocates successive cores to each process so far when you >>> use pe option - it forcibly bind_to core. >> >> In a shared cluster with many users and different MPI libraries in use, only >> the queuingsystem could know which job got which cores granted. This avoids >> any oversubscription of cores, while others are idle. > > FWIW: we detect the exterior binding constraint and work within it Aha, this is quite interesting - how do you do this: scanning the /proc//status or alike? What happens if you don't find enough free cores as they are used up by other applications already? -- Reuti >> -- Reuti >> >> >>> Tetsuya >>> >>> Hi, Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: > Reuti and Oscar, > > I'm a Torque user and I myself have never used SGE, so I hesitated to >>> join > the discussion. > > From my experience with the Torque, the openmpi 1.8 series has already > resolved the issue you pointed out in combining MPI with OpenMP. > > Please try to add --map-by slot:pe=8 option, if you want to use 8 >>> threads. > Then, then openmpi 1.8 should allocate processes properly without any >>> modification > of the hostfile provided by the Torque. > > In your case(8 threads and 10 procs): > > # you have to request 80 slots using SGE command before mpirun > mpirun --map-by slot:pe=8 -np 10 ./inverse.exe Thx for pointing me to this option, for now I can't get it working though >>> (in fact, I want to use it without binding essentially). This allows to >>> tell Open MPI to bind more cores to each of the MPI processes - ok, but does it lower the slot count granted by Torque too? I >>> mean, was your submission command like: $ qsub -l nodes=10:ppn=8 ... so that Torque knows, that it should grant and remember this slot count >>> of a total of 80 for the correct accounting? -- Reuti > where you can omit --bind-to option because --bind-to core is assumed > as default when pe=N is provided by the user. > Regards, > Tetsuya > >> Hi, >> >> Am 19.08.2014 um 19:06 schrieb Oscar Mojica: >> >>> I discovered what was the error. I forgot include the '-fopenmp' when >>> I compiled the objects in the Makefile, so the program worked but it didn't >>> divide the job > in threads. Now the program is working and I can use until 15 cores for >>> machine in the queue one.q. >>> >>> Anyway i would like to try implement your advice. Well I'm not alone >>> in the cluster so i must implement your second suggestion. The steps are >>> >>> a) Use '$ qconf -mp orte' to change the allocation rule to 8 >> >> The number of slots defined in your used one.q was also increased to 8 >>> (`qconf -sq one.q`)? >> >> >>> b) Set '#$ -pe orte 80' in the script >> >> Fine. >> >> >>> c) I'm not sure how to do this step. I'd appreciate your help here. I >>> can add some lines to the script to determine the PE_HOSTFILE path and >>> contents, but i > don't know how alter it >> >> For now you can put in your jobscript (just after OMP_NUM_THREAD is >>> exported): >> >> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_thre
Re: [OMPI users] No log_num_mtt in Ubuntu 14.04
> "Mike" == Mike Dubman writes: Mike> so, it seems you have old ofed w/o this parameter. Can you Mike> install latest Mellanox ofed? or check which community ofed Mike> has it? Rio is using the kernel.org drivers that are part of Ubuntu/3.13.x and log_num_mtt is not a parameter in those drivers. In fact log_num_mtt has never been a parameter in the kernel.org sources (just checked the git commit history). And it's not needed anymore either, since the following commit (which is also part of OFED 3.12 btw; Mike, seems Mellanox OFED is behind with this respect): --- commit db5a7a65c05867cb6ff5cb6d556a0edfce631d2d Author: Roland Dreier List-Post: users@lists.open-mpi.org Date: Mon Mar 5 10:05:28 2012 -0800 mlx4_core: Scale size of MTT table with system RAM The current driver defaults to 1M MTT segments, where each segment holds 8 MTT entries. This limits the total memory registered to 8M * PAGE_SIZE which is 32GB with 4K pages. Since systems that have much more memory are pretty common now (at least among systems with InfiniBand hardware), this limit ends up getting hit in practice quite a bit. Handle this by having the driver allocate at least enough MTT entries to cover 2 * totalram pages. Signed-off-by: Roland Dreier --- The relevant code segment (drivers/net/ethernet/mellanox/mlx4/profile.c): --- /* * We want to scale the number of MTTs with the size of the * system memory, since it makes sense to register a lot of * memory on a system with a lot of memory. As a heuristic, * make sure we have enough MTTs to cover twice the system * memory (with PAGE_SIZE entries). * * This number has to be a power of two and fit into 32 bits * due to device limitations, so cap this at 2^31 as well. * That limits us to 8TB of memory registration per HCA with * 4KB pages, which is probably OK for the next few months. */ si_meminfo(&si); request->num_mtt = roundup_pow_of_two(max_t(unsigned, request->num_mtt, min(1UL << (31 - log_mtts_per_seg), si.totalram >> (log_mtts_per_seg - 1; --- So the point here is that OpenMPI should check the mlx4 driver versions and not output false warnings when newer drivers are used. Didn't check whether this is fixed in the OpenMPI code repositories yet. It's not fixed in 1.8.2rc4 anyway (static uint64_t calculate_max_reg in ompi/mca/btl/openib/btl_openib.c). Also, the OpenMPI FAQ should be corrected accordingly. Rio as a note for you: You can safely ignore the warning. Cheers, Roland --- http://www.q-leap.com / http://qlustar.com --- HPC / Storage / Cloud Linux Cluster OS --- Mike> On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota Mike> wrote: >> Here is what "modinfo mlx4_core" gives >> >> filename: >> /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko >> version: 2.2-1 license: Dual BSD/GPL description: Mellanox >> ConnectX HCA low-level driver author: Roland Dreier srcversion: >> 3AE29A0A6538EBBE9227361 alias: >> pci:v15B3d1010sv*sd*bc*sc*i* alias: >> pci:v15B3d100Fsv*sd*bc*sc*i* alias: >> pci:v15B3d100Esv*sd*bc*sc*i* alias: >> pci:v15B3d100Dsv*sd*bc*sc*i* alias: >> pci:v15B3d100Csv*sd*bc*sc*i* alias: >> pci:v15B3d100Bsv*sd*bc*sc*i* alias: >> pci:v15B3d100Asv*sd*bc*sc*i* alias: >> pci:v15B3d1009sv*sd*bc*sc*i* alias: >> pci:v15B3d1008sv*sd*bc*sc*i* alias: >> pci:v15B3d1007sv*sd*bc*sc*i* alias: >> pci:v15B3d1006sv*sd*bc*sc*i* alias: >> pci:v15B3d1005sv*sd*bc*sc*i* alias: >> pci:v15B3d1004sv*sd*bc*sc*i* alias: >> pci:v15B3d1003sv*sd*bc*sc*i* alias: >> pci:v15B3d1002sv*sd*bc*sc*i* alias: >> pci:v15B3d676Esv*sd*bc*sc*i* alias: >> pci:v15B3d6746sv*sd*bc*sc*i* alias: >> pci:v15B3d6764sv*sd*bc*sc*i* alias: >> pci:v15B3d675Asv*sd*bc*sc*i* alias: >> pci:v15B3d6372sv*sd*bc*sc*i* alias: >> pci:v15B3d6750sv*sd*bc*sc*i* alias: >> pci:v15B3d6368sv*sd*bc*sc*i* alias: >> pci:v15B3d673Csv*sd*bc*sc*i* alias: >> pci:v15B3d6732sv*sd*bc*sc*i* alias: >> pci:v15B3d6354sv*sd*bc*sc*i* alias: >> pci:v15B3d634Asv*sd*bc*sc*i* alias: >> pci:v15B3d6340sv*sd*bc*sc*i* depends: intree: Y vermagic: >> 3.13.0-34-generic SMP mod_unload modversions signer: Magrathea: >> Glacier signing key sig_key: >
Re: [OMPI users] No log_num_mtt in Ubuntu 14.04
Dear Roland, Thank you so much. This was very helpful. Best, Rio >> "Mike" == Mike Dubman writes: > >Mike> so, it seems you have old ofed w/o this parameter. Can you >Mike> install latest Mellanox ofed? or check which community ofed >Mike> has it? > > Rio is using the kernel.org drivers that are part of Ubuntu/3.13.x and > log_num_mtt is not a parameter in those drivers. In fact log_num_mtt > has never been a parameter in the kernel.org sources (just checked the > git commit history). And it's not needed anymore either, since the > following commit (which is also part of OFED 3.12 btw; Mike, seems > Mellanox OFED is behind with this respect): > --- > commit db5a7a65c05867cb6ff5cb6d556a0edfce631d2d > Author: Roland Dreier > Date: Mon Mar 5 10:05:28 2012 -0800 > >mlx4_core: Scale size of MTT table with system RAM > >The current driver defaults to 1M MTT segments, where each segment holds >8 MTT entries. This limits the total memory registered to 8M * PAGE_SIZE >which is 32GB with 4K pages. Since systems that have much more memory >are pretty common now (at least among systems with InfiniBand hardware), >this limit ends up getting hit in practice quite a bit. > >Handle this by having the driver allocate at least enough MTT entries to >cover 2 * totalram pages. > >Signed-off-by: Roland Dreier > --- > > The relevant code segment (drivers/net/ethernet/mellanox/mlx4/profile.c): > > --- >/* > * We want to scale the number of MTTs with the size of the > * system memory, since it makes sense to register a lot of > * memory on a system with a lot of memory. As a heuristic, > * make sure we have enough MTTs to cover twice the system > * memory (with PAGE_SIZE entries). > * > * This number has to be a power of two and fit into 32 bits > * due to device limitations, so cap this at 2^31 as well. > * That limits us to 8TB of memory registration per HCA with > * 4KB pages, which is probably OK for the next few months. > */ >si_meminfo(&si); >request->num_mtt = >roundup_pow_of_two(max_t(unsigned, request->num_mtt, > min(1UL << (31 - log_mtts_per_seg), > si.totalram >> (log_mtts_per_seg > - 1; > --- > > So the point here is that OpenMPI should check the mlx4 driver versions > and not output false warnings when newer drivers are used. Didn't check > whether this is fixed in the OpenMPI code repositories yet. It's not > fixed in 1.8.2rc4 anyway (static uint64_t calculate_max_reg in > ompi/mca/btl/openib/btl_openib.c). Also, the OpenMPI FAQ should be > corrected accordingly. > > Rio as a note for you: You can safely ignore the warning. > > Cheers, > > Roland > > --- > http://www.q-leap.com / http://qlustar.com > --- HPC / Storage / Cloud Linux Cluster OS --- > >Mike> On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota >Mike> wrote: > >>> Here is what "modinfo mlx4_core" gives >>> >>> filename: >>> /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko >>> version: 2.2-1 license: Dual BSD/GPL description: Mellanox >>> ConnectX HCA low-level driver author: Roland Dreier srcversion: >>> 3AE29A0A6538EBBE9227361 alias: >>> pci:v15B3d1010sv*sd*bc*sc*i* alias: >>> pci:v15B3d100Fsv*sd*bc*sc*i* alias: >>> pci:v15B3d100Esv*sd*bc*sc*i* alias: >>> pci:v15B3d100Dsv*sd*bc*sc*i* alias: >>> pci:v15B3d100Csv*sd*bc*sc*i* alias: >>> pci:v15B3d100Bsv*sd*bc*sc*i* alias: >>> pci:v15B3d100Asv*sd*bc*sc*i* alias: >>> pci:v15B3d1009sv*sd*bc*sc*i* alias: >>> pci:v15B3d1008sv*sd*bc*sc*i* alias: >>> pci:v15B3d1007sv*sd*bc*sc*i* alias: >>> pci:v15B3d1006sv*sd*bc*sc*i* alias: >>> pci:v15B3d1005sv*sd*bc*sc*i* alias: >>> pci:v15B3d1004sv*sd*bc*sc*i* alias: >>> pci:v15B3d1003sv*sd*bc*sc*i* alias: >>> pci:v15B3d1002sv*sd*bc*sc*i* alias: >>> pci:v15B3d676Esv*sd*bc*sc*i* alias: >>> pci:v15B3d6746sv*sd*bc*sc*i* alias: >>> pci:v15B3d6764sv*sd*bc*sc*i* alias: >>> pci:v15B3d675Asv*sd*bc*sc*i* alias: >>> pci:v15B3d6372sv*sd*bc*sc*i* alias: >>> pci:v15B3d6750sv*sd*bc*sc*i* alias: >>> pci:v15B3d6368sv*sd*bc*sc*i* alias: >>> pci:v15B3d673Csv*sd*bc*sc*i* alias: >>> pci:v15B3d6732sv*sd*bc*sc*i* alias: >>> pci:v15B3d6354sv*sd*bc*sc*i* alias: >>> pci:v15B3d634Asv*sd*bc*sc*i* alias: >>> pci:v15B3d6340sv*sd*bc*sc*i* depends: intree: Y vermagic: >>> 3.13.0-34-generic SMP mod_unload modversions signer: Magrathea:
Re: [OMPI users] Running a hybrid MPI+openMP program
On Aug 20, 2014, at 9:04 AM, Reuti wrote: > Am 20.08.2014 um 16:26 schrieb Ralph Castain: > >> On Aug 20, 2014, at 6:58 AM, Reuti wrote: >> >>> Hi, >>> >>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: >>> Reuti, If you want to allocate 10 procs with N threads, the Torque script below should work for you: qsub -l nodes=10:ppn=N mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe >>> >>> I played around with giving -np 10 in addition to a Tight Integration. The >>> slot count is not really divided I think, but only 10 out of the granted >>> maximum is used (while on each of the listed machines an `orted` is >>> started). Due to the fixed allocation this is of course the result we want >>> to achieve as it subtracts bunches of 8 from the given list of machines >>> resp. slots. In SGE it's sufficient to use and AFAICS it works (without >>> touching the $PE_HOSTFILE any longer): >>> >>> === >>> export OMP_NUM_THREADS=8 >>> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / >>> $OMP_NUM_THREADS") ./inverse.exe >>> === >>> >>> and submit with: >>> >>> $ qsub -pe orte 80 job.sh >>> >>> as the variables are distributed to the slave nodes by SGE already. >>> >>> Nevertheless, using -np in addition to the Tight Integration gives a taste >>> of a kind of half-tight integration in some way. And would not work for us >>> because "--bind-to none" can't be used in such a command (see below) and >>> throws an error. >>> >>> Then, the openmpi automatically reduces the logical slot count to 10 by dividing real slot count 10N by binding width of N. I don't know why you want to use pe=N without binding, but unfortunately the openmpi allocates successive cores to each process so far when you use pe option - it forcibly bind_to core. >>> >>> In a shared cluster with many users and different MPI libraries in use, >>> only the queuingsystem could know which job got which cores granted. This >>> avoids any oversubscription of cores, while others are idle. >> >> FWIW: we detect the exterior binding constraint and work within it > > Aha, this is quite interesting - how do you do this: scanning the > /proc//status or alike? What happens if you don't find enough free cores > as they are used up by other applications already? > Remember, when you use mpirun to launch, we launch our own daemons using the native launcher (e.g., qsub). So the external RM will bind our daemons to the specified cores on each node. We use hwloc to determine what cores our daemons are bound to, and then bind our own child processes to cores within that range. If the cores we are bound to are the same on each node, then we will do this with no further instruction. However, if the cores are different on the individual nodes, then you need to add --hetero-nodes to your command line (as the nodes appear to be heterogeneous to us). So it is up to the RM to set the constraint - we just live within it. > -- Reuti > > >>> -- Reuti >>> >>> Tetsuya > Hi, > > Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima: > >> Reuti and Oscar, >> >> I'm a Torque user and I myself have never used SGE, so I hesitated to join >> the discussion. >> >> From my experience with the Torque, the openmpi 1.8 series has already >> resolved the issue you pointed out in combining MPI with OpenMP. >> >> Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. >> Then, then openmpi 1.8 should allocate processes properly without any modification >> of the hostfile provided by the Torque. >> >> In your case(8 threads and 10 procs): >> >> # you have to request 80 slots using SGE command before mpirun >> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe > > Thx for pointing me to this option, for now I can't get it working though (in fact, I want to use it without binding essentially). This allows to tell Open MPI to bind more cores to each of the MPI > processes - ok, but does it lower the slot count granted by Torque too? I mean, was your submission command like: > > $ qsub -l nodes=10:ppn=8 ... > > so that Torque knows, that it should grant and remember this slot count of a total of 80 for the correct accounting? > > -- Reuti > > >> where you can omit --bind-to option because --bind-to core is assumed >> as default when pe=N is provided by the user. >> Regards, >> Tetsuya >> >>> Hi, >>> >>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica: >>> I discovered what was the error. I forgot include the '-fopenmp' when I compiled the objects in the Makefile, so the program worked but it didn't divide the job >> in threads. Now the program is working and I can use until 15 cores for machine in
Re: [OMPI users] ORTE daemon has unexpectedly failed after launch
It was not yet fixed - but should be now. On Aug 20, 2014, at 6:39 AM, Timur Ismagilov wrote: > Hello! > > As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have > the problem > > a) > $ mpirun -np 1 ./hello_c > > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -- > > b) > $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -- > > c) > > $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 > -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c > > [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] > [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] > [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] > [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] > [compiler-2:14673] mca: base: components_register: registering oob components > [compiler-2:14673] mca: base: components_register: found loaded component tcp > [compiler-2:14673] mca: base: components_register: component tcp register > function successful > [compiler-2:14673] mca: base: components_open: opening oob components > [compiler-2:14673] mca: base: components_open: found loaded component tcp > [compiler-2:14673] mca: base: components_open: component tcp open function > successful > [compiler-2:14673] mca:oob:select: checking available component tcp > [compiler-2:14673] mca:oob:select: Querying component [tcp] > [compiler-2:14673] oob:tcp: component_available called > [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list > of V4 connections > [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] TCP STARTUP > [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 > [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 > [compiler-2:14673] mca:oob:select: Adding component to end > [compiler-2:14673] mca:oob:select: Found 1 active transports > [compiler-2:14673] mca: base: components_register: registering rml components > [compiler-2:14673] mca: base: components_register: found loaded component oob > [compiler-2:14673] mca: base: components_register: component oob has no > register or open function > [compiler-2:14673] mca: base: components_open: opening rml components > [compiler-2:14673] mca: base: components_open: found loaded component oob > [compiler-2:14673] mca: base: components_open: component oob open function > successful > [compiler-2:14673] orte_rml_base_select: initializing rml component oob > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0]
Re: [OMPI users] ORTE daemon has unexpectedly failed after launch
btw, we get same error in v1.8 branch as well. On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain wrote: > It was not yet fixed - but should be now. > > On Aug 20, 2014, at 6:39 AM, Timur Ismagilov wrote: > > Hello! > > As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have > the problem > > a) > $ mpirun -np 1 ./hello_c > > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -- > > b) > $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -- > > c) > > $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca > plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 > ./hello_c > > [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] > [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] > set priority to 0 > [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] > [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] > [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] > [compiler-2:14673] mca: base: components_register: registering oob > components > [compiler-2:14673] mca: base: components_register: found loaded component > tcp > [compiler-2:14673] mca: base: components_register: component tcp register > function successful > [compiler-2:14673] mca: base: components_open: opening oob components > [compiler-2:14673] mca: base: components_open: found loaded component tcp > [compiler-2:14673] mca: base: components_open: component tcp open function > successful > [compiler-2:14673] mca:oob:select: checking available component tcp > [compiler-2:14673] mca:oob:select: Querying component [tcp] > [compiler-2:14673] oob:tcp: component_available called > [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our > list of V4 connections > [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] TCP STARTUP > [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 > [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 > [compiler-2:14673] mca:oob:select: Adding component to end > [compiler-2:14673] mca:oob:select: Found 1 active transports > [compiler-2:14673] mca: base: components_register: registering rml > components > [compiler-2:14673] mca: base: components_register: found loaded component > oob > [compiler-2:14673] mca: base: components_register: component oob has no > register or open function > [compiler-2:14673] mca: base: components_open: opening rml components > [compiler-2:14673] mca: base: components_open: found loaded component oob > [compiler-2:14673] mca: base: components_open: component oob open function > successful > [compiler-2:14673] orte_rml_base_select: initializing rml component oob > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for > peer [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for > peer [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for > peer [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for > peer [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer > [[WIL
Re: [OMPI users] ORTE daemon has unexpectedly failed after launch
yes, i know - it is cmr'd On Aug 20, 2014, at 10:26 AM, Mike Dubman wrote: > btw, we get same error in v1.8 branch as well. > > > On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain wrote: > It was not yet fixed - but should be now. > > On Aug 20, 2014, at 6:39 AM, Timur Ismagilov wrote: > >> Hello! >> >> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have >> the problem >> >> a) >> $ mpirun -np 1 ./hello_c >> >> -- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -- >> >> b) >> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >> -- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -- >> >> c) >> >> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose >> 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c >> >> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] >> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set >> priority to 0 >> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] >> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set >> priority to 75 >> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] >> [compiler-2:14673] mca: base: components_register: registering oob components >> [compiler-2:14673] mca: base: components_register: found loaded component tcp >> [compiler-2:14673] mca: base: components_register: component tcp register >> function successful >> [compiler-2:14673] mca: base: components_open: opening oob components >> [compiler-2:14673] mca: base: components_open: found loaded component tcp >> [compiler-2:14673] mca: base: components_open: component tcp open function >> successful >> [compiler-2:14673] mca:oob:select: checking available component tcp >> [compiler-2:14673] mca:oob:select: Querying component [tcp] >> [compiler-2:14673] oob:tcp: component_available called >> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list >> of V4 connections >> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >> [compiler-2:14673] [[49095,0],0] TCP STARTUP >> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >> [compiler-2:14673] mca:oob:select: Adding component to end >> [compiler-2:14673] mca:oob:select: Found 1 active transports >> [compiler-2:14673] mca: base: components_register: registering rml components >> [compiler-2:14673] mca: base: components_register: found loaded component oob >> [compiler-2:14673] mca: base: components_register: component oob has no >> register or open function >> [compiler-2:14673] mca: base: components_open: opening rml components >> [compiler-2:14673] mca: base: components_open: found loaded component oob >> [compiler-2:14673] mca: base: components_open: component oob open function >> successful >> [compiler-2:14673] orte_rml_base_select: initializing rml component oob >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persist
Re: [OMPI users] Running a hybrid MPI+openMP program
Hi Well, with qconf -sq one.q I got the following: [oscar@aguia free-noise]$ qconf -sq one.q qname one.q hostlist compute-1-30.local compute-1-2.local compute-1-3.local \ compute-1-4.local compute-1-5.local compute-1-6.local \ compute-1-7.local compute-1-8.local compute-1-9.local \ compute-1-10.local compute-1-11.local compute-1-12.local \ compute-1-13.local compute-1-14.local compute-1-15.local seq_no0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval00:05:00 priority0 min_cpu_interval00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make mpich mpi orte rerun FALSE slots 1,[compute-1-30.local=1],[compute-1-2.local=1], \ [compute-1-3.local=1],[compute-1-5.local=1], \ [compute-1-8.local=1],[compute-1-6.local=1], \ [compute-1-4.local=1],[compute-1-9.local=1], \ [compute-1-11.local=1],[compute-1-7.local=1], \ [compute-1-13.local=1],[compute-1-10.local=1], \ [compute-1-15.local=1],[compute-1-12.local=1], \ [compute-1-14.local=1] the admin was who created this queue, so I have to speak to him to change the number of slots to number of threads that i wish to use. Then I could make use of: === export OMP_NUM_THREADS=N mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / $OMP_NUM_THREADS") ./inverse.exe === For now in my case this command line just would work for 10 processes and the work wouldn't be divided in threads, is it right? can I set a maximum number of threads in the queue one.q (e.g. 15 ) and change the number in the 'export' for my convenience I feel like a child hearing the adults speaking Thanks I'm learning a lot Oscar Fabian Mojica Ladino Geologist M.S. in Geophysics > From: re...@staff.uni-marburg.de > Date: Tue, 19 Aug 2014 19:51:46 +0200 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program > > Hi, > > Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > > > I discovered what was the error. I forgot include the '-fopenmp' when I > > compiled the objects in the Makefile, so the program worked but it didn't > > divide the job in threads. Now the program is working and I can use until > > 15 cores for machine in the queue one.q. > > > > Anyway i would like to try implement your advice. Well I'm not alone in the > > cluster so i must implement your second suggestion. The steps are > > > > a) Use '$ qconf -mp orte' to change the allocation rule to 8 > > The number of slots defined in your used one.q was also increased to 8 > (`qconf -sq one.q`)? > > > > b) Set '#$ -pe orte 80' in the script > > Fine. > > > > c) I'm not sure how to do this step. I'd appreciate your help here. I can > > add some lines to the script to determine the PE_HOSTFILE path and > > contents, but i don't know how alter it > > For now you can put in your jobscript (just after OMP_NUM_THREAD is exported): > > awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' > $PE_HOSTFILE > $TMPDIR/machines > export PE_HOSTFILE=$TMPDIR/machines > > = > > Unfortunately noone stepped into this discussion, as in my opinion it's a > much broader issue which targets all users who want to combine MPI with > OpenMP. The queuingsystem should get a proper request for the overall amount > of slots the user needs. For now this will be forwarded to Open MPI and it > will use this information to start the appropriate number of processes (which > was an achievement for the Tight Integration out-of-the-box of course) and > ignores any setting of OMP_NUM_THREADS. So, where should the generated list > of machines be adjusted; there are several options: > > a) The PE of the queuingsystem should do it: > > + a one time setup for the admin > + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE > - the "start_proc_args" would need to know the number of threads, i.e. > OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript > (tricky scanning of the submitted jobscript for OMP_NUM_THREADS would be too > nasty) > - limits to use inside the jobscript calls to libraries behaving in the same > way as Open MPI only > > > b) The particular queue should do it in a queue prolog: > > same as a) I think > > > c) The user should do it > > + no change in the SGE installation > - each and every user must include it in all the jobscripts to adjust the > list and export the pointer to the $PE_HOSTFILE, but he could change it forth > and back for different steps of the jobscript though > > > d) Open MPI should do
Re: [OMPI users] Running a hybrid MPI+openMP program
Am 20.08.2014 um 19:05 schrieb Ralph Castain: >> >> Aha, this is quite interesting - how do you do this: scanning the >> /proc//status or alike? What happens if you don't find enough free >> cores as they are used up by other applications already? >> > > Remember, when you use mpirun to launch, we launch our own daemons using the > native launcher (e.g., qsub). So the external RM will bind our daemons to the > specified cores on each node. We use hwloc to determine what cores our > daemons are bound to, and then bind our own child processes to cores within > that range. Thx for reminding me of this. Indeed, I mixed up two different aspects in this discussion. a) What will happen in case no binding was done by the RM (hence Open MPI could use all cores) and two Open MPI jobs (or something completely different besides one Open MPI job) are running on the same node (due to the Tight Integration with two different Open MPI directories in /tmp and two `orted`, unique for each job)? Will the second Open MPI job know what the first Open MPI job used up already? Or will both use the same set of cores as "-bind-to none" can't be set in the given `mpiexec` command because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" indispensable and can't be switched off? I see the same cores being used for both jobs. Altering the machinefile instead: the processes are not bound to any core, and the OS takes care of a proper assignment. > If the cores we are bound to are the same on each node, then we will do this > with no further instruction. However, if the cores are different on the > individual nodes, then you need to add --hetero-nodes to your command line > (as the nodes appear to be heterogeneous to us). b) Aha, it's not about different type CPU types, but also same CPU type but different allocations between the nodes? It's not in the `mpiexec` man-page of 1.8.1 though. I'll have a look at it. > So it is up to the RM to set the constraint - we just live within it. Fine. -- Reuti
Re: [OMPI users] Running a hybrid MPI+openMP program
On Aug 20, 2014, at 11:16 AM, Reuti wrote: > Am 20.08.2014 um 19:05 schrieb Ralph Castain: > >>> >>> Aha, this is quite interesting - how do you do this: scanning the >>> /proc//status or alike? What happens if you don't find enough free >>> cores as they are used up by other applications already? >>> >> >> Remember, when you use mpirun to launch, we launch our own daemons using the >> native launcher (e.g., qsub). So the external RM will bind our daemons to >> the specified cores on each node. We use hwloc to determine what cores our >> daemons are bound to, and then bind our own child processes to cores within >> that range. > > Thx for reminding me of this. Indeed, I mixed up two different aspects in > this discussion. > > a) What will happen in case no binding was done by the RM (hence Open MPI > could use all cores) and two Open MPI jobs (or something completely different > besides one Open MPI job) are running on the same node (due to the Tight > Integration with two different Open MPI directories in /tmp and two `orted`, > unique for each job)? Will the second Open MPI job know what the first Open > MPI job used up already? Or will both use the same set of cores as "-bind-to > none" can't be set in the given `mpiexec` command because of "-map-by > slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" > indispensable and can't be switched off? I see the same cores being used for > both jobs. Yeah, each mpirun executes completely independently of the other, so they have no idea what the other is doing. So the cores will be overloaded. Multi-pe's requires bind-to-core otherwise there is no way to implement the request > > Altering the machinefile instead: the processes are not bound to any core, > and the OS takes care of a proper assignment. > > >> If the cores we are bound to are the same on each node, then we will do this >> with no further instruction. However, if the cores are different on the >> individual nodes, then you need to add --hetero-nodes to your command line >> (as the nodes appear to be heterogeneous to us). > > b) Aha, it's not about different type CPU types, but also same CPU type but > different allocations between the nodes? It's not in the `mpiexec` man-page > of 1.8.1 though. I'll have a look at it. The man page is probably a little out-of-date in this area - but yes, --hetero-nodes is required for *any* difference in the way the nodes appear to us (cpus, slot assignments, etc.). The 1.9 series may remove that requirement - still looking at it. > > >> So it is up to the RM to set the constraint - we just live within it. > > Fine. > > -- Reuti > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25097.php
[OMPI users] Clarification about OpenMPI, slurm and PMI interface
Dear Open MPI experts, I have a problem that is related to the integration of OpenMPI, slurm and PMI interface. I spent some time today with a colleague of mine trying to figure out why we were not able to obtain all H5 profile files (generated by acct_gather_profile) using Open MPI. When I say "all" I mean if I run using 8 nodes (e.g. tesla[121-128]) then I always systematically miss the file related to the first one (the first node in the allocation list, in this case tesla121). By comparing which processes are spawn on the compute nodes, I discovered that mpirun running on tesla121 calls srun only to spawn remotely new MPI processes to the other 7 nodes (maybe this is obvious, for me it was not)... fs395 617 0.0 0.0 106200 1504 ?S22:41 0:00 /bin/bash /var/spool/slurm-test/slurmd/job390044/slurm_script fs395 629 0.1 0.0 194552 5288 ?Sl 22:41 0:00 \_ mpirun -bind-to socket --map-by ppr:1:socket --host tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes fs395 632 0.0 0.0 659740 9148 ?Sl 22:41 0:00 | \_ srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 --nodelist=tesla122,tesla123,tesla1 fs395 633 0.0 0.0 55544 1072 ?S22:41 0:00 | | \_ srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 --nodelist=tesla122,tesla123,te fs395 651 0.0 0.0 106072 1392 ?S22:41 0:00 | \_ /bin/bash ./run_linpack ./xhpl fs395 654 295 35.5 120113412 23289280 ? RLl 22:41 3:12 | | \_ ./xhpl fs395 652 0.0 0.0 106072 1396 ?S22:41 0:00 | \_ /bin/bash ./run_linpack ./xhpl fs395 656 307 35.5 120070332 23267728 ? RLl 22:41 3:19 | \_ ./xhpl The "xhpl" processes allocated on the first node of a job are not called by srun and because of this the SLURM profile plugin is not activated on the node!!! As result I always miss the first node profile information. Intel MPI does not have this behavior, mpiexec.hydra uses srun on the first node. I got to the conclusion that SLURM is configured properly, something is wrong in the way I lunch Open MPI using mpirun. If I disable SLURM support and I revert back to rsh (--mca plm rsh) everything work but there is not profiling because the SLURM plug-in is not activated. During the configure step, Open MPI 1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly, I would prefer to avoid to use srun as job luncher if possible... Any suggestion to get this sorted out is really appreciated! Best Regards, Filippo -- Mr. Filippo SPIGA, M.Sc. http://filippospiga.info ~ skype: filippo.spiga «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert * Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."
Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface
Hi, Filippo When launching with mpirun in a SLURM environment, srun is only being used to launch the ORTE daemons (orteds.) Since the daemon will already exist on the node from which you invoked mpirun, this node will not be included in the list of nodes. SLURM's PMI library is not involved (that functionality is only necessary if you directly launch your MPI application with srun, in which case it is used to exchanged wireup info amongst slurmds.) This is the expected behavior. ~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294 /* if the daemon already exists on this node, then * don't include it */ if (node->daemon_launched) { continue; } Do you have a frontend node that you can launch from? What happens if you set "-np X" where X = 8*ppn. The alternative is to do a direct launch of the MPI application with srun. Best, Josh On Wed, Aug 20, 2014 at 6:48 PM, Filippo Spiga wrote: > Dear Open MPI experts, > > I have a problem that is related to the integration of OpenMPI, slurm and > PMI interface. I spent some time today with a colleague of mine trying to > figure out why we were not able to obtain all H5 profile files (generated > by acct_gather_profile) using Open MPI. When I say "all" I mean if I run > using 8 nodes (e.g. tesla[121-128]) then I always systematically miss the > file related to the first one (the first node in the allocation list, in > this case tesla121). > > By comparing which processes are spawn on the compute nodes, I discovered > that mpirun running on tesla121 calls srun only to spawn remotely new MPI > processes to the other 7 nodes (maybe this is obvious, for me it was not)... > > fs395 617 0.0 0.0 106200 1504 ?S22:41 0:00 /bin/bash > /var/spool/slurm-test/slurmd/job390044/slurm_script > fs395 629 0.1 0.0 194552 5288 ?Sl 22:41 0:00 \_ > mpirun -bind-to socket --map-by ppr:1:socket --host > tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes > fs395 632 0.0 0.0 659740 9148 ?Sl 22:41 0:00 | \_ > srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 > --nodelist=tesla122,tesla123,tesla1 > fs395 633 0.0 0.0 55544 1072 ?S22:41 0:00 | | > \_ srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 > --nodelist=tesla122,tesla123,te > fs395 651 0.0 0.0 106072 1392 ?S22:41 0:00 | \_ > /bin/bash ./run_linpack ./xhpl > fs395 654 295 35.5 120113412 23289280 ? RLl 22:41 3:12 | | > \_ ./xhpl > fs395 652 0.0 0.0 106072 1396 ?S22:41 0:00 | \_ > /bin/bash ./run_linpack ./xhpl > fs395 656 307 35.5 120070332 23267728 ? RLl 22:41 3:19 | > \_ ./xhpl > > > The "xhpl" processes allocated on the first node of a job are not called > by srun and because of this the SLURM profile plugin is not activated on > the node!!! As result I always miss the first node profile information. > Intel MPI does not have this behavior, mpiexec.hydra uses srun on the first > node. > > I got to the conclusion that SLURM is configured properly, something is > wrong in the way I lunch Open MPI using mpirun. If I disable SLURM support > and I revert back to rsh (--mca plm rsh) everything work but there is not > profiling because the SLURM plug-in is not activated. During the configure > step, Open MPI 1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly, > I would prefer to avoid to use srun as job luncher if possible... > > Any suggestion to get this sorted out is really appreciated! > > Best Regards, > Filippo > > -- > Mr. Filippo SPIGA, M.Sc. > http://filippospiga.info ~ skype: filippo.spiga > > «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert > > * > Disclaimer: "Please note this message and any attachments are CONFIDENTIAL > and may be privileged or otherwise protected from disclosure. The contents > are not to be disclosed to anyone other than the addressee. Unauthorized > recipients are requested to preserve this confidentiality and to advise the > sender immediately of any error in transmission." > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25099.php >
Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface
Or you can add -nolocal|--nolocalDo not run any MPI applications on the local node to your mpirun command line and we won't run any application procs on the node where mpirun is executing On Aug 20, 2014, at 4:28 PM, Joshua Ladd wrote: > Hi, Filippo > > When launching with mpirun in a SLURM environment, srun is only being used to > launch the ORTE daemons (orteds.) Since the daemon will already exist on the > node from which you invoked mpirun, this node will not be included in the > list of nodes. SLURM's PMI library is not involved (that functionality is > only necessary if you directly launch your MPI application with srun, in > which case it is used to exchanged wireup info amongst slurmds.) This is the > expected behavior. > > ~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294 > /* if the daemon already exists on this node, then > * don't include it > */ > if (node->daemon_launched) { > continue; > } > > Do you have a frontend node that you can launch from? What happens if you set > "-np X" where X = 8*ppn. The alternative is to do a direct launch of the MPI > application with srun. > > > Best, > > Josh > > > > On Wed, Aug 20, 2014 at 6:48 PM, Filippo Spiga > wrote: > Dear Open MPI experts, > > I have a problem that is related to the integration of OpenMPI, slurm and PMI > interface. I spent some time today with a colleague of mine trying to figure > out why we were not able to obtain all H5 profile files (generated by > acct_gather_profile) using Open MPI. When I say "all" I mean if I run using 8 > nodes (e.g. tesla[121-128]) then I always systematically miss the file > related to the first one (the first node in the allocation list, in this case > tesla121). > > By comparing which processes are spawn on the compute nodes, I discovered > that mpirun running on tesla121 calls srun only to spawn remotely new MPI > processes to the other 7 nodes (maybe this is obvious, for me it was not)... > > fs395 617 0.0 0.0 106200 1504 ?S22:41 0:00 /bin/bash > /var/spool/slurm-test/slurmd/job390044/slurm_script > fs395 629 0.1 0.0 194552 5288 ?Sl 22:41 0:00 \_ mpirun > -bind-to socket --map-by ppr:1:socket --host > tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes > fs395 632 0.0 0.0 659740 9148 ?Sl 22:41 0:00 | \_ srun > --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 > --nodelist=tesla122,tesla123,tesla1 > fs395 633 0.0 0.0 55544 1072 ?S22:41 0:00 | | \_ > srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 > --nodelist=tesla122,tesla123,te > fs395 651 0.0 0.0 106072 1392 ?S22:41 0:00 | \_ > /bin/bash ./run_linpack ./xhpl > fs395 654 295 35.5 120113412 23289280 ? RLl 22:41 3:12 | | \_ > ./xhpl > fs395 652 0.0 0.0 106072 1396 ?S22:41 0:00 | \_ > /bin/bash ./run_linpack ./xhpl > fs395 656 307 35.5 120070332 23267728 ? RLl 22:41 3:19 | \_ > ./xhpl > > > The "xhpl" processes allocated on the first node of a job are not called by > srun and because of this the SLURM profile plugin is not activated on the > node!!! As result I always miss the first node profile information. Intel MPI > does not have this behavior, mpiexec.hydra uses srun on the first node. > > I got to the conclusion that SLURM is configured properly, something is wrong > in the way I lunch Open MPI using mpirun. If I disable SLURM support and I > revert back to rsh (--mca plm rsh) everything work but there is not profiling > because the SLURM plug-in is not activated. During the configure step, Open > MPI 1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly, I would > prefer to avoid to use srun as job luncher if possible... > > Any suggestion to get this sorted out is really appreciated! > > Best Regards, > Filippo > > -- > Mr. Filippo SPIGA, M.Sc. > http://filippospiga.info ~ skype: filippo.spiga > > «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert > > * > Disclaimer: "Please note this message and any attachments are CONFIDENTIAL > and may be privileged or otherwise protected from disclosure. The contents > are not to be disclosed to anyone other than the addressee. Unauthorized > recipients are requested to preserve this confidentiality and to advise the > sender immediately of any error in transmission." > > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25099.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/user
Re: [OMPI users] Running a hybrid MPI+openMP program
Reuti, Sorry for confusing you. Under the managed condition, actually -np option is not necessary. So, this cmd line also works for me with Torque. $ qsub -l nodes=10:ppn=N $ mpirun -map-by slot:pe=N ./inverse.exe At least, Ralph confirmed it worked with Slurm and I comfirmed with Torque as shown below: [mishima@manage ~]$ qsub -I -l nodes=4:ppn=8 qsub: waiting for job 8798.manage.cluster to start qsub: job 8798.manage.cluster ready [mishima@node09 ~]$ cat $PBS_NODEFILE node09 node09 node09 node09 node09 node09 node09 node09 node10 node10 node10 node10 node10 node10 node10 node10 node11 node11 node11 node11 node11 node11 node11 node11 node12 node12 node12 node12 node12 node12 node12 node12 [mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map ~/mis/openmpi/demos/myprog Data for JOB [8050,1] offset 0 JOB MAP Data for node: node09 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [8050,1] App: 0 Process rank: 0 Data for node: node10 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [8050,1] App: 0 Process rank: 1 Data for node: node11 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [8050,1] App: 0 Process rank: 2 Data for node: node12 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [8050,1] App: 0 Process rank: 3 = Hello world from process 0 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 1 of 4 [mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map ~/mis/openmpi/demos/myprog Data for JOB [8056,1] offset 0 JOB MAP Data for node: node09 Num slots: 8Max slots: 0Num procs: 2 Process OMPI jobid: [8056,1] App: 0 Process rank: 0 Process OMPI jobid: [8056,1] App: 0 Process rank: 1 Data for node: node10 Num slots: 8Max slots: 0Num procs: 2 Process OMPI jobid: [8056,1] App: 0 Process rank: 2 Process OMPI jobid: [8056,1] App: 0 Process rank: 3 Data for node: node11 Num slots: 8Max slots: 0Num procs: 2 Process OMPI jobid: [8056,1] App: 0 Process rank: 4 Process OMPI jobid: [8056,1] App: 0 Process rank: 5 Data for node: node12 Num slots: 8Max slots: 0Num procs: 2 Process OMPI jobid: [8056,1] App: 0 Process rank: 6 Process OMPI jobid: [8056,1] App: 0 Process rank: 7 = Hello world from process 1 of 8 Hello world from process 0 of 8 Hello world from process 2 of 8 Hello world from process 3 of 8 Hello world from process 4 of 8 Hello world from process 5 of 8 Hello world from process 6 of 8 Hello world from process 7 of 8 I don't know why it dosen't work with SGE. Could you show me your output adding -display-map and -mca rmaps_base_verbose 5 options? By the way, the option -map-by ppr:N:node or ppr:N:socket might be useful for your purpose. The ppr can reduce the slot counts given by RM without binding and allocate N procs by the specified resource. [mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map ~/mis/openmpi/demos/myprog Data for JOB [7913,1] offset 0 JOB MAP Data for node: node09 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [7913,1] App: 0 Process rank: 0 Data for node: node10 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [7913,1] App: 0 Process rank: 1 Data for node: node11 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [7913,1] App: 0 Process rank: 2 Data for node: node12 Num slots: 8Max slots: 0Num procs: 1 Process OMPI jobid: [7913,1] App: 0 Process rank: 3 = Hello world from process 0 of 4 Hello world from process 2 of 4 Hello world from process 1 of 4 Hello world from process 3 of 4 Tetsuya > Hi, > > Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: > > > Reuti, > > > > If you want to allocate 10 procs with N threads, the Torque > > script below should work for you: > > > > qsub -l nodes=10:ppn=N > > mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe > > I played around with giving -np 10 in addition to a Tight Integration. The slot count is not really divided I think, but only 10 out of the granted maximum is used (while on each of the listed > machines an `orted` is started). Due to the fixed allocation this is of course the result we want to achieve as it subtracts bunches of 8 from the given list of machines resp. slots. In SGE it's > sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE any longer): > > === > export OMP_NUM_THREADS=8 > mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / $OMP_NUM_THREADS") ./inver
Re: [OMPI users] Running a hybrid MPI+openMP program
Oscar, As I mentioned before, I've never used SGE. So please ask for Reuti's advise. Only thing I can tell is that you have to use the openmpi 1.8 series to use -map-by slot:pe=N option. Tetsuya > Hi > > Well, with qconf -sq one.q I got the following: > > [oscar@aguia free-noise]$ qconf -sq one.q > qname one.q > hostlist compute-1-30.local compute-1-2.local compute-1-3.local \ > compute-1-4.local compute-1-5.local compute-1-6.local \ > compute-1-7.local compute-1-8.local compute-1-9.local \ > compute-1-10.local compute-1-11.local compute-1-12.local \ > compute-1-13.local compute-1-14.local compute-1-15.local > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make mpich mpi orte > rerun FALSE > slots 1,[compute-1-30.local=1],[compute-1-2.local=1], \ > [compute-1-3.local=1],[compute-1-5.local=1], \ > [compute-1-8.local=1],[compute-1-6.local=1], \ > [compute-1-4.local=1],[compute-1-9.local=1], \ > [compute-1-11.local=1],[compute-1-7.local=1], \ > [compute-1-13.local=1],[compute-1-10.local=1], \ > [compute-1-15.local=1],[compute-1-12.local=1], \ > [compute-1-14.local=1] > > the admin was who created this queue, so I have to speak to him to change the number of slots to number of threads that i wish to use. > > Then I could make use of: > === > export OMP_NUM_THREADS=N > mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / $OMP_NUM_THREADS") ./inverse.exe > === > > For now in my case this command line just would work for 10 processes and the work wouldn't be divided in threads, is it right? > > can I set a maximum number of threads in the queue one.q (e.g. 15 ) and change the number in the 'export' for my convenience > > I feel like a child hearing the adults speaking > Thanks I'm learning a lot > > > Oscar Fabian Mojica Ladino > Geologist M.S. in Geophysics > > > > From: re...@staff.uni-marburg.de > > Date: Tue, 19 Aug 2014 19:51:46 +0200 > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program > > > > Hi, > > > > Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > > > > > I discovered what was the error. I forgot include the '-fopenmp' when I compiled the objects in the Makefile, so the program worked but it didn't divide the job in threads. Now the program is > working and I can use until 15 cores for machine in the queue one.q. > > > > > > Anyway i would like to try implement your advice. Well I'm not alone in the cluster so i must implement your second suggestion. The steps are > > > > > > a) Use '$ qconf -mp orte' to change the allocation rule to 8 > > > > The number of slots defined in your used one.q was also increased to 8 (`qconf -sq one.q`)? > > > > > > > b) Set '#$ -pe orte 80' in the script > > > > Fine. > > > > > > > c) I'm not sure how to do this step. I'd appreciate your help here. I can add some lines to the script to determine the PE_HOSTFILE path and contents, but i don't know how alter it > > > > For now you can put in your jobscript (just after OMP_NUM_THREAD is exported): > > > > awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' $PE_HOSTFILE > $TMPDIR/machines > > export PE_HOSTFILE=$TMPDIR/machines > > > > = > > > > Unfortunately noone stepped into this discussion, as in my opinion it's a much broader issue which targets all users who want to combine MPI with OpenMP. The queuingsystem should get a proper > request for the overall amount of slots the user needs. For now this will be forwarded to Open MPI and it will use this information to start the appropriate number of processes (which was an > achievement for the Tight Integration out-of-the-box of course) and ignores any setting of OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; there are several options: > > > > a) The PE of the queuingsystem should do it: > > > > + a one time setup for the admin > > + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE > > - the "start_proc_args" would need to know the number of threads, i.e. OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript (tricky scanning of the submitted jobscript for > OMP_NUM_THREADS would be too nasty) > > - limits to use inside the jobscript calls to libraries behaving in the same way as Open MPI only > > > > > > b) The particular queue should do it in a queue prolog: > > > > same as a) I think > > > > > > c) The user should