[OMPI users] MPI_File_write hangs on NFS-mounted filesystem
The simple C program attached below hangs on MPI_File_write when I am using an NFS-mounted filesystem. Is MPI-IO supported in OpenMPI for NFS filesystems? I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 3.2.51. I was surprised by this because the problems only started occurring recently when I upgraded my Debian system to wheezy; with OpenMPI in the previous Debian release, output to NFS-mounted filesystems worked fine. Is there any easy way to get this working? Any tips are appreciated. Regards, Steven G. Johnson --- #include #include #include void perr(const char *label, int err) { char s[MPI_MAX_ERROR_STRING]; int len; MPI_Error_string(err, s, &len); printf("%s: %d = %s\n", label, err, s); } int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_File fh; int err; err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); perr("open", err); const char s[] = "Hello world!\n"; MPI_Status status; err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status); perr("write", err); err = MPI_File_close(&fh); perr("close", err); MPI_Finalize(); return 0; }
Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem
Not sure if this is related, but: I've seen a case of performance degradation on NFS and Lustre when writing NetCDF files. The reason was that the file was filled with a loop writing one 4-byte record at once. Performance became close to local hard drive, when I simply introduced buffering of records and writing them to files with one row at once. - D. 2013/11/7 Steven G Johnson : > The simple C program attached below hangs on MPI_File_write when I am using > an NFS-mounted filesystem. Is MPI-IO supported in OpenMPI for NFS > filesystems? > > I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux > 3.2.51. I was surprised by this because the problems only started occurring > recently when I upgraded my Debian system to wheezy; with OpenMPI in the > previous Debian release, output to NFS-mounted filesystems worked fine. > > Is there any easy way to get this working? Any tips are appreciated. > > Regards, > Steven G. Johnson > > --- > #include > #include > #include > > void perr(const char *label, int err) > { > char s[MPI_MAX_ERROR_STRING]; > int len; > MPI_Error_string(err, s, &len); > printf("%s: %d = %s\n", label, err, s); > } > > int main(int argc, char **argv) > { > MPI_Init(&argc, &argv); > > MPI_File fh; > int err; > err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | > MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); > perr("open", err); > > const char s[] = "Hello world!\n"; > MPI_Status status; > err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status); > perr("write", err); > > err = MPI_File_close(&fh); > perr("close", err); > > MPI_Finalize(); > return 0; > } > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem
That's a relatively old version of OMPI. Maybe try the latest release? That's always the safe bet since the issue might have been fixed already. I recall that OMPI uses ROMIO so you might try to reproduce with MPICH so you can report it to the people that wrote the MPI-IO code. Of course, this might not be an issue with ROMIO itself. Trying with MPICH is a good way to verify that. Best, Jeff Sent from my iPhone On Nov 7, 2013, at 10:55 AM, Steven G Johnson wrote: > The simple C program attached below hangs on MPI_File_write when I am using > an NFS-mounted filesystem. Is MPI-IO supported in OpenMPI for NFS > filesystems? > > I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux > 3.2.51. I was surprised by this because the problems only started occurring > recently when I upgraded my Debian system to wheezy; with OpenMPI in the > previous Debian release, output to NFS-mounted filesystems worked fine. > > Is there any easy way to get this working? Any tips are appreciated. > > Regards, > Steven G. Johnson > > --- > #include > #include > #include > > void perr(const char *label, int err) > { >char s[MPI_MAX_ERROR_STRING]; >int len; >MPI_Error_string(err, s, &len); >printf("%s: %d = %s\n", label, err, s); > } > > int main(int argc, char **argv) > { >MPI_Init(&argc, &argv); > >MPI_File fh; >int err; >err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | > MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); >perr("open", err); > >const char s[] = "Hello world!\n"; >MPI_Status status; >err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status); >perr("write", err); > >err = MPI_File_close(&fh); >perr("close", err); > >MPI_Finalize(); >return 0; > } > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem
Hi Steven, Dmytry Not sure if this web page is still valid or totally out of date, but there it goes anyway, in the hopes that it may help: http://www.mcs.anl.gov/research/projects/mpi/mpich1-old/docs/install/node38.htm On the other hand, one expert seems to dismiss NFS for paralllel IO: http://www.open-mpi.org/community/lists/users/2008/07/6125.php I must say that this has been a gray area for me too. It would be nice if MPI - or the various MPIs - documentation told us a bit more clearly what types of underlying file system support MPI parallel IO: local disk (ext?, xfs, etc), NFS mounts, the various parallel file systems (PVFS/OrangeFS, Lustre, GlusterFS, etc). And perhaps provide some setup information, plus functionality, and performance comparisons. My two cents, Gus Correa On 11/07/2013 12:21 PM, Dmitry N. Mikushin wrote: Not sure if this is related, but: I've seen a case of performance degradation on NFS and Lustre when writing NetCDF files. The reason was that the file was filled with a loop writing one 4-byte record at once. Performance became close to local hard drive, when I simply introduced buffering of records and writing them to files with one row at once. - D. 2013/11/7 Steven G Johnson: The simple C program attached below hangs on MPI_File_write when I am using an NFS-mounted filesystem. Is MPI-IO supported in OpenMPI for NFS filesystems? I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 3.2.51. I was surprised by this because the problems only started occurring recently when I upgraded my Debian system to wheezy; with OpenMPI in the previous Debian release, output to NFS-mounted filesystems worked fine. Is there any easy way to get this working? Any tips are appreciated. Regards, Steven G. Johnson --- #include #include #include void perr(const char *label, int err) { char s[MPI_MAX_ERROR_STRING]; int len; MPI_Error_string(err, s,&len); printf("%s: %d = %s\n", label, err, s); } int main(int argc, char **argv) { MPI_Init(&argc,&argv); MPI_File fh; int err; err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL,&fh); perr("open", err); const char s[] = "Hello world!\n"; MPI_Status status; err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR,&status); perr("write", err); err = MPI_File_close(&fh); perr("close", err); MPI_Finalize(); return 0; } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] proper use of MPI_Abort
Jeff, Good to know. Thanks! Seems really like MPI_ABORT should only be used within error traps after MPI functions have been started. Code-wise, the sample I got was not the best. Usage should be checked before MPI_Initialize, I think :) It seems the expectation is that MPI_ABORT is only called when the user should be notified something went haywire. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Wednesday, November 06, 2013 11:30 AM > To: Open MPI Users > Subject: Re: [OMPI users] proper use of MPI_Abort > > I just checked the v1.7 series -- it looks like we have cleaned up this > message a bit. With your code snipit: > > - > ❯❯❯ mpicc abort.c -o abort && mpirun -np 4 abort > > *# Usage: mpicpy -input > > -- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on exactly > when Open MPI kills them. > -- > > > Notice the lack of the 2nd message. > > So I think the answer here is: it's fixed in the 1.7.x series. It is > unlikely to be > fixed in the 1.6.x series. > > > > On Nov 5, 2013, at 3:16 PM, "Andrus, Brian Contractor" > wrote: > > > Jeff, > > > > We are using the latest version: 1.6.5 > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > > >> -Original Message- > >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > >> Squyres (jsquyres) > >> Sent: Tuesday, November 05, 2013 5:11 AM > >> To: Open MPI Users > >> Subject: Re: [OMPI users] proper use of MPI_Abort > >> > >> You're correct -- you don't need to call MPI_Finalize after MPI_Abort. > >> > >> Can you cite what version of Open MPI you are using? > >> > >> > >> On Nov 4, 2013, at 9:01 AM, "Andrus, Brian Contractor" > >> > >> wrote: > >> > >>> All, > >>> > >>> I have some sample code that has a syntax message and then an > >> MPI_Abort call if the program is run without the required parameters. > >>> --snip--- > >>>if (!rank) { > >>>i = 1; > >>>while ((i < argc) && strcmp("-input", *argv)) { > >>>i++; > >>>argv++; > >>>} > >>>if (i >= argc) { > >>>fprintf(stderr, "\n*# Usage: mpicpy -input \n\n"); > >>>MPI_Abort(MPI_COMM_WORLD, 1); > >>>} > >>> --snip--- > >>> > >>> This is all well and good and it does provide the usage line, but it > >>> also > >> throws quite a message in addition: > >>> > >>> > >>> -- > >>> MPI_ABORT was invoked on rank 0 in communicator > >> MPI_COMM_WORLD > >>> with errorcode 1. > >>> > >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > >>> You may or may not see output from other processes, depending on > >>> exactly when Open MPI kills them. > >>> > >>> -- > >>> > >>> > >>> -- > >>> mpirun has exited due to process rank 0 with PID 40209 on node > >>> compute-3-3 exiting improperly. There are two reasons this could occur: > >>> > >>> 1. this process did not call "init" before exiting, but others in > >>> the job did. This can cause a job to hang indefinitely while it > >>> waits for all processes to call "init". By rule, if one process > >>> calls "init", then ALL processes must call "init" prior to termination. > >>> > >>> 2. this process called "init", but exited without calling "finalize". > >>> By rule, all processes that call "init" MUST call "finalize" prior > >>> to exiting or it will be considered an "abnormal termination" > >>> > >>> This may have caused other processes in the application to be > >>> terminated by signals sent by mpirun (as reported here). > >>> > >>> -- > >>> > >>> > >>> Is there a proper way to use MPI_Abort such that it will not trigger > >>> such a > >> message? > >>> It almost seems that MPI_Abort should be calling MPI_Finalize as a > >>> rule, or > >> openmpi should recognize MPI_Abort is the exception to requiring > >> MPI_Finalize. > >>> > >>> > >>> > >>> Brian Andrus > >>> ITACS/Research Computing > >>> Naval Postgraduate School > >>> Monterey, California > >>> voice: 831-656-6238 > >>> > >>> > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://w
[OMPI users] LAMA of openmpi-1.7.3 is unstable
Dear openmpi developers, I tried the new function LAMA of openmpi-1.7.3 and unfortunately it is not stable under my environment, which is built with torque. (1) I used 4 scripts as shown below to clarify the problem: (COMMON PART) #!/bin/sh #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 export OMP_NUM_THREADS=1 cd $PBS_O_WORKDIR cp $PBS_NODEFILE pbs_hosts NPROCS=`wc -l < pbs_hosts` (SCRIPT1) mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog (SCRIPT2) mpirun -oversubscribe -report-bindings -mca rmaps lama \ -mca rmaps_lama_bind 1c Myprog (SCRITP3) mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog (SCRIPT4) mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ -mca rmaps_lama_map Nsbnch \ -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog (2) The results are as follows: NODE03(32cores) NODE08(8cores) SCRIPT1 *ERROR1 *ERROR1 SCRIPT2 OK OK SCRIPT3 **ABORT OK SCRIPT4 **ABORT **ABORT (*)ERROR1 means: -- RMaps LAMA detected oversubscription after mapping 1 of 8 processes. Since you have asked not to oversubscribe the resources the job will not be launched. If you would instead like to oversubscribe the resources try using the --oversubscribe option to mpirun. -- [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file rmaps_lama_module.c at line 310 [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file base/rmaps_base_map_job.c at line 166 (**)ABORT means "stuck and no answer" until forced termination. (3) openmpi-1.7.3 configuration (with PGI compiler) ./configure \ --with-tm \ --with-verbs \ --disable-ipv6 \ CC=pgcc CFLAGS="-fast -tp k8-64e" \ CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ F77=pgfortran FFLAGS="-fast -tp k8-64e" \ FC=pgfortran FCFLAGS="-fast -tp k8-64e" (4) Cluster information: 32 cores AMD based node(node03): Machine (126GB) Socket L#0 (32GB) NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) Socket L#1 (32GB) NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB) L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) Socket L#2 (32GB) NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB) L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) Socket L#3 (32GB) NUMANode L#6 (P#2 16GB) + L3 L#6 (5118KB) L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) NUMANode L#
Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
What happens if you drop the LAMA request and instead run mpirun -report-bindings -bind-to core Myprog This would do the same thing - does it work? If so, then we know it is a problem in the LAMA mapper. If not, then it is likely a problem in a different section of the code. On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > Dear openmpi developers, > > I tried the new function LAMA of openmpi-1.7.3 and > unfortunately it is not stable under my environment, which > is built with torque. > > (1) I used 4 scripts as shown below to clarify the problem: > > (COMMON PART) > #!/bin/sh > #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > > (SCRIPT1) > mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > (SCRIPT2) > mpirun -oversubscribe -report-bindings -mca rmaps lama \ > -mca rmaps_lama_bind 1c Myprog > (SCRITP3) > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > (SCRIPT4) > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ > -mca rmaps_lama_map Nsbnch \ > -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog > > (2) The results are as follows: > > NODE03(32cores) NODE08(8cores) > SCRIPT1 *ERROR1 *ERROR1 > SCRIPT2 OK OK > SCRIPT3 **ABORT OK > SCRIPT4 **ABORT **ABORT > > (*)ERROR1 means: > -- > RMaps LAMA detected oversubscription after mapping 1 of 8 processes. > Since you have asked not to oversubscribe the resources the job will not > be launched. If you would instead like to oversubscribe the resources > try using the --oversubscribe option to mpirun. > -- > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > rmaps_lama_module.c at line 310 > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > base/rmaps_base_map_job.c at line 166 > > (**)ABORT means "stuck and no answer" until forced termination. > > > (3) openmpi-1.7.3 configuration (with PGI compiler) > > ./configure \ > --with-tm \ > --with-verbs \ > --disable-ipv6 \ > CC=pgcc CFLAGS="-fast -tp k8-64e" \ > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > FC=pgfortran FCFLAGS="-fast -tp k8-64e" > > > (4) Cluster information: > > 32 cores AMD based node(node03): > Machine (126GB) > Socket L#0 (32GB) >NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) > L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 > (P#0) > L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 > (P#1) > L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 > (P#2) > L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 > (P#3) >NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) > L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 > (P#4) > L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 > (P#5) > L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 > (P#6) > L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 > (P#7) > Socket L#1 (32GB) >NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) > L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 > (P#8) > L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 > (P#9) > L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU > L#10 (P#10) > L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU > L#11 (P#11) >NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB) > L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU > L#12 (P#12) > L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU > L#13 (P#13) > L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU > L#14 (P#14) > L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU > L#15 (P#15) > Socket L#2 (32GB) >NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) > L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU > L#16 (P#16) > L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU > L#17 (P#17) > L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU > L#18 (P#18) > L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU > L#19 (P#19) >NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB) > L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU > L#20 (P#20) > L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU > L#21 (P#21) > L2 L#22 (512KB) + L1d L#22 (64KB) + L1i
Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
Hi Ralph, I quickly tried 2 runs: mpirun -report-bindings -bind-to core Myprog mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core Myprog It works fine in both cases on node03 and node08. Regards, Tetsuya Mishima > What happens if you drop the LAMA request and instead run > > mpirun -report-bindings -bind-to core Myprog > > This would do the same thing - does it work? If so, then we know it is a problem in the LAMA mapper. If not, then it is likely a problem in a different section of the code. > > > > On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Dear openmpi developers, > > > > I tried the new function LAMA of openmpi-1.7.3 and > > unfortunately it is not stable under my environment, which > > is built with torque. > > > > (1) I used 4 scripts as shown below to clarify the problem: > > > > (COMMON PART) > > #!/bin/sh > > #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 > > export OMP_NUM_THREADS=1 > > cd $PBS_O_WORKDIR > > cp $PBS_NODEFILE pbs_hosts > > NPROCS=`wc -l < pbs_hosts` > > > > (SCRIPT1) > > mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > > (SCRIPT2) > > mpirun -oversubscribe -report-bindings -mca rmaps lama \ > > -mca rmaps_lama_bind 1c Myprog > > (SCRITP3) > > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > > (SCRIPT4) > > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ > > -mca rmaps_lama_map Nsbnch \ > > -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog > > > > (2) The results are as follows: > > > > NODE03(32cores) NODE08(8cores) > > SCRIPT1 *ERROR1 *ERROR1 > > SCRIPT2 OK OK > > SCRIPT3 **ABORT OK > > SCRIPT4 **ABORT **ABORT > > > > (*)ERROR1 means: > > -- > > RMaps LAMA detected oversubscription after mapping 1 of 8 processes. > > Since you have asked not to oversubscribe the resources the job will not > > be launched. If you would instead like to oversubscribe the resources > > try using the --oversubscribe option to mpirun. > > -- > > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > > rmaps_lama_module.c at line 310 > > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > > base/rmaps_base_map_job.c at line 166 > > > > (**)ABORT means "stuck and no answer" until forced termination. > > > > > > (3) openmpi-1.7.3 configuration (with PGI compiler) > > > > ./configure \ > > --with-tm \ > > --with-verbs \ > > --disable-ipv6 \ > > CC=pgcc CFLAGS="-fast -tp k8-64e" \ > > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > > F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > > FC=pgfortran FCFLAGS="-fast -tp k8-64e" > > > > > > (4) Cluster information: > > > > 32 cores AMD based node(node03): > > Machine (126GB) > > Socket L#0 (32GB) > >NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) > > L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 > > (P#0) > > L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 > > (P#1) > > L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 > > (P#2) > > L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 > > (P#3) > >NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) > > L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 > > (P#4) > > L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 > > (P#5) > > L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 > > (P#6) > > L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 > > (P#7) > > Socket L#1 (32GB) > >NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) > > L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 > > (P#8) > > L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 > > (P#9) > > L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU > > L#10 (P#10) > > L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU > > L#11 (P#11) > >NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB) > > L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU > > L#12 (P#12) > > L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU > > L#13 (P#13) > > L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU > > L#14 (P#14) > > L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU > > L#15 (P#15) > > Socket L#2 (32GB) > >NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) > > L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU > > L#16 (P#16) > > L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let the LAMA folks look into it. On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I quickly tried 2 runs: > > mpirun -report-bindings -bind-to core Myprog > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core > Myprog > > It works fine in both cases on node03 and node08. > > Regards, > Tetsuya Mishima > >> What happens if you drop the LAMA request and instead run >> >> mpirun -report-bindings -bind-to core Myprog >> >> This would do the same thing - does it work? If so, then we know it is a > problem in the LAMA mapper. If not, then it is likely a problem in a > different section of the code. >> >> >> >> On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Dear openmpi developers, >>> >>> I tried the new function LAMA of openmpi-1.7.3 and >>> unfortunately it is not stable under my environment, which >>> is built with torque. >>> >>> (1) I used 4 scripts as shown below to clarify the problem: >>> >>> (COMMON PART) >>> #!/bin/sh >>> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 >>> export OMP_NUM_THREADS=1 >>> cd $PBS_O_WORKDIR >>> cp $PBS_NODEFILE pbs_hosts >>> NPROCS=`wc -l < pbs_hosts` >>> >>> (SCRIPT1) >>> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog >>> (SCRIPT2) >>> mpirun -oversubscribe -report-bindings -mca rmaps lama \ >>> -mca rmaps_lama_bind 1c Myprog >>> (SCRITP3) >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ >>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog >>> (SCRIPT4) >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ >>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ >>> -mca rmaps_lama_map Nsbnch \ >>> -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog >>> >>> (2) The results are as follows: >>> >>>NODE03(32cores) NODE08(8cores) >>> SCRIPT1 *ERROR1 *ERROR1 >>> SCRIPT2 OK OK >>> SCRIPT3 **ABORT OK >>> SCRIPT4 **ABORT **ABORT >>> >>> (*)ERROR1 means: >>> > -- >>> RMaps LAMA detected oversubscription after mapping 1 of 8 processes. >>> Since you have asked not to oversubscribe the resources the job will > not >>> be launched. If you would instead like to oversubscribe the resources >>> try using the --oversubscribe option to mpirun. >>> > -- >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file >>> rmaps_lama_module.c at line 310 >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file >>> base/rmaps_base_map_job.c at line 166 >>> >>> (**)ABORT means "stuck and no answer" until forced termination. >>> >>> >>> (3) openmpi-1.7.3 configuration (with PGI compiler) >>> >>> ./configure \ >>> --with-tm \ >>> --with-verbs \ >>> --disable-ipv6 \ >>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ >>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ >>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ >>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" >>> >>> >>> (4) Cluster information: >>> >>> 32 cores AMD based node(node03): >>> Machine (126GB) >>> Socket L#0 (32GB) >>> NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) >>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU > L#0 >>> (P#0) >>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU > L#1 >>> (P#1) >>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU > L#2 >>> (P#2) >>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU > L#3 >>> (P#3) >>> NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) >>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU > L#4 >>> (P#4) >>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU > L#5 >>> (P#5) >>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU > L#6 >>> (P#6) >>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU > L#7 >>> (P#7) >>> Socket L#1 (32GB) >>> NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) >>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU > L#8 >>> (P#8) >>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU > L#9 >>> (P#9) >>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + > PU >>> L#10 (P#10) >>> L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + > PU >>> L#11 (P#11) >>> NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB) >>> L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + > PU >>> L#12 (P#12) >>> L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + > PU >>> L#13 (P#13) >>> L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + > PU >>> L#14 (P#14) >>> L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + > PU >>> L#15 (P#15)
Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
Thanks, Ralph. This is an additional information. Just execute directly on the node without Torque: mpirun -np 8 -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog Then it also works, which means the combination of LAMA and Torque would case the problem. Tetsuya Mishima > Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let the LAMA folks look into it. > > On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > I quickly tried 2 runs: > > > > mpirun -report-bindings -bind-to core Myprog > > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core > > Myprog > > > > It works fine in both cases on node03 and node08. > > > > Regards, > > Tetsuya Mishima > > > >> What happens if you drop the LAMA request and instead run > >> > >> mpirun -report-bindings -bind-to core Myprog > >> > >> This would do the same thing - does it work? If so, then we know it is a > > problem in the LAMA mapper. If not, then it is likely a problem in a > > different section of the code. > >> > >> > >> > >> On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Dear openmpi developers, > >>> > >>> I tried the new function LAMA of openmpi-1.7.3 and > >>> unfortunately it is not stable under my environment, which > >>> is built with torque. > >>> > >>> (1) I used 4 scripts as shown below to clarify the problem: > >>> > >>> (COMMON PART) > >>> #!/bin/sh > >>> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 > >>> export OMP_NUM_THREADS=1 > >>> cd $PBS_O_WORKDIR > >>> cp $PBS_NODEFILE pbs_hosts > >>> NPROCS=`wc -l < pbs_hosts` > >>> > >>> (SCRIPT1) > >>> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > >>> (SCRIPT2) > >>> mpirun -oversubscribe -report-bindings -mca rmaps lama \ > >>> -mca rmaps_lama_bind 1c Myprog > >>> (SCRITP3) > >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > >>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > >>> (SCRIPT4) > >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > >>> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ > >>> -mca rmaps_lama_map Nsbnch \ > >>> -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog > >>> > >>> (2) The results are as follows: > >>> > >>>NODE03(32cores) NODE08(8cores) > >>> SCRIPT1 *ERROR1 *ERROR1 > >>> SCRIPT2 OK OK > >>> SCRIPT3 **ABORT OK > >>> SCRIPT4 **ABORT **ABORT > >>> > >>> (*)ERROR1 means: > >>> > > -- > >>> RMaps LAMA detected oversubscription after mapping 1 of 8 processes. > >>> Since you have asked not to oversubscribe the resources the job will > > not > >>> be launched. If you would instead like to oversubscribe the resources > >>> try using the --oversubscribe option to mpirun. > >>> > > -- > >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > >>> rmaps_lama_module.c at line 310 > >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > >>> base/rmaps_base_map_job.c at line 166 > >>> > >>> (**)ABORT means "stuck and no answer" until forced termination. > >>> > >>> > >>> (3) openmpi-1.7.3 configuration (with PGI compiler) > >>> > >>> ./configure \ > >>> --with-tm \ > >>> --with-verbs \ > >>> --disable-ipv6 \ > >>> CC=pgcc CFLAGS="-fast -tp k8-64e" \ > >>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > >>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > >>> FC=pgfortran FCFLAGS="-fast -tp k8-64e" > >>> > >>> > >>> (4) Cluster information: > >>> > >>> 32 cores AMD based node(node03): > >>> Machine (126GB) > >>> Socket L#0 (32GB) > >>> NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) > >>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU > > L#0 > >>> (P#0) > >>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU > > L#1 > >>> (P#1) > >>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU > > L#2 > >>> (P#2) > >>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU > > L#3 > >>> (P#3) > >>> NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) > >>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU > > L#4 > >>> (P#4) > >>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU > > L#5 > >>> (P#5) > >>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU > > L#6 > >>> (P#6) > >>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU > > L#7 > >>> (P#7) > >>> Socket L#1 (32GB) > >>> NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) > >>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU > > L#8 > >>> (P#8) > >>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU > > L#9 > >>> (P#9) > >>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + > >
Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable
I suspect something else is going on there - I can't imagine how the LAMA mapper could be interacting with the Torque launcher. The check for adequate resources (per the error message) is done long before we get to the launcher. I'll have to let the LAMA supporters chase it down. Thanks Ralph On Nov 7, 2013, at 4:37 PM, tmish...@jcity.maeda.co.jp wrote: > > > Thanks, Ralph. > > This is an additional information. > > Just execute directly on the node without Torque: > mpirun -np 8 -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c > Myprog > > Then it also works, which means the combination of LAMA and Torque would > case > the problem. > > Tetsuya Mishima > >> Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let > the LAMA folks look into it. >> >> On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> I quickly tried 2 runs: >>> >>> mpirun -report-bindings -bind-to core Myprog >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to > core >>> Myprog >>> >>> It works fine in both cases on node03 and node08. >>> >>> Regards, >>> Tetsuya Mishima >>> What happens if you drop the LAMA request and instead run mpirun -report-bindings -bind-to core Myprog This would do the same thing - does it work? If so, then we know it is > a >>> problem in the LAMA mapper. If not, then it is likely a problem in a >>> different section of the code. On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > Dear openmpi developers, > > I tried the new function LAMA of openmpi-1.7.3 and > unfortunately it is not stable under my environment, which > is built with torque. > > (1) I used 4 scripts as shown below to clarify the problem: > > (COMMON PART) > #!/bin/sh > #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8 > export OMP_NUM_THREADS=1 > cd $PBS_O_WORKDIR > cp $PBS_NODEFILE pbs_hosts > NPROCS=`wc -l < pbs_hosts` > > (SCRIPT1) > mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c > Myprog > (SCRIPT2) > mpirun -oversubscribe -report-bindings -mca rmaps lama \ > -mca rmaps_lama_bind 1c Myprog > (SCRITP3) > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog > (SCRIPT4) > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \ > -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \ > -mca rmaps_lama_map Nsbnch \ > -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog > > (2) The results are as follows: > > NODE03(32cores) NODE08(8cores) > SCRIPT1 *ERROR1 *ERROR1 > SCRIPT2 OK OK > SCRIPT3 **ABORT OK > SCRIPT4 **ABORT **ABORT > > (*)ERROR1 means: > >>> > -- > RMaps LAMA detected oversubscription after mapping 1 of 8 processes. > Since you have asked not to oversubscribe the resources the job will >>> not > be launched. If you would instead like to oversubscribe the resources > try using the --oversubscribe option to mpirun. > >>> > -- > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > rmaps_lama_module.c at line 310 > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file > base/rmaps_base_map_job.c at line 166 > > (**)ABORT means "stuck and no answer" until forced termination. > > > (3) openmpi-1.7.3 configuration (with PGI compiler) > > ./configure \ > --with-tm \ > --with-verbs \ > --disable-ipv6 \ > CC=pgcc CFLAGS="-fast -tp k8-64e" \ > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \ > F77=pgfortran FFLAGS="-fast -tp k8-64e" \ > FC=pgfortran FCFLAGS="-fast -tp k8-64e" > > > (4) Cluster information: > > 32 cores AMD based node(node03): > Machine (126GB) > Socket L#0 (32GB) > NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) >L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU >>> L#0 > (P#0) >L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU >>> L#1 > (P#1) >L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU >>> L#2 > (P#2) >L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU >>> L#3 > (P#3) > NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) >L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU >>> L#4 > (P#4) >L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU >>> L#5 > (P#5) >L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU >>> L#6 > (P#6) >