date:20131107

[OMPI users] MPI_File_write hangs on NFS-mounted filesystem

2013-11-07 Thread Steven G Johnson

The simple C program attached below hangs on MPI_File_write when I am using an 
NFS-mounted filesystem.   Is MPI-IO supported in OpenMPI for NFS filesystems?

I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 
3.2.51.   I was surprised by this because the problems only started occurring 
recently when I upgraded my Debian system to wheezy; with OpenMPI in the 
previous Debian release, output to NFS-mounted filesystems worked fine.

Is there any easy way to get this working?  Any tips are appreciated.

Regards,
Steven G. Johnson

---
#include 
#include 
#include 

void perr(const char *label, int err)
{
char s[MPI_MAX_ERROR_STRING];
int len;
MPI_Error_string(err, s, &len);
printf("%s: %d = %s\n", label, err, s);
}

int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);

MPI_File fh;
int err;
err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | 
MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
perr("open", err);

const char s[] = "Hello world!\n";
MPI_Status status;
err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status);
perr("write", err);

err = MPI_File_close(&fh);
perr("close", err);

MPI_Finalize();
return 0;
}

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

2013-11-07 Thread Dmitry N. Mikushin

Not sure if this is related, but:

I've seen a case of performance degradation on NFS and Lustre when
writing NetCDF files. The reason was that the file was filled with a
loop writing one 4-byte record at once. Performance became close to
local hard drive, when I simply introduced buffering of records and
writing them to files with one row at once.

- D.


2013/11/7 Steven G Johnson :
> The simple C program attached below hangs on MPI_File_write when I am using 
> an NFS-mounted filesystem.   Is MPI-IO supported in OpenMPI for NFS 
> filesystems?
>
> I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 
> 3.2.51.   I was surprised by this because the problems only started occurring 
> recently when I upgraded my Debian system to wheezy; with OpenMPI in the 
> previous Debian release, output to NFS-mounted filesystems worked fine.
>
> Is there any easy way to get this working?  Any tips are appreciated.
>
> Regards,
> Steven G. Johnson
>
> ---
> #include 
> #include 
> #include 
>
> void perr(const char *label, int err)
> {
> char s[MPI_MAX_ERROR_STRING];
> int len;
> MPI_Error_string(err, s, &len);
> printf("%s: %d = %s\n", label, err, s);
> }
>
> int main(int argc, char **argv)
> {
> MPI_Init(&argc, &argv);
>
> MPI_File fh;
> int err;
> err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | 
> MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
> perr("open", err);
>
> const char s[] = "Hello world!\n";
> MPI_Status status;
> err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status);
> perr("write", err);
>
> err = MPI_File_close(&fh);
> perr("close", err);
>
> MPI_Finalize();
> return 0;
> }
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

2013-11-07 Thread Jeff Hammond

That's a relatively old version of OMPI.  Maybe try the latest
release? That's always the safe bet since the issue might have been
fixed already.

I recall that OMPI uses ROMIO so you might try to reproduce with MPICH
so you can report it to the people that wrote the MPI-IO code. Of
course, this might not be an issue with ROMIO itself.  Trying with
MPICH is a good way to verify that.

Best,

Jeff

Sent from my iPhone

On Nov 7, 2013, at 10:55 AM, Steven G Johnson  wrote:

> The simple C program attached below hangs on MPI_File_write when I am using 
> an NFS-mounted filesystem.   Is MPI-IO supported in OpenMPI for NFS 
> filesystems?
>
> I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 
> 3.2.51.   I was surprised by this because the problems only started occurring 
> recently when I upgraded my Debian system to wheezy; with OpenMPI in the 
> previous Debian release, output to NFS-mounted filesystems worked fine.
>
> Is there any easy way to get this working?  Any tips are appreciated.
>
> Regards,
> Steven G. Johnson
>
> ---
> #include 
> #include 
> #include 
>
> void perr(const char *label, int err)
> {
>char s[MPI_MAX_ERROR_STRING];
>int len;
>MPI_Error_string(err, s, &len);
>printf("%s: %d = %s\n", label, err, s);
> }
>
> int main(int argc, char **argv)
> {
>MPI_Init(&argc, &argv);
>
>MPI_File fh;
>int err;
>err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | 
> MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
>perr("open", err);
>
>const char s[] = "Hello world!\n";
>MPI_Status status;
>err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR, &status);
>perr("write", err);
>
>err = MPI_File_close(&fh);
>perr("close", err);
>
>MPI_Finalize();
>return 0;
> }
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

2013-11-07 Thread Gus Correa


Hi Steven, Dmytry

Not sure if this web page is still valid or totally out of date,
but there it goes anyway, in the hopes that it may help:

http://www.mcs.anl.gov/research/projects/mpi/mpich1-old/docs/install/node38.htm

On the other hand, one expert seems to dismiss NFS
for paralllel IO:

http://www.open-mpi.org/community/lists/users/2008/07/6125.php

I must say that this has been a gray area for me too.
It would be nice if MPI - or the various MPIs -
documentation told us a bit more clearly what types of
underlying file system support MPI parallel IO:
local disk (ext?, xfs, etc), NFS mounts,
the various parallel file systems (PVFS/OrangeFS, Lustre,
GlusterFS, etc).
And perhaps provide some setup information, plus
functionality, and performance comparisons.

My two cents,
Gus Correa


On 11/07/2013 12:21 PM, Dmitry N. Mikushin wrote:

Not sure if this is related, but:

I've seen a case of performance degradation on NFS and Lustre when
writing NetCDF files. The reason was that the file was filled with a
loop writing one 4-byte record at once. Performance became close to
local hard drive, when I simply introduced buffering of records and
writing them to files with one row at once.

- D.


2013/11/7 Steven G Johnson:

The simple C program attached below hangs on MPI_File_write when I am using an 
NFS-mounted filesystem.   Is MPI-IO supported in OpenMPI for NFS filesystems?

I'm using OpenMPI 1.4.5 on Debian stable (wheezy), 64-bit Opteron CPU, Linux 
3.2.51.   I was surprised by this because the problems only started occurring 
recently when I upgraded my Debian system to wheezy; with OpenMPI in the 
previous Debian release, output to NFS-mounted filesystems worked fine.

Is there any easy way to get this working?  Any tips are appreciated.

Regards,
Steven G. Johnson

---
#include
#include
#include

void perr(const char *label, int err)
{
 char s[MPI_MAX_ERROR_STRING];
 int len;
 MPI_Error_string(err, s,&len);
 printf("%s: %d = %s\n", label, err, s);
}

int main(int argc, char **argv)
{
 MPI_Init(&argc,&argv);

 MPI_File fh;
 int err;
 err = MPI_File_open(MPI_COMM_WORLD, "tstmpiio.dat", MPI_MODE_CREATE | 
MPI_MODE_WRONLY, MPI_INFO_NULL,&fh);
 perr("open", err);

 const char s[] = "Hello world!\n";
 MPI_Status status;
 err = MPI_File_write(fh, (void*) s, strlen(s), MPI_CHAR,&status);
 perr("write", err);

 err = MPI_File_close(&fh);
 perr("close", err);

 MPI_Finalize();
 return 0;
}
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] proper use of MPI_Abort

2013-11-07 Thread Andrus, Brian Contractor

Jeff,

Good to know. Thanks!


Seems really like MPI_ABORT should only be used within error traps after MPI 
functions have been started.
Code-wise, the sample I got was not the best. Usage should be checked before 
MPI_Initialize, I think :)

It seems the expectation is that MPI_ABORT is only called when the user should 
be notified something went haywire.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Wednesday, November 06, 2013 11:30 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] proper use of MPI_Abort
> 
> I just checked the v1.7 series -- it looks like we have cleaned up this
> message a bit.  With your code snipit:
> 
> -
> ❯❯❯ mpicc abort.c -o abort && mpirun -np 4 abort
> 
> *#  Usage: mpicpy -input 
> 
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on exactly
> when Open MPI kills them.
> --
> 
> 
> Notice the lack of the 2nd message.
> 
> So I think the answer here is: it's fixed in the 1.7.x series.  It is 
> unlikely to be
> fixed in the 1.6.x series.
> 
> 
> 
> On Nov 5, 2013, at 3:16 PM, "Andrus, Brian Contractor" 
> wrote:
> 
> > Jeff,
> >
> > We are using the latest version: 1.6.5
> >
> >
> > Brian Andrus
> > ITACS/Research Computing
> > Naval Postgraduate School
> > Monterey, California
> > voice: 831-656-6238
> >
> >
> >
> >> -Original Message-
> >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> >> Squyres (jsquyres)
> >> Sent: Tuesday, November 05, 2013 5:11 AM
> >> To: Open MPI Users
> >> Subject: Re: [OMPI users] proper use of MPI_Abort
> >>
> >> You're correct -- you don't need to call MPI_Finalize after MPI_Abort.
> >>
> >> Can you cite what version of Open MPI you are using?
> >>
> >>
> >> On Nov 4, 2013, at 9:01 AM, "Andrus, Brian Contractor"
> >> 
> >> wrote:
> >>
> >>> All,
> >>>
> >>> I have some sample code that has a syntax message and then an
> >> MPI_Abort call if the program is run without the required parameters.
> >>> --snip---
> >>>if (!rank) {
> >>>i = 1;
> >>>while ((i < argc) && strcmp("-input", *argv)) {
> >>>i++;
> >>>argv++;
> >>>}
> >>>if (i >= argc) {
> >>>fprintf(stderr, "\n*#  Usage: mpicpy -input  \n\n");
> >>>MPI_Abort(MPI_COMM_WORLD, 1);
> >>>}
> >>> --snip---
> >>>
> >>> This is all well and good and it does provide the usage line, but it
> >>> also
> >> throws quite a message in addition:
> >>>
> >>> 
> >>> --
> >>>  MPI_ABORT was invoked on rank 0 in communicator
> >> MPI_COMM_WORLD
> >>> with errorcode 1.
> >>>
> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >>> You may or may not see output from other processes, depending on
> >>> exactly when Open MPI kills them.
> >>> 
> >>> --
> >>> 
> >>> 
> >>> --
> >>>  mpirun has exited due to process rank 0 with PID 40209 on node
> >>> compute-3-3 exiting improperly. There are two reasons this could occur:
> >>>
> >>> 1. this process did not call "init" before exiting, but others in
> >>> the job did. This can cause a job to hang indefinitely while it
> >>> waits for all processes to call "init". By rule, if one process
> >>> calls "init", then ALL processes must call "init" prior to termination.
> >>>
> >>> 2. this process called "init", but exited without calling "finalize".
> >>> By rule, all processes that call "init" MUST call "finalize" prior
> >>> to exiting or it will be considered an "abnormal termination"
> >>>
> >>> This may have caused other processes in the application to be
> >>> terminated by signals sent by mpirun (as reported here).
> >>> 
> >>> --
> >>> 
> >>>
> >>> Is there a proper way to use MPI_Abort such that it will not trigger
> >>> such a
> >> message?
> >>> It almost seems that MPI_Abort should be calling MPI_Finalize as a
> >>> rule, or
> >> openmpi should recognize MPI_Abort is the exception to requiring
> >> MPI_Finalize.
> >>>
> >>>
> >>>
> >>> Brian Andrus
> >>> ITACS/Research Computing
> >>> Naval Postgraduate School
> >>> Monterey, California
> >>> voice: 831-656-6238
> >>>
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://w

[OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread tmishima



Dear openmpi developers,

I tried the new function LAMA of openmpi-1.7.3 and
unfortunately it is not stable under my environment, which
is built with torque.

(1) I used 4 scripts as shown below to clarify the problem:

(COMMON PART)
#!/bin/sh
#PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
cp $PBS_NODEFILE pbs_hosts
NPROCS=`wc -l < pbs_hosts`

(SCRIPT1)
mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
(SCRIPT2)
mpirun -oversubscribe -report-bindings -mca rmaps lama \
   -mca rmaps_lama_bind 1c Myprog
(SCRITP3)
mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
(SCRIPT4)
mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
   -mca rmaps_lama_map Nsbnch \
   -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog

(2) The results are as follows:

 NODE03(32cores) NODE08(8cores)
SCRIPT1  *ERROR1 *ERROR1
SCRIPT2  OK  OK
SCRIPT3  **ABORT OK
SCRIPT4  **ABORT **ABORT

(*)ERROR1 means:
--
RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
Since you have asked not to oversubscribe the resources the job will not
be launched. If you would instead like to oversubscribe the resources
try using the --oversubscribe option to mpirun.
--
[node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
rmaps_lama_module.c at line 310
[node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
base/rmaps_base_map_job.c at line 166

(**)ABORT means "stuck and no answer" until forced termination.


(3) openmpi-1.7.3 configuration (with PGI compiler)

./configure \
--with-tm \
--with-verbs \
--disable-ipv6 \
CC=pgcc CFLAGS="-fast -tp k8-64e" \
CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
F77=pgfortran FFLAGS="-fast -tp k8-64e" \
FC=pgfortran FCFLAGS="-fast -tp k8-64e"


(4) Cluster information:

32 cores AMD based node(node03):
Machine (126GB)
  Socket L#0 (32GB)
NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
(P#0)
  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
(P#1)
  L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
(P#2)
  L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
(P#3)
NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
  L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
(P#4)
  L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
(P#5)
  L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
(P#6)
  L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
(P#7)
  Socket L#1 (32GB)
NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
  L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
(P#8)
  L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
(P#9)
  L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
L#10 (P#10)
  L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
L#11 (P#11)
NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB)
  L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
L#12 (P#12)
  L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
L#13 (P#13)
  L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
L#14 (P#14)
  L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
L#15 (P#15)
  Socket L#2 (32GB)
NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB)
  L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
L#16 (P#16)
  L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
L#17 (P#17)
  L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
L#18 (P#18)
  L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
L#19 (P#19)
NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB)
  L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
L#20 (P#20)
  L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
L#21 (P#21)
  L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU
L#22 (P#22)
  L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU
L#23 (P#23)
  Socket L#3 (32GB)
NUMANode L#6 (P#2 16GB) + L3 L#6 (5118KB)
  L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU
L#24 (P#24)
  L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU
L#25 (P#25)
  L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU
L#26 (P#26)
  L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU
L#27 (P#27)
NUMANode L#

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread Ralph Castain

What happens if you drop the LAMA request and instead run

mpirun -report-bindings -bind-to core Myprog

This would do the same thing - does it work? If so, then we know it is a 
problem in the LAMA mapper. If not, then it is likely a problem in a different 
section of the code.



On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Dear openmpi developers,
> 
> I tried the new function LAMA of openmpi-1.7.3 and
> unfortunately it is not stable under my environment, which
> is built with torque.
> 
> (1) I used 4 scripts as shown below to clarify the problem:
> 
> (COMMON PART)
> #!/bin/sh
> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> 
> (SCRIPT1)
> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> (SCRIPT2)
> mpirun -oversubscribe -report-bindings -mca rmaps lama \
>   -mca rmaps_lama_bind 1c Myprog
> (SCRITP3)
> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> (SCRIPT4)
> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
>   -mca rmaps_lama_map Nsbnch \
>   -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
> 
> (2) The results are as follows:
> 
> NODE03(32cores) NODE08(8cores)
> SCRIPT1  *ERROR1 *ERROR1
> SCRIPT2  OK  OK
> SCRIPT3  **ABORT OK
> SCRIPT4  **ABORT **ABORT
> 
> (*)ERROR1 means:
> --
> RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
> Since you have asked not to oversubscribe the resources the job will not
> be launched. If you would instead like to oversubscribe the resources
> try using the --oversubscribe option to mpirun.
> --
> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> rmaps_lama_module.c at line 310
> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> base/rmaps_base_map_job.c at line 166
> 
> (**)ABORT means "stuck and no answer" until forced termination.
> 
> 
> (3) openmpi-1.7.3 configuration (with PGI compiler)
> 
> ./configure \
> --with-tm \
> --with-verbs \
> --disable-ipv6 \
> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> 
> 
> (4) Cluster information:
> 
> 32 cores AMD based node(node03):
> Machine (126GB)
>  Socket L#0 (32GB)
>NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
>  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
> (P#0)
>  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
> (P#1)
>  L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
> (P#2)
>  L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
> (P#3)
>NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
>  L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
> (P#4)
>  L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
> (P#5)
>  L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
> (P#6)
>  L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
> (P#7)
>  Socket L#1 (32GB)
>NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
>  L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
> (P#8)
>  L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
> (P#9)
>  L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
> L#10 (P#10)
>  L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
> L#11 (P#11)
>NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB)
>  L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
> L#12 (P#12)
>  L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
> L#13 (P#13)
>  L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
> L#14 (P#14)
>  L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
> L#15 (P#15)
>  Socket L#2 (32GB)
>NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB)
>  L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
> L#16 (P#16)
>  L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
> L#17 (P#17)
>  L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
> L#18 (P#18)
>  L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
> L#19 (P#19)
>NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB)
>  L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
> L#20 (P#20)
>  L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
> L#21 (P#21)
>  L2 L#22 (512KB) + L1d L#22 (64KB) + L1i

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread tmishima



Hi Ralph,

I quickly tried 2 runs:

mpirun -report-bindings -bind-to core Myprog
mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core
Myprog

It works fine in both cases on node03 and node08.

Regards,
Tetsuya Mishima

> What happens if you drop the LAMA request and instead run
>
> mpirun -report-bindings -bind-to core Myprog
>
> This would do the same thing - does it work? If so, then we know it is a
problem in the LAMA mapper. If not, then it is likely a problem in a
different section of the code.
>
>
>
> On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Dear openmpi developers,
> >
> > I tried the new function LAMA of openmpi-1.7.3 and
> > unfortunately it is not stable under my environment, which
> > is built with torque.
> >
> > (1) I used 4 scripts as shown below to clarify the problem:
> >
> > (COMMON PART)
> > #!/bin/sh
> > #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
> > export OMP_NUM_THREADS=1
> > cd $PBS_O_WORKDIR
> > cp $PBS_NODEFILE pbs_hosts
> > NPROCS=`wc -l < pbs_hosts`
> >
> > (SCRIPT1)
> > mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> > (SCRIPT2)
> > mpirun -oversubscribe -report-bindings -mca rmaps lama \
> >   -mca rmaps_lama_bind 1c Myprog
> > (SCRITP3)
> > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> >   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> > (SCRIPT4)
> > mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> >   -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
> >   -mca rmaps_lama_map Nsbnch \
> >   -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
> >
> > (2) The results are as follows:
> >
> > NODE03(32cores) NODE08(8cores)
> > SCRIPT1  *ERROR1 *ERROR1
> > SCRIPT2  OK  OK
> > SCRIPT3  **ABORT OK
> > SCRIPT4  **ABORT **ABORT
> >
> > (*)ERROR1 means:
> >
--
> > RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
> > Since you have asked not to oversubscribe the resources the job will
not
> > be launched. If you would instead like to oversubscribe the resources
> > try using the --oversubscribe option to mpirun.
> >
--
> > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> > rmaps_lama_module.c at line 310
> > [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> > base/rmaps_base_map_job.c at line 166
> >
> > (**)ABORT means "stuck and no answer" until forced termination.
> >
> >
> > (3) openmpi-1.7.3 configuration (with PGI compiler)
> >
> > ./configure \
> > --with-tm \
> > --with-verbs \
> > --disable-ipv6 \
> > CC=pgcc CFLAGS="-fast -tp k8-64e" \
> > CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> > F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> > FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >
> >
> > (4) Cluster information:
> >
> > 32 cores AMD based node(node03):
> > Machine (126GB)
> >  Socket L#0 (32GB)
> >NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
> >  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU
L#0
> > (P#0)
> >  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU
L#1
> > (P#1)
> >  L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU
L#2
> > (P#2)
> >  L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU
L#3
> > (P#3)
> >NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
> >  L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU
L#4
> > (P#4)
> >  L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU
L#5
> > (P#5)
> >  L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU
L#6
> > (P#6)
> >  L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU
L#7
> > (P#7)
> >  Socket L#1 (32GB)
> >NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
> >  L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU
L#8
> > (P#8)
> >  L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU
L#9
> > (P#9)
> >  L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 +
PU
> > L#10 (P#10)
> >  L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 +
PU
> > L#11 (P#11)
> >NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB)
> >  L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 +
PU
> > L#12 (P#12)
> >  L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 +
PU
> > L#13 (P#13)
> >  L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 +
PU
> > L#14 (P#14)
> >  L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 +
PU
> > L#15 (P#15)
> >  Socket L#2 (32GB)
> >NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB)
> >  L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 +
PU
> > L#16 (P#16)
> >  L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 +
PU

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread Ralph Castain

Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let the 
LAMA folks look into it.

On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I quickly tried 2 runs:
> 
> mpirun -report-bindings -bind-to core Myprog
> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core
> Myprog
> 
> It works fine in both cases on node03 and node08.
> 
> Regards,
> Tetsuya Mishima
> 
>> What happens if you drop the LAMA request and instead run
>> 
>> mpirun -report-bindings -bind-to core Myprog
>> 
>> This would do the same thing - does it work? If so, then we know it is a
> problem in the LAMA mapper. If not, then it is likely a problem in a
> different section of the code.
>> 
>> 
>> 
>> On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Dear openmpi developers,
>>> 
>>> I tried the new function LAMA of openmpi-1.7.3 and
>>> unfortunately it is not stable under my environment, which
>>> is built with torque.
>>> 
>>> (1) I used 4 scripts as shown below to clarify the problem:
>>> 
>>> (COMMON PART)
>>> #!/bin/sh
>>> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
>>> export OMP_NUM_THREADS=1
>>> cd $PBS_O_WORKDIR
>>> cp $PBS_NODEFILE pbs_hosts
>>> NPROCS=`wc -l < pbs_hosts`
>>> 
>>> (SCRIPT1)
>>> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
>>> (SCRIPT2)
>>> mpirun -oversubscribe -report-bindings -mca rmaps lama \
>>>  -mca rmaps_lama_bind 1c Myprog
>>> (SCRITP3)
>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>>>  -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
>>> (SCRIPT4)
>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
>>>  -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
>>>  -mca rmaps_lama_map Nsbnch \
>>>  -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
>>> 
>>> (2) The results are as follows:
>>> 
>>>NODE03(32cores) NODE08(8cores)
>>> SCRIPT1  *ERROR1 *ERROR1
>>> SCRIPT2  OK  OK
>>> SCRIPT3  **ABORT OK
>>> SCRIPT4  **ABORT **ABORT
>>> 
>>> (*)ERROR1 means:
>>> 
> --
>>> RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
>>> Since you have asked not to oversubscribe the resources the job will
> not
>>> be launched. If you would instead like to oversubscribe the resources
>>> try using the --oversubscribe option to mpirun.
>>> 
> --
>>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
>>> rmaps_lama_module.c at line 310
>>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
>>> base/rmaps_base_map_job.c at line 166
>>> 
>>> (**)ABORT means "stuck and no answer" until forced termination.
>>> 
>>> 
>>> (3) openmpi-1.7.3 configuration (with PGI compiler)
>>> 
>>> ./configure \
>>> --with-tm \
>>> --with-verbs \
>>> --disable-ipv6 \
>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
>>> 
>>> 
>>> (4) Cluster information:
>>> 
>>> 32 cores AMD based node(node03):
>>> Machine (126GB)
>>> Socket L#0 (32GB)
>>>   NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU
> L#0
>>> (P#0)
>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU
> L#1
>>> (P#1)
>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU
> L#2
>>> (P#2)
>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU
> L#3
>>> (P#3)
>>>   NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU
> L#4
>>> (P#4)
>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU
> L#5
>>> (P#5)
>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU
> L#6
>>> (P#6)
>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU
> L#7
>>> (P#7)
>>> Socket L#1 (32GB)
>>>   NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
>>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU
> L#8
>>> (P#8)
>>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU
> L#9
>>> (P#9)
>>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 +
> PU
>>> L#10 (P#10)
>>> L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 +
> PU
>>> L#11 (P#11)
>>>   NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB)
>>> L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 +
> PU
>>> L#12 (P#12)
>>> L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 +
> PU
>>> L#13 (P#13)
>>> L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 +
> PU
>>> L#14 (P#14)
>>> L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 +
> PU
>>> L#15 (P#15)

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread tmishima



Thanks, Ralph.

This is an additional information.

Just execute directly on the node without Torque:
mpirun -np 8 -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c
Myprog

Then it also works, which means the combination of LAMA and Torque would
case
the problem.

Tetsuya Mishima

> Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let
the LAMA folks look into it.
>
> On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I quickly tried 2 runs:
> >
> > mpirun -report-bindings -bind-to core Myprog
> > mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to
core
> > Myprog
> >
> > It works fine in both cases on node03 and node08.
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> What happens if you drop the LAMA request and instead run
> >>
> >> mpirun -report-bindings -bind-to core Myprog
> >>
> >> This would do the same thing - does it work? If so, then we know it is
a
> > problem in the LAMA mapper. If not, then it is likely a problem in a
> > different section of the code.
> >>
> >>
> >>
> >> On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Dear openmpi developers,
> >>>
> >>> I tried the new function LAMA of openmpi-1.7.3 and
> >>> unfortunately it is not stable under my environment, which
> >>> is built with torque.
> >>>
> >>> (1) I used 4 scripts as shown below to clarify the problem:
> >>>
> >>> (COMMON PART)
> >>> #!/bin/sh
> >>> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
> >>> export OMP_NUM_THREADS=1
> >>> cd $PBS_O_WORKDIR
> >>> cp $PBS_NODEFILE pbs_hosts
> >>> NPROCS=`wc -l < pbs_hosts`
> >>>
> >>> (SCRIPT1)
> >>> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c
Myprog
> >>> (SCRIPT2)
> >>> mpirun -oversubscribe -report-bindings -mca rmaps lama \
> >>>  -mca rmaps_lama_bind 1c Myprog
> >>> (SCRITP3)
> >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> >>>  -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> >>> (SCRIPT4)
> >>> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> >>>  -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
> >>>  -mca rmaps_lama_map Nsbnch \
> >>>  -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
> >>>
> >>> (2) The results are as follows:
> >>>
> >>>NODE03(32cores) NODE08(8cores)
> >>> SCRIPT1  *ERROR1 *ERROR1
> >>> SCRIPT2  OK  OK
> >>> SCRIPT3  **ABORT OK
> >>> SCRIPT4  **ABORT **ABORT
> >>>
> >>> (*)ERROR1 means:
> >>>
> >
--
> >>> RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
> >>> Since you have asked not to oversubscribe the resources the job will
> > not
> >>> be launched. If you would instead like to oversubscribe the resources
> >>> try using the --oversubscribe option to mpirun.
> >>>
> >
--
> >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> >>> rmaps_lama_module.c at line 310
> >>> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> >>> base/rmaps_base_map_job.c at line 166
> >>>
> >>> (**)ABORT means "stuck and no answer" until forced termination.
> >>>
> >>>
> >>> (3) openmpi-1.7.3 configuration (with PGI compiler)
> >>>
> >>> ./configure \
> >>> --with-tm \
> >>> --with-verbs \
> >>> --disable-ipv6 \
> >>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> >>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> >>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> >>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >>>
> >>>
> >>> (4) Cluster information:
> >>>
> >>> 32 cores AMD based node(node03):
> >>> Machine (126GB)
> >>> Socket L#0 (32GB)
> >>>   NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
> >>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU
> > L#0
> >>> (P#0)
> >>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU
> > L#1
> >>> (P#1)
> >>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU
> > L#2
> >>> (P#2)
> >>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU
> > L#3
> >>> (P#3)
> >>>   NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
> >>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU
> > L#4
> >>> (P#4)
> >>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU
> > L#5
> >>> (P#5)
> >>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU
> > L#6
> >>> (P#6)
> >>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU
> > L#7
> >>> (P#7)
> >>> Socket L#1 (32GB)
> >>>   NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB)
> >>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU
> > L#8
> >>> (P#8)
> >>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU
> > L#9
> >>> (P#9)
> >>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 +
> >

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

2013-11-07 Thread Ralph Castain

I suspect something else is going on there - I can't imagine how the LAMA 
mapper could be interacting with the Torque launcher. The check for adequate 
resources (per the error message) is done long before we get to the launcher.

I'll have to let the LAMA supporters chase it down.

Thanks
Ralph

On Nov 7, 2013, at 4:37 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Thanks, Ralph.
> 
> This is an additional information.
> 
> Just execute directly on the node without Torque:
> mpirun -np 8 -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c
> Myprog
> 
> Then it also works, which means the combination of LAMA and Torque would
> case
> the problem.
> 
> Tetsuya Mishima
> 
>> Okay, so the problem is a bug in LAMA itself. I'll file a ticket and let
> the LAMA folks look into it.
>> 
>> On Nov 7, 2013, at 4:18 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I quickly tried 2 runs:
>>> 
>>> mpirun -report-bindings -bind-to core Myprog
>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to
> core
>>> Myprog
>>> 
>>> It works fine in both cases on node03 and node08.
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
 What happens if you drop the LAMA request and instead run
 
 mpirun -report-bindings -bind-to core Myprog
 
 This would do the same thing - does it work? If so, then we know it is
> a
>>> problem in the LAMA mapper. If not, then it is likely a problem in a
>>> different section of the code.
 
 
 
 On Nov 7, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:
 
> 
> 
> Dear openmpi developers,
> 
> I tried the new function LAMA of openmpi-1.7.3 and
> unfortunately it is not stable under my environment, which
> is built with torque.
> 
> (1) I used 4 scripts as shown below to clarify the problem:
> 
> (COMMON PART)
> #!/bin/sh
> #PBS -l nodes=node03:ppn=8 / nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> 
> (SCRIPT1)
> mpirun -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c
> Myprog
> (SCRIPT2)
> mpirun -oversubscribe -report-bindings -mca rmaps lama \
> -mca rmaps_lama_bind 1c Myprog
> (SCRITP3)
> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c Myprog
> (SCRIPT4)
> mpirun -machinefile pbs_hosts -np ${NPROCS} -oversubscribe \
> -report-bindings -mca rmaps lama -mca rmaps_lama_bind 1c \
> -mca rmaps_lama_map Nsbnch \
> -mca ess ^tm -mca plm ^tm -mca ras ^tm Myprog
> 
> (2) The results are as follows:
> 
>   NODE03(32cores) NODE08(8cores)
> SCRIPT1  *ERROR1 *ERROR1
> SCRIPT2  OK  OK
> SCRIPT3  **ABORT OK
> SCRIPT4  **ABORT **ABORT
> 
> (*)ERROR1 means:
> 
>>> 
> --
> RMaps LAMA detected oversubscription after mapping 1 of 8 processes.
> Since you have asked not to oversubscribe the resources the job will
>>> not
> be launched. If you would instead like to oversubscribe the resources
> try using the --oversubscribe option to mpirun.
> 
>>> 
> --
> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> rmaps_lama_module.c at line 310
> [node08.cluster:28849] [[50428,0],0] ORTE_ERROR_LOG: Error in file
> base/rmaps_base_map_job.c at line 166
> 
> (**)ABORT means "stuck and no answer" until forced termination.
> 
> 
> (3) openmpi-1.7.3 configuration (with PGI compiler)
> 
> ./configure \
> --with-tm \
> --with-verbs \
> --disable-ipv6 \
> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> 
> 
> (4) Cluster information:
> 
> 32 cores AMD based node(node03):
> Machine (126GB)
> Socket L#0 (32GB)
>  NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB)
>L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU
>>> L#0
> (P#0)
>L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU
>>> L#1
> (P#1)
>L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU
>>> L#2
> (P#2)
>L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU
>>> L#3
> (P#3)
>  NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB)
>L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU
>>> L#4
> (P#4)
>L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU
>>> L#5
> (P#5)
>L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU
>>> L#6
> (P#6)
>

[OMPI users] MPI_File_write hangs on NFS-mounted filesystem

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

Re: [OMPI users] MPI_File_write hangs on NFS-mounted filesystem

Re: [OMPI users] proper use of MPI_Abort

[OMPI users] LAMA of openmpi-1.7.3 is unstable

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

Re: [OMPI users] LAMA of openmpi-1.7.3 is unstable

11 matches

Site Navigation

Mail list logo

Footer information