Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali
After creating new hostlist and making the scripts again it is working now
and picking up the hostlist as u can see :

*${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
(The above command is used to submit job)*

*[pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
mpirun -np 32 /home/MET/hrm/bin/hrm
*
but it just stays on this command and the model simulation don't start
further. I can't understand this behavior because the simulation works
fine when hostlist is not given as follows:

*${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*

**
**
*

*
On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres  wrote:

> Yes, this is known behavior for our CLI parser.  We could probably improve
> that a bit...
>
> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
>
> >
> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
> >
> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
> >>
> >>> Afraid I have to agree with the prior reply - sounds like NPROC isn't
> getting defined, which causes your cmd line to look like your original
> posting.
> >>
> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
> >>
> >> But: is this the intended behavior of mpirun? It looks like -np is
> eating -hostlist as a numeric argument? Shouldn't it complain about:
> argument for -np missing or argument not being numeric?
> >
> > Probably - I'm sure that the atol is returning zero, which should cause
> an error output. I'll check.
> >
> >
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
> >>>
>  The following command in used in script for job submission
> 
>  ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>  where NPROC in defined in someother file. The same application is
> running on the other system with same configuration.
> 
>  On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey 
> wrote:
>  No of processes missing after -np
>  Should be something like:
>  mpirun -np 256 ./exec
> 
> 
> 
>  Sent from my iPhone
> 
>  On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali 
> wrote:
> 
> > Dear All,
> >
> > I am running an application with mpirun but it gives following
> error, it is not picking up hostlist, there are other applications which
> run well with hostlist but it just gives following error with
> >
> >
> > [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
> > mpirun -np  /home/MET/hrm/bin/hrm
> >
> --
> > Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec
> format error
> >
> > This could mean that your PATH or executable name is wrong, or that
> you do not
> > have the necessary permissions.  Please ensure that the executable
> is able to be
> > found and executed.
> >
> >
> --
> >
> > Following the permission of the hostlist directory. Please help me
> to remove this error.
> >
> > [pmdtest@pmd02 bin]$ ll
> > total 7570
> > -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
> > -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
> > -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
> > -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
> > -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
> > -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
> >
> >
> > Thank you and Regards
> > Ahsan
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
>  --
>  Syed Ahsan Ali Bokhari
>  Electronic Engineer (EE)
> 
>  Research & Development Division
>  Pakistan Meteorological Department H-8/4, Islamabad.
>  Phone # off  +92518358714
>  Cell # +923155145014
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/l

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Jingcha Joba
Just to be sure, can u try
echo "${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1"
and check if you are indeed getting the correct argument.

If that looks fine, can u add --mca btl_openib_verbose 1 to the mpirun
argument list, and see what it says?



On Tue, Feb 28, 2012 at 10:15 PM, Syed Ahsan Ali wrote:

> After creating new hostlist and making the scripts again it is working now
> and picking up the hostlist as u can see :
>
> *
> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
> (The above command is used to submit job)*
>
> *
> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
> mpirun -np 32 /home/MET/hrm/bin/hrm
> *
> but it just stays on this command and the model simulation don't start
> further. I can't understand this behavior because the simulation works
> fine when hostlist is not given as follows:
>
> *${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*
>
> **
> **
> * *
>
> On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres wrote:
>
>> Yes, this is known behavior for our CLI parser.  We could probably
>> improve that a bit...
>>
>> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
>>
>> >
>> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
>> >
>> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
>> >>
>> >>> Afraid I have to agree with the prior reply - sounds like NPROC isn't
>> getting defined, which causes your cmd line to look like your original
>> posting.
>> >>
>> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
>> >>
>> >> But: is this the intended behavior of mpirun? It looks like -np is
>> eating -hostlist as a numeric argument? Shouldn't it complain about:
>> argument for -np missing or argument not being numeric?
>> >
>> > Probably - I'm sure that the atol is returning zero, which should cause
>> an error output. I'll check.
>> >
>> >
>> >>
>> >> -- Reuti
>> >>
>> >>
>> >>>
>> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
>> >>>
>>  The following command in used in script for job submission
>> 
>>  ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>>  where NPROC in defined in someother file. The same application is
>> running on the other system with same configuration.
>> 
>>  On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey 
>> wrote:
>>  No of processes missing after -np
>>  Should be something like:
>>  mpirun -np 256 ./exec
>> 
>> 
>> 
>>  Sent from my iPhone
>> 
>>  On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali 
>> wrote:
>> 
>> > Dear All,
>> >
>> > I am running an application with mpirun but it gives following
>> error, it is not picking up hostlist, there are other applications which
>> run well with hostlist but it just gives following error with
>> >
>> >
>> > [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>> > mpirun -np  /home/MET/hrm/bin/hrm
>> >
>> --
>> > Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec
>> format error
>> >
>> > This could mean that your PATH or executable name is wrong, or that
>> you do not
>> > have the necessary permissions.  Please ensure that the executable
>> is able to be
>> > found and executed.
>> >
>> >
>> --
>> >
>> > Following the permission of the hostlist directory. Please help me
>> to remove this error.
>> >
>> > [pmdtest@pmd02 bin]$ ll
>> > total 7570
>> > -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
>> > -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
>> > -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
>> > -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
>> > -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
>> > -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
>> >
>> >
>> > Thank you and Regards
>> > Ahsan
>> >
>> >
>> >
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>>  ___
>>  users mailing list
>>  us...@open-mpi.org
>>  http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>>  --
>>  Syed Ahsan Ali Bokhari
>>  Electronic Engineer (EE)
>> 
>>  Research & Development Division
>>  Pakistan Meteorological Department H-8/4, Islamabad.
>>  Phone # off  +92518358714
>>  Cell # +923155145014
>> 
>>  ___
>>  users mailing list
>>  

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali
I tried to echo but it returns nothing.

[pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
$i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
./hrm >> ${OUTFILE}_hrm 2>&1
[pmdtest@pmd02 d00_dayfiles]$


On Wed, Feb 29, 2012 at 12:01 PM, Jingcha Joba wrote:

> Just to be sure, can u try
> echo "${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1"
> and check if you are indeed getting the correct argument.
>
> If that looks fine, can u add --mca btl_openib_verbose 1 to the mpirun
> argument list, and see what it says?
>
>
>
> On Tue, Feb 28, 2012 at 10:15 PM, Syed Ahsan Ali wrote:
>
>> After creating new hostlist and making the scripts again it is working
>> now and picking up the hostlist as u can see :
>>
>> *
>> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>> (The above command is used to submit job)*
>>
>> *
>> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>> mpirun -np 32 /home/MET/hrm/bin/hrm
>> *
>> but it just stays on this command and the model simulation don't start
>> further. I can't understand this behavior because the simulation works
>> fine when hostlist is not given as follows:
>>
>> *${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*
>>
>> **
>> **
>> * *
>>
>> On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres wrote:
>>
>>> Yes, this is known behavior for our CLI parser.  We could probably
>>> improve that a bit...
>>>
>>> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
>>>
>>> >
>>> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
>>> >
>>> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
>>> >>
>>> >>> Afraid I have to agree with the prior reply - sounds like NPROC
>>> isn't getting defined, which causes your cmd line to look like your
>>> original posting.
>>> >>
>>> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
>>> >>
>>> >> But: is this the intended behavior of mpirun? It looks like -np is
>>> eating -hostlist as a numeric argument? Shouldn't it complain about:
>>> argument for -np missing or argument not being numeric?
>>> >
>>> > Probably - I'm sure that the atol is returning zero, which should
>>> cause an error output. I'll check.
>>> >
>>> >
>>> >>
>>> >> -- Reuti
>>> >>
>>> >>
>>> >>>
>>> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
>>> >>>
>>>  The following command in used in script for job submission
>>> 
>>>  ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>>>  where NPROC in defined in someother file. The same application is
>>> running on the other system with same configuration.
>>> 
>>>  On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey <
>>> pukkimon...@gmail.com> wrote:
>>>  No of processes missing after -np
>>>  Should be something like:
>>>  mpirun -np 256 ./exec
>>> 
>>> 
>>> 
>>>  Sent from my iPhone
>>> 
>>>  On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali 
>>> wrote:
>>> 
>>> > Dear All,
>>> >
>>> > I am running an application with mpirun but it gives following
>>> error, it is not picking up hostlist, there are other applications which
>>> run well with hostlist but it just gives following error with
>>> >
>>> >
>>> > [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>>> > mpirun -np  /home/MET/hrm/bin/hrm
>>> >
>>> --
>>> > Could not execute the executable "/home/MET/hrm/bin/hostlist":
>>> Exec format error
>>> >
>>> > This could mean that your PATH or executable name is wrong, or
>>> that you do not
>>> > have the necessary permissions.  Please ensure that the executable
>>> is able to be
>>> > found and executed.
>>> >
>>> >
>>> --
>>> >
>>> > Following the permission of the hostlist directory. Please help me
>>> to remove this error.
>>> >
>>> > [pmdtest@pmd02 bin]$ ll
>>> > total 7570
>>> > -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
>>> > -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
>>> > -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
>>> > -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
>>> > -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
>>> > -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
>>> >
>>> >
>>> > Thank you and Regards
>>> > Ahsan
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>>  ___
>>>  users maili

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Jingcha Joba
Well, it should be
echo *"*mpirun.* ", *
I just noticed that you have $i{ABSDIR}. I think should it be ${ABSDIR}.
On Tue, Feb 28, 2012 at 11:17 PM, Syed Ahsan Ali wrote:

> I tried to echo but it returns nothing.
>
> [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
> ./hrm >> ${OUTFILE}_hrm 2>&1
> [pmdtest@pmd02 d00_dayfiles]$
>
>
> On Wed, Feb 29, 2012 at 12:01 PM, Jingcha Joba wrote:
>
>> Just to be sure, can u try
>> echo "${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1"
>> and check if you are indeed getting the correct argument.
>>
>> If that looks fine, can u add --mca btl_openib_verbose 1 to the mpirun
>> argument list, and see what it says?
>>
>>
>>
>> On Tue, Feb 28, 2012 at 10:15 PM, Syed Ahsan Ali 
>> wrote:
>>
>>> After creating new hostlist and making the scripts again it is working
>>> now and picking up the hostlist as u can see :
>>>
>>> *
>>> ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
>>> sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>>> (The above command is used to submit job)*
>>>
>>> *
>>> [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
>>> mpirun -np 32 /home/MET/hrm/bin/hrm
>>> *
>>> but it just stays on this command and the model simulation don't start
>>> further. I can't understand this behavior because the simulation works
>>> fine when hostlist is not given as follows:
>>>
>>> *${MPIRUN} -np ${NPROC} ./hrm >> ${OUTFILE}_hrm 2>&1*
>>>
>>> **
>>> **
>>> * *
>>>
>>> On Tue, Feb 28, 2012 at 3:49 PM, Jeffrey Squyres wrote:
>>>
 Yes, this is known behavior for our CLI parser.  We could probably
 improve that a bit...

 On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:

 >
 > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
 >
 >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
 >>
 >>> Afraid I have to agree with the prior reply - sounds like NPROC
 isn't getting defined, which causes your cmd line to look like your
 original posting.
 >>
 >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
 >>
 >> But: is this the intended behavior of mpirun? It looks like -np is
 eating -hostlist as a numeric argument? Shouldn't it complain about:
 argument for -np missing or argument not being numeric?
 >
 > Probably - I'm sure that the atol is returning zero, which should
 cause an error output. I'll check.
 >
 >
 >>
 >> -- Reuti
 >>
 >>
 >>>
 >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
 >>>
  The following command in used in script for job submission
 
  ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl
 sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
  where NPROC in defined in someother file. The same application is
 running on the other system with same configuration.
 
  On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey <
 pukkimon...@gmail.com> wrote:
  No of processes missing after -np
  Should be something like:
  mpirun -np 256 ./exec
 
 
 
  Sent from my iPhone
 
  On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali 
 wrote:
 
 > Dear All,
 >
 > I am running an application with mpirun but it gives following
 error, it is not picking up hostlist, there are other applications which
 run well with hostlist but it just gives following error with
 >
 >
 > [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
 > mpirun -np  /home/MET/hrm/bin/hrm
 >
 --
 > Could not execute the executable "/home/MET/hrm/bin/hostlist":
 Exec format error
 >
 > This could mean that your PATH or executable name is wrong, or
 that you do not
 > have the necessary permissions.  Please ensure that the
 executable is able to be
 > found and executed.
 >
 >
 --
 >
 > Following the permission of the hostlist directory. Please help
 me to remove this error.
 >
 > [pmdtest@pmd02 bin]$ ll
 > total 7570
 > -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
 > -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
 > -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
 > -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
 > -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
 > -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
 >
 >
 > Thank you and Regards
 >

Re: [OMPI users] mpirun fails with no allocated resources

2012-02-29 Thread Muhammad Wahaj Sethi


Snapshot of my hosts file is present below. localhost is present here.

127.0.0.1   localhost
127.0.1.1   wahaj-ThinkPad-T510
10.42.43.1  node0
10.42.43.2  node1

Every thing works fine if I don't specify host names. 

This problem only specific to Open MPI version 1.7. 

Open MPI version 1.5.5 doesn't produces this error message.

- Original Message -
From: "Ralph Castain" 
To: "Open MPI Users" 
Sent: Tuesday, February 28, 2012 5:55:43 PM
Subject: Re: [OMPI users] mpirun fails with no allocated resources

Try leaving off the -H localhost,localhost front he cmd line - the local host 
will automatically be included, so that shouldn't be required.

I believe the problem is that "localhost" isn't the name of your machine, and 
so we look and don't see that machine anywhere.

On Feb 28, 2012, at 9:42 AM, Muhammad Wahaj Sethi wrote:

> Hello,
>I have installed newer version but problem still persists.
> 
> Package: Open MPI wahaj@wahaj-ThinkPad-T510 Distribution
>Open MPI: 1.7a1r26065
>  Open MPI repo revision: r26065
>   Open MPI release date: Unreleased developer copy
>Open RTE: 1.7a1r26065
>  Open RTE repo revision: r26065
>   Open RTE release date: Unreleased developer copy
>OPAL: 1.7a1r26065
>  OPAL repo revision: r26065
>   OPAL release date: Unreleased developer copy
> MPI API: 2.1
>Ident string: 1.7a1r26065
>  Prefix: /home/wahaj/openmpi-install
> Configured architecture: x86_64-unknown-linux-gnu
> 
> Sequence of steps I followed is mention below.
> 
> svn update
> make distclean
> ./autogen.pl
> ./configure --prefix=$HOME/openmpi-install
> make all install
> 
> 
> wahaj@wahaj-ThinkPad-T510:~$ mpirun -np 2 -H localhost,localhost /bin/hostname
> --
> There are no allocated resources for the application 
>  /bin/hostname
> that match the requested mapping:
> 
> 
> Verify that you have mapped the allocated resources properly using the 
> --host or --hostfile specification.
> --
> 
> regards,
> Wahaj
> 
> 
> - Original Message -
> From: "Ralph Castain" 
> To: "Open MPI Users" 
> Sent: Tuesday, February 28, 2012 3:30:47 PM
> Subject: Re: [OMPI users] mpirun fails with no allocated resources
> 
> 
> On Feb 28, 2012, at 7:24 AM, Muhammad Wahaj Sethi wrote:
> 
>> Hello!
>>I am trying run following command using trunk version 1.7a1r25984.
>> 
>> mpirun -np 2 -H localhost,localhost /bin/hostname
>> 
>> It fails with following error message.
>> 
>> --
>> There are no allocated resources for the application 
>> /bin/hostname
>> that match the requested mapping:
>> 
>> 
>> Verify that you have mapped the allocated resources properly using the 
>> --host or --hostfile specification.
>> --
>> 
>> Every thing works fine if I use trunk version 1.5.5rc3r26063.
>> 
>> Any ideas, how it can be fixed?
> 
> Sure - update your trunk version. It's been fixed for awhile.
> 
> 
>> 
>> regards,
>> Wahaj
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


[OMPI users] Hybrid OpenMPI / OpenMP programming

2012-02-29 Thread Auclair Francis

Dear Open-MPI users,

Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA 
machine (2 sockets by nodes and 4 cores by socket) with basically two

levels of implementation for Open-MPI:
- at lower level n "Master" MPI-processes (one by socket) are
simultaneously runned by dividing classically the physical domain into n
sub-domains
- while at higher level 4n MPI-processes are spawn to run a sparse 
Poisson solver.
At each time step, the code is thus going back and forth between these 
two levels of implementation using two MPI communicators. This also 
means that during about half of the computation time, 3n cores are at 
best sleeping (if not 'waiting' at a barrier) when not inside "Solver 
routines". We consequently decided to implement OpenMP functionality in 
our code when solver was not running (we declare one single "parallel" 
region and use the omp "master" command when OpenMP threads are not 
active). We however face several difficulties:


a) It seems that both the 3n-MPI processes and the OpenMP threads 
'consume processor cycles while waiting'. We consequently tried: mpirun

-mpi_yield_when_idle 1 … , export OMP_WAIT_POLICY=passive or export
KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
of computing time but worsens the second problem we have to face (see
bellow).

b) We managed to have a "correct" (?) implementation of our MPI-processes
on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n …
However if OpenMP threads initially seem to scatter on each socket (one
thread by core) they slowly migrate to the same core as their 'Master 
MPI process' or gather on one or two cores by socket… We play around 
with the environment variable KMP_AFFINITY but the best we could obtain 
was a pinning of the OpenMP threads to their own core... disorganizing 
at the same time the implementation of the 4n Level-2 MPI processes. 
When added, neither the specification of a rankfile nor the mpirun 
option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.
This comportment looks rather inefficient but so far we did not manage 
to prevent the migration of the 4 threads to at most a couple of cores !


Is there something wrong in our "Hybrid" implementation?
Do you have any advices?
Thanks for your help,
Francis



[OMPI users] Drastic OpenMPI performance reduction when message exeeds 128 KB

2012-02-29 Thread adrian sabou
Hi all,
 
I am experiencing a rather unpleasant issue with a simple OpenMPI app. I have 4 
nodes communicating with a central node. Performance is good and the 
application behaves as it should. (i.e. performance steadily decreases as I 
increase the work size). My problem is that immediately after messages passed 
between nodes become larger that 128 KB performance drops suddenly in an 
unexpected way. I have done some research and tried to modify various eager 
limits, without any success. I am a beginner in OpenMPI and I can't seem to 
figure out this issue. I am hopping that one of you might shed some light on 
this situation. My OpenMPI version is 1.5.4 on Ubuntu Server 10.04 64 bit. Any 
help is welcome. Thanks.
 
Adrian 

Re: [OMPI users] Could not execute the executable"/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Jeffrey Squyres
FWIW: Ralph committed a change to mpirun the other day that will now check if 
you're missing integer command line arguments.  This will appear in Open MPI 
v1.7. It'll look something like this:

% mpirun -np hostname
---
Open MPI has detected that a parameter given to a cmd line
option does not match the expected format:

  Option: np
  Param:  hostname

This is frequently caused by omitting to provide the parameter
to an option that requires one. Please check the cmd line and try again.
---
%




On Feb 28, 2012, at 5:49 AM, Jeff Squyres (jsquyres) wrote:

> Yes, this is known behavior for our CLI parser.  We could probably improve 
> that a bit...
> 
> On Feb 28, 2012, at 4:55 AM, Ralph Castain wrote:
> 
> >
> > On Feb 28, 2012, at 2:52 AM, Reuti wrote:
> >
> >> Am 28.02.2012 um 10:21 schrieb Ralph Castain:
> >>
> >>> Afraid I have to agree with the prior reply - sounds like NPROC isn't 
> >>> getting defined, which causes your cmd line to look like your original 
> >>> posting.
> >>
> >> Maybe the best to investigate this is to `echo` $MPIRUN and $NPROC.
> >>
> >> But: is this the intended behavior of mpirun? It looks like -np is eating 
> >> -hostlist as a numeric argument? Shouldn't it complain about: argument for 
> >> -np missing or argument not being numeric?
> >
> > Probably - I'm sure that the atol is returning zero, which should cause an 
> > error output. I'll check.
> >
> >
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>> On Feb 27, 2012, at 10:29 PM, Syed Ahsan Ali wrote:
> >>>
>  The following command in used in script for job submission
> 
>  ${MPIRUN} -np ${NPROC} -hostfile ${ABSDIR}/hostlist -mca btl 
>  sm,openib,self --mca btl_openib_use_srq 1 ./hrm >> ${OUTFILE}_hrm 2>&1
>  where NPROC in defined in someother file. The same application is 
>  running on the other system with same configuration.
> 
>  On Tue, Feb 28, 2012 at 10:12 AM, PukkiMonkey  
>  wrote:
>  No of processes missing after -np
>  Should be something like:
>  mpirun -np 256 ./exec
> 
> 
> 
>  Sent from my iPhone
> 
>  On Feb 27, 2012, at 8:47 PM, Syed Ahsan Ali  
>  wrote:
> 
> > Dear All,
> >
> > I am running an application with mpirun but it gives following error, 
> > it is not picking up hostlist, there are other applications which run 
> > well with hostlist but it just gives following error with
> >
> >
> > [pmdtest@pmd02 d00_dayfiles]$ tail -f *_hrm
> > mpirun -np  /home/MET/hrm/bin/hrm
> > --
> > Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec 
> > format error
> >
> > This could mean that your PATH or executable name is wrong, or that you 
> > do not
> > have the necessary permissions.  Please ensure that the executable is 
> > able to be
> > found and executed.
> >
> > --
> >
> > Following the permission of the hostlist directory. Please help me to 
> > remove this error.
> >
> > [pmdtest@pmd02 bin]$ ll
> > total 7570
> > -rwxrwxrwx 1 pmdtest pmdtest 2517815 Feb 16  2012 gme2hrm
> > -rwxrwxrwx 1 pmdtest pmdtest   0 Feb 16  2012 gme2hrm.map
> > -rwxrwxrwx 1 pmdtest pmdtest 473 Jan 30  2012 hostlist
> > -rwxrwxrwx 1 pmdtest pmdtest 5197698 Feb 16  2012 hrm
> > -rwxrwxrwx 1 pmdtest pmdtest   0 Dec 31  2010 hrm.map
> > -rwxrwxrwx 1 pmdtest pmdtest1680 Dec 31  2010 mpd.hosts
> >
> >
> > Thank you and Regards
> > Ahsan
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
>  --
>  Syed Ahsan Ali Bokhari
>  Electronic Engineer (EE)
> 
>  Research & Development Division
>  Pakistan Meteorological Department H-8/4, Islamabad.
>  Phone # off  +92518358714
>  Cell # +923155145014
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mail

Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:

> [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile 
> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1 ./hrm 
> >> ${OUTFILE}_hrm 2>&1
> [pmdtest@pmd02 d00_dayfiles]$ 

Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm file, not 
stdout.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Drastic OpenMPI performance reduction when message exeeds 128 KB

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 5:39 AM, adrian sabou wrote:

> I am experiencing a rather unpleasant issue with a simple OpenMPI app. I have 
> 4 nodes communicating with a central node. Performance is good and the 
> application behaves as it should. (i.e. performance steadily decreases as I 
> increase the work size). My problem is that immediately after messages passed 
> between nodes become larger that 128 KB performance drops suddenly in an 
> unexpected way. I have done some research and tried to modify various eager 
> limits, without any success. I am a beginner in OpenMPI and I can't seem to 
> figure out this issue. I am hopping that one of you might shed some light on 
> this situation. My OpenMPI version is 1.5.4 on Ubuntu Server 10.04 64 bit. 
> Any help is welcome. Thanks.

Lots of things can be a factor here (I assume you're using TCP over Ethernet?):

- are you using a network switch or hub?
- what kind of switch/hub is it? (switch quality can have a *lot* to do with 
network performance, and I don't say that just because of my employer :-) )
- is this a point-to-point pattern, or are multiple nodes communicating 
simultaneously?  (I'm asking about network contention)
- how many procs are you running on each node?  Are they all communicating 
simultaneously from each node?
- is the performance degradation only when communicating over TCP?  Or does it 
happen when communicating over shared memory?  Or both?

I think you probably want to test what happens with a simple point-to-point 
benchmark between two peers on different nodes, and observe the performance 
there.  If you have a problem on your network or setup, you'll see it there.  
Then expand your testing to include multiple procs simultaneously (e.g., 
running the same 2-proc point-to-point benchmark multiple times simultaneously) 
and see what happens.

If all that looks good, then start looking hard at your application 
communication pattern.  When you hit 128 KB message size, are you exhausting 
cache sizes, or creating some other kind of algorithmic congestion?  Look for 
things like this.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] IMB-OpenMPI on Centos 6

2012-02-29 Thread Jeffrey Squyres
I haven't followed OFED development for a long time, so I don't know if there 
is a buggy OFED in RHEL 5.4.

If you're doing development with the internals Open MPI (or if it'll be 
necessary to dive into the internals for debugging a custom device/driver), you 
might want to move this discussion to the devel list, not the user's list.

Open MPI does have a few open tickets about what happens when registered memory 
is exhausted.  We just recently committed some improvements to this (although 
the problem is not fully solved) on the v1.4 and v1.5 branches.  Open MPI 
v1.4.3 is pretty old, actually.  Could you try upgrading to Open MPI v1.4.5, or 
the latest v1.5.5rc?


On Feb 27, 2012, at 2:10 AM, Venkateswara Rao Dokku wrote:

> Hi,
> 
> We are facing a problem while running the IMB [Intel MPI Benchmark] tests on 
> Centos 6.0.
> All the tests [PingPong, Exchange.. etc] stalls after some time with no 
> errors.
> 
> Introduction:
> Our's is a customized OFED stack[Our own Driver specific library and Kernel 
> drivers for the h/w], we use IMB tests for testing the same.
> We have already tested the same stack on RHEL5.4 and it was fine.  
> 
> Observation:
> Tests sends few packets and it is observed that acknowledgement for all those 
> packets are received. But no more Send Work Queue entries added for the 
> driver to process.
> Test does not return at all, just stalls there after sending few packets.
> Observed only in Centos 6/RHEL 6.
> 
> Versions of packages installed :
> OpenMPI - 1.4.3
> LibIbVerbs   - 1.1.4 
> LibIbUmad   - 1.3.6
> IMB - 3.2.2
> 
> Please confirm if the versions are compatible with RHEL6. If not, Please 
> suggest the appropriate packages.
> 
> Please respond ASAP. Any help will be appreciated.
> 
> 
> 
> -- 
> Thanks & Regards,
> D.Venkateswara Rao,
> Software Engineer,One Convergence Devices Pvt Ltd.,
> Jubille Hills,Hyderabad.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] mpirun fails with no allocated resources

2012-02-29 Thread Jeffrey Squyres
Just to put this up front: using the trunk is subject to have these kinds of 
problems.  It is the head of development, after all -- things sometimes break. 
:-)

Ralph: FWIW, I can replicate this problem on my Mac (OS X Lion) with the SVN 
trunk HEAD (svnversion tells me I have 26070M):

-
[6:46] jsquyres-mac:~/svn/ompi % mpirun -np 1 --host localhost uptime
--
There are no allocated resources for the application 
  uptime
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the 
--host or --hostfile specification.
--
[6:46] jsquyres-mac:~/svn/ompi % cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255 broadcasthost
::1 localhost 
fe80::1%lo0 localhost
[6:46] jsquyres-mac:~/svn/ompi % 
-


On Feb 29, 2012, at 3:36 AM, Muhammad Wahaj Sethi wrote:

> 
> 
> Snapshot of my hosts file is present below. localhost is present here.
> 
> 127.0.0.1 localhost
> 127.0.1.1 wahaj-ThinkPad-T510
> 10.42.43.1node0
> 10.42.43.2node1
> 
> Every thing works fine if I don't specify host names. 
> 
> This problem only specific to Open MPI version 1.7. 
> 
> Open MPI version 1.5.5 doesn't produces this error message.
> 
> - Original Message -
> From: "Ralph Castain" 
> To: "Open MPI Users" 
> Sent: Tuesday, February 28, 2012 5:55:43 PM
> Subject: Re: [OMPI users] mpirun fails with no allocated resources
> 
> Try leaving off the -H localhost,localhost front he cmd line - the local host 
> will automatically be included, so that shouldn't be required.
> 
> I believe the problem is that "localhost" isn't the name of your machine, and 
> so we look and don't see that machine anywhere.
> 
> On Feb 28, 2012, at 9:42 AM, Muhammad Wahaj Sethi wrote:
> 
>> Hello,
>>   I have installed newer version but problem still persists.
>> 
>> Package: Open MPI wahaj@wahaj-ThinkPad-T510 Distribution
>>   Open MPI: 1.7a1r26065
>> Open MPI repo revision: r26065
>>  Open MPI release date: Unreleased developer copy
>>   Open RTE: 1.7a1r26065
>> Open RTE repo revision: r26065
>>  Open RTE release date: Unreleased developer copy
>>   OPAL: 1.7a1r26065
>> OPAL repo revision: r26065
>>  OPAL release date: Unreleased developer copy
>>MPI API: 2.1
>>   Ident string: 1.7a1r26065
>> Prefix: /home/wahaj/openmpi-install
>> Configured architecture: x86_64-unknown-linux-gnu
>> 
>> Sequence of steps I followed is mention below.
>> 
>> svn update
>> make distclean
>> ./autogen.pl
>> ./configure --prefix=$HOME/openmpi-install
>> make all install
>> 
>> 
>> wahaj@wahaj-ThinkPad-T510:~$ mpirun -np 2 -H localhost,localhost 
>> /bin/hostname
>> --
>> There are no allocated resources for the application 
>> /bin/hostname
>> that match the requested mapping:
>> 
>> 
>> Verify that you have mapped the allocated resources properly using the 
>> --host or --hostfile specification.
>> --
>> 
>> regards,
>> Wahaj
>> 
>> 
>> - Original Message -
>> From: "Ralph Castain" 
>> To: "Open MPI Users" 
>> Sent: Tuesday, February 28, 2012 3:30:47 PM
>> Subject: Re: [OMPI users] mpirun fails with no allocated resources
>> 
>> 
>> On Feb 28, 2012, at 7:24 AM, Muhammad Wahaj Sethi wrote:
>> 
>>> Hello!
>>>   I am trying run following command using trunk version 1.7a1r25984.
>>> 
>>> mpirun -np 2 -H localhost,localhost /bin/hostname
>>> 
>>> It fails with following error message.
>>> 
>>> --
>>> There are no allocated resources for the application 
>>> /bin/hostname
>>> that match the requested mapping:
>>> 
>>> 
>>> Verify that you have mapped the allocated resources properly using the 
>>> --host or --hostfile specification.
>>> --
>>> 
>>> Every thing works fine if I use trunk version 1.5.5rc3r26063.
>>> 
>>> Any ideas, how it can be fixed?
>> 
>> Sure - update your trunk version. It's been fixed for awhile.
>> 
>> 
>>> 
>>> regards,
>>> Wahaj
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _

[OMPI users] archlinux segmentation fault error

2012-02-29 Thread Stefano Dal Pont
Hi,

I'm a newbie with openMPI so the problem it's probably me :)
Im using a Fortran 90 code developed under Ubuntu 10.04. I've recently
installed the same code on my Archlinux machine but I have some issues
concerning openMPI.
A simple example-code works fine on both machine while the "big" code gives
a segmentation fault error on Archlinux.
On Ubuntu gcc 4.3 is used while on Arch gcc version is 4.6. Is there a way
to make openmpi use gcc 4.3?

thanks


Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes

2012-02-29 Thread Yiguang Yan
Hi Jeff,

Thanks.

I tried as what you suggested. Here are the output:

>>>
yiguang@gulftown testdmp]$ ./test.bash
[gulftown:25052] mca: base: components_open: Looking for plm 
components
[gulftown:25052] mca: base: components_open: opening plm 
components
[gulftown:25052] mca: base: components_open: found loaded 
component rsh
[gulftown:25052] mca: base: components_open: component rsh 
has no register function
[gulftown:25052] mca: base: components_open: component rsh 
open function successful
[gulftown:25052] mca: base: components_open: found loaded 
component slurm
[gulftown:25052] mca: base: components_open: component slurm 
has no register function
[gulftown:25052] mca: base: components_open: component slurm 
open function successful
[gulftown:25052] mca: base: components_open: found loaded 
component tm
[gulftown:25052] mca: base: components_open: component tm 
has no register function
[gulftown:25052] mca: base: components_open: component tm 
open function successful
[gulftown:25052] mca:base:select: Auto-selecting plm components
[gulftown:25052] mca:base:select:(  plm) Querying component [rsh]
[gulftown:25052] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[gulftown:25052] mca:base:select:(  plm) Querying component 
[slurm]
[gulftown:25052] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[gulftown:25052] mca:base:select:(  plm) Querying component [tm]
[gulftown:25052] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[gulftown:25052] mca:base:select:(  plm) Selected component [rsh]
[gulftown:25052] mca: base: close: component slurm closed
[gulftown:25052] mca: base: close: unloading component slurm
[gulftown:25052] mca: base: close: component tm closed
[gulftown:25052] mca: base: close: unloading component tm
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
<<<


The following is the content of test.bash:
>>>
yiguang@gulftown testdmp]$ ./test.bash
#!/bin/sh -f
#nohup
#
# 
>---
<
adinahome=/usr/adina/system8.8dmp
mpirunfile=$adinahome/bin/mpirun
#
# Set envars for mpirun and orted
#
export PATH=$adinahome/bin:$adinahome/tools:$PATH
export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
#
#
# run DMP problem
#
mcaprefix="--prefix $adinahome"
mcarshagent="--mca plm_rsh_agent rsh:ssh"
mcatmpdir="--mca orte_tmpdir_base /tmp"
mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
mcaenvars="-x PATH -x LD_LIBRARY_PATH"
mcabtlconn="--mca btl openib,sm,self"
mcaplmbase="--mca plm_base_verbose 100"

mcaparams="$mcaprefix $mcaenvars $mcarshagent 
$mcaopenibmsg $mcabtlconn $mcatmpdir $mcaplmbase"

$mpirunfile $mcaparams --app addmpw-hostname
<<<

While the content of addmpw-hostname is:
>>>
-n 1 -host gulftown hostname
-n 1 -host ibnode001 hostname
-n 1 -host ibnode002 hostname
-n 1 -host ibnode003 thostname
<<<

After this, I also tried to specify the orted through:

--mca orte_launch_agent $adinahome/bin/orted

then, orted could be found on slave nodes, but now the shared libs 
in $adinahome/lib are not on the LD_LIBRARY_PATH.

Any comments?

Thanks,
Yiguang





[OMPI users] Very slow MPI_GATHER

2012-02-29 Thread Pinero, Pedro_jose
Hi,

 

I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers
cluster connected though Ethernet, obtaining a very poor performance. I
have measured each operation time and I haver realised that the
MPI_Gather operation takes about 1 second in each synchronization (only
an integer is send in each case). Is this time range normal or I have a
synchronization problem?  Is there any way to improve this performance?

 

Thank you for your help in advance

 

Pedro



Re: [OMPI users] InfiniBand path migration not working

2012-02-29 Thread Shamis, Pavel
> 
>> On Tue, Feb 28, 2012 at 11:34 AM, Shamis, Pavel  wrote:
>> I reviewed the code and it seems to be ok :) The error should be reported if 
>> the port migration is already happened once (port 1 to port 2), and now you 
>> are trying to shutdown port 2 and MPI reports that it can't migrate anymore. 
>> It assumes that port 1 is still down and it can't go back to from port 2 to 
>> port 1.
> 
> In my test case I never try to shutdown port 2.
> I start with both ports cabled up.
> Then I start the MPI test
> Then I unplug the Port 1 cable.
> I leave Port 2 alone.  I expect the application to just keep using Port 2.
> 
> So I expect the migration from Port 1 to Port 2 when I unplug the
> cable.  But I don't expect any more migration after that.

Then we have some bug there :-
)
> 
>> 
>> Can you please build open mpi in debug mode and try to run it in verbose 
>> mode. It will help to understand better the scenario.
> 
> I've recompiled with debug mode(configure --enable-debug).  The
> resulting output (mpirun --mca btl_base_verbose 1) is too large to
> send (28 MB).  Are there specific lines you are looking for? Or do you
> have a preferred method for sending you a text file?

I would like to see all the file.
28MB is it the size after compression ?

I think gmail supports up to 25Mb.
You may try to create gzip file and then slice it using "split" command.

Regards,
Pasha




[OMPI users] Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Venkateswara Rao Dokku
Hiii,
I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3... I
could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and the
remaining tests are hanging at some message size.. the output is shown below

[root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl openib,self,sm
-H 192.168.0.175,192.168.0.174 --mca orte_base_help_aggregate 0
/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
failed to create doorbell file /dev/plx2_char_dev
--
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:test1
  Device name:   plx2_0
  Device vendor ID:  0x10b5
  Device vendor part ID: 4277

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_no_device_params_found to 0.
--
failed to create doorbell file /dev/plx2_char_dev
--
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:test2
  Device name:   plx2_0
  Device vendor ID:  0x10b5
  Device vendor part ID: 4277

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_no_device_params_found to 0.
--
alloc_srq max: 512 wqe_shift: 5
alloc_srq max: 512 wqe_shift: 5
alloc_srq max: 512 wqe_shift: 5
alloc_srq max: 512 wqe_shift: 5
alloc_srq max: 512 wqe_shift: 5
alloc_srq max: 512 wqe_shift: 5
# OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
# Size Bi-Bandwidth (MB/s)
plx2_create_qp line: 415
plx2_create_qp line: 415
plx2_create_qp line: 415
 plx2_create_qp line: 415
1 0.00
2 0.00
4 0.01
8 0.03
160.07
320.15
640.11
128   0.21
256   0.43
512   0.88
1024  2.10
2048  4.21
4096  8.10
8192 16.19
16384 8.46
3276820.34
6553639.85
131072   84.22
262144  142.23
524288  234.83
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 7305 on node test2 exited on
signal 0 (Unknown signal 0).
--
2 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

[root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl openib,self,sm
-H 192.168.0.175,192.168.0.174 --mca orte_base_help_aggregate 0
/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bw
failed to create doorbell file /dev/plx2_char_dev
--
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:test1
  Device name:   plx2_0
  Device vendor ID:  0x10b5
  Device vendor part ID: 4277

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_no_device_params_found to 0.
--
failed to create doorbell file /dev/plx2_char_dev
--
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:test2
  Device name:   plx2_0
  Device vendor ID:  0x10b5
  Device vendor part ID: 4277

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
  btl_openib_warn_no_device_params_found to 0.
--
alloc_srq max: 512 wqe_

Re: [OMPI users] Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Venkateswara Rao Dokku
Sorry, i forgot to introduce the system.. Ours is the customized OFED stack
implemented to work on the specific hardware.. We tested the stack with the
q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We want to
execute the osu_benchamark3.1.1 suite on our OFED..

On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku  wrote:

> Hiii,
> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3... I
> could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and the
> remaining tests are hanging at some message size.. the output is shown below
>
> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> orte_base_help_aggregate 0
> /root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> failed to create doorbell file /dev/plx2_char_dev
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
>
>   Local host:test1
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
>
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
>
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_no_device_params_found to 0.
> --
> failed to create doorbell file /dev/plx2_char_dev
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
>
>   Local host:test2
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
>
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
>
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_no_device_params_found to 0.
> --
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> # OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
> # Size Bi-Bandwidth (MB/s)
> plx2_create_qp line: 415
> plx2_create_qp line: 415
> plx2_create_qp line: 415
>  plx2_create_qp line: 415
> 1 0.00
> 2 0.00
> 4 0.01
> 8 0.03
> 160.07
> 320.15
> 640.11
> 128   0.21
> 256   0.43
> 512   0.88
> 1024  2.10
> 2048  4.21
> 4096  8.10
> 8192 16.19
> 16384 8.46
> 3276820.34
> 6553639.85
> 131072   84.22
> 262144  142.23
> 524288  234.83
> mpirun: killing job...
>
> --
> mpirun noticed that process rank 0 with PID 7305 on node test2 exited on
> signal 0 (Unknown signal 0).
> --
> 2 total processes killed (some possibly by mpirun during cleanup)
> mpirun: clean termination accomplished
>
> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> orte_base_help_aggregate 0
> /root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bw
> failed to create doorbell file /dev/plx2_char_dev
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
>
>   Local host:test1
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
>
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
>
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_no_device_params_found to 0.
> --
> failed to create doorbell file /dev/plx2_char_dev
> --
> WARNING: No preset parameters were found for the device that Open MPI
> d

Re: [OMPI users] mpirun fails with no allocated resources

2012-02-29 Thread Ralph Castain
Fixed with r26071

On Feb 29, 2012, at 4:55 AM, Jeffrey Squyres wrote:

> Just to put this up front: using the trunk is subject to have these kinds of 
> problems.  It is the head of development, after all -- things sometimes 
> break. :-)
> 
> Ralph: FWIW, I can replicate this problem on my Mac (OS X Lion) with the SVN 
> trunk HEAD (svnversion tells me I have 26070M):
> 
> -
> [6:46] jsquyres-mac:~/svn/ompi % mpirun -np 1 --host localhost uptime
> --
> There are no allocated resources for the application 
>  uptime
> that match the requested mapping:
> 
> 
> Verify that you have mapped the allocated resources properly using the 
> --host or --hostfile specification.
> --
> [6:46] jsquyres-mac:~/svn/ompi % cat /etc/hosts
> ##
> # Host Database
> #
> # localhost is used to configure the loopback interface
> # when the system is booting.  Do not change this entry.
> ##
> 127.0.0.1 localhost
> 255.255.255.255   broadcasthost
> ::1 localhost 
> fe80::1%lo0   localhost
> [6:46] jsquyres-mac:~/svn/ompi % 
> -
> 
> 
> On Feb 29, 2012, at 3:36 AM, Muhammad Wahaj Sethi wrote:
> 
>> 
>> 
>> Snapshot of my hosts file is present below. localhost is present here.
>> 
>> 127.0.0.1localhost
>> 127.0.1.1wahaj-ThinkPad-T510
>> 10.42.43.1   node0
>> 10.42.43.2   node1
>> 
>> Every thing works fine if I don't specify host names. 
>> 
>> This problem only specific to Open MPI version 1.7. 
>> 
>> Open MPI version 1.5.5 doesn't produces this error message.
>> 
>> - Original Message -
>> From: "Ralph Castain" 
>> To: "Open MPI Users" 
>> Sent: Tuesday, February 28, 2012 5:55:43 PM
>> Subject: Re: [OMPI users] mpirun fails with no allocated resources
>> 
>> Try leaving off the -H localhost,localhost front he cmd line - the local 
>> host will automatically be included, so that shouldn't be required.
>> 
>> I believe the problem is that "localhost" isn't the name of your machine, 
>> and so we look and don't see that machine anywhere.
>> 
>> On Feb 28, 2012, at 9:42 AM, Muhammad Wahaj Sethi wrote:
>> 
>>> Hello,
>>>  I have installed newer version but problem still persists.
>>> 
>>> Package: Open MPI wahaj@wahaj-ThinkPad-T510 Distribution
>>>  Open MPI: 1.7a1r26065
>>> Open MPI repo revision: r26065
>>> Open MPI release date: Unreleased developer copy
>>>  Open RTE: 1.7a1r26065
>>> Open RTE repo revision: r26065
>>> Open RTE release date: Unreleased developer copy
>>>  OPAL: 1.7a1r26065
>>>OPAL repo revision: r26065
>>> OPAL release date: Unreleased developer copy
>>>   MPI API: 2.1
>>>  Ident string: 1.7a1r26065
>>>Prefix: /home/wahaj/openmpi-install
>>> Configured architecture: x86_64-unknown-linux-gnu
>>> 
>>> Sequence of steps I followed is mention below.
>>> 
>>> svn update
>>> make distclean
>>> ./autogen.pl
>>> ./configure --prefix=$HOME/openmpi-install
>>> make all install
>>> 
>>> 
>>> wahaj@wahaj-ThinkPad-T510:~$ mpirun -np 2 -H localhost,localhost 
>>> /bin/hostname
>>> --
>>> There are no allocated resources for the application 
>>> /bin/hostname
>>> that match the requested mapping:
>>> 
>>> 
>>> Verify that you have mapped the allocated resources properly using the 
>>> --host or --hostfile specification.
>>> --
>>> 
>>> regards,
>>> Wahaj
>>> 
>>> 
>>> - Original Message -
>>> From: "Ralph Castain" 
>>> To: "Open MPI Users" 
>>> Sent: Tuesday, February 28, 2012 3:30:47 PM
>>> Subject: Re: [OMPI users] mpirun fails with no allocated resources
>>> 
>>> 
>>> On Feb 28, 2012, at 7:24 AM, Muhammad Wahaj Sethi wrote:
>>> 
 Hello!
  I am trying run following command using trunk version 1.7a1r25984.
 
 mpirun -np 2 -H localhost,localhost /bin/hostname
 
 It fails with following error message.
 
 --
 There are no allocated resources for the application 
 /bin/hostname
 that match the requested mapping:
 
 
 Verify that you have mapped the allocated resources properly using the 
 --host or --hostfile specification.
 --
 
 Every thing works fine if I use trunk version 1.5.5rc3r26063.
 
 Any ideas, how it can be fixed?
>>> 
>>> Sure - update your trunk version. It's been fixed for awhile.
>>> 
>>> 
 
 regards,
 Wahaj
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> h

Re: [OMPI users] mpirun fails with no allocated resources

2012-02-29 Thread Muhammad Wahaj Sethi
Thanx alot. 

- Original Message -
From: "Ralph Castain" 
To: "Open MPI Users" 
Sent: Wednesday, February 29, 2012 5:56:23 PM
Subject: Re: [OMPI users] mpirun fails with no allocated resources

Fixed with r26071

On Feb 29, 2012, at 4:55 AM, Jeffrey Squyres wrote:

> Just to put this up front: using the trunk is subject to have these kinds of 
> problems.  It is the head of development, after all -- things sometimes 
> break. :-)
> 
> Ralph: FWIW, I can replicate this problem on my Mac (OS X Lion) with the SVN 
> trunk HEAD (svnversion tells me I have 26070M):
> 
> -
> [6:46] jsquyres-mac:~/svn/ompi % mpirun -np 1 --host localhost uptime
> --
> There are no allocated resources for the application 
>  uptime
> that match the requested mapping:
> 
> 
> Verify that you have mapped the allocated resources properly using the 
> --host or --hostfile specification.
> --
> [6:46] jsquyres-mac:~/svn/ompi % cat /etc/hosts
> ##
> # Host Database
> #
> # localhost is used to configure the loopback interface
> # when the system is booting.  Do not change this entry.
> ##
> 127.0.0.1 localhost
> 255.255.255.255   broadcasthost
> ::1 localhost 
> fe80::1%lo0   localhost
> [6:46] jsquyres-mac:~/svn/ompi % 
> -
> 
> 
> On Feb 29, 2012, at 3:36 AM, Muhammad Wahaj Sethi wrote:
> 
>> 
>> 
>> Snapshot of my hosts file is present below. localhost is present here.
>> 
>> 127.0.0.1localhost
>> 127.0.1.1wahaj-ThinkPad-T510
>> 10.42.43.1   node0
>> 10.42.43.2   node1
>> 
>> Every thing works fine if I don't specify host names. 
>> 
>> This problem only specific to Open MPI version 1.7. 
>> 
>> Open MPI version 1.5.5 doesn't produces this error message.
>> 
>> - Original Message -
>> From: "Ralph Castain" 
>> To: "Open MPI Users" 
>> Sent: Tuesday, February 28, 2012 5:55:43 PM
>> Subject: Re: [OMPI users] mpirun fails with no allocated resources
>> 
>> Try leaving off the -H localhost,localhost front he cmd line - the local 
>> host will automatically be included, so that shouldn't be required.
>> 
>> I believe the problem is that "localhost" isn't the name of your machine, 
>> and so we look and don't see that machine anywhere.
>> 
>> On Feb 28, 2012, at 9:42 AM, Muhammad Wahaj Sethi wrote:
>> 
>>> Hello,
>>>  I have installed newer version but problem still persists.
>>> 
>>> Package: Open MPI wahaj@wahaj-ThinkPad-T510 Distribution
>>>  Open MPI: 1.7a1r26065
>>> Open MPI repo revision: r26065
>>> Open MPI release date: Unreleased developer copy
>>>  Open RTE: 1.7a1r26065
>>> Open RTE repo revision: r26065
>>> Open RTE release date: Unreleased developer copy
>>>  OPAL: 1.7a1r26065
>>>OPAL repo revision: r26065
>>> OPAL release date: Unreleased developer copy
>>>   MPI API: 2.1
>>>  Ident string: 1.7a1r26065
>>>Prefix: /home/wahaj/openmpi-install
>>> Configured architecture: x86_64-unknown-linux-gnu
>>> 
>>> Sequence of steps I followed is mention below.
>>> 
>>> svn update
>>> make distclean
>>> ./autogen.pl
>>> ./configure --prefix=$HOME/openmpi-install
>>> make all install
>>> 
>>> 
>>> wahaj@wahaj-ThinkPad-T510:~$ mpirun -np 2 -H localhost,localhost 
>>> /bin/hostname
>>> --
>>> There are no allocated resources for the application 
>>> /bin/hostname
>>> that match the requested mapping:
>>> 
>>> 
>>> Verify that you have mapped the allocated resources properly using the 
>>> --host or --hostfile specification.
>>> --
>>> 
>>> regards,
>>> Wahaj
>>> 
>>> 
>>> - Original Message -
>>> From: "Ralph Castain" 
>>> To: "Open MPI Users" 
>>> Sent: Tuesday, February 28, 2012 3:30:47 PM
>>> Subject: Re: [OMPI users] mpirun fails with no allocated resources
>>> 
>>> 
>>> On Feb 28, 2012, at 7:24 AM, Muhammad Wahaj Sethi wrote:
>>> 
 Hello!
  I am trying run following command using trunk version 1.7a1r25984.
 
 mpirun -np 2 -H localhost,localhost /bin/hostname
 
 It fails with following error message.
 
 --
 There are no allocated resources for the application 
 /bin/hostname
 that match the requested mapping:
 
 
 Verify that you have mapped the allocated resources properly using the 
 --host or --hostfile specification.
 --
 
 Every thing works fine if I use trunk version 1.5.5rc3r26063.
 
 Any ideas, how it can be fixed?
>>> 
>>> Sure - update your trunk version. It's been fixed for awhile.
>>> 
>>> 
 
 regards,
 Wahaj
 ___
 users

Re: [OMPI users] Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jeffrey Squyres
FWIW, I'm immediately suspicious of *any* MPI application that uses the MPI 
one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like these two OSU 
benchmarks are using those operations.

Is it known that these two benchmarks are correct?



On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:

> Sorry, i forgot to introduce the system.. Ours is the customized OFED stack 
> implemented to work on the specific hardware.. We tested the stack with the 
> q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We want to execute 
> the osu_benchamark3.1.1 suite on our OFED..
> 
> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku  
> wrote:
> Hiii,
> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3... I 
> could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and the 
> remaining tests are hanging at some message size.. the output is shown below
> 
> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl openib,self,sm -H 
> 192.168.0.175,192.168.0.174 --mca orte_base_help_aggregate 0 
> /root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> failed to create doorbell file /dev/plx2_char_dev 
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
> 
>   Local host:test1
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
> 
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
> 
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_no_device_params_found to 0.
> --
> failed to create doorbell file /dev/plx2_char_dev 
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
> 
>   Local host:test2
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
> 
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
> 
> NOTE: You can turn off this warning by setting the MCA parameter
>   btl_openib_warn_no_device_params_found to 0.
> --
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> alloc_srq max: 512 wqe_shift: 5
> # OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
> # Size Bi-Bandwidth (MB/s)
> plx2_create_qp line: 415 
> plx2_create_qp line: 415 
> plx2_create_qp line: 415 
> plx2_create_qp line: 415 
> 1 0.00
> 2 0.00
> 4 0.01
> 8 0.03
> 160.07
> 320.15
> 640.11
> 128   0.21
> 256   0.43
> 512   0.88
> 1024  2.10
> 2048  4.21
> 4096  8.10
> 8192 16.19
> 16384 8.46
> 3276820.34
> 6553639.85
> 131072   84.22
> 262144  142.23
> 524288  234.83
> mpirun: killing job...
> 
> --
> mpirun noticed that process rank 0 with PID 7305 on node test2 exited on 
> signal 0 (Unknown signal 0).
> --
> 2 total processes killed (some possibly by mpirun during cleanup)
> mpirun: clean termination accomplished
> 
> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl openib,self,sm -H 
> 192.168.0.175,192.168.0.174 --mca orte_base_help_aggregate 0 
> /root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bw
> failed to create doorbell file /dev/plx2_char_dev 
> --
> WARNING: No preset parameters were found for the device that Open MPI
> detected:
> 
>   Local host:test1
>   Device name:   plx2_0
>   Device vendor ID:  0x10b5
>   Device vendor part ID: 4277
> 
> Default device parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_device_param_files MCA parameter to set values for your
> device.
> 
> NOTE: You can turn off this warning by setting th

Re: [OMPI users] Very slow MPI_GATHER

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 11:01 AM, Pinero, Pedro_jose wrote:

> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers cluster 
> connected though Ethernet, obtaining a very poor performance.

Let me making sure I'm parsing this statement properly: are you launching 200 
MPI processes on 2 computers?  If so, do those computers each have 100 cores?

I ask because oversubscribing MPI processes (i.e., putting more than 1 process 
per core) will be disastrous to performance.

> I have measured each operation time and I haver realised that the MPI_Gather 
> operation takes about 1 second in each synchronization (only an integer is 
> send in each case). Is this time range normal or I have a synchronization 
> problem?  Is there any way to improve this performance?

I'm afraid I can't say more without more information about your hardware and 
software setup.  Is this a dedicated HPC cluster?  Are you oversubscribing the 
cores?  What kind of Ethernet switching gear do you have?  ...etc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Barrett, Brian W
I'm pretty sure that they are correct.  Our one-sided implementation is
buggier than I'd like (indeed, I'm in the process of rewriting most of it
as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
that the bugs are in Open MPI's onesided support.  Can you try a more
recent release (something from the 1.5 tree) and see if the problem
persists?

Thanks,

Brian

On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:

>FWIW, I'm immediately suspicious of *any* MPI application that uses the
>MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
>these two OSU benchmarks are using those operations.
>
>Is it known that these two benchmarks are correct?
>
>
>
>On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
>
>> Sorry, i forgot to introduce the system.. Ours is the customized OFED
>>stack implemented to work on the specific hardware.. We tested the stack
>>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
>>want to execute the osu_benchamark3.1.1 suite on our OFED..
>> 
>> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
>> wrote:
>> Hiii,
>> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3...
>>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
>> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and
>>the remaining tests are hanging at some message size.. the output is
>>shown below
>> 
>> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
>>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
>>orte_base_help_aggregate 0
>>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
>> failed to create doorbell file /dev/plx2_char_dev
>> 
>>-
>>-
>> WARNING: No preset parameters were found for the device that Open MPI
>> detected:
>> 
>>   Local host:test1
>>   Device name:   plx2_0
>>   Device vendor ID:  0x10b5
>>   Device vendor part ID: 4277
>> 
>> Default device parameters will be used, which may result in lower
>> performance.  You can edit any of the files specified by the
>> btl_openib_device_param_files MCA parameter to set values for your
>> device.
>> 
>> NOTE: You can turn off this warning by setting the MCA parameter
>>   btl_openib_warn_no_device_params_found to 0.
>> 
>>-
>>-
>> failed to create doorbell file /dev/plx2_char_dev
>> 
>>-
>>-
>> WARNING: No preset parameters were found for the device that Open MPI
>> detected:
>> 
>>   Local host:test2
>>   Device name:   plx2_0
>>   Device vendor ID:  0x10b5
>>   Device vendor part ID: 4277
>> 
>> Default device parameters will be used, which may result in lower
>> performance.  You can edit any of the files specified by the
>> btl_openib_device_param_files MCA parameter to set values for your
>> device.
>> 
>> NOTE: You can turn off this warning by setting the MCA parameter
>>   btl_openib_warn_no_device_params_found to 0.
>> 
>>-
>>-
>> alloc_srq max: 512 wqe_shift: 5
>> alloc_srq max: 512 wqe_shift: 5
>> alloc_srq max: 512 wqe_shift: 5
>> alloc_srq max: 512 wqe_shift: 5
>> alloc_srq max: 512 wqe_shift: 5
>> alloc_srq max: 512 wqe_shift: 5
>> # OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
>> # Size Bi-Bandwidth (MB/s)
>> plx2_create_qp line: 415
>> plx2_create_qp line: 415
>> plx2_create_qp line: 415
>> plx2_create_qp line: 415
>> 1 0.00
>> 2 0.00
>> 4 0.01
>> 8 0.03
>> 160.07
>> 320.15
>> 640.11
>> 128   0.21
>> 256   0.43
>> 512   0.88
>> 1024  2.10
>> 2048  4.21
>> 4096  8.10
>> 8192 16.19
>> 16384 8.46
>> 3276820.34
>> 6553639.85
>> 131072   84.22
>> 262144  142.23
>> 524288  234.83
>> mpirun: killing job...
>> 
>> 
>>-
>>-
>> mpirun noticed that process rank 0 with PID 7305 on node test2 exited
>>on signal 0 (Unknown signal 0).
>> 
>>-
>>-
>> 2 total processes killed (some possibly by mpirun during cleanup)
>> mpirun: clean termination accomplished
>> 
>> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
>>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
>>orte_base_help_aggregate 0
>>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bw
>> failed to create doorbell file /dev/plx2_char_dev
>> 
>>-

Re: [OMPI users] archlinux segmentation fault error

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 9:39 AM, Stefano Dal Pont wrote:

> I'm a newbie with openMPI so the problem it's probably me :)
> Im using a Fortran 90 code developed under Ubuntu 10.04. I've recently 
> installed the same code on my Archlinux machine but I have some issues 
> concerning openMPI. 
> A simple example-code works fine on both machine while the "big" code gives a 
> segmentation fault error on Archlinux. 
> On Ubuntu gcc 4.3 is used while on Arch gcc version is 4.6. Is there a way to 
> make openmpi use gcc 4.3? 


This is a local configuration issue, not really an Open MPI issue.

Open MPI will compile itself with whichever compiler you tell it to; if you 
have both gcc 4.3 and 4.6 installed correctly on your machine, you can probably 
configure Open MPI with:

./configure CC=/path/to/gcc4.3/bin/gcc CXX=/path/to/gcc4.3/bin/g++ \
   F77=/path/to/gcc4.3/bin/gfortran FC=/path/to/gcc4.3/bin/gfortran ...

Make sense?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Newbi question about MPI_wait vs MPI_wait any

2012-02-29 Thread Eric Chamberland

Hi,

I would like to know which of "waitone" vs "waitany" is optimal and of 
course, will never produce deadlocks.


Let's say we have "lNp" processes and they want to send an array of int 
of length "lNbInt" to process "0" in a non-blocking MPI_Isend (instead 
of MPI_Gather).  Let's say the order for receiving is unimportant and we 
want to start using data as soon as possible.


I have attached wait.cc, that one can compile in two manners:

mpicxx -o waitone wait.cc

mpicxx -DMPI_WAIT_ANY_VERSION -o waitany wait.cc

Then launch using 1 parameter to the executable: the length "lNbInt".

The waitone version:
mpirun -display-map -H host1,host2,host3 -n 24 waitone 1

The waitany version:
mpirun -display-map -H host1,host2,host3 -n 24 waitany 1

After executing several times, on different number of processes and 
different number of nodes and almost always large value of "lNbInt", I 
*think* these could be good conclusions? :


#1- Both version take almost the same wall clock time to complete
#2- Both version do *not* produce deadlock
#3- MPI_WAIT_ANY_VERSION could do better if some work was really done 
with received data.
#4- MPI_WAIT_ANY_VERSION received always the data from processes on the 
same host.


I haven't be able to reproduce a deadlock even while varying array 
length, number of processes and number of hosts.  How can I conclude 
there are no problem with this code?  Any reading suggestion?


Thanks!

Eric
#include "mpi.h"
#include 
#include 

//Use the following for the MPI_Waitany version
//#define MPI_WAIT_ANY_VERSION

int main(int pArgc, char *pArgv[])
{
  int lRank = -1;
  int lNp   = -1;
  int lTag = 1;
  int lRet = 0;

  if (pArgc != 2) {
std::cerr << "Please specify the number of int to send!" << std::endl;
return 1;
  }
  int lNbInt = std::atoi(pArgv[1]);

  MPI_Request lSendRequest;
  MPI_Status  lStatus;
  lStatus.MPI_ERROR  = MPI_SUCCESS;

  MPI_Init(&pArgc,&pArgv);

  MPI_Comm lComm = MPI_COMM_WORLD;

  MPI_Comm_size(lComm, &lNp);

  MPI_Comm_rank(lComm, &lRank);

  int * lPtrToArrayOfInt = 0;
  int * lVecInt = 0;

  if (lRank != 0 ) {
lPtrToArrayOfInt = new int[lNbInt];
for (int i = 0; i< lNbInt; ++i) {
  lPtrToArrayOfInt[i] = rand();
}
MPI_Isend(lPtrToArrayOfInt, lNbInt, MPI_INT, 0, lTag, lComm, &lSendRequest);
  }
  else {
MPI_Request* lVecRequest = new MPI_Request[lNp-1];
lVecInt = new int[lNbInt*lNp-1];
if (0 == lVecInt) {
  std::cerr<< "Unable to allocate array!" <

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jingcha Joba
When I ran my osu tests , I was able to get the numbers out of all the
tests except latency_mt (which was obvious, as I didnt compile open-mpi
with multi threaded support).
A good way to know if the problem is with openmpi or with your custom OFED
stack would be to use some other device like tcp instead of ib and rerun
these one sided comm tests.
On Wed, Feb 29, 2012 at 10:04 AM, Barrett, Brian W wrote:

> I'm pretty sure that they are correct.  Our one-sided implementation is
> buggier than I'd like (indeed, I'm in the process of rewriting most of it
> as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
> that the bugs are in Open MPI's onesided support.  Can you try a more
> recent release (something from the 1.5 tree) and see if the problem
> persists?
>
> Thanks,
>
> Brian
>
> On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:
>
> >FWIW, I'm immediately suspicious of *any* MPI application that uses the
> >MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
> >these two OSU benchmarks are using those operations.
> >
> >Is it known that these two benchmarks are correct?
> >
> >
> >
> >On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
> >
> >> Sorry, i forgot to introduce the system.. Ours is the customized OFED
> >>stack implemented to work on the specific hardware.. We tested the stack
> >>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
> >>want to execute the osu_benchamark3.1.1 suite on our OFED..
> >>
> >> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
> >> wrote:
> >> Hiii,
> >> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3...
> >>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> >> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and
> >>the remaining tests are hanging at some message size.. the output is
> >>shown below
> >>
> >> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> >>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> >>orte_base_help_aggregate 0
> >>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> >> failed to create doorbell file /dev/plx2_char_dev
> >>
> >>-
> >>-
> >> WARNING: No preset parameters were found for the device that Open MPI
> >> detected:
> >>
> >>   Local host:test1
> >>   Device name:   plx2_0
> >>   Device vendor ID:  0x10b5
> >>   Device vendor part ID: 4277
> >>
> >> Default device parameters will be used, which may result in lower
> >> performance.  You can edit any of the files specified by the
> >> btl_openib_device_param_files MCA parameter to set values for your
> >> device.
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >>   btl_openib_warn_no_device_params_found to 0.
> >>
> >>-
> >>-
> >> failed to create doorbell file /dev/plx2_char_dev
> >>
> >>-
> >>-
> >> WARNING: No preset parameters were found for the device that Open MPI
> >> detected:
> >>
> >>   Local host:test2
> >>   Device name:   plx2_0
> >>   Device vendor ID:  0x10b5
> >>   Device vendor part ID: 4277
> >>
> >> Default device parameters will be used, which may result in lower
> >> performance.  You can edit any of the files specified by the
> >> btl_openib_device_param_files MCA parameter to set values for your
> >> device.
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >>   btl_openib_warn_no_device_params_found to 0.
> >>
> >>-
> >>-
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> # OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
> >> # Size Bi-Bandwidth (MB/s)
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> 1 0.00
> >> 2 0.00
> >> 4 0.01
> >> 8 0.03
> >> 160.07
> >> 320.15
> >> 640.11
> >> 128   0.21
> >> 256   0.43
> >> 512   0.88
> >> 1024  2.10
> >> 2048  4.21
> >> 4096  8.10
> >> 8192 16.19
> >> 16384 8.46
> >> 3276820.34
> >> 6553639.85
> >> 131072   84.22
> >> 262144  142.23
> >> 524288  234.83
> >> mpirun: killing job...
> >>
> >>
> >>-

Re: [OMPI users] Very slow MPI_GATHER

2012-02-29 Thread Jingcha Joba
two things:
1. Too many mpi processes on one node leading to processes pre-empting each
other
2. Contention in your network.

On Wed, Feb 29, 2012 at 8:01 AM, Pinero, Pedro_jose <
pedro_jose.pin...@atmel.com> wrote:

> Hi,
>
> ** **
>
> I am using OMPI v.1.5.5 to communicate 200 Processes in a 2-Computers
> cluster connected though Ethernet, obtaining a very poor performance. I
> have measured each operation time and I haver realised that the MPI_Gather
> operation takes about 1 second in each synchronization (only an integer is
> send in each case). Is this time range normal or I have a synchronization
> problem?  Is there any way to improve this performance?
>
> ** **
>
> Thank you for your help in advance
>
> ** **
>
> Pedro
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jeffrey Squyres
FWIW, if Brian says that our one-sided stuff is a bit buggy, I believe him 
(because he wrote it).  :-)

The fact is that the MPI-2 one-sided stuff is extremely complicated and 
somewhat open to interpretation.  In practice, I haven't seen the MPI-2 
one-sided stuff used much in the wild.  The MPI-3 working group just revamped 
the one-sided support and generally made it much mo'betta.  Brian is 
re-implementing that stuff, and I believe it'll also be much mo'betta.

My point: I wouldn't worry if not all one-sided benchmarks run with OMPI.  No 
one uses them (yet) anyway.


On Feb 29, 2012, at 1:42 PM, Jingcha Joba wrote:

> When I ran my osu tests , I was able to get the numbers out of all the tests 
> except latency_mt (which was obvious, as I didnt compile open-mpi with multi 
> threaded support).
> A good way to know if the problem is with openmpi or with your custom OFED 
> stack would be to use some other device like tcp instead of ib and rerun 
> these one sided comm tests.
> On Wed, Feb 29, 2012 at 10:04 AM, Barrett, Brian W  wrote:
> I'm pretty sure that they are correct.  Our one-sided implementation is
> buggier than I'd like (indeed, I'm in the process of rewriting most of it
> as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
> that the bugs are in Open MPI's onesided support.  Can you try a more
> recent release (something from the 1.5 tree) and see if the problem
> persists?
> 
> Thanks,
> 
> Brian
> 
> On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:
> 
> >FWIW, I'm immediately suspicious of *any* MPI application that uses the
> >MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
> >these two OSU benchmarks are using those operations.
> >
> >Is it known that these two benchmarks are correct?
> >
> >
> >
> >On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
> >
> >> Sorry, i forgot to introduce the system.. Ours is the customized OFED
> >>stack implemented to work on the specific hardware.. We tested the stack
> >>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
> >>want to execute the osu_benchamark3.1.1 suite on our OFED..
> >>
> >> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
> >> wrote:
> >> Hiii,
> >> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3...
> >>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> >> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and
> >>the remaining tests are hanging at some message size.. the output is
> >>shown below
> >>
> >> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> >>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> >>orte_base_help_aggregate 0
> >>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> >> failed to create doorbell file /dev/plx2_char_dev
> >>
> >>-
> >>-
> >> WARNING: No preset parameters were found for the device that Open MPI
> >> detected:
> >>
> >>   Local host:test1
> >>   Device name:   plx2_0
> >>   Device vendor ID:  0x10b5
> >>   Device vendor part ID: 4277
> >>
> >> Default device parameters will be used, which may result in lower
> >> performance.  You can edit any of the files specified by the
> >> btl_openib_device_param_files MCA parameter to set values for your
> >> device.
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >>   btl_openib_warn_no_device_params_found to 0.
> >>
> >>-
> >>-
> >> failed to create doorbell file /dev/plx2_char_dev
> >>
> >>-
> >>-
> >> WARNING: No preset parameters were found for the device that Open MPI
> >> detected:
> >>
> >>   Local host:test2
> >>   Device name:   plx2_0
> >>   Device vendor ID:  0x10b5
> >>   Device vendor part ID: 4277
> >>
> >> Default device parameters will be used, which may result in lower
> >> performance.  You can edit any of the files specified by the
> >> btl_openib_device_param_files MCA parameter to set values for your
> >> device.
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >>   btl_openib_warn_no_device_params_found to 0.
> >>
> >>-
> >>-
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> alloc_srq max: 512 wqe_shift: 5
> >> # OSU One Sided MPI_Put Bi-directional Bandwidth Test v3.1.1
> >> # Size Bi-Bandwidth (MB/s)
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> plx2_create_qp line: 415
> >> 1 0.00
> >> 2 0.00
> >> 4 0.01
> >> 8

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jingcha Joba
Squyres,
I thought RDMA read and write are implemented as one side communication
using get and put respectively..
Is it not so?

On Wed, Feb 29, 2012 at 10:49 AM, Jeffrey Squyres wrote:

> FWIW, if Brian says that our one-sided stuff is a bit buggy, I believe him
> (because he wrote it).  :-)
>
> The fact is that the MPI-2 one-sided stuff is extremely complicated and
> somewhat open to interpretation.  In practice, I haven't seen the MPI-2
> one-sided stuff used much in the wild.  The MPI-3 working group just
> revamped the one-sided support and generally made it much mo'betta.  Brian
> is re-implementing that stuff, and I believe it'll also be much mo'betta.
>
> My point: I wouldn't worry if not all one-sided benchmarks run with OMPI.
>  No one uses them (yet) anyway.
>
>
> On Feb 29, 2012, at 1:42 PM, Jingcha Joba wrote:
>
> > When I ran my osu tests , I was able to get the numbers out of all the
> tests except latency_mt (which was obvious, as I didnt compile open-mpi
> with multi threaded support).
> > A good way to know if the problem is with openmpi or with your custom
> OFED stack would be to use some other device like tcp instead of ib and
> rerun these one sided comm tests.
> > On Wed, Feb 29, 2012 at 10:04 AM, Barrett, Brian W 
> wrote:
> > I'm pretty sure that they are correct.  Our one-sided implementation is
> > buggier than I'd like (indeed, I'm in the process of rewriting most of it
> > as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
> > that the bugs are in Open MPI's onesided support.  Can you try a more
> > recent release (something from the 1.5 tree) and see if the problem
> > persists?
> >
> > Thanks,
> >
> > Brian
> >
> > On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:
> >
> > >FWIW, I'm immediately suspicious of *any* MPI application that uses the
> > >MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
> > >these two OSU benchmarks are using those operations.
> > >
> > >Is it known that these two benchmarks are correct?
> > >
> > >
> > >
> > >On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
> > >
> > >> Sorry, i forgot to introduce the system.. Ours is the customized OFED
> > >>stack implemented to work on the specific hardware.. We tested the
> stack
> > >>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
> > >>want to execute the osu_benchamark3.1.1 suite on our OFED..
> > >>
> > >> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
> > >> wrote:
> > >> Hiii,
> > >> I tried executing osu_benchamarks-3.1.1 suite with the
> openmpi-1.4.3...
> > >>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> > >> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and
> > >>the remaining tests are hanging at some message size.. the output is
> > >>shown below
> > >>
> > >> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> > >>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> > >>orte_base_help_aggregate 0
> > >>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> > >> failed to create doorbell file /dev/plx2_char_dev
> > >>
> >
> >>-
> > >>-
> > >> WARNING: No preset parameters were found for the device that Open MPI
> > >> detected:
> > >>
> > >>   Local host:test1
> > >>   Device name:   plx2_0
> > >>   Device vendor ID:  0x10b5
> > >>   Device vendor part ID: 4277
> > >>
> > >> Default device parameters will be used, which may result in lower
> > >> performance.  You can edit any of the files specified by the
> > >> btl_openib_device_param_files MCA parameter to set values for your
> > >> device.
> > >>
> > >> NOTE: You can turn off this warning by setting the MCA parameter
> > >>   btl_openib_warn_no_device_params_found to 0.
> > >>
> >
> >>-
> > >>-
> > >> failed to create doorbell file /dev/plx2_char_dev
> > >>
> >
> >>-
> > >>-
> > >> WARNING: No preset parameters were found for the device that Open MPI
> > >> detected:
> > >>
> > >>   Local host:test2
> > >>   Device name:   plx2_0
> > >>   Device vendor ID:  0x10b5
> > >>   Device vendor part ID: 4277
> > >>
> > >> Default device parameters will be used, which may result in lower
> > >> performance.  You can edit any of the files specified by the
> > >> btl_openib_device_param_files MCA parameter to set values for your
> > >> device.
> > >>
> > >> NOTE: You can turn off this warning by setting the MCA parameter
> > >>   btl_openib_warn_no_device_params_found to 0.
> > >>
> >
> >>-
> > >>-
> > >> alloc_srq max: 512 wqe_shift: 5
> > >> alloc_srq max: 512 wqe_shift: 5
> > >> alloc_srq max: 512 wqe_shift: 5
> > >> alloc_srq max: 512 wqe_shift: 5
> > >> alloc_srq max: 512 wqe_

Re: [OMPI users] orted daemon no found! --- environment not passed to slave nodes

2012-02-29 Thread Jeffrey Squyres
Gah.  I didn't realize that my 1.4.x build was a *developer* build.  
*Developer* builds give a *lot* more detail with plm_base_verbose=100 
(including the specific rsh command being used).  You obviously didn't get that 
output because you don't have a developer build.  :-\

Just for reference, here's what plm_base_verbose=100 tells me for running an 
orted on a remote node, when I use the --prefix option to mpirun (I'm a tcsh 
user, so the syntax below will be a little different than what is running in 
your environment):

-
[svbu-mpi:28527] [[20181,0],0] plm:rsh: executing: (//usr/bin/ssh) 
[/usr/bin/ssh svbu-mpi001  set path = ( /home/jsquyres/bogus/bin $path ) ; if ( 
$?LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib ; if ( $?OMPI_have_llp == 1 ) 
setenv LD_LIBRARY_PATH /home/jsquyres/bogus/lib:$LD_LIBRARY_PATH ;  
/home/jsquyres/bogus/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 
1322582016 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 
"1322582016.0;tcp://172.29.218.140:34815;tcp://10.148.255.1:34815" --mca 
plm_base_verbose 100]
-

Ok, a few options here:

1. You can get a developer build if you use the --enable-debug option to 
configure.  Then plm_base_verbose=100 will give a lot more info.  Remember, the 
goal here is to see what's going wrong -- not to depend on having a developer 
build around.

2. If that isn't workable, make an "orted" in your default path somewhere 
that's a short script:

-
:
echo ===environment===
env | sort
echo ===environment end===
sleep 1000
-

Then when you "mpirun", do a "ps" to see exactly what was executed on the node 
where mpirun was invoked and the node where orted is supposed to be running.  
It's not quite as descriptive as seeing the plm_base_verbose output because we 
run multiple shell commands, but it's something.  You'll also see the stdout 
from the local node.  You'll need to use the --leave-session-attached option to 
mpirun to see the output from the remote nodes.


On Feb 29, 2012, at 9:43 AM, Yiguang Yan wrote:

> Hi Jeff,
> 
> Thanks.
> 
> I tried as what you suggested. Here are the output:
> 
 
> yiguang@gulftown testdmp]$ ./test.bash
> [gulftown:25052] mca: base: components_open: Looking for plm 
> components
> [gulftown:25052] mca: base: components_open: opening plm 
> components
> [gulftown:25052] mca: base: components_open: found loaded 
> component rsh
> [gulftown:25052] mca: base: components_open: component rsh 
> has no register function
> [gulftown:25052] mca: base: components_open: component rsh 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component slurm
> [gulftown:25052] mca: base: components_open: component slurm 
> has no register function
> [gulftown:25052] mca: base: components_open: component slurm 
> open function successful
> [gulftown:25052] mca: base: components_open: found loaded 
> component tm
> [gulftown:25052] mca: base: components_open: component tm 
> has no register function
> [gulftown:25052] mca: base: components_open: component tm 
> open function successful
> [gulftown:25052] mca:base:select: Auto-selecting plm components
> [gulftown:25052] mca:base:select:(  plm) Querying component [rsh]
> [gulftown:25052] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [gulftown:25052] mca:base:select:(  plm) Querying component 
> [slurm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component 
> [slurm]. Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Querying component [tm]
> [gulftown:25052] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [gulftown:25052] mca:base:select:(  plm) Selected component [rsh]
> [gulftown:25052] mca: base: close: component slurm closed
> [gulftown:25052] mca: base: close: unloading component slurm
> [gulftown:25052] mca: base: close: component tm closed
> [gulftown:25052] mca: base: close: unloading component tm
> bash: orted: command not found
> bash: orted: command not found
> bash: orted: command not found
> <<<
> 
> 
> The following is the content of test.bash:
 
> yiguang@gulftown testdmp]$ ./test.bash
> #!/bin/sh -f
> #nohup
> #
> # 
> >---
> <
> adinahome=/usr/adina/system8.8dmp
> mpirunfile=$adinahome/bin/mpirun
> #
> # Set envars for mpirun and orted
> #
> export PATH=$adinahome/bin:$adinahome/tools:$PATH
> export LD_LIBRARY_PATH=$adinahome/lib:$LD_LIBRARY_PATH
> #
> #
> # run DMP problem
> #
> mcaprefix="--prefix $adinahome"
> mcarshagent="--mca plm_rsh_agent rsh:ssh"
> mcatmpdir="--mca orte_tmpdir_base /tmp"
> mcaopenibmsg="--mca btl_openib_warn_default_gid_prefix 0"
> mcaenvars="-x PATH -x LD_LIBRARY_PATH"
> mcabtlconn="--mca btl openib,sm,self"
> mcaplmbase="--mca plm_base_verbose 100"
> 
> mcaparams="$mcapre

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 2:30 PM, Jingcha Joba wrote:

> Squyres,
> I thought RDMA read and write are implemented as one side communication using 
> get and put respectively..
> Is it not so? 

Yes and no.

Keep in mind the difference between two things here:

- An an underlying transport's one-sided capabilities (e.g., using InfiniBand 
RDMA reads/writes)
- MPI one-sided and/or two-sided message passing

Most OpenFabrics-capable MPI's use OF RDMA reads and writes for sending large 
messages (both one and two sided).  But it's not always the case.  For example, 
it may not be worth it to use RDMA for short messages because of the cost of 
registering memory, negotiating the target address for the RDMA read/write 
(which may require a round-tip ACK), etc.

So OF-capable MPI's basically divorce the two issues.  The underlying transport 
will choose the "best" method (whether it's a send/recv style exchange, an 
RDMA-stle exchange, or a mixture of the two).

Make sense?


> On Wed, Feb 29, 2012 at 10:49 AM, Jeffrey Squyres  wrote:
> FWIW, if Brian says that our one-sided stuff is a bit buggy, I believe him 
> (because he wrote it).  :-)
> 
> The fact is that the MPI-2 one-sided stuff is extremely complicated and 
> somewhat open to interpretation.  In practice, I haven't seen the MPI-2 
> one-sided stuff used much in the wild.  The MPI-3 working group just revamped 
> the one-sided support and generally made it much mo'betta.  Brian is 
> re-implementing that stuff, and I believe it'll also be much mo'betta.
> 
> My point: I wouldn't worry if not all one-sided benchmarks run with OMPI.  No 
> one uses them (yet) anyway.
> 
> 
> On Feb 29, 2012, at 1:42 PM, Jingcha Joba wrote:
> 
> > When I ran my osu tests , I was able to get the numbers out of all the 
> > tests except latency_mt (which was obvious, as I didnt compile open-mpi 
> > with multi threaded support).
> > A good way to know if the problem is with openmpi or with your custom OFED 
> > stack would be to use some other device like tcp instead of ib and rerun 
> > these one sided comm tests.
> > On Wed, Feb 29, 2012 at 10:04 AM, Barrett, Brian W  
> > wrote:
> > I'm pretty sure that they are correct.  Our one-sided implementation is
> > buggier than I'd like (indeed, I'm in the process of rewriting most of it
> > as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
> > that the bugs are in Open MPI's onesided support.  Can you try a more
> > recent release (something from the 1.5 tree) and see if the problem
> > persists?
> >
> > Thanks,
> >
> > Brian
> >
> > On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:
> >
> > >FWIW, I'm immediately suspicious of *any* MPI application that uses the
> > >MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
> > >these two OSU benchmarks are using those operations.
> > >
> > >Is it known that these two benchmarks are correct?
> > >
> > >
> > >
> > >On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
> > >
> > >> Sorry, i forgot to introduce the system.. Ours is the customized OFED
> > >>stack implemented to work on the specific hardware.. We tested the stack
> > >>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
> > >>want to execute the osu_benchamark3.1.1 suite on our OFED..
> > >>
> > >> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
> > >> wrote:
> > >> Hiii,
> > >> I tried executing osu_benchamarks-3.1.1 suite with the openmpi-1.4.3...
> > >>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> > >> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite... and
> > >>the remaining tests are hanging at some message size.. the output is
> > >>shown below
> > >>
> > >> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> > >>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> > >>orte_base_help_aggregate 0
> > >>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> > >> failed to create doorbell file /dev/plx2_char_dev
> > >>
> > >>-
> > >>-
> > >> WARNING: No preset parameters were found for the device that Open MPI
> > >> detected:
> > >>
> > >>   Local host:test1
> > >>   Device name:   plx2_0
> > >>   Device vendor ID:  0x10b5
> > >>   Device vendor part ID: 4277
> > >>
> > >> Default device parameters will be used, which may result in lower
> > >> performance.  You can edit any of the files specified by the
> > >> btl_openib_device_param_files MCA parameter to set values for your
> > >> device.
> > >>
> > >> NOTE: You can turn off this warning by setting the MCA parameter
> > >>   btl_openib_warn_no_device_params_found to 0.
> > >>
> > >>-
> > >>-
> > >> failed to create doorbell file /dev/plx2_char_dev
> > >>
> > >>-
> > >>-
> > >> WARNING: No preset parameters were f

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jingcha Joba
So if I understand correctly, if a message size is smaller than it will use
the MPI way (non-RDMA, 2 way communication), if its larger, then it would
use the Open Fabrics, by using the ibverbs (and ofed stack) instead of
using the MPI's stack?

If so, could that be the reason why the MPI_Put "hangs" when sending a
message more than 512KB (or may be 1MB)?
Also is there a way to know if for a particular MPI call, OF uses send/recv
or RDMA exchange?
On Wed, Feb 29, 2012 at 11:36 AM, Jeffrey Squyres wrote:

> On Feb 29, 2012, at 2:30 PM, Jingcha Joba wrote:
>
> > Squyres,
> > I thought RDMA read and write are implemented as one side communication
> using get and put respectively..
> > Is it not so?
>
> Yes and no.
>
> Keep in mind the difference between two things here:
>
> - An an underlying transport's one-sided capabilities (e.g., using
> InfiniBand RDMA reads/writes)
> - MPI one-sided and/or two-sided message passing
>
> Most OpenFabrics-capable MPI's use OF RDMA reads and writes for sending
> large messages (both one and two sided).  But it's not always the case.
>  For example, it may not be worth it to use RDMA for short messages because
> of the cost of registering memory, negotiating the target address for the
> RDMA read/write (which may require a round-tip ACK), etc.
>
> So OF-capable MPI's basically divorce the two issues.  The underlying
> transport will choose the "best" method (whether it's a send/recv style
> exchange, an RDMA-stle exchange, or a mixture of the two).
>
> Make sense?
>
>
> > On Wed, Feb 29, 2012 at 10:49 AM, Jeffrey Squyres 
> wrote:
> > FWIW, if Brian says that our one-sided stuff is a bit buggy, I believe
> him (because he wrote it).  :-)
> >
> > The fact is that the MPI-2 one-sided stuff is extremely complicated and
> somewhat open to interpretation.  In practice, I haven't seen the MPI-2
> one-sided stuff used much in the wild.  The MPI-3 working group just
> revamped the one-sided support and generally made it much mo'betta.  Brian
> is re-implementing that stuff, and I believe it'll also be much mo'betta.
> >
> > My point: I wouldn't worry if not all one-sided benchmarks run with
> OMPI.  No one uses them (yet) anyway.
> >
> >
> > On Feb 29, 2012, at 1:42 PM, Jingcha Joba wrote:
> >
> > > When I ran my osu tests , I was able to get the numbers out of all the
> tests except latency_mt (which was obvious, as I didnt compile open-mpi
> with multi threaded support).
> > > A good way to know if the problem is with openmpi or with your custom
> OFED stack would be to use some other device like tcp instead of ib and
> rerun these one sided comm tests.
> > > On Wed, Feb 29, 2012 at 10:04 AM, Barrett, Brian W 
> wrote:
> > > I'm pretty sure that they are correct.  Our one-sided implementation is
> > > buggier than I'd like (indeed, I'm in the process of rewriting most of
> it
> > > as part of Open MPI's support for MPI-3's revised RDMA), so it's likely
> > > that the bugs are in Open MPI's onesided support.  Can you try a more
> > > recent release (something from the 1.5 tree) and see if the problem
> > > persists?
> > >
> > > Thanks,
> > >
> > > Brian
> > >
> > > On 2/29/12 10:56 AM, "Jeffrey Squyres"  wrote:
> > >
> > > >FWIW, I'm immediately suspicious of *any* MPI application that uses
> the
> > > >MPI one-sided operations (i.e., MPI_PUT and MPI_GET).  It looks like
> > > >these two OSU benchmarks are using those operations.
> > > >
> > > >Is it known that these two benchmarks are correct?
> > > >
> > > >
> > > >
> > > >On Feb 29, 2012, at 11:33 AM, Venkateswara Rao Dokku wrote:
> > > >
> > > >> Sorry, i forgot to introduce the system.. Ours is the customized
> OFED
> > > >>stack implemented to work on the specific hardware.. We tested the
> stack
> > > >>with the q-perf and Intel Benchmarks(IMB-3.2.2).. they went fine.. We
> > > >>want to execute the osu_benchamark3.1.1 suite on our OFED..
> > > >>
> > > >> On Wed, Feb 29, 2012 at 9:57 PM, Venkateswara Rao Dokku
> > > >> wrote:
> > > >> Hiii,
> > > >> I tried executing osu_benchamarks-3.1.1 suite with the
> openmpi-1.4.3...
> > > >>I could run 10 bench-mark tests (except osu_put_bibw,osu_put_bw,osu_
> > > >> get_bw,osu_latency_mt) out of 14 tests in the bench-mark suite...
> and
> > > >>the remaining tests are hanging at some message size.. the output is
> > > >>shown below
> > > >>
> > > >> [root@test2 ~]# mpirun --prefix /usr/local/ -np 2 --mca btl
> > > >>openib,self,sm -H 192.168.0.175,192.168.0.174 --mca
> > > >>orte_base_help_aggregate 0
> > > >>/root/ramu/ofed_pkgs/osu_benchmarks-3.1.1/osu_put_bibw
> > > >> failed to create doorbell file /dev/plx2_char_dev
> > > >>
> > >
> >>-
> > > >>-
> > > >> WARNING: No preset parameters were found for the device that Open
> MPI
> > > >> detected:
> > > >>
> > > >>   Local host:test1
> > > >>   Device name:   plx2_0
> > > >>   Device vendor ID:  0x10b5
> > > >>   Device vendor part ID: 42

Re: [OMPI users] [EXTERNAL] Re: Question regarding osu-benchamarks 3.1.1

2012-02-29 Thread Jeffrey Squyres
On Feb 29, 2012, at 2:57 PM, Jingcha Joba wrote:

> So if I understand correctly, if a message size is smaller than it will use 
> the MPI way (non-RDMA, 2 way communication), if its larger, then it would use 
> the Open Fabrics, by using the ibverbs (and ofed stack) instead of using the 
> MPI's stack?

Er... no.

So let's talk MPI-over-OpenFabrics-verbs specifically.

All MPI communication calls will use verbs under the covers.  They may use 
verbs send/receive semantics in some cases, and RDMA semantics in other cases.  
"It depends" -- on a lot of things, actually.  It's hard to come up with a good 
rule of thumb for when it uses one or the other; this is one of the reasons 
that the openib BTL code is so complex.  :-)

The main points here are:

1. you can trust the openib BTL to do the Best thing possible to get the 
message to the other side.  Regardless of whether that message is an MPI_SEND 
or an MPI_PUT (for example).

2. MPI_PUT does not necessarily == verbs RDMA write (and likewise, MPI_GET does 
not necessarily == verbs RDMA read).

> If so, could that be the reason why the MPI_Put "hangs" when sending a 
> message more than 512KB (or may be 1MB)?

No.  I'm guessing that there's some kind of bug in the MPI_PUT implementation.

> Also is there a way to know if for a particular MPI call, OF uses send/recv 
> or RDMA exchange?

Not really.

More specifically: all things being equal, you don't care which is used.  You 
just want your message to get to the receiver/target as fast as possible.  One 
of the main ideas of MPI is to hide those kinds of details from the user.  
I.e., you call MPI_SEND.  A miracle occurs.  The message is received on the 
other side.

:-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] ssh between nodes

2012-02-29 Thread Denver Smith
Hello,

On my cluster running moab and torque, I cannot ssh without a password between 
compute nodes. I can however request multiple node jobs fine. I was wondering 
if passwordless ssh keys need to be set up between compute nodes in order for 
mpi applications to run correctly.

Thanks


Re: [OMPI users] ssh between nodes

2012-02-29 Thread Randall Svancara
Depends on which launcher you are using.  My understanding is that you can
use torque to launch the MPI processes on remote nodes, but you must
compile this support into OpenMPI.  Please, someone correct me if I am
wrong.

For most clusters I work with and manage, we use passwordless keys.  The
reason is that sometimes MPI implementations, like those provided by many
vendords do not supply the requisite functionality to integrate with
Torque, such as the Intel's OpenMPI tools or Comsol's bundled MPI
implementation as an example.

So really, it boils down to your needs.

Thanks

Randall

On Wed, Feb 29, 2012 at 1:09 PM, Denver Smith  wrote:

>  Hello,
>
>  On my cluster running moab and torque, I cannot ssh without a password
> between compute nodes. I can however request multiple node jobs fine. I was
> wondering if passwordless ssh keys need to be set up between compute nodes
> in order for mpi applications to run correctly.
>
>  Thanks
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
Know Your Linux? 


Re: [OMPI users] ssh between nodes

2012-02-29 Thread Lloyd Brown
It really depends.  You certainly CAN have mpirun/mpiexec use ssh to
launch the remote processes.  If you're using Torque, though, I strongly
recommend using the hooks in OpenMPI, into the Torque TM-API (see
http://www.open-mpi.org/faq/?category=building#build-rte-tm).  That will
use the pbs_mom's themselves to launch all the processes, which has
several advantages.

Using the TM-API for job launch means that remote processes will be
children of the Torque pbs_mom process, not the sshd process, which
means that Torque will be able to do a better job at killing rogue
processes, reporting resources utilized, etc.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 02/29/2012 02:09 PM, Denver Smith wrote:
> Hello,
> 
> On my cluster running moab and torque, I cannot ssh without a password
> between compute nodes. I can however request multiple node jobs fine. I
> was wondering if passwordless ssh keys need to be set up between compute
> nodes in order for mpi applications to run correctly.
> 
> Thanks
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] ssh between nodes

2012-02-29 Thread Martin Siegert
Hi,

On Wed, Feb 29, 2012 at 09:09:27PM +, Denver Smith wrote:
> 
>Hello,
>On my cluster running moab and torque, I cannot ssh without a password
>between compute nodes. I can however request multiple node jobs fine. I
>was wondering if passwordless ssh keys need to be set up between
>compute nodes in order for mpi applications to run correctly.
>Thanks

No. passwordless ssh keys are not needed. In fact, I strong advise
against using those (teaching users how to generate passwordless
ssh keys creates security problems: they start using those not just
for connecting to compute nodes). There are several alternatives:

1) use openmpi's hooks into torque (use the --with-tm configure option);
2) use ssh hostbased authentication (and set IgnoreUserKnownHosts to yes);
3) use rsh (works if your cluster is sufficiently small).

I prefer any of these (in decreasing order) over passwordless ssh keys.

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University
Burnaby, British Columbia


Re: [OMPI users] InfiniBand path migration not working

2012-02-29 Thread Jeremy
Hi Pasha,

>On Wed, Feb 29, 2012 at 11:02 AM, Shamis, Pavel  wrote:
>
> I would like to see all the file.
> 28MB is it the size after compression ?
>
> I think gmail supports up to 25Mb.
> You may try to create gzip file and then slice it using "split" command.

See attached. At about line 151311 is when I unplugged the cable from
Port 1. Then I see the APM error message at about line 178905.

Thanks,

-Jeremy


debug.txt.bz2
Description: BZip2 compressed data


Re: [OMPI users] Hybrid OpenMPI / OpenMP programming

2012-02-29 Thread Ralph Castain
It sounds like you are running into an issue with the Linux scheduler. I have 
an item to add an API "bind-this-thread-to-", but that won't be 
available until sometime in the future.

Couple of things you could try in the meantime. First, use the --cpus-per-rank 
option to separate the ranks from each other. In other words, instead of 
--bind-to-socket -bysocket, you do:

-bind-to-core -cpus-per-rank N

This will take each rank and bind it to a unique set of N cores, thereby 
cleanly separating them on the node.

Second, the Linux scheduler tends to become jealous of the way MPI procs "hog" 
the resources. The scheduler needs room to run all those daemons and other 
processes too. So it tends to squeeze you aside a little, just to create some 
room for the rest of the stuff.

What you can do is "entice" it away from your processes by leaving 1-2 cores 
for its own use. For example:

-npernode 2 -bind-to-core -cpus-per-rank 3

would run two MPI ranks on each node, each rank exclusively bound to 3 cores. 
This leaves 2 cores on each node for Linux. When the scheduler sees the 6 cores 
of your MPI/MP procs working hard, and 2 cores sitting idle, it will tend to 
use those 2 cores for everything else - and not be tempted to push you aside to 
gain access to "your" cores.

HTH
Ralph

On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:

> Dear Open-MPI users,
> 
> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine 
> (2 sockets by nodes and 4 cores by socket) with basically two
> levels of implementation for Open-MPI:
> - at lower level n "Master" MPI-processes (one by socket) are
> simultaneously runned by dividing classically the physical domain into n
> sub-domains
> - while at higher level 4n MPI-processes are spawn to run a sparse Poisson 
> solver.
> At each time step, the code is thus going back and forth between these two 
> levels of implementation using two MPI communicators. This also means that 
> during about half of the computation time, 3n cores are at best sleeping (if 
> not 'waiting' at a barrier) when not inside "Solver routines". We 
> consequently decided to implement OpenMP functionality in our code when 
> solver was not running (we declare one single "parallel" region and use the 
> omp "master" command when OpenMP threads are not active). We however face 
> several difficulties:
> 
> a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume 
> processor cycles while waiting'. We consequently tried: mpirun
> -mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export
> KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
> of computing time but worsens the second problem we have to face (see
> bellow).
> 
> b) We managed to have a "correct" (?) implementation of our MPI-processes
> on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n 
> However if OpenMP threads initially seem to scatter on each socket (one
> thread by core) they slowly migrate to the same core as their 'Master MPI 
> process' or gather on one or two cores by socket
> We play around with the environment variable KMP_AFFINITY but the best we 
> could obtain was a pinning of the OpenMP threads to their own core... 
> disorganizing at the same time the implementation of the 4n Level-2 MPI 
> processes. When added, neither the specification of a rankfile nor the mpirun 
> option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.
> This comportment looks rather inefficient but so far we did not manage to 
> prevent the migration of the 4 threads to at most a couple of cores !
> 
> Is there something wrong in our "Hybrid" implementation?
> Do you have any advices?
> Thanks for your help,
> Francis
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Could not execute the executable "/home/MET/hrm/bin/hostlist": Exec format error

2012-02-29 Thread Syed Ahsan Ali
Sorry Jeff I couldn't get you point.

On Wed, Feb 29, 2012 at 4:27 PM, Jeffrey Squyres  wrote:

> On Feb 29, 2012, at 2:17 AM, Syed Ahsan Ali wrote:
>
> > [pmdtest@pmd02 d00_dayfiles]$ echo ${MPIRUN} -np ${NPROC} -hostfile
> $i{ABSDIR}/hostlist -mca btl sm,openib,self --mca btl_openib_use_srq 1
> ./hrm >> ${OUTFILE}_hrm 2>&1
> > [pmdtest@pmd02 d00_dayfiles]$
>
> Because you used >> and 2>&1, the output when to your ${OUTFILE}_hrm file,
> not stdout.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>


-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014