Re: [OMPI users] Threading models with openib

2010-06-18 Thread Brad Benton
On Wed, Jun 9, 2010 at 7:58 AM, Jeff Squyres  wrote:

> On Jun 8, 2010, at 12:33 PM, David Turner wrote:
>
> > Please verify:  if using openib BTL, the only threading model is
> MPI_THREAD_SINGLE?
>
> Up to MPI_THREAD_SERIALIZED.
>
> > Is there a timeline for full support of MPI_THREAD_MULTIPLE in Open MPI's
> openib BTL?
>
> IBM has been making some good strides in this direction recently, but I
> don't know what their specific timeframe is.
>


Yes,  we (mainly Chris Yeoh) have been making improvements with
MPI_THREAD_MULTIPLE support for the openib BTL.

 Here is how things currently stand:
 - the trunk and 1.5rc1 have threading modifications that improve the use of
MPI_THREAD_MULTIPLE with the openib BTL
 - It is currently functional, but with some restrictions and is still a
work in progress.  Consequently, the default behavior for the openib BTL is
still to not support MPI_THREAD_MULTIPLE.  However, this can be overridden
with a command line parameter.

In order to use MPI_THREAD_MULTIPLE with the openib BTL:
  - set the MCA parameter "btl_base_thread_multiple_override" to 1
  - set the MCA parameter "mca mpi_leave_pinned" to 0

The latter is needed because there are still some known issues with
threading and the memory registration cache.

As for timelines, we are already moving some of these fixes into the 1.4.x
branch as well.  However, certain changes such as a thread-safe memory
registration cache are targeted for the 1.5/1.6 series.  Our goal is to have
a stable implementation of the openib BTL that fully supports
MPI_THREAD_MULTIPLE by the time that the 1.5 feature branch transitions to
the 1.6 super-stable branch.

If you have threaded applications that you can run today over the openib
BTL, please give it a try and let us know of any problems that you
encounter.

Thanks,
--Brad

Brad Benton
brad.ben...@us.ibm.com



>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.

2010-06-18 Thread guillaume ranquet
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

sorry for the very long delay, I didn't understood you waited an answer
from my side on this. (the debate seemed to be between maintainers)
do not hesitate to bug me if I'm not answering after some days.

to answer shortly:
- -yes I've tested the patch submited on this thread by Scott and it
solved my issues.
- -no, I havent tested the patch submited by George, I can have a quick
try if needed.

as of "which one wins", I'm quite sure you have more clues than me on
the subjet :)


On 06/07/2010 09:49 PM, Jeff Squyres wrote:
> George --
> 
> Scott's patch was different than the one you applied.  Apparently, his fixes 
> this user's problem (I don't know if Guillaume tested yours).
> 
> Which one wins?
> 
> 
> 
> On Jun 3, 2010, at 9:49 AM, Scott Atchley wrote:
> 
>> On Jun 3, 2010, at 8:54 AM, guillaume ranquet wrote:
>>
>>> granquet@bordeplage-15 ~ $ mpirun --mca btl mx,openib,sm,self --mca pml
>>> ^cm --mca mpi_leave_pinned 0 ~/bwlat/mpi_helloworld
>>> [bordeplage-15.bordeaux.grid5000.fr:02707] Error in mx_init (error No MX
>>> device entry in /dev.)
>>> Hello world from process 0 of 1
>>>
>>> it works :)
>>
>> Jeff, you may want to change this message to opal_output_verbose(). It is in 
>> $OMPI/ompi/mca/common/common_mx.c.
>>
 Ok. I think that OMPI is trying to open the MX MTL first. It fails at
 mx_init() (the first error message) but it had already created some
 mpool resources. It then tries to open the MX BTL and it skips the MX
 initialization and returns SUCCESS. The MX BTL then tries to call
 mx_get_info() which fails and prints the second message.

 Try the attached patch. It tries to clean up if mx_init() fails and
 does not return SUCCESS on subsequent attempts to initialize MX.

 Scott
>>>
>>> I tried your patch and it seems to correct the issue:
>>>
>>> configured with:  --prefix=$HOME/openmpi-1.4.2-nomx-bin/
>>> - --with-openib=/usr --with-mx=/usr
>>>
>>> $ ~/openmpi-1.4.2-nomx-bin/bin/mpirun ~/bwlat/mpi_helloworld
>>> [bordeplage-15.bordeaux.grid5000.fr:22406] Error in mx_init (error No MX
>>> device entry in /dev.)
>>> Hello world from process 0 of 1
>>
>> Excellent.
>>
>>> don't hesitate if you need further testing :)
>>
>> Thanks for all your assistance!
>>
>>> do you plan on applying this patch on next release? (1.4.3?)
>>
>> Jeff, I leave this up to you and George.
>>
>> Scott
>>
> 
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMG46CAAoJEEzIl7PMEAli+2MH/19oFkY+JM1l/1hfRIKVrSl4
+tzpWuPdrRFBODqKrZz6TTvZTBqCHar0M6FLPVTr3wvTRVMgEbdlBwr6u7GUBdVP
3XJw25jFUKkaAOM8PbDI7V3FMZ6oyF7Xxefo2EBCRvp9lVeop6Y0c01fXz9LS6F+
SYn8mi5bmn58GKd8xKLvK2zgGDwdw5CRQRdWGPOfHVo4hcosvv0d55RhpDs1/U1C
YRabXwCM0ZU251bYLwhZCjVPZZMfrQBy8oEc1DBiHOXPnc1c25GBwMxL5WPRkR+b
xXHM2PECDACLZYKAtb/CZh94DXWxTbsMKxM9N37zf48avgKyqQYJdkwrUSlDsxc=
=zGo1
-END PGP SIGNATURE-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Segmentation fault in MPI_Finalize with IB hardware and memory manager.

2010-06-18 Thread Jeff Squyres
Sorry for the confusion; I was asking George which one wins.  I'm not active in 
the MX portion of the OMPI code base, so I don't know which one is better / 
should be used.


On Jun 18, 2010, at 8:19 AM, guillaume ranquet wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hello,
> 
> sorry for the very long delay, I didn't understood you waited an answer
> from my side on this. (the debate seemed to be between maintainers)
> do not hesitate to bug me if I'm not answering after some days.
> 
> to answer shortly:
> - -yes I've tested the patch submited on this thread by Scott and it
> solved my issues.
> - -no, I havent tested the patch submited by George, I can have a quick
> try if needed.
> 
> as of "which one wins", I'm quite sure you have more clues than me on
> the subjet :)
> 
> 
> On 06/07/2010 09:49 PM, Jeff Squyres wrote:
>> George --
>> 
>> Scott's patch was different than the one you applied.  Apparently, his fixes 
>> this user's problem (I don't know if Guillaume tested yours).
>> 
>> Which one wins?
>> 
>> 
>> 
>> On Jun 3, 2010, at 9:49 AM, Scott Atchley wrote:
>> 
>>> On Jun 3, 2010, at 8:54 AM, guillaume ranquet wrote:
>>> 
 granquet@bordeplage-15 ~ $ mpirun --mca btl mx,openib,sm,self --mca pml
 ^cm --mca mpi_leave_pinned 0 ~/bwlat/mpi_helloworld
 [bordeplage-15.bordeaux.grid5000.fr:02707] Error in mx_init (error No MX
 device entry in /dev.)
 Hello world from process 0 of 1
 
 it works :)
>>> 
>>> Jeff, you may want to change this message to opal_output_verbose(). It is 
>>> in $OMPI/ompi/mca/common/common_mx.c.
>>> 
> Ok. I think that OMPI is trying to open the MX MTL first. It fails at
> mx_init() (the first error message) but it had already created some
> mpool resources. It then tries to open the MX BTL and it skips the MX
> initialization and returns SUCCESS. The MX BTL then tries to call
> mx_get_info() which fails and prints the second message.
> 
> Try the attached patch. It tries to clean up if mx_init() fails and
> does not return SUCCESS on subsequent attempts to initialize MX.
> 
> Scott
 
 I tried your patch and it seems to correct the issue:
 
 configured with:  --prefix=$HOME/openmpi-1.4.2-nomx-bin/
 - --with-openib=/usr --with-mx=/usr
 
 $ ~/openmpi-1.4.2-nomx-bin/bin/mpirun ~/bwlat/mpi_helloworld
 [bordeplage-15.bordeaux.grid5000.fr:22406] Error in mx_init (error No MX
 device entry in /dev.)
 Hello world from process 0 of 1
>>> 
>>> Excellent.
>>> 
 don't hesitate if you need further testing :)
>>> 
>>> Thanks for all your assistance!
>>> 
 do you plan on applying this patch on next release? (1.4.3?)
>>> 
>>> Jeff, I leave this up to you and George.
>>> 
>>> Scott
>>> 
>> 
>> 
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.15 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJMG46CAAoJEEzIl7PMEAli+2MH/19oFkY+JM1l/1hfRIKVrSl4
> +tzpWuPdrRFBODqKrZz6TTvZTBqCHar0M6FLPVTr3wvTRVMgEbdlBwr6u7GUBdVP
> 3XJw25jFUKkaAOM8PbDI7V3FMZ6oyF7Xxefo2EBCRvp9lVeop6Y0c01fXz9LS6F+
> SYn8mi5bmn58GKd8xKLvK2zgGDwdw5CRQRdWGPOfHVo4hcosvv0d55RhpDs1/U1C
> YRabXwCM0ZU251bYLwhZCjVPZZMfrQBy8oEc1DBiHOXPnc1c25GBwMxL5WPRkR+b
> xXHM2PECDACLZYKAtb/CZh94DXWxTbsMKxM9N37zf48avgKyqQYJdkwrUSlDsxc=
> =zGo1
> -END PGP SIGNATURE-
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] problem with -npernode

2010-06-18 Thread David Turner

Hi,

On 06/17/2010 03:34 PM, Ralph Castain wrote:

No more info required - it's a bug. Fixed and awaiting release of 1.4.3.


I downloaded openmpi-1.4.3a1r23261.tar.gz, dated June 9.  It behaves the
same as 1.4.2.  Is there a newer version available for testing?


On Jun 17, 2010, at 3:50 PM, David Turner wrote:


Hi,

Recently, Christopher Maestas reported a problem with -npernode in
Open MPI 1.4.2 ("running a ompi 1.4.2 job with -np versus -npernode").
I have also encountered this problem, with a simple "hello, world"
program:

% mpirun -np 16 ./a.out
myrank, icount = 0   16
myrank, icount = 2   16
myrank, icount = 5   16
myrank, icount = 7   16
myrank, icount = 1   16
myrank, icount = 4   16
myrank, icount = 6   16
myrank, icount = 3   16
myrank, icount = 8   16
myrank, icount = 9   16
myrank, icount =10   16
myrank, icount =12   16
myrank, icount =13   16
myrank, icount =15   16
myrank, icount =11   16
myrank, icount =14   16
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP

% mpirun -np 16 -npernode 8 ./a.out
[c1146:15313] *** Process received signal ***
[c1146:15313] Signal: Segmentation fault (11)
[c1146:15313] Signal code: Address not mapped (1)
[c1146:15313] Failing at address: 0x50
[c1146:15313] *** End of error message ***
Segmentation fault
[c1138:26571] [[62315,0],1] routed:binomial: Connection to lifeline 
[[62315,0],0] lost
 % module swap openmpi openmpi/1.4.1
 % mpirun -np 16 -npernode 8 ./a.out
myrank, icount = 8   16
myrank, icount =13   16
myrank, icount =10   16
myrank, icount =11   16
myrank, icount =15   16
myrank, icount =14   16
myrank, icount =12   16
myrank, icount = 5   16
myrank, icount = 2   16
myrank, icount = 3   16
myrank, icount = 1   16
myrank, icount = 0   16
myrank, icount = 9   16
myrank, icount = 6   16
myrank, icount = 7   16
myrank, icount = 4   16
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP
FORTRAN STOP

Compilers are PGI/10.5, OS is Scientific Linux 5.4, resource manager is
torque 2.4.5.  Please let me know if you need more information.  Thanks!

--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316


Re: [OMPI users] problem with -npernode

2010-06-18 Thread Jeff Squyres
On Jun 18, 2010, at 9:57 AM, David Turner wrote:

> I downloaded openmpi-1.4.3a1r23261.tar.gz, dated June 9.  It behaves the
> same as 1.4.2.  Is there a newer version available for testing?

Not yet.  Ticket 2431 was just approved to be put back to the 1.4 branch (which 
contains the fix); it hasn't actually been put back yet.  See:

https://svn.open-mpi.org/trac/ompi/ticket/2431

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/