Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-16 Thread Ralph Castain
Just to wrap this up for the user list: this has now been fixed and added to 
1.8.2 in the nightly tarball. The problem proved to be an edge case when 
partial allocations were combined with coprocessor existence (hit a slightly 
different code path).


On Jun 12, 2014, at 9:04 AM, Dan Dietz  wrote:

> That shouldn't be a problem. Let me figure out the process and I'll
> get back to you.
> 
> Dan
> 
> On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain  wrote:
>> Arggh - is there any way I can get access to this beast so I can debug this? 
>> I can't figure out what in the world is going on, but it seems to be 
>> something triggered by your specific setup.
>> 
>> 
>> On Jun 12, 2014, at 8:48 AM, Dan Dietz  wrote:
>> 
>>> Unfortunately, the nightly tarball appears to be crashing in a similar
>>> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
>>> 
>>> Dan
>>> 
>>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
 I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
 situation just fine. I don't have access to a Torque machine, but I did 
 set everything to follow the same code path, added faux coprocessors, etc. 
 - and it ran just fine.
 
 Can you try the 1.8.2 tarball and see if it solves the problem?
 
 
 On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:
 
> Okay, let me poke around some more. It is clearly tied to the 
> coprocessors, but I'm not yet sure just why.
> 
> One thing you might do is try the nightly 1.8.2 tarball - there have been 
> a number of fixes, and this may well have been caught there. Worth taking 
> a look.
> 
> 
> On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:
> 
>> Sorry - it crashes with both torque and rsh launchers. The output from
>> a gdb backtrace on the core files looks identical.
>> 
>> Dan
>> 
>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
>>> Afraid I'm a little confused now - are you saying it works fine under 
>>> Torque, but segfaults under rsh? Could you please clarify your current 
>>> situation?
>>> 
>>> 
>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:
>>> 
 It looks like it is still segfaulting with the rsh launcher:
 
 ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
 -np 4 -machinefile ./nodes ./hello
 [conte-a084:51113] *** Process received signal ***
 [conte-a084:51113] Signal: Segmentation fault (11)
 [conte-a084:51113] Signal code: Address not mapped (1)
 [conte-a084:51113] Failing at address: 0x2c
 [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
 [conte-a084:51113] [ 1]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
 [conte-a084:51113] [ 2]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
 [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
 [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
 [conte-a084:51113] [ 5] 
 /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
 [conte-a084:51113] [ 6] mpirun[0x404719]
 [conte-a084:51113] *** End of error message ***
 Segmentation fault (core dumped)
 
 On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  
 wrote:
> I'm having no luck poking at this segfault issue. For some strange 
> reason, we seem to think there are coprocessors on those remote nodes 
> - e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
> 
> Out of curiosity, can you try running this with "-mca plm rsh"? This 
> will substitute the rsh/ssh launcher in place of Torque - assuming 
> your system will allow it, this will let me see if the problem is 
> somewhere in the Torque launcher or elsewhere in OMPI.
> 
> Thanks
> Ralph
> 
> On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:
> 
>> No problem -
>> 
>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 
>> chips.
>> 2 per node, 8 cores each. No threading enabled.
>> 
>> $ lstopo
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>> (P#0)
>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>> (P#1)
>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>> (P#2)
>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>> (P#3)
>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>> (P#4)
>> L2 L#5 

[OMPI users] deprecated cuptiActivityEnqueueBuffer

2014-06-16 Thread jcabello
Hi all:

I'm having trouble to compile OMPI from trunk svn with the new 6.0 nvidia
SDK because deprecated cuptiActivityEnqueueBuffer

this is the problem:

  CC   libvt_la-vt_cupti_activity.lo
  CC   libvt_la-vt_iowrap_helper.lo
  CC   libvt_la-vt_libwrap.lo
  CC   libvt_la-vt_mallocwrap.lo
vt_cupti_activity.c: In function 'vt_cuptiact_queueNewBuffer':
vt_cupti_activity.c:203:3: error: implicit declaration of function
'cuptiActivityEnqueueBuffer' [-Werror=implicit-function-declaration]
   VT_CUPTI_CALL(cuptiActivityEnqueueBuffer(cuCtx, 0, ALIGN_BUFFER(buffer,
8),

Does any body known any patch?


Re: [OMPI users] deprecated cuptiActivityEnqueueBuffer

2014-06-16 Thread Rolf vandeVaart
Do you need the vampire support in your build?  If not, you could add this to 
configure.
--disable-vt
  
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>jcabe...@computacion.cs.cinvestav.mx
>Sent: Monday, June 16, 2014 1:40 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] deprecated cuptiActivityEnqueueBuffer
>
>Hi all:
>
>I'm having trouble to compile OMPI from trunk svn with the new 6.0 nvidia
>SDK because deprecated cuptiActivityEnqueueBuffer
>
>this is the problem:
>
>  CC   libvt_la-vt_cupti_activity.lo
>  CC   libvt_la-vt_iowrap_helper.lo
>  CC   libvt_la-vt_libwrap.lo
>  CC   libvt_la-vt_mallocwrap.lo
>vt_cupti_activity.c: In function 'vt_cuptiact_queueNewBuffer':
>vt_cupti_activity.c:203:3: error: implicit declaration of function
>'cuptiActivityEnqueueBuffer' [-Werror=implicit-function-declaration]
>   VT_CUPTI_CALL(cuptiActivityEnqueueBuffer(cuCtx, 0,
>ALIGN_BUFFER(buffer, 8),
>
>Does any body known any patch?
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/06/24652.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-16 Thread Yuping Sun
Dear All:

I bought a 64 core workstation and installed NASA fun3d with open mpi 1.6.5. 
Then I started to test run fun3d using 16, 32, 48 cores. However the 
performance of the fun3d run is bad. I got data below:

the run command is (it is for 32 core as an example)

mpiexec -np 32 --bysocket --bind-to-socket 
~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi --time_timestep_loop 
--animation_freq -1 > screen.dump_bs30


CPUs     times    iterations    time/it
60    678s    30it        22.61s
48    702s    30it        23.40s
32    734s    30it        24.50s
16    894s    30it        29.80s

You can see using 60 cores, to run 30 iteration, FUN3D will complete in 678 
seconds, roughly 22.61 second per iteration.

Using 16 cores, to run 30 iteration, FUN3D will complete in 894 seconds, 
roughly 29.8 seconds per iteration.

the data above shows FUN3D run using mpirun does not scale at all! I used to 
run fun3d with mpirun on a 8 core WS, and it scales well.
The same job to run on a linux cluster scales well.

Would you all give me some advice to improve the performance loss when I 
increase the use of more cores, or how to run mpirun with proper options to get 
a linear scaling when using 16 to 32 to 48 cores?

Thank you.

Yuping

Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-16 Thread Ralph Castain
Well, for one, there is never any guarantee of linear scaling with the number 
of procs - that is very application dependent. You can actually see performance 
decrease with number of procs if the application doesn't know how to exploit 
them.

One thing that stands out is your mapping and binding options. Mapping bysocket 
means that you are putting neighboring ranks (i.e., ranks that differ by 1) on 
different sockets, which usually means different NUMA regions. This make shared 
memory between those procs run poorly. IF the application does a lot of 
messaging between ranks that differ by 1, then you would see poor scaling.

So one thing you could do is change --bysocket to --bycore. Then, if your 
application isn't threaded, you could --bind-to-core for better performance.


On Jun 16, 2014, at 3:19 PM, Yuping Sun  wrote:

> Dear All:
> 
> I bought a 64 core workstation and installed NASA fun3d with open mpi 1.6.5. 
> Then I started to test run fun3d using 16, 32, 48 cores. However the 
> performance of the fun3d run is bad. I got data below:
> 
> the run command is (it is for 32 core as an example)
> mpiexec -np 32 --bysocket --bind-to-socket 
> ~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi --time_timestep_loop 
> --animation_freq -1 > screen.dump_bs30
> 
> CPUs timesiterationstime/it
> 60678s30it22.61s
> 48702s30it23.40s
> 32734s30it24.50s
> 16894s30it29.80s
> 
> You can see using 60 cores, to run 30 iteration, FUN3D will complete in 678 
> seconds, roughly 22.61 second per iteration.
> 
> Using 16 cores, to run 30 iteration, FUN3D will complete in 894 seconds, 
> roughly 29.8 seconds per iteration.
> 
> the data above shows FUN3D run using mpirun does not scale at all! I used to 
> run fun3d with mpirun on a 8 core WS, and it scales well.
> The same job to run on a linux cluster scales well.
> 
> Would you all give me some advice to improve the performance loss when I 
> increase the use of more cores, or how to run mpirun with proper options to 
> get a linear scaling when using 16 to 32 to 48 cores?
> 
> Thank you.
> 
> Yuping
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24654.php



Re: [OMPI users] deprecated cuptiActivityEnqueueBuffer

2014-06-16 Thread jcabello

Ok that works

Thanks!!
> Do you need the vampire support in your build?  If not, you could add this
> to configure.
> --disable-vt
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>>jcabe...@computacion.cs.cinvestav.mx
>>Sent: Monday, June 16, 2014 1:40 PM
>>To: us...@open-mpi.org
>>Subject: [OMPI users] deprecated cuptiActivityEnqueueBuffer
>>
>>Hi all:
>>
>>I'm having trouble to compile OMPI from trunk svn with the new 6.0 nvidia
>>SDK because deprecated cuptiActivityEnqueueBuffer
>>
>>this is the problem:
>>
>>  CC   libvt_la-vt_cupti_activity.lo
>>  CC   libvt_la-vt_iowrap_helper.lo
>>  CC   libvt_la-vt_libwrap.lo
>>  CC   libvt_la-vt_mallocwrap.lo
>>vt_cupti_activity.c: In function 'vt_cuptiact_queueNewBuffer':
>>vt_cupti_activity.c:203:3: error: implicit declaration of function
>>'cuptiActivityEnqueueBuffer' [-Werror=implicit-function-declaration]
>>   VT_CUPTI_CALL(cuptiActivityEnqueueBuffer(cuCtx, 0,
>>ALIGN_BUFFER(buffer, 8),
>>
>>Does any body known any patch?
>>___
>>users mailing list
>>us...@open-mpi.org
>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post: http://www.open-
>>mpi.org/community/lists/users/2014/06/24652.php
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ---
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/06/24653.php
>



Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-16 Thread Yuping Sun
Hi Ralph:

Is the following correct command to you:

mpirun -np 32 --bysocket --bycore  
~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi
 --time_timestep_loop --animation_freq -1 

I run above command, still do not improve. Would you give me a detailed command 
with options?
Thank you.

Best regards,

Yuping



On Tue, 6/17/14, Ralph Castain  wrote:

 Subject: Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores
 To: "Yuping Sun" , "Open MPI Users" 

 Date: Tuesday, June 17, 2014, 1:59 AM
 
 Well, for one, there
 is never any guarantee of linear scaling with the number of
 procs - that is very application dependent. You can actually
 see performance decrease with number of procs if the
 application doesn't know how to exploit them.
 One thing that stands out is your mapping and
 binding options. Mapping bysocket means that you are putting
 neighboring ranks (i.e., ranks that differ by 1) on
 different sockets, which usually means different NUMA
 regions. This make shared memory between those procs run
 poorly. IF the application does a lot of messaging between
 ranks that differ by 1, then you would see poor
 scaling.
 So one thing you could do is change --bysocket to
 --bycore. Then, if your application isn't threaded, you
 could --bind-to-core for better performance.
 
 On Jun 16, 2014, at 3:19 PM, Yuping Sun 
 wrote:
 Dear All:
 I
 bought a 64 core workstation and installed NASA fun3d with
 open mpi 1.6.5. Then I started to test run fun3d using 16,
 32, 48 cores. However the performance of the fun3d run is
 bad. I got data below:
 the
 run command is (it is for 32 core as an example)
 mpiexec
 -np 32 --bysocket --bind-to-socket
 ~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi
 --time_timestep_loop --animation_freq -1 >
 screen.dump_bs30
 
 CPUs
     times   
 iterations    time/it
 60   
 678s    30it       
 22.61s
 48   
 702s    30it       
 23.40s
 32   
 734s    30it       
 24.50s
 16   
 894s    30it       
 29.80s
 You
 can see using 60 cores, to run 30 iteration, FUN3D will
 complete in 678 seconds, roughly 22.61 second per
 iteration.
 Using
 16 cores, to run 30 iteration, FUN3D will complete in 894
 seconds, roughly 29.8 seconds per iteration.
 the
 data above shows FUN3D run using mpirun does not scale at
 all! I used to run fun3d with mpirun on a 8 core WS, and it
 scales well.The
 same job to run on a linux cluster scales well.
 Would
 you all give me some advice to improve the performance loss
 when I
  increase the use of more cores, or how to run mpirun with
 proper options to get a linear scaling when using 16 to 32
 to 48 cores?
 Thank
 you.
 Yuping
 
 
 
 
 
 
 
 
 
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription:
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post:
 http://www.open-mpi.org/community/lists/users/2014/06/24654.php



Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-16 Thread Zehan Cui
Hi Yuping,

Maybe using multi-threads inside a socket, and MPI among sockets is better
choice for such NUMA platform.

Multi-threads can exploit the benefit of share memory, and MPI can
alleviate the cost of non-uniform memory access.


regards,
Zehan




On Tue, Jun 17, 2014 at 6:19 AM, Yuping Sun  wrote:

> Dear All:
>
> I bought a 64 core workstation and installed NASA fun3d with open mpi
> 1.6.5. Then I started to test run fun3d using 16, 32, 48 cores. However the
> performance of the fun3d run is bad. I got data below:
>
> the run command is (it is for 32 core as an example)
> mpiexec -np 32 --bysocket --bind-to-socket
> ~ysun/Codes/NASA/fun3d-12.3-66687/Mpi/FUN3D_90/nodet_mpi
> --time_timestep_loop --animation_freq -1 > screen.dump_bs30
>
> CPUs timesiterationstime/it
> 60678s30it22.61s
> 48702s30it23.40s
> 32734s30it24.50s
> 16894s30it29.80s
>
> You can see using 60 cores, to run 30 iteration, FUN3D will complete in
> 678 seconds, roughly 22.61 second per iteration.
>
> Using 16 cores, to run 30 iteration, FUN3D will complete in 894 seconds,
> roughly 29.8 seconds per iteration.
>
> the data above shows FUN3D run using mpirun does not scale at all! I used
> to run fun3d with mpirun on a 8 core WS, and it scales well.
> The same job to run on a linux cluster scales well.
>
> Would you all give me some advice to improve the performance loss when I
> increase the use of more cores, or how to run mpirun with proper options to
> get a linear scaling when using 16 to 32 to 48 cores?
>
> Thank you.
>
> Yuping
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/06/24654.php
>