from:"Zbigniew Koza"

[OMPI users] bug in CUDA support for dual-processor systems?

2012-07-31 Thread Zbigniew Koza


Hi,

I wrote a simple program to see if OpenMPI can really handle cuda 
pointers as promised in the FAQ and how efficiently.
The program (see below) breaks if MPI communication is to be performed 
between two devices that are on the same node but under different IOHs 
in a dual-processor Intel machine.
Note that  cudaMemCpy works for such devices, although not as 
efficiently as for the devices on the same IOH and GPUDirect enabled.


Here's the output from my program:

===

>  mpirun -n 6 ./a.out
Init
Init
Init
Init
Init
Init
rank: 1, size: 6
rank: 2, size: 6
rank: 3, size: 6
rank: 4, size: 6
rank: 5, size: 6
rank: 0, size: 6
device 3 is set
Process 3 is on typhoon1
Using regular memory
device 0 is set
Process 0 is on typhoon1
Using regular memory
device 4 is set
Process 4 is on typhoon1
Using regular memory
device 1 is set
Process 1 is on typhoon1
Using regular memory
device 5 is set
Process 5 is on typhoon1
Using regular memory
device 2 is set
Process 2 is on typhoon1
Using regular memory
^C^[[A^C
zkoza@typhoon1:~/multigpu$
zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart 
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include

zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart 
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include

zkoza@typhoon1:~/multigpu$ mpirun -n 6 ./a.out
Process 1 of 6 is on typhoon1
Process 2 of 6 is on typhoon1
Process 0 of 6 is on typhoon1
Process 4 of 6 is on typhoon1
Process 5 of 6 is on typhoon1
Process 3 of 6 is on typhoon1
device 2 is set
device 1 is set
device 0 is set
Using regular memory
device 5 is set
device 3 is set
device 4 is set
Host->device bandwidth for processor 1: 1587.993499 MB/sec
Host->device bandwidth for processor 2: 1570.275316 MB/sec
Host->device bandwidth for processor 3: 1569.890751 MB/sec
Host->device bandwidth for processor 5: 1483.637702 MB/sec
Host->device bandwidth for processor 0: 1480.888029 MB/sec
Host->device bandwidth for processor 4: 1476.241371 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [1] bandwidth: 3338.57 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [1] bandwidth: 420.85 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[1] bandwidth: 362.13 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[1] bandwidth: 6552.35 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [2] bandwidth: 3238.88 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [2] bandwidth: 418.18 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[2] bandwidth: 362.06 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[2] bandwidth: 5022.82 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [3] bandwidth: 3295.32 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [3] bandwidth: 418.90 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[3] bandwidth: 359.16 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[3] bandwidth: 5019.89 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [4] bandwidth: 4619.55 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [4] bandwidth: 419.24 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[4] bandwidth: 364.52 MB/sec
--
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  cuIpcOpenMemHandle return value:   205
  address: 0x20020
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--
[typhoon1:06098] Failed to register remote memory, rc=-1
[typhoon1:06098] [[33788,1],4] ORTE_ERROR_LOG: Error in file 
pml_ob1_recvreq.c at line 465






Comment:
In my machine there are 2 six-core intel processors with HT on, yielding 
24 virtual processors, and  6 Tesla C2070s.
The devices  are grouped in two groups, one with 4 and the other with 2 
devices.
Devices in the same group can talk to each other via GPUDirect at approx 
6GB/s; devices in different groups can use

cudaMemCpy and UVA at somewhat smaller transfer rates.


my OpenMPI is openmpi-1.9a1r26904 compiled from sources

./configure -prefix=/home/zkoza/openmpi.1.9.cuda 
--with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/lib


> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Thu_Apr__5_00:24:31_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Ubuntu 12.04 64-bit

Nvidia  Driver Version: 295.41 |

The program was compiled with:
> mpicc prog.c -lcuda -lcudart -L/usr/local/cuda/lib64 
-I/usr/local/cuda/include





SOURCE CODE:


#include 
#include 
#include 
#include 
#include 
#include 

#define NREPEAT 20
#define NBYTES 1


#define CALL(x)\
{

Re: [OMPI users] bug in CUDA support for dual-processor systems?

2012-07-31 Thread Zbigniew Koza


Thanks for a quick reply.

I do not know much about low-level CUDA and IPC,
but there's no problem using high-level CUDA to determine if
device A can talk to B via GPUDirect (cudaDeviceCanAccessPeer).
Then, for such connections, one only needs to call 
cudaDeviceEnablePeerAccess
and then essentially  "sit back and laugh" -  given correct current 
device and stream, functions like cudaMemcpyPeer work irrespectively of 
whether GPUDirect
is on or off for a given pair of devices, the only difference being the 
speed.
So, I hope it should be possible to implement device-IOH-IOH-device 
communication using low-level CUDA.
Such functionality should be an important step in the "CPU-GPU 
high-performance war" :-),
as  8-GPU fast-MPI-link systems  bring a new meaning to a "GPU node" in 
GPU clusters...


Here is the output of my test program that was aimed at determining
a) aggregate, best-case transfer rate between 6 GPUs running in parallel 
and

b) whether devices on different IOHs can talk to each other:

3 [GB] in  78.6952 [ms] =  38.1218 GB/s (aggregate)
sending 6 bytes from device 0:
0 -> 0: 11.3454 [ms] 52.8848 GB/s
0 -> 1: 90.3628 [ms] 6.6399 GB/s
0 -> 2: 113.396 [ms] 5.29117 GB/s
0 -> 3: 113.415 [ms] 5.29032 GB/s
0 -> 4: 170.307 [ms] 3.52305 GB/s
0 -> 5: 169.613 [ms] 3.53747 GB/s

This shows that even if devices are on different IOHs, like 0 and 4, 
they can talk to each other at a fantastic speed of 3.5 GB/s

and it would be pity if OpenMPI did not used this opportunity.

I have also 2 questions:

a) I noticed that on my 6-GPU 2-CPU  platform the initialization of CUDA 
4.2 takes a long time, approx 10 seconds.

Do you think I should report this as a bug to nVidia?

b) Is there any info on running OpenMPI + CUDA? For example, what are 
the dependencies of transfer rates and latencies on transfer size?
A dedicated www page, blog or whatever? How can I know if the current 
problem was solved?




Many thanks for making CUDA available in OpenMPI.

Regards

Z Koza

W dniu 31.07.2012 19:39, Rolf vandeVaart pisze:

The current implementation does assume that the GPUs are on the same IOH and 
therefore can use the IPC features of the CUDA library for communication.
One of the initial motivations for this was that to be able to detect whether 
GPUs can talk to one another, the CUDA library has to be initialized and the 
GPUs have to be selected by each rank.  It is at that point that we can 
determine whether the IPC will work between the GPUs.However, this means 
that the GPUs need to be selected by each rank prior to the call to MPI_Init as 
that is where we determine whether IPC is possible, and we were trying to avoid 
that requirement.

I will submit a ticket against this and see if we can improve this.

Rolf


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Zbigniew Koza
Sent: Tuesday, July 31, 2012 12:38 PM
To: us...@open-mpi.org
Subject: [OMPI users] bug in CUDA support for dual-processor systems?

Hi,

I wrote a simple program to see if OpenMPI can really handle cuda pointers as
promised in the FAQ and how efficiently.
The program (see below) breaks if MPI communication is to be performed
between two devices that are on the same node but under different IOHs in a
dual-processor Intel machine.
Note that  cudaMemCpy works for such devices, although not as efficiently as
for the devices on the same IOH and GPUDirect enabled.

Here's the output from my program:

===


  mpirun -n 6 ./a.out

Init
Init
Init
Init
Init
Init
rank: 1, size: 6
rank: 2, size: 6
rank: 3, size: 6
rank: 4, size: 6
rank: 5, size: 6
rank: 0, size: 6
device 3 is set
Process 3 is on typhoon1
Using regular memory
device 0 is set
Process 0 is on typhoon1
Using regular memory
device 4 is set
Process 4 is on typhoon1
Using regular memory
device 1 is set
Process 1 is on typhoon1
Using regular memory
device 5 is set
Process 5 is on typhoon1
Using regular memory
device 2 is set
Process 2 is on typhoon1
Using regular memory
^C^[[A^C
zkoza@typhoon1:~/multigpu$
zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
zkoza@typhoon1:~/multigpu$ mpirun -n 6 ./a.out Process 1 of 6 is on
typhoon1 Process 2 of 6 is on typhoon1 Process 0 of 6 is on typhoon1 Process
4 of 6 is on typhoon1 Process 5 of 6 is on typhoon1 Process 3 of 6 is on
typhoon1 device 2 is set device 1 is set device 0 is set Using regular memory
device 5 is set device 3 is set device 4 is set
Host->device bandwidth for processor 1: 1587.993499 MB/sec device
Host->bandwidth for processor 2: 1570.275316 MB/sec device bandwidth for
Host->processor 3: 1569.89075

Re: [OMPI users] 1D and 2D arrays allocate memory by maloc() and MPI_Send and MPI_Recv problem.

2012-08-07 Thread Zbigniew Koza


Look at  this  declaration:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag,
 MPI_Comm comm)

here*"count" is the**number of elements* (not bytes!) in the send buffer 
(nonnegative integer)

Your "count" was defined as

count = rows*matrix_size*sizeof (double);
and seems to be erroneous;
variable "count" cannot depend on the size of the matrix element!

Z Koza

On Aug 7, 2012, at 10:33 , Paweł Jaromin wrote:

Hello all

Sorry, may be this is a stupid question, bat a have a big problem with
maloc() and  matrix arrays.
I want to make a program that do very simple thing like matriA *
matrixB = matrixC.
Because I need more matrix size than 100x100 (5000x5000), I have to
use maloc() for memory allocation.
First I tried this way:

The typical form for dynamically allocating an NxM array of type T is:
T **a = malloc(sizeof *a * N);
if (a)
{
  for (i = 0; i < N; i++)
  {
a[i] = malloc(sizeof *a[i] * M);
  }
}
// the arrays are created before  split to nodes

No problem with create, fill array,but the problem started when I have
send and receive it.
Of course before  send I calculated "cont" for MPI_Send.
To be shore, that the count for MPI_Send i MPI_Recv is the same I also
send "count".

count = rows*matrix_size*sizeof (double); //part of matrix
MPI_Send(&count, 1, MPI_INT, dest, mtype,MPI_COMM_WORLD);
MPI_Send(&matrixA[offset][0], count, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);

from worker side the code looks like:

MPI_Recv(&countA, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&matrixA[0][0], countA, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);


An error looks like:

[pawcioj-VirtualBox:01700] *** Process received signal ***
[pawcioj-VirtualBox:01700] Signal: Segmentation fault (11)
[pawcioj-VirtualBox:01700] Signal code: Address not mapped (1)
[pawcioj-VirtualBox:01700] Failing at address: 0x88fa000
[pawcioj-VirtualBox:01700] [ 0] [0xc2740c]
[pawcioj-VirtualBox:01700] [ 1]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x906c) [0x17606c]
[pawcioj-VirtualBox:01700] [ 2]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6a1b) [0x173a1b]
[pawcioj-VirtualBox:01700] [ 3]
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x3ae6) [0x7b7ae6]
[pawcioj-VirtualBox:01700] [ 4]
/usr/lib/libopen-pal.so.0(opal_progress+0x81) [0x406fa1]
[pawcioj-VirtualBox:01700] [ 5]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x48e5) [0x1718e5]
[pawcioj-VirtualBox:01700] [ 6] /usr/lib/libmpi.so.0(MPI_Recv+0x165) [0x1ef9d5]
[pawcioj-VirtualBox:01700] [ 7] macierz_V.02(main+0x927) [0x8049870]
[pawcioj-VirtualBox:01700] [ 8] /lib/libc.so.6(__libc_start_main+0xe7)
[0xddfce7]
[pawcioj-VirtualBox:01700] [ 9] macierz_V.02() [0x8048b71]
[pawcioj-VirtualBox:01700] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 1700 on node
pawcioj-VirtualBox exited on signal 11 (Segmentation fault).


Because I have no result, I tied do that by 1D array but the problem
seems similar.

Probably I do something wrong, so I would like to ask you about advice
how do that proper or maybe link to useful tutorial.
I spend two weeks to find out how do that but unfortunately without result :(.



--
--
pozdrawiam

Paweł Jaromin

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] RDMA GPUDirect CUDA...

2012-08-14 Thread Zbigniew Koza


Hi,

I've just found this information on  nVidia's plans regarding enhanced 
support for MPI in their CUDA toolkit:

http://developer.nvidia.com/cuda/nvidia-gpudirect

The idea that two GPUs can talk to each other via network cards without 
CPU as a middleman looks very promising.

This technology is supposed to be revealed and released in September.

My questions:

1. Will OpenMPI include   RDMA support in its CUDA interface?
2. Any idea how much can this technology reduce the CUDA Send/Recv latency?
3. Any idea whether this technology will be available for Fermi-class 
Tesla devices or only for Keplers?


Regards,

Z Koza

[OMPI users] what is a "node"?

2012-08-30 Thread Zbigniew Koza


Hi,

consider this specification:

"Curie fat consists in 360 nodes which contains 4 eight cores CPU 
Nehalem-EX clocked at 2.27 GHz, let 32 cores / node and 11520 cores for 
the full fat configuration"


Suppose I would like to run some performance tests just on a single 
processor rather than 4 of them.

Is there a way to do this?
I'm afraid specifying that I need 1 cluster node with 8 MPI prcesses
will result in OS distributing these 8 processes among 4
processors forming the node, and this is not what I'm after.

Z Koza

Re: [OMPI users] what is a "node"?

2012-08-30 Thread Zbigniew Koza

Thanks a lot!

Z Koza

2012/8/30 Gus Correa 

> Hi  Zbigniew
>
> Besides the OpenMPI processor affinity capability that Jeff mentioned.
>
> If your Curie cluster has a resource manager [Torque, SGE, etc],
> your job submission script to the resource manager/ queue system
> should specifically request a single node, for the test that you have in
> mind.
>
> For instance, on Torque/PBS, this would be done by adding this directive to
> the top of the job script:
>
> #PBS -l nodes=1:ppn=8
> ...
> mpiexec -np 8 ...
>
> meaning that you want the 8 processors [i.e. cores] to be in a single node.
>
> On top of this, you need to add the appropriate process binding
> keywords to the mpiexec command line, as Jeff suggested.
> 'man mpiexec' will tell you a lot about the OpenMPI process binding
> capability, specially in 1.6 and 1.4 series.
>
> In the best of the worlds your resource manager has the ability to also
> assign a group of
> cores exclusively to each of the jobs that may be sharing the node.
> Say, job1 requests 4 cores and gets cores 0-3 and cannot use any other
> cores,
> job2 requests 8 cores and gets cores 4-11 and cannot use any other cores,
> and so on.
>
> However, not all resource managers/ queue systems are built this way
> [particularly the older versions],
> and may let the various job processes to drift across all cores in the
> node.
>
> If the resource manager is old and doesn't have that hardware locality
> capability,
> and if you don't want your performance test to risk being polluted by
> other jobs running on the same node, that perhaps share the same cores
> with your job,
> then you can request all 32 cores in the node for your job,
> but use only 8 of them to run your MPI program.
> It is wasteful, but may be the only way to go.
> For instance, on Torque:
>
> #PBS -l nodes=1:ppn=32
> ...
> mpiexec -np 8 ...
>
> Again, add the OpenMPI process binding keywords to the mpiexec command
> line,
> to ensure the use of a fixed group of 8 cores.
>
> With SGE and Slurm the syntax is different than the above,
> but I would guess that there is an equivalent setup.
>
> I hope this helps,
> Gus Correa
>
>
> On 08/30/2012 08:07 AM, Jeff Squyres wrote:
>
>> In the OMPI v1.6 series, you can use the processor affinity options.  And
>> you can use --report-bindings to show exactly where processes were bound.
>>  For example:
>>
>> -
>> % mpirun -np 4 --bind-to-core --report-bindings -bycore uptime
>> [svbu-mpi056:18904] MCW rank 0 bound to socket 0[core 0]: [B . . .][. . .
>> .]
>> [svbu-mpi056:18904] MCW rank 1 bound to socket 0[core 1]: [. B . .][. . .
>> .]
>> [svbu-mpi056:18904] MCW rank 2 bound to socket 0[core 2]: [. . B .][. . .
>> .]
>> [svbu-mpi056:18904] MCW rank 3 bound to socket 0[core 3]: [. . . B][. . .
>> .]
>>   05:06:13 up 7 days,  6:57,  1 user,  load average: 0.29, 0.10, 0.03
>>   05:06:13 up 7 days,  6:57,  1 user,  load average: 0.29, 0.10, 0.03
>>   05:06:13 up 7 days,  6:57,  1 user,  load average: 0.29, 0.10, 0.03
>>   05:06:13 up 7 days,  6:57,  1 user,  load average: 0.29, 0.10, 0.03
>> %
>> -
>>
>> I bound each process to a single core, and mapped them on a round-robin
>> basis by core.  Hence, all 4 processes ended up on their own cores on a
>> single processor socket.
>>
>> The --report-bindings output shows that this particular machine has 2
>> sockets, each with 4 cores.
>>
>>
>>
>> On Aug 30, 2012, at 5:37 AM, Zbigniew Koza wrote:
>>
>>  Hi,
>>>
>>> consider this specification:
>>>
>>> "Curie fat consists in 360 nodes which contains 4 eight cores CPU
>>> Nehalem-EX clocked at 2.27 GHz, let 32 cores / node and 11520 cores for the
>>> full fat configuration"
>>>
>>> Suppose I would like to run some performance tests just on a single
>>> processor rather than 4 of them.
>>> Is there a way to do this?
>>> I'm afraid specifying that I need 1 cluster node with 8 MPI prcesses
>>> will result in OS distributing these 8 processes among 4
>>> processors forming the node, and this is not what I'm after.
>>>
>>> Z Koza
>>> __**_
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>
>>
>>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>

Re: [OMPI users] what is a "node"?

2012-09-01 Thread Zbigniew Koza

Hi,

I have one more question.
I wanted to experiment with processor affinity command-line options on my
ubuntu PC.
When I use OpenMPI compiled from sourecs a few weeks ago, mpirun returns
error messages.
However, the"official" OpenMPI installation on the same machine makes no
problem.
Does it mean there's a  bug in OpenMPI-current and I should report a bug?

===  1. OpneMPI version: 

mpirun -V
mpirun (Open MPI) 1.9a1r26880

Report bugs to http://www.open-mpi.org/community/help/


 2. mpirun "offending" command and error report: ===

zkoza@zbyszek:~$ mpirun -np 2 --bind-to-core -bycore --report-bindings
uptime
--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  zbyszek

This is a warning only; your job will continue, though performance may
be degraded.
--
--
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
  line parameter option (remember that mpirun interprets the first
  unrecognized command line token as the executable).

Node:   zbyszek
Executable: -bycore
--
2 total processes failed to start


 3. the same mpirun command using standard MPI installation 

LD_LIBRARY_PATH=/usr/lib/openmpi /usr/bin/mpirun --path /usr/lib/openmpi
-np 2 --bind-to-core -bycore --report-bindings uptime
[zbyszek:03104] [[7637,0],0] odls:default:fork binding child [[7637,1],0]
to cpus 0001
[zbyszek:03104] [[7637,0],0] odls:default:fork binding child [[7637,1],1]
to cpus 0002
 12:25:51 up 21:27,  1 user,  load average: 0.00, 0.01, 0.05
 12:25:51 up 21:27,  1 user,  load average: 0.00, 0.01, 0.05


 4. version of standard OpenMPI ===

zkoza@zbyszek:~$ LD_LIBRARY_PATH=/usr/lib/openmpi /usr/bin/mpirun --path
/usr/lib/openmpi --version
mpirun (Open MPI) 1.4.3



Z Koza

Re: [OMPI users] what is a "node"?

2012-09-01 Thread Zbigniew Koza

Thanks, Ralph,

the new syntax works well (I used "man mpirun", which displayed the old
syntax).
Also, the report displayed by --report-binding is far more human-readable
than in previous versions of OpenMPI

Out of curiosity, and also to supress the warning, I installed the
libnuma-dev package with libnuma.so and libnuma.a libraries, but the
warning remains.
Does it mean I should recompile OpenMPI to get rid of this warning?

Z Koza



2012/9/1 Ralph Castain 

> You are using cmd line options that no longer exist in the 1.9 release -
> look at "mpirun -h" for the current list of options.
>
> FWIW: in your example, the correct cmd line would be:
>
> mpirun -np 2 --bind-to core -map-by core --report-bindings uptime
>
> Note the space in "--bind-to core" and the "-map-by" option syntax. The
> warning means that we didn't find libnuma installed on your machine, so we
> cannot bind memory allocations (but can bind processes).
>
> On Sep 1, 2012, at 3:41 AM, Zbigniew Koza  wrote:
>
> Hi,
>
> I have one more question.
> I wanted to experiment with processor affinity command-line options on my
> ubuntu PC.
> When I use OpenMPI compiled from sourecs a few weeks ago, mpirun returns
> error messages.
> However, the"official" OpenMPI installation on the same machine makes no
> problem.
> Does it mean there's a  bug in OpenMPI-current and I should report a bug?
>
> ===  1. OpneMPI version: 
>
> mpirun -V
> mpirun (Open MPI) 1.9a1r26880
>
> Report bugs to http://www.open-mpi.org/community/help/
>
>
>  2. mpirun "offending" command and error report: ===
>
> zkoza@zbyszek:~$ mpirun -np 2 --bind-to-core -bycore --report-bindings
> uptime
> --
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>
>   Node:  zbyszek
>
> This is a warning only; your job will continue, though performance may
> be degraded.
> --
> --
> mpirun was unable to find the specified executable file, and therefore
> did not launch the job.  This error was first reported for process
> rank 0; it may have occurred for other processes as well.
>
> NOTE: A common cause for this error is misspelling a mpirun command
>   line parameter option (remember that mpirun interprets the first
>   unrecognized command line token as the executable).
>
> Node:   zbyszek
> Executable: -bycore
> --
> 2 total processes failed to start
>
>
>  3. the same mpirun command using standard MPI installation 
>
> LD_LIBRARY_PATH=/usr/lib/openmpi /usr/bin/mpirun --path /usr/lib/openmpi
> -np 2 --bind-to-core -bycore --report-bindings uptime
> [zbyszek:03104] [[7637,0],0] odls:default:fork binding child [[7637,1],0]
> to cpus 0001
> [zbyszek:03104] [[7637,0],0] odls:default:fork binding child [[7637,1],1]
> to cpus 0002
>  12:25:51 up 21:27,  1 user,  load average: 0.00, 0.01, 0.05
>  12:25:51 up 21:27,  1 user,  load average: 0.00, 0.01, 0.05
>
>
>  4. version of standard OpenMPI ===
>
> zkoza@zbyszek:~$ LD_LIBRARY_PATH=/usr/lib/openmpi /usr/bin/mpirun --path
> /usr/lib/openmpi --version
> mpirun (Open MPI) 1.4.3
>
>
>
> Z Koza
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] What is the default install library for PATH and LD_LIBRARY_PATH

2012-11-13 Thread Zbigniew Koza

./configure does not compile, but generates the Makefile.

Did you run
> make
> make install
after running ./configure?

Notice also that openmpi can very likely be already installed on your
system from ubuntu packages;
anyway, I suggest you use ubuntu packages rather than compiling from
sources unless you have a
very good reason not to use the packaged version.

you can also quite safely re-run make install to see where the
libraries are going to.

If you're unsure which version of openmpi you have, you can start from
> which mpicc
> mpicc --showme



Z Koza

2012/11/13 huaibao zhang :
> Hi Reuti,
>
> Thanks for your answer. I really appreciate it.
> I am using an old version 1.4.3. for my code. If I only type $./configure,
> it will compile, but I have no idea where it is installed. I typed $ find
> /lib -name "libopen-pal.so.0", but it shows nothing. Do you thinks it is
> caused since I am not a root user or the old version.
>
> Thanks,
> Paul
>
> --
> Huaibao (Paul) Zhang
> Gas Surface Interactions Lab
> Department of Mechanical Engineering
> University of Kentucky,
> Lexington, KY, 40506-0503
> Office: 216 Ralph G. Anderson Building
> Web:gsil.engineering.uky.edu
>
> On Nov 13, 2012, at 12:24 PM, Reuti  wrote:
>
> Am 13.11.2012 um 15:44 schrieb huaibao zhang:
>
> I installed OpenMPI on my Ubuntu 64 bit desktop. At first, I did not specify
> "prefix", so even I've installed it. I could not find where it is. Since the
> "PATH" and "LD" have to be given, the mpicc can find the "lib open-pal.so.0"
> file.
>
>
> You mean "...can't find..."? If you use the default location, it should have
> the correct settings already even without adding any path to PATH or
> LD_LIBRARY_PATH.
>
> You can use:
>
> $ find /lib -name "libopen-pal.so.0"
>
> to spot the location. But I wonder about the version. The actual one seems
> to be libopen-pal.so.4 -> libopen-pal.so.4.0.3 - which version are you
> using?
>
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] bug in CUDA support for dual-processor systems?

Re: [OMPI users] bug in CUDA support for dual-processor systems?

Re: [OMPI users] 1D and 2D arrays allocate memory by maloc() and MPI_Send and MPI_Recv problem.

[OMPI users] RDMA GPUDirect CUDA...

[OMPI users] what is a "node"?

Re: [OMPI users] what is a "node"?

Re: [OMPI users] what is a "node"?

Re: [OMPI users] what is a "node"?

Re: [OMPI users] What is the default install library for PATH and LD_LIBRARY_PATH

9 matches

Site Navigation

Mail list logo

Footer information