[OMPI users] Openmpi 1.8.5 on Linux with threading support

2015-05-19 Thread Nilo Menezes

Hello,

I'm trying to run openmpi with multithread support enabled.

I'm getting this error messages before init finishes:
[node011:61627] PSM returned unhandled/unknown connect error: Operation 
timed out

[node011:61627] PSM EP connect error (unknown connect error):

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node005:51948] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node039:57062] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node012:64036] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node008:14098] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[node011:61627] Local abort before MPI_INIT completed successfully; not 
able to aggregate error messages, and not able to guarantee that all 
other processes were killed!
[node005:51887] 1 more process has sent help message help-mpi-runtime / 
mpi_init:startup:internal-failure
[node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages


The library was configured with:
./configure \
--prefix=/home/opt \
--enable-static \
--enable-mpi-thread-multiple \
--with-threads

gcc 4.8.2

On Linux:
Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT 
2012 x86_64 x86_64 x86_64 GNU/Linux


The job was started with:
sbatch --nodes=6 --ntasks=30 --mem=4096  -o result/TOn6t30.txt -e 
result/TEn6t30.txt job.sh



job.sh contains:
mpirun --mca btl tcp,self \
   --mca btl_tcp_if_include 172.24.38.0/24 \
   --mca oob_tcp_if_include eth0 \
/home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2 
--columns=1000 --rows=1000


I call MPI_INIT with:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

The program is a simple game of life simulation. It runs fine in a 
single node (with one or many tasks). But fails at random nodes when 
distributed.


Any hint may help.

Best Regards,

Nilo Menezes


Re: [OMPI users] Openmpi 1.8.5 on Linux with threading support

2015-05-19 Thread Ralph Castain
It looks like you have PSM enabled cards on your system as well as
Ethernet, and we are picking that up. Try adding "-mca pml ob1" to your cmd
line and see if that helps


On Tue, May 19, 2015 at 5:04 AM, Nilo Menezes  wrote:

> Hello,
>
> I'm trying to run openmpi with multithread support enabled.
>
> I'm getting this error messages before init finishes:
> [node011:61627] PSM returned unhandled/unknown connect error: Operation
> timed out
> [node011:61627] PSM EP connect error (unknown connect error):
>
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node005:51948] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node039:57062] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node012:64036] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node008:14098] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node011:61627] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> [node005:51887] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:internal-failure
> [node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> The library was configured with:
> ./configure \
> --prefix=/home/opt \
> --enable-static \
> --enable-mpi-thread-multiple \
> --with-threads
>
> gcc 4.8.2
>
> On Linux:
> Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT
> 2012 x86_64 x86_64 x86_64 GNU/Linux
>
> The job was started with:
> sbatch --nodes=6 --ntasks=30 --mem=4096  -o result/TOn6t30.txt -e
> result/TEn6t30.txt job.sh
>
>
> job.sh contains:
> mpirun --mca btl tcp,self \
>--mca btl_tcp_if_include 172.24.38.0/24 \
>--mca oob_tcp_if_include eth0 \
> /home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2
> --columns=1000 --rows=1000
>
> I call MPI_INIT with:
> int provided;
> MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
>
> The program is a simple game of life simulation. It runs fine in a single
> node (with one or many tasks). But fails at random nodes when distributed.
>
> Any hint may help.
>
> Best Regards,
>
> Nilo Menezes
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26879.php
>


[OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Lev Givon
I'm encountering intermittent errors while trying to use the Multi-Process
Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU by
multiple MPI processes that perform GPU-to-GPU communication with each other
(i.e., GPU pointers are passed to the MPI transmission primitives). I'm using
GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, which is in turn
built against CUDA 7.0. In my current configuration, I have 4 MPS server daemons
running, each of which controls access to one of 4 GPUs; the MPI processes
spawned by my program are partitioned into 4 groups (which might contain
different numbers of processes) that each talk to a separate daemon. For certain
transmission patterns between these processes, the program runs without any
problems. For others (e.g., 16 processes partitioned into 4 groups), however, it
dies with the following error:

[node05:20562] Failed to register remote memory, rc=-1
--
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  cuIpcOpenMemHandle return value:   21199360
  address: 0x1
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--
[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
---
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
[node05:20564] Failed to register remote memory, rc=-1
[node05:20564] [[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05:20566] Failed to register remote memory, rc=-1
[node05:20566] [[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05:20567] Failed to register remote memory, rc=-1
[node05:20567] [[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at 
line 477
[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node05:20569] Failed to register remote memory, rc=-1
[node05:20569] [[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477
[node05:20571] Failed to register remote memory, rc=-1
[node05:20571] [[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477
[node05:20572] Failed to register remote memory, rc=-1
[node05:20572] [[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c 
at line 477

After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my program, but
that doesn't seem to have any effect upon the problem. Rebooting the machine
also doesn't have any effect. I should also add that my program runs without any
error if the groups of MPI processes talk directly to the GPUs instead of via
MPS.

Does anyone have any ideas as to what could be going on?
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



[OMPI users] Open MPI collectives algorithm selection

2015-05-19 Thread Khalid Hasanov
Hello,

I am trying to use coll_tuned_dynamic_rules_filename option.

I am not sure if I do everything right or not. But my impression is that
config file feature does not work as expected.

For example, if I specify config file as in the attached
ompi_tuned_file.conf and execute the attached simple broadcast example as :


> mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
> coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
> coll_base_verbose 1  bcast_example
>
>
> 
> I would expect that during run time the config file should be ignored as
> it does not contain any configuration for communicator size 16. However, it
> uses configuration for the last communicator for which the size is 5. I
> have attached tuned_output file for more information.
>
> Similar problem exists even if the configuration file contains config for
> communicator size 16. For example , I added to the configuration file first
> communicator size 16 then communicator size 5. But it used configuration
> for communicator size 5.
>
> Another interesting thing is that if the second communicator size is
> greater than the first communicator in the config file then it seems to
> work correctly. At least I tested it for the case where communicator one
> had size 16 and second had 55.
>
>
> I used a development version of Open MPI (1.9.0a1). I forked it into my
> own github (https://github.com/khalid-hasanov/ompi) and I have attached
> ompi_info outputs as well.
>
> I have added some printfs into coll_tuned_decision_dynamic.c file to
> double check it:
>
> if (alg) {
>
> printf("Men burdayam: alg=%d\n", alg);
>
> /* we have found a valid choice from the file based rules for
> this message size */
>
> return ompi_coll_tuned_bcast_intra_do_this (buff, count,
> datatype, root,
>
> comm, module,
>
> alg, faninout,
> segsize);
>
> } /* found a method */
>
>
>
>
> Best regards,
> Khalid
>
#include 
#include 

int main(int argc, char** argv) {
   
   MPI_Init(&argc, &argv);

   int array[1024]; 
   int root=0; 

   MPI_Bcast( array, 1024, MPI_CHAR, root, MPI_COMM_WORLD); 

   MPI_Finalize();

   return EXIT_SUCCESS;
}


ompi_tuned_file.conf
Description: Binary data


ompi_info_output
Description: Binary data


tuned_output
Description: Binary data


Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Rolf vandeVaart
I am not sure why you are seeing this.  One thing that is clear is that you 
have found a bug in the error reporting.  The error message is a little garbled 
and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My 
expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like?  Can you send me output from 
"nvidia-smi topo -m"

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 6:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>1.8.5 with CUDA 7.0 and Multi-Process Service
>
>I'm encountering intermittent errors while trying to use the Multi-Process
>Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
>by multiple MPI processes that perform GPU-to-GPU communication with
>each other (i.e., GPU pointers are passed to the MPI transmission primitives).
>I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
>which is in turn built against CUDA 7.0. In my current configuration, I have 4
>MPS server daemons running, each of which controls access to one of 4 GPUs;
>the MPI processes spawned by my program are partitioned into 4 groups
>(which might contain different numbers of processes) that each talk to a
>separate daemon. For certain transmission patterns between these
>processes, the program runs without any problems. For others (e.g., 16
>processes partitioned into 4 groups), however, it dies with the following 
>error:
>
>[node05:20562] Failed to register remote memory, rc=-1
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   21199360
>  address: 0x1
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
>pml_ob1_recvreq.c at line 477
>---
>Child job 2 terminated normally, but 1 process returned a non-zero exit code..
>Per user-direction, the job has been aborted.
>---
>[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
>mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
>[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
>[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
>[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
>[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
>mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
>[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
>[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
>[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>
>After the above error occurs, I notice that /dev/shm/ is littered with
>cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
>but that doesn't seem to have any effect upon the problem. Rebooting the
>machine also doesn't have any effect. I should also add that my program runs
>without any error if the groups of MPI processes talk directly to the GPUs
>instead of via MPS.
>
>Does anyone have any ideas as to what could be going on?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/05/26881.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Lev Givon
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
> 
> >-Original Message-
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission 
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have 
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following 
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> >  cuIpcOpenMemHandle return value:   21199360
> >  address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.

(snip)

> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this.  One thing that is clear is that you
> have found a bug in the error reporting.  The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
> 
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My
> expectation is that you will not see any errors, but may lose some
> performance.
> 
> What does your hardware configuration look like?  Can you send me output from
> "nvidia-smi topo -m"

GPU0GPU1GPU2GPU3CPU Affinity
GPU0 X  PHB SOC SOC 0-23
GPU1PHB  X  SOC SOC 0-23
GPU2SOC SOC  X  PHB 0-23
GPU3SOC SOC PHB  X  0-23

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] Open MPI collectives algorithm selection

2015-05-19 Thread Gilles Gouaillardet

Hi Khalid,

i checked the source code and it turns out rules must be ordered :
- first by communicator size
- second by message size

Here is attached an updated version of the ompi_tuned_file.conf you 
should use


Cheers,

Gilles

On 5/20/2015 8:39 AM, Khalid Hasanov wrote:

Hello,
I am trying to use coll_tuned_dynamic_rules_filename option.

I am not sure if I do everything right or not. But my impression is 
that config file feature does not work as expected.
For example, if I specify config file as in the attached 
ompi_tuned_file.conf and execute the attached simple broadcast example 
as :


mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
coll_base_verbose 1 bcast_example




I would expect that during run time the config file should be
ignored as it does not contain any configuration for communicator
size 16. However, it uses configuration for the last communicator
for which the size is 5. I have attached tuned_output file for
more information.

Similar problem exists even if the configuration file contains
config for communicator size 16. For example , I added to the
configuration file first communicator size 16 then communicator
size 5. But it used configuration for communicator size 5.

Another interesting thing is that if the second communicator size
is greater than the first communicator in the config file then it
seems to work correctly. At least I tested it for the case where
communicator one had size 16 and second had 55.


I used a development version of Open MPI (1.9.0a1). I forked it
into my own github (https://github.com/khalid-hasanov/ompi) and I
have attached ompi_info outputs as well.

I have added some printfs into coll_tuned_decision_dynamic.c file
to double check it:

if (alg) {

printf("Men burdayam: alg=%d\n", alg);

/* we have found a valid choice from the file based rules for this
message size */

return ompi_coll_tuned_bcast_intra_do_this (buff, count, datatype,
root,

  comm, module,

  alg, faninout, segsize);

} /* found a method */





Best regards,
Khalid



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26882.php


1   # num of collectives
7   # ID = 7 Bcast collective (ID in coll_base_functions.h)
2   # number of com sizes 2
5   # comm size 8
7   # number of msg sizes 7
0 1 0 0 # for message size 0, linear 1, topo 0, segmentation 0
1024 5 0 0  # for message size 1024, binomial 6, topo 0, 0 segmentation
8192 6 0 0  # message size 8k, linear 1, topo 0, 0 segmentation
16384 5 0 0 # message size 16k, binary tree 5, topo 0, 0 segmentation
32768 6 0 0 # 32k, chain 2, no topo or segmentation
262144 3 0 0# 256k, pipeline 3, no topo or segmentation
524288 4 0 0# message size 512k+, split-binary 4, topo 0, 0 segmentation
8   # comm size 16
7   # number of msg sizes 7
0 1 0 0 # for message size 0, linear 1, topo 0, segmentation 0
1024 6 0 0  # for message size 1024, linear 1, topo 0, 0 segmentation
8192 6 0 0  # message size 8k, binomial tree 6, topo 0, 0 segmentation
16384 5 0 0 # message size 16k, binary tree 5, topo 0, 0 segmentation
32768 2 0 0 # 32k, chain 2, no topo or segmentation
262144 3 0 0# 256k, pipeline 3, no topo or segmentation
524288 4 0 0# message size 512k+, split-binary 4, topo 0, 0 segmentation
# end of first collective


Re: [OMPI users] Open MPI collectives algorithm selection

2015-05-19 Thread Khalid Hasanov
Hi Gilles,

Thank you a lot, it works now.

Just one minor thing I have seen now. If I use some communicator size which
does not exist in the configuration file, it will still use the
configuration file. For example, if I use the previous config file with
mpirun -n 4 it will use the config for the comm size 5 (the first one). The
same happens if n is less than 16. If n > 16 it will use the config for the
communicator size 16 (the second one). I am writing this just in case it is
not expected behaviour.

Thanks again.

Best regards,
Khalid


On Wed, May 20, 2015 at 2:12 AM, Gilles Gouaillardet 
wrote:

>  Hi Khalid,
>
> i checked the source code and it turns out rules must be ordered :
> - first by communicator size
> - second by message size
>
> Here is attached an updated version of the ompi_tuned_file.conf you should
> use
>
> Cheers,
>
> Gilles
>
>
> On 5/20/2015 8:39 AM, Khalid Hasanov wrote:
>
>  Hello,
>
> I am trying to use coll_tuned_dynamic_rules_filename option.
>
>  I am not sure if I do everything right or not. But my impression is that
> config file feature does not work as expected.
>
> For example, if I specify config file as in the attached
> ompi_tuned_file.conf and execute the attached simple broadcast example as :
>
>
>>   mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
>> coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
>> coll_base_verbose 1  bcast_example
>>
>>
>> 
>> I would expect that during run time the config file should be ignored as
>> it does not contain any configuration for communicator size 16. However, it
>> uses configuration for the last communicator for which the size is 5. I
>> have attached tuned_output file for more information.
>>
>>  Similar problem exists even if the configuration file contains config
>> for communicator size 16. For example , I added to the configuration file
>> first communicator size 16 then communicator size 5. But it used
>> configuration for communicator size 5.
>>
>>  Another interesting thing is that if the second communicator size is
>> greater than the first communicator in the config file then it seems to
>> work correctly. At least I tested it for the case where communicator one
>> had size 16 and second had 55.
>>
>>
>>  I used a development version of Open MPI (1.9.0a1). I forked it into my
>> own github (https://github.com/khalid-hasanov/ompi) and I have attached
>> ompi_info outputs as well.
>>
>>  I have added some printfs into coll_tuned_decision_dynamic.c file to
>> double check it:
>>
>>  if (alg) {
>>
>> printf("Men burdayam: alg=%d\n", alg);
>>
>> /* we have found a valid choice from the file based rules
>> for this message size */
>>
>> return ompi_coll_tuned_bcast_intra_do_this (buff, count,
>> datatype, root,
>>
>> comm, module,
>>
>> alg, faninout,
>> segsize);
>>
>> } /* found a method */
>>
>>
>>
>>
>>  Best regards,
>> Khalid
>>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26882.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26885.php
>


Re: [OMPI users] Open MPI collectives algorithm selection

2015-05-19 Thread Gilles Gouaillardet

Khalid,

this is probably not the intended behavior, i will followup on the devel 
mailing list.


Thanks for reporting this

Cheers,

Gilles

On 5/20/2015 10:30 AM, Khalid Hasanov wrote:

Hi Gilles,

Thank you a lot, it works now.

Just one minor thing I have seen now. If I use some communicator size 
which does not exist in the configuration file, it will still use the 
configuration file. For example, if I use the previous config file 
with mpirun -n 4 it will use the config for the comm size 5 (the first 
one). The same happens if n is less than 16. If n > 16 it will use the 
config for the communicator size 16 (the second one). I am writing 
this just in case it is not expected behaviour.


Thanks again.

Best regards,
Khalid


On Wed, May 20, 2015 at 2:12 AM, Gilles Gouaillardet 
mailto:gil...@rist.or.jp>> wrote:


Hi Khalid,

i checked the source code and it turns out rules must be ordered :
- first by communicator size
- second by message size

Here is attached an updated version of the ompi_tuned_file.conf
you should use

Cheers,

Gilles


On 5/20/2015 8:39 AM, Khalid Hasanov wrote:

Hello,
I am trying to use coll_tuned_dynamic_rules_filename option.

I am not sure if I do everything right or not. But my impression
is that config file feature does not work as expected.
For example, if I specify config file as in the attached
ompi_tuned_file.conf and execute the attached simple broadcast
example as :

mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
coll_base_verbose 1  bcast_example




I would expect that during run time the config file should be
ignored as it does not contain any configuration for
communicator size 16. However, it uses configuration for the
last communicator for which the size is 5. I have attached
tuned_output file for more information.

Similar problem exists even if the configuration file
contains config for communicator size 16. For example , I
added to the configuration file first communicator size 16
then communicator size 5. But it used configuration for
communicator size 5.

Another interesting thing is that if the second communicator
size is greater than the first communicator in the config
file then it seems to work correctly. At least I tested it
for the case where communicator one had size 16 and second
had 55.


I used a development version of Open MPI (1.9.0a1). I forked
it into my own github
(https://github.com/khalid-hasanov/ompi) and I have attached
ompi_info outputs as well.

I have added some printfs into coll_tuned_decision_dynamic.c
file to double check it:

if (alg) {

printf("Men burdayam: alg=%d\n", alg);

/* we have found a valid choice from the file based rules for
this message size */

return ompi_coll_tuned_bcast_intra_do_this (buff, count,
datatype, root,

comm, module,

alg, faninout, segsize);

} /* found a method */





Best regards,
Khalid



___
users mailing list
us...@open-mpi.org  
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/05/26882.php



___
users mailing list
us...@open-mpi.org 
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26885.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26886.php




Re: [OMPI users] Open MPI collectives algorithm selection

2015-05-19 Thread Khalid Hasanov
Thanks a lot, Gilles.

On Wed, May 20, 2015 at 2:47 AM, Gilles Gouaillardet 
wrote:

>  Khalid,
>
> this is probably not the intended behavior, i will followup on the devel
> mailing list.
>
> Thanks for reporting this
>
> Cheers,
>
> Gilles
>
>
> On 5/20/2015 10:30 AM, Khalid Hasanov wrote:
>
> Hi Gilles,
>
>  Thank you a lot, it works now.
>
>  Just one minor thing I have seen now. If I use some communicator size
> which does not exist in the configuration file, it will still use the
> configuration file. For example, if I use the previous config file with
> mpirun -n 4 it will use the config for the comm size 5 (the first one). The
> same happens if n is less than 16. If n > 16 it will use the config for the
> communicator size 16 (the second one). I am writing this just in case it is
> not expected behaviour.
>
>  Thanks again.
>
>  Best regards,
> Khalid
>
>
> On Wed, May 20, 2015 at 2:12 AM, Gilles Gouaillardet 
> wrote:
>
>>  Hi Khalid,
>>
>> i checked the source code and it turns out rules must be ordered :
>> - first by communicator size
>> - second by message size
>>
>> Here is attached an updated version of the ompi_tuned_file.conf you
>> should use
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 5/20/2015 8:39 AM, Khalid Hasanov wrote:
>>
>>   Hello,
>>
>> I am trying to use coll_tuned_dynamic_rules_filename option.
>>
>>  I am not sure if I do everything right or not. But my impression is
>> that config file feature does not work as expected.
>>
>> For example, if I specify config file as in the attached
>> ompi_tuned_file.conf and execute the attached simple broadcast example as :
>>
>>
>>>   mpirun -n 16 --mca coll_tuned_use_dynamic_rules 1  --mca
>>> coll_tuned_dynamic_rules_filename ompi_tuned_file.conf   -mca
>>> coll_base_verbose 1  bcast_example
>>>
>>>
>>> 
>>> I would expect that during run time the config file should be ignored as
>>> it does not contain any configuration for communicator size 16. However, it
>>> uses configuration for the last communicator for which the size is 5. I
>>> have attached tuned_output file for more information.
>>>
>>>  Similar problem exists even if the configuration file contains config
>>> for communicator size 16. For example , I added to the configuration file
>>> first communicator size 16 then communicator size 5. But it used
>>> configuration for communicator size 5.
>>>
>>>  Another interesting thing is that if the second communicator size is
>>> greater than the first communicator in the config file then it seems to
>>> work correctly. At least I tested it for the case where communicator one
>>> had size 16 and second had 55.
>>>
>>>
>>>  I used a development version of Open MPI (1.9.0a1). I forked it into
>>> my own github (https://github.com/khalid-hasanov/ompi) and I have
>>> attached ompi_info outputs as well.
>>>
>>>  I have added some printfs into coll_tuned_decision_dynamic.c file to
>>> double check it:
>>>
>>>  if (alg) {
>>>
>>> printf("Men burdayam: alg=%d\n", alg);
>>>
>>> /* we have found a valid choice from the file based rules
>>> for this message size */
>>>
>>> return ompi_coll_tuned_bcast_intra_do_this (buff, count,
>>> datatype, root,
>>>
>>> comm, module,
>>>
>>> alg, faninout,
>>> segsize);
>>>
>>> } /* found a method */
>>>
>>>
>>>
>>>
>>>  Best regards,
>>> Khalid
>>>
>>
>>
>>  ___
>> users mailing listus...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/05/26882.php
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26885.php
>>
>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26886.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26887.php
>


Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Lev Givon
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
> >-Original Message-
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission 
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have 
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following 
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> >  cuIpcOpenMemHandle return value:   21199360
> >  address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.

(snip)

> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this.  One thing that is clear is that you
> have found a bug in the error reporting.  The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
> 
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My
> expectation is that you will not see any errors, but may lose some
> performance.

The error does indeed go away when IPC is disabled, although I do want to
avoid degrading the performance of data transfers between GPU memory locations.

> What does your hardware configuration look like?  Can you send me output from
> "nvidia-smi topo -m"
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/