Re: [OMPI users] problem with exceptions in Java interface

2016-08-29 Thread Gilles Gouaillardet

Siegmar and all,


i am puzzled with this error.

on one hand, it is caused by an invalid buffer

(e.g. buffer size is 1, but user suggests size is 2)

so i am fine with current behavior (e.g. 
java.lang.ArrayIndexOutOfBoundsException is thrown)


/* if that was a C program, it would very likely SIGSEGV, e.g. Open MPI 
does not catch this kind of error when checking params */



on the other hand, Open MPI could be enhanced to check the buffer size, 
and throw a MPIException in this case.



as far as i am concerned, this is a feature request and not a bug.


thoughts anyone ?


Cheers,


Gilles

On 8/29/2016 3:48 PM, Siegmar Gross wrote:

Hi,

I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f,
and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had
reported a problem with Java execeptions (PR 1698) which had
been solved in June (PR 1803).

https://github.com/open-mpi/ompi/issues/1698
https://github.com/open-mpi/ompi/pull/1803

Unfortunately the problem still exists or exists once more
in all three branches.


loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C 
compiler absolute"

  Open MPI repo revision: dev-4691-g277c319
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 112 mpijavac Exception_2_Main.java
warning: [path] bad path element 
"/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or 
directory

1 warning
loki fd1026 113 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---
-- 

mpiexec detected that one or more processes exited with non-zero 
status, thus causing

the job to be terminated. The first process to do so was:

  Process name: [[58548,1],0]
  Exit code:1
-- 


loki fd1026 114 exit



loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C 
compiler absolute"

  Open MPI repo revision: v2.0.0-233-gb5f0a4f
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 117 mpijavac Exception_2_Main.java
warning: [path] bad path element 
"/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or 
directory

1 warning
loki fd1026 118 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
-- 

mpiexec detected that one or more processes exited with non-zero 
status, thus causing

the job to be terminated. The first process to do so was:

  Process name: [[58485,1],0]
  Exit code:1
-- 


loki fd1026 119 exit



loki fd1026 107 ompi_info | grep -e "Open MPI repo revision" -e "C 
compiler absolute"

  Open MPI repo revision: v1.10.3-31-g35ba6a1
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 107 mpijavac Exception_2_Main.java
loki fd1026 108 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1231)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
-- 

mpiexec detected that one or more processes exited with non-zero 
status, thus causing

the job to be terminated. The first process to do so was:

  Process name: [[34400,1],0]
  Exit code:1
-- 


loki fd1026 109 exit




I would be grateful, if

Re: [OMPI users] problem with exceptions in Java interface

2016-08-29 Thread Siegmar Gross

Hi Gilles,

isn't it possible to pass all exceptions from the Java interface
to the calling method? I can live with the current handling of
exceptions as well, although some exceptions can be handled
within my program and some will break my program even if I want
to handle exceptions myself. I understood PR 1698 in the way, that
all exceptions can be processed in the user program if the user
chooses MPI.ERRORS_RETURN (otherwise this change request wouldn't
have been necessary). Nevertheless, if you decide, things are as
they are, I'm happy with your decision as well.


Kind regards

Siegmar


Am 29.08.2016 um 10:30 schrieb Gilles Gouaillardet:

Siegmar and all,


i am puzzled with this error.

on one hand, it is caused by an invalid buffer

(e.g. buffer size is 1, but user suggests size is 2)

so i am fine with current behavior (e.g.
java.lang.ArrayIndexOutOfBoundsException is thrown)

/* if that was a C program, it would very likely SIGSEGV, e.g. Open MPI does
not catch this kind of error when checking params */


on the other hand, Open MPI could be enhanced to check the buffer size, and
throw a MPIException in this case.


as far as i am concerned, this is a feature request and not a bug.


thoughts anyone ?


Cheers,


Gilles

On 8/29/2016 3:48 PM, Siegmar Gross wrote:

Hi,

I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f,
and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had
reported a problem with Java execeptions (PR 1698) which had
been solved in June (PR 1803).

https://github.com/open-mpi/ompi/issues/1698
https://github.com/open-mpi/ompi/pull/1803

Unfortunately the problem still exists or exists once more
in all three branches.


loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
absolute"
  Open MPI repo revision: dev-4691-g277c319
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 112 mpijavac Exception_2_Main.java
warning: [path] bad path element
"/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or directory
1 warning
loki fd1026 113 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58548,1],0]
  Exit code:1
--
loki fd1026 114 exit



loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
absolute"
  Open MPI repo revision: v2.0.0-233-gb5f0a4f
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 117 mpijavac Exception_2_Main.java
warning: [path] bad path element
"/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or directory
1 warning
loki fd1026 118 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58485,1],0]
  Exit code:1
--
loki fd1026 119 exit



loki fd1026 107 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
absolute"
  Open MPI repo revision: v1.10.3-31-g35ba6a1
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 107 mpijavac Exception_2_Main.java
loki fd1026 108 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1231)
at Exception_2_Main.main(Exception_2_Main.java:22)
---

Re: [OMPI users] problem with exceptions in Java interface

2016-08-29 Thread Gilles Gouaillardet
Hi Siegmar,

I will review PR 1698 and wait some more feedback from the developers, they
might have different views than mine.
assuming PR 1698 does what you expect, it does not catch all user errors.
for example, if you MPI_Send a buffer that is too short, the exception
might be thrown at any time.
in the worst case, it will occur in the progress thread and outside of any
MPI call, which means it cannot be "converted" into a MPIException.

fwiw, we have a way to check buffers, but it requires
1. Open MPI is configure'd with --enable-memchecker
and
2. the MPI tasks are ran under valgrind.
iirc, valgrind will issue an error message if the buffer is invalid, and
the app will crash after
(e.g. the MPI subroutine will not return with an error code the end user
can "trap")

such checks might be easier to make in Java, and resulting errors might be
easily made "trappable", but as far as I am concerned
1. this has a runtime overhead
2. this is a new development.

let's follow up at https://github.com/open-mpi/ompi/issues/1698 from now

Cheers,

Gilles

On Monday, August 29, 2016, Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi Gilles,
>
> isn't it possible to pass all exceptions from the Java interface
> to the calling method? I can live with the current handling of
> exceptions as well, although some exceptions can be handled
> within my program and some will break my program even if I want
> to handle exceptions myself. I understood PR 1698 in the way, that
> all exceptions can be processed in the user program if the user
> chooses MPI.ERRORS_RETURN (otherwise this change request wouldn't
> have been necessary). Nevertheless, if you decide, things are as
> they are, I'm happy with your decision as well.
>
>
> Kind regards
>
> Siegmar
>
>
> Am 29.08.2016 um 10:30 schrieb Gilles Gouaillardet:
>
>> Siegmar and all,
>>
>>
>> i am puzzled with this error.
>>
>> on one hand, it is caused by an invalid buffer
>>
>> (e.g. buffer size is 1, but user suggests size is 2)
>>
>> so i am fine with current behavior (e.g.
>> java.lang.ArrayIndexOutOfBoundsException is thrown)
>>
>> /* if that was a C program, it would very likely SIGSEGV, e.g. Open MPI
>> does
>> not catch this kind of error when checking params */
>>
>>
>> on the other hand, Open MPI could be enhanced to check the buffer size,
>> and
>> throw a MPIException in this case.
>>
>>
>> as far as i am concerned, this is a feature request and not a bug.
>>
>>
>> thoughts anyone ?
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>> On 8/29/2016 3:48 PM, Siegmar Gross wrote:
>>
>>> Hi,
>>>
>>> I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f,
>>> and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server
>>> 12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had
>>> reported a problem with Java execeptions (PR 1698) which had
>>> been solved in June (PR 1803).
>>>
>>> https://github.com/open-mpi/ompi/issues/1698
>>> https://github.com/open-mpi/ompi/pull/1803
>>>
>>> Unfortunately the problem still exists or exists once more
>>> in all three branches.
>>>
>>>
>>> loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C
>>> compiler
>>> absolute"
>>>   Open MPI repo revision: dev-4691-g277c319
>>>  C compiler absolute: /opt/solstudio12.5b/bin/cc
>>> loki fd1026 112 mpijavac Exception_2_Main.java
>>> warning: [path] bad path element
>>> "/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or
>>> directory
>>> 1 warning
>>> loki fd1026 113 mpiexec -np 1 java Exception_2_Main
>>> Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
>>> Call "bcast" with index out-of bounds.
>>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
>>> at mpi.Comm.bcast(Native Method)
>>> at mpi.Comm.bcast(Comm.java:1252)
>>> at Exception_2_Main.main(Exception_2_Main.java:22)
>>> ---
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> ---
>>> 
>>> --
>>> mpiexec detected that one or more processes exited with non-zero status,
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>>
>>>   Process name: [[58548,1],0]
>>>   Exit code:1
>>> 
>>> --
>>> loki fd1026 114 exit
>>>
>>>
>>>
>>> loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C
>>> compiler
>>> absolute"
>>>   Open MPI repo revision: v2.0.0-233-gb5f0a4f
>>>  C compiler absolute: /opt/solstudio12.5b/bin/cc
>>> loki fd1026 117 mpijavac Exception_2_Main.java
>>> warning: [path] bad path element
>>> "/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or
>>> directory
>>> 1 warning
>>> loki fd1026 118 mpiexec -np 1 java Exception_2_Main
>>> Set erro

[OMPI users] Fwd: Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server

2016-08-29 Thread M. D.
Hi,

I would like to ask - are there any new solutions or investigations in this
problem?

Cheers,

Matus Dobrotka

2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> my bad for the confusion,
>
> I misread you and miswrote my reply.
>
> I will investigate this again.
>
> strictly speaking, the clients can only start after the server first write
> the port info to a file.
> if you start the client right after the server start, they might use
> incorrect/outdated info and cause all the test hang.
>
> I will start reproducing the hang
>
> Cheers,
>
> Gilles
>
>
> On Tuesday, July 19, 2016, M. D.  wrote:
>
>> Yes I understand it, but I think, this is exactly that situation you are
>> talking about. In my opinion, the test is doing exactly what you said -
>> when a new player is willing to join, other players must invoke 
>> MPI_Comm_accept().
>> All *other* players must invoke MPI_Comm_accept(). Only the last client
>> (in this case last player which wants to join) does not
>> invoke MPI_Comm_accept(), because this client invokes only
>> MPI_Comm_connect(). He is connecting to communicator, in which all other
>> players are already involved and therefore this last client doesn't have to
>> invoke MPI_Comm_accept().
>>
>> Am I still missing something in this my reflection?
>>
>> Matus
>>
>> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet :
>>
>>> here is what the client is doing
>>>
>>> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
>>> rank) ;
>>>
>>> for (i = rank ; i < num_clients ; i++)
>>> {
>>>   /* client performs a collective accept */
>>>   CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
>>> intracomm, &intercomm)) ;
>>>
>>>   printf("CLIENT: connected to server on port\n") ;
>>>   [...]
>>>
>>> }
>>>
>>> 2) has rank 1
>>>
>>> /* and 3) has rank 2) */
>>>
>>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
>>> called, hence my analysis of the crash/hang
>>>
>>>
>>> I understand what you are trying to achieve, keep in mind
>>> MPI_Comm_accept() is a collective call, so when a new player
>>>
>>> is willing to join, other players must invoke MPI_Comm_accept().
>>>
>>> and it is up to you to make sure that happens
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 7/19/2016 5:48 PM, M. D. wrote:
>>>
>>>
>>>
>>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet :
>>>
 MPI_Comm_accept must be called by all the tasks of the local
 communicator.

>>> Yes, that's how I understand it. In the source code of the test, all the
>>> tasks call  MPI_Comm_accept - server and also relevant clients.
>>>
 so if you

 1) mpirun -np 1 ./singleton_client_server 2 1

 2) mpirun -np 1 ./singleton_client_server 2 0

 3) mpirun -np 1 ./singleton_client_server 2 0

 then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
 and an exited task (2)

>>> This is not true in my opinion -  because of above mentioned fact that
>>> MPI_Comm_accept is called by all the tasks of the local communicator.
>>>
 /*

 strictly speaking, there is a race condition, if 2) has exited, then
 MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

 if 2) has not yet exited, then the test will hang because 2) does not
 invoke MPI_Comm_accept

 */

>>> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>>>


>>>
 there are different ways of seeing things :

 1) this is an incorrect usage of the test, the number of clients should
 be the same everywhere

 2) task 2) should not exit (because it did not call
 MPI_Comm_disconnect()) and the test should hang when

 starting task 3) because task 2) does not call MPI_Comm_accept()


 ad 1) I am sorry, but maybe I do not understand what you think - In my
>>> previous post I wrote that the number of clients is the same in every
>>> mpirun instance.
>>> ad 2) it is the same as above
>>>
 i do not know how you want to spawn your tasks.

 if 2) and 3) do not need to communicate with each other (they only
 communicate with 1)), then

 you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

 if 2 and 3) need to communicate with each other, it would be much
 easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

 so there is only one inter communicator with all the tasks.

>>> My aim is that all the tasks need to communicate with each other. I am
>>> implementing a distributed application - game with more players
>>> communicating with each other via MPI. It should work as follows - First
>>> player creates a game and waits for other players to connect to this game.
>>> On different computers (in the same network) the other players can join
>>> this game. When they are connected, they should be able to play this game
>>> together.
>>> I hope, it is clear what my

Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-29 Thread Jingchao Zhang
Hi Ralph,


I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 
cores/node. Please see the results below:


$ mpirun ./a.out < test.in
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 35 for 
process [[43954,1],0]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 41 for 
process [[43954,1],0]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 43 for 
process [[43954,1],0]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 37 for 
process [[43954,1],1]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 46 for 
process [[43954,1],1]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 49 for 
process [[43954,1],1]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 38 for 
process [[43954,1],2]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 50 for 
process [[43954,1],2]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 52 for 
process [[43954,1],2]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 42 for 
process [[43954,1],3]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 53 for 
process [[43954,1],3]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 55 for 
process [[43954,1],3]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 45 for 
process [[43954,1],4]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 56 for 
process [[43954,1],4]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 58 for 
process [[43954,1],4]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 47 for 
process [[43954,1],5]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 59 for 
process [[43954,1],5]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 61 for 
process [[43954,1],5]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 57 for 
process [[43954,1],6]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 64 for 
process [[43954,1],6]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 66 for 
process [[43954,1],6]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 62 for 
process [[43954,1],7]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 68 for 
process [[43954,1],7]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 70 for 
process [[43954,1],7]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 65 for 
process [[43954,1],8]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 72 for 
process [[43954,1],8]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 74 for 
process [[43954,1],8]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 75 for 
process [[43954,1],9]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 79 for 
process [[43954,1],9]
[c1725.crane.hcc.unl.edu:170750] [[43954,0],0] iof:hnp pushing fd 81 for 
process [[43954,1],9]
Rank 5 has cleared MPI_Init
Rank 9 has cleared MPI_Init
Rank 1 has cleared MPI_Init
Rank 2 has cleared MPI_Init
Rank 3 has cleared MPI_Init
Rank 4 has cleared MPI_Init
Rank 8 has cleared MPI_Init
Rank 0 has cleared MPI_Init
Rank 6 has cleared MPI_Init
Rank 7 has cleared MPI_Init
Rank 14 has cleared MPI_Init
Rank 15 has cleared MPI_Init
Rank 16 has cleared MPI_Init
Rank 18 has cleared MPI_Init
Rank 10 has cleared MPI_Init
Rank 11 has cleared MPI_Init
Rank 12 has cleared MPI_Init
Rank 13 has cleared MPI_Init
Rank 17 has cleared MPI_Init
Rank 19 has cleared MPI_Init


Thanks,


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400

From: users  on behalf of r...@open-mpi.org 

Sent: Saturday, August 27, 2016 12:31:53 PM
To: Open MPI Users
Subject: Re: [OMPI users] stdin issue with openmpi/2.0.0

I am finding this impossible to replicate, so something odd must be going on. 
Can you please (a) pull down the latest v2.0.1 nightly tarball, and (b) add 
this patch to it?

diff --git a/orte/mca/iof/hnp/iof_hnp.c b/orte/mca/iof/hnp/iof_hnp.c
old mode 100644
new mode 100755
index 512fcdb..362ff46
--- a/orte/mca/iof/hnp/iof_hnp.c
+++ b/orte/mca/iof/hnp/iof_hnp.c
@@ -143,16 +143,17 @@ static int hnp_push(const orte_process_name_t* dst_name, 
orte_iof_tag_t src_tag,
 int np, numdigs;
 orte_ns_cmp_bitmask_t mask;



+opal_output(0,
+ "%s iof:hnp pushing fd %d for process %s",
+ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
+ fd, ORTE_NAME_PRINT(dst_name));
+
 /* don't do this if the dst vpid is invalid or the fd is negative! */
 if (ORTE_VPID_INVALID == dst_name->vpid || fd < 0) {
 return ORTE_SUCCESS;
 }



-OPAL_OUTPUT_VERBOSE((1, orte_iof_base_framework.framework_output,
- "%s iof:hnp pushing fd %d for process %s",
- ORTE_NAME_PR

[OMPI users] Multi-Threaded Performance Question

2016-08-29 Thread Stephen Ibanez

Hi All,

I am trying to use the MPI_THREAD_MULTIPLE support for OpenMPI 2.0. I 
know the documentation states that multi threaded support for OpenMPI is 
only lightly tested and likely will result in poor performance. I have 
noticed that when I have many threads for a particular process that are 
all calling MPI_Recv then the receive calls appear to interfere with 
each other and the overall performance is worse than a single thread.


To clarify, the experiment that I am running consists of process 1 on a 
node in an infiniband cluster generating requests which are sent to 
process 2 on a different node in an infiniband cluster. Process 2 simply 
receives the request, does a little bit of processing, and replies back 
to process 1. I noticed that as I add more threads running on different 
cores to process 2, the total number of requests/sec completed 
decreases. I wouldn't expect that adding more threads would decrease 
throughput, unless the receive calls from the different threads are 
interfering with each other.


After looking a little bit into the OpenMPI implementation of the 
MPI_Recv function, I noticed that it looks like each call to MPI_Recv 
requires each thread to obtain a mutex so that only one thread can 
receive at a time. I thought that this would explain the decrease in 
performance caused by many threads calling MPI_Recv. To try and get 
around this issue, I tried to wrap the MPI_Recv call in a 
pthread_spinlock_t . The idea here being that the threads would try to 
lock the spinlock rather than the mutex, which should eliminate most of 
the contention between threads. However, I am still seeing that 
increasing the number of threads for process 2 causes the throughput to 
decrease.


So my question is, are there any other sources of interference between 
threads in OpenMPI that would cause the number of requests completed/sec 
to decrease as I increase the number of threads in process 2?


Thanks,
-Steve
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] stdin issue with openmpi/2.0.0

2016-08-29 Thread r...@open-mpi.org
I’m sorry, but something is simply very wrong here. Are you sure you are 
pointed at the correct LD_LIBRARY_PATH? Perhaps add a “BOO” or something at the 
front of the output message to ensure we are using the correct plugin?

This looks to me like you must be picking up a stale library somewhere.

> On Aug 29, 2016, at 10:29 AM, Jingchao Zhang  wrote:
> 
> Hi Ralph,
> 
> I used the tarball from Aug 26 and added the patch. Tested with 2 nodes, 10 
> cores/node. Please see the results below:
> 
> $ mpirun ./a.out < test.in
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 35 for process [[43954,1],0]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 41 for process [[43954,1],0]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 43 for process [[43954,1],0]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 37 for process [[43954,1],1]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 46 for process [[43954,1],1]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 49 for process [[43954,1],1]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 38 for process [[43954,1],2]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 50 for process [[43954,1],2]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 52 for process [[43954,1],2]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 42 for process [[43954,1],3]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 53 for process [[43954,1],3]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 55 for process [[43954,1],3]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 45 for process [[43954,1],4]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 56 for process [[43954,1],4]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 58 for process [[43954,1],4]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 47 for process [[43954,1],5]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 59 for process [[43954,1],5]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 61 for process [[43954,1],5]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 57 for process [[43954,1],6]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 64 for process [[43954,1],6]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 66 for process [[43954,1],6]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 62 for process [[43954,1],7]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 68 for process [[43954,1],7]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 70 for process [[43954,1],7]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 65 for process [[43954,1],8]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 72 for process [[43954,1],8]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 74 for process [[43954,1],8]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 75 for process [[43954,1],9]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 79 for process [[43954,1],9]
> [c1725.crane.hcc.unl.edu :170750] 
> [[43954,0],0] iof:hnp pushing fd 81 for process [[43954,1],9]
> Rank 5 has cleared MPI_Init
> Rank 9 has cleared MPI_Init
> Rank 1 has cleared MPI_Init
> Rank 2 has cleared MPI_Init
> Rank 3 has cleared MPI_Init
> Rank 4 has cleared MPI_Init
> Rank 8 has cleared MPI_Init
> Rank 0 has cleared MPI_Init
> Rank 6 has cleared MPI_Init
> Rank 7 has cleared MPI_Init
> Rank 14 has cleared MPI_Init
> Rank 15 has cleared MPI_Init
> Rank 16 has cleared MPI_Init
> Rank 18 has cleared MPI_Init
> Rank 10 has cleared MPI_I

Re: [OMPI users] problem with exceptions in Java interface

2016-08-29 Thread Graham, Nathaniel Richard
​Hello Siegmar and Gilles,


I made a reply where Gilles suggested, but figured I leave a note here in case 
the other was missed.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory

From: users  on behalf of Gilles Gouaillardet 

Sent: Monday, August 29, 2016 6:16 AM
To: Open MPI Users
Subject: Re: [OMPI users] problem with exceptions in Java interface

Hi Siegmar,

I will review PR 1698 and wait some more feedback from the developers, they 
might have different views than mine.
assuming PR 1698 does what you expect, it does not catch all user errors.
for example, if you MPI_Send a buffer that is too short, the exception might be 
thrown at any time.
in the worst case, it will occur in the progress thread and outside of any MPI 
call, which means it cannot be "converted" into a MPIException.

fwiw, we have a way to check buffers, but it requires
1. Open MPI is configure'd with --enable-memchecker
and
2. the MPI tasks are ran under valgrind.
iirc, valgrind will issue an error message if the buffer is invalid, and the 
app will crash after
(e.g. the MPI subroutine will not return with an error code the end user can 
"trap")

such checks might be easier to make in Java, and resulting errors might be 
easily made "trappable", but as far as I am concerned
1. this has a runtime overhead
2. this is a new development.

let's follow up at https://github.com/open-mpi/ompi/issues/1698 from now

Cheers,

Gilles

On Monday, August 29, 2016, Siegmar Gross 
mailto:siegmar.gr...@informatik.hs-fulda.de>>
 wrote:
Hi Gilles,

isn't it possible to pass all exceptions from the Java interface
to the calling method? I can live with the current handling of
exceptions as well, although some exceptions can be handled
within my program and some will break my program even if I want
to handle exceptions myself. I understood PR 1698 in the way, that
all exceptions can be processed in the user program if the user
chooses MPI.ERRORS_RETURN (otherwise this change request wouldn't
have been necessary). Nevertheless, if you decide, things are as
they are, I'm happy with your decision as well.


Kind regards

Siegmar


Am 29.08.2016 um 10:30 schrieb Gilles Gouaillardet:
Siegmar and all,


i am puzzled with this error.

on one hand, it is caused by an invalid buffer

(e.g. buffer size is 1, but user suggests size is 2)

so i am fine with current behavior (e.g.
java.lang.ArrayIndexOutOfBoundsException is thrown)

/* if that was a C program, it would very likely SIGSEGV, e.g. Open MPI does
not catch this kind of error when checking params */


on the other hand, Open MPI could be enhanced to check the buffer size, and
throw a MPIException in this case.


as far as i am concerned, this is a feature request and not a bug.


thoughts anyone ?


Cheers,


Gilles

On 8/29/2016 3:48 PM, Siegmar Gross wrote:
Hi,

I have installed v1.10.3-31-g35ba6a1, openmpi-v2.0.0-233-gb5f0a4f,
and openmpi-dev-4691-g277c319 on my "SUSE Linux Enterprise Server
12 (x86_64)" with Sun C 5.14 beta and gcc-6.1.0. In May I had
reported a problem with Java execeptions (PR 1698) which had
been solved in June (PR 1803).

https://github.com/open-mpi/ompi/issues/1698
https://github.com/open-mpi/ompi/pull/1803

Unfortunately the problem still exists or exists once more
in all three branches.


loki fd1026 112 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
absolute"
  Open MPI repo revision: dev-4691-g277c319
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 112 mpijavac Exception_2_Main.java
warning: [path] bad path element
"/usr/local/openmpi-master_64_cc/lib64/shmem.jar": no such file or directory
1 warning
loki fd1026 113 mpiexec -np 1 java Exception_2_Main
Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
Call "bcast" with index out-of bounds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at mpi.Comm.bcast(Native Method)
at mpi.Comm.bcast(Comm.java:1252)
at Exception_2_Main.main(Exception_2_Main.java:22)
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---
--
mpiexec detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58548,1],0]
  Exit code:1
--
loki fd1026 114 exit



loki fd1026 116 ompi_info | grep -e "Open MPI repo revision" -e "C compiler
absolute"
  Open MPI repo revision: v2.0.0-233-gb5f0a4f
 C compiler absolute: /opt/solstudio12.5b/bin/cc
loki fd1026 117 mpijavac Exception_2_Main.java
warning: [path] bad path element
"/usr/local/openmpi-2.0.1_64_cc/lib64/shmem.jar": no such file or directory
1 warning
l