from:"jan"

[OMPI users] running problem on Dell blade server, confirm 2d21ce3ce8be64d8104b3ad71b8c59e2514a72eb

2009-04-24 Thread jan

Dear Sir,

 

I’m running a cluster with OpenMPI. 

 

$mpirun --mca mpi_show_mpi_alloc_mem_leaks 8 --mca mpi_show_handle_leaks 1 
$HOME/test/cpi

 

I got the error message as job failed:

 

Process 15 on node2

Process 6 on node1

Process 14 on node2

… … …

Process 0 on node1

Process 10 on node2

[node2][[9340,1],13][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],9][btl_openib_component.c:3002:poll_device] error polling HP CQ

 with -2 errno says Success

[node2][[9340,1],10][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],11][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],8][btl_openib_component.c:3002:poll_device] error polling HP CQ

 with -2 errno says Success

[node2][[9340,1],15][btl_openib_component.c:3002:poll_device] [node2][[9340,1],1

2][btl_openib_component.c:3002:poll_device] error polling HP CQ with -2 errno sa

ys Success

error polling HP CQ with -2 errno says Success

[node2][[9340,1],14][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

mpirun: killing job...

 

--

mpirun noticed that process rank 0 with PID 28438 on node node1 exited on signal

 0 (Unknown signal 0).

--

mpirun: clean termination accomplished

 

and got the message as job success

 

Process 1 on node1

Process 2 on node1

… … …

Process 13 on node2

Process 14 on node2

--

The following memory locations were allocated via MPI_ALLOC_MEM but

not freed via MPI_FREE_MEM before invoking MPI_FINALIZE:

 

Process ID: [[13692,1],12]

Hostname:   node2

PID:30183

 

(null)

--

[node1:32276] 15 more processes have sent help message help-mpool-base.txt / all

 mem leaks

[node1:32276] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help

/ error messages

 



It  occurred periodic, ie. twice success, then twice failed, twice success, 
then twice failed … . I download the OFED-1.4.1-rc3 from The OpenFabrics 
Alliance and installed on Dell PowerEdge M600 Blade Server. The infiniband 
Mezzanine Cards is Mellanox ConnectX QDR & DDR. And infiniband switch module is 
Mellanox M2401G. OS is CentOS 5.3, kernel  2.6.18-128.1.6.el5, with PGI V7.2-5 
compiler. It’s running OpenSM subnet manager. 



Best Regards,



Gloria Jan



Wavelink Technology Inc.

 

The output of the "ompi_info --all" command as:



 Package: Open MPI root@vortex Distribution
Open MPI: 1.3.1
   Open MPI SVN revision: r20826
   Open MPI release date: Mar 18, 2009
Open RTE: 1.3.1
   Open RTE SVN revision: r20826
   Open RTE release date: Mar 18, 2009
OPAL: 1.3.1
   OPAL SVN revision: r20826
   OPAL release date: Mar 18, 2009
Ident string: 1.3.1
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.1)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.1)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.1)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.1)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.1)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.1)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.1)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.1)
 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.1)
 MCA pml:

[OMPI users] running problem on Dell blade server, confirm 2d21ce3ce8be64d8104b3ad71b8c59e2514a72eb

2009-04-24 Thread jan

Dear Sir,
 

I’m running a cluster with OpenMPI. 

 

$mpirun --mca mpi_show_mpi_alloc_mem_leaks 8 --mca mpi_show_handle_leaks 1 
$HOME/test/cpi

 

I got the error message as job failed:

 

Process 15 on node2

Process 6 on node1

Process 14 on node2

… … …

Process 0 on node1

Process 10 on node2

[node2][[9340,1],13][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],9][btl_openib_component.c:3002:poll_device] error polling HP CQ

 with -2 errno says Success

[node2][[9340,1],10][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],11][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

[node2][[9340,1],8][btl_openib_component.c:3002:poll_device] error polling HP CQ

 with -2 errno says Success

[node2][[9340,1],15][btl_openib_component.c:3002:poll_device] [node2][[9340,1],1

2][btl_openib_component.c:3002:poll_device] error polling HP CQ with -2 errno sa

ys Success

error polling HP CQ with -2 errno says Success

[node2][[9340,1],14][btl_openib_component.c:3002:poll_device] error polling HP C

Q with -2 errno says Success

mpirun: killing job...

 

--

mpirun noticed that process rank 0 with PID 28438 on node node1 exited on signal

 0 (Unknown signal 0).

--

mpirun: clean termination accomplished

 

and got the message as job success

 

Process 1 on node1

Process 2 on node1

… … …

Process 13 on node2

Process 14 on node2

--

The following memory locations were allocated via MPI_ALLOC_MEM but

not freed via MPI_FREE_MEM before invoking MPI_FINALIZE:

 

Process ID: [[13692,1],12]

Hostname:   node2

PID:30183

 

(null)

--

[node1:32276] 15 more processes have sent help message help-mpool-base.txt / all

 mem leaks

[node1:32276] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help

/ error messages

 



It  occurred periodic, ie. twice success, then twice failed, twice success, 
then twice failed … . I download the OFED-1.4.1-rc3 from The OpenFabrics 
Alliance and installed on Dell PowerEdge M600 Blade Server. The infiniband 
Mezzanine Cards is Mellanox ConnectX QDR & DDR. And infiniband switch module is 
Mellanox M2401G. OS is CentOS 5.3, kernel  2.6.18-128.1.6.el5, with PGI V7.2-5 
compiler. It’s running OpenSM subnet manager. 



Best Regards,



Gloria Jan



Wavelink Technology Inc.

 

The output of the "ompi_info --all" command as:



 Package: Open MPI root@vortex Distribution
Open MPI: 1.3.1
   Open MPI SVN revision: r20826
   Open MPI release date: Mar 18, 2009
Open RTE: 1.3.1
   Open RTE SVN revision: r20826
   Open RTE release date: Mar 18, 2009
OPAL: 1.3.1
   OPAL SVN revision: r20826
   OPAL release date: Mar 18, 2009
Ident string: 1.3.1
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.1)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.1)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.1)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.1)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.1)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.1)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.1)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.1)
 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.1)
 MCA pml:

Re: [OMPI users] users Digest, Vol 1212, Issue 3

2009-04-26 Thread jan


Dear Jeff,

Thankyou for your help. I have tried the OpenMPI v1.3.2 on Sunday, but the 
problems till occured.


Regards, Gloria
Wavelink Technology Inc.





Per http://www.open-mpi.org/community/lists/announce/2009/03/0029.php,
can you try upgrading to Open MPI v1.3.2?


On Apr 24, 2009, at 5:21 AM, jan wrote:


Dear Sir,

I?m running a cluster with OpenMPI.

$mpirun --mca mpi_show_mpi_alloc_mem_leaks 8 --mca
mpi_show_handle_leaks 1 $HOME/test/cpi

I got the error message as job failed:

Process 15 on node2
Process 6 on node1
Process 14 on node2
? ? ?
Process 0 on node1
Process 10 on node2
[node2][[9340,1],13][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],9][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],10][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],8][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],15][btl_openib_component.c:3002:poll_device] [node2]
[[9340,1],1
2][btl_openib_component.c:3002:poll_device] error polling HP CQ with
-2 errno sa
ys Success
error polling HP CQ with -2 errno says Success
[node2][[9340,1],14][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 28438 on node node1
exited on signal
 0 (Unknown signal 0).
--
mpirun: clean termination accomplished

and got the message as job success

Process 1 on node1
Process 2 on node1
? ? ?
Process 13 on node2
Process 14 on node2
--
The following memory locations were allocated via MPI_ALLOC_MEM but
not freed via MPI_FREE_MEM before invoking MPI_FINALIZE:

Process ID: [[13692,1],12]
Hostname:   node2
PID:30183

(null)
--
[node1:32276] 15 more processes have sent help message help-mpool-
base.txt / all
 mem leaks
[node1:32276] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help
/ error messages


It  occurred periodic, ie. twice success, then twice failed, twice
success, then twice failed ? . I download the OFED-1.4.1-rc3 from
The OpenFabrics Alliance and installed on Dell PowerEdge M600 Blade
Server. The infiniband Mezzanine Cards is Mellanox ConnectX QDR &
DDR. And infiniband switch module is Mellanox M2401G. OS is CentOS
5.3, kernel  2.6.18-128.1.6.el5, with PGI V7.2-5 compiler. It?s
running OpenSM subnet manager.

Best Regards,

Gloria Jan

Wavelink Technology Inc.

The output of the "ompi_info --all" command as:

 Package: Open MPI root@vortex Distribution
Open MPI: 1.3.1
   Open MPI SVN revision: r20826
   Open MPI release date: Mar 18, 2009
Open RTE: 1.3.1
   Open RTE SVN revision: r20826
   Open RTE release date: Mar 18, 2009
OPAL: 1.3.1
   OPAL SVN revision: r20826
   OPAL release date: Mar 18, 2009
Ident string: 1.3.1
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component
v1.3.1)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.1)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: config (MCA v2.0, API v2.0, Component
v1.3.1)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.1)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component
v1.3.1)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: hierarch (MCA v2.0, API v2.0, Component
v1.3.1)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.1)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.1)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.1)

Re: [OMPI users] users Digest, Vol 1212, Issue 3, Message: 2

2009-04-27 Thread jan


Thank You Jeff Squyres.

I have checked out the web page 
http://www.open-mpi.org/community/lists/announce/2009/03/0029.php, then the 
page https://svn.open-mpi.org/trac/ompi/ticket/1853 , but the web page 
svn.open-mpi.org seems crash.


Then I tried OpenMpi V1.3.2 for many different configuration again. but 
found the problem still occurred periodic, ie. twice success, then twice 
failed, twice

success, then twice failed ... . Do you have any suggestion for this issue?

Thank you again.

Best Regards,

Gloria Jan
Wavelink Technology Inc.




Per http://www.open-mpi.org/community/lists/announce/2009/03/0029.php,
can you try upgrading to Open MPI v1.3.2?


On Apr 24, 2009, at 5:21 AM, jan wrote:


Dear Sir,

I?m running a cluster with OpenMPI.

$mpirun --mca mpi_show_mpi_alloc_mem_leaks 8 --mca
mpi_show_handle_leaks 1 $HOME/test/cpi

I got the error message as job failed:

Process 15 on node2
Process 6 on node1
Process 14 on node2
? ? ?
Process 0 on node1
Process 10 on node2
[node2][[9340,1],13][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],9][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],10][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],8][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],15][btl_openib_component.c:3002:poll_device] [node2]
[[9340,1],1
2][btl_openib_component.c:3002:poll_device] error polling HP CQ with
-2 errno sa
ys Success
error polling HP CQ with -2 errno says Success
[node2][[9340,1],14][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 28438 on node node1
exited on signal
 0 (Unknown signal 0).
--
mpirun: clean termination accomplished

and got the message as job success

Process 1 on node1
Process 2 on node1
? ? ?
Process 13 on node2
Process 14 on node2
--
The following memory locations were allocated via MPI_ALLOC_MEM but
not freed via MPI_FREE_MEM before invoking MPI_FINALIZE:

Process ID: [[13692,1],12]
Hostname:   node2
PID:30183

(null)
--
[node1:32276] 15 more processes have sent help message help-mpool-
base.txt / all
 mem leaks
[node1:32276] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help
/ error messages


It  occurred periodic, ie. twice success, then twice failed, twice
success, then twice failed ? . I download the OFED-1.4.1-rc3 from
The OpenFabrics Alliance and installed on Dell PowerEdge M600 Blade
Server. The infiniband Mezzanine Cards is Mellanox ConnectX QDR &
DDR. And infiniband switch module is Mellanox M2401G. OS is CentOS
5.3, kernel  2.6.18-128.1.6.el5, with PGI V7.2-5 compiler. It?s
running OpenSM subnet manager.

Best Regards,

Gloria Jan

Wavelink Technology Inc.

The output of the "ompi_info --all" command as:

 Package: Open MPI root@vortex Distribution
Open MPI: 1.3.1
   Open MPI SVN revision: r20826
   Open MPI release date: Mar 18, 2009
Open RTE: 1.3.1
   Open RTE SVN revision: r20826
   Open RTE release date: Mar 18, 2009
OPAL: 1.3.1
   OPAL SVN revision: r20826
   OPAL release date: Mar 18, 2009
Ident string: 1.3.1
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component
v1.3.1)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.1)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.1)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.1)
 MCA installdirs: config (MCA v2.0, API v2.0, Component
v1.3.1)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.1)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.1)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component
v1.3.1)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: hierarch (MCA v2.0, API v2.0, Component
v1.3.1)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.1)
MCA coll: self (MCA v

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-04-30 Thread jan

Thank You Jeff Squyres. Could you suggest the method to run layer 0 
diagnostics to know that if the fabric is clean. I have contacted Dell 
local(Taiwan). I don't think they are familiar with Openmpi even the 
infiniband module. Does anyone have the IB stack hangs problem with Mellanox 
ConnectX product?


Thank you again.

Best Regards,

Gloria Jan
Wavelink Technology Inc



I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.

Our system is:

Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2


But what I can also add, it not only affect openmpi, if this messages
are triggered after mpirun:
[node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP CQ with -2 errno says Success

Then IB stack hangs. You cannot even reload it, have to reboot node.




Something that severe should not be able to be caused by Open MPI.
Specifically: Open MPI should not be able to hang the OFED stack.
Have you run layer 0 diagnostics to know that your fabric is clean?
You might want to contact your IB vendor to find out how to do that.

--
Jeff Squyres
Cisco Systems



--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

End of users Digest, Vol 1217, Issue 2
**

[OMPI users] Fw: users Digest, Vol 1217, Issue 2, Message3

2009-05-04 Thread jan


Hi Jeff,

I have updated the firmware of Infiniband module on Dell M600, but the 
problem still occured.


===

$ mpirun -hostfile clusternode -np 16 --byslot --mca btl openib,sm,self 
$HOME/test/cpi

Process 1 on node1
Process 11 on node2
Process 8 on node2
Process 6 on node1
Process 4 on node1
Process 14 on node2
Process 3 on node1
Process 2 on node1
Process 9 on node2
Process 5 on node1
Process 0 on node1
Process 7 on node1
Process 10 on node2
Process 15 on node2
Process 13 on node2
Process 12 on node2
[node1][[3175,1],0][btl_openib_component.c:3029:poll_device] error polling 
HP CQ with -2 errno says Success

=

Is this problem unsolvable?


Best Regards,

Gloria Jan
Wavelink Technology Inc



I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.

Our system is:

Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2


But what I can also add, it not only affect openmpi, if this messages
are triggered after mpirun:
[node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP CQ with -2 errno says Success

Then IB stack hangs. You cannot even reload it, have to reboot node.




Something that severe should not be able to be caused by Open MPI.
Specifically: Open MPI should not be able to hang the OFED stack.
Have you run layer 0 diagnostics to know that your fabric is clean?
You might want to contact your IB vendor to find out how to do that.

--
Jeff Squyres
Cisco Systems





On Apr 24, 2009, at 5:21 AM, jan wrote:


Dear Sir,

I?m running a cluster with OpenMPI.

$mpirun --mca mpi_show_mpi_alloc_mem_leaks 8 --mca
mpi_show_handle_leaks 1 $HOME/test/cpi

I got the error message as job failed:

Process 15 on node2
Process 6 on node1
Process 14 on node2
? ? ?
Process 0 on node1
Process 10 on node2
[node2][[9340,1],13][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],9][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],10][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
[node2][[9340,1],8][btl_openib_component.c:3002:poll_device] error
polling HP CQ
 with -2 errno says Success
[node2][[9340,1],15][btl_openib_component.c:3002:poll_device] [node2]
[[9340,1],1
2][btl_openib_component.c:3002:poll_device] error polling HP CQ with
-2 errno sa
ys Success
error polling HP CQ with -2 errno says Success
[node2][[9340,1],14][btl_openib_component.c:3002:poll_device] error
polling HP C
Q with -2 errno says Success
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 28438 on node node1
exited on signal
 0 (Unknown signal 0).
--
mpirun: clean termination accomplished

and got the message as job success

Process 1 on node1
Process 2 on node1
? ? ?
Process 13 on node2
Process 14 on node2
--
The following memory locations were allocated via MPI_ALLOC_MEM but
not freed via MPI_FREE_MEM before invoking MPI_FINALIZE:

Process ID: [[13692,1],12]
Hostname:   node2
PID:30183

(null)
--
[node1:32276] 15 more processes have sent help message help-mpool-
base.txt / all
 mem leaks
[node1:32276] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help
/ error messages


It  occurred periodic, ie. twice success, then twice failed, twice
success, then twice failed ? . I download the OFED-1.4.1-rc3 from
The OpenFabrics Alliance and installed on Dell PowerEdge M600 Blade
Server. The infiniband Mezzanine Cards is Mellanox ConnectX QDR &
DDR. And infiniband switch module is Mellanox M2401G. OS is CentOS
5.3, kernel  2.6.18-128.1.6.el5, with PGI V7.2-5 compiler. It?s
running OpenSM subnet manager.

Best Regards,

Gloria Jan

Wavelink Technology Inc.

The output of the "ompi_info --all" command as:

 Package: Open MPI root@vortex Distribution
Open MPI: 1.3.1
   Open MPI SVN revision: r20826
   Open MPI release date: Mar 18, 2009
Open RTE: 1.3.1
   Open RTE SVN revision: r20826
   Open RTE release date: Mar 18, 2009
OPAL: 1.3.1
   OPAL SVN revision: r20826
   OPAL release date: Mar 18, 2009
Ident string: 1.3.1
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component
v1.3.1)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component
v1.3.1)
   MCA paffinity: l

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-04 Thread jan

Thank you Jeff. I have passed the mail to the IB vendor Dell company(the 
blade was ordered from Dell Taiwan), but he todl me that he didn't 
understand  "layer 0 diagnostics". Coluld you help us to get more 
information of "layer 0 diagnostics". Thanks again.

Regards,

Gloria Jan
Wavelink Technology Inc.


> As I've indicated a few times in this thread:
>
> >> Have you run layer 0 diagnostics to know that your fabric is clean?
> >> You might want to contact your IB vendor to find out how to do that.
>
>
>
> On May 4, 2009, at 4:34 AM, jan wrote:
>
>> Hi Jeff,
>>
>> I have updated the firmware of Infiniband module on Dell M600, but the
>> problem still occured.
>>
>> =
>> =
>> =
>> =
>> =
>> ==
>>
>> $ mpirun -hostfile clusternode -np 16 --byslot --mca btl  openib,sm,self
>> $HOME/test/cpi
>> Process 1 on node1
>> Process 11 on node2
>> Process 8 on node2
>> Process 6 on node1
>> Process 4 on node1
>> Process 14 on node2
>> Process 3 on node1
>> Process 2 on node1
>> Process 9 on node2
>> Process 5 on node1
>> Process 0 on node1
>> Process 7 on node1
>> Process 10 on node2
>> Process 15 on node2
>> Process 13 on node2
>> Process 12 on node2
>> [node1][[3175,1],0][btl_openib_component.c:3029:poll_device] error 
>> polling
>> HP CQ with -2 errno says Success
>> = = = = = = = 
>> ==
>>
>> Is this problem unsolvable?
>>
>>
>> Best Regards,
>>
>>  Gloria Jan
>> Wavelink Technology Inc
>>

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-04 Thread jan

Thank you Jeff. I have passed the mail to the IB vendor Dell company(the 
blade was ordered from Dell Taiwan), but he todl me that he didn't 
understand  "layer 0 diagnostics". Coluld you help us to get more 
information of "layer 0 diagnostics". Thanks again.

Regards,

Gloria Jan
Wavelink Technology Inc.


> As I've indicated a few times in this thread:
>
> >> Have you run layer 0 diagnostics to know that your fabric is clean?
> >> You might want to contact your IB vendor to find out how to do that.
>
>
>
> On May 4, 2009, at 4:34 AM, jan wrote:
>
>> Hi Jeff,
>>
>> I have updated the firmware of Infiniband module on Dell M600, but the
>> problem still occured.
>>
>> =
>> =
>> =
>> =
>> =
>> ==
>>
>> $ mpirun -hostfile clusternode -np 16 --byslot --mca btl  openib,sm,self
>> $HOME/test/cpi
>> Process 1 on node1
>> Process 11 on node2
>> Process 8 on node2
>> Process 6 on node1
>> Process 4 on node1
>> Process 14 on node2
>> Process 3 on node1
>> Process 2 on node1
>> Process 9 on node2
>> Process 5 on node1
>> Process 0 on node1
>> Process 7 on node1
>> Process 10 on node2
>> Process 15 on node2
>> Process 13 on node2
>> Process 12 on node2
>> [node1][[3175,1],0][btl_openib_component.c:3029:poll_device] error 
>> polling
>> HP CQ with -2 errno says Success
>> = = = = = = = 
>> ==
>>
>> Is this problem unsolvable?
>>
>>
>> Best Regards,
>>
>>  Gloria Jan
>> Wavelink Technology Inc
>>

[OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-07 Thread jan

Anyone can help me to find out problem or bug in my cluster? The output of 
"ibv_devinfo -v" from Dell blade with infiniband module look very strange. The 
phys_port_cnt is 2, one active, and another down. The active port is 20x speed, 
the down port is 10x speed. We are using Dell PowerEdge M600 Blade Serverwith 
Mellanox ConnectX DDR infiniband Mezzanine card and Cisco M2401G infiniband 
switch. OS is centOS 5.3, kernel 2.6.18-128.1.6el5 with PGI V7.2-5 compiler, 
and OFED-1.4.1-rc4 with openmpi-1.3.2:

# ibv_devinfo -v
hca_id: mlx4_0
fw_ver: 2.5.000
node_guid:  0018:8b90:97fe:73cd
sys_image_guid: 0018:8b90:97fe:73d0
vendor_id:  0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id:   DEL08C002
phys_port_cnt:  2
max_mr_size:0x
page_size_cap:  0xf000
max_qp: 131008
max_qp_wr:  16351
device_cap_flags:   0x000c1c66
max_sge:32
max_sge_rd: 0
max_cq: 65408
max_cqe:4194303
max_mr: 131056
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom:2096128
max_qp_init_rd_atom:128
max_ee_init_rd_atom:0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd:0
max_mw: 0
max_raw_ipv6_qp:0
max_raw_ethy_qp:0
max_mcast_grp:  8192
max_mcast_qp_attach:56
max_total_mcast_qp_attach:  458752
max_ah: 0
max_fmr:0
max_srq:65472
max_srq_wr: 16383
max_srq_sge:31
max_pkeys:  128
local_ca_ack_delay: 15
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 4
port_lid:   16
port_lmc:   0x00
max_msg_sz: 0x4000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr:  0x0
qkey_viol_cntr: 0x0
sm_sl:  0
pkey_tbl_len:   128
gid_tbl_len:128
subnet_timeout: 18
init_type_reply:0
active_width:   4X (2)
active_speed:   5.0 Gbps (2)
phys_state: LINK_UP (5)
GID[  0]:   
fe80::::0018:8b90:97fe:73ce


Best Regards, 

Gloria Jan
Wavelink Technology Inc.

[OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-21 Thread jan

Thanks Jeff. We solve this problem finally. Download the newest 
OFED-1.4.1-rc6.tgz, and reinstall all node's infiniband drivers and 
utilities. Everythings looks good, and I have my own coffee time now. Thanks 
again.


Best Regards,

Gloria Jan
Wavelink Technology Inc

I don't think the speed of the down port matters; port_down means that 
there's no cable connected, so the values are probably fairly random.



On May 7, 2009, at 10:38 PM, jan wrote:

Anyone can help me to find out problem or bug in my cluster? The  output 
of "ibv_devinfo -v" from Dell blade with infiniband module  look very 
strange. The phys_port_cnt is 2, one active, and another  down. The 
active port is 20x speed, the down port is 10x speed. We  are using Dell 
PowerEdge M600 Blade Serverwith Mellanox ConnectX DDR  infiniband 
Mezzanine card and Cisco M2401G infiniband switch. OS is  centOS 5.3, 
kernel 2.6.18-128.1.6el5 with PGI V7.2-5 compiler, and  OFED-1.4.1-rc4 
with openmpi-1.3.2:


# ibv_devinfo -v
hca_id: mlx4_0
fw_ver: 2.5.000
node_guid:  0018:8b90:97fe:73cd
sys_image_guid: 0018:8b90:97fe:73d0
vendor_id:  0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id:   DEL08C002
phys_port_cnt:  2
max_mr_size:0x
page_size_cap:  0xf000
max_qp: 131008
max_qp_wr:  16351
device_cap_flags:   0x000c1c66
max_sge:32
max_sge_rd: 0
max_cq: 65408
max_cqe:4194303
max_mr: 131056
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom:2096128
max_qp_init_rd_atom:128
max_ee_init_rd_atom:0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd:0
max_mw: 0
max_raw_ipv6_qp:0
max_raw_ethy_qp:0
max_mcast_grp:  8192
max_mcast_qp_attach:56
max_total_mcast_qp_attach:  458752
max_ah: 0
max_fmr:0
max_srq:65472
max_srq_wr: 16383
max_srq_sge:31
max_pkeys:  128
local_ca_ack_delay: 15
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 4
port_lid:   16
port_lmc:   0x00
max_msg_sz: 0x4000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr:  0x0
qkey_viol_cntr: 0x0
sm_sl:  0
pkey_tbl_len:   128
gid_tbl_len:128
subnet_timeout: 18
init_type_reply:0
active_width:   4X (2)
active_speed:   5.0 Gbps (2)
phys_state: LINK_UP (5)
GID[  0]: 
fe80::::0018:8b90:97fe:73ce


Best Regards,

Gloria Jan
Wavelink Technology Inc.



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] OpenMPI providing rank?

2010-07-28 Thread Nysal Jan

OMPI_COMM_WORLD_RANK can be used to get the MPI rank. For other environment
variables -
http://www.open-mpi.org/faq/?category=running#mpi-environmental-variables
For processor affinity see this FAQ entry -
http://www.open-mpi.org/faq/?category=all#using-paffinity

--Nysal

On Wed, Jul 28, 2010 at 9:04 AM, Yves Caniou wrote:

> Hi,
>
> I have some performance issue on a parallel machine composed of nodes of 16
> procs each. The application is launched on multiple of 16 procs for given
> numbers of nodes.
> I was told by people using MX MPI with this machine to attach a script to
> mpiexec, which 'numactl' things, in order to make the execution performance
> stable.
>
> Looking on the faq (the oldest one is for OpenMPI v1.3?), I saw that maybe
> the
> solution would be for me to use the --mca mpi_paffinity_alone 1
> Is that correct? -- BTW, I have both memory and processor affinity:
> >ompi_info | grep affinity
>   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.2)
>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.2)
>   MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.2)
> Does it handle memory too, or do I have to use another option like
> --mca mpi_maffinity 1?
>
> Still, I would like to test the numactl solution. Does OpenMPI provide an
> equivalent to $MXMPI_ID which gives at least gives the NODE on which a
> process is launched by OpenMPI, so that I can adapt the script that was
> given
> to me?
>
> Tkx.
>
> .Yves.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Implementing a new BTL module in MCA

2010-08-03 Thread Nysal Jan

You can find the template for a BTL in ompi/mca/btl/template (You will find
this on the subversion trunk). Copy and rename the folder/files. Use this as
a starting point.
For details on creating a new component (such as a new BTL) look here -
https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
The following document might also be useful -
http://www.open-mpi.org/papers/trinity-btl-2009/xenmpi_report.pdf

Regards
--Nysal

On Tue, Aug 3, 2010 at 5:45 PM, Simone Pellegrini <
spellegr...@dps.uibk.ac.at> wrote:

> Deal all,
> I need to implement an MPI layer on top of a message passing library which
> is currently used in a particular device where I have to run MPI programs (
> very vague, I know :) ).
>
> Instead of reinventing the wheel, my idea was to reuse most of the Open MPI
> implementation and just add a new layer to support my custom device. I guess
> that extending the Byte Transfer Layer of the Modular Component Architecture
> should make the job. Right?
>
> Anyway, before starting wasting my time looking for documentation I wanted
> to have some pointers to documentation regarding extension of Open MPI.
> Which are the interfaces I have to extend? Is there any "hello world"
> example on how to do it?
>
> many thanks, Simone
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Memory allocation error when linking with MPI libraries

2010-08-08 Thread Nysal Jan

What interconnect are you using? Infiniband? Use  "--without-memory-manager"
option while building ompi in order to disable ptmalloc.

Regards
--Nysal

On Sun, Aug 8, 2010 at 7:49 PM, Nicolas Deladerriere <
nicolas.deladerri...@gmail.com> wrote:

> Yes, I'am using 24G machine on 64 bit Linux OS.
> If I compile without wrapper, I did not get any problems.
>
> It seems that when I am linking with openmpi, my program use a kind of
> openmpi implemented malloc. Is it possible to switch it off in order ot only
> use malloc from libc ?
>
> Nicolas
>
> 2010/8/8 Terry Frankcombe 
>
> You're trying to do a 6GB allocate.  Can your underlying system handle
>> that?  IF you compile without the wrapper, does it work?
>>
>> I see your executable is using the OMPI memory stuff.  IIRC there are
>> switches to turn that off.
>>
>>
>> On Fri, 2010-08-06 at 15:05 +0200, Nicolas Deladerriere wrote:
>> > Hello,
>> >
>> > I'am having an sigsegv error when using simple program compiled and
>> > link with openmpi.
>> > I have reproduce the problem using really simple fortran code. It
>> > actually does not even use MPI, but just link with mpi shared
>> > libraries. (problem does not appear when I do not link with mpi
>> > libraries)
>> >% cat allocate.F90
>> >program test
>> >implicit none
>> >integer, dimension(:), allocatable :: z
>> >integer(kind=8) :: l
>> >
>> >write(*,*) "l ?"
>> >read(*,*) l
>> >
>> >ALLOCATE(z(l))
>> >z(1) = 111
>> >z(l) = 222
>> >DEALLOCATE(z)
>> >
>> >end program test
>> >
>> > I am using openmpi 1.4.2 and gfortran for my tests. Here is the
>> > compilation :
>> >
>> >% ./openmpi-1.4.2/build/bin/mpif90 --showme -g -o testallocate
>> > allocate.F90
>> >gfortran -g -o testallocate allocate.F90
>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/include -pthread
>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib
>> > -L/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib -lmpi_f90
>> > -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
>> > -lutil -lm -ldl -pthread
>> >
>> > When I am running that test with different length, I sometimes get a
>> > "Segmentation fault" error. Here are two examples using two specific
>> > values, but error happens for many other values of length (I did not
>> > manage to find which values of lenght gives that error)
>> >
>> >%  ./testallocate
>> > l ?
>> >16
>> >Segmentation fault
>> >% ./testallocate
>> > l ?
>> >20
>> >
>> > I used debugger with re-compiled version of openmpi using debug flag.
>> > I got the folowing error in function sYSMALLOc
>> >
>> >Program received signal SIGSEGV, Segmentation fault.
>> >0x2b70b3b3 in sYSMALLOc (nb=640016, av=0x2b930200)
>> > at malloc.c:3239
>> >3239set_head(remainder, remainder_size | PREV_INUSE);
>> >Current language:  auto; currently c
>> >(gdb) bt
>> >#0  0x2b70b3b3 in sYSMALLOc (nb=640016,
>> > av=0x2b930200) at malloc.c:3239
>> >#1  0x2b70d0db in opal_memory_ptmalloc2_int_malloc
>> > (av=0x2b930200, bytes=64) at malloc.c:4322
>> >#2  0x2b70b773 in opal_memory_ptmalloc2_malloc
>> > (bytes=64) at malloc.c:3435
>> >#3  0x2b70a665 in opal_memory_ptmalloc2_malloc_hook
>> > (sz=64, caller=0x2bf8534d) at hooks.c:667
>> >#4  0x2bf8534d in _gfortran_internal_free ()
>> > from /usr/lib64/libgfortran.so.1
>> >#5  0x00400bcc in MAIN__ () at allocate.F90:11
>> >#6  0x00400c4e in main ()
>> >(gdb) display
>> >(gdb) list
>> >3234  if ((unsigned long)(size) >= (unsigned long)(nb +
>> > MINSIZE)) {
>> >3235remainder_size = size - nb;
>> >3236remainder = chunk_at_offset(p, nb);
>> >3237av->top = remainder;
>> >3238set_head(p, nb | PREV_INUSE | (av != &main_arena ?
>> > NON_MAIN_ARENA : 0));
>> >3239set_head(remainder, remainder_size | PREV_INUSE);
>> >3240check_malloced_chunk(av, p, nb);
>> >3241return chunk2mem(p);
>> >3242  }
>> >3243
>> >
>> >
>> > I also did the same test in C and I got the same problem.
>> >
>> > Does someone has any idea that could help me understand what's going
>> > on ?
>> >
>> > Regards
>> > Nicolas
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Bug in POWERPC32.asm?

2010-08-09 Thread Nysal Jan

Thanks for reporting this Matthew. Fixed in r23576 (
https://svn.open-mpi.org/trac/ompi/changeset/23576)

Regards
--Nysal

On Fri, Aug 6, 2010 at 10:38 PM, Matthew Clark wrote:

> I was looking in my copy of openmpi-1.4.1 opal/asm/base/POWERPC32.asm
> and saw the following:
>
> START_FUNC(opal_sys_timer_get_cycles)
>LSYM(15)
>mftbu r0
>mftb r11
>mftbu r2
>cmpw cr7,r2,r0
>bne+ cr7,REFLSYM(14)
>li r4,0
>li r9,0
>or r3,r2,r9
>or r4,r4,r11
>blr
> END_FUNC(opal_sys_timer_get_cycles)
>
> I'll readily admit at my lack of ppc assembly smartness, but shouldn't
> the loop back at bne+ be to REFLSYM(15) instead of (14)?
>
> Matt
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Memory allocation error when linking with MPI libraries

2010-08-15 Thread Nysal Jan

>What does it exactly imply to compile with this option ?
Open MPI's internal malloc library (ptmalloc) will not be built/used. If you
are using an RDMA capable interconnect such as Infiniband, you will not be
able to use the "mpi_leave_pinned" feature. mpi_leave_pinned might improve
performance for applications that reuse/repeatedly send from the same
buffer. If you are not using such interconnects then there is no impact on
performance. For more details see the FAQ entries (24-28) -
http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned

--Nysal


On Thu, Aug 12, 2010 at 6:30 PM, Nicolas Deladerriere <
nicolas.deladerri...@gmail.com> wrote:

> building openmpi with option "--without-memory-manager" fix my problem.
>
> What does it exactly imply to compile with this option ?
> I guess all malloc use functions from libc instead of openmpi one, but does
> it have an effect on performance or something else ?
>
> Nicolas
>
> 2010/8/8 Nysal Jan 
>
> What interconnect are you using? Infiniband? Use
>> "--without-memory-manager" option while building ompi in order to disable
>> ptmalloc.
>>
>> Regards
>> --Nysal
>>
>>
>> On Sun, Aug 8, 2010 at 7:49 PM, Nicolas Deladerriere <
>> nicolas.deladerri...@gmail.com> wrote:
>>
>>> Yes, I'am using 24G machine on 64 bit Linux OS.
>>> If I compile without wrapper, I did not get any problems.
>>>
>>> It seems that when I am linking with openmpi, my program use a kind of
>>> openmpi implemented malloc. Is it possible to switch it off in order ot only
>>> use malloc from libc ?
>>>
>>> Nicolas
>>>
>>> 2010/8/8 Terry Frankcombe 
>>>
>>> You're trying to do a 6GB allocate.  Can your underlying system handle
>>>> that?  IF you compile without the wrapper, does it work?
>>>>
>>>> I see your executable is using the OMPI memory stuff.  IIRC there are
>>>> switches to turn that off.
>>>>
>>>>
>>>> On Fri, 2010-08-06 at 15:05 +0200, Nicolas Deladerriere wrote:
>>>> > Hello,
>>>> >
>>>> > I'am having an sigsegv error when using simple program compiled and
>>>> > link with openmpi.
>>>> > I have reproduce the problem using really simple fortran code. It
>>>> > actually does not even use MPI, but just link with mpi shared
>>>> > libraries. (problem does not appear when I do not link with mpi
>>>> > libraries)
>>>> >% cat allocate.F90
>>>> >program test
>>>> >implicit none
>>>> >integer, dimension(:), allocatable :: z
>>>> >integer(kind=8) :: l
>>>> >
>>>> >write(*,*) "l ?"
>>>> >read(*,*) l
>>>> >
>>>> >ALLOCATE(z(l))
>>>> >z(1) = 111
>>>> >z(l) = 222
>>>> >DEALLOCATE(z)
>>>> >
>>>> >end program test
>>>> >
>>>> > I am using openmpi 1.4.2 and gfortran for my tests. Here is the
>>>> > compilation :
>>>> >
>>>> >% ./openmpi-1.4.2/build/bin/mpif90 --showme -g -o testallocate
>>>> > allocate.F90
>>>> >gfortran -g -o testallocate allocate.F90
>>>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/include -pthread
>>>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib
>>>> > -L/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib -lmpi_f90
>>>> > -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
>>>> > -lutil -lm -ldl -pthread
>>>> >
>>>> > When I am running that test with different length, I sometimes get a
>>>> > "Segmentation fault" error. Here are two examples using two specific
>>>> > values, but error happens for many other values of length (I did not
>>>> > manage to find which values of lenght gives that error)
>>>> >
>>>> >%  ./testallocate
>>>> > l ?
>>>> >16
>>>> >Segmentation fault
>>>> >% ./testallocate
>>>> > l ?
>>>> >20
>>>> >
>>>> > I used debugger with re-compiled version of openmpi using debug flag.
>>>> > I got the folowing error in function sYSMALLOc
>>>> >
>>&

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Nysal Jan

The value of hdr->tag seems wrong.

In ompi/mca/pml/ob1/pml_ob1_hdr.h
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
#define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
#define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)

and in ompi/mca/btl/btl.h
#define MCA_BTL_TAG_PML 0x40

So hdr->tag should be a value >= 65
Since the tag is incorrect you are not getting the proper callback function
pointer and hence the SEGV.
I'm not sure at this point as to why you are getting an invalid/corrupt
message header ?

--Nysal

On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:

> Hi,
>
> sorry, i just forgot to add the values of the function parameters:
> (gdb) print reg->cbdata
> $1 = (void *) 0x0
> (gdb) print openib_btl->super
> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> btl_rdma_pipeline_send_length = 1048576,
>  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> btl_flags = 310,
>  btl_add_procs = 0x2b341eb8ee47 , btl_del_procs =
> 0x2b341eb90156 , btl_register = 0, btl_finalize =
> 0x2b341eb93186 ,
>  btl_alloc = 0x2b341eb90a3e , btl_free =
> 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> ,
>  btl_prepare_dst = 0x2b341eb91f2e , btl_send =
> 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> ,
>  btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e
> , btl_dump = 0x2b341acd45cb ,
> btl_mpool = 0xf3f4110,
>  btl_register_error = 0x2b341eb90565 ,
> btl_ft_event = 0x2b341eb952e7 }
> (gdb) print hdr->tag
> $3 = 0 '\0'
> (gdb) print des
> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> (gdb) print reg->cbfunc
> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>
> Eloi
>
> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > Hi,
> >
> > Here is the output of a core file generated during a segmentation fault
> > observed during a collective call (using openib):
> >
> > #0  0x in ?? ()
> > (gdb) where
> > #0  0x in ?? ()
> > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
> > 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
>  0x2aedb859fa31
> > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > coll_tuned_decision_fixed.c:63
> > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> > count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> > stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> >
> {>
> > = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> > {pi_ = 0x6}}}, }) at interface.cpp:371
> > #14 0x040cb858 in
> FEMTown::Field::detail::align_itfs_and_neighbhors
> > (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> > check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> > FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> > 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> > 0x039acdd4 in PyField_align_elements (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
> > _object*, _object*)>::exec<_object

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Nysal Jan

Hi Eloi,
>Do you think that a thread race condition could explain the hdr->tag value
?
Are there multiple threads invoking MPI functions in your application? The
openib BTL is not yet thread safe in the 1.4 release series. There have been
improvements to openib BTL thread safety in 1.5, but it is still not
officially supported.

--Nysal

On Tue, Aug 17, 2010 at 1:06 PM, Eloi Gaudry  wrote:

> Hi Nysal,
>
> This is what I was wondering, it hdr->tag was expected to be null or not.
> I'll soon send a valgrind output to the list, hoping this could help to
> locate an invalid
> memory access allowing to understand why reg->cbfunc / hdr->tag are null.
>
> Do you think that a thread race condition could explain the hdr->tag value
> ?
>
> Thanks for your help,
> Eloi
>
> On Monday 16 August 2010 20:46:39 Nysal Jan wrote:
> > The value of hdr->tag seems wrong.
> >
> > In ompi/mca/pml/ob1/pml_ob1_hdr.h
> > #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> > #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> > #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> > #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> > #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> >
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML 0x40
> >
> > So hdr->tag should be a value >= 65
> > Since the tag is incorrect you are not getting the proper callback
> function
> > pointer and hence the SEGV.
> > I'm not sure at this point as to why you are getting an invalid/corrupt
> > message header ?
> >
> > --Nysal
> >
> > On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:
> > > Hi,
> > >
> > > sorry, i just forgot to add the values of the function parameters:
> > > (gdb) print reg->cbdata
> > > $1 = (void *) 0x0
> > > (gdb) print openib_btl->super
> > > $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > btl_rdma_pipeline_send_length = 1048576,
> > >
> > >  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > >
> > > 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> > > btl_flags = 310,
> > >
> > >  btl_add_procs = 0x2b341eb8ee47 ,
> btl_del_procs
> > >  =
> > >
> > > 0x2b341eb90156 , btl_register = 0,
> btl_finalize
> > > = 0x2b341eb93186 ,
> > >
> > >  btl_alloc = 0x2b341eb90a3e , btl_free =
> > >
> > > 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> > > ,
> > >
> > >  btl_prepare_dst = 0x2b341eb91f2e ,
> btl_send
> > >  =
> > >
> > > 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> > > ,
> > >
> > >  btl_put = 0x2b341eb94660 , btl_get =
> 0x2b341eb94c4e
> > >
> > > , btl_dump = 0x2b341acd45cb ,
> > > btl_mpool = 0xf3f4110,
> > >
> > >  btl_register_error = 0x2b341eb90565
> ,
> > >
> > > btl_ft_event = 0x2b341eb952e7 }
> > > (gdb) print hdr->tag
> > > $3 = 0 '\0'
> > > (gdb) print des
> > > $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > (gdb) print reg->cbfunc
> > > $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > >
> > > Eloi
> > >
> > > On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > > Hi,
> > > >
> > > > Here is the output of a core file generated during a segmentation
> fault
> > > > observed during a collective call (using openib):
> > > >
> > > > #0  0x in ?? ()
> > > > (gdb) where
> > > > #0  0x in ?? ()
> > > > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> at
> > > > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > > > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > > > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > > > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > > 0x2aedbc

Re: [OMPI users] Checksuming in openmpi 1.4.1

2010-09-01 Thread Nysal Jan

Hi Gilbert,
Checksums are turned off by default. If you need checksums to be activated
add "-mca pml csum" to the mpirun command line.
Checksums are enabled only for inter-node communication. Intra-node
communication is typically over shared memory and hence checksum is disabled
for this case.
If you have built a debug version of Open MPI (--enable-debug), you can see
the checksum output by appending "-mca pml_base_verbose 5" to your command
line.
If you are interested in the relevant code it is located here -
ompi/mca/pml/csum

--Nysal

On Tue, Aug 31, 2010 at 1:22 PM, Gilbert Grosdidier wrote:

> Bonjour,
>>
>> I'm not sure I understand how to trigger CHECKSUM use
>> inside of OpenMPI 1.4.1 (after digging in the FAQs, I got not
>> explanations, sorry):
>>
>> - Is checksuming activated by default and embedded automatically
>> within the Send/Recv pair mechanism, please ?
>> - If not, which MCA param(S) should I set to activate it ?
>> - Is there a time penalty for using it, please ?
>>
>> Thanks in advance for any help.
>>
>> --
>> Regards, Gilbert.
>>
>>
>>
> --
> *-*
>  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
>  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
>  Facult??des Sciences, Bat. 200 Fax   : +33 1 6446 8546
>  B.P. 34, F-91898 Orsay Cedex (FRANCE)
>  -
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Nysal Jan

Hi Eloi,
Sorry for the delay in response. I haven't read the entire email thread, but
do you have a test case which can reproduce this error? Without that it will
be difficult to nail down the cause. Just to clarify, I do not work for an
iwarp vendor. I can certainly try to reproduce it on an IB system. There is
also a PML called csum, you can use it via "-mca pml csum", which will
checksum the MPI messages and verify it at the receiver side for any data
corruption. You can try using it to see if it is able to catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry  wrote:

> Hi Nysal,
>
> I'm sorry to intrrupt, but I was wondering if you had a chance to look at
> this error.
>
> Regards,
> Eloi
>
>
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone:   +32 10 487 959
>
>
> -- Forwarded message --
> From: Eloi Gaudry 
> To: Open MPI Users 
> Date: Wed, 15 Sep 2010 16:27:43 +0200
> Subject: Re: [OMPI users] [openib] segfault when using openib btl
> Hi,
>
> I was wondering if anybody got a chance to have a look at this issue.
>
> Regards,
> Eloi
>
>
> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> > Hi Jeff,
> >
> > Please find enclosed the output (valgrind.out.gz) from
> > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> > openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> > btl_openib_want_fork_support 0 -tag-output
> > /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> > --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> > valgrind.supp --suppressions=./suppressions.python.supp
> > /opt/actran/bin/actranpy_mp ...
> >
> > Thanks,
> > Eloi
> >
> > On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > > I did run our application through valgrind but it couldn't find any
> > > > > "Invalid write": there is a bunch of "Invalid read" (I'm using
> 1.4.2
> > > > > with the suppression file), "Use of uninitialized bytes" and
> > > > > "Conditional jump depending on uninitialized bytes" in different
> ompi
> > > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > > send you an output of valgrind shortly.
> > > >
> > > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > > unaware of them (and therefore incorrectly marks them as
> > > > uninitialized).
> > >
> > > would it  help if i use the upcoming 1.5 version of openmpi ? i read
> that
> > > a huge effort has been done to clean-up the valgrind output ? but maybe
> > > that this doesn't concern this btl (for the reasons you mentionned).
> > >
> > > > > Another question, you said that the callback function pointer
> should
> > > > > never be 0. But can the tag be null (hdr->tag) ?
> > > >
> > > > The tag is not a pointer -- it's just an integer.
> > >
> > > I was worrying that its value could not be null.
> > >
> > > I'll send a valgrind output soon (i need to build libpython without
> > > pymalloc first).
> > >
> > > Thanks,
> > > Eloi
> > >
> > > > > Thanks for your help,
> > > > > Eloi
> > > > >
> > > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > > >> Sorry for the delay in replying.
> > > > >>
> > > > >> Odd; the values of the callback function pointer should never be
> 0.
> > > > >> This seems to suggest some kind of memory corruption is occurring.
> > > > >>
> > > > >> I don't know if it's possible, because the stack trace looks like
> > > > >> you're calling through python, but can you run this application
> > > > >> through valgrind, or some other memory-checking debugger?
> > > > >>
> > > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > > >>> Hi,
> > > > >>>
> > > > >>> sorry, i just forgot to add the values of the function
> parameters:
> > > > >>> (gdb) print reg->cbdata
> > > > >>> $1 = (void *) 0x0
> > > > >>> (gdb) print openib_btl->super
> > > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > > >>>
> > > > >>>   btl_rdma_pipeline_frag_size = 1048576,
> btl_min_rdma_pipeline_size
> > > > >>>   = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > > >>>   btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > > >>>   0x2b341eb8ee47, btl_del_procs =
> > > > >>>   0x2b341eb90156, btl_register = 0,
> > > > >>>   btl_finalize = 0x2b341eb93186,
> btl_alloc
> > > > >>>   = 0x2b341eb90a3e, btl_free =
> > > > >>>   0x2b341eb91400, btl_prepare_src =
> > > > >>>   0x2b341eb91813, btl_prepare_dst =
> > > > >>>   0x2b341eb91f2e, btl_send =
> > > > >>>   0x2b341eb94517, btl_sendi =
> > > > >>>   0x2b341eb9340d, btl_put =
> > > > >>>   0x2b341eb94660, btl_get =
> > > > >>>   0x2b341eb94c4e, b

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Nysal Jan

gt;> certain sizes. At least that is one gut feel I have.
> > >>>>>>>>
> > >>>>>>>> In my mind the tag being 0 is either something below OMPI is
> > >>>>>>>> polluting the data fragment or OMPI's internal protocol is some
> > >>>>>>>> how getting messed up.  I can imagine (no empirical data here)
> > >>>>>>>> the queue sizes could change how the OMPI protocol sets things
> > >>>>>>>> up. Another thing may be the coalescing feature in the openib
> BTL
> > >>>>>>>> which tries to gang multiple messages into one packet when
> > >>>>>>>> resources are running low.   I can see where changing the queue
> > >>>>>>>> sizes might affect the coalescing. So, it might be interesting
> to
> > >>>>>>>> turn off the coalescing.  You can do that by setting "--mca
> > >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
> > >>>>>>>>
> > >>>>>>>> If that doesn't solve the issue then obviously there must be
> > >>>>>>>> something else going on :-).
> > >>>>>>>>
> > >>>>>>>> Note, the reason I am interested in this is I am seeing a
> similar
> > >>>>>>>> error condition (hdr->tag == 0) on a development system.  Though
> > >>>>>>>> my failing case fails with np=8 using the connectivity test
> > >>>>>>>> program which is mainly point to point and there are not a
> > >>>>>>>> significant amount of data transfers going on either.
> > >>>>>>>>
> > >>>>>>>> --td
> > >>>>>>>>
> > >>>>>>>>> Eloi
> > >>>>>>>>>
> > >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
> > >>>>>>>>>> That is interesting.  So does the number of processes affect
> > >>>>>>>>>> your runs any.  The times I've seen hdr->tag be 0 usually has
> > >>>>>>>>>> been due to protocol issues.  The tag should never be 0.  Have
> > >>>>>>>>>> you tried to do other receive_queue settings other than the
> > >>>>>>>>>> default and the one you mention.
> > >>>>>>>>>>
> > >>>>>>>>>> I wonder if you did a combination of the two receive queues
> > >>>>>>>>>> causes a failure or not.  Something like
> > >>>>>>>>>>
> > >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
> > >>>>>>>>>>
> > >>>>>>>>>> I am wondering if it is the first queuing definition causing
> the
> > >>>>>>>>>> issue or possibly the SRQ defined in the default.
> > >>>>>>>>>>
> > >>>>>>>>>> --td
> > >>>>>>>>>>
> > >>>>>>>>>> Eloi Gaudry wrote:
> > >>>>>>>>>>> Hi Terry,
> > >>>>>>>>>>>
> > >>>>>>>>>>> The messages being send/received can be of any size, but the
> > >>>>>>>>>>> error seems to happen more often with small messages (as an
> int
> > >>>>>>>>>>> being broadcasted or allreduced). The failing communication
> > >>>>>>>>>>> differs from one run to another, but some spots are more
> likely
> > >>>>>>>>>>> to be failing than another. And as far as I know, there are
> > >>>>>>>>>>> always located next to a small message (an int being
> > >>>>>>>>>>> broadcasted for instance) communication. Other typical
> > >>>>>>>>>>> messages size are
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 10k but can be very much larger.
> > >>>>>>>>>>>
> > >>>>>>>>&g

Re: [OMPI users] Creating 64-bit objects?

2010-11-10 Thread Nysal Jan

Hi Brian,
This problem was first reported by Paul H. Hargrove in the developer mailing
list. It is a bug in libtool and has been fixed in the latest release
(2.2.8). More details are available here -
http://www.open-mpi.org/community/lists/devel/2010/10/8606.php

Regards
--Nysal

On Wed, Nov 10, 2010 at 1:04 AM, Price, Brian M (N-KCI) <
brian.m.pr...@lmco.com> wrote:

>  OpenMPI version: 1.3.3 & 1.4.3
>
> Platform: IBM P5
>
> Issue:  I want OpenMPI to support some existing 64-bit FORTRAN software,
> but I can’t seem to get 64-bit objects from OpenMPI without some
> modification to the Makefile in ompi/mpi/f90.
>
> I can configure, build, and install just fine with the following compilers:
>
> -  CC = xlC_r
>
> -  CXX = xlC_r
>
> -  F77 = xlf95_r
>
> -  FC = xlf95_r
>
> But, this configuration produces 32-bit objects for all languages.
>
> So, to produce 64-bit objects for all languages, I supply the following
> flags:
>
> -  CFLAGS = -q64
>
> -  CXXFLAGS = -q64
>
> -  FFLAGS = -q64
>
> -  FCFLAGS = -q64
>
> This configuration results in the following error during the build (more
> specifically, link) phase:
>
> -  When creating libmpi_f90.la in ompi/mpi/f90
>
> -  COMMANDS:
>
> o   /bin/sh ../../../libtool  --mode=link xlf95_r -I../../../ompi/include
> -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90  -q64 -version-info
> 0:1:0  -export-dynamic  -o libmpi_f90.la -rpath /lib mpi.lo
> mpi_sizeof.lo mpi_comm_spawn_multiple_f90.lo mpi_testall_f90.lo
> mpi_testsome_f90.lo mpi_waitall_f90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo
> mpi_wtime_f90.lo  ../../../ompi/libmpi.la -lnsl -lutil
>
> o   libtool: link: /usr/bin/ld -m elf64ppc -shared  .libs/mpi.o
> .libs/mpi_sizeof.o .libs/mpi_comm_spawn_multiple_f90.o
> .libs/mpi_testall_f90.o .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o
> .libs/mpi_waitsome_f90.o .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.lo
> -L/orte/.libs -L/opal/.libs ../../../ompi/.libs/libmpi.so
> /orte/.libs/libopen-rte.so /opal/.libs/libopen-pal.so -ldl
> -lnsl -lutil  -q64  -soname libmpi_f90.so.0 -o .libs/libmpi_f90.so.0.0.1
>
> -  OUTPUT:
>
> /usr/bin/ld: unrecognized option ‘-q64’
>
> /usr/bin/ld: use the --help option for usage information
>
> make[4]: *** [libmpi_f90.la] Error 1
>
> make[4]: Leaving directory `/ompi/mpi/f90`
>
> make[3]: *** [all-recursive] Error 1
>
> make[3]: Leaving directory `/ompi/mpi/f90`
>
> make[2]: *** [all] Error 2
>
> make[2]: Leaving directory `/ompi/mpi/f90`
>
> make[1]: *** [all-recursive] Error 1
>
> make[1]: Leaving directory `/ompi`
>
> make: *** [all-recursive] Error 1
>
>
>
> The -q64 option, while valid for the xlf95_r compiler, is not a valid
> option for /usr/bin/ld.  So, I’m wondering why this option got passed to
> /usr/bin/ld.  After looking at /ompi/mpi/f90/Makefile, I see that
> FCFLAGS shows up in link lines (“libmpi_f90_la_LINK” and “FCLINK”).  This
> direction seems to come from Makefile.in.
>
> If I remove these FCFLAGS references from the Makefile, I am able to
> complete the build and install of OpenMPI, and it seems to correctly support
> my existing software.
>
> So,  now for my question:
>
> Should FCFLAGS show up on these links lines and, if so, how would I get
> 64-bit objects?
>
> Thanks,
>
> Brian Price
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Creating 64-bit objects?

2010-11-11 Thread Nysal Jan

Jeff,
Yes the issue was reported to exist on both 1.4.3 and 1.5. I have created a
ticket to get this fixed - https://svn.open-mpi.org/trac/ompi/ticket/2626
We can patch libtool locally as you suggested.

--Nysal

On Wed, Nov 10, 2010 at 7:21 PM, Jeff Squyres  wrote:

> Nysal --
>
> Does the same issue occur in OMPI 1.5?
>
> Should we put in a local patch for OMPI 1.4.x and/or OMPI 1.5?  (we've done
> this before while waiting for upstream Libtool patches to be released, etc.)
>
>
>
> On Nov 10, 2010, at 2:19 AM, Nysal Jan wrote:
>
> > Hi Brian,
> > This problem was first reported by Paul H. Hargrove in the developer
> mailing list. It is a bug in libtool and has been fixed in the latest
> release (2.2.8). More details are available here -
> http://www.open-mpi.org/community/lists/devel/2010/10/8606.php
> >
> > Regards
> > --Nysal
> >
> > On Wed, Nov 10, 2010 at 1:04 AM, Price, Brian M (N-KCI) <
> brian.m.pr...@lmco.com> wrote:
> > OpenMPI version: 1.3.3 & 1.4.3
> >
> > Platform: IBM P5
> >
> > Issue:  I want OpenMPI to support some existing 64-bit FORTRAN software,
> but I can’t seem to get 64-bit objects from OpenMPI without some
> modification to the Makefile in ompi/mpi/f90.
> >
> > I can configure, build, and install just fine with the following
> compilers:
> >
> > -  CC = xlC_r
> >
> > -  CXX = xlC_r
> >
> > -  F77 = xlf95_r
> >
> > -  FC = xlf95_r
> >
> > But, this configuration produces 32-bit objects for all languages.
> >
> > So, to produce 64-bit objects for all languages, I supply the following
> flags:
> >
> > -  CFLAGS = -q64
> >
> > -  CXXFLAGS = -q64
> >
> > -  FFLAGS = -q64
> >
> > -  FCFLAGS = -q64
> >
> > This configuration results in the following error during the build (more
> specifically, link) phase:
> >
> > -  When creating libmpi_f90.la in ompi/mpi/f90
> >
> > -  COMMANDS:
> >
> > o   /bin/sh ../../../libtool  --mode=link xlf95_r -I../../../ompi/include
> -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90  -q64 -version-info
> 0:1:0  -export-dynamic  -o libmpi_f90.la -rpath /lib mpi.lo
> mpi_sizeof.lo mpi_comm_spawn_multiple_f90.lo mpi_testall_f90.lo
> mpi_testsome_f90.lo mpi_waitall_f90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo
> mpi_wtime_f90.lo  ../../../ompi/libmpi.la -lnsl -lutil
> >
> > o   libtool: link: /usr/bin/ld -m elf64ppc -shared  .libs/mpi.o
> .libs/mpi_sizeof.o .libs/mpi_comm_spawn_multiple_f90.o
> .libs/mpi_testall_f90.o .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o
> .libs/mpi_waitsome_f90.o .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.lo
>  -L/orte/.libs -L/opal/.libs ../../../ompi/.libs/libmpi.so
> /orte/.libs/libopen-rte.so /opal/.libs/libopen-pal.so -ldl
> -lnsl -lutil  -q64  -soname libmpi_f90.so.0 -o .libs/libmpi_f90.so.0.0.1
> >
> > -  OUTPUT:
> >
> > /usr/bin/ld: unrecognized option ‘-q64’
> > /usr/bin/ld: use the --help option for usage information
> > make[4]: *** [libmpi_f90.la] Error 1
> > make[4]: Leaving directory `/ompi/mpi/f90`
> > make[3]: *** [all-recursive] Error 1
> > make[3]: Leaving directory `/ompi/mpi/f90`
> > make[2]: *** [all] Error 2
> > make[2]: Leaving directory `/ompi/mpi/f90`
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory `/ompi`
> > make: *** [all-recursive] Error 1
> >
> > The -q64 option, while valid for the xlf95_r compiler, is not a valid
> option for /usr/bin/ld.  So, I’m wondering why this option got passed to
> /usr/bin/ld.  After looking at /ompi/mpi/f90/Makefile, I see that
> FCFLAGS shows up in link lines (“libmpi_f90_la_LINK” and “FCLINK”).  This
> direction seems to come from Makefile.in.
> >
> > If I remove these FCFLAGS references from the Makefile, I am able to
> complete the build and install of OpenMPI, and it seems to correctly support
> my existing software.
> >
> > So,  now for my question:
> >
> > Should FCFLAGS show up on these links lines and, if so, how would I get
> 64-bit objects?
> >
> > Thanks,
> >
> > Brian Price
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] EXTERNAL: Re: Creating 64-bit objects?

2010-11-11 Thread Nysal Jan

I believe the libtool version (<2.2.8) used to make the 1.4.3 and 1.5
tarball does not have this fix. I have opened a ticket to get this fixed -
https://svn.open-mpi.org/trac/ompi/ticket/2626

--Nysal

On Wed, Nov 10, 2010 at 7:08 PM, Price, Brian M (N-KCI) <
brian.m.pr...@lmco.com> wrote:

>  Thanks, Nysal.
>
>
>
> The only problem I’m having now is connecting a libtool version (e.g.
> 2.2.8) with an OpenMPI version.  I’m sorry if it’s a silly question, but can
> you tell me in which version of OpenMPI this problem will go away?
>
>
>
> Thanks, again.
>
>
>
> Brian
>
>
>
>
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Nysal Jan
> *Sent:* Wednesday, November 10, 2010 12:19 AM
> *To:* Open MPI Users
> *Subject:* EXTERNAL: Re: [OMPI users] Creating 64-bit objects?
>
>
>
> Hi Brian,
> This problem was first reported by Paul H. Hargrove in the developer
> mailing list. It is a bug in libtool and has been fixed in the latest
> release (2.2.8). More details are available here -
> http://www.open-mpi.org/community/lists/devel/2010/10/8606.php
>
> Regards
> --Nysal
>
> On Wed, Nov 10, 2010 at 1:04 AM, Price, Brian M (N-KCI) <
> brian.m.pr...@lmco.com> wrote:
>
> OpenMPI version: 1.3.3 & 1.4.3
>
> Platform: IBM P5
>
> Issue:  I want OpenMPI to support some existing 64-bit FORTRAN software,
> but I can’t seem to get 64-bit objects from OpenMPI without some
> modification to the Makefile in ompi/mpi/f90.
>
> I can configure, build, and install just fine with the following compilers:
>
> -  CC = xlC_r
>
> -  CXX = xlC_r
>
> -  F77 = xlf95_r
>
> -  FC = xlf95_r
>
> But, this configuration produces 32-bit objects for all languages.
>
> So, to produce 64-bit objects for all languages, I supply the following
> flags:
>
> -  CFLAGS = -q64
>
> -  CXXFLAGS = -q64
>
> -  FFLAGS = -q64
>
> -  FCFLAGS = -q64
>
> This configuration results in the following error during the build (more
> specifically, link) phase:
>
> -  When creating libmpi_f90.la in ompi/mpi/f90
>
> -  COMMANDS:
>
> o   /bin/sh ../../../libtool  --mode=link xlf95_r -I../../../ompi/include
> -I../../../ompi/include -I. -I. -I../../../ompi/mpi/f90  -q64 -version-info
> 0:1:0  -export-dynamic  -o libmpi_f90.la -rpath /lib mpi.lo
> mpi_sizeof.lo mpi_comm_spawn_multiple_f90.lo mpi_testall_f90.lo
> mpi_testsome_f90.lo mpi_waitall_f90.lo mpi_waitsome_f90.lo mpi_wtick_f90.lo
> mpi_wtime_f90.lo  ../../../ompi/libmpi.la -lnsl -lutil
>
> o   libtool: link: /usr/bin/ld -m elf64ppc -shared  .libs/mpi.o
> .libs/mpi_sizeof.o .libs/mpi_comm_spawn_multiple_f90.o
> .libs/mpi_testall_f90.o .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o
> .libs/mpi_waitsome_f90.o .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.lo
> -L/orte/.libs -L/opal/.libs ../../../ompi/.libs/libmpi.so
> /orte/.libs/libopen-rte.so /opal/.libs/libopen-pal.so -ldl
> -lnsl -lutil  -q64  -soname libmpi_f90.so.0 -o .libs/libmpi_f90.so.0.0.1
>
> -  OUTPUT:
>
> /usr/bin/ld: unrecognized option ‘-q64’
>
> /usr/bin/ld: use the --help option for usage information
>
> make[4]: *** [libmpi_f90.la] Error 1
>
> make[4]: Leaving directory `/ompi/mpi/f90`
>
> make[3]: *** [all-recursive] Error 1
>
> make[3]: Leaving directory `/ompi/mpi/f90`
>
> make[2]: *** [all] Error 2
>
> make[2]: Leaving directory `/ompi/mpi/f90`
>
> make[1]: *** [all-recursive] Error 1
>
> make[1]: Leaving directory `/ompi`
>
> make: *** [all-recursive] Error 1
>
>
>
> The -q64 option, while valid for the xlf95_r compiler, is not a valid
> option for /usr/bin/ld.  So, I’m wondering why this option got passed to
> /usr/bin/ld.  After looking at /ompi/mpi/f90/Makefile, I see that
> FCFLAGS shows up in link lines (“libmpi_f90_la_LINK” and “FCLINK”).  This
> direction seems to come from Makefile.in.
>
> If I remove these FCFLAGS references from the Makefile, I am able to
> complete the build and install of OpenMPI, and it seems to correctly support
> my existing software.
>
> So,  now for my question:
>
> Should FCFLAGS show up on these links lines and, if so, how would I get
> 64-bit objects?
>
> Thanks,
>
> Brian Price
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Open MPI vs IBM MPI performance help

2010-12-03 Thread Nysal Jan

Collecting MPI Profile information might help narrow down the issue. You
could use some of the tools mentioned here -
http://www.open-mpi.org/faq/?category=perftools

--Nysal

On Wed, Dec 1, 2010 at 11:59 PM, Price, Brian M (N-KCI) <
brian.m.pr...@lmco.com> wrote:

>  OpenMPI version: 1.4.3
>
> Platform: IBM P5, 32 processors, 256 GB memory, Symmetric Multi-Threading
> (SMT) enabled
>
> Application: starts up 48 processes and does MPI using MPI_Barrier,
> MPI_Get, MPI_Put (lots of transfers, large amounts of data)
>
> Issue:  When implemented using Open MPI vs. IBM’s MPI (‘poe’ from HPC
> Toolkit), the application runs 3-5 times slower.
>
> I suspect that IBM’s MPI implementation must take advantage of some
> knowledge that it has about data transfers that Open MPI is not taking
> advantage of.
>
> Any suggestions?
>
> Thanks,
>
> Brian Price
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with AlltoAll routine

2008-05-17 Thread Nysal Jan

Gabriele,
Can you try with Open MPI 1.2.6. It has a parameter to disable early
completion, set it to zero (-mca pml_ob1_use_early_completion 0).

--Nysal

On Wed, May 7, 2008 at 9:29 PM, Gabriele FATIGATI 
wrote:

> I have attached informations requested about Infiniband net and OpenMPi
> enviroment. There is also LSF script used to launch the application.
>
> On Tue, 6 May 2008 21:30:17 -0500, Brad Benton said:
> >
> > Hello Gabriele,
> >
> > To help track down this problem, could I ask you to take a look at the
> Open
> > MPI "Getting Help" page?
> >   http://www.open-mpi.org/community/help/
> >
> > In particular, if you could collect and send the information requested on
> > that page to the list, it will help us to better understand your
> > configuration and how others might reproduce the problem.
> >
> > Thanks & Regards,
> > --Brad
> >
> > Brad Benton
> > IBM
> >
> >
> > On Tue, May 6, 2008 at 10:35 AM, Gabriele FATIGATI  >
> > wrote:
> >
> > > Hi,
> > > i tried to run SkaMPI 5.0.4 benchmark on IBM-BladeCenterLS21 system
> with
> > > 256 processors over Infiniband 5 Gb/s, but test has stopped on
> > > "AlltoAll-length" routine, with count=2048 for some reason.
> > >
> > > I have launched test with:
> > > --mca btl_openib_eager_limit 1024
> > >
> > > Same tests with 128 processor or less, have finished successful.
> > >
> > > Different values of eager limit don't solve the problem. Version of
> > > OpenMPI involved is 1.2.5. There's someone with this kind of problem
> over
> > > Infiniband?
> > > Thanks in advance.
> > > --
> > > Gabriele Fatigati
> > >
> > > CINECA Systems & Tecnologies Department
> > >
> > > Supercomputing  Group
> > >
> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> > >
> > > www.cineca.itTel:39 051 6171722
> > >
> > > g.fatig...@cineca.it
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> >
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Heap profiling with OpenMPI

2008-08-05 Thread Jan Ploski

Hi,

I wanted to determine the peak heap memory usage of each MPI process in my 
application. Using MVAPICH it can be done by simply substituting a wrapper 
shell script for the MPI executable and from that wrapper script starting 
"valgrind --tool=massif ./prog.exe". However, when I tried the same 
approach with OpenMPI, I got no output from massif (empty output file), 
even though the MPI process ran and finished as expected. Can someone 
explain this or provide an alternative way of obtaining the same 
information (preferably one that does not require source code 
instrumentation and recompiling)?

Best regards,
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
FuE Bereich Energie | R&D Division Energy
Escherweg 2  - 26121 Oldenburg - Germany
Phone/Fax: +49 441 9722 - 184 / 202
E-Mail: jan.plo...@offis.de
URL: http://www.offis.de

Re: [OMPI users] Heap profiling with OpenMPI

2008-08-06 Thread Jan Ploski

users-boun...@open-mpi.org schrieb am 08/05/2008 05:51:51 PM:

> Jan,
> 
> I'm using valgrind with Open MPI on a [very] regular basis and I never 
> had any problems. I usually want to know the execution path on the MPI 
> applications. For this I use:
> mpirun -np XX valgrind --tool=callgrind -q --log-file=some_file ./my_app
> 
> I just run your example:
>  mpirun -np 2 -bynode --mca btl tcp,self valgrind --tool=massif - 
> q ./NPmpi -u 4
> and I got 2 non empty files in the current directory:
>  bosilca@dancer:~/NetPIPE_3.6.2$ ls -l massif.out.*
>  -rw--- 1 bosilca bosilca 140451 2008-08-05 11:57 massif.out. 
> 21197
>  -rw--- 1 bosilca bosilca 131471 2008-08-05 11:57 massif.out. 
> 21210

George,

Thanks for the info - which version of OpenMPI, compiler and valgrind did 
you try with? I checked in two different clusters with OpenMPI 1.2.4 
compiled with two different versions of the PGI compiler and valgrind 
3.3.1, with the same bad result. I also noticed that the MPI processes 
despite of producing the expected output do not terminate cleanly. I can 
see in the stderr log (for each process):

==7909== Warning: client syscall munmap tried to modify addresses 
0xD1968F92A19A72D1-0x34324E6F
==7909== 
==7909== Process terminating with default action of signal 11 (SIGSEGV)
==7909==  Access not within mapped region at address 0x8053D8000
==7909==at 0x5284996: _int_free (in 
/opt/openmpi-1.2.4/lib/libopen-pal.so.0.0.0)
==7909==by 0x52837A7: free (in 
/opt/openmpi-1.2.4/lib/libopen-pal.so.0.0.0)
==7909==by 0x593C76A: free_mem (in /lib64/libc-2.4.so)
==7909==by 0x593C3E1: __libc_freeres (in /lib64/libc-2.4.so)
==7909==by 0x491D31C: _vgnU_freeres (vg_preloaded.c:60)
==7909==by 0x587D1C4: exit (in /lib64/libc-2.4.so)
==7909==by 0x586815A: (below main) (in /lib64/libc-2.4.so)

That probably explains why my massif.out.* are empty (<200 bytes long), 
but why do the processes crash? The same program runs ok with 
valgrind+MVAPICH or with OpenMPI without valgrind in their respective 
clusters. I experience this both with a simple test program and with a 
real application (WRF).

Regards,
Jan Ploski

Re: [OMPI users] Heap profiling with OpenMPI

2008-08-06 Thread Jan Ploski


George Bosilca wrote:

Jan,

I'm using the latest of Open MPI compiled with debug turned on, and 
valgrind 3.3.0. From your trace it looks like there is a conflict 
between two memory managers. I'm not having the same problem as I 
disable the Open MPI memory manager on my builds (configure option 
--without-memory-manager).


Thanks for the tip! I confirm that the problem goes away after 
rebuilding --without-memory-manager.


As I also have the same problem in another cluster, I'd like to know 
what side effects using this configuration option might have before 
suggesting it as a solution to that cluster's admin. I didn't find an 
explanation of what it does in the FAQ (beyond a recommendation to use 
it for static builds). Could you please explain this option, especially 
why one might want to *not* use it?


Regards,
Jan Ploski

Re: [OMPI users] Heap profiling with OpenMPI

2008-08-07 Thread Jan Ploski

users-boun...@open-mpi.org schrieb am 08/06/2008 07:44:03 PM:

> On Aug 6, 2008, at 12:37 PM, Jan Ploski wrote:
> 
> >> I'm using the latest of Open MPI compiled with debug turned on, and 
> >> valgrind 3.3.0. From your trace it looks like there is a conflict 
> >> between two memory managers. I'm not having the same problem as I 
> >> disable the Open MPI memory manager on my builds (configure option 
> >> --without-memory-manager).
> >
> > Thanks for the tip! I confirm that the problem goes away after 
> > rebuilding --without-memory-manager.
> >
> > As I also have the same problem in another cluster, I'd like to know 
> > what side effects using this configuration option might have before 
> > suggesting it as a solution to that cluster's admin. I didn't find 
> > an explanation of what it does in the FAQ (beyond a recommendation 
> > to use it for static builds). Could you please explain this option, 
> > especially why one might want to *not* use it?
> 
> This is on my to-do list (add this to the FAQ); sorry it isn't done yet.
> 
> Here's a recent post where I've explained it a bit more:
> 
>  http://www.open-mpi.org/community/lists/users/2008/07/6161.php
> 
> Let me know if you'd like to know more.

Jeff,

Thanks for this explanation. According to what you wrote, 
--without-memory-manager can make my and other applications run 
significantly slower. While I can find out just how much for my app, I 
hardly can do it for other (unknown) users. So it would be nice if my heap 
profiling problem could be resolved in another way in the future. Is the 
planned mpi_leave_pinned change in v1.3 going to correct it?

Regards,
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
FuE Bereich Energie | R&D Division Energy
Escherweg 2  - 26121 Oldenburg - Germany
Phone/Fax: +49 441 9722 - 184 / 202
E-Mail: jan.plo...@offis.de
URL: http://www.offis.de

Re: [OMPI users] Heap profiling with OpenMPI

2008-08-07 Thread Jan Ploski

users-boun...@open-mpi.org schrieb am 08/07/2008 09:27:39 AM:

> I can't speak for Jeff, but my understanding of the changes for 1.3
> should allow you to switch off the memory manager when running your
> checks.
> 
> It seems to me an obvious interim solution would be to have two versions
> of OpenMPI installed, one with and one without the memory manager.  Use
> one for debugging, and (if desired) the pinning-capable version for
> production.

This solution is good for our local cluster (where I am administrator), 
but it gets complicated for other Grid clusters where OpenMPI is installed 
by their respective admins from RPM and the Globus middleware has no idea 
about my private installations without hacking adapters. What I wanted to 
do specifically is to compare the variation in memory use of my app (WRF) 
in different clusters and with different MPI implementations to validate 
the prediction model I constructed in my local cluster.

Regards,
Jan Ploski

Re: [OMPI users] Buffer size limit and memory consumption problem on heterogeneous (32 bit / 64 bit) machines

2010-05-20 Thread Nysal Jan

This probably got fixed in https://svn.open-mpi.org/trac/ompi/ticket/2386
Can you try 1.4.2, the fix should be in there.

Regards
--Nysal


On Thu, May 20, 2010 at 2:02 PM, Olivier Riff wrote:

> Hello,
>
> I assume this question has been already discussed many times, but I can not
> find on Internet a solution to my problem.
> It is about buffer size limit of MPI_Send and MPI_Recv with heterogeneous
> system (32 bit laptop / 64 bit cluster).
> My configuration is :
> open mpi 1.4, configured with: --without-openib --enable-heterogeneous
> --enable-mpi-threads
> Program is launched a laptop (32 bit Mandriva 2008) which distributes tasks
> to do to a cluster of 70 processors  (64 bit RedHat Entreprise
> distribution):
> I have to send various buffer size from few bytes till 30Mo.
>
> I tested following commands:
> 1) mpirun -v -machinefile machinefile.txt MyMPIProgram
> -> crash on client side ( 64 bit RedHat Entreprise ) when sent buffer size
> > 65536.
> 2) mpirun --mca btl_tcp_eager_limit 3000 -v -machinefile
> machinefile.txt MyMPIProgram
> -> works but has the effect of generating gigantic memory consumption on 32
> bit machine side after MPI_Recv. Memory consumption goes from 800Mo to 2,1Go
> after receiving about 20ko from each 70 clients ( a total of about 1.4 Mo
> ).  This makes my program crash later because I have no more memory to
> allocate new structures. I read in a openmpi forum thread that setting
> btl_tcp_eager_limit to a huge value explains this huge memory consumption
> when a message sent does not have a preposted ready recv. Also after all
> messages have been received and there is no more traffic activity : the
> memory consumed remains at 2.1go... and I do not understand why.
>
> What is the best way to do in order to have a working program which also
> has a small memory consumption (the speed performance can be lower) ?
> I tried to play with mca paramters btl_tcp_sndbuf and mca btl_tcp_rcvbuf,
> but without success.
>
> Thanks in advance for you help.
>
> Best regards,
>
> Olivier
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with compilation : statically linked applications

2010-06-14 Thread Nysal Jan

__cxa_get_exception_ptr should be defined in libstdc++ shared library.

--Nysal

On Mon, Jun 14, 2010 at 5:51 AM, HeeJin Kim  wrote:

> Dear all,
>
> I had built openmpi-1.4.2 with:
> configure CC=icc CXX=icpc F77=ifort FC=ifort
> --prefix=/home/biduri/program/openmpi --enable-mpi-threads --enable-static
>
> And I have a problem during compilation of q-chem software which uses
> openmpi.
>
>
> /home/biduri/program/openmpi/lib/libopen-pal.a(dlopen.o): In function
> `vm_open':
> loaders/dlopen.c:(.text+0xad): warning: Using 'dlopen' in statically linked
> applications requires at runtime the shared libraries from the glibc version
> used for linking
> /home/biduri/qchem/par_qchem_op/ccman/ccman.a(properties.o): In function
> `CalcNonRelaxedTransDipole(Spin, int, int, Spin, int, int, OPDM&, OPDM&,
> double, int, signed char, _IO_FILE*, signed char, signed char)':
> properties.C:(.text+0x3df8): undefined reference to
> `__cxa_get_exception_ptr'
> /home/biduri/qchem/par_qchem_op/ccman/ccman.a(properties.o): In function
> `CalcSOCs(AlphaBetaMatr&, BlockTensor&, KMatrix&)':
> properties.C:(.text+0x52fc): undefined reference to
> `__cxa_get_exception_ptr'
> /home/biduri/qchem/par_qchem_op/ccman/ccman.a(ccsd_calc.o): In function
> `CCSD_Calc::CalculateT(BlockTensor&, BlockTensor&, MutableBlockTensor&,
> MutableBlockTensor&, signed char)':
> ccsd_calc.C:(.text+0x2957): undefined reference to
> `__cxa_get_exception_ptr'
> /home/biduri/qchem/par_qchem_op/ccman/ccman.a(ccsd_calc.o): In function
> `CCSD_Calc::CalcLambdaIntermed()':
> ccsd_calc.C:(.text+0x4409): undefined reference to
> `__cxa_get_exception_ptr'
> 
>
> Best,
> Heejin
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] lammps MD code fails with Open MPI 1.3

2009-02-20 Thread Nysal Jan

It could be the same bug reported here
http://www.open-mpi.org/community/lists/users/2009/02/8010.php

Can you try a recent snapshot of 1.3.1
(http://www.open-mpi.org/nightly/v1.3/) to verify if this has been fixed

--Nysal

On Thu, 2009-02-19 at 16:09 -0600, Jeff Pummill wrote:
> I built a fresh version of lammps v29Jan09 against Open MPI 1.3 which
> in turn was built with Gnu compilers v4.2.4 on an Ubuntu 8.04 x86_64
> box. This Open MPI build was able to generate usable binaries such as
> XHPL and NPB, but the lammps binary it generated was not usable.
> 
> I tried it with a couple of different versions of the lammps source,
> but to no avail. No errors during the builds and a binary was created,
> but when executing the job it quickly exits with no messages other
> than:
> 
> jpummil@stealth:~$ mpirun -np 4 -hostfile
> hosts /home/jpummil/lmp_Stealth-OMPI < in.testbench_small
> LAMMPS (22 Jan 2008)
> 
> Interestingly, I downloaded Open MPI 1.2.8, built it with the same
> configure options I had used with 1.3, and it worked.
> 
> I'm getting by fine with 1.2.8. I just wanted to file a possible bug
> report on 1.3 and see if others have seen this behavior.
> 
> Cheers!
> 
> -- 
> Jeff F. Pummill
> Senior Linux Cluster Administrator
> TeraGrid Campus Champion - UofA
> University of Arkansas
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problems in 1.3 loading shared libs when usingVampirServer

2009-02-25 Thread Nysal Jan

On Tue, 2009-02-24 at 13:30 -0500, Jeff Squyres wrote:
> - Get Python to give you the possibility of opening dependent  
> libraries in the global scope.  This may be somewhat controversial;  
> there are good reasons to open plugins in private scopes.  But I have  
> to imagine that OMPI is not the only python extension out there that  
> wants to open plugins of its own; other such projects should be  
> running into similar issues.
> 
Can you check if the following works:
import dl
import sys
flags = sys.getdlopenflags()
sys.setdlopenflags(flags | dl.RTLD_GLOBAL)
import minimpi


--Nysal

[OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim

I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout. 
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:

[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175
error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for 
wr_id 250450816 opcode 11048 qp_idx 3

This is seen with OpenMPI 1.3 and OpenFabrics 1.4.

Is this normal or is it an indicator of other problems, maybe related to
hardware?
Are there other parameters that need to be looked at too?

Thanks for any insight on this!

Regards,
Jan Lindheim

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim

On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB  
> fabric.  You should do a diagnostic on your HCAs, cables, and switches.
> 
> Increasing the timeout value should only be necessary on very large IB  
> fabrics and/or very congested networks.

Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.

Jan

> 
> 
> On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
> 
> >I found several reports on the openmpi users mailing list from users,
> >who need to bump up the default value for btl_openib_ib_timeout.
> >We also have some applications on our cluster, that have problems,
> >unless we set this value from the default 10 to 15:
> >
> >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174  
> >to: shc175
> >error polling LP CQ with status RETRY EXCEEDED ERROR status number  
> >12 for
> >wr_id 250450816 opcode 11048 qp_idx 3
> >
> >This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> >
> >Is this normal or is it an indicator of other problems, maybe  
> >related to
> >hardware?
> >Are there other parameters that need to be looked at too?
> >
> >Thanks for any insight on this!
> >
> >Regards,
> >Jan Lindheim
> >___
> >users mailing list
> >us...@open-mpi.org
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim

On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote:
> On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
> 
> >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> >> This *usually* indicates a physical / layer 0 problem in your IB
> >> fabric.  You should do a diagnostic on your HCAs, cables, and  
> >switches.
> >>
> >> Increasing the timeout value should only be necessary on very  
> >large IB
> >> fabrics and/or very congested networks.
> >
> >Thanks Jeff!
> >What is considered to be very large IB fabrics?
> >I assume that with just over 180 compute nodes,
> >our cluster does not fall into this category.
> >
> 
> I was a little misleading in my note -- I should clarify.  It's really  
> congestion that matters, not the size of the fabric.  Congestion is  
> potentially more likely to happen in larger fabrics, since packets may  
> have to flow through more switches, there's likely more apps running  
> on the cluster, etc.  But it's all very application/cluster-specific;  
> only you can know if your fabric is heavily congested based on what  
> you run on it, etc.
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Thanks again Jeff!
Time to dig up diagnostics tools and look at port statistics.

Jan

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Jan Lindheim

On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote:
> 
> >Time to dig up diagnostics tools and look at port statistics.
> >  
> You may use ibdiagnet tool for the network debug - 
> *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
> 
> Pasha.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Thanks Pasha!
ibdiagnet reports the following:

-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join
due to rate:2.5Gbps < group:10Gbps

I guess this may indicate a bad adapter.  Now, I just need to find what
system this maps to.

I also ran ibcheckerrors and it reports a lot of problems with buffer
overruns.  Here's the tail end of the output, with only some of the last
ports reported:

#warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
#warn: counter LinkDowned = 23  (threshold 10) lid 193 port 14
#warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14
#warn: counter RcvSwRelayErrors = 225   (threshold 100) lid 193 port 14
#warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14:  FAILED 
#warn: counter LinkRecovers = 181   (threshold 10) lid 193 port 1
#warn: counter RcvSwRelayErrors = 2417  (threshold 100) lid 193 port 1
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1:  FAILED 
#warn: counter LinkRecovers = 103   (threshold 10) lid 193 port 3
#warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
#warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3:  FAILED 
#warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4
#warn: counter RcvErrors = 109  (threshold 10) lid 193 port 4
#warn: counter RcvSwRelayErrors = 507   (threshold 100) lid 193 port 4
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4:  FAILED 

## Summary: 209 nodes checked, 0 bad nodes found
##  716 ports checked, 103 ports have errors beyond threshold


I wonder if this is something that needs to be tuned in the Infiniband
switch or if there is something in OpenMPI/OpenIB that can be tuned.

Thanks,
Jan Lindheim

Re: [OMPI users] selected pml cm, but peer [[2469, 1], 0] on compute-0-0 selected pml ob1

2009-03-19 Thread Nysal Jan

fs1 is selecting the "cm" PML whereas other nodes are selecting the
"ob1" PML component. You can force ob1 to be used via "--mca pml ob1"

What kind of hardware/NIC does fs1 have?

--Nysal

On Wed, 2009-03-18 at 17:17 -0400, Gary Draving wrote:
> Hi all,
> 
> anyone ever seen an error like this? Seems like I have some setting 
> wrong in opemmpi.  I thought I had it setup like the other machines but 
> seems as though I have missed something. I only get the error when 
> adding machine "fs1" to the hostfile list.  The other 40+ machines seem 
> fine.
> 
> [fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer 
> [[2469,1],0] on compute-0-0 selected pml ob1
> 
> When I use ompi_info the output looks like my other machines:
> 
> [root@fs1 openmpi-1.3]# ompi_info | grep btl
>  MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3)
>  MCA btl: openib (MCA v2.0, API v2.0, Component v1.3)
>  MCA btl: self (MCA v2.0, API v2.0, Component v1.3)
>  MCA btl: sm (MCA v2.0, API v2.0, Component v1.3)
> 
> The whole error is below, any help would be greatly appreciated.
> 
> Gary
> 
> [admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp 
> --hostfile machines -np 7 greetings
> [fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer 
> [[2212,1],0] on compute-0-0 selected pml ob1
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not 
> able to guarantee that all other processes were killed!
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu
>   Process 2 ([[2212,1],0]) is on host: compute-0-0
>   BTLs attempted: openib self sm
> 
> Your MPI job is now going to abort; sorry.
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully; 
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-0.local:1591] Abort before MPI_INIT completed successfully; 
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not 
> able to guarantee that all other processes were killed!
> --
> mpirun has exited due to process rank 3 with PID 16884 on
> node dahl.calvin.edu exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --
> [dahl.calvin.edu:16879] 3 more processes have sent help message 
> help-mpi-runtime / mpi_init:startup:internal-failure
> [dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to 
> 0 to see all help / error messages
> [dahl.calvin.edu:16879] 2 more processes have sent help message 
> help-mca-bml-r2.txt / unreachable proc
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] XLF and 1.3.1

2009-04-14 Thread Nysal Jan

Can you try adding --disable-dlopen to the configure command line

--Nysal

On Tue, 2009-04-14 at 10:19 +0200, Jean-Michel Beuken wrote:
> Hello,
> 
> I'm trying to build 1.3.1 under  IBM Power5 + SLES 9.1 + XLF 9.1...
> 
> after some searches on FAQ and Google, my configure :
> 
> export CC="/opt/ibmcmp/vac/7.0/bin/xlc"
> export CXX="/opt/ibmcmp/vacpp/7.0/bin/xlc++"
> export CFLAGS="-O2 -q64 -qmaxmem=-1"
> #
> export F77="/opt/ibmcmp/xlf/9.1/bin/xlf"
> export FFLAGS="-O2 -q64 -qmaxmem=-1"
> export FC="/opt/ibmcmp/xlf/9.1/bin/xlf90"
> export FCFLAGS="-O2 -q64 -qmaxmem=-1"
> #
> export LDFLAGS="-q64"
> #
> ./configure --prefix=/usr/local/openmpi_1.3.1 \
>--disable-ipv6 \
>--enable-mpi-f77 --enable-mpi-f90 \
>--disable-mpi-profile \
>--without-xgrid \
>--enable-static --disable-shared \
>--disable-heterogeneous \
>--enable-contrib-no-build=libnbc,vt \
>--enable-mca-no-build=maffinity,btl-portals \
>--disable-mpi-cxx --disable-mpi-cxx-seek
> 
> 
> 
> there is a problem of "multiple definition"...
> 
> any advices ?
> 
> thanks
> 
> jmb
> 
> --
> make[2]: Entering directory 
> `/usr/local/src/openmpi-1.3.1/opal/tools/wrappers'
> /bin/sh ../../../libtool --tag=CC   --mode=link 
> /opt/ibmcmp/vac/7.0/bin/xlc  -DNDEBUG -O2 -q64 -qmaxmem=-1   
> -export-dynamic -q64  -o opal_wrapper opal_wrapper.o 
> ../../../opal/libopen-pal.la -lnsl -lutil  -lpthread
> libtool: link: /opt/ibmcmp/vac/7.0/bin/xlc -DNDEBUG -O2 -q64 -qmaxmem=-1 
> -q64 -o opal_wrapper opal_wrapper.o -Wl,--export-dynamic  
> ../../../opal/.libs/libopen-pal.a -ldl -lnsl -lutil -lpthread
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt__alloc.o)(.opd+0x18): 
> In function `argz_next':
> : multiple definition of `argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x528): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt__alloc.o)(.text+0x60): 
> In function `.argz_next':
> : multiple definition of `.argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4760): 
> first defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt__alloc.o)(.opd+0x30): 
> In function `__argz_next':
> : multiple definition of `__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x540): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt__alloc.o)(.text+0x80): 
> In function `.__argz_next':
> : multiple definition of `.__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4780): 
> first defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o)(.opd+0x108): In 
> function `argz_next':
> : multiple definition of `argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x528): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o)(.text+0x860): 
> In function `.argz_next':
> : multiple definition of `.argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4760): 
> first defined here
> /usr/bin/ld: Warning: size of symbol `.argz_next' changed from 20 in 
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o) to 60 in 
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o)
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o)(.opd+0x120): In 
> function `__argz_next':
> : multiple definition of `__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x540): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-preopen.o)(.text+0x8a0): 
> In function `.__argz_next':
> : multiple definition of `.__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4780): 
> first defined here
> ../../../opal/.libs/libopen-pal.a(dlopen.o)(.opd+0x78): In function 
> `argz_next':
> : multiple definition of `argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x528): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(dlopen.o)(.text+0x240): In function 
> `.argz_next':
> : multiple definition of `.argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4760): 
> first defined here
> ../../../opal/.libs/libopen-pal.a(dlopen.o)(.opd+0x90): In function 
> `__argz_next':
> : multiple definition of `__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x540): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(dlopen.o)(.text+0x280): In function 
> `.__argz_next':
> : multiple definition of `.__argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.text+0x4780): 
> first defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt_error.o)(.opd+0x78): In 
> function `argz_next':
> : multiple definition of `argz_next'
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x528): first 
> defined here
> ../../../opal/.libs/libopen-pal.a(libltdlc_la-lt_error.o)(.text+0x260): 
> In function `.argz_next':
> : multiple defin

Re: [OMPI users] libnuma issue

2009-04-15 Thread Nysal Jan

You could try statically linking the Intel-provided libraries. Use
LDFLAGS=-static-intel 

--Nysal

On Wed, 2009-04-15 at 21:03 +0200, Francesco Pietra wrote:
> On Wed, Apr 15, 2009 at 8:39 PM, Prentice Bisbal  wrote:
> > Francesco Pietra wrote:
> >> I used --with-libnuma=/usr since Prentice Bisbal's suggestion and it
> >> worked. Unfortunately, I found no way to fix the failure in finding
> >> libimf.so when compiling openmpi-1.3.1 with intels, as you have seen
> >> in other e-mail from me. And gnu compilers (which work well with both
> >> openmpi and the slower code of my application) are defeated by the
> >> faster code of my application. With limited hardware resources, I must
> >> rely on that 40% speeding up.
> >>
> >
> > To fix the libimf.so problem you need to include the path to Intel's
> > libimf.so in your LD_LIBRARY_PATH environment variable. On my system, I
> > installed v11.074 of the Intel compilers in /usr/local/intel, so my
> > libimf.so file is located here:
> >
> > /usr/local/intel/Compiler/11.0/074/lib/intel64/libimf.so
> >
> > So I just add that to my LD_LIBRARY_PATH:
> >
> > LD_LIBRARY_PATH=/usr/local/intel/Compiler/11.0/074/lib/intel64:$LD_LIBRARY_PATH
> > export LD_LIBRARY_PATH
> 
> Just a clarification: With my system I use the latest intels version
> 10, 10.1.2.024, and mkl 10.1.2.024 because it proved difficult to make
> a debian package with version 11. At
> 
> echo $LD_LIBRARY_PATH
> 
> /opt/intel/mkl/10.1.2.024/lib/em64t:/opt/intel/cce/10.1.022/lib:opt/intel/fce/10.1.022/lib:/usr/local/lib
> 
> (that /lib contains libimf.so)
> 
> That results from sourcing in my .bashrc:
> 
> . /opt/intel/fce/10.1.022/bin/ifortvars.sh
> . /opt/intel/cce/10.1.022/bin/iccvars.sh
> 
>  Did you suppress that sourcing before exporting the LD_EXPORT_PATH to
> the library at issue? Having so much turned around the proble, it is
> not unlikely that I am messing myself.
> 
> thanks
> francesco
> 
> 
> >
> > Now I can run whatever programs need libimf.so without any problems. In
> > your case, you'll want to that before your make command.
> >
> > Here's exactly what I use to compile OpenMPI with the Intel Compilers:
> >
> > export PATH=/usr/local/intel/Compiler/11.0/074/bin/intel64:$PATH
> >
> > export
> > LD_LIBRARY_PATH=/usr/local/intel/Compiler/11.0/074/lib/intel64:$LD_LIBRARY_PATH
> >
> > ../configure CC=icc CXX=icpc F77=ifort FC=ifort
> > --prefix=/usr/local/openmpi-1.2.8/intel-11/x86_64 --disable-ipv6
> > --with-sge --with-openib --enable-static
> >
> > --
> > Prentice
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openmpi-1.1a9r10157 Fails to build with Nag f95

2006-06-02 Thread Jan De Laet


Hi,

openmpi-1.1a9r10157's fortran bindings also fail to build with the 
Pathscale 2.1 pathf90 compiler. At the same spot but with different 
error messages (see below), which perhaps helps to clarify things. Any 
help greatly appreciated as well.


Best regards,

Jan De Laet

=
make[4]: Leaving directory 
`/data/home/jan/openmpi-1.1a9r10157/ompi/mpi/f90/scripts'
make[4]: Entering directory 
`/data/home/jan/openmpi-1.1a9r10157/ompi/mpi/f90'

***
* Compiling the mpi.f90 file may take a few minutes.
* This is quite normal -- do not be alarmed if the compile
* process seems to 'hang' at this point for several minutes.
***
pathf90 -I../../../ompi/include -I../../../ompi/include -I. -I. -O3 
-march=opteron -fPIC -DPIC  -c -I. -o mpi.o  mpi.f90


module mpi
  ^
pathf90-855 pathf90: ERROR MPI, File = mpi.f90, Line = 20, Column = 8
 The compiler has detected errors in module "MPI".  No module 
information file will be created for this module.


 subroutine mpi_type_null_delete_fn( type, type_keyval, 
attribute_val_out, &

^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
21, Column = 14
 "MPI_TYPE_NULL_DELETE_FN" has the EXTERNAL attribute, therefore it 
must not be a procedure in an interface block.


 subroutine mpi_type_null_copy_fn( type, type_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
26, Column = 14
 "MPI_TYPE_NULL_COPY_FN" has the EXTERNAL attribute, therefore it must 
not be a procedure in an interface block.


 subroutine mpi_type_dup_fn( type, type_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
34, Column = 14
 "MPI_TYPE_DUP_FN" has the EXTERNAL attribute, therefore it must not be 
a procedure in an interface block.


 subroutine mpi_comm_null_delete_fn(comm, comm_keyval, attribute_val_out, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
42, Column = 14
 "MPI_COMM_NULL_DELETE_FN" has the EXTERNAL attribute, therefore it 
must not be a procedure in an interface block.


 subroutine mpi_comm_null_copy_fn( comm, comm_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
47, Column = 14
 "MPI_COMM_NULL_COPY_FN" has the EXTERNAL attribute, therefore it must 
not be a procedure in an interface block.


 subroutine mpi_comm_dup_fn( comm, comm_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
55, Column = 14
 "MPI_COMM_DUP_FN" has the EXTERNAL attribute, therefore it must not be 
a procedure in an interface block.


 subroutine mpi_null_delete_fn( comm, comm_keyval, attribute_val_out, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
63, Column = 14
 "MPI_NULL_DELETE_FN" has the EXTERNAL attribute, therefore it must not 
be a procedure in an interface block.


 subroutine mpi_null_copy_fn( comm, comm_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
68, Column = 14
 "MPI_NULL_COPY_FN" has the EXTERNAL attribute, therefore it must not 
be a procedure in an interface block.


 subroutine mpi_dup_fn( comm, comm_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
76, Column = 14
 "MPI_DUP_FN" has the EXTERNAL attribute, therefore it must not be a 
procedure in an interface block.


 subroutine mpi_win_null_delete_fn( window, win_keyval, 
attribute_val_out, &

^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
84, Column = 14
 "MPI_WIN_NULL_DELETE_FN" has the EXTERNAL attribute, therefore it must 
not be a procedure in an interface block.


 subroutine mpi_win_null_copy_fn( window, win_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
89, Column = 14
 "MPI_WIN_NULL_COPY_FN" has the EXTERNAL attribute, therefore it must 
not be a procedure in an interface block.


 subroutine mpi_win_dup_fn( window, win_keyval, extra_state, &
^
pathf90-608 pathf90: ERROR MPI, File = attr_fn-f90-interfaces.h, Line = 
97, Column = 14
 "MPI_WIN_DUP_FN" has the EXTERNAL attribute, therefore it must not be 
a procedure in an interface block.


pathf90: PathScale(TM) Fortran 90 Version 2.1 (f14) Fri Jun  2, 2006  
09:43:24




=

Pathscale's error message pathf90-608:

Error : "%s" has the %s attribute, therefore it must not be a procedure 
in an

interf

Re: [OMPI users] openmpi-1.1a9r10157 Fails to build with Nag, f95Compiler // and Pathscale

2006-06-02 Thread Jan De Laet


Jeff,
Ok, this solved the problem with the Pathscale compiler.
Thanks
  -- Jan



Message: 2
Date: Thu, 1 Jun 2006 17:37:36 -0400
From: "Jeff Squyres \(jsquyres\)" 
Subject: Re: [OMPI users] openmpi-1.1a9r10157 Fails to build with Nag
f95Compiler
To: "Open MPI Users" 
Message-ID:

Content-Type: text/plain;   charset="us-ascii"

Greetings.

This was actually reported earlier today (off list).  It was the result
of a botched merge from the trunk to the v1.1 branch.  I have fixed the
issue as of r10171 (it was a one-line mistake); the fix should show up
in the snapshot tarballs tonight. 
 




Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

Re: [OMPI users] IBM Spectrum MPI problem

2017-05-19 Thread Nysal Jan K A

hi Gabriele,
You can check some of the available options here -
https://www.ibm.com/support/knowledgecenter/en/SSZTET_10.1.0/smpi02/smpi02_interconnect.html
The "-pami_noib" option might be of help in this scenario. Alternatively,
on a single node, the vader BTL can also be used.

Regards
--Nysal

On Fri, May 19, 2017 at 12:52 PM, Gabriele Fatigati 
wrote:

> Hi GIlles,
>
> using your command with one MPI procs I get:
>
> findActiveDevices Error
> We found no active IB device ports
> Hello world from rank 0  out of 1 processors
>
> So it seems to work apart the error message.
>
>
> 2017-05-19 9:10 GMT+02:00 Gilles Gouaillardet :
>
>> Gabriele,
>>
>>
>> so it seems pml/pami assumes there is an infiniband card available (!)
>>
>> i guess IBM folks will comment on that shortly.
>>
>>
>> meanwhile, you do not need pami since you are running on a single node
>>
>> mpirun --mca pml ^pami ...
>>
>> should do the trick
>>
>> (if it does not work, can run and post the logs)
>>
>> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>>
>>> Hi John,
>>> Infiniband is not used, there is a single node on this machine.
>>>
>>> 2017-05-19 8:50 GMT+02:00 John Hearns via users <
>>> users@lists.open-mpi.org >:
>>>
>>> Gabriele,   pleae run  'ibv_devinfo'
>>> It looks to me like you may have the physical interface cards in
>>> these systems, but you do not have the correct drivers or
>>> libraries loaded.
>>>
>>> I have had similar messages when using Infiniband on x86 systems -
>>> which did not have libibverbs installed.
>>>
>>>
>>> On 19 May 2017 at 08:41, Gabriele Fatigati >> > wrote:
>>>
>>> Hi Gilles, using your command:
>>>
>>> [openpower:88536] mca: base: components_register: registering
>>> framework pml components
>>> [openpower:88536] mca: base: components_register: found loaded
>>> component pami
>>> [openpower:88536] mca: base: components_register: component
>>> pami register function successful
>>> [openpower:88536] mca: base: components_open: opening pml
>>> components
>>> [openpower:88536] mca: base: components_open: found loaded
>>> component pami
>>> [openpower:88536] mca: base: components_open: component pami
>>> open function successful
>>> [openpower:88536] select: initializing pml component pami
>>> findActiveDevices Error
>>> We found no active IB device ports
>>> [openpower:88536] select: init returned failure for component
>>> pami
>>> [openpower:88536] PML pami cannot be selected
>>> 
>>> --
>>> No components were able to be opened in the pml framework.
>>>
>>> This typically means that either no components of this type were
>>> installed, or none of the installed componnets can be loaded.
>>> Sometimes this means that shared libraries required by these
>>> components are unable to be found/loaded.
>>>
>>>   Host:  openpower
>>>   Framework: pml
>>> 
>>> --
>>>
>>>
>>> 2017-05-19 7:03 GMT+02:00 Gilles Gouaillardet
>>> mailto:gil...@rist.or.jp>>:
>>>
>>> Gabriele,
>>>
>>>
>>> pml/pami is here, at least according to ompi_info
>>>
>>>
>>> can you update your mpirun command like this
>>>
>>> mpirun --mca pml_base_verbose 100 ..
>>>
>>>
>>> and post the output ?
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 5/18/2017 10:41 PM, Gabriele Fatigati wrote:
>>>
>>> Hi Gilles, attached the requested info
>>>
>>> 2017-05-18 15:04 GMT+02:00 Gilles Gouaillardet
>>> >> 
>>> >> >>:
>>>
>>> Gabriele,
>>>
>>> can you
>>> ompi_info --all | grep pml
>>>
>>> also, make sure there is nothing in your
>>> environment pointing to
>>> an other Open MPI install
>>> for example
>>> ldd a.out
>>> should only point to IBM libraries
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Thursday, May 18, 2017, Gabriele Fatigati
>>> mailto:g.fatig...@cineca.it>
>>> >>
>>> >> wrote:
>>>
>>> Dear OpenMPI users and developer

Re: [OMPI users] [EXTERNAL] Re: Errors on POWER8 Ubuntu 14.04u2

2015-03-30 Thread Nysal Jan K A

If this is Power 8 in LE mode, its most likely a libtool issue. You need
libtool >= 2.4.3, which has the LE patches, and need to run autogen.pl
again. I have an issue open for this -
https://github.com/open-mpi/ompi/issues/396

Regards
--Nysal

On Sat, Mar 28, 2015 at 12:41 AM, Hammond, Simon David (-EXP) <
sdha...@sandia.gov> wrote:

> Thanks guys,
>
> I have tried two configure lines:
>
> (1) ./configure
> --prefix=/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none
> --enable-mpi-thread-multiple CC=/usr/bin/gcc CXX=/usr/bin/g++
> FC=/usr/bin/gfortran
>
> (2) ./configure
> --prefix=/home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none
> --enable-mpi-thread-multiple CC=/usr/bin/gcc CXX=/usr/bin/g++
> FC=/usr/bin/gfortran --enable-shared --disable-static
>
> The second was just to try and force the generation of shared libraries (I
> notice they are not in
> /home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none/lib).
>
> I also attached the config.log from (2) bzip2'd as requested on the help
> page.
>
> Thanks for all of your help,
>
>
> S.
>
>
> --
> Simon Hammond
> Center for Computing Research (Scalable Computer Architectures)
> Sandia National Laboratories, NM
> [Sent from remote connection, please excuse typing errors]
>
> 
> From: users  on behalf of Jeff Squyres
> (jsquyres) 
> Sent: Friday, March 27, 2015 11:15 AM
> To: Open MPI User's List
> Subject: [EXTERNAL] Re: [OMPI users] Errors on POWER8 Ubuntu 14.04u2
>
> It might be helpful to send all the information listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> > On Mar 26, 2015, at 10:55 PM, Ralph Castain 
> wrote:
> >
> > Could you please send us your configure line?
> >
> >> On Mar 26, 2015, at 4:47 PM, Hammond, Simon David (-EXP) <
> sdha...@sandia.gov> wrote:
> >>
> >> Hi everyone,
> >>
> >> We are trying to compile custom installs of OpenMPI 1.8.4 on our POWER8
> Ubuntu system. We can configure and build correctly but when running
> ompi_info we see many errors like those listed below. It appears that all
> of the libraries in the ./lib are static (.a) files. It appears that we are
> unable to get our IB system working as a result.
> >>
> >> Can you recommend what we should be doing to ensure this works
> correctly?
> >>
> >> [node11:104711] mca: base: component_find: unable to open
> /home/projects/power8/openmpi/1.8.4/gnu/4.8.2/cuda/none/lib/openmpi/mca_compress_bzip:
> lt_dlerror() returned NULL! (ignored)
> >>
> >> Thanks for your help,
> >>
> >>
> >> --
> >> Simon Hammond
> >> Center for Computing Research (Scalable Computer Architectures)
> >> Sandia National Laboratories, NM
> >> [Sent from remote connection, please excuse typing errors]
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26547.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26550.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26551.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26552.php
>

[OMPI users] MPI_Init() "num local peers failed" - bug?

2023-05-16 Thread Jan Florian Wagner via users

Hi all,

there seems to be a host order-dependent timing issue. The issue occurs
when a set of processes is placed on the same node. Mpirun of the job exits
at MPI_Init() with:

  num local peers failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

Non-MPI applications launch just fine, such as:

$ mpirun -np 36 --hostfile job_4.machines --mca rmaps seq --map-by node
--bind-to none --mca btl_openib_allow_ib 1  /usr/bin/hostname

The error with "num local peers failed" happens with already a simple MPI
program that simply invokes MPI_Init(), e.g.,

$ mpirun -np 36 --hostfile job_4.machines --mca rmaps seq --map-by node
--bind-to none --mca btl_openib_allow_ib 1  /cluster/testing/mpihelloworld

  num local peers failed
  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

There is zero documentation in the error and I don't know how to work
around this... The error occurs in OpenMPI 4.1.1 and 4.0.4. We have not
tried other versions yet.

Curiously, if the hostfile is sorted first, MPI_Init() will always succeed,
e.g.,

$ sort job_4.machines  > job_4.machines-sorted
$ mpirun -np 36 --hostfile clock_4.machines-sorted --mca rmaps seq --map-by
node --bind-to none --mca btl_openib_allow_ib 1
 /cluster/difx/DiFX-trunk_64/bin/mpihelloworld
 (--> success)

My guess is when two instances of the same node are in the hostfile, and
the instances are too wide apart (too many other nodes listed in between),
then MPI_Init() of one of the instances might be checking much too soon for
the other instance?

Alas we have a heterogenous cluster where rank-to-node mapping is critical.

Does OpenMPI 4.1 have any "grace time" parameter or similar, which would
allow processes to wait a bit longer for the expected other instance(s) to
eventually come up on the same node?

many thanks,
Jan

46 matches

Mail list logo