Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.

2007-01-09 Thread Michael Marti

Thanks Jeff for the hint.

Unfortunately neither openmpi-1.2b3r12956 nor openmpi-1.2b2 compile  
on aix-5.3/power5. Therefore I was not able to check if the poll  
issue is gone on these versions. Both (beta2 and beta3) fail for the  
same reason:


"pls_poe_module.c", line 640.2: 1506-204 (S) Unexpected end of file.
make: 1254-004 The error code from the last command is 1.

I presume there is a missing bracket or so probably inside some  
ifdef. As soon as I have a little more time I will have a look into  
it - any suggestion as to where to start are welcome...


Thanks again, Michael.

On Jan 2, 2007, at 3:50 PM, Jeff Squyres wrote:


Yikes - that's not a good error.  :-(

We don't regularly build / test on AIX, so I don't have much
immediate guidance for you.  My best suggestion at this point would
be to try the latest 1.2 beta or nightly snapshot.  We did an update
of the event engine (the portion of the code that you're seeing the
error issue from) that *may* alleviate the problem...?  (I have no
idea, actually -- I'm just kinda hoping that the new version of the
event engine will fix your problem :-\ )


On Dec 27, 2006, at 10:29 AM, Michael Marti wrote:


Dear All

I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5.

:: Compilation seems to have worked with the following sequence:

setenv OBJECT_MODE 64

setenv CC xlc
setenv CXX xlC
setenv F77 xlf
setenv FC xlf90

setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -
q64"
setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -
qtune=pwr5 -q64"
setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -
q64"
setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5
-q64"
setenv LDFLAGS "-Wl,-brtl"

./configure --prefix=/ist/openmpi-1.1.2 \
  --disable-mpi-cxx \
  --disable-mpi-cxx-seek \
  --enable-mpi-threads \
  --enable-progress-threads \
  --enable-static \
  --disable-shared \
  --disable-io-romio


:: After the compilation I ran make check and all 11 tests passed
successfully.

:: Now I'm trying to run the following command just for test:
# mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-
b44_1.asc -np 2 /usr/bin/hostname
- The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc
contains 4 hosts:
r1blade041 slots=1
r1blade042 slots=1
r1blade043 slots=1
r1blade044 slots=1
- The mpirun command eventually hangs with the following message:
[r1blade041:418014] poll failed with errno=25
[r1blade041:418014] opal_event_loop: ompi_evesel->dispatch()
failed.
- In this state mpirun cannot be killed by hitting  only a
kill -9 will do the trick.
- While the mpirun still hangs I can see that the "orted" has been
launched on both requested hosts.

:: I turned on all debug options in openmpi-mca-params.conf. The
output for the same call of mpirun is in the file mpirun- 
debug.txt.gz.



:: As sugested in the mailinglis rules I include config.log
(config.log.gz) and the output of ompi_info (ompi_info.txt.gz).





:: As I am completely new to openmpi (I have some experience with
lam) I am lost at this stage. I would really appreciate if someone
could give me some hints as to what is going wrong and where I
could get more info.

Best regards,

Michael Marti.


--
- 
-

--
Michael Marti
Centro de Fisica dos Plasmas
Instituto Superior Tecnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal

Tel:   +351 218 419 379
Fax:  +351 218 464 455
Mobile:  +351 968 434 327
- 
-

--


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.

2007-01-09 Thread Ralph H Castain
Hi Michael

I would suggest using the nightly snapshot off of the trunk - the poe module
compiles correctly there. I suspect we need an update to bring that fix over
to the 1.2 branch.

Ralph



On 1/9/07 7:55 AM, "Michael Marti"  wrote:

> Thanks Jeff for the hint.
> 
> Unfortunately neither openmpi-1.2b3r12956 nor openmpi-1.2b2 compile
> on aix-5.3/power5. Therefore I was not able to check if the poll
> issue is gone on these versions. Both (beta2 and beta3) fail for the
> same reason:
> 
> "pls_poe_module.c", line 640.2: 1506-204 (S) Unexpected end of file.
> make: 1254-004 The error code from the last command is 1.
> 
> I presume there is a missing bracket or so probably inside some
> ifdef. As soon as I have a little more time I will have a look into
> it - any suggestion as to where to start are welcome...
> 
> Thanks again, Michael.
> 
> On Jan 2, 2007, at 3:50 PM, Jeff Squyres wrote:
> 
>> Yikes - that's not a good error.  :-(
>> 
>> We don't regularly build / test on AIX, so I don't have much
>> immediate guidance for you.  My best suggestion at this point would
>> be to try the latest 1.2 beta or nightly snapshot.  We did an update
>> of the event engine (the portion of the code that you're seeing the
>> error issue from) that *may* alleviate the problem...?  (I have no
>> idea, actually -- I'm just kinda hoping that the new version of the
>> event engine will fix your problem :-\ )
>> 
>> 
>> On Dec 27, 2006, at 10:29 AM, Michael Marti wrote:
>> 
>>> Dear All
>>> 
>>> I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5.
>>> 
>>> :: Compilation seems to have worked with the following sequence:
>>> 
>>> setenv OBJECT_MODE 64
>>> 
>>> setenv CC xlc
>>> setenv CXX xlC
>>> setenv F77 xlf
>>> setenv FC xlf90
>>> 
>>> setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -
>>> q64"
>>> setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -
>>> qtune=pwr5 -q64"
>>> setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -
>>> q64"
>>> setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5
>>> -q64"
>>> setenv LDFLAGS "-Wl,-brtl"
>>> 
>>> ./configure --prefix=/ist/openmpi-1.1.2 \
>>>   --disable-mpi-cxx \
>>>   --disable-mpi-cxx-seek \
>>>   --enable-mpi-threads \
>>>   --enable-progress-threads \
>>>   --enable-static \
>>>   --disable-shared \
>>>   --disable-io-romio
>>> 
>>> 
>>> :: After the compilation I ran make check and all 11 tests passed
>>> successfully.
>>> 
>>> :: Now I'm trying to run the following command just for test:
>>> # mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-
>>> b44_1.asc -np 2 /usr/bin/hostname
>>> - The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc
>>> contains 4 hosts:
>>> r1blade041 slots=1
>>> r1blade042 slots=1
>>> r1blade043 slots=1
>>> r1blade044 slots=1
>>> - The mpirun command eventually hangs with the following message:
>>> [r1blade041:418014] poll failed with errno=25
>>> [r1blade041:418014] opal_event_loop: ompi_evesel->dispatch()
>>> failed.
>>> - In this state mpirun cannot be killed by hitting  only a
>>> kill -9 will do the trick.
>>> - While the mpirun still hangs I can see that the "orted" has been
>>> launched on both requested hosts.
>>> 
>>> :: I turned on all debug options in openmpi-mca-params.conf. The
>>> output for the same call of mpirun is in the file mpirun-
>>> debug.txt.gz.
>>> 
>>> 
>>> :: As sugested in the mailinglis rules I include config.log
>>> (config.log.gz) and the output of ompi_info (ompi_info.txt.gz).
>>> 
>>> 
>>> 
>>> 
>>> 
>>> :: As I am completely new to openmpi (I have some experience with
>>> lam) I am lost at this stage. I would really appreciate if someone
>>> could give me some hints as to what is going wrong and where I
>>> could get more info.
>>> 
>>> Best regards,
>>> 
>>> Michael Marti.
>>> 
>>> 
>>> -- 
>>> -
>>> -
>>> --
>>> Michael Marti
>>> Centro de Fisica dos Plasmas
>>> Instituto Superior Tecnico
>>> Av. Rovisco Pais
>>> 1049-001 Lisboa
>>> Portugal
>>> 
>>> Tel:   +351 218 419 379
>>> Fax:  +351 218 464 455
>>> Mobile:  +351 968 434 327
>>> -
>>> -
>>> --
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -- 
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Ompi failing on mx only

2007-01-09 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> I need it's the backtrace on the process which generate the 
> segfault. Second, in order to understand the backtrace, it's 
> better to have run debug version of Open MPI. Without the 
> debug version we only see the address where the fault occur 
> without having access to the line number ...

How about this, this is the section that I was stepping through in order
to get the first error I usually run into ... "mx_connect fail for
node-1:0 with key  (error Endpoint closed or not connectable!)"

// gdb output

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) s
Single stepping until exit from function sched_yield, 
which has no line number information.
opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80
80  while (c->c_signaled == 0) {
(gdb) s
81  opal_progress();
(gdb) s

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ac857361540 in sched_yield () from /lib/libc.so.6
#1  0x00402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0)
at condition.h:81
#2  0x00402b3c in orterun (argc=17, argv=0x7fff54151088)
at orterun.c:427
#3  0x00402713 in main (argc=17, argv=0x7fff54151088) at
main.c:13

--- This is the mpirun output as I was stepping through it. At the end
of this is the error that the backtrace above shows.

[node-2:11909] top: openmpi-sessions-ggrobe@node-2_0
[node-2:11909] tmp: /tmp
[node-1:10719] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0
[node-1:10719] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10719] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10719] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10719] tmp: /tmp
[juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4)
[juggernaut:17414] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 6
  MPIR_proctable:
(i, host, exe, pid) = (0, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10719)
(i, host, exe, pid) = (1, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10720)
(i, host, exe, pid) = (2, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10721)
(i, host, exe, pid) = (3, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10722)
(i, host, exe, pid) = (4, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11908)
(i, host, exe, pid) = (5, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11909)
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10721] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2
[node-1:10721] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10721] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10721] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10721] tmp: /tmp
[node-1:10720] mx_connect fail for node-1:0 with key  (error
Endpoint closed or not connectable!)



Re: [OMPI users] external32 i/o not implemented?

2007-01-09 Thread Robert Latham
On Mon, Jan 08, 2007 at 02:32:14PM -0700, Tom Lund wrote:
> Rainer,
>Thank you for taking time to reply to my querry.  Do I understand 
> correctly that external32 data representation for i/o is not 
> implemented?  I am puzzled since the MPI-2 standard clearly indicates 
> the existence of external32 and has lots of words regarding how nice 
> this feature is for file interoperability.  So do both Open MPI and 
> MPIch2 not adhere to the standard in this regard?  If this is really the 
> case, how difficult is it to define a custom data representation that is 
> 32-bit big endian on all platforms?  Do you know of any documentation 
> that explains how to do this?
>Thanks again.

Hi Tom

You do understand correctly.  I do not know of an MPI-IO
implementation that supports external32.  

When you say "custom data representation" do you mean an MPI-IO
user-defined data representation?  

An alternate approach would be to use a higher level library like
parallel-netcdf or HDF5 (configured for parallel i/o).  Those
libraries already define a file format and implement all the necessary
data conversion routines, and they have a wealth of ancilary tools and
programs to work with their respective file formats.  Additionally,
those higher-level libraries will offer you more features than MPI-IO
such as the ability to define atributes on variables and datafiles.
Even better, there is the potential that these libraries might offer
some clever optimizations for your workload, saving you the effort.
Further, you can use those higher-level libraries on top of any MPI-IO
implementation, not just OpenMPI or MPICH2.  

This is a little bit of a diversion from your original question, but
to sum it up, I'd say one potential answer to the lack of external32
is to use a higher level library and sidestep the issue of MPI-IO data
representations altogether. 

==rob

-- 
Rob Latham
Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B


Re: [OMPI users] external32 i/o not implemented?

2007-01-09 Thread Tom Lund

Rob,
  Thank you for your informative reply.  I had no luck finding the 
external32 data representation in any of several mpi implementations and 
thus I do need to devise an alternative strategy.  Do you know of a good 
reference explaining how to combine HDF5 with mpi?


  ---Tom

Robert Latham wrote:

On Mon, Jan 08, 2007 at 02:32:14PM -0700, Tom Lund wrote:
  

Rainer,
   Thank you for taking time to reply to my querry.  Do I understand 
correctly that external32 data representation for i/o is not 
implemented?  I am puzzled since the MPI-2 standard clearly indicates 
the existence of external32 and has lots of words regarding how nice 
this feature is for file interoperability.  So do both Open MPI and 
MPIch2 not adhere to the standard in this regard?  If this is really the 
case, how difficult is it to define a custom data representation that is 
32-bit big endian on all platforms?  Do you know of any documentation 
that explains how to do this?

   Thanks again.



Hi Tom

You do understand correctly.  I do not know of an MPI-IO
implementation that supports external32.  


When you say "custom data representation" do you mean an MPI-IO
user-defined data representation?  


An alternate approach would be to use a higher level library like
parallel-netcdf or HDF5 (configured for parallel i/o).  Those
libraries already define a file format and implement all the necessary
data conversion routines, and they have a wealth of ancilary tools and
programs to work with their respective file formats.  Additionally,
those higher-level libraries will offer you more features than MPI-IO
such as the ability to define atributes on variables and datafiles.
Even better, there is the potential that these libraries might offer
some clever optimizations for your workload, saving you the effort.
Further, you can use those higher-level libraries on top of any MPI-IO
implementation, not just OpenMPI or MPICH2.  


This is a little bit of a diversion from your original question, but
to sum it up, I'd say one potential answer to the lack of external32
is to use a higher level library and sidestep the issue of MPI-IO data
representations altogether. 


==rob

  



--
===
  Thomas S. Lund
  Sr. Research Scientist
  Colorado Research Associates, a division of
  NorthWest Research Associates
  3380 Mitchell Ln.
  Boulder, CO 80301
  (303) 415-9701 X 209 (voice)
  (303) 415-9702   (fax)
  l...@cora.nwra.com
===



Re: [OMPI users] external32 i/o not implemented?

2007-01-09 Thread Robert Latham
On Tue, Jan 09, 2007 at 02:53:24PM -0700, Tom Lund wrote:
> Rob,
>Thank you for your informative reply.  I had no luck finding the 
> external32 data representation in any of several mpi implementations and 
> thus I do need to devise an alternative strategy.  Do you know of a good 
> reference explaining how to combine HDF5 with mpi?

Sure.  Start here: http://hdf.ncsa.uiuc.edu/HDF5/PHDF5/ 

You might also find the Parallel-NetCDF project (disclaimer: I work on
that project) interesting:
http://www.mcs.anl.gov/parallel-netcdf/

==rob

-- 
Rob Latham
Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B