Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.
Thanks Jeff for the hint. Unfortunately neither openmpi-1.2b3r12956 nor openmpi-1.2b2 compile on aix-5.3/power5. Therefore I was not able to check if the poll issue is gone on these versions. Both (beta2 and beta3) fail for the same reason: "pls_poe_module.c", line 640.2: 1506-204 (S) Unexpected end of file. make: 1254-004 The error code from the last command is 1. I presume there is a missing bracket or so probably inside some ifdef. As soon as I have a little more time I will have a look into it - any suggestion as to where to start are welcome... Thanks again, Michael. On Jan 2, 2007, at 3:50 PM, Jeff Squyres wrote: Yikes - that's not a good error. :-( We don't regularly build / test on AIX, so I don't have much immediate guidance for you. My best suggestion at this point would be to try the latest 1.2 beta or nightly snapshot. We did an update of the event engine (the portion of the code that you're seeing the error issue from) that *may* alleviate the problem...? (I have no idea, actually -- I'm just kinda hoping that the new version of the event engine will fix your problem :-\ ) On Dec 27, 2006, at 10:29 AM, Michael Marti wrote: Dear All I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5. :: Compilation seems to have worked with the following sequence: setenv OBJECT_MODE 64 setenv CC xlc setenv CXX xlC setenv F77 xlf setenv FC xlf90 setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - q64" setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x - qtune=pwr5 -q64" setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - q64" setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -q64" setenv LDFLAGS "-Wl,-brtl" ./configure --prefix=/ist/openmpi-1.1.2 \ --disable-mpi-cxx \ --disable-mpi-cxx-seek \ --enable-mpi-threads \ --enable-progress-threads \ --enable-static \ --disable-shared \ --disable-io-romio :: After the compilation I ran make check and all 11 tests passed successfully. :: Now I'm trying to run the following command just for test: # mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41- b44_1.asc -np 2 /usr/bin/hostname - The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc contains 4 hosts: r1blade041 slots=1 r1blade042 slots=1 r1blade043 slots=1 r1blade044 slots=1 - The mpirun command eventually hangs with the following message: [r1blade041:418014] poll failed with errno=25 [r1blade041:418014] opal_event_loop: ompi_evesel->dispatch() failed. - In this state mpirun cannot be killed by hitting only a kill -9 will do the trick. - While the mpirun still hangs I can see that the "orted" has been launched on both requested hosts. :: I turned on all debug options in openmpi-mca-params.conf. The output for the same call of mpirun is in the file mpirun- debug.txt.gz. :: As sugested in the mailinglis rules I include config.log (config.log.gz) and the output of ompi_info (ompi_info.txt.gz). :: As I am completely new to openmpi (I have some experience with lam) I am lost at this stage. I would really appreciate if someone could give me some hints as to what is going wrong and where I could get more info. Best regards, Michael Marti. -- - - -- Michael Marti Centro de Fisica dos Plasmas Instituto Superior Tecnico Av. Rovisco Pais 1049-001 Lisboa Portugal Tel: +351 218 419 379 Fax: +351 218 464 455 Mobile: +351 968 434 327 - - -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.
Hi Michael I would suggest using the nightly snapshot off of the trunk - the poe module compiles correctly there. I suspect we need an update to bring that fix over to the 1.2 branch. Ralph On 1/9/07 7:55 AM, "Michael Marti" wrote: > Thanks Jeff for the hint. > > Unfortunately neither openmpi-1.2b3r12956 nor openmpi-1.2b2 compile > on aix-5.3/power5. Therefore I was not able to check if the poll > issue is gone on these versions. Both (beta2 and beta3) fail for the > same reason: > > "pls_poe_module.c", line 640.2: 1506-204 (S) Unexpected end of file. > make: 1254-004 The error code from the last command is 1. > > I presume there is a missing bracket or so probably inside some > ifdef. As soon as I have a little more time I will have a look into > it - any suggestion as to where to start are welcome... > > Thanks again, Michael. > > On Jan 2, 2007, at 3:50 PM, Jeff Squyres wrote: > >> Yikes - that's not a good error. :-( >> >> We don't regularly build / test on AIX, so I don't have much >> immediate guidance for you. My best suggestion at this point would >> be to try the latest 1.2 beta or nightly snapshot. We did an update >> of the event engine (the portion of the code that you're seeing the >> error issue from) that *may* alleviate the problem...? (I have no >> idea, actually -- I'm just kinda hoping that the new version of the >> event engine will fix your problem :-\ ) >> >> >> On Dec 27, 2006, at 10:29 AM, Michael Marti wrote: >> >>> Dear All >>> >>> I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5. >>> >>> :: Compilation seems to have worked with the following sequence: >>> >>> setenv OBJECT_MODE 64 >>> >>> setenv CC xlc >>> setenv CXX xlC >>> setenv F77 xlf >>> setenv FC xlf90 >>> >>> setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - >>> q64" >>> setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x - >>> qtune=pwr5 -q64" >>> setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - >>> q64" >>> setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 >>> -q64" >>> setenv LDFLAGS "-Wl,-brtl" >>> >>> ./configure --prefix=/ist/openmpi-1.1.2 \ >>> --disable-mpi-cxx \ >>> --disable-mpi-cxx-seek \ >>> --enable-mpi-threads \ >>> --enable-progress-threads \ >>> --enable-static \ >>> --disable-shared \ >>> --disable-io-romio >>> >>> >>> :: After the compilation I ran make check and all 11 tests passed >>> successfully. >>> >>> :: Now I'm trying to run the following command just for test: >>> # mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41- >>> b44_1.asc -np 2 /usr/bin/hostname >>> - The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc >>> contains 4 hosts: >>> r1blade041 slots=1 >>> r1blade042 slots=1 >>> r1blade043 slots=1 >>> r1blade044 slots=1 >>> - The mpirun command eventually hangs with the following message: >>> [r1blade041:418014] poll failed with errno=25 >>> [r1blade041:418014] opal_event_loop: ompi_evesel->dispatch() >>> failed. >>> - In this state mpirun cannot be killed by hitting only a >>> kill -9 will do the trick. >>> - While the mpirun still hangs I can see that the "orted" has been >>> launched on both requested hosts. >>> >>> :: I turned on all debug options in openmpi-mca-params.conf. The >>> output for the same call of mpirun is in the file mpirun- >>> debug.txt.gz. >>> >>> >>> :: As sugested in the mailinglis rules I include config.log >>> (config.log.gz) and the output of ompi_info (ompi_info.txt.gz). >>> >>> >>> >>> >>> >>> :: As I am completely new to openmpi (I have some experience with >>> lam) I am lost at this stage. I would really appreciate if someone >>> could give me some hints as to what is going wrong and where I >>> could get more info. >>> >>> Best regards, >>> >>> Michael Marti. >>> >>> >>> -- >>> - >>> - >>> -- >>> Michael Marti >>> Centro de Fisica dos Plasmas >>> Instituto Superior Tecnico >>> Av. Rovisco Pais >>> 1049-001 Lisboa >>> Portugal >>> >>> Tel: +351 218 419 379 >>> Fax: +351 218 464 455 >>> Mobile: +351 968 434 327 >>> - >>> - >>> -- >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
> I need it's the backtrace on the process which generate the > segfault. Second, in order to understand the backtrace, it's > better to have run debug version of Open MPI. Without the > debug version we only see the address where the fault occur > without having access to the line number ... How about this, this is the section that I was stepping through in order to get the first error I usually run into ... "mx_connect fail for node-1:0 with key (error Endpoint closed or not connectable!)" // gdb output Breakpoint 1, 0x2ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x2ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) s Single stepping until exit from function sched_yield, which has no line number information. opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80 80 while (c->c_signaled == 0) { (gdb) s 81 opal_progress(); (gdb) s Breakpoint 1, 0x2ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x2ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) backtrace #0 0x2ac857361540 in sched_yield () from /lib/libc.so.6 #1 0x00402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:81 #2 0x00402b3c in orterun (argc=17, argv=0x7fff54151088) at orterun.c:427 #3 0x00402713 in main (argc=17, argv=0x7fff54151088) at main.c:13 --- This is the mpirun output as I was stepping through it. At the end of this is the error that the backtrace above shows. [node-2:11909] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11909] tmp: /tmp [node-1:10719] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0 [node-1:10719] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10719] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10719] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10719] tmp: /tmp [juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4) [juggernaut:17414] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 6 MPIR_proctable: (i, host, exe, pid) = (0, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10719) (i, host, exe, pid) = (1, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10720) (i, host, exe, pid) = (2, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10721) (i, host, exe, pid) = (3, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10722) (i, host, exe, pid) = (4, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11908) (i, host, exe, pid) = (5, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11909) [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10721] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2 [node-1:10721] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10721] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10721] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10721] tmp: /tmp [node-1:10720] mx_connect fail for node-1:0 with key (error Endpoint closed or not connectable!)
Re: [OMPI users] external32 i/o not implemented?
On Mon, Jan 08, 2007 at 02:32:14PM -0700, Tom Lund wrote: > Rainer, >Thank you for taking time to reply to my querry. Do I understand > correctly that external32 data representation for i/o is not > implemented? I am puzzled since the MPI-2 standard clearly indicates > the existence of external32 and has lots of words regarding how nice > this feature is for file interoperability. So do both Open MPI and > MPIch2 not adhere to the standard in this regard? If this is really the > case, how difficult is it to define a custom data representation that is > 32-bit big endian on all platforms? Do you know of any documentation > that explains how to do this? >Thanks again. Hi Tom You do understand correctly. I do not know of an MPI-IO implementation that supports external32. When you say "custom data representation" do you mean an MPI-IO user-defined data representation? An alternate approach would be to use a higher level library like parallel-netcdf or HDF5 (configured for parallel i/o). Those libraries already define a file format and implement all the necessary data conversion routines, and they have a wealth of ancilary tools and programs to work with their respective file formats. Additionally, those higher-level libraries will offer you more features than MPI-IO such as the ability to define atributes on variables and datafiles. Even better, there is the potential that these libraries might offer some clever optimizations for your workload, saving you the effort. Further, you can use those higher-level libraries on top of any MPI-IO implementation, not just OpenMPI or MPICH2. This is a little bit of a diversion from your original question, but to sum it up, I'd say one potential answer to the lack of external32 is to use a higher level library and sidestep the issue of MPI-IO data representations altogether. ==rob -- Rob Latham Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B
Re: [OMPI users] external32 i/o not implemented?
Rob, Thank you for your informative reply. I had no luck finding the external32 data representation in any of several mpi implementations and thus I do need to devise an alternative strategy. Do you know of a good reference explaining how to combine HDF5 with mpi? ---Tom Robert Latham wrote: On Mon, Jan 08, 2007 at 02:32:14PM -0700, Tom Lund wrote: Rainer, Thank you for taking time to reply to my querry. Do I understand correctly that external32 data representation for i/o is not implemented? I am puzzled since the MPI-2 standard clearly indicates the existence of external32 and has lots of words regarding how nice this feature is for file interoperability. So do both Open MPI and MPIch2 not adhere to the standard in this regard? If this is really the case, how difficult is it to define a custom data representation that is 32-bit big endian on all platforms? Do you know of any documentation that explains how to do this? Thanks again. Hi Tom You do understand correctly. I do not know of an MPI-IO implementation that supports external32. When you say "custom data representation" do you mean an MPI-IO user-defined data representation? An alternate approach would be to use a higher level library like parallel-netcdf or HDF5 (configured for parallel i/o). Those libraries already define a file format and implement all the necessary data conversion routines, and they have a wealth of ancilary tools and programs to work with their respective file formats. Additionally, those higher-level libraries will offer you more features than MPI-IO such as the ability to define atributes on variables and datafiles. Even better, there is the potential that these libraries might offer some clever optimizations for your workload, saving you the effort. Further, you can use those higher-level libraries on top of any MPI-IO implementation, not just OpenMPI or MPICH2. This is a little bit of a diversion from your original question, but to sum it up, I'd say one potential answer to the lack of external32 is to use a higher level library and sidestep the issue of MPI-IO data representations altogether. ==rob -- === Thomas S. Lund Sr. Research Scientist Colorado Research Associates, a division of NorthWest Research Associates 3380 Mitchell Ln. Boulder, CO 80301 (303) 415-9701 X 209 (voice) (303) 415-9702 (fax) l...@cora.nwra.com ===
Re: [OMPI users] external32 i/o not implemented?
On Tue, Jan 09, 2007 at 02:53:24PM -0700, Tom Lund wrote: > Rob, >Thank you for your informative reply. I had no luck finding the > external32 data representation in any of several mpi implementations and > thus I do need to devise an alternative strategy. Do you know of a good > reference explaining how to combine HDF5 with mpi? Sure. Start here: http://hdf.ncsa.uiuc.edu/HDF5/PHDF5/ You might also find the Parallel-NetCDF project (disclaimer: I work on that project) interesting: http://www.mcs.anl.gov/parallel-netcdf/ ==rob -- Rob Latham Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B