(I am bringing back OMPI users to CC) I reproduce the problem with OMPI 1.6.1 and found the problem. mx_finalize() is called before this error occurs. So the error is expected because calling mx_connect() after mx_finalize() is invalid. It looks the MX component changed significantly between OMPI 1.5 and 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions. Can somebody comment on what was changed in the MX BTL component in late 1.5 versions ?
Brice Le 12/09/2012 15:48, Douglas Eadline a écrit : >> Resending this mail again, with another SMTP. Please re-add >> open-mx-devel to CC when you reply. >> >> Brice >> >> >> Le 07/09/2012 00:10, Brice Goglin a écrit : >>> Hello Doug, >>> >>> Did you use the same Open-MX version when it worked fine? Same kernel >>> too? >>> Any chance you built OMPI over an old OMX that would not be compatible >>> with 1.5.2? > I double checked, and even rebuilt both Open MPI and MPICH2 > with 1.5.2. > > Running on a 4 node cluster with Warewulf provisioning. See below: > >>> The error below tells that the OMX driver and lib don't speak the same >>> langage. EBADF is almost never returned from the OMX driver. The only >>> case is when talking to /dev/open-mx-raw, but normal application don't >>> do this. That's why I am thinking about OMPI using an old library that >>> cannot talk to a new driver. We have checks to prevent this but we never >>> know. >>> >>> Do you see anything in dmesg? > no >>> Is omx_info OK? > yes shows: > > Open-MX version 1.5.2 > build: deadline@limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2 > Mon Sep 10 08:44:16 EDT 2012 > > Found 1 boards (32 max) supporting 32 endpoints each: > n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85) > managed by driver 'r8169' > > Peer table is ready, mapper is 00:00:00:00:00:00 > ================================================ > 0) 00:1a:4d:4a:bf:85 n0:0 > 1) 00:1c:c0:9b:66:d0 n1:0 > 2) 00:1a:4d:4a:bf:83 n2:0 > 3) e0:69:95:35:d7:71 limulus:0 > > > >>> Does a basic omx_perf work? (see >>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf) > yes, checked host -> each node it works. And mpich2 compiled > with same libraries works. What else > can I check? > > -- > Doug > > >>> Brice >>> >>> >>> Le 06/09/2012 23:04, Douglas Eadline a écrit : >>>> I built open-mpi 1.6.1 using the open-mx libraries. >>>> This worked previously and now I get the following >>>> error. Here is my system: >>>> >>>> kernel: 2.6.32-279.5.1.el6.x86_64 >>>> open-mx: 1.5.2 >>>> >>>> BTW, open-mx worked previously with open-mpi and the current >>>> version works with mpich2 >>>> >>>> >>>> $ mpiexec -np 8 -machinefile machines cpi >>>> Process 0 on limulus >>>> FatalError: Failed to lookup peer by addr, driver replied Bad file >>>> descriptor >>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion >>>> `0' >>>> failed. >>>> [limulus:04448] *** Process received signal *** >>>> [limulus:04448] Signal: Aborted (6) >>>> [limulus:04448] Signal code: (-6) >>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] >>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] >>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] >>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] >>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) >>>> [0x332462bae0] >>>> [limulus:04448] [ 5] >>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) >>>> [0x7fb587418b37] >>>> [limulus:04448] [ 6] >>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) >>>> [0x7fb58741a5d5] >>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) >>>> [0x7fb587419c7a] >>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) >>>> [0x7fb58741a27c] >>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15) >>>> [0x7fb587425865] >>>> [limulus:04448] [10] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e) >>>> [0x7fb5876fe40e] >>>> [limulus:04448] [11] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4) >>>> [0x7fb5876fbd94] >>>> [limulus:04448] [12] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb) >>>> [0x7fb58777d6fb] >>>> [limulus:04448] [13] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb) >>>> [0x7fb58777509b] >>>> [limulus:04448] [14] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b) >>>> [0x7fb58770b55b] >>>> [limulus:04448] [15] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8) >>>> [0x7fb58770b8b8] >>>> [limulus:04448] [16] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc) >>>> [0x7fb587702d8c] >>>> [limulus:04448] [17] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78) >>>> [0x7fb587712e88] >>>> [limulus:04448] [18] >>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130) >>>> [0x7fb5876ce1b0] >>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4] >>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd) >>>> [0x332461ecdd] >>>> [limulus:04448] [21] cpi() [0x400ac9] >>>> [limulus:04448] *** End of error message *** >>>> Process 2 on limulus >>>> Process 4 on limulus >>>> Process 6 on limulus >>>> Process 1 on n0 >>>> Process 7 on n0 >>>> Process 3 on n0 >>>> Process 5 on n0 >>>> -------------------------------------------------------------------------- >>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus >>>> exited >>>> on signal 6 (Aborted). >>>> -------------------------------------------------------------------------- >>>> >>>> >>>> >> >> -- >> Mailscanner: Clean >> >