Here's the r numbers with notable MX changes recently: https://svn.open-mpi.org/trac/ompi/changeset/26760 https://svn.open-mpi.org/trac/ompi/changeset/26759 https://svn.open-mpi.org/trac/ompi/changeset/26698 https://svn.open-mpi.org/trac/ompi/changeset/26626 https://svn.open-mpi.org/trac/ompi/changeset/26194 https://svn.open-mpi.org/trac/ompi/changeset/26180 https://svn.open-mpi.org/trac/ompi/changeset/25445 https://svn.open-mpi.org/trac/ompi/changeset/25043 https://svn.open-mpi.org/trac/ompi/changeset/24858 https://svn.open-mpi.org/trac/ompi/changeset/24460 https://svn.open-mpi.org/trac/ompi/changeset/23996 https://svn.open-mpi.org/trac/ompi/changeset/23925 https://svn.open-mpi.org/trac/ompi/changeset/23801 https://svn.open-mpi.org/trac/ompi/changeset/23764 https://svn.open-mpi.org/trac/ompi/changeset/23714 https://svn.open-mpi.org/trac/ompi/changeset/23713 https://svn.open-mpi.org/trac/ompi/changeset/23712
That goes back over a year. On Sep 12, 2012, at 11:39 AM, Brice Goglin wrote: > (I am bringing back OMPI users to CC) > > I reproduce the problem with OMPI 1.6.1 and found the problem. > mx_finalize() is called before this error occurs. So the error is > expected because calling mx_connect() after mx_finalize() is invalid. > It looks the MX component changed significantly between OMPI 1.5 and > 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions. > Can somebody comment on what was changed in the MX BTL component in late > 1.5 versions ? > > Brice > > > > > > Le 12/09/2012 15:48, Douglas Eadline a écrit : >>> Resending this mail again, with another SMTP. Please re-add >>> open-mx-devel to CC when you reply. >>> >>> Brice >>> >>> >>> Le 07/09/2012 00:10, Brice Goglin a écrit : >>>> Hello Doug, >>>> >>>> Did you use the same Open-MX version when it worked fine? Same kernel >>>> too? >>>> Any chance you built OMPI over an old OMX that would not be compatible >>>> with 1.5.2? >> I double checked, and even rebuilt both Open MPI and MPICH2 >> with 1.5.2. >> >> Running on a 4 node cluster with Warewulf provisioning. See below: >> >>>> The error below tells that the OMX driver and lib don't speak the same >>>> langage. EBADF is almost never returned from the OMX driver. The only >>>> case is when talking to /dev/open-mx-raw, but normal application don't >>>> do this. That's why I am thinking about OMPI using an old library that >>>> cannot talk to a new driver. We have checks to prevent this but we never >>>> know. >>>> >>>> Do you see anything in dmesg? >> no >>>> Is omx_info OK? >> yes shows: >> >> Open-MX version 1.5.2 >> build: deadline@limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2 >> Mon Sep 10 08:44:16 EDT 2012 >> >> Found 1 boards (32 max) supporting 32 endpoints each: >> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85) >> managed by driver 'r8169' >> >> Peer table is ready, mapper is 00:00:00:00:00:00 >> ================================================ >> 0) 00:1a:4d:4a:bf:85 n0:0 >> 1) 00:1c:c0:9b:66:d0 n1:0 >> 2) 00:1a:4d:4a:bf:83 n2:0 >> 3) e0:69:95:35:d7:71 limulus:0 >> >> >> >>>> Does a basic omx_perf work? (see >>>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf) >> yes, checked host -> each node it works. And mpich2 compiled >> with same libraries works. What else >> can I check? >> >> -- >> Doug >> >> >>>> Brice >>>> >>>> >>>> Le 06/09/2012 23:04, Douglas Eadline a écrit : >>>>> I built open-mpi 1.6.1 using the open-mx libraries. >>>>> This worked previously and now I get the following >>>>> error. Here is my system: >>>>> >>>>> kernel: 2.6.32-279.5.1.el6.x86_64 >>>>> open-mx: 1.5.2 >>>>> >>>>> BTW, open-mx worked previously with open-mpi and the current >>>>> version works with mpich2 >>>>> >>>>> >>>>> $ mpiexec -np 8 -machinefile machines cpi >>>>> Process 0 on limulus >>>>> FatalError: Failed to lookup peer by addr, driver replied Bad file >>>>> descriptor >>>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion >>>>> `0' >>>>> failed. >>>>> [limulus:04448] *** Process received signal *** >>>>> [limulus:04448] Signal: Aborted (6) >>>>> [limulus:04448] Signal code: (-6) >>>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] >>>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] >>>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] >>>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] >>>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) >>>>> [0x332462bae0] >>>>> [limulus:04448] [ 5] >>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) >>>>> [0x7fb587418b37] >>>>> [limulus:04448] [ 6] >>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) >>>>> [0x7fb58741a5d5] >>>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) >>>>> [0x7fb587419c7a] >>>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) >>>>> [0x7fb58741a27c] >>>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15) >>>>> [0x7fb587425865] >>>>> [limulus:04448] [10] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e) >>>>> [0x7fb5876fe40e] >>>>> [limulus:04448] [11] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4) >>>>> [0x7fb5876fbd94] >>>>> [limulus:04448] [12] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb) >>>>> [0x7fb58777d6fb] >>>>> [limulus:04448] [13] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb) >>>>> [0x7fb58777509b] >>>>> [limulus:04448] [14] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b) >>>>> [0x7fb58770b55b] >>>>> [limulus:04448] [15] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8) >>>>> [0x7fb58770b8b8] >>>>> [limulus:04448] [16] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc) >>>>> [0x7fb587702d8c] >>>>> [limulus:04448] [17] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78) >>>>> [0x7fb587712e88] >>>>> [limulus:04448] [18] >>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130) >>>>> [0x7fb5876ce1b0] >>>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4] >>>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd) >>>>> [0x332461ecdd] >>>>> [limulus:04448] [21] cpi() [0x400ac9] >>>>> [limulus:04448] *** End of error message *** >>>>> Process 2 on limulus >>>>> Process 4 on limulus >>>>> Process 6 on limulus >>>>> Process 1 on n0 >>>>> Process 7 on n0 >>>>> Process 3 on n0 >>>>> Process 5 on n0 >>>>> -------------------------------------------------------------------------- >>>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus >>>>> exited >>>>> on signal 6 (Aborted). >>>>> -------------------------------------------------------------------------- >>>>> >>>>> >>>>> >>> >>> -- >>> Mailscanner: Clean >>> >> > > > _______________________________________________ > Open-mx-devel mailing list > open-mx-de...@lists.gforge.inria.fr > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/open-mx-devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/