Re: [OMPI users] Ensuring use of real cores
I do want in fact to bind first to one HT of each core before binding to two HTs of one core. So that will be possible in 1.7? ThxJohn On 9/11/12 11:19 PM, Ralph Castain wrote: Not entirely sure I know what you mean. If you are talking about running without specifying binding, then it makes no difference - we'll run wherever the OS puts us, so you would need to tell the OS not to use the virtual cores (i.e., disable HT). If you are talking about binding, then pre-1.7 releases all bind to core at the lowest level. On a hyperthread-enabled machine, that binds you to both HT's of a core. Starting with the upcoming 1.7 release, you can bind to the separate HTs, but that doesn't sound like something you want to do. HTH Ralph On Sep 11, 2012, at 6:34 PM, John R. Cary wrote: Our code gets little benefit from using virtual cores (hyperthreading), so when we run with mpiexec on an 8 real plus 8 virtual machine, we would like to be certain that it uses only the 8 real cores. Is there a way to do this with openmpi? ThxJohn ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ensuring use of real cores
On Sep 12, 2012, at 4:57 AM, "John R. Cary" wrote: > I do want in fact to bind first to one HT of each core > before binding to two HTs of one core. So that will > be possible in 1.7? Yes - you can get a copy of the 1.7 nightly tarball and experiment with it in advance, if you like. You'll want mpirun --map-by core --bind-to hwthread Add --report-bindings to see what happens, but I believe that will do what you want. You'll map one process to each core, and bind it to only the first hwthread on that core. Let me know either way - if it doesn't, we have time to adjust it. > > ThxJohn > > On 9/11/12 11:19 PM, Ralph Castain wrote: >> Not entirely sure I know what you mean. If you are talking about running >> without specifying binding, then it makes no difference - we'll run wherever >> the OS puts us, so you would need to tell the OS not to use the virtual >> cores (i.e., disable HT). >> >> If you are talking about binding, then pre-1.7 releases all bind to core at >> the lowest level. On a hyperthread-enabled machine, that binds you to both >> HT's of a core. Starting with the upcoming 1.7 release, you can bind to the >> separate HTs, but that doesn't sound like something you want to do. >> >> HTH >> Ralph >> >> >> On Sep 11, 2012, at 6:34 PM, John R. Cary wrote: >> >>> Our code gets little benefit from using virtual cores (hyperthreading), >>> so when we run with mpiexec on an 8 real plus 8 virtual machine, we >>> would like to be certain that it uses only the 8 real cores. >>> >>> Is there a way to do this with openmpi? >>> >>> ThxJohn >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ensuring use of real cores
Thanks! John On 9/12/12 8:05 AM, Ralph Castain wrote: On Sep 12, 2012, at 4:57 AM, "John R. Cary" wrote: I do want in fact to bind first to one HT of each core before binding to two HTs of one core. So that will be possible in 1.7? Yes - you can get a copy of the 1.7 nightly tarball and experiment with it in advance, if you like. You'll want mpirun --map-by core --bind-to hwthread Add --report-bindings to see what happens, but I believe that will do what you want. You'll map one process to each core, and bind it to only the first hwthread on that core. Let me know either way - if it doesn't, we have time to adjust it. ThxJohn On 9/11/12 11:19 PM, Ralph Castain wrote: Not entirely sure I know what you mean. If you are talking about running without specifying binding, then it makes no difference - we'll run wherever the OS puts us, so you would need to tell the OS not to use the virtual cores (i.e., disable HT). If you are talking about binding, then pre-1.7 releases all bind to core at the lowest level. On a hyperthread-enabled machine, that binds you to both HT's of a core. Starting with the upcoming 1.7 release, you can bind to the separate HTs, but that doesn't sound like something you want to do. HTH Ralph On Sep 11, 2012, at 6:34 PM, John R. Cary wrote: Our code gets little benefit from using virtual cores (hyperthreading), so when we run with mpiexec on an 8 real plus 8 virtual machine, we would like to be certain that it uses only the 8 real cores. Is there a way to do this with openmpi? ThxJohn ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ensuring use of real cores
A little more on this (since affinity is one of my favorite topics of late :-) ). See my blog entries about what we just did in the 1.7 branch (and SVN trunk): http://blogs.cisco.com/performance/taking-mpi-process-affinity-to-the-next-level/ http://blogs.cisco.com/performance/process-affinity-in-ompi-v1-7-part-1/ http://blogs.cisco.com/performance/process-affinity-in-ompi-v1-7-part-2/ As Ralph said, the v1.6 series will allow you to bind processes to the entire core (i.e., all hyperthreads in a core), or to an entire socket (i.e., all hyperthreads in a socket). The v1.7 series will be quite a bit more flexible in its affinity options (note that the "Expert" mode described in my blog posting will be coming in v1.7.1 -- if you want to try that now, you'll need to use the SVN trunk). For example: mpirun --report-bindings --map-by core --bind-to hwthread ... Should give you the pattern you want. Note that it looks like we have a bug in this pattern at the moment, however -- you'll need to use the SVN trunk and the "lama" mapper to get the patterns you want. The following example is running on a sandy bridge server with 2 sockets, each with 8 cores, each with 2 hyperthreads: One hyperthread per core: - % mpirun --mca rmaps lama -np 4 --host svbu-mpi058 --report-bindings --map-by core --bind-to hwthread uptime [svbu-mpi058:23916] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../..][../../../../../../../..] [svbu-mpi058:23916] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [../B./../../../../../..][../../../../../../../..] [svbu-mpi058:23916] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [../../B./../../../../..][../../../../../../../..] [svbu-mpi058:23916] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [../../../B./../../../..][../../../../../../../..] 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 - Both hyperthreads per core: - % mpirun --mca rmaps lama -np 4 --host svbu-mpi058 --report-bindings --map-by core --bind-to core uptime [svbu-mpi058:23951] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [svbu-mpi058:23951] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] [svbu-mpi058:23951] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..] [svbu-mpi058:23951] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..] 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 -- On Sep 12, 2012, at 8:10 AM, John R. Cary wrote: > Thanks! > > John > > On 9/12/12 8:05 AM, Ralph Castain wrote: >> On Sep 12, 2012, at 4:57 AM, "John R. Cary" wrote: >> >>> I do want in fact to bind first to one HT of each core >>> before binding to two HTs of one core. So that will >>> be possible in 1.7? >> Yes - you can get a copy of the 1.7 nightly tarball and experiment with it >> in advance, if you like. You'll want >> >> mpirun --map-by core --bind-to hwthread >> >> Add --report-bindings to see what happens, but I believe that will do what >> you want. You'll map one process to each core, and bind it to only the first >> hwthread on that core. >> >> Let me know either way - if it doesn't, we have time to adjust it. >> >>> ThxJohn >>> >>> On 9/11/12 11:19 PM, Ralph Castain wrote: Not entirely sure I know what you mean. If you are talking about running without specifying binding, then it makes no difference - we'll run wherever the OS puts us, so you would need to tell the OS not to use the virtual cores (i.e., disable HT). If you are talking about binding, then pre-1.7 releases all bind to core at the lowest level. On a hyperthread-enabled machine, that binds you to both HT's of a core. Starting with the upcoming 1.7 release, you can bind to the separate HTs, but that doesn't sound like something you want to do. HTH Ralph On Sep 11, 2012, at 6:34 PM, John R. Cary wrote: > Our code gets little benefit from using virtual cores (hyperthreading), > so when we run with mpiexec on an 8 real plus 8 virtual machine, we > would like to be certain that it uses only the 8 real cores. > > Is there a way to do this with openmpi? > > ThxJohn > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___
Re: [OMPI users] Setting RPATH for Open MPI libraries
We have a long-standing philosophy that OMPI should add the bare minimum number of preprocessor/compiler/linker flags to its wrapper compilers, and let the user/administrator customize from there. That being said, a looong time ago, I started a patch to add a --with-rpath option to configure, but never finished it. :-\ I can try to get it back on my to-do list. For the moment, you might want to try the configure --enable-mpirun-prefix-by-default option, too. On Sep 11, 2012, at 2:31 PM, Jed Brown wrote: > On Tue, Sep 11, 2012 at 2:29 PM, Reuti wrote: > With "user" you mean someone compiling Open MPI? > > Yes > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
(I am bringing back OMPI users to CC) I reproduce the problem with OMPI 1.6.1 and found the problem. mx_finalize() is called before this error occurs. So the error is expected because calling mx_connect() after mx_finalize() is invalid. It looks the MX component changed significantly between OMPI 1.5 and 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions. Can somebody comment on what was changed in the MX BTL component in late 1.5 versions ? Brice Le 12/09/2012 15:48, Douglas Eadline a écrit : >> Resending this mail again, with another SMTP. Please re-add >> open-mx-devel to CC when you reply. >> >> Brice >> >> >> Le 07/09/2012 00:10, Brice Goglin a écrit : >>> Hello Doug, >>> >>> Did you use the same Open-MX version when it worked fine? Same kernel >>> too? >>> Any chance you built OMPI over an old OMX that would not be compatible >>> with 1.5.2? > I double checked, and even rebuilt both Open MPI and MPICH2 > with 1.5.2. > > Running on a 4 node cluster with Warewulf provisioning. See below: > >>> The error below tells that the OMX driver and lib don't speak the same >>> langage. EBADF is almost never returned from the OMX driver. The only >>> case is when talking to /dev/open-mx-raw, but normal application don't >>> do this. That's why I am thinking about OMPI using an old library that >>> cannot talk to a new driver. We have checks to prevent this but we never >>> know. >>> >>> Do you see anything in dmesg? > no >>> Is omx_info OK? > yes shows: > > Open-MX version 1.5.2 > build: deadline@limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2 > Mon Sep 10 08:44:16 EDT 2012 > > Found 1 boards (32 max) supporting 32 endpoints each: > n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85) >managed by driver 'r8169' > > Peer table is ready, mapper is 00:00:00:00:00:00 > > 0) 00:1a:4d:4a:bf:85 n0:0 > 1) 00:1c:c0:9b:66:d0 n1:0 > 2) 00:1a:4d:4a:bf:83 n2:0 > 3) e0:69:95:35:d7:71 limulus:0 > > > >>> Does a basic omx_perf work? (see >>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf) > yes, checked host -> each node it works. And mpich2 compiled > with same libraries works. What else > can I check? > > -- > Doug > > >>> Brice >>> >>> >>> Le 06/09/2012 23:04, Douglas Eadline a écrit : I built open-mpi 1.6.1 using the open-mx libraries. This worked previously and now I get the following error. Here is my system: kernel: 2.6.32-279.5.1.el6.x86_64 open-mx: 1.5.2 BTW, open-mx worked previously with open-mpi and the current version works with mpich2 $ mpiexec -np 8 -machinefile machines cpi Process 0 on limulus FatalError: Failed to lookup peer by addr, driver replied Bad file descriptor cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion `0' failed. [limulus:04448] *** Process received signal *** [limulus:04448] Signal: Aborted (6) [limulus:04448] Signal code: (-6) [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) [0x332462bae0] [limulus:04448] [ 5] /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) [0x7fb587418b37] [limulus:04448] [ 6] /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) [0x7fb58741a5d5] [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) [0x7fb587419c7a] [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) [0x7fb58741a27c] [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15) [0x7fb587425865] [limulus:04448] [10] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e) [0x7fb5876fe40e] [limulus:04448] [11] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4) [0x7fb5876fbd94] [limulus:04448] [12] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb) [0x7fb58777d6fb] [limulus:04448] [13] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb) [0x7fb58777509b] [limulus:04448] [14] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b) [0x7fb58770b55b] [limulus:04448] [15] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8) [0x7fb58770b8b8] [limulus:04448] [16] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc) [0x7fb587702d8c] [limulus:04448] [17] /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78) [0x7fb587712e88] [limulus:04448] [18] /opt/mpi/openmpi-gnu4/lib64/li
Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
Here's the r numbers with notable MX changes recently: https://svn.open-mpi.org/trac/ompi/changeset/26760 https://svn.open-mpi.org/trac/ompi/changeset/26759 https://svn.open-mpi.org/trac/ompi/changeset/26698 https://svn.open-mpi.org/trac/ompi/changeset/26626 https://svn.open-mpi.org/trac/ompi/changeset/26194 https://svn.open-mpi.org/trac/ompi/changeset/26180 https://svn.open-mpi.org/trac/ompi/changeset/25445 https://svn.open-mpi.org/trac/ompi/changeset/25043 https://svn.open-mpi.org/trac/ompi/changeset/24858 https://svn.open-mpi.org/trac/ompi/changeset/24460 https://svn.open-mpi.org/trac/ompi/changeset/23996 https://svn.open-mpi.org/trac/ompi/changeset/23925 https://svn.open-mpi.org/trac/ompi/changeset/23801 https://svn.open-mpi.org/trac/ompi/changeset/23764 https://svn.open-mpi.org/trac/ompi/changeset/23714 https://svn.open-mpi.org/trac/ompi/changeset/23713 https://svn.open-mpi.org/trac/ompi/changeset/23712 That goes back over a year. On Sep 12, 2012, at 11:39 AM, Brice Goglin wrote: > (I am bringing back OMPI users to CC) > > I reproduce the problem with OMPI 1.6.1 and found the problem. > mx_finalize() is called before this error occurs. So the error is > expected because calling mx_connect() after mx_finalize() is invalid. > It looks the MX component changed significantly between OMPI 1.5 and > 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions. > Can somebody comment on what was changed in the MX BTL component in late > 1.5 versions ? > > Brice > > > > > > Le 12/09/2012 15:48, Douglas Eadline a écrit : >>> Resending this mail again, with another SMTP. Please re-add >>> open-mx-devel to CC when you reply. >>> >>> Brice >>> >>> >>> Le 07/09/2012 00:10, Brice Goglin a écrit : Hello Doug, Did you use the same Open-MX version when it worked fine? Same kernel too? Any chance you built OMPI over an old OMX that would not be compatible with 1.5.2? >> I double checked, and even rebuilt both Open MPI and MPICH2 >> with 1.5.2. >> >> Running on a 4 node cluster with Warewulf provisioning. See below: >> The error below tells that the OMX driver and lib don't speak the same langage. EBADF is almost never returned from the OMX driver. The only case is when talking to /dev/open-mx-raw, but normal application don't do this. That's why I am thinking about OMPI using an old library that cannot talk to a new driver. We have checks to prevent this but we never know. Do you see anything in dmesg? >> no Is omx_info OK? >> yes shows: >> >> Open-MX version 1.5.2 >> build: deadline@limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2 >> Mon Sep 10 08:44:16 EDT 2012 >> >> Found 1 boards (32 max) supporting 32 endpoints each: >> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85) >> managed by driver 'r8169' >> >> Peer table is ready, mapper is 00:00:00:00:00:00 >> >> 0) 00:1a:4d:4a:bf:85 n0:0 >> 1) 00:1c:c0:9b:66:d0 n1:0 >> 2) 00:1a:4d:4a:bf:83 n2:0 >> 3) e0:69:95:35:d7:71 limulus:0 >> >> >> Does a basic omx_perf work? (see http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf) >> yes, checked host -> each node it works. And mpich2 compiled >> with same libraries works. What else >> can I check? >> >> -- >> Doug >> >> Brice Le 06/09/2012 23:04, Douglas Eadline a écrit : > I built open-mpi 1.6.1 using the open-mx libraries. > This worked previously and now I get the following > error. Here is my system: > > kernel: 2.6.32-279.5.1.el6.x86_64 > open-mx: 1.5.2 > > BTW, open-mx worked previously with open-mpi and the current > version works with mpich2 > > > $ mpiexec -np 8 -machinefile machines cpi > Process 0 on limulus > FatalError: Failed to lookup peer by addr, driver replied Bad file > descriptor > cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion > `0' > failed. > [limulus:04448] *** Process received signal *** > [limulus:04448] Signal: Aborted (6) > [limulus:04448] Signal code: (-6) > [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] > [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] > [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] > [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] > [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) > [0x332462bae0] > [limulus:04448] [ 5] > /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) > [0x7fb587418b37] > [limulus:04448] [ 6] > /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) > [0x7fb58741a5d5] > [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) > [0x7fb587419c7a] > [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) > [0x7fb58741a27c] > [limulus:04448] [ 9] /us
Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
I don't recall any major modification in the MX BTL for the 1.6 with the exception of the rework of the initialization part. There patches dealt with avoiding the double initialization (BTL and MTL), so we might want to start looking at those. mx_finalize is called only from ompi_common_mx_finalize, which in turn is called from both the MTL and the BTL modules. As we do manage the reference counts in the init/finalize of the MX, I can only picture two possible scenarios: 1. Somehow there are pending communications and they are activated after MPI_Finalize (very unlikely) 2. One of the MTL or BTL initialization call ompi_common_mx_finalize without a successful corresponding call to ompi_common_mx_init. As the BTL is the one segfaulting, I guess the culprit is the MTL. I'll take a look at this more in details. george. On Sep 12, 2012, at 17:39 , Brice Goglin wrote: > (I am bringing back OMPI users to CC) > > I reproduce the problem with OMPI 1.6.1 and found the problem. > mx_finalize() is called before this error occurs. So the error is > expected because calling mx_connect() after mx_finalize() is invalid. > It looks the MX component changed significantly between OMPI 1.5 and > 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions. > Can somebody comment on what was changed in the MX BTL component in late > 1.5 versions ? > > Brice > > > > > > Le 12/09/2012 15:48, Douglas Eadline a écrit : >>> Resending this mail again, with another SMTP. Please re-add >>> open-mx-devel to CC when you reply. >>> >>> Brice >>> >>> >>> Le 07/09/2012 00:10, Brice Goglin a écrit : Hello Doug, Did you use the same Open-MX version when it worked fine? Same kernel too? Any chance you built OMPI over an old OMX that would not be compatible with 1.5.2? >> I double checked, and even rebuilt both Open MPI and MPICH2 >> with 1.5.2. >> >> Running on a 4 node cluster with Warewulf provisioning. See below: >> The error below tells that the OMX driver and lib don't speak the same langage. EBADF is almost never returned from the OMX driver. The only case is when talking to /dev/open-mx-raw, but normal application don't do this. That's why I am thinking about OMPI using an old library that cannot talk to a new driver. We have checks to prevent this but we never know. Do you see anything in dmesg? >> no Is omx_info OK? >> yes shows: >> >> Open-MX version 1.5.2 >> build: deadline@limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2 >> Mon Sep 10 08:44:16 EDT 2012 >> >> Found 1 boards (32 max) supporting 32 endpoints each: >> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85) >> managed by driver 'r8169' >> >> Peer table is ready, mapper is 00:00:00:00:00:00 >> >> 0) 00:1a:4d:4a:bf:85 n0:0 >> 1) 00:1c:c0:9b:66:d0 n1:0 >> 2) 00:1a:4d:4a:bf:83 n2:0 >> 3) e0:69:95:35:d7:71 limulus:0 >> >> >> Does a basic omx_perf work? (see http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf) >> yes, checked host -> each node it works. And mpich2 compiled >> with same libraries works. What else >> can I check? >> >> -- >> Doug >> >> Brice Le 06/09/2012 23:04, Douglas Eadline a écrit : > I built open-mpi 1.6.1 using the open-mx libraries. > This worked previously and now I get the following > error. Here is my system: > > kernel: 2.6.32-279.5.1.el6.x86_64 > open-mx: 1.5.2 > > BTW, open-mx worked previously with open-mpi and the current > version works with mpich2 > > > $ mpiexec -np 8 -machinefile machines cpi > Process 0 on limulus > FatalError: Failed to lookup peer by addr, driver replied Bad file > descriptor > cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion > `0' > failed. > [limulus:04448] *** Process received signal *** > [limulus:04448] Signal: Aborted (6) > [limulus:04448] Signal code: (-6) > [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500] > [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5] > [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085] > [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e] > [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0) > [0x332462bae0] > [limulus:04448] [ 5] > /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197) > [0x7fb587418b37] > [limulus:04448] [ 6] > /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55) > [0x7fb58741a5d5] > [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a) > [0x7fb587419c7a] > [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c) > [0x7fb58741a27c] > [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15) > [0x7fb587425865] > [limulus:04448] [10] > /opt/mp
Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
Le 12/09/2012 17:57, Jeff Squyres a écrit : > Here's the r numbers with notable MX changes recently: > > https://svn.open-mpi.org/trac/ompi/changeset/26698 Reverting this one fixes the problem. And adding --mca mtl ^mx to the command works too (Doug, can you try that?) The problem is that the MTL component calls ompi_common_mx_initialize() only once in component_init() but it calls finalize() twice: once in component_close() and once in ompi_mtl_mx_finalize(). The attached patch seems to work. Signed-off-by: Brice Goglin Brice diff --git a/ompi/mca/mtl/mx/mtl_mx.c b/ompi/mca/mtl/mx/mtl_mx.c index c3e92f3..2c07cc9 100644 --- a/ompi/mca/mtl/mx/mtl_mx.c +++ b/ompi/mca/mtl/mx/mtl_mx.c @@ -151,7 +151,7 @@ ompi_mtl_mx_finalize(struct mca_mtl_base_module_t* mtl) { return OMPI_ERROR; } -return ompi_common_mx_finalize(); +return OMPI_SUCCESS; }
Re: [OMPI users] MPI_Spawn and process allocation policy
On Wed, Aug 17, 2011 at 12:05 AM, Simone Pellegrini wrote: > On 08/16/2011 11:15 PM, Ralph Castain wrote: >> >> I'm not finding a bug - the code looks clean. If I send you a patch, could >> you apply it, rebuild, and send me the resulting debug output? > > yes, I could do that. No problem. > > thanks again, Simone Hi all - Did this ever get resolved? I've been having trouble specifying the hosts to run my spawned processes on. Is there an example of how to do this? Also, I'm assuming that any host given must also be listed in the hostfile? Thanks, Brian
Re: [OMPI users] Ensuring use of real cores
Okay, we've dug a bit and talked a lot :-) If you want to do this with the default mapping system, just add "--use-hwthreads-cpus" so we use the hwthreads as independent processors. Otherwise, you can only bind down to the core level since we aren't treating the HTs inside each core as separate entities. On Sep 12, 2012, at 6:50 AM, Jeff Squyres wrote: > A little more on this (since affinity is one of my favorite topics of late > :-) ). See my blog entries about what we just did in the 1.7 branch (and SVN > trunk): > > http://blogs.cisco.com/performance/taking-mpi-process-affinity-to-the-next-level/ > http://blogs.cisco.com/performance/process-affinity-in-ompi-v1-7-part-1/ > http://blogs.cisco.com/performance/process-affinity-in-ompi-v1-7-part-2/ > > As Ralph said, the v1.6 series will allow you to bind processes to the entire > core (i.e., all hyperthreads in a core), or to an entire socket (i.e., all > hyperthreads in a socket). > > The v1.7 series will be quite a bit more flexible in its affinity options > (note that the "Expert" mode described in my blog posting will be coming in > v1.7.1 -- if you want to try that now, you'll need to use the SVN trunk). > > For example: > > mpirun --report-bindings --map-by core --bind-to hwthread ... > > Should give you the pattern you want. Note that it looks like we have a bug > in this pattern at the moment, however -- you'll need to use the SVN trunk > and the "lama" mapper to get the patterns you want. The following example is > running on a sandy bridge server with 2 sockets, each with 8 cores, each with > 2 hyperthreads: > > One hyperthread per core: > > - > % mpirun --mca rmaps lama -np 4 --host svbu-mpi058 --report-bindings --map-by > core --bind-to hwthread uptime > [svbu-mpi058:23916] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B./../../../../../../..][../../../../../../../..] > [svbu-mpi058:23916] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > [../B./../../../../../..][../../../../../../../..] > [svbu-mpi058:23916] MCW rank 2 bound to socket 0[core 2[hwt 0]]: > [../../B./../../../../..][../../../../../../../..] > [svbu-mpi058:23916] MCW rank 3 bound to socket 0[core 3[hwt 0]]: > [../../../B./../../../..][../../../../../../../..] > 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:51 up 1 day, 12:08, 0 users, load average: 0.00, 0.01, 0.05 > - > > Both hyperthreads per core: > > - > % mpirun --mca rmaps lama -np 4 --host svbu-mpi058 --report-bindings --map-by > core --bind-to core uptime > [svbu-mpi058:23951] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../..][../../../../../../../..] > [svbu-mpi058:23951] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: > [../BB/../../../../../..][../../../../../../../..] > [svbu-mpi058:23951] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: > [../../BB/../../../../..][../../../../../../../..] > [svbu-mpi058:23951] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: > [../../../BB/../../../..][../../../../../../../..] > 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 > 06:48:57 up 1 day, 12:09, 0 users, load average: 0.00, 0.01, 0.05 > -- > > > > On Sep 12, 2012, at 8:10 AM, John R. Cary wrote: > >> Thanks! >> >> John >> >> On 9/12/12 8:05 AM, Ralph Castain wrote: >>> On Sep 12, 2012, at 4:57 AM, "John R. Cary" wrote: >>> I do want in fact to bind first to one HT of each core before binding to two HTs of one core. So that will be possible in 1.7? >>> Yes - you can get a copy of the 1.7 nightly tarball and experiment with it >>> in advance, if you like. You'll want >>> >>> mpirun --map-by core --bind-to hwthread >>> >>> Add --report-bindings to see what happens, but I believe that will do what >>> you want. You'll map one process to each core, and bind it to only the >>> first hwthread on that core. >>> >>> Let me know either way - if it doesn't, we have time to adjust it. >>> ThxJohn On 9/11/12 11:19 PM, Ralph Castain wrote: > Not entirely sure I know what you mean. If you are talking about running > without specifying binding, then it makes no difference - we'll run > wherever the OS puts us, so you would need to tell the OS not to use the > virtual cores (i.e., disable HT). > > If you are talking about binding, then pre-1.7 releases all bind to core > at the lowest level. On a hyperthread-enabled machine, that binds you to > both HT's of a core. Starting with the upcoming 1.7 release, you can bind > to the separate HTs, but that doesn't sound like something you want to do. > > HTH > Ralph > > > On S
Re: [OMPI users] MPI_Spawn and process allocation policy
On Sep 12, 2012, at 9:55 AM, Brian Budge wrote: > On Wed, Aug 17, 2011 at 12:05 AM, Simone Pellegrini > wrote: >> On 08/16/2011 11:15 PM, Ralph Castain wrote: >>> >>> I'm not finding a bug - the code looks clean. If I send you a patch, could >>> you apply it, rebuild, and send me the resulting debug output? >> >> yes, I could do that. No problem. >> >> thanks again, Simone > > Hi all - > > Did this ever get resolved? Not sure to what you are referring - this thread ran all over the place, and I confess I've lost track. > I've been having trouble specifying the > hosts to run my spawned processes on. Is there an example of how to > do this? Did you look at "man MPI_Comm_spawn"? It lists all the info keys we recognize and what they do > Also, I'm assuming that any host given must also be listed > in the hostfile? Not in the updated 1.6 branch and above - to include the upcoming 1.6.2 release. > > Thanks, > Brian
Re: [OMPI users] Setting RPATH for Open MPI libraries
On Wed, Sep 12, 2012 at 10:20 AM, Jeff Squyres wrote: > We have a long-standing philosophy that OMPI should add the bare minimum > number of preprocessor/compiler/linker flags to its wrapper compilers, and > let the user/administrator customize from there. > In general, I agree with that philosophy. > > That being said, a looong time ago, I started a patch to add a > --with-rpath option to configure, but never finished it. :-\ I can try to > get it back on my to-do list. > That would be perfect. > > For the moment, you might want to try the configure > --enable-mpirun-prefix-by-default option, too. > The downside is that we tend not to bother with the mpirun for configure and it's a little annoying to "mpirun ldd" when hunting for other problems (e.g. a missing shared lib unrelated to Open MPI).
Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
George -- you're the owner of this part of the code. What do you want to do here? On Sep 12, 2012, at 12:37 PM, Brice Goglin wrote: > Le 12/09/2012 17:57, Jeff Squyres a écrit : >> Here's the r numbers with notable MX changes recently: >> >> https://svn.open-mpi.org/trac/ompi/changeset/26698 > > Reverting this one fixes the problem. > And adding --mca mtl ^mx to the command works too (Doug, can you try that?) > > The problem is that the MTL component calls ompi_common_mx_initialize() > only once in component_init() but it calls finalize() twice: once in > component_close() and once in ompi_mtl_mx_finalize(). The attached patch > seems to work. > > Signed-off-by: Brice Goglin > > Brice > > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] test for sctp on FreeBSD too narrow
Thank you! I'll put this on the trunk and the upcoming releases. This is too late for v1.6.2, but it can be in 1.6.3. On Sep 10, 2012, at 4:22 PM, Brooks Davis wrote: > The test for SCTP support in libc on FreeBSD only allows it to work on > FreeBSD 7 (or I suppose 70 :). That attached patch expands the test to > 7 though 19 which should be enough for a while. Hopefully by the time > FreeBSD 19 is out everything will have sctp support in libc or have > dropped it. :) > > -- Brooks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/