Re: [OMPI users] Building vs packaging
Hey Rob, I don't know if this is what is going on, but in general, when a package is installed via a distro's package management system, it ends up in system locations such as /usr/bin and /usr/lib that are automatically searched when looking for executables and libraries. So, it isn't necessarily that the package maintainers did much of anything different when putting together the package; instead, they may have put files in locations that are more accessible from a system-tool point of view. For example, the runtime linker knows to search in several system-defined directories such as /usr/lib. This might explain why everything worked after installing openmpi-bin: the binaries and libraries all ended up in system locations that are automatically a part of the environment on the remote node, so remote execution worked as it could find everything. Thanks, David On 05/14/2016 05:37 AM, Rob Malpass wrote: Hi all I posted about a fortnight ago to this list as I was having some trouble getting my nodes to be controlled by my master node. Perceived wisdom at the time was to compile with the –enable-orterun-prefix-by-default. For some time I’d been getting cannot open libopen-rte.so.7 which points to a problem with LD_LIBRARY_PATH. I had been able to run it on nodes 3 and 4 even though (from headnode) if I do ssh node4 ‘echo $LD_LIBRARY_PATH’ returns a blank line. However – as I say it’s working on nodes 3 and 4. I had been hacking for ages on nodes 1 and 2 getting the same error but still with LD_LIBRARY_PATH apparently not set for an interactive login. Almost in desperation, I cheated: sudo apt-get install openmpi-bin and hey presto. I can now do (from head node) mpirun –H node2,node3,node4 –n 10 foo and it works fine. So clearly apt-get install has set something that I’d not done (and it’s seemingly not LD_LIBRARY_PATH) as ssh node2 ‘echo $LD_LIBRARY_PATH’ still returns a blank line. Can anyone tell me what might be in the install script so I can get a clue? Thanks ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29201.php -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
[OMPI users] what was the rationale behind rank mapping by socket?
Hello All, Would anyone know why the default mapping scheme is socket for jobs with more than 2 ranks? Would they be able to please take some time and explain the reasoning? Please note I am not railing against the decision, but rather trying to gather as much information about it as I can so as to be able to better work with my users who are just now starting to ask questions about it. The FAQ pretty much pushes folks to the man pages, and the mpirun man page doesn't go in to the reasoning. Thank you for your time, David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] how to tell if pmi or pmi2 is being used?
Hello All, I'm using Open MPI 1.10.3 with Slurm and would like to ask how do I find out if pmi1 or pmi2 was used for process launching? The Slurm installation is supposed to support both pmi1 and pmi2, but I would really like to know which one I fall in to. I tried using '-mca plm_base_verbose 100' on the mpirun line, but it didn't mention pmi specifically. Instead, all I could really find was that it was using the slurm component. Is there something else I can look at in the output that would have that detail? Thank you for your time, David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] how to tell if pmi or pmi2 is being used?
That is really good to know. Thanks! David On 10/13/2016 12:27 PM, r...@open-mpi.org wrote: If you are using mpirun, then neither PMI1 or PMI2 are involved at all. ORTE has its own internal mechanism for handling wireup. On Oct 13, 2016, at 10:43 AM, David Shrader wrote: Hello All, I'm using Open MPI 1.10.3 with Slurm and would like to ask how do I find out if pmi1 or pmi2 was used for process launching? The Slurm installation is supposed to support both pmi1 and pmi2, but I would really like to know which one I fall in to. I tried using '-mca plm_base_verbose 100' on the mpirun line, but it didn't mention pmi specifically. Instead, all I could really find was that it was using the slurm component. Is there something else I can look at in the output that would have that detail? Thank you for your time, David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] question about "--rank-by slot" behavior
Hello All, The man page for mpirun says that the default ranking procedure is round-robin by slot. It doesn't seem to be that straight-forward to me, though, and I wanted to ask about the behavior. To help illustrate my confusion, here are a few examples where the ranking behavior changed based on the mapping behavior, which doesn't make sense to me, yet. First, here is a simple map by core (using 4 nodes of 32 cpu cores each): $> mpirun -n 128 --map-by core --report-bindings true [gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././.][./././././././././././././././././.] ...output snipped... Things look as I would expect: ranking happens round-robin through the cpu cores. Now, here's a map by socket example: $> mpirun -n 128 --map-by socket --report-bindings true [gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.] [gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] ...output snipped... Why is rank 1 on a different socket? I know I am mapping by socket in this example, but, fundamentally, nothing should really be different in terms of ranking, correct? The same number of processes are available on each host as in the first example, and available in the same locations. How is "slot" different in this case? If I use "--rank-by core," I recover the output from the first example. I thought that maybe "--rank-by slot" might be following something laid down by "--map-by", but the following example shows that isn't completely correct, either: $> mpirun -n 128 --map-by socket:span --report-bindings true [gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.] [gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] ...output snipped... If ranking by slot were somehow following something left over by mapping, I would have expected rank 2 to end up on a different host. So, now I don't know what to expect from using "--rank-by slot." Does anyone have any pointers? Thank you for the help! David -- David Shrader HPC-ENV High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] question about "--rank-by slot" behavior
Hello Ralph, I do understand that "slot" is an abstract term and isn't tied down to any particular piece of hardware. What I am trying to understand is how "slot" came to be equivalent to "socket" in my second and third example, but "core" in my first example. As far as I can tell, MPI ranks should have been assigned the same in all three examples. Why weren't they? You mentioned that, when using "--rank-by slot", the ranks are assigned round-robin by scheduler entry; does this mean that the scheduler entries change based on the mapping algorithm (the only thing I changed in my examples) and this results in ranks being assigned differently? Thanks again, David On 11/30/2016 01:23 PM, r...@open-mpi.org wrote: I think you have confused “slot” with a physical “core”. The two have absolutely nothing to do with each other. A “slot” is nothing more than a scheduling entry in which a process can be placed. So when you --rank-by slot, the ranks are assigned round-robin by scheduler entry - i.e., you assign all the ranks on the first node, then assign all the ranks on the next node, etc. It doesn’t matter where those ranks are placed, or what core or socket they are running on. We just blindly go thru and assign numbers. If you rank-by core, then we cycle across the procs by looking at the core number they are bound to, assigning all the procs on a node before moving to the next node. If you rank-by socket, then you cycle across the procs on a node by round-robin of sockets, assigning all procs on the node before moving to the next node. If you then added “span” to that directive, we’d round-robin by socket across all nodes before circling around to the next proc on this node. HTH Ralph On Nov 30, 2016, at 11:26 AM, David Shrader wrote: Hello All, The man page for mpirun says that the default ranking procedure is round-robin by slot. It doesn't seem to be that straight-forward to me, though, and I wanted to ask about the behavior. To help illustrate my confusion, here are a few examples where the ranking behavior changed based on the mapping behavior, which doesn't make sense to me, yet. First, here is a simple map by core (using 4 nodes of 32 cpu cores each): $> mpirun -n 128 --map-by core --report-bindings true [gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././.][./././././././././././././././././.] ...output snipped... Things look as I would expect: ranking happens round-robin through the cpu cores. Now, here's a map by socket example: $> mpirun -n 128 --map-by socket --report-bindings true [gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.] [gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] ...output snipped... Why is rank 1 on a different socket? I know I am mapping by socket in this example, but, fundamentally, nothing should really be different in terms of ranking, correct? The same number of processes are available on each host as in the first example, and available in the same locations. How is "slot" different in this case? If I use "--rank-by core," I recover the output from the first example. I thought that maybe "--rank-by slot" might be following something laid down by "--map-by", but the following example shows that isn't completely correct, either: $> mpirun -n 128 --map-by socket:span --report-bindings true [gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.] [gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.] [gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.] ...output snipped... If ranking by slot were somehow following something left over by mapping, I would have expected rank 2 to end up on a different host. So, now I don't know what to expect from using "--rank-by slot." Does anyone have any pointers? Thank you for the help! David -- David Shrader HPC-ENV High Performance Computer Systems Los
Re: [OMPI users] question about "--rank-by slot" behavior
Thank you for the explanation! I understand what is going on now: there is a process list for each node whose order is dependent on the mapping policy, and the ranker, when using "slot," walks through that list. Makes sense. Thank you again! David On 11/30/2016 04:46 PM, r...@open-mpi.org wrote: “slot’ never became equivalent to “socket”, or to “core”. Here is what happened: *for your first example: the mapper assigns the first process to the first node because there is a free core there, and you said to map-by core. It goes on to assign the second process to the second core, and the third process to the third core, etc. until we reach the defined #procs for that node (i.e., the number of assigned “slots” for that node). When it goes to rank the procs, the ranker starts with the first process assigned on the first node - this process occupies the first “slot”, and so it gets rank 0. The ranker then assigns rank 1 to the second process it assigned to the first node, as that process occupies the second “slot”. Etc. * your 2nd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the first node, until all the “slots” for that node have been filled. The ranker then starts with the first process that was assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process that was assigned to the node - that would be the first proc mapped to the second socket. The ranker then assigns rank 2 to the third proc assigned to the node - that would be the 2nd proc assigned to the first socket. * your 3rd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the second node, continuing around until all procs have been mapped. The ranker then starts with the first proc assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process assigned to the first node (because we are ranking by slot!), which corresponds to the first proc mapped to the second socket. The ranker then assigns rank 2 to the third process assigned to the first node, which corresponds to the second proc mapped to the first socket of that node. So you can see that you will indeed get the same relative ranking, even though the mapping was done using a different algorithm. HTH Ralph On Nov 30, 2016, at 2:16 PM, David Shrader wrote: Hello Ralph, I do understand that "slot" is an abstract term and isn't tied down to any particular piece of hardware. What I am trying to understand is how "slot" came to be equivalent to "socket" in my second and third example, but "core" in my first example. As far as I can tell, MPI ranks should have been assigned the same in all three examples. Why weren't they? You mentioned that, when using "--rank-by slot", the ranks are assigned round-robin by scheduler entry; does this mean that the scheduler entries change based on the mapping algorithm (the only thing I changed in my examples) and this results in ranks being assigned differently? Thanks again, David On 11/30/2016 01:23 PM, r...@open-mpi.org wrote: I think you have confused “slot” with a physical “core”. The two have absolutely nothing to do with each other. A “slot” is nothing more than a scheduling entry in which a process can be placed. So when you --rank-by slot, the ranks are assigned round-robin by scheduler entry - i.e., you assign all the ranks on the first node, then assign all the ranks on the next node, etc. It doesn’t matter where those ranks are placed, or what core or socket they are running on. We just blindly go thru and assign numbers. If you rank-by core, then we cycle across the procs by looking at the core number they are bound to, assigning all the procs on a node before moving to the next node. If you rank-by socket, then you cycle across the procs on a node by round-robin of sockets, assigning all procs on the node before moving to the next node. If you then added “span” to that directive, we’d round-robin by socket across all nodes before circling around to the next proc on this node. HTH Ralph On Nov 30, 2016, at 11:26 AM, David Shrader wrote: Hello All, The man page for mpirun says that the default ranking procedure is round-robin by slot. It doesn't seem to be that straight-forward to me, though, and I wanted to ask about the behavior. To help illustrate my confusion, here are a few examples where the ranking behavior changed based on the mapping behavior, which doesn't make sense to me, yet. First, here is a simple map by core (using 4 nodes of 32 cpu cores each): $> mpirun -n 128 --map-by core --report-binding
[OMPI users] what was ompi configured with?
Hello, Is there a way to tell what configure line was used in building Open MPI from the installation itself? That is, not from config.log but from issuing some command like 'mpicc --version'. I'm wondering if a particular installation of Open MPI has anything that "remembers" how it was configured. Thank you very much for your time, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] what was ompi configured with?
That is pretty much what I am looking for. Thank you! David On 05/05/2015 12:58 PM, Jeff Squyres (jsquyres) wrote: We can't capture the exact configure command line, but you can look at the output from ompi_info to check specific characteristics of your Open MPI installation. ompi_info with no CLI options tells you a bunch of stuff; "ompi_info --all" tells you (a lot) more. On May 5, 2015, at 2:54 PM, David Shrader wrote: Hello, Is there a way to tell what configure line was used in building Open MPI from the installation itself? That is, not from config.log but from issuing some command like 'mpicc --version'. I'm wondering if a particular installation of Open MPI has anything that "remembers" how it was configured. Thank you very much for your time, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26838.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
[OMPI users] 1.8.5, mxm, and a spurious '-L' flag
Hello, I'm getting a spurious '-L' flag when I have mxm installed in system-space (/usr/lib64/libmxm.so) which is causing an error at link time during make: ...output snipped... /bin/sh ../../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -o libmca_mtl_mxm.la mtl_mxm.lo mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil libtool: link: require no space between `-L' and `-lrt' make[2]: *** [libmca_mtl_mxm.la] Error 1 make[2]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi' make: *** [all-recursive] Error 1 If I I use --with-mxm=no, then this error doesn't occur (as expected as the mxm component isn't touched). Has anyone run in to this before? Here is my configure line: ./configure --disable-silent-rules --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=... I wonder if there is an empty variable that should contain the directory libmxm is in somewhere in configure since no directory is passed to --with-mxm which is then paired with a "-L". I think I'll go through the configure script while waiting to see if anyone else has run in to this. Thank you for any and all help, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag
Hello Mike, This particular instance of mxm was installed using rpms that were re-rolled by our admins. I'm not 100% sure where they got them (HPCx or somewhere else). I myself am not using HPCx. Is there any particular reason why mxm shouldn't be in system space? If there is, I'll share it with our admins and try to get the install location corrected. As for what is causing the extra -L, it does look like an empty variable is used without checking that it is empty in configure. Line 246117 in the configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this: ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" By invoking configure with '/bin/sh -x ./configure ...' and changing PS4 to output line numbers, I saw that line 246117 was setting ompi_check_mxm_extra_libs to just "-L". It turns out that configure does this in three separate locations. I put a check around all three instances like this: if test ! -z "$ompi_check_mxm_extra_libs"; then ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" fi And the spurious '-L' disappeared from the linking commands and make completed fine. So, it looks like there are two solutions: move the install location of mxm to not be in system-space or modify configure. Which one would be the better one for me to pursue? Thanks, David On 05/23/2015 12:05 AM, Mike Dubman wrote: Hi, How mxm was installed? by copying? The rpm based installation places mxm into /opt/mellanox/mxm and not into /usr/lib64/libmxm.so. Do you use HPCx (pack of OMPI and MXM and FCA)? You can download HPCX, extract it anywhere and compile OMPI pointing to mxm location under HPCX. Also, HPCx contains rpms for mxm and fca. M On Sat, May 23, 2015 at 1:07 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote: Hello, I'm getting a spurious '-L' flag when I have mxm installed in system-space (/usr/lib64/libmxm.so) which is causing an error at link time during make: ...output snipped... /bin/sh ../../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -o libmca_mtl_mxm.la <http://libmca_mtl_mxm.la> mtl_mxm.lo mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil libtool: link: require no space between `-L' and `-lrt' make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>] Error 1 make[2]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi' make: *** [all-recursive] Error 1 If I I use --with-mxm=no, then this error doesn't occur (as expected as the mxm component isn't touched). Has anyone run in to this before? Here is my configure line: ./configure --disable-silent-rules --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=... I wonder if there is an empty variable that should contain the directory libmxm is in somewhere in configure since no directory is passed to --with-mxm which is then paired with a "-L". I think I'll go through the configure script while waiting to see if anyone else has run in to this. Thank you for any and all help, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov <http://lanl.gov> ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26904.php -- Kind Regards, M. ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26905.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag
Hello Mike, I'm glad that I could be of help. Just as an FYI, right now our admins are still hosting the fca libraries in /opt, but they would like to have it in system-space just as they have done with mxm. I haven't worked my way through all of the fca-related logic in configure yet, so I don't know if putting the fca libraries in system-space will expose any issues as of yet. It might be a good idea to check out that logic while looking at the mxm-related logic. Thank you again! David On 05/26/2015 09:41 AM, Mike Dubman wrote: Hello David, Thanks for info and patch - will fix ompi configure logic with your patch. mxm can be installed in the system and user spaces - both are valid and supported logic. M On Tue, May 26, 2015 at 5:50 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: Hello Mike, This particular instance of mxm was installed using rpms that were re-rolled by our admins. I'm not 100% sure where they got them (HPCx or somewhere else). I myself am not using HPCx. Is there any particular reason why mxm shouldn't be in system space? If there is, I'll share it with our admins and try to get the install location corrected. As for what is causing the extra -L, it does look like an empty variable is used without checking that it is empty in configure. Line 246117 in the configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this: ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" By invoking configure with '/bin/sh -x ./configure ...' and changing PS4 to output line numbers, I saw that line 246117 was setting ompi_check_mxm_extra_libs to just "-L". It turns out that configure does this in three separate locations. I put a check around all three instances like this: if test ! -z "$ompi_check_mxm_extra_libs"; then ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" fi And the spurious '-L' disappeared from the linking commands and make completed fine. So, it looks like there are two solutions: move the install location of mxm to not be in system-space or modify configure. Which one would be the better one for me to pursue? Thanks, David On 05/23/2015 12:05 AM, Mike Dubman wrote: Hi, How mxm was installed? by copying? The rpm based installation places mxm into /opt/mellanox/mxm and not into /usr/lib64/libmxm.so. Do you use HPCx (pack of OMPI and MXM and FCA)? You can download HPCX, extract it anywhere and compile OMPI pointing to mxm location under HPCX. Also, HPCx contains rpms for mxm and fca. M On Sat, May 23, 2015 at 1:07 AM, David Shrader mailto:dshra...@lanl.gov>> wrote: Hello, I'm getting a spurious '-L' flag when I have mxm installed in system-space (/usr/lib64/libmxm.so) which is causing an error at link time during make: ...output snipped... /bin/sh ../../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -o libmca_mtl_mxm.la <http://libmca_mtl_mxm.la> mtl_mxm.lo mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil libtool: link: require no space between `-L' and `-lrt' make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>] Error 1 make[2]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi' make: *** [all-recursive] Error 1 If I I use --with-mxm=no, then this error doesn't occur (as expected as the mxm component isn't touched). Has anyone run in to this before? Here is my configure line: ./configure --disable-silent-rules --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=... I wonder if there is an empty variable that should contain the directory libmxm is in somewhere in configure since no directory is passed to --with-mxm which is then paired with a "-L". I think I'll go through the configure script while waiting to see if anyone else has run in to this. Thank you for any and all help, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov <http://lanl.gov> ___
Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag
Hello Mike, I'm still working on getting you my config.log, but I thought I would chime in about that line 36. In my case, that code path is not executed because with_mxm is empty (I don't use --with-mxm on the configure line since libmxm.so is in system space and configure picks up on it automatically). Thus, ompi_check_mxm_libdir never gets assigned which results in just "-L" getting used on line 41. The same behavior could be found by using '--with-mxm=yes'. Thanks, David On 05/26/2015 11:28 AM, Mike Dubman wrote: Thanks Jeff! but in this line: https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L36 ompi_check_mxm_libdir gets value if with_mxm was passed On Tue, May 26, 2015 at 6:59 PM, Jeff Squyres (jsquyres) mailto:jsquy...@cisco.com>> wrote: This line: https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L41 doesn't check to see if $ompi_check_mxm_libdir is empty. > On May 26, 2015, at 11:50 AM, Mike Dubman mailto:mi...@dev.mellanox.co.il>> wrote: > > David, > Could you please send me your config.log file? > > Looking into config/ompi_check_mxm.m4 macro I don`t understand how it could happen. > > Thanks a lot. > > On Tue, May 26, 2015 at 6:41 PM, Mike Dubman mailto:mi...@dev.mellanox.co.il>> wrote: > Hello David, > Thanks for info and patch - will fix ompi configure logic with your patch. > > mxm can be installed in the system and user spaces - both are valid and supported logic. > > M > > On Tue, May 26, 2015 at 5:50 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: > Hello Mike, > > This particular instance of mxm was installed using rpms that were re-rolled by our admins. I'm not 100% sure where they got them (HPCx or somewhere else). I myself am not using HPCx. Is there any particular reason why mxm shouldn't be in system space? If there is, I'll share it with our admins and try to get the install location corrected. > > As for what is causing the extra -L, it does look like an empty variable is used without checking that it is empty in configure. Line 246117 in the configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this: > > ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" > > By invoking configure with '/bin/sh -x ./configure ...' and changing PS4 to output line numbers, I saw that line 246117 was setting ompi_check_mxm_extra_libs to just "-L". It turns out that configure does this in three separate locations. I put a check around all three instances like this: > > if test ! -z "$ompi_check_mxm_extra_libs"; then > ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir" > fi > > And the spurious '-L' disappeared from the linking commands and make completed fine. > > So, it looks like there are two solutions: move the install location of mxm to not be in system-space or modify configure. Which one would be the better one for me to pursue? > > Thanks, > David > > > On 05/23/2015 12:05 AM, Mike Dubman wrote: >> Hi, >> >> How mxm was installed? by copying? >> >> The rpm based installation places mxm into /opt/mellanox/mxm and not into /usr/lib64/libmxm.so. >> >> Do you use HPCx (pack of OMPI and MXM and FCA)? >> You can download HPCX, extract it anywhere and compile OMPI pointing to mxm location under HPCX. >> >> Also, HPCx contains rpms for mxm and fca. >> >> >> M >> >> On Sat, May 23, 2015 at 1:07 AM, David Shrader mailto:dshra...@lanl.gov>> wrote: >> Hello, >> >> I'm getting a spurious '-L' flag when I have mxm installed in system-space (/usr/lib64/libmxm.so) which is causing an error at link time during make: >> >> ...output snipped... >> /bin/sh ../../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -o libmca_mtl_mxm.la <http://libmca_mtl_mxm.la> mtl_mxm.lo mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil >> libtool: link: require no space between `-L' and `-lrt' >> make[2]: *** [libmca_mtl_mxm.la <http://libmca_mtl_mxm.la>] Error 1 >> make[2]: Leaving directory `/turquoise/usr/projec
Re: [OMPI users] Openmpi compilation errors
Looking at the config.log, I see this: pgi-cc-lin64: LICENSE MANAGER PROBLEM: No such feature exists. Feature: pgi-cc-lin64 It looks like there is a problem with the PGI license. Does it work with a regular file (e.g., hello_world)? If it does, how do you get it to work (env variables, license file, etc.)? Thanks, David On 05/27/2015 10:25 AM, Bruno Queiros wrote: Hello I'm trying to compile openmpi-1.8.5 with portland fortran 10.4 64bits on a CentOS7 64bits. This is the output i get: ./configure CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 F90=pgf90 --prefix=/opt/openmpi-1.8.5_pgf90 == Configuring Open MPI *** Startup tests checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking target system type... x86_64-unknown-linux-gnu checking for gcc... pgcc checking whether the C compiler works... no configure: error: in `/root/TransferArea/openmpi-1.8.5': configure: error: C compiler cannot create executables See `config.log' for more details The config.log goes as an attachment ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26954.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] Openmpi compilation errors
Yes, exactly like that. Given your configure line, all of the Portland Group's compilers need to work: $> pgf90 hello.f90 $> pgcc hello.c $> pgCC hello.cpp What of those commands work for you? Thanks, David On 05/27/2015 11:01 AM, Bruno Queiros wrote: David Do you mean if Portland Fortran compiler works? Like pgf90 hello.f ? Bruno Em qua, 27 de mai de 2015 às 17:40, David Shrader <mailto:dshra...@lanl.gov>> escreveu: Looking at the config.log, I see this: pgi-cc-lin64: LICENSE MANAGER PROBLEM: No such feature exists. Feature: pgi-cc-lin64 It looks like there is a problem with the PGI license. Does it work with a regular file (e.g., hello_world)? If it does, how do you get it to work (env variables, license file, etc.)? Thanks, David On 05/27/2015 10:25 AM, Bruno Queiros wrote: Hello I'm trying to compile openmpi-1.8.5 with portland fortran 10.4 64bits on a CentOS7 64bits. This is the output i get: ./configure CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 F90=pgf90 --prefix=/opt/openmpi-1.8.5_pgf90 == Configuring Open MPI *** Startup tests checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking target system type... x86_64-unknown-linux-gnu checking for gcc... pgcc checking whether the C compiler works... no configure: error: in `/root/TransferArea/openmpi-1.8.5': configure: error: C compiler cannot create executables See `config.log' for more details The config.log goes as an attachment ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:http://www.open-mpi.org/community/lists/users/2015/05/26954.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov <http://lanl.gov> ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26955.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/05/26957.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
[OMPI users] orte-clean hang in 1.8.5
Hello All, I had a user report that orte-clean is hanging on him with Open MPI 1.8.5. Here are the steps I used to reproduce what he reported: %> which orte-clean /usr/projects/hpcsoft/toss2/moonlight/openmpi/1.6.5-gcc-4.4/bin/orte-clean %> mpirun -n 1 /usr/projects/hpcsoft/toss2/moonlight/openmpi/1.6.5-gcc-4.4/bin/orte-clean Reported: 1 (out of 1) daemons - 1 (out of 1) procs [hangs] I have found that the same behavior does not happen using 1.6.5. That is, I get a command prompt after running orte-clean. Is this behavior expected? I am not familiar with orte-clean, so I am not sure if it hanging when used in this fashion is an actual problem with orte-clean. If it is unexpected behavior, I'll dig some more. Thank you very much for your time, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] shared memory performance
er and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27298.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/07/27300.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
[OMPI users] Open MPI 1.8.8 and hcoll in system space
Hello All, I'm having some trouble getting Open MPI 1.8.8 to configure correctly when hcoll is installed in system space. That is, hcoll is installed to /usr/lib64 and /usr/include/hcoll. I get an error during configure: $> Konsole output ./configure --with-hcoll ...output snipped... Konsole output configure:219976: checking for MCA component coll:hcoll compile mode configure:219982: result: static configure:220039: checking --with-hcoll value configure:220042: result: simple ok (unspecified) configure:220840: error: HCOLL support requested but not found. Aborting I have also tried using "--with-hcoll=yes" and gotten the same behavior. Has anyone else gotten the hcoll component to build when hcoll itself is in system space? I am using hcoll-3.2.748. I did take a look at configure, and it looks like there is a test on "with_hcoll" to see if it is not empty and not yes on line 220072. In my case, this test fails, so the else clause gets invoked. The else clause is several hundred lines below on line 220822 and simply sets Konsole output ompi_check_hcoll_happy="no". Configure doesn't try to do anything to figure out if hcoll is usable, but it does quit soon after with the above error because ompi_check_hcoll_happy isn't "yes." In case it helps, here is the output from config.log for that area: ...output snipped... configure:219976: checking for MCA component coll:hcoll compile mode configure:219982: result: dso configure:220039: checking --with-hcoll value configure:220042: result: simple ok (unspecified) configure:220840: error: HCOLL support requested but not found. Aborting ## ## ## Cache variables. ## ## ## ...output snipped... Have I missed something in specifying --with-hcoll? I would prefer not to use "--with-hcoll=/usr" as I am pretty sure that spurious linker flags to that area will work their way in when they shouldn't. Thanks, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
I have cloned Gilles' topic/hcoll_config branch and, after running autogen.pl, have found that './configure --with-hcoll' does indeed work now. I used Gilles' branch as I wasn't sure how best to get the pull request changes in to my own clone of master. It looks like the proper checks are happening, too: Konsole output --- MCA component coll:hcoll(m4 configuration macro) checking for MCA component coll:hcollcompile mode... dso checking --with-hcollvalue... simple ok (unspecified) checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library without search path checking for library containing hcoll_get_version... -lhcoll checking if MCA component coll:hcollcan compile... yes I haven't checked whether or not Open MPI builds successfully as I don't have much experience running off of the latest source. For now, I think I will try to generate a patch to the 1.8.8 configure script and see if that works as expected. Thanks, David On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote: On Aug 11, 2015, at 1:39 AM, Åke Sandgren wrote: Please fix the hcoll test (and code) to be correct. Any configure test that adds /usr/lib and/or /usr/include to any compile flags is broken. +1 Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some comments to it. -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
Hello Gilles, Thank you very much for the patch! It is much more complete than mine. Using that patch and re-running autogen.pl, I am able to build 1.8.8 with './configure --with-hcoll' without errors. I do have issues when it comes to running 1.8.8 with hcoll built in, however. In my quick sanity test of running a basic parallel hello world C program, I get the following: Konsole output Konsole output [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439390789.040265] [zo-fe1:31353:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:31353:0] Caught signal 11 (Segmentation fault) [zo-fe1:31354:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000fa626 mca_coll_hcoll_comm_query() ??:0 11 0x000f776e mca_coll_base_comm_select() ??:0 12 0x00074ee4 ompi_mpi_init() ??:0 13 0x00093dc0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000fa626 mca_coll_hcoll_comm_query() ??:0 11 0x000f776e mca_coll_base_comm_select() ??:0 12 0x00074ee4 ompi_mpi_init() ??:0 13 0x00093dc0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === -- mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited on signal 11 (Segmentation fault). -- I do not get this message with only 1 process. I am using hcoll 3.2.748. Could this be an issue with hcoll itself or something with my ompi build? Thanks, David On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote: Thanks David, i made a PR for the v1.8 branch at https://github.com/open-mpi/ompi-release/pull/492 the patch is attached (it required some back-porting) Cheers, Gilles On 8/12/2015 4:01 AM, David Shrader wrote: I have cloned Gilles' topic/hcoll_config branch and, after running autogen.pl, have found that './configure --with-hcoll' does indeed work now. I used Gilles' branch as I wasn't sure how best to get the pull request changes in to my own clone of master. It looks like the proper checks are happening, too: Konsole output --- MCA component coll:hcoll(m4 configuration macro) checking for MCA component coll:hcollcompile mode... dso checking --with-hcollvalue... simple ok (unspecified) checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library without search path checking for library containing hcoll_get_version... -lhcoll checking if MCA component coll:hcollcan compile... yes I haven't checked whether or not Open MPI builds successfully as I don't have much experience running off of the latest source. For now, I thin
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
Hey Devendar, It looks like I still get the error: Konsole output [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000f9706 mca_coll_hcoll_comm_query() ??:0 11 0x000f684e mca_coll_base_comm_select() ??:0 12 0x00073fc4 ompi_mpi_init() ??:0 13 0x00092ea0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000f9706 mca_coll_hcoll_comm_query() ??:0 11 0x000f684e mca_coll_base_comm_select() ??:0 12 0x00073fc4 ompi_mpi_init() ??:0 13 0x00092ea0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === -- mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on signal 11 (Segmentation fault). -- Thanks, David On 08/12/2015 10:42 AM, Deva wrote: Hi David, This issue is from hcoll library. This could be because of symbol conflict with ml module. This is fixed recently in HCOLL. Can you try with "-mca coll ^ml" and see if this workaround works in your setup? -Devendar On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote: Hello Gilles, Thank you very much for the patch! It is much more complete than mine. Using that patch and re-running autogen.pl <http://autogen.pl>, I am able to build 1.8.8 with './configure --with-hcoll' without errors. I do have issues when it comes to running 1.8.8 with hcoll built in, however. In my quick sanity test of running a basic parallel hello world C program, I get the following: [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439390789.040265] [zo-fe1:31353:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:31353:0] Caught signal 11 (Segmentation fault) [zo-fe1:31354:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. Thanks, David On 08/12/2015 10:51 AM, Deva wrote: From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <mailto:dshra...@lanl.gov>> wrote: Hey Devendar, It looks like I still get the error: [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000f9706 mca_coll_hcoll_comm_query() ??:0 11 0x000f684e mca_coll_base_comm_select() ??:0 12 0x00073fc4 ompi_mpi_init() ??:0 13 0x00092ea0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000f9706 mca_coll_hcoll_comm_query() ??:0 11 0x000f684e mca_coll_base_comm_select() ??:0 12 0x00073fc4 ompi_mpi_init() ??:0 13 0x00092ea0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === -- mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on signal 11 (Segmentation fault). -- Thanks, David On 08/12/2015 10:42 AM, Deva wrote: Hi David, This issue is from hcoll library. This could be because of symbol conflict with ml module. This is fixed recently in HCOLL. Can you try with "-mca coll ^ml" and see if this workaround works in your setup? -Devendar On Wed, Aug 12, 2015 at 9:30 AM, David Shrader mailto:dshra...@lanl.gov>> wrote: Hello Gilles, Thank you very much for the patch! It is much more complete than mine. Using that patch and re-running autogen.pl <http://autogen.pl>, I am able to build 1.8.8 with './configure --with-hcoll' without errors. I do have issues when it comes to running 1.8.8 with hcoll built in, however. In my quick sanity test of running a basic parallel hello world C program, I get the following: [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439390789.039197] [zo-fe1:31354:0] shm.c:65 MXM WARN Could not open the KNEM
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
I remember seeing those, but forgot about them. I am curious, though, why using '-mca coll ^ml' wouldn't work for me. We'll watch for the next HPCX release. Is there an ETA on when that release may happen? Thank you for the help! David On 08/12/2015 04:04 PM, Deva wrote: David, This is because of hcoll symbols conflict with ml coll module inside OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll library and will be available in next HPCX release. Some earlier discussion on this issue: http://www.open-mpi.org/community/lists/users/2015/06/27154.php http://www.open-mpi.org/community/lists/devel/2015/06/17562.php -Devendar On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: Interesting... the seg faults went away: [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. 0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov> 0: We have 2 processors 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov> reporting for duty This implies to me that some other library is being used instead of /usr/lib64/libhcoll.so, but I am not sure how that could be... Thanks, David On 08/12/2015 03:30 PM, Deva wrote: Hi David, I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue. Can you do one more quick test with seeing LD_PRELOAD to hcoll lib? $LD_PRELOAD= mpirun -n 2 -mca coll ^ml ./a.out -Devendar On Wed, Aug 12, 2015 at 12:52 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. Thanks, David On 08/12/2015 10:51 AM, Deva wrote: From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader mailto:dshra...@lanl.gov>> wrote: Hey Devendar, It looks like I still get the error: [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_comm_query() ??:0 9 0x0006ace9 hcoll_create_context() ??:0 10 0x000f9706 mca_coll_hcoll_comm_query() ??:0 11 0x000f684e mca_coll_base_comm_select() ??:0 12 0x00073fc4 ompi_mpi_init() ??:0 13 0x00092ea0 PMPI_Init() ??:0 14 0x004009b6 main() ??:0 15 0x0001ed5d __libc_start_main() ??:0 16 0x004008c9 _start() ??:0 === backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rh
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
I don't have that option on the configure command line, but my platform file is using "enable_dlopen=no." I imagine that is getting the same result. Thank you for the pointer! Thanks, David On 08/12/2015 05:04 PM, Deva wrote: do you have "-disable-dlopen" in your configure option? This might force coll_ml to be loaded first even with -mca coll ^ml. next HPCX is expected to release by end of Aug. -Devendar On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: I remember seeing those, but forgot about them. I am curious, though, why using '-mca coll ^ml' wouldn't work for me. We'll watch for the next HPCX release. Is there an ETA on when that release may happen? Thank you for the help! David On 08/12/2015 04:04 PM, Deva wrote: David, This is because of hcoll symbols conflict with ml coll module inside OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll library and will be available in next HPCX release. Some earlier discussion on this issue: http://www.open-mpi.org/community/lists/users/2015/06/27154.php http://www.open-mpi.org/community/lists/devel/2015/06/17562.php -Devendar On Wed, Aug 12, 2015 at 2:52 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: Interesting... the seg faults went away: [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. 0: Running on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov> 0: We have 2 processors 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov <http://zo-fe1.lanl.gov> reporting for duty This implies to me that some other library is being used instead of /usr/lib64/libhcoll.so, but I am not sure how that could be... Thanks, David On 08/12/2015 03:30 PM, Deva wrote: Hi David, I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue. Can you do one more quick test with seeing LD_PRELOAD to hcoll lib? $LD_PRELOAD= mpirun -n 2 -mca coll ^ml ./a.out -Devendar On Wed, Aug 12, 2015 at 12:52 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. Thanks, David On 08/12/2015 10:51 AM, Deva wrote: From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader mailto:dshra...@lanl.gov>> wrote: Hey Devendar, It looks like I still get the error: [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
Hey Jeff, I'm actually not able to find coll_ml related files at that location. All I see are the following files: [dshrader@zo-fe1 openmpi]$ ls /usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/ libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so In this particular build, I am using platform files instead of the stripped down debug builds I was doing before. Could something in the platform files move or combine with something else the coll_ml related files? Thanks, David On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote: Note that this will require you to have fairly recent GNU Autotools installed. Another workaround for avoiding the coll ml module would be to install Open MPI as normal, and then rm the following files after installation: rm $prefix/lib/openmpi/mca_coll_ml* This will physically remove the coll ml plugin from the Open MPI installation, and therefore it won't/can't be used (or interfere with the hcoll plugin). On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet wrote: David, i guess you do not want to use the ml coll module at all in openmpi 1.8.8 you can simply do touch ompi/mca/coll/ml/.ompi_ignore ./autogen.pl ./configure ... make && make install so the ml component is not even built Cheers, Gilles On 8/13/2015 7:30 AM, David Shrader wrote: I remember seeing those, but forgot about them. I am curious, though, why using '-mca coll ^ml' wouldn't work for me. We'll watch for the next HPCX release. Is there an ETA on when that release may happen? Thank you for the help! David On 08/12/2015 04:04 PM, Deva wrote: David, This is because of hcoll symbols conflict with ml coll module inside OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll library and will be available in next HPCX release. Some earlier discussion on this issue: http://www.open-mpi.org/community/lists/users/2015/06/27154.php http://www.open-mpi.org/community/lists/devel/2015/06/17562.php -Devendar On Wed, Aug 12, 2015 at 2:52 PM, David Shrader wrote: Interesting... the seg faults went away: [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. 0: Running on host zo-fe1.lanl.gov 0: We have 2 processors 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty This implies to me that some other library is being used instead of /usr/lib64/libhcoll.so, but I am not sure how that could be... Thanks, David On 08/12/2015 03:30 PM, Deva wrote: Hi David, I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue. Can you do one more quick test with seeing LD_PRELOAD to hcoll lib? $LD_PRELOAD= mpirun -n 2 -mca coll ^ml ./a.out -Devendar On Wed, Aug 12, 2015 at 12:52 PM, David Shrader wrote: The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. Thanks, David On 08/12/2015 10:51 AM, Deva wrote: From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader wrote: Hey Devendar, It looks like I still get the error: [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439397957.352704] [zo-fe1:14677:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [zo-fe1:14677:0] Caught signal 11 (Segmentation fault) [zo-fe1:14678:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00056cdc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641 3 0x00056e4c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616 4 0x000326a0 killpg() ??:0 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers() ??:0 6 0x000969e3 hmca_bcol_basesmuma_comm_query() ??:0 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery() coll_ml_module.c:0 8 0x0002fda2 hmca_coll_ml_com
Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space
Interestingly enough, I have found that using --disable-dlopen causes the seg fault whether or not --enable-mca-no-build=coll-ml is used. That is, the following configure line generates a build of Open MPI that will *not* seg fault when running a simple hello world program: ./configure --prefix=/tmp/dshrader-ompi-1.8.8-install --enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll While the following configure line will produce a build of Open MPI that *will* seg fault with the same error I mentioned before: ./configure --prefix=/tmp/dshrader-ompi-1.8.8-install --enable-mca-no-build=coll-ml --with-mxm=no --with-hcoll --disable-dlopen I'm not sure why this would be. Thanks, David On 08/13/2015 11:19 AM, Jeff Squyres (jsquyres) wrote: Ah, if you're disable-dlopen, then you won't find individual plugin DSOs. Instead, you can configure this way: ./configure --enable-mca-no-build=coll-ml ... This will disable the build of the coll/ml component altogether. On Aug 13, 2015, at 11:23 AM, David Shrader wrote: Hey Jeff, I'm actually not able to find coll_ml related files at that location. All I see are the following files: [dshrader@zo-fe1 openmpi]$ ls /usr/projects/hpcsoft/toss2/zorrillo/openmpi/1.8.8-gcc-4.4/lib/openmpi/ libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so In this particular build, I am using platform files instead of the stripped down debug builds I was doing before. Could something in the platform files move or combine with something else the coll_ml related files? Thanks, David On 08/13/2015 04:02 AM, Jeff Squyres (jsquyres) wrote: Note that this will require you to have fairly recent GNU Autotools installed. Another workaround for avoiding the coll ml module would be to install Open MPI as normal, and then rm the following files after installation: rm $prefix/lib/openmpi/mca_coll_ml* This will physically remove the coll ml plugin from the Open MPI installation, and therefore it won't/can't be used (or interfere with the hcoll plugin). On Aug 13, 2015, at 2:03 AM, Gilles Gouaillardet wrote: David, i guess you do not want to use the ml coll module at all in openmpi 1.8.8 you can simply do touch ompi/mca/coll/ml/.ompi_ignore ./autogen.pl ./configure ... make && make install so the ml component is not even built Cheers, Gilles On 8/13/2015 7:30 AM, David Shrader wrote: I remember seeing those, but forgot about them. I am curious, though, why using '-mca coll ^ml' wouldn't work for me. We'll watch for the next HPCX release. Is there an ETA on when that release may happen? Thank you for the help! David On 08/12/2015 04:04 PM, Deva wrote: David, This is because of hcoll symbols conflict with ml coll module inside OMPI. HCOLL is derived from ml module. This issue is fixed in hcoll library and will be available in next HPCX release. Some earlier discussion on this issue: http://www.open-mpi.org/community/lists/users/2015/06/27154.php http://www.open-mpi.org/community/lists/devel/2015/06/17562.php -Devendar On Wed, Aug 12, 2015 at 2:52 PM, David Shrader wrote: Interesting... the seg faults went away: [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439416182.732720] [zo-fe1:14690:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. [1439416182.733640] [zo-fe1:14689:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won't use knem. 0: Running on host zo-fe1.lanl.gov 0: We have 2 processors 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty This implies to me that some other library is being used instead of /usr/lib64/libhcoll.so, but I am not sure how that could be... Thanks, David On 08/12/2015 03:30 PM, Deva wrote: Hi David, I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue. Can you do one more quick test with seeing LD_PRELOAD to hcoll lib? $LD_PRELOAD= mpirun -n 2 -mca coll ^ml ./a.out -Devendar On Wed, Aug 12, 2015 at 12:52 PM, David Shrader wrote: The admin that rolled the hcoll rpm that we're using (and got it in system space) said that she got it from hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar. Thanks, David On 08/12/2015 10:51 AM, Deva wrote: From where did you grab this HCOLL lib? MOFED or HPCX? what version? On Wed, Aug 12, 2015 at 9:47 AM, David Shrader wrote: Hey Devendar, It looks like I still get the error: [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs [1439397957.351764] [zo-fe1:14678:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or direc tory. Won'
[OMPI users] hcoll dependency on mxm configure error
Hello All, I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting an error during configure: --- MCA component coll:hcoll (m4 configuration macro) checking for MCA component coll:hcoll compile mode... static checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library in lib checking for library containing hcoll_get_version... no looking for library in lib64 checking for library containing hcoll_get_version... no configure: error: HCOLL support requested but not found. Aborting The configure line I used: ./configure --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll --with-platform=contrib/platform/lanl/toss/optimized-panasas Here are the corresponding lines from config.log: configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include -I/opt/mellanox/hcoll/include -L/opt/mellanox/hcoll/lib conftest.c -lhcoll -lrt -lm -lutil >&5 /usr/bin/ld: warning: libmxm.so.2, needed by /opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or -rpath-link) /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_recv' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_create' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_free_context_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_destroy' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_free_ep_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_progress' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_read_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_disconnect' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_destroy' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_create' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_cleanup' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_send' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_connect' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_init' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_get_address' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_error_string' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mem_unmap' collect2: ld returned 1 exit status An ldd on /opt/mellanox/hcoll/lib/libhcoll.so shows a dependency on libmxm.so, so the above error makes sense. I am using hcoll version 3.3.768 and mxm version 3.4.3065 (reported by rpm). So, my question: is there a way to take care of this other than putting '-L/opt/mellanox/lib -lmxm' in to LDFLAGS/LIBS? Using LDFLAGS/LIBS will link mxm in to everything, which I would prefer not to do. Thanks in advance! David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] hcoll dependency on mxm configure error
I should probably point out that libhcoll.so does not know where libmxm.so is: [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so linux-vdso.so.1 => (0x7fffb2f1f000) libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5 (0x7fe31bd0b000) libmxm.so.2 => not found libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f) libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000) libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000) libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000) libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0 (0x7fe31adfe000) librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000) libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000) libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000) libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000) /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000) libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000) libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000) Both hcoll and mxm where installed using the rpms provided by Mellanox. Thanks again, David On 10/21/2015 09:34 AM, David Shrader wrote: Hello All, I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting an error during configure: --- MCA component coll:hcoll (m4 configuration macro) checking for MCA component coll:hcoll compile mode... static checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library in lib checking for library containing hcoll_get_version... no looking for library in lib64 checking for library containing hcoll_get_version... no configure: error: HCOLL support requested but not found. Aborting The configure line I used: ./configure --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll --with-platform=contrib/platform/lanl/toss/optimized-panasas Here are the corresponding lines from config.log: configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include -I/opt/mellanox/hcoll/include -L/opt/mellanox/hcoll/lib conftest.c -lhcoll -lrt -lm -lutil >&5 /usr/bin/ld: warning: libmxm.so.2, needed by /opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or -rpath-link) /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_recv' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_create' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_free_context_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_destroy' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_free_ep_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_progress' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_config_read_opts' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_disconnect' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_destroy' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mq_create' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_cleanup' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_send' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_connect' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_init' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_ep_get_address' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_error_string' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_mem_unmap' collect2: ld returned 1 exit status An ldd on /opt/mellanox/hcoll/lib/libhcoll.so shows a
Re: [OMPI users] hcoll dependency on mxm configure error
We're using TOSS which is based on Red Hat. The current version we're running is based on Red Hat 6.6. I'm actually not sure what mofed version we're using right now based on what I can find on the system and the admins over that are out. I'll get back to you on that as soon as I know. Using LD_LIBRARY_PATH before configure got it to work, which I didn't expect. Thanks for the tip! I didn't realize that loading in a shared library of a library that is being linked in on the active compile line fell under the runtime portion of linking, and could be affected by using LD_LIBRARY_PATH. Thanks! David On 10/21/2015 09:59 AM, Mike Dubman wrote: Hi David, what linux distro do you use? (and mofed version)? Do you have /etc/ld.conf.d/mxm.conf file? Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib ./configure ? Thanks On Wed, Oct 21, 2015 at 6:40 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: I should probably point out that libhcoll.so does not know where libmxm.so is: [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so linux-vdso.so.1 => (0x7fffb2f1f000) libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5 (0x7fe31bd0b000) libmxm.so.2 => not found libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f) libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000) libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000) libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000) libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0 (0x7fe31adfe000) librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000) libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000) libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000) libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000) /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000) libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000) libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000) Both hcoll and mxm where installed using the rpms provided by Mellanox. Thanks again, David On 10/21/2015 09:34 AM, David Shrader wrote: Hello All, I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting an error during configure: --- MCA component coll:hcoll (m4 configuration macro) checking for MCA component coll:hcoll compile mode... static checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library in lib checking for library containing hcoll_get_version... no looking for library in lib64 checking for library containing hcoll_get_version... no configure: error: HCOLL support requested but not found. Aborting The configure line I used: ./configure --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll --with-platform=contrib/platform/lanl/toss/optimized-panasas Here are the corresponding lines from config.log: configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include -I/opt/mellanox/hcoll/include -L/opt/mellanox/hcoll/lib conftest.c -lhcoll -lrt -lm -lutil >&5 /usr/bin/ld: warning: libmxm.so.2, needed by /opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or -rpath-link) /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_recv' /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
Re: [OMPI users] hcoll dependency on mxm configure error
I'm sorry I missed reporting on that. I do not have /etc/ld.so.conf.d/mxm.conf. Interestingly enough, the rpm reports that it does include that file, but it isn't there: [dshrader@zo-fe1 serial]$ rpm -qa | grep mxm mxm-3.4.3065-1.x86_64 [dshrader@zo-fe1 serial]$ rpm -ql mxm-3.4.3065-1.x86_64 /etc/ld.so.conf.d/mxm.conf ...output snipped... [dshrader@zo-fe1 serial]$ ll /etc/ld.so.conf.d/mxm.conf ls: cannot access /etc/ld.so.conf.d/mxm.conf: No such file or directory I'll follow up with the admin who installed the rpm. Thanks, David On 10/21/2015 11:37 AM, Mike Dubman wrote: could you please check if you have file /etc/ld.so.conf.d/mxm.conf on your system? it will help us understand why hcoll did not detect libmxm.so at the 1st attempt. Thanks On Wed, Oct 21, 2015 at 7:19 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: We're using TOSS which is based on Red Hat. The current version we're running is based on Red Hat 6.6. I'm actually not sure what mofed version we're using right now based on what I can find on the system and the admins over that are out. I'll get back to you on that as soon as I know. Using LD_LIBRARY_PATH before configure got it to work, which I didn't expect. Thanks for the tip! I didn't realize that loading in a shared library of a library that is being linked in on the active compile line fell under the runtime portion of linking, and could be affected by using LD_LIBRARY_PATH. Thanks! David On 10/21/2015 09:59 AM, Mike Dubman wrote: Hi David, what linux distro do you use? (and mofed version)? Do you have /etc/ld.conf.d/mxm.conf file? Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib ./configure ....? Thanks On Wed, Oct 21, 2015 at 6:40 PM, David Shrader mailto:dshra...@lanl.gov>> wrote: I should probably point out that libhcoll.so does not know where libmxm.so is: [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so linux-vdso.so.1 => (0x7fffb2f1f000) libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5 (0x7fe31bd0b000) libmxm.so.2 => not found libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f) libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000) libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000) libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000) libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0 (0x7fe31adfe000) librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000) libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000) libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000) libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000) /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000) libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000) libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000) Both hcoll and mxm where installed using the rpms provided by Mellanox. Thanks again, David On 10/21/2015 09:34 AM, David Shrader wrote: Hello All, I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting an error during configure: --- MCA component coll:hcoll (m4 configuration macro) checking for MCA component coll:hcoll compile mode... static checking hcoll/api/hcoll_api.h usability... yes checking hcoll/api/hcoll_api.h presence... yes checking for hcoll/api/hcoll_api.h... yes looking for library in lib checking for library containing hcoll_get_version... no looking for library in lib64 checking for library containing hcoll_get_version... no configure: error: HCOLL support requested but not found. Aborting
Re: [OMPI users] hcoll dependency on mxm configure error
It turns out that stuff in /etc is in RAM, so the mxm.conf wasn't there because that area hadn't been refreshed yet, either by the admin manually pushing it out or by rebooting. The admins pushed it out, and now ldd on libhcoll.so resolves the libmxm.so dependency. And, configure works without having to specify LD_LIBRARY_PATH. So, not an Open MPI issue, but I am very grateful for all the help! David On 10/21/2015 12:00 PM, David Shrader wrote: I'm sorry I missed reporting on that. I do not have /etc/ld.so.conf.d/mxm.conf. Interestingly enough, the rpm reports that it does include that file, but it isn't there: [dshrader@zo-fe1 serial]$ rpm -qa | grep mxm mxm-3.4.3065-1.x86_64 [dshrader@zo-fe1 serial]$ rpm -ql mxm-3.4.3065-1.x86_64 /etc/ld.so.conf.d/mxm.conf ...output snipped... [dshrader@zo-fe1 serial]$ ll /etc/ld.so.conf.d/mxm.conf ls: cannot access /etc/ld.so.conf.d/mxm.conf: No such file or directory I'll follow up with the admin who installed the rpm. Thanks, David On 10/21/2015 11:37 AM, Mike Dubman wrote: could you please check if you have file /etc/ld.so.conf.d/mxm.conf on your system? it will help us understand why hcoll did not detect libmxm.so at the 1st attempt. Thanks On Wed, Oct 21, 2015 at 7:19 PM, David Shrader <mailto:dshra...@lanl.gov>> wrote: We're using TOSS which is based on Red Hat. The current version we're running is based on Red Hat 6.6. I'm actually not sure what mofed version we're using right now based on what I can find on the system and the admins over that are out. I'll get back to you on that as soon as I know. Using LD_LIBRARY_PATH before configure got it to work, which I didn't expect. Thanks for the tip! I didn't realize that loading in a shared library of a library that is being linked in on the active compile line fell under the runtime portion of linking, and could be affected by using LD_LIBRARY_PATH. Thanks! David On 10/21/2015 09:59 AM, Mike Dubman wrote: Hi David, what linux distro do you use? (and mofed version)? Do you have /etc/ld.conf.d/mxm.conf file? Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib ./configure ....? Thanks On Wed, Oct 21, 2015 at 6:40 PM, David Shrader wrote: I should probably point out that libhcoll.so does not know where libmxm.so is: [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so linux-vdso.so.1 => (0x7fffb2f1f000) libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5 (0x7fe31bd0b000) libmxm.so.2 => not found libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f) libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000) libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0 (0x7fe31b499000) libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000) libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0 (0x7fe31adfe000) librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000) libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000) libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000) libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000) /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000) libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000) libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000) Both hcoll and mxm where installed using the rpms provided by Mellanox. Thanks again, David On 10/21/2015 09:34 AM, David Shrader wrote: Hello All, I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting an error during configure: --- MCA component coll:hcoll (m4 configuration macro) checking for MCA component coll:hcoll compile mode... static checking hcoll/api/hcoll_api.h usability.
[OMPI users] a single build of Open MPI that can be used with multiple GCC versions
Hello, Is it possible to use a single build of Open MPI against multiple versions of GCC if the versions of GCC are from the same release series? I was under the assumption that as long as a binary-compatible compiler was used, it was possible to "swap out" the compiler from underneath Open MPI. That is the general question I have, but here is the specific scenario that prompted it: * built Open MPI 1.10.1 against GCC 5.2.0 with a directory name of openmpi-1.10.1-gcc-5 * installed GCC 5.3.0 * removed GCC 5.2.0 I now have users who are getting errors like the following when using mpicxx: /bin/grep: /usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la: No such file or directory I can see several references to my previous GCC 5.2.0 installation in the /lib/*.la files, including a reference to /usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la. This is all disconcerting as users of GCC 5.3.0 were using 5.3.0's binaries but were getting some 5.2.0 library configs before I removed 5.2.0, but no one knew it. If it should be possible to use a single build of Open MPI with multiple binary-compatible compilers, is there a way to fix my above situation or prevent it from happening at build time? Thanks, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] a single build of Open MPI that can be used with multiple GCC versions
A bit of an update: I was mistaken when I said users were reporting 1.10.1 was throwing an error. The error occurs for 1.6.5 (which I still have to keep on my production systems). Users report that they do not see the error with 1.10.1. That being said, I do see references to my GCC 5.2.0 installation in the /lib/*.la 1.10.1 files and would like to ask if I need to worry at all? It seems the way files were named and organized in /lib changed in 1.7 which may be why 1.10.1 is working. Thank you very much for your time, David On 02/10/2016 10:58 AM, David Shrader wrote: Hello, Is it possible to use a single build of Open MPI against multiple versions of GCC if the versions of GCC are from the same release series? I was under the assumption that as long as a binary-compatible compiler was used, it was possible to "swap out" the compiler from underneath Open MPI. That is the general question I have, but here is the specific scenario that prompted it: * built Open MPI 1.10.1 against GCC 5.2.0 with a directory name of openmpi-1.10.1-gcc-5 * installed GCC 5.3.0 * removed GCC 5.2.0 I now have users who are getting errors like the following when using mpicxx: /bin/grep: /usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la: No such file or directory I can see several references to my previous GCC 5.2.0 installation in the /lib/*.la files, including a reference to /usr/projects/hpcsoft/toss2/common/gcc/5.2.0/lib/../lib64/libstdc++.la. This is all disconcerting as users of GCC 5.3.0 were using 5.3.0's binaries but were getting some 5.2.0 library configs before I removed 5.2.0, but no one knew it. If it should be possible to use a single build of Open MPI with multiple binary-compatible compilers, is there a way to fix my above situation or prevent it from happening at build time? Thanks, David -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] Question on OpenMPI backwards compatibility
Hey Edwin, The versioning scheme changed with 2.x. Prior to 2.x the "Minor" version had a different definition and did not mention backwards compatibility at all (at least in my 1.6.x tarballs). As it turned out for 1.8.x and 1.6.x, 1.8.x was not backwards compatible with 1.6.x, so the behavior you saw in your test of 1.6.x-compiled code running against 1.8.x is expected. In practice, 1.x was never backwards compatible with 1.y where x>y, even though the versioning documentation at the time didn't specifically mention it. There is a note in the versioning documentation (https://www.open-mpi.org/software/ompi/versions/) that does warn of this change in the versioning scheme: NOTE: The version numbering conventions were changed with the release of v1.10.0. Most notably, Open MPI no longer uses an "odd/even" release schedule to indicate feature development vs. stable releases. See the README in releases prior to v1.10.0 for more information (e.g., https://github.com/open-mpi/ompi-release/blob/v1.8/README#L1392-L1475). There is also a CAVEAT underneath the "Major" section of the versioning documentation that says that 1.10.x is not backwards compatible with other 1.x releases and that the same rule applies to anything before 1.10.0. Perhaps another CAVEAT could be placed after the "Minor" section since the information on backwards compatibility in the "Minor" section only applies to 2.x and beyond. The developers are still in the midst of the version scheme transition (developing on both 1.10.x and 2.x), so the FAQ entries might be a bit out-dated for the new numbering scheme for a while. Thanks, David On 02/26/2016 09:20 AM, Blosch, Edwin L wrote: I am confused about backwards-compatibility. FAQ #111 says: Open MPI reserves the right to break ABI compatibility at new feature release series. . MPI applications compiled/linked against Open MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x But the versioning documentation says: * Minor: The minor number is the second integer in the version string. Backwards compatibility will still be preserved with prior releases that have the same major version number (e.g., v2.5.3 is backwards compatible with v2.3.1). These two examples and statements appear inconsistent to me: Can I use OpenMPI 1.7.x run-time and options to execute codes built with OpenMPI 1.6.x? No (FAQ #111) Can I use OpenMPI 2.5.x run-time and options to execute codes built with OpenMPI 2.3.x? Yes (s/w versioning documentation) Can I use OpenMPI 1.8.x run-time and options to execute codes built with OpenMPI 1.6.x? Who knows?! I tested this once, and it failed. I made the assumption that 1.8.x wouldn't run a 1.6.x code, and I moved on. But I realize now that I could have made a mistake. The test I performed could have failed for some other reason. Can anyone shed some light? ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/02/28590.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov
Re: [OMPI users] Question on OpenMPI backwards compatibility
I forgot to include a link to the official announcement of the change, and that info might be helpful in navigating the different versions and backwards compatibility: https://www.open-mpi.org/community/lists/announce/2015/06/0069.php Thanks, David On 02/26/2016 10:43 AM, David Shrader wrote: Hey Edwin, The versioning scheme changed with 2.x. Prior to 2.x the "Minor" version had a different definition and did not mention backwards compatibility at all (at least in my 1.6.x tarballs). As it turned out for 1.8.x and 1.6.x, 1.8.x was not backwards compatible with 1.6.x, so the behavior you saw in your test of 1.6.x-compiled code running against 1.8.x is expected. In practice, 1.x was never backwards compatible with 1.y where x>y, even though the versioning documentation at the time didn't specifically mention it. There is a note in the versioning documentation (https://www.open-mpi.org/software/ompi/versions/) that does warn of this change in the versioning scheme: NOTE: The version numbering conventions were changed with the release of v1.10.0. Most notably, Open MPI no longer uses an "odd/even" release schedule to indicate feature development vs. stable releases. See the README in releases prior to v1.10.0 for more information (e.g., https://github.com/open-mpi/ompi-release/blob/v1.8/README#L1392-L1475). There is also a CAVEAT underneath the "Major" section of the versioning documentation that says that 1.10.x is not backwards compatible with other 1.x releases and that the same rule applies to anything before 1.10.0. Perhaps another CAVEAT could be placed after the "Minor" section since the information on backwards compatibility in the "Minor" section only applies to 2.x and beyond. The developers are still in the midst of the version scheme transition (developing on both 1.10.x and 2.x), so the FAQ entries might be a bit out-dated for the new numbering scheme for a while. Thanks, David On 02/26/2016 09:20 AM, Blosch, Edwin L wrote: I am confused about backwards-compatibility. FAQ #111 says: Open MPI reserves the right to break ABI compatibility at new feature release series. . MPI applications compiled/linked against Open MPI 1.6.x will not be ABI compatible with Open MPI 1.7.x But the versioning documentation says: * Minor: The minor number is the second integer in the version string. Backwards compatibility will still be preserved with prior releases that have the same major version number (e.g., v2.5.3 is backwards compatible with v2.3.1). These two examples and statements appear inconsistent to me: Can I use OpenMPI 1.7.x run-time and options to execute codes built with OpenMPI 1.6.x? No (FAQ #111) Can I use OpenMPI 2.5.x run-time and options to execute codes built with OpenMPI 2.3.x? Yes (s/w versioning documentation) Can I use OpenMPI 1.8.x run-time and options to execute codes built with OpenMPI 1.6.x? Who knows?! I tested this once, and it failed. I made the assumption that 1.8.x wouldn't run a 1.6.x code, and I moved on. But I realize now that I could have made a mistake. The test I performed could have failed for some other reason. Can anyone shed some light? ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/02/28590.php -- David Shrader HPC-3 High Performance Computer Systems Los Alamos National Lab Email: dshrader lanl.gov