[OMPI users] Using PLFS with Open MPI 1.8
Has anyone found the magic to apply the traditional PLFS ompi-1.7.x-plfs-prep.patch to the current version of Open MPI? It looks like it shouldn't take too much effort to update the patch, but it would be even better to learn that someone else has already made that available! Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP
Re: [OMPI users] Open MPI does not work when MPICH or intel MPI are installed
Hi, The short answer: Environment module files are probably the best solution for your problem. The long answer: See , which pretty much addresses your question. Andy On 05/23/2016 07:40 AM, Megdich Islem wrote: Hi, I am using 2 software, one is called Open Foam and the other called EMPIRE that need to run together at the same time. Open Foam uses Open MPI implementation and EMPIRE uses either MPICH or intel mpi. The version of Open MPI that comes with Open Foam is 1.6.5. I am using Intel (R) MPI Library for linux * OS, version 5.1.3 and MPICH 3.0.4. My problem is when I have the environment variables of either mpich or Intel MPI sourced to bashrc, I fail to run a case of Open Foam with parallel processing ( You find attached a picture of the error I got ) This is an example of a command line I use to run Open Foam mpirun -np 4 interFoam -parallel Once I keep the environment variable of OpenFoam only, the parallel processing works without any problem, so I won't be able to run EMPIRE. I am sourcing the environment variables in this way: For Open Foam: source /opt/openfoam30/etc/bashrc For MPICH 3.0.4 export PATH=/home/islem/Desktop/mpich/bin:$PATH export LD_LIBRARY_PATH="/home/islem/Desktop/mpich/lib/:$LD_LIBRARY_PATH" export MPICH_F90=gfortran export MPICH_CC=/opt/intel/bin/icc export MPICH_CXX=/opt/intel/bin/icpc export MPICH-LINK_CXX="-L/home/islem/Desktop/mpich/lib/ -Wl,-rpath -Wl,/home/islem/Desktop/mpich/lib -lmpichcxx -lmpich -lopa -lmpl -lrt -lpthread" For intel export PATH=$PATH:/opt/intel/bin/ LD_LIBRARY_PATH="/opt/intel/lib/intel64:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH source /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/bin/mpivars.sh intel64 If Only Open Foam is sourced, mpirun --version gives OPEN MPI (1.6.5) If Open Foam and MPICH are sourced, mpirun --version gives mpich 3.0.1 If Open Foam and intel MPI are sourced, mpirun --version gives intel (R) MPI libarary for linux, version 5.1.3 My question is why I can't have two MPI implementation installed and sourced together. How can I solve the problem ? Regards, Islem Megdiche ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29279.php
[OMPI users] Using Open MPI with PBS Pro
I gleaned from the web that I need to comment out "opal_event_include=epoll" in /etc/openmpi-mca-params.conf in order to use Open MPI with PBS Pro. Can we also disable that in other cases, like Slurm, or is this something specific to PBS Pro? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using Open MPI with PBS Pro
Hi Ralph, I think I found that information at <https://github.com/open-mpi/ompi/issues/341> :-) In any case, thanks for the information about the default params file -- I won't worry too much about modifying it then. Andy I On 08/23/2016 08:08 PM, r...@open-mpi.org wrote: I’ve never heard of that, and cannot imagine what it has to do with the resource manager. Can you point to where you heard that one? FWIW: we don’t ship OMPI with anything in the default mca params file, so somebody must have put it in there for you. On Aug 23, 2016, at 4:48 PM, Andy Riebs wrote: I gleaned from the web that I need to comment out "opal_event_include=epoll" in /etc/openmpi-mca-params.conf in order to use Open MPI with PBS Pro. Can we also disable that in other cases, like Slurm, or is this something specific to PBS Pro? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Slurm binding not propagated to MPI jobs
Hi All, We are running Open MPI version 1.10.2, built with support for Slurm version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to bind by core, which segv's if there are more processes than cores. The user reports: What I found is that % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0 will have the problem, but: % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0 Will run as expected and print out the usage message because I didn’t provide the right arguments to the code. So, it appears that the binding has something to do with the issue. My binding script is as follows: % cat bindit.sh #!/bin/bash #echo SLURM_LOCALID=$SLURM_LOCALID stride=1 if [ ! -z "$SLURM_LOCALID" ]; then let bindCPU=$SLURM_LOCALID*$stride exec numactl --membind=0 --physcpubind=$bindCPU $* fi $* % -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Slurm binding not propagated to MPI jobs
Hi Ralph, I think I've found the magic keys... $ srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none $ srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x Andy On 10/27/2016 11:57 AM, r...@open-mpi.org wrote: Hey Andy Is there a SLURM envar that would tell us the binding option from the srun cmd line? We automatically bind when direct launched due to user complaints of poor performance if we don’t. If the user specifies a binding option, then we detect that we were already bound and don’t do it. However, if the user specifies that they not be bound, then we think they simply didn’t specify anything - and that isn’t the case. If we can see something that tells us “they explicitly said not to do itâ€, then we can avoid the situation. Ralph On Oct 27, 2016, at 8:48 AM, Andy Riebs wrote: Hi All, We are running Open MPI version 1.10.2, built with support for Slurm version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to bind by core, which segv's if there are more processes than cores. The user reports: What I found is that % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0 will have the problem, but: % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0 Will run as expected and print out the usage message because I didn’t provide the right arguments to the code. So, it appears that the binding has something to do with the issue. My binding script is as follows: % cat bindit.sh #!/bin/bash #echo SLURM_LOCALID=$SLURM_LOCALID stride=1 if [ ! -z "$SLURM_LOCALID" ]; then let bindCPU=$SLURM_LOCALID*$stride exec numactl --membind=0 --physcpubind=$bindCPU $* fi $* % -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Slurm binding not propagated to MPI jobs
Yes, they still exist: $ srun --ntasks-per-node=2 -N1 env | grep BIND | sort -u SLURM_CPU_BIND_LIST=0x SLURM_CPU_BIND=quiet,mask_cpu:0x SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_VERBOSE=quiet Here are the relevant Slurm configuration options that could conceivably change the behavior from system to system: SelectType = select/cons_res SelectTypeParameters = CR_CPU On 10/27/2016 01:17 PM, r...@open-mpi.org wrote: And if there is no --cpu_bind on the cmd line? Do these not exist? On Oct 27, 2016, at 10:14 AM, Andy Riebs <andy.ri...@hpe.com> wrote: Hi Ralph, I think I've found the magic keys... $ srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none $ srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x Andy On 10/27/2016 11:57 AM, r...@open-mpi.org wrote: Hey Andy Is there a SLURM envar that would tell us the binding option from the srun cmd line? We automatically bind when direct launched due to user complaints of poor performance if we don’t. If the user specifies a binding option, then we detect that we were already bound and don’t do it. However, if the user specifies that they not be bound, then we think they simply didn’t specify anything - and that isn’t the case. If we can see something that tells us “they explicitly said not to do itâ€, then we can avoid the situation. Ralph On Oct 27, 2016, at 8:48 AM, Andy Riebs <andy.ri...@hpe.com> wrote: Hi All, We are running Open MPI version 1.10.2, built with support for Slurm version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to bind by core, which segv's if there are more processes than cores. The user reports: What I found is that % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0 will have the problem, but: % srun --ntasks-per-node=8 --cpu_bind=none \ env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0 Will run as expected and print out the usage message because I didn’t provide the right arguments to the code. So, it appears that the binding has something to do with the issue. My binding script is as follows: % cat bindit.sh #!/bin/bash #echo SLURM_LOCALID=$SLURM_LOCALID stride=1 if [ ! -z "$SLURM_LOCALID" ]; then let bindCPU=$SLURM_LOCALID*$stride exec numactl --membind=0 --physcpubind=$bindCPU $* fi $* % -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My
Re: [OMPI users] Slurm binding not propagated to MPI jobs
Hi Ralph, I haven't played around in this code, so I'll flip the question over to the Slurm list, and report back here when I learn anything. Cheers Andy On 10/27/2016 01:44 PM, r...@open-mpi.org wrote: Sigh - of course it wouldn’t be simple :-( All right, let’s suppose we look for SLURM_CPU_BIND: * if it includes the word “none”, then we know the user specified that they don’t want us to bind * if it includes the word mask_cpu, then we have to check the value of that option. * If it is all F’s, then they didn’t specify a binding and we should do our thing. * If it is anything else, then we assume they _did_ specify a binding, and we leave it alone Would that make sense? Is there anything else that could be in that envar which would trip us up? On Oct 27, 2016, at 10:37 AM, Andy Riebs <andy.ri...@hpe.com> wrote: Yes, they still exist: $ srun --ntasks-per-node=2 -N1 env | grep BIND | sort -u SLURM_CPU_BIND_LIST=0x SLURM_CPU_BIND=quiet,mask_cpu:0x SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_VERBOSE=quiet Here are the relevant Slurm configuration options that could conceivably change the behavior from system to system: SelectType = select/cons_res SelectTypeParameters = CR_CPU On 10/27/2016 01:17 PM, r...@open-mpi.org wrote: And if there is no --cpu_bind on the cmd line? Do these not exist? On Oct 27, 2016, at 10:14 AM, Andy Riebs <andy.ri...@hpe.com> wrote: Hi Ralph, I think I've found the magic keys... $ srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=none SLURM_CPU_BIND_LIST= SLURM_CPU_BIND=quiet,none $ srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep BIND SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x SLURM_CPU_BIND_VERBOSE=quiet SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_LIST=0x,0x SLURM_CPU_BIND=quiet,mask_cpu:0x,0x Andy On 10/27/2016 11:57 AM, r...@open-mpi.org wrote: Hey Andy Is there a SLURM envar that would tell us the binding option from the srun cmd line? We automatically bind when direct launched due to user complaints of poor performance if we don’t. If the user specifies a binding option, then we detect that we were already bound and don’t do it. However, if the user specifies that they not be bound, then we think they simply didn’t specify anything - and that isn’t the case. If we can see something that tells us “they explicitly said not to do itâ€, then we can avoid the situation.
Re: [OMPI users] Slurm binding not propagated to MPI jobs
Getting that support into 2.1 would be terrific -- and might save us from having to write some Slurm prolog scripts to effect that. Thanks Ralph! On 11/01/2016 11:36 PM, r...@open-mpi.org wrote: Ah crumby!! We already solved this on master, but it cannot be backported to the 1.10 series without considerable pain. For some reason, the support for it has been removed from the 2.x series as well. I’ll try to resolve that issue and get the support reinstated there (probably not until 2.1). Can you manage until then? I think the v2 RM’s are thinking Dec/Jan for 2.1. Ralph On Nov 1, 2016, at 11:38 AM, Riebs, Andy <andy.ri...@hpe.com> wrote: To close the thread here… I got the following information: Looking at SLURM_CPU_BIND is the right idea, but there are quite a few more options. It misses map_cpu, rank, plus the NUMA-based options: rank_ldom, map_ldom, and mask_ldom. See the srun man pages for documentation. From: Riebs, Andy Sent: Thursday, October 27, 2016 1:53 PM To: users@lists.open-mpi.org Subject: Re: [OMPI users] Slurm binding not propagated to MPI jobs Hi Ralph, I haven't played around in this code, so I'll flip the question over to the Slurm list, and report back here when I learn anything. Cheers Andy On 10/27/2016 01:44 PM, r...@open-mpi.org wrote: Sigh - of course it wouldn’t be simple :-( All right, let’s suppose we look for SLURM_CPU_BIND: * if it includes the word “noneâ€, then we know the user specified that they don’t want us to bind * if it includes the word mask_cpu, then we have to check the value of that option. * If it is all F’s, then they didn’t specify a binding and we should do our thing. * If it is anything else, then we assume they _did_ specify a binding, and we leave it alone Would that make sense? Is there anything else that could be in that envar which would trip us up? On Oct 27, 2016, at 10:37 AM, Andy Riebs <andy.ri...@hpe.com> wrote: Yes, they still exist: $ srun --ntasks-per-node=2 -N1 env | grep BIND | sort -u SLURM_CPU_BIND_LIST=0x SLURM_CPU_BIND=quiet,mask_cpu:0x SLURM_CPU_BIND_TYPE=mask_cpu: SLURM_CPU_BIND_VERBOSE=quiet Here are the relevant Slurm configuration options that could conceivably change the behavior from system to system: SelectType
Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread
erestingly, I participated in the discussion that lead to that workaround, stating that I had no problem compiling Open MPI with PGI v9. I'm assuming the problem now is that I'm specifying --enable-mpi-thread-multiple, which I'm doing because a user requested that feature. It's been exactly 8 years and 2 days since that workaround was posted to the list. Please tell me a better way of dealing with this issue than writing a 'fakepgf90' script. Any suggestions? -- Prentice ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Build problem
Hi, I'm trying to build OMPI on RHEL 7.2 with MOFED on an x86_64 system, and I'm seeing = Open MPI gitclone: test/datatype/test-suite.log = # TOTAL: 9 # PASS: 8 # SKIP: 0 # XFAIL: 0 # FAIL: 1 # XPASS: 0 # ERROR: 0 .. contents:: :depth: 2 FAIL: external32 /data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32: symbol lookup error: /data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32: undefined symbol: ompi_datatype_pack_external_size FAIL external32 (exit status: 127) I'm probably missing an obvious library or package, but libc++-devel.i686 and glibc-devel.i686 didn't cover this for me. Alex, I'd like to buy a clue, please? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Build problem
Exactly the hint that I needed -- thanks Gilles! Andy On 05/24/2017 10:33 PM, Gilles Gouaillardet wrote: Andy, it looks like some MPI libraries are being mixed in your environment from the test/datatype directory, what if you ldd .libs/lt-external32 does it resolve the the libmpi.so you expect ? Cheers, Gilles On 5/25/2017 11:02 AM, Andy Riebs wrote: Hi, I'm trying to build OMPI on RHEL 7.2 with MOFED on an x86_64 system, and I'm seeing = Open MPI gitclone: test/datatype/test-suite.log = # TOTAL: 9 # PASS: 8 # SKIP: 0 # XFAIL: 0 # FAIL: 1 # XPASS: 0 # ERROR: 0 .. contents:: :depth: 2 FAIL: external32 /data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32: symbol lookup error: /data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32: undefined symbol: ompi_datatype_pack_external_size FAIL external32 (exit status: 127) I'm probably missing an obvious library or package, but libc++-devel.i686 and glibc-devel.i686 didn't cover this for me. Alex, I'd like to buy a clue, please? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
n-zero exit code.. Per user-direction, the job has been aborted. --- -- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[30881,1],0] Exit code: 255 -- Any thoughts about where to go from here? Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
,0],0] plm:base:orted_cmd sending orted_exit commands -- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[32419,1],1] Exit code: 255 -- [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure [atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-api.txt / shmem-abort [atl1-01-mic0:189895] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm On 04/10/2015 06:37 PM, Ralph Castain wrote: Andy - could you please try the current 1.8.5 nightly tarball and see if it helps? The error log indicates that it is failing to get the topology from some daemon, I’m assuming the one on the Phi? You might also add —enable-debug to that configure line and then put -mca plm_base_verbose on the shmemrun cmd to get more help On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com> wrote: Summary: MPI jobs work fine, SHMEM jobs work just often enough to be tantalizing, on an Intel Xeon Phi/MIC system. Longer version Thanks to the excellent write-up last June (), I have been able to build a version of Open MPI for the Xeon Phi coprocessor that runs MPI jobs on the Phi coprocessor with no problem, but not SHMEM jobs. Just at the point where I was about to document the problems I was having with SHMEM, my trivial SHMEM job worked. And then failed when I tried to run it again, immediately afterwards. I have a feeling I may be in uncharted territory here. Environment RHEL 6.5 Intel Composer XE 2015 Xeon Phi/MIC Configuration $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH $ source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64 $ ./configure --prefix=/home/ariebs/mic/mpi \ CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib \ LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio \ --disable-vt --disable-mpi-fortran \ --enable-mca-no-build=btl-usnic,btl-openib,common-verbs $ make $ make install Test program #include #include #include int main(int argc, char **argv) { int me, num_pe; shmem_init(); num_pe = num_pes(); me = my_pe(); printf("Hello World from process %ld of %ld\n", me, num_pe); exit(0); } Building the program export PATH=/home/ariebs/mic/mpi/bin:$PATH export PATH=/usr/linux-k1om-4.7/bin/:$PATH source /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh intel64 export LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include -pthread \ -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib -Wl,--enable-new-dtags \ -L/home/ariebs/mic/mpi/lib -loshmem -lmpi -lopen-rte -lopen-pal \ -lm -ldl -lutil \
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Everything is built on the Xeon side, with the icc "-mmic" switch. I then ssh into one of the PHIs, and run shmemrun from there. On 04/11/2015 12:00 PM, Ralph Castain wrote: Let me try to understand the setup a little better. Are you running shmemrun on the PHI itself? Or is it running on the host processor, and you are trying to spawn a process onto the Phi? On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, Yes, this is attempting to get OSHMEM to run on the Phi. I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC=icc -mmic CXX=icpc -mmic \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as the build complained that mca oob/ud needed mca common-verbs.) With that configuration, here is what I am seeing now... $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:189895] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:189895] mca:base:select:( plm) Selected component [rsh] [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename hash 4121194178 [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419 [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start comm [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm creating map [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working unmanaged allocation [atl1-01-mic0:189895] [[32419,0],0] using dash_host [atl1-01-mic0:189895] [[32419,0],0] checking node atl1-01-mic0 [atl1-01-mic0:189895] [[32419,0],0] ignoring myself [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only HNP in allocation [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440 [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring up iof for job [32419,1] [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch [32419,1] registered [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job [32419,1] is not a dynamic spawn [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting ---
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
rocess has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm On 04/11/2015 07:41 PM, Ralph Castain wrote: Got it - thanks. I fixed that ERROR_LOG issue (I think- please verify). I suspect the memheap issue relates to something else, but I probably need to let the OSHMEM folks comment on it On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> wrote: Everything is built on the Xeon side, with the icc "-mmic" switch. I then ssh into one of the PHIs, and run shmemrun from there. On 04/11/2015 12:00 PM, Ralph Castain wrote: Let me try to understand the setup a little better. Are you running shmemrun on the PHI itself? Or is it running on the host processor, and you are trying to spawn a process onto the Phi? On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, Yes, this is attempting to get OSHMEM to run on the Phi. I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured it with $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC=icc -mmic CXX=icpc -mmic \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud (Note that I had to add "oob-ud" to the "--enable-mca-no-build" option, as the build complained that mca oob/ud needed mca common-verbs.) With that configuration, here is what I am seeing now... $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:189895] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:189895] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:189895] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:189895] mca:base:select:( plm) Selected component [rsh] [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias 189895 nodename hash 4121194178 [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam 32419 [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent ssh : rsh path NULL [atl1-
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
riebs/bench/hello/mic.out [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments() add: 0060-00601000 rw-p 00:11 6029314 /home/ariebs/bench/hello/mic.out [atl1-01-mic0:190442] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments [atl1-01-mic0:190442] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff00 - 0x0x10f20 270532608 bytes type=0x1 id=0x [atl1-01-mic0:190441] base/memheap_base_static.c:75 - mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2 segments [atl1-01-mic0:190441] base/memheap_base_register.c:39 - mca_memheap_base_reg() register seg#00: 0x0xff00 - 0x0x10f20 270532608 bytes type=0x1 id=0x [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 - _reg_segment() Failed to register segment [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting -- It looks like SHMEM_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during SHMEM_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open SHMEM developer): mca_memheap_base_select() failed --> Returned "Error" (-1) instead of "Success" (0) -- -- SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with errorcode -1. -- -- A SHMEM process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Local host: atl1-01-mic0 PID: 190441 -- --- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending orted_exit commands -- shmemrun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[31875,1],0] Exit code: 255 -- [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-api.txt / shmem-abort [atl1-01-mic0:190439] 1 more process has sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm On 04/12/2015 03:09 PM, Ralph Castain wrote: Sorry about that - I hadn’t brought it over to the 1.8 branch yet. I’ve done so now, which means the ERROR_LOG shouldn’t show up any more. It won’t fix the memheap problem, though. You might try adding “--mca memheap_base_verbose 100” to your cmd line so we can see why none of the memheap components are being selected. On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2: $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 $PWD/mic.out [atl1-01-mic0:190189] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
s/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "1901330432" -mca orte_ess_vpid "" -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2" [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of mine [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch list [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon [[29012,0],1] [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"] /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127 [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit commands -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm On 04/13/2015 08:50 AM, Andy Riebs wrote: Hi Ralph, Here are the results with last night's "master" nightly, openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose option (yes, it looks like the "ERROR_LOG" problem has gone away): $ cat /proc/sys/kernel/shmmax 33554432 $ cat /proc/sys/kernel/shmall 2097152 $ cat /proc/sys/kernel/shmmni 4096 $ export SHMEM_SYMMETRIC_HEAP=1M $ shmemrun -H localhost -N 2 --mca sshmem mmap --mca plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [rsh] [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-01-mic0:190439] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [isolated] [atl1-01-mic0:190439] mca:base:select:( plm) Query of component [isolated] set priority to 0 [atl1-01-mic0:190439] mca:base:select:( plm) Querying component [slurm] [atl1-01-mic0:190439] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [atl1-01-mic0:190439] mca:base:select:( plm) Selected component
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Ralph and Nathan, The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I've demonstrated that the problem library is where it belongs on the remote system: $ ldd mic.out linux-vdso.so.1 => (0x7fffb83ff000) liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x2b059cfbb000) libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x2b059d35a000) libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x2b059d7e3000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x2b059db53000) libm.so.6 => /lib64/libm.so.6 (0x2b059df3d000) libdl.so.2 => /lib64/libdl.so.2 (0x2b059e16c000) libutil.so.1 => /lib64/libutil.so.1 (0x2b059e371000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b059e574000) libpthread.so.0 => /lib64/libpthread.so.0 (0x2b059e786000) libc.so.6 => /lib64/libc.so.6 (0x2b059e9a4000) librt.so.1 => /lib64/librt.so.1 (0x2b059ecfc000) libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x2b059ef04000) libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x2b059f356000) libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x2b059fbef000) libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x2b059fe02000) /lib64/ld-linux-k1om.so.2 (0x2b059cd9a000) $ echo $LD_LIBRARY_PATH /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib $ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked, not stripped $ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory ... On 04/13/2015 04:25 PM, Nathan Hjelm wrote: For talking between PHIs on the same system I recommend using the scif BTL NOT tcp. That said, it looks like the LD_LIBRARY_PATH is wrong on the remote system. It looks like it can't find the intel compiler libraries. -Nathan Hjelm HPC-5, LANL On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote: Progress! I can run my trivial program on the local PHI, but not the other PHI, on the system. Here are the interesting parts: A pretty good recipe with last night's nightly master: $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-orterun-prefix-by-default \ --enable-debug $ make && make install $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out Hello World from process 0 of 2 Hello World from process 1 of 2 $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml yoda --mca btl openib,sm,self $PWD/mic.out Hello World from process 0 of 2 Hello World from process 1 of 2 $ However, I can't seem to cross the fabric. I can ssh freely back and forth between mic0 and mic1. However, running the next 2 tests from mic0, it certainly seems like the second one should work, too: $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out Hello World from process 0 of 2 Hello World from process 1 of 2 $ shmemrun -x
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
/usr/bin/ssh PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid "" -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2" [atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a child of mine [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to launch list [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of daemon [[33630,0],1] [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"] /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory [atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127 [atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending orted_exit commands -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- [atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm On 04/13/2015 07:47 PM, Ralph Castain wrote: Weird. I’m not sure what to try at that point - IIRC, building static won’t resolve this problem (but you could try and see). You could add the following to the cmd line and see if it tells us anything useful: —leave-session-attached —mca mca_component_show_load_errors 1 You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see where it is looking for libimf since it (and not mic.out) is the one complaining On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com> wrote: Ralph and Nathan, The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I
Re: [OMPI users] One-sided communication, a missing/non-existing API call
Nick, You may have more luck looking into the OSHMEM layer of Open MPI; SHMEM is designed for one-sided communications. BR, Andy On 04/14/2015 02:36 PM, Nick Papior Andersen wrote: Dear all, I am trying to implement some features using a one-sided communication scheme. The problem is that I understand the different one-sided communication schemes as this (basic words): MPI_Get) fetches remote window memory to a local memory space MPI_Get_Accumulate) 1. fetches remote window memory to a local memory space 2. sends a local memory space (different from that used in 1.) to the remote window and does OP on those two quantities MPI_Put) sends local memory space to remote window memory MPI_Accumulate) sends a local memory space to the remote window and does OP on those two quantities (surprisingly the documentation says that this only works with windows within the same node, note that MPI_Get_Accumulate does not say this constraint) ?) Where is the function that fetches remotely and does operation in a local memory space? Do I really have to do MPI_Get to local memory, then do operation manually? (no it is not difficult, but... ;) ) I would like this to exist: MPI_Get_Reduce(origin,...,target,...,MPI_OP,...) When I just looked at the API names I thought Get_Accumulate did this, but to my surprise that was not the case at all. :) -- Kind regards Nick ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26723.php
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Gilles and Ralph, thanks! $ shmemrun -H mic0,mic1 -n 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=1M $PWD/mic.out [atl1-01-mic0:192474] [[29886,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 440 Hello World from process 0 of 2 Hello World from process 1 of 2 $ This was built with the openmpi-dev-1487-g9c6d452.tar.bz2 nightly master. Oddly, -static-intel didn't work. Fortunately, -rpath did. I'll follow-up in the next day or so with the winning build recipes for both MPI and the user app to wrap up this note and, one hopes, save others from some frustration in the future. Andy On 04/14/2015 11:10 PM, Ralph Castain wrote: I think Gilles may be correct here. In reviewing the code, it appears we have never (going back to the 1.6 series, at least) forwarded the local LD_LIBRARY_PATH to the remote node when exec’ing the orted. The only thing we have done is to set the PATH and LD_LIBRARY_PATH to support the OMPI prefix - not any supporting libs. What we have required, therefore, is that your path be setup properly in the remote .bashrc (or pick your shell) to handle the libraries. As I indicated, the -x option only forwards envars to the application procs themselves, not the orted. I could try to add another cmd line option to forward things for the orted, but the concern we’ve had in the past (and still harbor) is that the ssh cmd line is limited in length. Thus, adding some potentially long paths to support this option could overwhelm it and cause failures. I’d try the static method first, or perhaps the LDFLAGS Gilles suggested. On Apr 14, 2015, at 5:11 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: Andy, what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ? IIRC, an other option is : LDFLAGS="-static-intel" last but not least, you can always replace orted with a simple script that sets the LD_LIBRARY_PATH and exec the original orted do you have the same behaviour on non MIC hardware when Open MPI is compiled with intel compilers ? if it works on non MIC hardware, the root cause could be in the sshd_config of the MIC that does not accept to receive LD_LIBRARY_PATH my 0.02 US$ Gilles On 4/14/2015 11:20 PM, Ralph Castain wrote: Hmmm…certainly looks that way. I’ll investigate. On Apr 14, 2015, at 6:06 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting propagated? $ ldd /home/ariebs/mic/mpi-nightly/bin/orted linux-vdso.so.1 => (0x7fffa1d3b000) libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x2ab6ce464000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x2ab6ce7d3000) libm.so.6 => /lib64/libm.so.6 (0x2ab6cebbd000) libdl.so.2 => /lib64/libdl.so.2 (0x2ab6ceded000) librt.so.1 => /lib64/librt.so.1 (0x2ab6ceff1000) libutil.so.1 => /lib64/libutil.so.1 (0x2ab6cf1f9000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2ab6cf3fc000) libpthread.so.0 => /lib64/libpthre
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Hi Ralph, If I did this right (NEVER a good bet :-) ), it didn't work... Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I built with the same script as yesterday, but removing the LDFLAGS=-Wl, stuff: $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud $ make $ make install ... make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490/test' make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make install-exec-hook make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[3]: ./config/find_common_syms: Command not found make[3]: [install-exec-hook] Error 127 (ignored) make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Nothing to be done for `install-data-am'. make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' $ But it seems to finish the install. I then tried to run, adding the new mca arguments: $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path $PATH -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory ... $ echo $MIC_LD_LIBRARY_PATH /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.* /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so $ On 04/16/2015 07:22 AM, Ralph Castain wrote: FWIW: I just added (last night) a pair of new MCA params for this purpose: plm_rsh_pass_path prepends the designated path to the remote shell's PATH prior to executing orted plm_rsh_pass_libpath same thing for LD_LIBRARY_PATH I believe that will resolve the problem for Andy regardless of compiler used. In the master now, waiting for someone to verify it before adding to 1.8.5. Sadly, I am away from any cluster for the rest of this week, so I'd welcome anyone having a chance to test it. On Thu, Apr 16, 2015 at 2:57 AM, Thomas Jahnswrote: Hello, On Apr 15, 2015, at 02:11 , Gilles Gouaillardet wrote: what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ? IIRC, an other option is : LDFLAGS="-static-intel" let me first state that I have no experience developing for MIC. But regarding the Intel runtime libraries, the only sane option in my opinion is to use the icc.cfg/ifort.cfg/icpc.cfg files that get put in the same directory as the corresponding compiler binaries and add a line like -Wl,-rpath,/path/to/composerxe/lib/intel?? to that file. Regards, Thomas -- Thomas Jahns DKRZ GmbH, Department: Application s
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Hi Ralph, Did you solve this problem in a more general way? I finally sat down this morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit from last week, and can't reproduce the problem at all. Andy On 04/16/2015 12:15 PM, Ralph Castain wrote: Sorry - I had to revert the commit due to a reported MTT problem. I'll reinsert it after I get home and can debug the problem this weekend. On Thu, Apr 16, 2015 at 9:41 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, If I did this right (NEVER a good bet :-) ), it didn't work... Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I built with the same script as yesterday, but removing the LDFLAGS=-Wl, stuff: $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud $ make $ make install ... make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490/test' make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make install-exec-hook make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[3]: ./config/find_common_syms: Command not found make[3]: [install-exec-hook] Error 127 (ignored) make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Nothing to be done for `install-data-am'. make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' $ But it seems to finish the install. I then tried to run, adding the new mca arguments: $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path $PATH -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory ... $ echo $MIC_LD_LIBRARY_PATH /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.* /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so $ On 04/16/2015 07:22 AM, Ralph Castain wrote: FWIW: I just added (last night) a pair of new MCA params for this purpose: plm_rsh_pass_path prepends the designated path to the remote shell's PATH prior to executing orted plm_rsh_pass_libpath same thing for LD_LIBRARY_PATH I believe that will resolve the problem for Andy regardless of compiler used. In the master now, waiting for someone to verify it before adding to 1.8.5. Sadly, I am away from any cluster for t
Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC
Yes, it just worked -- I took the old command line, just to ensure that I was testing the correct problem, and it worked. Then I remembered that I had set OMPI_MCA_plm_rsh_pass_path and OMPI_MCA_plm_rsh_pass_libpath in my test setup, so I removed those from my environment, ran again, and it still worked! Whatever it is that you're doing Ralph, keep it up :-) Regardless of the cause or result, thanks $$ for poking at this! Andy On 04/26/2015 10:35 AM, Ralph Castain wrote: Not intentionally - I did add that new MCA param as we discussed, but don’t recall making any other changes in this area. There have been some other build system changes made as a result of more extensive testing of the 1.8 release candidate - it is possible that something in that area had an impact here. Are you saying it just works, even without passing the new param? On Apr 26, 2015, at 6:39 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, Did you solve this problem in a more general way? I finally sat down this morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit from last week, and can't reproduce the problem at all. Andy On 04/16/2015 12:15 PM, Ralph Castain wrote: Sorry - I had to revert the commit due to a reported MTT problem. I'll reinsert it after I get home and can debug the problem this weekend. On Thu, Apr 16, 2015 at 9:41 AM, Andy Riebs <andy.ri...@hp.com> wrote: Hi Ralph, If I did this right (NEVER a good bet :-) ), it didn't work... Using last night's master nightly, openmpi-dev-1515-gc869490.tar.bz2, I built with the same script as yesterday, but removing the LDFLAGS=-Wl, stuff: $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic" CXX="icpc -mmic" \ --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \ AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld \ --enable-mpirun-prefix-by-default --disable-io-romio --disable-mpi-fortran \ --enable-debug --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud $ make $ make install ... make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490/test' make[1]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make install-exec-hook make[3]: Entering directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[3]: ./config/find_common_syms: Command not found make[3]: [install-exec-hook] Error 127 (ignored) make[3]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[2]: Nothing to be done for `install-data-am'. make[2]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' make[1]: Leaving directory `/home/ariebs/mic/openmpi-dev-1515-gc869490' $ But it seems to finish the install. I then tried to run, adding the new mca arguments:
Re: [OMPI users] MPIRUN SEGMENTATION FAULT
The challenge for the MPI experts here (of which I am NOT one!) is that the problem appears to be in your program; MPI is simply reporting that your program failed. If you got the program from someone else, you will need to solicit their help. If you wrote it, well, it is never a bad time to learn to use gdb! Best regards Andy On 04/23/2016 10:41 AM, Elio Physics wrote: I am not really an expert with gdb. What is the core file? and how to use gdb? I have got three files as an output when the executable is used. One is the actual output which stops and the other two are error files (from which I knew about the segmentation fault). thanks From: users on behalf of Ralph Castain Sent: Saturday, April 23, 2016 11:39 AM To: Open MPI Users Subject: Re: [OMPI users] MPIRUN SEGMENTATION FAULT valgrind isn’t going to help here - there are multiple reasons why your application could be segfaulting. Take a look at the core file with gdb and find out where it is failing. On Apr 22, 2016, at 10:20 PM, Elio Physicswrote: One more thing i forgot to mention in my previous e-mail. In the output file I get the following message: 2 total processes killed (some possibly by mpirun during cleanup) Thanks From: users on behalf of Elio Physics Sent: Saturday, April 23, 2016 3:07 AM To: Open MPI Users Subject: Re: [OMPI users] MPIRUN SEGMENTATION FAULT I have used valgrind and this is what i got: valgrind mpirun ~/Elie/SPRKKR/bin/kkrscf6.3MPI Fe_SCF.inp > scf-51551.jlborges.fisica.ufmg.br.out ==8135== Memcheck, a memory error detector ==8135== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==8135== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==8135== Command: mpirun /home/emoujaes/Elie/SPRKKR/bin/kkrscf6.3MPI Fe_SCF.inp ==8135== -- mpirun noticed that process rank 0 with PID 8147 on node jlborges.fisica.ufmg.br exited on signal 11 (Segmentation fault). -- ==8135== ==8135== HEAP SUMMARY: ==8135== in use at exit: 485,683 bytes in 1,899 blocks ==8135== total heap usage: 7,723 allocs, 5,824 frees, 12,185,660 bytes allocated ==8135== ==8135== LEAK SUMMARY: ==8135== definitely lost: 34,944 bytes in 34 blocks ==8135== indirectly lost: 26,613 bytes in 58 blocks ==8135== possibly lost: 0 bytes in 0 blocks ==8135== still reachable: 424,126 bytes in 1,807 blocks ==8135== suppressed: 0 bytes in 0 blocks
[OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1
I've built 1.10.2 with all my favorite configuration options, but I get messages such as this (one for each rank with orte_base_help_aggregate=0) when I try to run on a MOFED system: $ shmemrun -H hades02,hades03 $PWD/shmem.out -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: hades03 Local device: mlx4_0 Local port: 2 CPCs attempted: rdmacm, udcm -- My configure options: config_opts="--prefix=${INSTALL_DIR} \ --without-mpi-param-check \ --with-knem=/opt/mellanox/hpcx/knem \ --with-mxm=/opt/mellanox/mxm \ --with-mxm-libdir=/opt/mellanox/mxm/lib \ --with-fca=/opt/mellanox/fca \ --with-pmi=${INSTALL_ROOT}/slurm \ --without-psm --disable-dlopen \ --disable-vt \ --enable-orterun-prefix-by-default \ --enable-debug-symbols" There aren't any obvious error messages in the build log -- what am I missing? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE
Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1
For anyone like me who happens to google this in the future, the solution was to set OMPI_MCA_pml=yalla Many thanks Josh! On 05/05/2016 12:52 PM, Joshua Ladd wrote: We are working with Andy offline. Josh On Thu, May 5, 2016 at 7:32 AM, Andy Riebs <andy.ri...@hpe.com> wrote: I've built 1.10.2 with all my favorite configuration options, but I get messages such as this (one for each rank with orte_base_help_aggregate=0) when I try to run on a MOFED system: $ shmemrun -H hades02,hades03 $PWD/shmem.out -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: hades03 Local device: mlx4_0 Local port: 2 CPCs attempted: rdmacm, udcm -- My configure options: config_opts="--prefix=${INSTALL_DIR} \ --without-mpi-param-check \ --with-knem=/opt/mellanox/hpcx/knem \ --with-mxm=/opt/mellanox/mxm \ --with-mxm-libdir=/opt/mellanox/mxm/lib \ --with-fca=/opt/mellanox/fca \ --with-pmi=${INSTALL_ROOT}/slurm \ --without-psm --disable-dlopen \ --disable-vt \ --enable-orterun-prefix-by-default \ --enable-debug-symbols" There aren't any obvious error messages in the build log -- what am I missing? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29094.php ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29100.php
Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1
Sorry, my output listing was incomplete -- the program did run after the "No OpenFabrics" message, but (I presume) ran over Ethernet rather than InfiniBand. So I can't really say what was causing it to fail. Andy On 05/05/2016 06:09 PM, Nathan Hjelm wrote: It should work fine with ob1 (the default). Did you determine what was causing it to fail? -Nathan On Thu, May 05, 2016 at 06:04:55PM -0400, Andy Riebs wrote: For anyone like me who happens to google this in the future, the solution was to set OMPI_MCA_pml=yalla Many thanks Josh! On 05/05/2016 12:52 PM, Joshua Ladd wrote: We are working with Andy offline. Josh On Thu, May 5, 2016 at 7:32 AM, Andy Riebs wrote: I've built 1.10.2 with all my favorite configuration options, but I get messages such as this (one for each rank with orte_base_help_aggregate=0) when I try to run on a MOFED system: $ shmemrun -H hades02,hades03 $PWD/shmem.out -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: hades03 Local device: mlx4_0 Local port: 2 CPCs attempted: rdmacm, udcm -- My configure options: config_opts="--prefix=${INSTALL_DIR} \ --without-mpi-param-check \ --with-knem=/opt/mellanox/hpcx/knem \ --with-mxm=/opt/mellanox/mxm \ --with-mxm-libdir=/opt/mellanox/mxm/lib \ --with-fca=/opt/mellanox/fca \ --with-pmi=${INSTALL_ROOT}/slurm \ --without-psm --disable-dlopen \ --disable-vt \ --enable-orterun-prefix-by-default \ --enable-debug-symbols" There aren't any obvious error messages in the build log -- what am I missing? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29094.php ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29100.php ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29101.php ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29102.php
[OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x
We have built a version of Open MPI 3.0.x that works with Slurm (our primary use case), but it fails when executed without Slurm. If I srun an MPI "hello world" program, it works just fine. Likewise, if I salloc a couple of nodes and use mpirun from there, life is good. But if I just try to mpirun the program without Slurm support, the program appears to run to completion, and then segv's. A bit of good news is that this can be reproduced with a single process. Sample output and configuration information below: [tests]$ cat gdb.cmd set follow-fork-mode child r [tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x74be8700 (LWP 21386)] [New Thread 0x73f70700 (LWP 21387)] [New Thread 0x7fffeacac700 (LWP 21393)] [Thread 0x7fffeacac700 (LWP 21393) exited] [New Thread 0x7fffeacac700 (LWP 21394)] Hello world! I'm 0 of 1 on node04 [Thread 0x7fffeacac700 (LWP 21394) exited] [Thread 0x73f70700 (LWP 21387) exited] [Thread 0x74be8700 (LWP 21386) exited] [Inferior 1 (process 21382) exited normally] Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6 4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 libibverbs-1.2.1mlnx1-OFED .3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3. el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) q [node04:21373] *** Process received signal *** [node04:21373] Signal: Segmentation fault (11) [node04:21373] Signal code: (128) [node04:21373] Failing at address: (nil) [node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x760c4370] [node04:21373] [ 1] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7365104b] [node04:21373] [ 2] /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x764e4a14] [node04:21373] [ 3] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7363f5cd] [node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x760bcdc5] [node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x75deb73d] [node04:21373] *** End of error message *** bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid "399966208" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:73],node[4:0]04@0(2)" -mca orte_hnp_uri "3 99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx" -mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix "^s1,s2,cray,isolated" [tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort OMPI_MCA_atomic=ucx OMPI_MCA_coll=^hcoll OMPI_MCA_coll_tuned_allgather_algorithm=2 OMPI_MCA_coll_tuned_allgatherv_algorithm=2 OMPI_MCA_coll_tuned_use_dynamic_rules=1 OMPI_MCA_pml=ucx OMPI_MCA_scoll=^mpi OMPI_MCA_spml=ucx OMPI_MCA_spml_ucx_heap_reg_nb=1 OMPI_MCA_sshmem=mmap OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d OPENMPI_VER=3.0.x.4ca1c4d SLURM_DISTRIBUTION=block:block SLURM_HINT=nomultithread SLURM_SRUN_REDUCE_TASK_EXIT=1 SLURM_TEST_EXEC=1 SLURM_UNBUFFEREDIO=1 SLURM_VER=17.11.0-0pre2 UCX_TLS=dc_x UCX_ZCOPY_THRESH=131072 [tests]$ OS: CentOS 7.3 HW: x86_64 (KNL) OMPI version: 3.0.x.4ca1c4d Configuration options: --prefix=/opt/local/shmem/3.0.x.4ca1c4d --with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll --with-hwloc=/opt/local/hwloc/1.11.4 --with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem --with-libevent=/usr --with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm --with-platform=cont
Re: [OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x
As always, thanks for your help Ralph! Cutting over to PMIx 1.2.4 solved the problem for me. (Slurm wasn't happy building with PMIx v2.) And yes, I had ssh access to node04. (And Gilles, thanks for your note, as well.) Andy On 10/27/2017 04:31 PM, r...@open-mpi.org wrote: Two questions: 1. are you running this on node04? Or do you have ssh access to node04? 2. I note you are building this against an old version of PMIx for some reason. Does it work okay if you build it with the embedded PMIx (which is 2.0)? Does it work okay if you use PMIx v1.2.4, the latest release in that series? On Oct 27, 2017, at 1:24 PM, Andy Riebs wrote: We have built a version of Open MPI 3.0.x that works with Slurm (our primary use case), but it fails when executed without Slurm. If I srun an MPI "hello world" program, it works just fine. Likewise, if I salloc a couple of nodes and use mpirun from there, life is good. But if I just try to mpirun the program without Slurm support, the program appears to run to completion, and then segv's. A bit of good news is that this can be reproduced with a single process. Sample output and configuration information below: [tests]$ cat gdb.cmd set follow-fork-mode child r [tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x74be8700 (LWP 21386)] [New Thread 0x73f70700 (LWP 21387)] [New Thread 0x7fffeacac700 (LWP 21393)] [Thread 0x7fffeacac700 (LWP 21393) exited] [New Thread 0x7fffeacac700 (LWP 21394)] Hello world! I'm 0 of 1 on node04 [Thread 0x7fffeacac700 (LWP 21394) exited] [Thread 0x73f70700 (LWP 21387) exited] [Thread 0x74be8700 (LWP 21386) exited] [Inferior 1 (process 21382) exited normally] Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6 4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 libibverbs-1.2.1mlnx1-OFED .3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3. el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) q [node04:21373] *** Process received signal *** [node04:21373] Signal: Segmentation fault (11) [node04:21373] Signal code: (128) [node04:21373] Failing at address: (nil) [node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x760c4370] [node04:21373] [ 1] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7365104b] [node04:21373] [ 2] /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x764e4a14] [node04:21373] [ 3] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7363f5cd] [node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x760bcdc5] [node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x75deb73d] [node04:21373] *** End of error message *** bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid "399966208" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:73],node[4:0]04@0(2)" -mca orte_hnp_uri "3 99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx" -mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix "^s1,s2,cray,isolated" [tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort OMPI_MCA_atomic=ucx OMPI_MCA_coll=^hcoll OMPI_MCA_coll_tuned_allgather_algorithm=2 OMPI_MCA_coll_tuned_allgatherv_algorithm=2 OMPI_MCA_coll_tuned_use_dynamic_rules=1 OMPI_MCA_pml=ucx OMPI_MCA_scoll=^mpi OMPI_MCA_spml=ucx OMPI_MCA_spml_ucx_heap_reg_nb=1 OMPI_MCA_sshmem=mmap OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d OPENMPI_VER=3.0.x.4ca1c4d SLURM_DISTRIBUTION=block:block SLURM_HINT=nomultithread SLUR
Re: [OMPI users] no openmpi over IB on new CentOS 7 system
Noam, Start with the FAQ, etc., under "Getting Help/Support" in the left-column menu at https://www.open-mpi.org/ Andy *From:* Noam Bernstein *Sent:* Tuesday, October 09, 2018 2:26PM *To:* Open Mpi Users *Cc:* *Subject:* [OMPI users] no openmpi over IB on new CentOS 7 system Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7 system, and I’m not even sure what information would be useful to provide. I’m using the CentOS built in libibverbs and/or libfabric, and I configure openmpi with just —with-verbs —with-ofi —prefix=$DEST also tried —without-ofi, no change. Basically, I can run with “—mca btl self,vader”, but if I try “—mca btl,openib” I get an error from each process: [compute-0-0][[24658,1],5][connect/btl_openib_connect_udcm.c:1245:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument If I don’t specify the btl it appears to try to set up openib with the same errors, then crashes on some free() related segfault, presumably when it tries to actually use vader. The machine seems to be able to see its IB interface, as reported by things like ibstatus or ibv_devinfo. I’m not sure what else to look for. I also confirmed that “ulimit -l” reports unlimited. Does anyone have any suggestions as to how to diagnose this issue? thanks, Noam ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Experience with SHMEM with OpenMP
The web suggests that OpenMP should work just fine with OpenMPI/MPI -- does this also work with OpenMPI/SHMEM? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you! ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Building PMIx and Slurm support
Daniel, I think you need to have "--with-pmix=" point to a specific directory; either "/usr" if you installed it in /usr/lib and /usr/include, or the specific directory, like "--with-pmix=/usr/local/pmix-3.0.2" Andy *From:* Daniel Letai *Sent:* Sunday, March 03, 2019 8:54AM *To:* Users *Cc:* *Subject:* Re: [OMPI users] Building PMIx and Slurm support Hello, I have built the following stack : 1. centos 7.5 (gcc 4.8.5-28, libevent 2.0.21-4) 2. MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tgz built with --all --without-32bit (this includes ucx 1.5.0) 3. hwloc from centos 7.5 : 1.11.8-4.el7 4. pmix 3.1.2 5. slurm 18.08.5-2 built --with-ucx --with-pmix 6. openmpi 4.0.0 : configure --with-slurm --with-pmix=external --with-pmi --with-libevent=external --with-hwloc=external --with-knem=/opt/knem-1.1.3.90mlnx1 --with-hcoll=/opt/mellanox/hcoll The configure part succeeds, however 'make' errors out with: *ext3x.c: *In function '*ext3x_value_unload*': *ext3x.c:1109:10: error: 'PMIX_MODEX' *undeclared (first use in this function) And same for *'PMIX_INFO_ARRAY'* However, both are declared in the opal/mca/pmix/pmix3x/pmix/include/pmix_common.h file. opal/mca/pmix/ext3x/ext3x.c does include pmix_common.h but as a system include #include , while ext3x.h includes it as a local include #include "pmix_common". Neither seem to pull from the correct path. Regards, Dani_L. On 2/24/19 3:09 AM, Gilles Gouaillardet wrote: Passant, you have to manually download and apply https://github.com/pmix/pmix/commit/2e2f4445b45eac5a3fcbd409c81efe318876e659.patch to PMIx 2.2.1 that should likely fix your problem. As a side note, it is a bad practice to configure --with-FOO=/usr since it might have some unexpected side effects. Instead, you can replace configure --with-slurm --with-pmix=/usr --with-pmi=/usr --with-libevent=/usr with configure --with-slurm --with-pmix=external --with-pmi --with-libevent=external to be on the safe side I also invite you to pass --with-hwloc=external to the configure command line Cheers, Gilles On Sun, Feb 24, 2019 at 1:54 AM Passant A. Hafez wrote: Hello Gilles, Here are some details: Slurm 18.08.4 PMIx 2.2.1 (as shown in /usr/include/pmix_version.h) Libevent 2.0.21 srun --mpi=list srun: MPI types are... srun: none srun: openmpi srun: pmi2 srun: pmix srun: pmix_v2 Open MPI versions tested: 4.0.0 and 3.1.2 For each installation to be mentioned a different MPI Hello World program was compiled. Jobs were submitted by sbatch, 2 node * 2 tasks per node then srun --mpi=pmix program File 400ext_2x2.out (attached) is for OMPI 4.0.0 installation with configure options: --with-slurm --with-pmix=/usr --with-pmi=/usr --with-libevent=/usr and configure log: Libevent support: external PMIx support: External (2x) File 400int_2x2.out (attached) is for OMPI 4.0.0 installation with configure options: --with-slurm --with-pmix and configure log: Libevent support: internal (external libevent version is less that internal version 2.0.22) PMIx support: Internal Tested also different installations for 3.1.2 and got errors similar to 400ext_2x2.out (NOT-SUPPORTED in file event/pmix_event_registration.c at line 101) All the best, -- Passant A. Hafez | HPC Applications Specialist KAUST Supercomputing Core Laboratory (KSL) King Abdullah University of Science and Technology Building 1, Al-Khawarizmi, Room 0123 Mobile : +966 (0) 55-247-9568 Mobile : +20 (0) 106-146-9644 Office : +966 (0) 12-808-0367 From: users on behalf of Gilles Gouaillardet Sent: Saturday, February 23, 2019 5:17 PM To: Open MPI Users Subject: Re: [OMPI users] Building PMIx and Slurm support Hi, PMIx has cross-version compatibility, so as long as the PMIx library used by SLURM is compatible with the one (internal or external) used by Open MPI, you should be fine. If you want to minimize the risk of cross-version incompatibility, then I encourage you to use the same (and hence external) PMIx that was used to build SLURM with Open MPI. Can you tell a bit more than "it didn't work" ? (Open MPI version, PMIx version used by SLURM, PMIx version used by Open MPI, error message, ...) Cheers, Gilles On Sat, Feb 23, 2019 at 9:46 PM Passant A. Hafez wrote: Good day everyone, I've trying to build and use the PMIx support for Open MPI but I tried many things that I can list if needed, but with no luck. I was able to test the PMIx client but when I used OMPI specifying srun --mpi=pmix it didn't work. So if you please advise me with the versions of each PMIx and Open MPI that should be working well with Slurm 18.08, it'd be great. Also, what is the difference between using internal vs external PMIx installations? All the best, -- Passant A. Hafez | HPC Applications Specialist KAUST Supercomputing Core Laboratory (KSL) King Abdullah University of Science and Technology Build