Re: [OMPI users] Building vs packaging
I don't know about .deb packages, but at least in the rpms there is a post install scriptlet that re-runs ldconfig to ensure the new libs are in the ldconfig cache. On 16/05//2016 18:04, Dave Love wrote: "Rob Malpass" writes: Almost in desperation, I cheated: Why is that cheating? Unless you specifically want a different version, it seems sensible to me, especially as you then have access to packaged versions of at least some MPI programs. Likewise with rpm-based systems, which I'm afraid I know more about. Also the package system ensures that things don't break by inadvertently removing their dependencies; the hwloc libraries might be an example. sudo apt-get install openmpi-bin and hey presto. I can now do (from head node) mpirun -H node2,node3,node4 -n 10 foo and it works fine. So clearly apt-get install has set something that I'd not done (and it's seemingly not LD_LIBRARY_PATH) as ssh node2 'echo $LD_LIBRARY_PATH' still returns a blank line. No. As I said recently, Debian installs a default MPI (via the alternatives system) with libraries in the system search path. Check the library contents. ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29212.php smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] Docker Cluster Queue Manager
For provisioning, I personally use xCAT, which just started supporting docker http://xcat-docs.readthedocs.io/en/stable/advanced/docker/lifecycle_management.html Together with slurm elastic computing feature http://xcat-docs.readthedocs.io/en/stable/advanced/docker/lifecycle_management.html this could be a poor man's solution. I'm not sure how convenient xCAT would be to maintain non-unix ids. Frankly, I'm still not sure of your requirements in that regard. On 04/06//2016 16:01, Rob Nagler wrote: Hi Daniel, Thanks. Shifter is also interesting. However, it assumes our users map to a Unix user id, and therefore the access to the shared file system can be controlled by normal Unix permissions. That's not scalable, and makes for quite a bit of complexity. Each node must know about each user so you have to run LDAP or something similar. This adds complexity to dynamic cluster creation. Shifter runs in a chroot, not an cgroup, context. For a supercomputer center with an application process to get an account, this works fine. For a web application with no "background check", it's more risky. At NERSC, you don't have the bad actor problem. Web apps do, and all it takes is one local exploit to escape chroot. Docker/cgroups is safer, and the focus on improving Linux security is on cgroups these days, not chroot "jails". Shifter also does not solve the problem of queuing dynamic clusters. SLURM/Torque, which Shifter relies on, does not either. This is probably the most difficult item. StarCluster does solve this problem, but doesn't work on bare metal, and it's not clear if it is being maintained any more. Rob ___ users mailing list us...@open-mpi.org Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29366.php smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] PMIx + OpenMPI
This is just a guess but - is it possible after building slurm against PMIx, you should build openmpi against SLURM's PMIx instead of directly using external? I would assume slurm's PMI server now "knows" PMIx. On 06/08//2017 16:14, Charles A Taylor wrote: HI Gilles, I tried both “—with-pmix=/opt/pmix” and “—with-pmix=internal” and got the same “UNREACHABLE” error both ways. I tried the “external” first since that is what SLURM was built against. I’m missing something simple/basic - just not sure what it is. Thanks, Charlie On Aug 6, 2017, at 7:43 AM, Gilles Gouaillardet wrote: Charles, did you build Open MPI with the external PMIx ? iirc, Open MPI 2.0.x does not support cross version PMIx Cheers, Gilles On Sun, Aug 6, 2017 at 7:59 PM, Charles A Taylor wrote: On Aug 6, 2017, at 6:53 AM, Charles A Taylor wrote: Anyone successfully using PMIx with OpenMPI and SLURM? I have, 1. Installed an “external” version (1.1.5) of PMIx. 2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in an mpi_pmix plugin along the lines of mpi_pmi2). 3. Built OpenMPI 2.0.1 (tried 2.0.3 as well). However, when attempting to launch MPI apps (LAMMPS in this case), I get [c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file src/client/pmix_client.c at line 199 I should have mentioned that I’m launching with srun —mpi=pmix … If I launch with srun —mpi=pmi2 ... the app starts and runs without issue. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users smime.p7s Description: S/MIME Cryptographic Signature ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Compiling openmpi 1.6.4 without CUDA
Hi List, I've encountered an issue today - building an openmpi 1.6.4 from source rpm, on a machine which has cuda-5 (latest) installed, resulted in openmpi always using the cuda headers and libs. I should mention that I have added the cuda libs dir to ldconfig, and the bin dir to the path (nvcc is in path). When building openmpi 1.6.4 (rpmbuild --rebuild openmpi.src.rpm) the package is automatically build with cuda. I have tried to define --without-cuda , --disable-cuda, --disable-cudawrapers but the rpm is always built with cuda, and fails to install as the required libs are not in rpmdb. If installing with --disable-vt, cuda is not looked for or installed. So i guess my question is two-fold: 1. Is this by design? from the FAQ (http://www.open-mpi.org/faq/?category=building#build-cuda) I was sure cuda is not built by default. 2. Is there a way to keep vampirtrace without cuda? The reason I don't want cuda in mpi is due to the target cluster characteristics: Except for 1 node, it will have no gpus, so I saw no reason to deploy cuda to it. unfortunately, I had to use the single node with cuda as the compilation node, as it was the only node with complete development packages. I can always mv the cuda dirs during build phase, but I'm wandering if this is how openmpi build is supposed to behave.
[OMPI users] errors trying to run a simple mpi task
Hi, I've encountered strange issues when trying to run a simple mpi job on a single host which has IB. The complete errors: -> mpirun -n 1 hello -- WARNING: Failed to open "ofa-v2-mlx4_0-1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- [[53031,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: uDAPL Host: n01 Another transport will be used instead, although this may result in lower performance. -- -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: n01 Registerable memory: 32768 MiB Total memory: 65503 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- Process 0 on n01 out of 1 [n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Following my setup and other info: OS: CentOS 6.3 x86_64 installed ofed 3.5 from source ( ./install.pl --all) installed openmpi 1.6.4 with the following build parameters: rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir /opt/openmpi/1.6.4/gcc' --define '_mandir %{_prefix}/share/man' --define '_datadir %{_prefix}/share' --define 'configure_options --with-openib=/usr --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpirun-prefix-by-default --target=x86_64-unknown-linux-gnu --with-hwloc=/usr/local --with-libltdl --enable-branch-probabilities --with-udapl --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1' --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts 1' --define 'shell_scripts_basename mpivars' --define '_usr /usr' --define 'ofed 0' 2>&1 | tee openmpi.build.sge (disable -vt was used due to cuda presence which is automatically linked by vt, and becomes a dependency with no matching rpm). memorylocked is unlimited: ->ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515028 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited IB devices are present: ->ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:004d:b0e2 sys_image_guid: 0002:c903:004d:b0e5 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt:
Re: [OMPI users] errors trying to run a simple mpi task
Thanks, that actually solved one of the errors: ->mpirun -n 1 hello -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: n01 Registerable memory: 32768 MiB Total memory: 65503 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- Process 0 on n01 out of 1 BTW, the node has 64GB total ram. Is it possible openmpi is limited to only 32GB? or possibly the ofed installation has such a limit? On 23/06//2013 17:58, Ralph Castain wrote: Don't include udapl - that code may well be stale Sent from my iPhone On Jun 23, 2013, at 3:42 AM, dani <d...@letai.org.il> wrote: Hi, I've encountered strange issues when trying to run a simple mpi job on a single host which has IB. The complete errors: -> mpirun -n 1 hello -- WARNING: Failed to open "ofa-v2-mlx4_0-1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -- -- [[53031,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: uDAPL Host: n01 Another transport will be used instead, although this may result in lower performance. -- -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: n01 Registerable memory: 32768 MiB Total memory: 65503 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- Process 0 on n01 out of 1 [n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Following my setup and other info: OS: CentOS 6.3 x86_64 installed ofed 3.5 from source ( ./install.pl --all) installed openmpi 1.6.4 with the following build parameters: rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix /
Re: [OMPI users] errors trying to run a simple mpi task
OK, I see the part I missed - changing the MTT parameters for mlx4_core. My only issue is that currently there is no num_mtt or log_num_mtt, but there is log_mtts_per_seg in /sys/module/mlx4_core/parameters Has the parameter names changed? I know I can simply set log_mtts_per_seg to 5 (it's currently 3) but the discussion at the mailing list (http://www.open-mpi.org/community/lists/devel/2012/08/11417.php) seems to indicate it's better to use numerous small chunks to avoid fragmentation. Can you, or anyone else, provide documentation for the current parameters of mlx4_core? I can't locate it in the mellanox or ofed sites. On 24/06//2013 18:02, Jeff Squyres (jsquyres) wrote: On Jun 23, 2013, at 3:21 PM, dani wrote: See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages BTW, the node has 64GB total ram. Is it possible openmpi is limited to only 32GB? or possibly the ofed installation has such a limit? Yes. See the FAQ item cited in the help message for more detail.
Re: [OMPI users] requirement on ssh when run openmpi
But that is not a requirement on ssh. That is a requirement on the install base on the second node - both must have the same environment variables set, using same paths on each machine. either install openmpi on each node, and setup /etc/profile.d/openmpi.{c,}sh and /etc/ld.so.conf.d/openmpi.conf files on both (preferred) or install to a common file system (e.g. nfs mount) and still use profile and ldconfig to setup environment. On 29/07//2013 05:00, meng wrote: Dear Reuti, Thank you for the reply. In examples directory on the computer c1, command "mpiexec -np 4 hello_c" is correctly executed. If I run " mpiexec -machinefile hosts hello_c " on computer c1, where hosts contains two line : " node3 localhost " the screen displays as follows: bash: orted: command not found -- A daemon (pid 5247) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.-- -- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. I dont know whats wrong with it. Thank you. Regards, Meng At 2013-07-27 16:44:56,Reuti wrote: >Hi, > >Am 27.07.2013 um 08:48 schrieb meng: > >>what 's the requriement, especially on ssh, to run openmpi? I have two computers, say c1 and c2. Through ssh, c1 can access c2 without password, but c2 can't access c1. Under this environment, can I use openmpi to compute parallely? > >as long as you execute `mpiexec` only on c1, it should work. But you can't start a job on c2. > >-- Reuti > > >> Regards, >> meng >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility
I might be offbase here, but I think what was implied is that you've built openmpi with --with-pmi without supplying the path that holds pmi2 libs. First build slurm with pmi, then build openmpi with path to pmi so in slurm. That might not provide pmix, but it will work with pmi2. --with-pmi= or --with-pmi-libdir= On 18/11//2017 19:03, Bennet Fauber wrote: Howard, Thanks for the reply. I think, based on a previous reply, that we may not have the right combinations of pmi and slurm lined up. I will have to coordinate with our admin who compiles and installs slurm, and once we think we have slurm with pmix, I'll try again and post the files/information you suggest. Thanks for telling me which files and output is most useful here. -- bennet On Fri, Nov 17, 2017 at 11:45 PM, Howard Pritchardwrote: Hello Bennet, What you are trying to do using srun as the job launcher should work. Could you post the contents of /etc/slurm/slurm.conf for your system? Could you also post the output of the following command: ompi_info --all | grep pmix to the mail list. the config.log from your build would also be useful. Howard 2017-11-16 9:30 GMT-07:00 r...@open-mpi.org : What Charles said was true but not quite complete. We still support the older PMI libraries but you likely have to point us to wherever slurm put them. However,we definitely recommend using PMIx as you will get a faster launch Sent from my iPad > On Nov 16, 2017, at 9:11 AM, Bennet Fauber wrote: > > Charlie, > > Thanks a ton! Yes, we are missing two of the three steps. > > Will report back after we get pmix installed and after we rebuild > Slurm. We do have a new enough version of it, at least, so we might > have missed the target, but we did at least hit the barn. ;-) > > > >> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor wrote: >> Hi Bennet, >> >> Three things... >> >> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2. >> >> 2. You will need slurm 16.05 or greater built with —with-pmix >> >> 2a. You will need pmix 1.1.5 which you can get from github. >> (https://github.com/pmix/tarballs). >> >> 3. then, to launch your mpi tasks on the allocated resources, >> >> srun —mpi=pmix ./hello-mpi >> >> I’m replying to the list because, >> >> a) this information is harder to find than you might think. >> b) someone/anyone can correct me if I’’m giving a bum steer. >> >> Hope this helps, >> >> Charlie Taylor >> University of Florida >> >> On Nov 16, 2017, at 10:34 AM, Bennet Fauber wrote: >> >> I think that OpenMPI is supposed to support SLURM integration such that >> >> srun ./hello-mpi >> >> should work? I built OMPI 2.1.2 with