Re: [OMPI users] Building vs packaging

2016-05-19 Thread dani
I don't know about .deb packages, but at least in the rpms there is a 
post install scriptlet that re-runs ldconfig to ensure the new libs are 
in the ldconfig cache.



On 16/05//2016 18:04, Dave Love wrote:

"Rob Malpass"  writes:


Almost in desperation, I cheated:

Why is that cheating?  Unless you specifically want a different version,
it seems sensible to me, especially as you then have access to packaged
versions of at least some MPI programs.  Likewise with rpm-based
systems, which I'm afraid I know more about.

Also the package system ensures that things don't break by inadvertently
removing their dependencies; the hwloc libraries might be an example.


sudo  apt-get install openmpi-bin

  


and hey presto.   I can now do (from head node)

  


mpirun -H node2,node3,node4 -n 10 foo

  


and it works fine.   So clearly apt-get install has set something that I'd
not done (and it's seemingly not LD_LIBRARY_PATH) as ssh node2 'echo
$LD_LIBRARY_PATH' still returns a blank line.

No.  As I said recently, Debian installs a default MPI (via the
alternatives system) with libraries in the system search path.  Check
the library contents.
___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29212.php





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-04 Thread dani

  
  
For provisioning, I personally use xCAT, which just started
  supporting docker
http://xcat-docs.readthedocs.io/en/stable/advanced/docker/lifecycle_management.html
Together with slurm elastic computing feature
http://xcat-docs.readthedocs.io/en/stable/advanced/docker/lifecycle_management.html
  this could be a poor man's solution.
I'm not sure how convenient xCAT would be to maintain non-unix
  ids. Frankly, I'm still not sure of your requirements in that
  regard.




On 04/06//2016 16:01, Rob Nagler wrote:


  
Hi Daniel,


Thanks. 


Shifter is also interesting. However,
  it assumes our users map to a Unix user id, and therefore the
  access to the shared file system can be controlled by normal
  Unix permissions. That's not scalable, and makes for quite a
  bit of complexity. Each node must know about each user so you
  have to run LDAP or something similar. This adds complexity to
  dynamic cluster creation.


Shifter runs in a chroot, not an
  cgroup, context. For a supercomputer center with an
  application process to get an account, this works fine. For a
  web application with no "background check", it's more risky.
  At NERSC, you don't have the bad actor problem. Web apps do,
  and all it takes is one local exploit to escape chroot.
  Docker/cgroups is safer, and the focus on improving Linux
  security is on cgroups these days, not chroot "jails".


Shifter also does not solve the problem
  of queuing dynamic clusters. SLURM/Torque, which Shifter
  relies on, does not either. This is probably the most
  difficult item. StarCluster does solve this problem, but
  doesn't work on bare metal, and it's not clear if it is being
  maintained any more. 


Rob



  
  
  
  
  ___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29366.php


  




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] PMIx + OpenMPI

2017-08-06 Thread dani

  
  
This is just a guess but - is it possible after building slurm
  against PMIx, you should build openmpi against SLURM's PMIx
  instead of directly using external? I would assume slurm's PMI
  server now "knows" PMIx.


On 06/08//2017 16:14, Charles A Taylor
  wrote:


  HI Gilles,

I tried both “—with-pmix=/opt/pmix” and “—with-pmix=internal” and got the same “UNREACHABLE” error both ways.  I tried the “external” first since that is what SLURM was built against.

I’m missing something simple/basic - just not sure what it is.  

Thanks,

Charlie


  
On Aug 6, 2017, at 7:43 AM, Gilles Gouaillardet  wrote:

Charles,

did you build Open MPI with the external PMIx ?
iirc, Open MPI 2.0.x does not support cross version PMIx

Cheers,

Gilles

On Sun, Aug 6, 2017 at 7:59 PM, Charles A Taylor  wrote:


  

  
On Aug 6, 2017, at 6:53 AM, Charles A Taylor  wrote:


Anyone successfully using PMIx with OpenMPI and SLURM?  I have,

1. Installed an “external” version (1.1.5) of PMIx.
2. Patched SLURM 15.08.13 with the SchedMD-provided PMIx patch (results in an mpi_pmix plugin along the lines of mpi_pmi2).
3. Built OpenMPI 2.0.1 (tried 2.0.3 as well).

However, when attempting to launch MPI apps (LAMMPS in this case), I get

  [c9a-s2.ufhpc:08914] PMIX ERROR: UNREACHABLE in file src/client/pmix_client.c at line 199


  
  I should have mentioned that I’m launching with

  srun —mpi=pmix …

If I launch with

 srun —mpi=pmi2 ...

the app starts and runs without issue.


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

  
  
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


  




smime.p7s
Description: S/MIME Cryptographic Signature
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Compiling openmpi 1.6.4 without CUDA

2013-05-20 Thread dani

  
  
Hi List,

I've encountered an issue today - building an openmpi 1.6.4 from
source rpm, on a machine which has cuda-5 (latest) installed,
resulted in openmpi always using the cuda headers and libs.
I should mention that I have added the cuda libs dir to ldconfig,
and the bin dir to the path (nvcc is in path).
When building openmpi 1.6.4 (rpmbuild --rebuild openmpi.src.rpm) the
package is automatically build with cuda.
I have tried to define --without-cuda , --disable-cuda,
--disable-cudawrapers but the rpm is always built with cuda, and
fails to install as the required libs are not in rpmdb.
If installing with --disable-vt, cuda is not looked for or
installed.
So i guess my question is two-fold:
1. Is this by design? from the FAQ
(http://www.open-mpi.org/faq/?category=building#build-cuda) I was
sure cuda is not built by default.
2. Is there a way to keep vampirtrace without cuda?

The reason I don't want cuda in mpi is due to the target cluster
characteristics: Except for 1 node, it will have no gpus, so I saw
no reason to deploy cuda to it. unfortunately, I had to use the
single node with cuda as the compilation node, as it was the only
node with complete development packages.

I can always mv the cuda dirs during build phase, but I'm wandering
if this is how openmpi build is supposed to behave.
  



[OMPI users] errors trying to run a simple mpi task

2013-06-23 Thread dani

  
  
Hi,

I've encountered strange issues when trying to run a simple mpi job
on a single host which has IB.
The complete errors:

-> mpirun -n 1 hello
--
  WARNING: Failed to open "ofa-v2-mlx4_0-1"
  [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. 
  This may be a real error or it may be an invalid entry in the
  uDAPL
  Registry which is contained in the dat.conf file. Contact your
  local
  System Administrator to confirm the availability of the interfaces
  in
  the dat.conf file.
--
--
  [[53031,1],0]: A high-performance Open MPI point-to-point
  messaging module
  was unable to find any relevant network interfaces:
  
  Module: uDAPL
    Host: n01
  
  Another transport will be used instead, although this may result
  in
  lower performance.
--
--
  WARNING: It appears that your OpenFabrics subsystem is configured
  to only
  allow registering part of your physical memory.  This can cause
  MPI jobs to
  run with erratic performance, hang, and/or crash.
  
  This may be caused by your OpenFabrics vendor limiting the amount
  of
  physical memory that can be registered.  You should investigate
  the
  relevant Linux kernel module parameters that control how much
  physical
  memory can be registered, and increase them to allow registering
  all
  physical memory on your machine.
  
  See this Open MPI FAQ item for more information on these Linux
  kernel module
  parameters:
  
     
  http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
  
    Local host:  n01
    Registerable memory: 32768 MiB
    Total memory:    65503 MiB
  
  Your MPI job will continue, but may be behave poorly and/or hang.
--
  Process 0 on n01 out of 1
  [n01:13534] 7 more processes have sent help message
  help-mpi-btl-udapl.txt / dat_ia_open fail
  [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to
  see all help / error messages
Following my setup and other info:
OS: CentOS 6.3 x86_64
installed ofed 3.5 from source ( ./install.pl --all)
installed openmpi 1.6.4 with the following build parameters:
rpmbuild --rebuild openmpi-1.6.4-1.src.rpm
  --define '_prefix /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir
  /opt/openmpi/1.6.4/gcc' --define '_mandir %{_prefix}/share/man'
  --define '_datadir %{_prefix}/share' --define 'configure_options
  --with-openib=/usr --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++
  F77=gfortran FC=gfortran --enable-mpirun-prefix-by-default
  --target=x86_64-unknown-linux-gnu --with-hwloc=/usr/local
  --with-libltdl --enable-branch-probabilities --with-udapl
  --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1'
  --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts
  1' --define 'shell_scripts_basename mpivars' --define '_usr /usr'
  --define 'ofed 0' 2>&1 | tee openmpi.build.sge
(disable -vt was used due to cuda presence which is automatically
linked by vt, and becomes a dependency with no matching rpm).

memorylocked is unlimited:
->ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 515028
  max locked memory   (kbytes, -l) unlimited
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size    (512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 10240
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) 1024
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
IB devices are present:
->ibv_devinfo
  hca_id:    mlx4_0
      transport:            InfiniBand (0)
      fw_ver:                2.9.1000
      node_guid:            0002:c903:004d:b0e2
      sys_image_guid:            0002:c903:004d:b0e5
      vendor_id:            0x02c9
      vendor_part_id:            26428
      hw_ver:                0xB0
      board_id:            MT_0D90110009
      phys_port_cnt:            

Re: [OMPI users] errors trying to run a simple mpi task

2013-06-23 Thread dani

  
  
Thanks, that actually solved one of the errors:
->mpirun -n 1 hello
--
  WARNING: It appears that your OpenFabrics subsystem is configured
  to only
  allow registering part of your physical memory.  This can cause
  MPI jobs to
  run with erratic performance, hang, and/or crash.
  
  This may be caused by your OpenFabrics vendor limiting the amount
  of
  physical memory that can be registered.  You should investigate
  the
  relevant Linux kernel module parameters that control how much
  physical
  memory can be registered, and increase them to allow registering
  all
  physical memory on your machine.
  
  See this Open MPI FAQ item for more information on these Linux
  kernel module
  parameters:
  
     
  http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
  
    Local host:  n01
    Registerable memory: 32768 MiB
    Total memory:    65503 MiB
  
  Your MPI job will continue, but may be behave poorly and/or hang.
--
  Process 0 on n01 out of 1
BTW, the node has 64GB total ram. Is it possible openmpi is limited
to only 32GB? or possibly the ofed installation has such a limit?


On 23/06//2013 17:58, Ralph Castain
  wrote:


  
  Don't include udapl - that code may well be stale

Sent from my iPhone
  
On Jun 23, 2013, at 3:42 AM, dani <d...@letai.org.il>
wrote:

  
  

  
  Hi,
  
  I've encountered strange issues when trying to run a simple
  mpi job on a single host which has IB.
  The complete errors:
  
  -> mpirun -n 1 hello
--
WARNING: Failed to open "ofa-v2-mlx4_0-1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED]. 
This may be a real error or it may be an invalid entry in
the uDAPL
Registry which is contained in the dat.conf file. Contact
your local
System Administrator to confirm the availability of the
interfaces in
the dat.conf file.
--
--
[[53031,1],0]: A high-performance Open MPI point-to-point
messaging module
was unable to find any relevant network interfaces:

Module: uDAPL
  Host: n01

Another transport will be used instead, although this may
result in
lower performance.
--
--
WARNING: It appears that your OpenFabrics subsystem is
configured to only
allow registering part of your physical memory.  This can
cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the
amount of
physical memory that can be registered.  You should
investigate the
relevant Linux kernel module parameters that control how
much physical
memory can be registered, and increase them to allow
registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these
Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  n01
  Registerable memory: 32768 MiB
  Total memory:    65503 MiB

Your MPI job will continue, but may be behave poorly and/or
hang.
--
Process 0 on n01 out of 1
[n01:13534] 7 more processes have sent help message
help-mpi-btl-udapl.txt / dat_ia_open fail
[n01:13534] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages
  Following my setup and other info:
  OS: CentOS 6.3 x86_64
  installed ofed 3.5 from source ( ./install.pl --all)
  installed openmpi 1.6.4 with the following build parameters:
  rpmbuild --rebuild
openmpi-1.6.4-1.src.rpm --define '_prefix
/

Re: [OMPI users] errors trying to run a simple mpi task

2013-06-24 Thread dani

  
  
OK, I see the part I missed - changing the MTT parameters for
mlx4_core.
My only issue is that currently there is no num_mtt or log_num_mtt,
but there is log_mtts_per_seg in /sys/module/mlx4_core/parameters
Has the parameter names changed? I know I can simply set
log_mtts_per_seg to 5 (it's currently 3) but the discussion at the
mailing list
(http://www.open-mpi.org/community/lists/devel/2012/08/11417.php)
seems to indicate it's better to use numerous small chunks to avoid
fragmentation.

Can you, or anyone else, provide documentation for the current
parameters of mlx4_core? I can't locate it in the mellanox or ofed
sites.


On 24/06//2013 18:02, Jeff Squyres
  (jsquyres) wrote:


  On Jun 23, 2013, at 3:21 PM, dani  wrote:


  

  See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages


BTW, the node has 64GB total ram. Is it possible openmpi is limited to only 32GB? or possibly the ofed installation has such a limit?

  
  
Yes.

See the FAQ item cited in the help message for more detail.




  



Re: [OMPI users] requirement on ssh when run openmpi

2013-07-28 Thread dani

  
  
But that is not a requirement on ssh.
That is a requirement on the install base on the second node - both
must have the same environment variables set, using same paths on
each machine.
either install openmpi on each node, and setup 
/etc/profile.d/openmpi.{c,}sh and /etc/ld.so.conf.d/openmpi.conf
files on both (preferred) or install to a common file system (e.g.
nfs mount) and still use profile and ldconfig to setup environment.

On 29/07//2013 05:00, meng wrote:


  Dear
Reuti,
  Thank you for the reply.
  In examples directory on the computer c1, command "mpiexec -np
4 hello_c" is correctly executed.
 If I run " mpiexec -machinefile hosts hello_c " on computer c1,
where hosts contains two line :
"  node3
  localhost
"
the screen displays as follows:
bash: orted: command not found
--
A daemon (pid 5247) died unexpectedly with status 127 while
attempting
to launch so we are aborting.

There may be more information reported by the environment (see
above).

This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH
to have the
location of the shared libraries on the remote nodes and this
will
automatically be forwarded to the remote nodes.--
--
  mpiexec noticed that the job aborted, but has no info as to
  the process
  that caused that situation.
    I dont know whats wrong with it.
    Thank you.
   Regards,
  Meng
  
  
  
  
At 2013-07-27 16:44:56,Reuti  wrote:
>Hi,
>
>Am 27.07.2013 um 08:48 schrieb meng:
>
>>what 's the requriement, especially on ssh, to run openmpi? I have two computers, say c1 and c2. Through ssh, c1 can access c2 without password, but c2 can't access c1. Under this environment, can I use openmpi to compute parallely? 
>
>as long as you execute `mpiexec` only on c1, it should work. But you can't start a job on c2.
>
>-- Reuti
>
>
>>  Regards,
>> meng
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


  
  
  
  
  
  
  ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-18 Thread dani

  
  
I might be offbase here, but I think what was implied is that
  you've built openmpi with --with-pmi without supplying the path
  that holds pmi2 libs. 
First build slurm with pmi, then build openmpi with path to pmi
  so in slurm.
That might not provide pmix, but it will work with pmi2.
--with-pmi= or --with-pmi-libdir=



On 18/11//2017 19:03, Bennet Fauber
  wrote:


  

  Howard,

  
  Thanks for the reply.
  
  I think, based on a previous reply, that we may not have the
  right combinations of pmi and slurm lined up.  I will have to
  coordinate with our admin who compiles and installs slurm, and
  once we think we have slurm with pmix, I'll try again and post
  the files/information you suggest.
  
  Thanks for telling me which files and output is most useful
  here.
  

-- bennet


  
  
On Fri, Nov 17, 2017 at 11:45 PM,
  Howard Pritchard 
  wrote:
  
Hello Bennet,
  
  
  What you are trying to do using srun as the job
launcher should work.  Could you post the contents
  of /etc/slurm/slurm.conf for your system?  
  
  
  Could you also post the output of the following
command:
  
  
  ompi_info --all | grep pmix
  
  
to the mail
  list.
the config.log
  from your build would also be useful.
Howard
  


  2017-11-16 9:30 GMT-07:00 r...@open-mpi.org :
What
  Charles said was true but not quite complete. We still
  support the older PMI libraries but you likely have to
  point us to wherever slurm put them.
  
  However,we definitely recommend using PMIx as you will
  get a faster launch
  
  Sent from my iPad
  

  > On Nov 16, 2017, at 9:11 AM, Bennet Fauber
  
  wrote:
  >
  > Charlie,
  >
  > Thanks a ton!  Yes, we are missing two of the
  three steps.
  >
  > Will report back after we get pmix installed
  and after we rebuild
  > Slurm.  We do have a new enough version of
  it, at least, so we might
  > have missed the target, but we did at least
  hit the barn.  ;-)
  >
  >
  >
  >> On Thu, Nov 16, 2017 at 10:54 AM, Charles
  A Taylor 
  wrote:
  >> Hi Bennet,
  >>
  >> Three things...
  >>
  >> 1. OpenMPI 2.x requires PMIx in lieu of
  pmi1/pmi2.
  >>
  >> 2. You will need slurm 16.05 or greater
  built with —with-pmix
  >>
  >> 2a. You will need pmix 1.1.5 which you
  can get from github.
  >> (https://github.com/pmix/tarballs).
  >>
  >> 3. then, to launch your mpi tasks on the
  allocated resources,
  >>
  >>   srun —mpi=pmix ./hello-mpi
  >>
  >> I’m replying to the list because,
  >>
  >> a) this information is harder to find
  than you might think.
  >> b) someone/anyone can correct me if I’’m
  giving a bum steer.
  >>
  >> Hope this helps,
  >>
  >> Charlie Taylor
  >> University of Florida
  >>
  >> On Nov 16, 2017, at 10:34 AM, Bennet
  Fauber 
  wrote:
  >>
  >> I think that OpenMPI is supposed to
  support SLURM integration such that
  >>
  >>   srun ./hello-mpi
  >>
  >> should work?  I built OMPI 2.1.2 with