Rob,
have you looked at Singularity
https://github.com/gmkurtzer/singularity/releases/tag/2.0
It is a new containerisation framework aimed squarely at HPC.
Also you mention Juyputer. I am learning Julia at the moment, and I looked
at the parallel facilities yesterday
https://github.com/JuliaParal
Rob, I really think you should look at the FAQ
http://singularity.lbl.gov/#faq
Also I don;t understand what you mean by 'Out users don't have Unix user
IDs'
That is no problem of course - I have worked with Centrify and Samba, where
you can define mappings between Windows users and Unix IDs or gro
Rob, I am not familair with wakari.io
However what you say about the Unix userid problem is very relevant to many
'shared infrastructure' projects and is a topic which comes up in
discussions about them.
Teh concern there is, as you say, if the managers of the system have a
global filesystem, with
Please can someone point me towards the affinity settings for:
OpenMPI 1.10 used with Slurm version 15
I have some nodes with 2630-v4 processors.
So 10 cores per socket / 20 hyperthreads
Hyperthreading is enabled.
I would like to set affinity for 20 processes per node,
so that the processes are
s to
> *core*. Supported options include slot, hwthread, core, l1cache, l2cache,
> l3cache, socket, numa, board, and none.
>
> https://www.open-mpi.org/doc/current/man1/mpirun.1.php#sect9
>
>
>
>
> On Jul 17, 2016, at 11:25 PM, John Hearns wrote:
>
> Please can someone
On 24 August 2010 18:58, Rahul Nabar wrote:
> There are a few unusual things about the cluster. We are using a
> 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and
> the other 10GigE. These are on seperate subnets although the order of
> the eth interfaces is variable. i.e. 10G
On 24 September 2010 08:46, Andrei Fokau wrote:
> We use a C-program which consumes a lot of memory per process (up to few
> GB), 99% of the data being the same for each process. So for us it would be
> quite reasonable to put that part of data in a shared memory.
http://www.emsl.pnl.gov/docs/glo
On 20 November 2010 16:31, Gilbert Grosdidier wrote:
> Bonjour,
Bonjour Gilbert.
I manage ICE clusters also.
Please could you have look at /etc/init.d/pbs on the compute blades?
Do you have something like:
if [ "${PBS_START_MOM}" -gt 0 ] ; then
if check_prog "mom" ; then
e
On 14 December 2010 17:32, Lydia Heck wrote:
>
> I have experimented a bit more and found that if I set
>
> OMPI_MCA_plm_rsh_num_concurrent=1024
>
> a job with more than 2,500 processes will start and run.
>
> However when I searched the open-mpi web site for the the variable I could
> not find an
On 17 December 2010 14:45, Gilbert Grosdidier
wrote:
> Bonjour,
> About this issue, for which I got NO feedback ;-)
Gilbert, as you have an SGI cluster, have you filed a support request to SGI?
Also, which firmware do you have installed?
I haveFirmware version: 2.5.0
http://www.open
On 17 December 2010 14:45, Gilbert Grosdidier
wrote:
> Bonjour,
> About this issue, for which I got NO feedback ;-) I recently spotted
> into btl_openib.c code, that this error message could come from
On the cluster admin node, run firmware_revs and look for the
Infiniband firmware
On 17 December 2010 15:47, Gilbert Grosdidier
wrote:
>>
> gg= I don't know, and firmware_revs does not seem to be available.
> Only thing I got on a worker node was with lspci :
If you log into a compute node the command is /usr/sbin/ibstat
The firmware_revs command is on the cluster admin
On 6 January 2011 21:10, Gilbert Grosdidier wrote:
> Hi Jeff,
>
> Where's located lstopo command on SuseLinux, please ?
> And/or hwloc-bind, which seems related to it ?
I was able to get hwloc to install quite easily on SuSE -
download/configure/make
Configure it to install to /usr/local/bin
A
On 20 January 2011 06:59, Zhigang Wei wrote:
> Dear all,
>
>
>
>
>
> I want to use infiniband, I am from a University in the US, my University’s
> high performance center don’t have Gcc compiled openmpi that support
> infiniband, so I want to compile myself.
That is a surprise - you must have som
On 20 January 2011 16:50, Olivier SANNIER wrote:
> >
>
> So there is no dynamic discovery of nodes available on the network. Unless,
> of course, if I was to write a tool that would do it before the actual run
> is started.
That is in essence what a batch scheduler does.
OK, to be honest it has
On 20 January 2011 16:50, Olivier SANNIER wrote:
>> I’ve started looking at beowulf clusters, and that lead me to PBS. Am I
> right in assuming that PBS (PBSPro or TORQUE) could be used to do the
> monitoring and the load balancing I thought of?
Yes, that is correct. An alternative is Gridengine.
Mohd,
the Clustermonkey site is a good resource for you
http://www.clustermonkey.net/
On 2 April 2011 04:16, Ahsan Ali wrote:
> Hello,
> I want to run WRF on multiple nodes in a linux cluster using openmpi,
> giving the command mpirun -np 4 ./wrf.exe just submit it to the single node
> . I don't know how to run it on other nodes as well. Help needed.
Ahsan,
you have a Dell clu
On 30 August 2011 02:55, Ralph Castain wrote:
> Instead, all used dynamic requests - i.e., the job that was doing a
> comm_spawn would request resources at the time of the comm_spawn call. I
> would pass the request to Torque, and if resources were available,
> immediately process them into OMP
Andre,
you should not need the OpenMPI sources.
Install the openmpi-devel package from the same source
(zypper install openmpi-devel if you have that science repository enabled)
This will give you the mpi.h file and other include files, libraries
and manual pages.
That is a convention in Suse-sty
On 03/02/2012, Tom Rosmond wrote:
> Recently the organization I work for bought a modest sized Linux cluster
> for running large atmospheric data assimilation systems. In my
> experience a glaring problem with systems of this kind is poor IO
> performance. Typically they have 2 types of network:
Harini,
you can install OpenMPI which is packaged for your distribution of Linux,
for examply on SuSE use zypper install openmpi
or the equivalent on Redhat/Ubuntu
You probably will not get the most up to date Openmpi version,
but you will get the library paths set up in /etc/ld.so.conf.d/ and
t
Have you checked the system logs on the machines where this is running?
Is it perhaps that the processes use lots of memory and the Out Of
Memory (OOM) killer is killing them?
Also check all nodes for left-over 'orphan' processes which are still
running after a job finishes - these should be killed
n the nodes. The failure
> happend on Friday and after that tens of similar jobs completed
> successfully.
>
> Regards,
> Grzegorz Maj
>
> 2012/3/27 John Hearns :
>> Have you checked the system logs on the machines where this is running?
>> Is it perhaps that the proc
It is well worth installing 'htop' to help diagnose situations like this.
On 02/08/2012, Syed Ahsan Ali wrote:
> Yes the issue has been diagnosed. I can ssh them but they are asking for
> passwords
You need to configure 'passwordless ssh'
Can we assume that your home directory is shared across all cluster nodes?
That means when you log into a cluster node the director
Apologies, I have not taken the time to read your comprehensive diagnostics!
As Gus says, this sounds like a memory problem.
My suspicion would be the kernel Out Of Memory (OOM) killer.
Log into those nodes (or ask your systems manager to do this). Look
closely at /var/log/messages where there wil
You need to either copy the data to storage which the cluster nodes have
mounted. Surely your cluster vendor included local storage?
Or you can configure the cluster head node to export the SAN volume by NFS
Data is large and cannot be copied to the local drives od the compute
nodes as the data is large.
I understand that.
I think that you have storage attached to your cluster head node - the
'SAN storage' you refer to.
Lets' call that volume /data
All you need to do is edit the /etc/exports file o
If I may ask, which comapny installed thsi cluster for you?
Surely they will advise on how to NFS mount the storage on the compute nodes?
Jeff, this is very good advice.
I have had many, many hours of deep joy getting to know the OOM killer
and all of his wily ways.
Respect the OOM Killer!
On cluster I manage, the OOM killer is working, however there is a
strict policy that if OOM killer kicks on in a cluster node it is
excluded f
Short answer. Run ibstats or ibstatus.
Look also at the logs of your subnet manager.
Those diagnostics are from Openfabrics.
What type of infiniband card do you have?
What drivers are you using?
2 percent?
Have you logged into a compute node and run a simple top when the job is
running?
Are all the processes distributed across the CPU cores?
Are the processes being pinned properly to a core? Or are they hopping from
core to core?
Also make SURE all nodes havenooted with all cores online
Have you run ibstat on every single node and made sure all links are up at
the correct speed?
Have you checkef the output to make sure that you are not domehow running
over ethernet?
Lart your users. Its the only way.
they will thank you for it it, eventually.
www.catb.org/jargon/html/L/LART.html
ldd rca.x
Try logging in to each node and run this command.
Even better use pdsh
Backing up what Matthieu said, can you run a simple Hello world mpi
application first?
then something like a Pallas run - just to make sure you can run
aplications in parallel.
For information, if you use a batch system such as PbsPro or Torque it can
be configured to set up the cpuset for a job and start the job within the
cpuset. It will also destroy the cpuset at the end of a job.
Highly useful for job cpu binding as you day and also if you have a machine
running many
On a bug system you can boot the system into a 'boot cpuset'.
So all system processes run in a small number of low numbered cores. Plus
any login sessions. The batch system then crwtes cpusets in the higher
numbeted cores - free from OS interference.
Bug system?
Big system!
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
Agree with what you say Dave.
Regarding not wanting jobs to use certsin cores ie. reserving low-numbered
cores for OS processes then surely a good way forward is to use a 'boot
cpuset' of one or two cores and let your jobs run on the rest of the cores.
You're right about cpusets being helpful wit
On 23 August 2013 12:36, Dave Love wrote:
> John Hearns writes:
>
> > > cpuset' of one or two cores and let your jobs run on the rest of the
> cores.
>
> Maybe, if you make sure the resource manager knows about it, and users
> don't mind losing the cor
I agree what Ralph says.
I have a lot of experience in running SLES 10 and 11 systems and many
flavours of Opensuse.
I am not sure if rpms for Openmpi are available for Sles - I will check.
Installing Openfoam is a pig I agree.
You could be better off with Opensuse from a point of view of Openm
I jsut checked on the Opensuse Build Service.
There are OpenMPI RPMs available for SLES 11 SP2 - but not SLES 11 SP1
http://software.opensuse.org/package/openmpi (then click on Show Other
Versions )
I have got openmpi installed on my SLES 11 SP1 system, version 1.3.2
Zypper says it is rovide
Also for info an Opensuse 12.2 system has openmpi 1.5.4 packaged with it.
Serves me right for not reading your original mail you have SLES 11 SP2
The openmpi RPMs are provided by the SuSE Software Development Kit DVD.
you can download SIO images of this from the SUSE website.
For SP! it is named SLE-SP1-SDK-DVD-x86_64-GM-DVD1.iso
So you should copy this .iso file
Also should be able to define $TMPDIR with your batch system.
This can be on a much bigger disk.
Not a good answer to your question but you could look for the child
processes and look at /proc/$pid/cmdline and cwd
Or just use pgrep -P $pidofmpirun
This is not a good answer. I'm sitting at lunch - so an expert will be
along in a minute with a good answer.
E do you have a filesystem which is full?df will tell you
Or maybe mounted read only.
Good to hear that!
OpenMPI aprons. Nice! Good to wear when cooking up those Chef recipes. (Did
I really just say that...)
'Htop' is a very good tool for looking at where processes are running.
I got NIST Fire and Smoke installed and running for a customer at my last
job.
The burning sofa demo is pretty nifty!
Ps. 'htop' is a good tool for looking at where processes are running.
Khadije - you need to give a list of compute hosts to mpirun.
And probably have to set up passwordless ssh to each host.
Noam, cpusets are a very good idea.
Not only for CPU binding but for isolating 'badky behaved' applications.
If an application stsrts using huge amounts of memory - kill it, collapse
the cpuset and it is gone - nice clean way to manage jobs.
On Fri, 2008-06-06 at 17:56 +0100, SLIM H.A. wrote:
> Hi
>
> I want to use SGE to run jobs on a cluster with mx and infiniband nodes.
> By dividing the nodes into two host groups SGE will submit to either
> interconnect.
>
> The interconnect can be specified in the mpirun command with the --mca
>
loop
several times and you will see what I mean.
If you have other machines on the network, you have to configure them
such that you can start remote processes on them.
When you use "mpirun" to launch your MPI code you need to give the names
of those machines as a parameter to mpirun - it is known as a "machines
file".
John Hearns
2008/9/18 Alex Wolfe
> Hello,
>
> I am trying to run the HPL benchmarking software on a new 1024 core cluster
> that we have set up. Unfortunately I'm hitting the "mca_oob_tcp_accept:
> accept() failed: Too many open files (24)" error known in verson 1.2 of
> openmpi. No matter what I set the fil
2008/9/19 Alex Wolfe
> I'm just running it using mpirun from the command line. Thanks for the
> reply.
>
>>
>
> Have you checked what ulimit -a
returns on all the nodes on your cluster, ie when you ssh into them what
does ulimit -a give you?
I may be on the wrong track here.
Regarding hyperthreading, and finding our information about your CPUs
in detail, there is the excellent hwloc project from OpenMPI
http://www.open-mpi.org/projects/hwloc/
I downloaded the 1.0 release candidate, and it compiled and ran first
time on Nehalem systems. Gives a superb and helpful view
Gus,
I'm not using OpenMPI, however OpenSUSE 11.2 with current updates
seems to work fine on Nehalem.
I'm curious that you say the Nvidia graphics driver does not install -
have you tried running the install script manually, rather than
downloading an RPM etc?
I'm using version 195.36.15 and it
On 7 May 2010 03:17, Jeff Squyres wrote:
>
> Indeed. I have seen some people have HT enabled in the bios just so that
> they can have the software option of turning them off via linux -- then you
> can run with HT and without it and see what it does to your specific codes.
I may have missed t
If you have a system with two IB cards, can you choose using a command line
switch which card to use with Openmpi?
Also a more general question - can you change (or throttle back) the speed
at which an Infiniband card works at?
For example, to use an fDR card at QDR speeds.
Thanks for any insight
You say that you can run the code OK 'by hand' with an mpirun.
Are you assuming somehow that the Gridengine jobs will inherit your
environment variables, paths etc?
If I remember correctly, you should submit wiht the -V option to pass
over environment settings.
Even better, make sure that the jo
As an aside, with Slurm you can use:
sbatch --ntasks-per-socket=
I would hazard a guess that this uses the OpenMPI syntax as above to
perform the binding to core!
On 27 July 2015 at 09:47, Ralph Castain wrote:
> As you say, it all depends on your kernel :-)
>
> If the numactl libraries are a
Hi Steve.
Regarding Step 3, have you thought of using some shared storage?
NFS shared drive perhaps, or there are many alternatives!
On 23 January 2016 at 20:47, Steve O'Hara
wrote:
> Hi,
>
>
>
> I’m afraid I’m pretty new to both OpenFOAM and openMPI so please excuse me
> if my questions are eit
Rob,
I agree with what Dave Love says. The distro packaged OpenMPI packages
should set things up OK for you.
I guess that is true on the head node, but from what you say maybe the
cluster compute nodes are being installed some other way.
On HPC clusters, when you are managing alternate packages
2008/11/19 Ray Muno
> Thought I would revisit this one.
>
> We are still having issues with this. It is not clear to me what is leaving
> the user files behind in /dev/shm.
>
> This is not something users are doing directly, they are just compiling
> their code directly with mpif90 (from OpenMPI)
2008/11/20 Ray Muno
> J
>>
>>
>>
>>
> OK, what should I be seeing when I run "ipcs -p"?
>
>
Looks like I don't know my System V from my POSIX.
I know what to do.
2009/2/4 Hana Milani
>
> Is there a local system administrator that you can talk to about this?
>
> Not a very good one,
I'm sure he or she is just gonna LOVE you.
I would seriously advise a big box of doughnuts on http://www.sysadminday.com/
And please cut the HTML formatting with the bold tex
2009/3/30 Kevin McManus :
> >
> I can find psm libs at...
>
> /usr/lib/libpsm_infinipath.so.1.0
> /usr/lib/libpsm_infinipath.so.1
> /usr/lib64/libpsm_infinipath.so.1.0
> /usr/lib64/libpsm_infinipath.so.1
On x86_64 type systems /usr/lib64 are the 64 bit libraries, /usr/lib
are the 32 bit ones
2009/4/3 Francesco Pietra :
> > "expected file /usr/lib/include/numa.h was not found"
>
> In debian amd64 lenny numa.h has a different location
> "/usr/include/numa.h". Attached is the config.log.
>
> I would appreciate help in circumventing the problem.
It is /usr/include/numa.h on SuSE also (SLE
2009/4/3 Francesco Pietra :
> I was not sure whether that is a technically correct procedure. It works.
> Thanks
>
It most certainly is not. But I have been a Unix system admin for many
years. I have done things which I am
not proud of
If I ever offer to let you use my keyboard, wash your
2009/4/6 Ankush Kaul :
>> Also how do i come to know that the program is using resources of both the
> nodes?
Log into the second node before you start the program.
Run 'top'
Seriously - top is a very, very useful utility.
2009/5/14 Valmor de Almeida :
>
> Hello,
>
> I am wondering whether light oversubscription could lead to a clobbered
> program.
Apologies if this is a stupid reply.
Have you checked if the OOM killer (out of memory killer) is being
triggered when you run the program on the laptop?
Open a separate w
Hello Lachlan. I think Jeff Squyres will be along in a short while! HE is
of course the expert on Cisco.
In the meantime a quick Google turns up:
http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/usnic/c/deployment/2_0_X/b_Cisco_usNIC_Deployment_Guide_For_Standalone_C-SeriesServers.html
Mahmood, as Giles says start by looking at how that application is compiled
and linked.
Run 'ldd' on the executable and look closely at the libraries. Do this on
a compute node if you can.
There was a discussion on another mailign list recently about how to
fingerpritn executables and see which a
Mahmood,
are you compiling and linking this application?
Or are you using an executable which someone else has prepared?
It would be very useful if we could know the application.
On 2 September 2016 at 16:35, Mahmood Naderan wrote:
> >Did you ran
> >ulimit -c unlimited
> >before invoking mpi
Thankyou. That is helpful.
Could you run an 'ldd' on your executable, on one of the compute nodes if
possible?
I will nto be able to solve your problem, but at least we now know what the
application is,
and can look at the libraries it is using.
On 2 September 2016 at 17:19, Mahmood Naderan w
Sergei, what does the command "ibv_devinfo" return please?
I had a recent case like this, but on Qlogic hardware.
Sorry if I am mixing things up.
On 28 October 2016 at 10:48, Sergei Hrushev wrote:
> Hello, All !
>
> We have a problem with OpenMPI version 1.10.2 on a cluster with newly
> inst
Sorry - shoot down my idea. Over to someone else (me hides head in shame)
On 28 October 2016 at 11:28, Sergei Hrushev wrote:
> Sergei, what does the command "ibv_devinfo" return please?
>>
>> I had a recent case like this, but on Qlogic hardware.
>> Sorry if I am mixing things up.
>>
>>
> An
Segei,
can you run :
ibhosts
ibstat
ibdiagnet
Lord help me for being so naive, but do you have a subnet manager running?
On 1 November 2016 at 06:40, Sergei Hrushev wrote:
> Hi Jeff !
>
> What does "ompi_info | grep openib" show?
>>
>>
> $ ompi_info | grep openib
> MCA bt
Mahmoud, you should look at the OpenHPC project.
http://www.openhpc.community/
On 15 December 2016 at 19:50, Mahmoud MIRZAEI wrote:
> Dears,
>
> May you please let me know if there is any procedure to install OpenMPI on
> CentOS in HPC?
>
> Thanks.
> Mahmoud
>
>
>
> _
Jordi,
this is not an answer to your question. However have you looked at
Singularity:
http://singularity.lbl.gov/
On 24 March 2017 at 08:54, Jordi Guitart wrote:
> Hello,
>
> Docker allows several containers running in the same host to share the
> same IPC namespace, thus they can share mem
Ray, probably a stupid question but do you have the hwloc-devel package
installed?
And also the libxml2-devel package?
On 27 April 2017 at 21:54, Ray Sheppard wrote:
> Hi All,
> I have searched the mail archives because I think this issue was
> addressed earlier, but I can not find anything
Gabriele, as this is based on OpenMPI can you run ompi_info
then look for the btl which are available and the mtl which are available?
On 18 May 2017 at 14:10, Reuti wrote:
> Hi,
>
> > Am 18.05.2017 um 14:02 schrieb Gabriele Fatigati :
> >
> > Dear OpenMPI users and developers, I'm using IBM
CA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v10.1.0)
>
>
> about mtl no information retrieve ompi_info
>
>
> 2017-05-18 14:13 GMT+02:00 John Hearns via users >:
>
>> Gabriele, as this is based on OpenMPI can you run ompi_info
>> then look for the btl which a
Gabriele, pleae run 'ibv_devinfo'
It looks to me like you may have the physical interface cards in these
systems, but you do not have the correct drivers or libraries loaded.
I have had similar messages when using Infiniband on x86 systems - which
did not have libibverbs installed.
On 19 May
it does not work, can run and post the logs)
>
> mpirun --mca pml ^pami --mca pml_base_verbose 100 ...
>
>
> Cheers,
>
>
> Gilles
>
>
> On 5/19/2017 4:01 PM, Gabriele Fatigati wrote:
>
>> Hi John,
>> Infiniband is not used, there is a single node on
Giles, Allan,
if the host 'smd' is acting as a cluster head node it is not a must for it
to have an Infiniband card.
So you should be able to run jobs across the other nodes, which have Qlogic
cards.
I may have something mixed up here, if so I am sorry.
If you want also to run jobs on the smd hos
Allan,
remember that Infiniband is not Ethernet. You dont NEED to set up IPOIB
interfaces.
Two diagnostics please for you to run:
ibnetdiscover
ibdiagnet
Let us please have the reuslts ofibnetdiscover
On 19 May 2017 at 09:25, John Hearns wrote:
> Giles, Allan,
>
> if the
folks will comment on that shortly.
>>>
>>>
>>> meanwhile, you do not need pami since you are running on a single node
>>>
>>> mpirun --mca pml ^pami ...
>>>
>>> should do the trick
>>>
>>> (if it does not w
;>> findActiveDevices Error
>>> We found no active IB device ports
>>> Hello world from rank 0 out of 1 processors
>>>
>>> So it seems to work apart the error message.
>>>
>>>
>>> 2017-05-19 9:10 GMT+02:00 Gilles Gouaillard
neral case.
supercomputer cluster running over high performance fabrics are complicated
beasts. Itis not sufficient to plug in cards and cable.
On 19 May 2017 at 11:12, John Hearns wrote:
> I am not sure I agree with that.
> (a) the original error message from Gabriele was quite
Michael, try
--mca plm_rsh_agent ssh
I've been fooling with this myself recently, in the contect of a PBS cluster
On 22 June 2017 at 16:16, Michael Di Domenico
wrote:
> is it possible to disable slurm/munge/psm/pmi(x) from the mpirun
> command line or (better) using environment variables?
>
>
; You can add "OMPI_MCA_plm=rsh OMPI_MCA_sec=^munge” to your environment
>
>
> On Jun 22, 2017, at 7:28 AM, John Hearns via users <
> users@lists.open-mpi.org> wrote:
>
> Michael, try
> --mca plm_rsh_agent ssh
>
> I've been fooling with this myself recentl
I may have asked this recently (if so sorry).
If anyoen has worked with QoS settings with OpenMPI please ping me off list,
eg
mpirun --mca btl_openib_ib_service_level N
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/
1 - 100 of 179 matches
Mail list logo