[slurm-users] Re: Running SLURM in a laptop

2025-02-19 Thread John Hearns via slurm-users
How about using cpusets Create a boot cpusets with the e cores and start slurm in the p cores Yeah showing my age by talking about cpusets On Wed, Feb 19, 2025, 6:05 PM Timo Rothenpieler via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 19.02.2025 14:06, Luke Sudbery via slurm-users w

[slurm-users] Run only one time on a node

2025-02-18 Thread John Hearns via slurm-users
I am running single node tests on a cluster. I can select named nodes using the -2 flag with sbatch. However - if I want to submit perhaps 20 test jobs is there any smart way to run only one time on a node? I know I could touch a file with the hostname and test for that file. I am just wondering i

[slurm-users] Create filenames based on slurm hosts

2025-02-14 Thread John Hearns via slurm-users
I am working on power logging of a GPU cluster I am working with. I am running jobs on multiple hosts. I wanst to create a file , one for each host, which has a unique filename containing the host name. Something like clush -w $SLURM_JOB_NODELIST "touch file$(hostname)" My foo is weak today. Help

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
ps -eaf --forest is your friend with Slurm On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote: > I observed similar symptoms when we had issues with the shared Lustre file > system. When the file system couldn't complete an I/O operation, the > pro

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote: > I have had something similar. > The fix was to run a > scontrol reconfig > Which causes a reread of the Slurmd config >

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try It might be scontrol reread. Use the manual On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello everyone.

[slurm-users] Re: Installing slurm*

2025-02-04 Thread John Hearns via slurm-users
Steven, one tip if you are just starting with Slurm: "Use the logs Luke, Use the logs" By this I mean tail -f /var/log/slurmctl and restart the slurmctld service On a compute node tail -f /var/log/slurmd Oh, and you probably are going to set up Munge also - which is easy. On Tue, 4 Feb 2025

[slurm-users] Re: RHEL8.10 V slurmctld

2025-01-30 Thread John Hearns via slurm-users
Have you run id on a computer node? On Wed, Jan 29, 2025, 6:47 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote: > I am using Redhat's IdM/IPA for users > > Slurmctld is failing to run jobs and it is getting "invalid user id". > > "2025-01-28T21:48:50.271] sched: Allocate J

[slurm-users] Re: need help with seff script on ubuntu (slurm 21.08)

2025-01-09 Thread John Hearns via slurm-users
To debug shell scripts try running with the -x flag ??? On Thu, Jan 9, 2025, 10:51 AM Gérard Henry (AMU) via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello all and happy new year, > > i have installed slurm 21.08 on ubuntu 22 LTS, and database for > accounting on a remote machine run

[slurm-users] Re: launch failed requeued held

2025-01-08 Thread John Hearns via slurm-users
Generally, the troubleshooting steps which you should take for Slurm are: squeue to look at the list of running/queued or held jobs sinfo to show which nodes are idle, busy or down scontrol show node to get more detailed information on a node For problem nodes - indeed just log into any node t

[slurm-users] Re: launch failed requeued held

2025-01-07 Thread John Hearns via slurm-users
You need to find the node which the job started on. Then look at the slurmd log on that node. You may find an indication of the reason for the failure. On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote: > slurm 24.11 - squeue displays reaso

[slurm-users] Re: formatting node names

2025-01-07 Thread John Hearns via slurm-users
Davide, the 'nodeset' command can be used here nodeset -e -S '\n' node[03-04,12-22,27-32,36] On Mon, 6 Jan 2025 at 19:58, Davide DelVento via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi all, > I remember seeing on this list a slurm command to change a slurm-friendly > list suc

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-04 Thread John Hearns via slurm-users
Output of sinfo and squeue Look at slurmd log in an example node also Tail -f is your friend On Sat, Jan 4, 2025, 8:13 AM sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote: > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) >

[slurm-users] Re: Slurm plugin for custom hardware allocation

2024-12-23 Thread John Hearns via slurm-users
I think this was discussed here recently. On Mon, Dec 23, 2024, 12:18 PM Laura Zharmukhametova via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello, > > Is there an existing Slurm plugin for FPGA allocation? If not, can someone > please point me in the right direction for how to approa

[slurm-users] Re: getting slurm going

2024-12-08 Thread John Hearns via slurm-users
Is your slurm.conf identical on all nodes? On Sun, Dec 8, 2024, 7:42 PM John Hearns wrote: > Tail -f on the slurm controller logs > > Log into a computer node and tail -f on slurmd log as the slurmd log is > started > Or start slurmd in the foreground and set debug flag > >

[slurm-users] Re: getting slurm going

2024-12-08 Thread John Hearns via slurm-users
Tail -f on the slurm controller logs Log into a computer node and tail -f on slurmd log as the slurmd log is started Or start slurmd in the foreground and set debug flag On Sun, Dec 8, 2024, 7:37 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi, > > I did that, > > [r

[slurm-users] Re: Non-Standard Mail Notification in Job

2024-12-05 Thread John Hearns via slurm-users
I used to configure Postfix on the head node. All compute nodes are then configured to use the head node as a relay. On Thu, Dec 5, 2024, 1:14 AM Kevin Buckley via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 2024/12/05 05:37, Daniel Miliate via slurm-users wrote: > > > > I'm trying t

[slurm-users] Re: Access denied by pam_slurm_adopt

2024-11-10 Thread John Hearns via slurm-users
Forget what I just said. slurmctld had not been restarted in a month of Sundays and it was logging mismatched in the slurm.conf Slurm reconfig and a restart f all slurmd and problem looks fixed. On Sun, 10 Nov 2024 at 14:50, John Hearns wrote: > I have cluster which uses Slurm 23.1

[slurm-users] Access denied by pam_slurm_adopt

2024-11-10 Thread John Hearns via slurm-users
I have cluster which uses Slurm 23.11.6 When I submit a multi-node job and run something like clush -b -w $SLURM_JOB_NODELIST "date" very often the ssh command fails with: Access denied by pam_slurm_adopt: you have no active jobs on this node This will happen maybe on 50% of the nodes There is t

Re: [slurm-users] Running pyMPI on several nodes

2019-07-16 Thread John Hearns
srun: error: Application launch failed: Invalid node name specified Hearns Law. All batch system problems are DNS problems. Seriously though - check out your name resolution both on the head node and the compute nodes. On Tue, 16 Jul 2019 at 08:49, Pär Lundö wrote: > Hi, > > I have now had th

Re: [slurm-users] Running pyMPI on several nodes

2019-07-12 Thread John Hearns
Par, by 'poking around' Crhis means to use tools such as netstat and lsof. Also I would look as ps -eaf --forest to make sure there are no 'orphaned' jusbs sitting on that compute node. Having said that though, I have a dim memory of a classic PBSPro error message which says something about a netw

Re: [slurm-users] Running pyMPI on several nodes

2019-07-11 Thread John Hearns
Please try something very simple such as a hello world program or srun -N2 -n8 hostname What is the error message which you have ? On Fri, 12 Jul 2019 at 07:07, Pär Lundö wrote: > > Hi there Slurm-experts! > I am trouble using or running a python-mpi program involving more than > one node. The

Re: [slurm-users] Running pyMPI on several nodes

2019-07-11 Thread John Hearns
MY apology. You do say that the Python program simply printe the rank - so is a hello world program. On Fri, 12 Jul 2019 at 07:45, John Hearns wrote: > Please try something very simple such as a hello world program or > srun -N2 -n8 hostname > > What is the error message which you h

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread John Hearns
Paul, you refer to banking resources. Which leads me to ask are schemes such as Gold used these days in Slurm? Gold was a utility where groups could top up with a virtual amount of money which would be spent as they consume resources. Altair also wrote a similar system for PBS, which they offered t

Re: [slurm-users] Configure Slurm 17.11.9 in Ubuntu 18.10 with use of PMI

2019-06-20 Thread John Hearns
Palle, you will get a more up to date version of Slurm by using the GitHub repository https://github.com/SchedMD/slurm You do not necessarily have to use the Linux distribution version of packages, which are often out of date. However - please tell us a bit more about your environment. Specificall

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread John Hearns
Janne, thankyou. That FGCI benchmark in a container is pretty smart. I always say that real application benchmarks beat synthetic benchmarks. Taking a small mix of applications like that and taking a geometric mean is great. Note: *"a reference result run on a Dell PowerEdge C4130"* In the old da

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread John Hearns
I agree with Christopher Coffey - look at the sssd caching. I have had experience with sssd and can help a bit. Also if you are seeing long waits could you have nested groups? sssd is notorious for not handling these well, and there are settings in the configuration file which you can experiment wi

Re: [slurm-users] Slurm Install on Remote System

2019-05-26 Thread John Hearns
Think of system administrators like grumpy bears in their caves. They will growl at you and make fierce noises. Btu bring them cookies and they will roll over and let their tummies be tickled. On Sun, 26 May 2019 at 05:25, Raymond Wan wrote: > > > On 25/5/2019 7:37 PM, John Hea

Re: [slurm-users] Slurm Install on Remote System

2019-05-26 Thread John Hearns
Priya, you could set up a cluster on Aamazon or another cloud for testing. Please have a look at this https://elasticluster.readthedocs.io/en/latest/ If you want to set up some virtual machines on your own laptop or server, Google for vagrant slurm There are several vagrant recipes on the net.

Re: [slurm-users] Slurm Install on Remote System

2019-05-25 Thread John Hearns
specific instructions >>> for >>> installing on Remote server without root access ? >>> ------ next part -- >>> An HTML attachment was scrubbed... >>> URL: < >>> http://lists.schedmd.com/pipermail/slurm-users/attachments/20190525/684d07e9/attachment-0001.html >>> > &

Re: [slurm-users] Slurm Install on Remote System

2019-05-25 Thread John Hearns
OK, I am going to stick my neck out here. You say a 'remote system' - is this a single server? If it is, for what purpose do you need Slurm? If you want to schedule some tasks to run one after the other, simply start a screen session then put the takss into a script. I am sorry if I sound rude her

Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-21 Thread John Hearns
tml On Tue, 21 May 2019 at 01:28, Dave Evans wrote: > Do you have that resource handy? I looked into the cgroups documentation > but I see very little on tutorials for modifying the permissions. > > On Mon, May 20, 2019 at 2:45 AM John Hearns > wrote: > >> Two repli

Re: [slurm-users] Access/permission denied

2019-05-20 Thread John Hearns
Why are you sshing into the compute node compute-0-2 ??? On the head node named rocks7: srun -c 1 --partition RUBY --account y8 --mem=1G xclock On Mon, 20 May 2019 at 16:07, Mahmood Naderan wrote: > Hi > Although proper configuration has been defined as below > > [root@rocks7 software]# grep R

Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-20 Thread John Hearns
Two replies here. First off for normal user logins you can direct them into a cgroup - I looked into this about a year ago and it was actually quite easy. As I remember there is a service or utility available which does just that. Of course the user cgroup would not have Expanding on my theme, it

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread John Hearns
It's a DNS problem, isn't it? Seriously though - how long does srun hostname take for a single system? On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen wrote: > We have 12,000 nodes in our system, 9,600 of which are KNL. We can > start a parallel application within a few seconds in most cases

Re: [slurm-users] Job dispatching policy

2019-04-24 Thread John Hearns
I would suggest that if those applications really are not possible with Slurm - then reserve a set of nodes for interactive use and disable the Slurm daemon on them. Direct users to those nodes. More constructively - maybe the list can help you get the X11 applications to run using Slurm. Could yo

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-23 Thread John Hearns
Will, there are some excellent responses here. I agree that moving data to local fast storage on a node is a great idea. Regarding the NFS storage, I would look at implementing BeeGFS if you can get some new hardware or free up existing hardware. BeeGFS is a skoosh case to set up. (*) Scottish sl

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread John Hearns
ght so no answer to your specific > question but I hope you can get some support with it. We dumped our BC > PoC, the sysadmin working on the PoC still has nightmares. > > On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns" < > slurm-users-boun...@lists.schedmd.co

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread John Hearns
please have a look at section 6.3 of the Bright Admin Manual You have run updateprovisioners then rebooted the nodes? Configuring The Cluster To Authenticate Against An External LDAP Server The cluster can be configured in different ways to authenticate against an external LDAP server. For smaller

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread John Hearns
Yugendra, the Bright support guys are excellent. Slurm is their default choice. I would ask again. Yes, Slurm is technically out of scope for them, but they shoudl help a bit. By the way, I think your problem is that you have configured authentication using AD on your head node. BUT you have not

Re: [slurm-users] Visualisation -- Slurm and (Turbo)VNC

2019-01-03 Thread John Hearns
Hi David. I set up DCV on a cluster of workstations at a facility not far from you a few years ago (in Woking...). I'm not sure what the relevance of having multiple GPUs is - I thought the DCV documentation dealt with that ?? One thing you should do is introduce MobaXterm to your users if they

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-30 Thread John Hearns
Chris, I have delved deep into the OOM killer code and interaction with cpusets in the past (*). That experience is not really relevant! However I always recommend looking at this sysctl parameter min_free_kbytes https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/perform

Re: [slurm-users] An observation on SLURM's logging

2018-11-27 Thread John Hearns
https://twitter.com/systemdsucks Is that my coat? Why thankyou. On Tue, 27 Nov 2018 at 11:21, Kevin Buckley wrote: > Here are our usual Slurm log-related config settings > > (where we've increased the default "Syslog" level as we're trying > to get the Crays to funnel the slurmd log messages a

Re: [slurm-users] About x11 support

2018-11-27 Thread John Hearns
Going off topic, if you want an ssh client and an X-server on a Windows workstation or laptop, I highly recommend MobaXterm. You can open a remote desktop easily. Session types are ssh, VNC, RDP, Telnet(!) , Mosh and anything else you can think of. Including a serial terminal for those times when y

Re: [slurm-users] AWS SLURM Burst Cluster, fill configuring nodes

2018-10-26 Thread John Hearns
Hi Jordan. Regarding filling up the nodes look at https://slurm.schedmd.com/elastic_computing.html *SelectType* Generally must be "select/linear". If Slurm is configured to allocate individual CPUs to jobs rather than whole nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linea

Re: [slurm-users] Cgroups and swap with 18.08.1?

2018-10-19 Thread John Hearns
After doing some Googling https://jvns.ca/blog/2017/02/17/mystery-swap/ Swapping is weird and confusing (Amen to that!) https://jvns.ca/blog/2016/12/03/how-much-memory-is-my-process-using-/ (interesting article) >From the Docker documentation, below. Bill - this is what you are seeing. Twice as

Re: [slurm-users] Socket timed out on send/recv operation

2018-10-18 Thread John Hearns
Kirk, MailProg=/usr/bin/sendmail MailProg should be the program used to SEND mail ie. /bin/mail not sendmail If I am not wrong int he jargon MailProg is a MUA not an MTA (sendmail is an MTA) On Thu, 18 Oct 2018 at 19:01, Kirk Main wrote: > Hi all, > > I'm a new administrator to Slurm a

Re: [slurm-users] Cgroups and swap with 18.08.1?

2018-10-16 Thread John Hearns
what you think is happening - remember that log messages take effort to put in the code, well at least some keystrokes, so they usually mean something! On Tue, 16 Oct 2018 at 10:04, John Hearns wrote: > Rather dumb question from me - you have checked those processes are > running within a

Re: [slurm-users] Cgroups and swap with 18.08.1?

2018-10-16 Thread John Hearns
Rather dumb question from me - you have checked those processes are running within a cgroup? I have no experience in constraining the swap usage using cgroups, so sorry if I am adding nothing to the debate here. On Tue, 16 Oct 2018 at 04:49, Bill Broadley wrote: > > Greetings, > > I'm using ubun

[slurm-users] Tiered RAM and Swap Space

2018-09-25 Thread John Hearns
We recently had a very good discussion on swap space and job suspension. I had a look at the Intel pages on Optane memory. https://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html It is definitely being positioned as a fast file cache, ie for block oriented d

Re: [slurm-users] swap size

2018-09-22 Thread John Hearns
I would say that, yes, you have a good workflow here with Slurm. As another aside - is anyone working with suspending and resuming containers? I see on the Singularity site that suspend/resume in on the roadmap (I am not talking about checkpointing here). Also it is worth saying that these days on

Re: [slurm-users] swap size

2018-09-21 Thread John Hearns
Ashton, on a compute node with 256Gbytes of RAM I would not configure any swap at all. None. I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - and no swap. Also our ICE clusters were diskless - SGI very smartly configured swap over ISCSI - but we disabled this, the reason being

Re: [slurm-users] Create users

2018-09-15 Thread John Hearns
Loris said: Until now I had thought that the most elegant way of setting up Slurm users would be via a PAM module analogous to pam_mkhomedir, the simplest option being to use pam_script. When in Denmark this year (hello Ole!) I looked at pam_mkhomedir quite closely. The object was to automaticall

Re: [slurm-users] can't create memory group (cgroup)

2018-09-08 Thread John Hearns
Not an answer to your question - a good diagnostic for cgroups is the utility 'lscgroups' On Sat, 8 Sep 2018 at 10:10, Gennaro Oliva wrote: > > Hi Mike, > > On Fri, Sep 07, 2018 at 03:53:44PM +, Mike Cammilleri wrote: > > I'm getting this error lately for everyone's jobs, which results in > >

Re: [slurm-users] Configuration issue on Ubuntu

2018-09-05 Thread John Hearns
Following on from what Chris Samuel says /root/sl/sl2 kinda suggest Scientific Linux to me (SL - Redhat alike distribution used by Fermilab and CERN) Or it could just be sl = slurm I would run ldd `which slurctld` and let us know what libraries is it linked to On Wed, 5 Sep 2018 at 08:51, Ge

Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

2018-08-30 Thread John Hearns
I also remember there being write-only permissions involved when working with cgroups and devices .. which bent my head slightly.. On Thu, 30 Aug 2018 at 17:02, John Hearns wrote: > Chaofeng, I agree with what Chris says. You should be using cgroups. > > I did a lot of work with cg

Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

2018-08-30 Thread John Hearns
Chaofeng, I agree with what Chris says. You should be using cgroups. I did a lot of work with cgroups anf GPUs in PBSPro (yes I know... splitter!) With cgroups you only get access to the devices which are allocated to that cgroup, and you get CUDA_VISIBLE_DEVICES set for you. Remember also to lo

Re: [slurm-users] siesta jobs with slurm, an issue

2018-07-22 Thread John Hearns
Are you very sure that the filesystem with the input file is mounted on the compute nodes? Try to cat the file. On 22 July 2018 at 19:11, Mahmood Naderan wrote: > I am able to directly run the command on the node. Please note in the > following output that I have pressed ^C after some minutes. S

Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread John Hearns
If it is any help, https://slurm.schedmd.com/sinfo.html NODE STATE CODES Node state codes are shortened as required for the field size. These node states may be followed by a special character to identify state flags associated with the node. The following node sufficies and states are used: ***

Re: [slurm-users] 'srun hostname' hangs on the command line

2018-07-17 Thread John Hearns
mpute nodes to the login nodes may be block > by the firewall. That will prevent srun from running properly > > Sent from my iPhone > > > On 17 Jul 2018, at 10:16, John Hearns wrote: > > Ronan, as far as I can see this means that you cannot launch a job. > > > >

Re: [slurm-users] 'srun hostname' hangs on the command line

2018-07-17 Thread John Hearns
/var/log ?? > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *John Hearns > *Sent:* Tuesday, July 17, 2018 8:57 AM > > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line > &g

Re: [slurm-users] 'srun hostname' hangs on the command line

2018-07-17 Thread John Hearns
Ronan, sorry to ask but this is a bit unclear. Are you unable to launch ANY sessions with srun? In which case you need to look at the logs to see why the job is not being scheduled. Is it only the hostname command which fails? I would guess very much you have already run an ssh into a node and r

Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
(s): 2 > NUMA node(s): 4 > Vendor ID: AuthenticAMD > CPU family:21 > Model: 1 > Model name:AMD Opteron(tm) Processor 6282 SE > Stepping: 2 > > > Regards, > Mahmood > > > > On We

Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Mahmood, please please forgive me for saying this. A quick Google shows that Opteron 61xx have eight or twelve cores. Have you checked that all the servers have 12 cores? I realise I am appearing stupid here. On 11 July 2018 at 10:39, Mahmood Naderan wrote: > >Try runningps -eaf --fores

Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Another thought - are we getting mixed up between hyperthreaded and physical cores here? I don't see how 12 hyperthreaded cores translates to 8 though - it would be 6! On 11 July 2018 at 10:30, John Hearns wrote: > Mahmood, > I am sure you have checked this. Try running

Re: [slurm-users] cpu limit issue

2018-07-11 Thread John Hearns
Mahmood, I am sure you have checked this. Try runningps -eaf --forest while a job is running. I often find the --forest option helps to understand how batch jobs are being run. On 11 July 2018 at 09:12, Mahmood Naderan wrote: > >Check the Gaussian log file for mention of its using just

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread John Hearns
mpich-discuss/2008-May/003605.html > > My users primarily use OpenMPI, and so do not have much recent experience > with this issue. IIRC, this issue only impacted other MPI jobs running by > the same user on the same node, so a bit different than the symptoms as you > describe them (imp

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
em is killed, why > would all others go down as well? > > > That would make sense if a single mpirun is running 36 tasks... but the > user is not doing this. > > > From: slurm-users on behalf of > John Hearns > Sent: Friday, June 29,

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
Matteo, a stupid question but if these are single CPU jobs why is mpirun being used? Is your user using these 36 jobs to construct a parallel job to run charmm? If the mpirun is killed, yes all the other processes which are started by it on the other compute nodes will be killed. I suspect your u

Re: [slurm-users] SLURM PAM support?

2018-06-18 Thread John Hearns
ian, but the basics should be the same. I've placed the > script in github, if you want to try it: > https://github.com/irush-cs/slurm-scripts > > Yair. > > > On Mon, Jun 18, 2018 at 3:33 PM, John Hearns > wrote: > > Your problem is that you are listening to Lenna

Re: [slurm-users] SLURM PAM support?

2018-06-18 Thread John Hearns
Your problem is that you are listening to Lennart Poettering... I cannot answer your question directly. However I am doing work at the moment with PAM and sssd. Have a look at the directory which contains the unit files. Go on /lib/systemd/sysem See that nice file named -.sliceYes that file is

Re: [slurm-users] When I start slurmctld, there are some errors in log.

2018-06-15 Thread John Hearns
n't have the directory /var/spool/slurmctld/. And then I mkdir the > directory, and "chown slurm:slurm /var/spool/slurmctld". > But there is also the errors. > > 2018-06-15 16:00 GMT+08:00 John Hearns : > >> And your permissions on the directory /var/spool/slurmctld

Re: [slurm-users] When I start slurmctld, there are some errors in log.

2018-06-15 Thread John Hearns
And your permissions on the directory /var/spool/slurmctld/ are On 15 June 2018 at 09:11, UGI wrote: > When I start slurmctld, there are some errors in log. And the job running > information doesn't store to mysql via slurmdbd. > > I set > > AccountingStoragePass=/usr/local/munge-munge-0.5

Re: [slurm-users] Memory prioritization?

2018-06-13 Thread John Hearns
Matt, I back up what Loris said regarding interactive jobs. I am sorry to sound ranty here, but my experience teaches me that in cases like this you must ask why this is being desired. Hey - you are the systems expert. If you get the user to explain why they desire this functionality, it actually

Re: [slurm-users] run bash script in spank plugin

2018-06-04 Thread John Hearns
t; > during the job, I would like to run a program on the machine running the > job > but I'd like the program to keep running even after the job ends. > > 2018-06-04 15:30 GMT+02:00 John Hearns : > >> Tueur what are you trying to achieve here? The example you give is &

Re: [slurm-users] run bash script in spank plugin

2018-06-04 Thread John Hearns
Tueur what are you trying to achieve here? The example you give is touch /tmp/newfile.txt' I think you are trying to send a signal to another process. Could this be 'Hey - the job has finished and there is a new file for you to process' If that is so, there may be better ways to do this. If you ha

Re: [slurm-users] Upgrade woes

2018-06-01 Thread John Hearns
Lachlan, I note that you have dropped the slurmdm and started again with an empty database. This sounds serious! The only thing I would suggest is an strace of the slurmctld I often run straces when I have a proble. They never usually tell me much but a lot of pretty text flies past ont he screen.

Re: [slurm-users] Using free memory available when allocating a node to a job

2018-05-29 Thread John Hearns
nk you for your inputs. > > > > > > *De :* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *De la > part de* John Hearns > *Envoyé :* mardi 29 mai 2018 12:39 > *À :* Slurm User Community List > *Objet :* Re: [slurm-users] Using free memory available when all

Re: [slurm-users] Using free memory available when allocating a node to a job

2018-05-29 Thread John Hearns
=7002775 This of course is very dependent on what your environment and applications are. Would you be able to say please what the problems you are having with memory? On 29 May 2018 at 12:26, John Hearns wrote: > Alexandre, it would be helpful if you could say why this behaviour

Re: [slurm-users] Using free memory available when allocating a node to a job

2018-05-29 Thread John Hearns
Alexandre, it would be helpful if you could say why this behaviour is desirable. For instance, do you have codes which need a large amount of memory and your users are seeing that these codes are crashing because other codes running on the same nodes are using memory. I have two thoughts: A) en

Re: [slurm-users] Force run a waiting job

2018-05-28 Thread John Hearns
Alfonso, the way I would do this is to put all the users jobs on hold, using scontrol hold Then change the limits , then release the job you want to run. this is probably not the smartest way to achieve this. On 28 May 2018 at 09:53, Pardo Diaz Alfonso wrote: > Hi, > > > > I've a list of use

Re: [slurm-users] Controller / backup controller q's

2018-05-25 Thread John Hearns
Will, I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using? I don't see a problem in running two different distributions. But as Per says look at your development environment. For my part, I would ask have you thought about containerisation? ie

Re: [slurm-users] Requested partition configuration not available now

2018-05-16 Thread John Hearns
Mahmood, you should check that the slurm.conf files are identical on the head node and the compute nodes after you run the rocks sync. On 16 May 2018 at 11:07, Mahmood Naderan wrote: > Yes I did that prior to my first email. However, I thought that is > similar to the service restart bug in

Re: [slurm-users] Python and R installation in a SLURM cluster

2018-05-12 Thread John Hearns
e the home directory too! On 12 May 2018 at 22:02, John Hearns wrote: > Well I DID say that you need 'what looks like a home directory'. > So yes indeed you prove, correctly, that this works just fine! > > On 12 May 2018 at 20:17, Eric F. Alemany wrote: > >> >&g

Re: [slurm-users] Python and R installation in a SLURM cluster

2018-05-12 Thread John Hearns
, California 94305 > > Tel:1-650-498-7969 No Texting > Fax:1-650-723-7382 > > On May 12, 2018, at 00:08, John Hearns wrote: > > Eric, I'm sorry to be a little prickly here. > Each node has an independent home directory for the user? > How then do applications update

Re: [slurm-users] Python and R installation in a SLURM cluster

2018-05-12 Thread John Hearns
4305 > > Tel:1-650-498-7969 No Texting > Fax:1-650-723-7382 > > > > On May 11, 2018, at 12:56 AM, Chris Samuel wrote: > > On Friday, 11 May 2018 5:11:38 PM AEST John Hearns wrote: > > Eric, my advice would be to definitely learn the Modules system and > implement mo

Re: [slurm-users] Python and R installation in a SLURM cluster

2018-05-11 Thread John Hearns
Regarding NFS shares and Python, and plenty of other packages too, pay attention to where the NFS server is located on your network. The NFS server should be part of your cluster, or at least have a network interface on your cluster fabric. If you perhaps have a home directory server which is a ca

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
"Otherwise a user can have a sing le job that takes the entire cluster, and insidesplit it up the way he wants to." Yair, I agree. That is what I was referring to regardign interactive jobs. Perhaps not a user reserving the entire cluster, but a use reserving a lot of compute nodes and not making s

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
> Eventually the job aging makes the jobs so high-priority, Guess I should look in the manual, but could you increase the job ageing time parameters? I guess it is also worth saying that this is the scheduler doing its job - it is supposed to keep jobs ready and waiting to go, to keep the cluster

Re: [slurm-users] wckey specification error

2018-05-02 Thread John Hearns
xample accounting > and other things. > > > Regards, > Mahmood > > > > > > On Tue, May 1, 2018 at 9:35 PM, Cooper, Trevor wrote: > > > >> On May 1, 2018, at 2:58 AM, John Hearns wrote: > >> > >> Rocks 7 is now available, which is based on CentO

Re: [slurm-users] wckey specification error

2018-05-01 Thread John Hearns
I quickly downloaded that roll and unpacked the RPMs. I cannot quite see how SLurm is configured, so to my shame I gave up (I did say that Rocks was not my thing) On 1 May 2018 at 11:58, John Hearns wrote: > Rocks 7 is now available, which is based on CentOS 7.4 > I hate to be uncharitabl

Re: [slurm-users] wckey specification error

2018-05-01 Thread John Hearns
Rocks 7 is now available, which is based on CentOS 7.4 I hate to be uncharitable, but I am not a fan of Rocks. I speak from experience, having installed my share of Rocks clusters. The philosophy just does not fit in with the way I look at the world. Anyway, to install extra software on Rocks you

Re: [slurm-users] Slurm overhead

2018-04-26 Thread John Hearns
Mahmood, do you haave Hyperthreading enabled? That may be the root cause of your problem. If you have hyperhtreading, then when you start to run more than the number of PHYSICAL cores you will get over-subscription. Now, with certain workloads that is fine - that is what hyperhtreading is all abou

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread John Hearns
*Caedite eos. Novit enim Dominus qui sunt eius* https://en.wikipedia.org/wiki/Caedite_eos._Novit_enim_Dominus_qui_sunt_eius. I have been wanting to use that line in the context of batch systems and users for ages. At least now I can make it a play on killing processes. Rather than being put on a

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread John Hearns
Nicolo, I cannot say what your problem is. However in the past with problems like this I would a) look at ps -eaf --forest Try to see what the parent processes of these job processes are Clearly if the parent PID is 1 then --forest is nto much help. But the --forest option is my 'goto' option

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread John Hearns
Loris, Ole, thankyou so much. That is the Python script I was thinking of. On 17 April 2018 at 11:15, Ole Holm Nielsen wrote: > On 04/17/2018 10:56 AM, John Hearns wrote: > >> Please can some kind soul remind me what the Python code for mangling >> Slurm and PBS machinefile

[slurm-users] Python code for munging hostfiles

2018-04-17 Thread John Hearns
Please can some kind soul remind me what the Python code for mangling Slurm and PBS machinefiles is called please? We discussed it here about a year ago, in the context of running Ansys. I have a Cunning Plan (TM) to recode it in Julia, for no real reason other than curiosity.

Re: [slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

2018-03-05 Thread John Hearns
all) of the compute nodes > (200-202), not on the machine I launched the srun command from (001). > > John -- Yes we are heavily invested in the Trick framework and use their > Monte-Carlo feature quite extensively, in the past we've used PBS to manage > our compute nod

Re: [slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

2018-03-04 Thread John Hearns
Dan, completely off topic here. May I ask what type of simulations are you running? Clearly you probably have a large investment in time in Trick. However as a fan of Julia language let me leave this link here: https://juliaobserver.com/packages/RigidBodyDynamics On 5 March 2018 at 07:31, John

Re: [slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

2018-03-04 Thread John Hearns
I completely agree with what Chris says regarding cgroups. Implement them, and you will not regret it. I have worked with other simulation frameworks, which work in a similar fashion to Trick, ie a master process which spawns off independent worker processes on compute nodes. I am thinking on an

  1   2   >