[slurm-users] Re: Slurm "showpartitions" tool has been updated

2025-01-17 Thread Robert Kudyba via slurm-users
itions tools compacts the list of nodenames using "scontrol > show hostlistsorted ". Can you please check your scontrol > manual page if "hostlistsorted" is documented? > > Does the command work without -N? > > Thanks, > Ole > > > On 17-01-2025 19:54, R

[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
27, 2024 at 9:41 AM Robert Kudyba via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> We're in the process of upgrading but first we're moving to RHEL 9. My >> attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}"

[slurm-users] errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
We're in the process of upgrading but first we're moving to RHEL 9. My attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-18.08.9.tar.bz2 (H/T to Brian for this flag ). I've stumped Google and the Slurm

[slurm-users] sreport syntax for TRES/GPU usage

2024-08-16 Thread Robert Kudyba via slurm-users
In a 25 node heterogeneous cluster with 4 different types of GPUs, to get granular to see which GPUs were used most over a time period we have to set AccountingStorageTRES to something like: AccountingStorageTRES=gres/gpu,gres/gpu:rtx8000,gres/gpu:v100s,gres/gpu:a40,gres/gpu:a100 Unfortunately it'

[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
Thanks Ben but there's no mention of SINGULARITYENV_SLURM_CONF in that page. Slurm is not in the container either so we're trying to get mpirun from the host to run inside the container. On Wed, Jul 3, 2024, 11:30 AM Benjamin Smith wrote: > On 03/07/2024 16:03, Robert Kudyba vi

[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
In https://support.schedmd.com/show_bug.cgi?id=9282#c6 Tim mentioned this env variable SINGULARITYENV_SLURM_CONF, what is the usage/syntax for it? I can't find any reference to this. I'm running into the same issue mentioned there. Thanks in advance! -- slurm-users mailing list -- slurm-users@li

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-05 Thread Robert Kudyba via slurm-users
gt; > -- > > #BlackLivesMatter > > > > || \\UTGERS, > |---*O*--- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus >

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
i - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users < > slurm-users@lists.

[slurm-users] diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
At the moment we have 2 nodes that are having long wait times. Generally this is when the nodes are fully allocated. What would be the other reasons if there is still enough available memory and CPU available, that a job would take so long? Slurm version is 23.02.4 via Bright Computing. Note the c

[slurm-users] Re: any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-05-13 Thread Robert Kudyba via slurm-users
> > Cheers, > > > > Luke > > > > -- > > Luke Sudbery > > Principal Engineer (HPC and Storage). > > Architecture, Infrastructure and Systems > > Advanced Research Computing, IT Services > > Room 132, Computer Centre G5, Elms Road > > > > *Pleas

[slurm-users] any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-04-19 Thread Robert Kudyba via slurm-users
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does not appear to come by default with the Bright 'cm' package of Slurm. Currently ssh to a node gets: Login not allowed: no running jobs and no WLM allocatio

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
On Bright it's set in a few places: grep -r -i SLURM_CONF /etc /etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slur

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
> > Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). > For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the correct path. > On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users w

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
already completing or completed [465.extern] error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze' for writing: Permission denied On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba wrote: > << &g

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: > > > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is > > there a better option? > > Traditionally /tmp and /var/tmp have been 1777 (that "1" being the > stick

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > Hi Robert, > > On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: > > > We switched over from using systemctl for tmp.mount and change to zram, > > e.g., > > modprobe zram > > echo 20GB > /sys/block/zram0/disksize > > mkfs.

[slurm-users] slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of JobId=

2024-02-23 Thread Robert Kudyba via slurm-users
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp srun with --x11 was working before changing this. We're on RHEL 9. slurmctld logs show this whenever --x11 is used

[slurm-users] Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-29 Thread Robert Kudyba
According to these links: https://rpmfind.net/linux/rpm2html/search.php?query=slurm https://src.fedoraproject.org/rpms/slurm Why doesn't RHEL 8 get a newer version? Can someone update the repo maintainer Philip Kovacs < pk...@fedoraproject.org>? There was

[slurm-users] JobState of RaisedSignal:53 Real-time_signal_19; slurm 23.02.4

2023-11-10 Thread Robert Kudyba
The user is launching a Singularity container for RStudio and the final option for --rsession-path does not exist. scontrol show job 420719 JobId=420719 JobName=r2.sbatch UserId=ouruser(552199) GroupId=user(500) MCS_label=N/A Priority=1428 Nice=0 Account=ouracct QOS=xxx JobState=FAILED Reaso

[slurm-users] Slurm 20.11.3, Suspended new connections while processing backlog filled /

2021-03-10 Thread Robert Kudyba
I see there is this exact issue https://githubmemory.com/repo/dun/munge/issues/94. We are on Slurm 20.11.3 on Bright Cluster 8.1 on Centos 7.9 I found hundreds of these logs in slurmctld: error: slurm_accept_msg_conn: Too many open files in system Then in munged.log: Suspended new connections whi

[slurm-users] Slurm upgrade to 20.11.3, slurmdbd still trying to start old version 20.02.3

2021-03-03 Thread Robert Kudyba
Slurmdbd has an issue and from the logs is still trying to load the old version: [2021-01-22T14:17:18.430] MySQL server version is: 5.5.68-MariaDB [2021-01-22T14:17:18.433] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout [2021-0

Re: [slurm-users] exempting a node from Gres Autodetect

2021-02-19 Thread Robert Kudyba
have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7, fixed in 20.06.1 On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk wrote: > Hi all: > > (I hope plague and weather are being visibly less than maximally cruel > to you all.) > > In short, I was trying to exempt a node from NVML Auto

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Robert Kudyba
You all might be interested in a patch to the SPEC file, to not make the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 On Tue, Jan 26, 2021 at 3:17 PM Paul Raines wrote: > > You should check your jobs that

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-26 Thread Robert Kudyba
On Mon, Jan 25, 2021 at 6:36 PM Brian Andrus wrote: > Also, a plug for support contracts. I have been doing slurm for a very > long while, but always encourage my clients to get a support contract. > That is how SchedMD stays alive and we are able to have such a good > piece of software. I see th

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-02 Thread Robert Kudyba
> > been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It > seems to have started to occur when I enabled proctrack/cgroup and changed > select/linear to select/con_tres. > Our slurm.conf has the same setting: SelectType=select/cons_tres SelectTypeParameters=CR_CPU SchedulerTime

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
his happens due to laggy storage the job is > using taking time flushing the job's data. So making sure that your > storage is up, responsive, and stable will also cut these down. > > -Paul Edmon- > > On 11/30/2020 12:52 PM, Robert Kudyba wrote: > > I've

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_i

[slurm-users] MPS Count option clarification and TensorFlow 2/PyTorch greediness causing out of memory OOMs

2020-08-25 Thread Robert Kudyba
Comparing the Slurm MPS configuration example here , our gres.conf has this: NodeName=node[001-003] Name=mps Count=400 What does "Count" really mean and how do you use this number? >From that web page

[slurm-users] configure Slurm when disk quota exceeded

2020-08-04 Thread Robert Kudyba
Is there a way for Slurm to detect when a user quota has been exceeded? We use XFS and when users are over the quota they will get a "Disk quota exceeded" message, e.g., when trying to scp or create a new file. However if they are not aware of this and try using a sbatch file, they don't receive an

[slurm-users] TensorRT script runs with srun but not from a sbatch file

2020-04-29 Thread Robert Kudyba
I'm using this TensorRT tutorial with MPS on Slurm 20.02 on Bright Cluster 8.2 Here are the contents of my mpsmovietest sbatch file: #!/bin/bash #SBATCH --nodes=1 #SBATCH --job-name=MPSMovieTest #SBATCH --gres=g

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
On Thu, Apr 23, 2020 at 1:43 PM Michael Robbert wrote: > It looks like you have hyper-threading turned on, but haven’t defined the > ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS > or changed the definition of ThreadsPerCore in slurm.conf. > Nice find. node003 has hyp

[slurm-users] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 on Bright Cluster 8.2. slurm.conf is on the head node. I don't see these errors on the other 2 nodes. After restarting slurmd on node003 I see this: slurmd[400766]: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(

[slurm-users] srun always uses node002 even using --nodelist=node001

2020-04-16 Thread Robert Kudyba
I'm using this TensorRT tutorial with MPS on Slurm 20.02 on Bright Cluster 8.2 I’m trying to use srun to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m w

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
> > > use yum install slurm20, here they show Slurm 19 but it's the same for 20 > > In that case you'll need to open a bug with Bright to get them to > rebuild Slurm with nvml support. They told me they don't officially support MPS nor Slurm and to come here to get support (or pay SchedMD). The

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
> > > and the NVIDIA Management Library (NVML) is installed on the node and >> > was found during Slurm configuration >> >> That's the key phrase - when whoever compiled Slurm ran ./configure >> *before* compilation it was on a system without the nvidia libraries and >> headers present, so Slurm co

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
On Wed, Apr 8, 2020 at 9:34 AM wrote: > I believe in order to compile for nvml you'll have to compile on a system > with an Nvidia gpu installed otherwise the Nvidia driver and libraries > won't install on that system. > Yes our 3 compute nodes have 1 V100 each. So I can run: ssh node001 Last lo

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-08 Thread Robert Kudyba
On Wed, Apr 8, 2020 at 10:23 AM Eric Berquist wrote: > I just ran into this issue. Specifically, SLURM looks for the NVML header > file, which comes with CUDA or DCGM, in addition to the library that comes > with the drivers. The check is at > https://github.com/SchedMD/slurm/blob/a763a008b770032

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
> Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to > autodetect nvml functionality, but we weren't able to find that lib when > Slurm was configured. > > > > Apparently the Slurm build you are using has not be compiled against NVML > and as such it cannot use the autodetect func

Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
> *Computer Scientist* > > BioHPC – Lyda Hill Dept. of Bioinformatics > > UT Southwestern Medical Center > > > > *From:* slurm-users *On Behalf Of > *Robert Kudyba > *Sent:* Tuesday, April 7, 2020 3:26 PM > *To:* Slurm User Community List > *Subject:* [slu

[slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS

2020-04-07 Thread Robert Kudyba
Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the following options to enable MPS: SelectType=select/cons_tres GresTypes=gpu,mic,mps I restarted slurmctld and ran scontrol reconfigure, however all jobs get the below error: [2020-04-07T15:29:00.741] debug: backfill: no jobs to b

[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?

2020-04-03 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering how the below sbatch file is sharing a GPU. MPS is running on the head node: ps -auwx|grep mps root 108581 0.0 0.0 12780 812 ?Ssl Mar23 0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-co

[slurm-users] Fwd: gres/gpu: count changed for node node002 from 0 to 1

2020-03-14 Thread Robert Kudyba
labels = labels.to(device) outputs = net(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0)

[slurm-users] gres/gpu: count changed for node node002 from 0 to 1

2020-03-13 Thread Robert Kudyba
We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps going into a draining state: sinfo -a PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq*up infinite 1 drng node002 info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E" NODELIST CPUS(A/I/

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
ince our GPU > nodes are over-provisioned in terms of both RAM and CPU, we end up using > the excess resources for non-GPU jobs. > > If that 32 GB is GPU RAM, then I have no experience with that, but I > suspect MPS would be required. > > > On Feb 27, 2020, at 11:14 AM, Robert K

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
t share nodes between jobs". This parameter conflict with > "OverSubscribe=FORCE:12 " parameter. Acording to the slurm > documentation, the Shared parameter has been replaced by the > OverSubscribe parameter. But, I suppose it still works. > > Regards, > > Ahmet M. > >

[slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-26 Thread Robert Kudyba
We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple concurrent jobs to run on our small 4 node cluster. Based on https://community.brightcomputing.com/question/5d6614ba08e8e81e885f1991?action=artikel&cat=14&id=410&artlang=en&highlight=slurm+%2526%252334%253Bgang+scheduling%2526%

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Robert Kudyba
I suppose I can ask Bright Computing but does anyone know what version of Bright is needed? I would guess 8.2 or 9.0. Definitely want to dive into this.

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
NodeCnt=1 done On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba wrote: > Usually means you updated the slurm.conf but have not done "scontrol >> reconfigure" yet. >> > Well it turns out it was something else related to a Bright Computing > setting. In case anyone

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-11 Thread Robert Kudyba
category % use gpucategory % roles % use slurmclient % set realmemory 191846 % commit The value in /etc/slurm/slurm.conf was conflicting with this especially when restarting slurmctld. On 2/10/2020 8:55 AM, Robert Kudyba wrote: > > We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11

[slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Robert Kudyba
We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12. We're getting the below errors when I restart the slurmctld service. The file appears to be the same on the head node and compute nodes: [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf -rw-r--r-- 1 root roo

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-21 Thread Robert Kudyba
> > > are you sure, your 24 core nodes have 187 TERABYTES memory? > > As you yourself cited: > > Size of real memory on the node in megabytes > > The settings in your slurm.conf: > > NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 > Gres=gpu:1 > > so, your machines should h

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
hed to the specific info in your main > config > > Brian Andrus > > > On 1/20/2020 10:37 AM, Robert Kudyba wrote: > > I've posted about this previously here > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23

[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Robert Kudyba
I've posted about this previously here , and here so I'm trying to get to

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Robert Kudyba
p;s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=> > Your Bright manual may have a similar process for updating SLURM config > "the Bright way". > > On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba > wrote: > >> I thought I had taken care of this a while back

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-29 Thread Robert Kudyba
I thought I had taken care of this a while back but it appears the issue has returned. A very simply sbatch slurmhello.sh: cat slurmhello.sh #!/bin/sh #SBATCH -o my.stdout #SBATCH -N 3 #SBATCH --ntasks=16 module add shared openmpi/gcc/64/1.10.7 slurm mpirun hello sbatch slurmhello.sh Submitted b

[slurm-users] JobState=FAILED Reason=NonZeroExitCode Dependency=(null) ExitCode=1:0

2019-07-09 Thread Robert Kudyba
>From this tutorial https://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters I am trying to run the below and it always fails. I've made sure to run 'module load slurm'. What could be wrong? Logs from slurmctld show ok: [2019-07-09T10:19:44.183] prolog_running_

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
suggest RealMemory=191879 , where I suspect you have > RealMemory=196489092 > > Brian Andrus > On 7/8/2019 11:59 AM, Robert Kudyba wrote: >> I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 >> and Bright Cluster 8.1. Their support sent me here

[slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or sett

[slurm-users] Where to adjust the memory limit from sinfo vs free command?

2019-05-16 Thread Robert Kudyba
The MEMORY limit here shows 1, which I believe is 1 MB? But the results of the free command clearly shows we have more than that. Where is this configured? sinfo -lNe Thu May 16 16:41:23 2019 NODELIST NODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON

[slurm-users] Myload script from Slurm Gang Scheduling tutorial

2019-05-16 Thread Robert Kudyba
Hello, Can anyone share the myload script referenced in https://slurm.schedmd.com/gang_scheduling.html Would like to test this on our Bright Cluster running Slurm now as the workload manager and allowing multiple jobs to run concurrently. Than