[slurm-users] Re: Slurm "showpartitions" tool has been updated

2025-01-17 Thread Robert Kudyba via slurm-users
Ancient, 18.08, aiming to upgrade in a couple months. Looks like it's here: show ENTITY=ID Display the state of the specified entity with the specified identification. ENTITY may be aliases, assoc_mgr, bbstat, burstbuffer, config, daemons, dwstat, federation,

[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
27, 2024 at 9:41 AM Robert Kudyba via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> We're in the process of upgrading but first we're moving to RHEL 9. My >> attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}"

[slurm-users] errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Robert Kudyba via slurm-users
We're in the process of upgrading but first we're moving to RHEL 9. My attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-18.08.9.tar.bz2 (H/T to Brian for this flag ). I've stumped Google and the Slurm

[slurm-users] sreport syntax for TRES/GPU usage

2024-08-16 Thread Robert Kudyba via slurm-users
In a 25 node heterogeneous cluster with 4 different types of GPUs, to get granular to see which GPUs were used most over a time period we have to set AccountingStorageTRES to something like: AccountingStorageTRES=gres/gpu,gres/gpu:rtx8000,gres/gpu:v100s,gres/gpu:a40,gres/gpu:a100 Unfortunately it'

[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
Thanks Ben but there's no mention of SINGULARITYENV_SLURM_CONF in that page. Slurm is not in the container either so we're trying to get mpirun from the host to run inside the container. On Wed, Jul 3, 2024, 11:30 AM Benjamin Smith wrote: > On 03/07/2024 16:03, Robert Kudyba vi

[slurm-users] Re: Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm, SINGULARITYENV_SLURM_CONF

2024-07-03 Thread Robert Kudyba via slurm-users
In https://support.schedmd.com/show_bug.cgi?id=9282#c6 Tim mentioned this env variable SINGULARITYENV_SLURM_CONF, what is the usage/syntax for it? I can't find any reference to this. I'm running into the same issue mentioned there. Thanks in advance! -- slurm-users mailing list -- slurm-users@li

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-05 Thread Robert Kudyba via slurm-users
> || \\of NJ | Office of Advanced Research Computing - MSB A555B, > Newark > > `' > > > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users < > slurm-users@lists.schedmd.com> wrote: > > > > At the moment we have 2 nodes that are ha

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
i - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users < > slurm-users@lists.

[slurm-users] diagnosing why interactive/non-interactive job waits are so long with State=MIXED

2024-06-04 Thread Robert Kudyba via slurm-users
At the moment we have 2 nodes that are having long wait times. Generally this is when the nodes are fully allocated. What would be the other reasons if there is still enough available memory and CPU available, that a job would take so long? Slurm version is 23.02.4 via Bright Computing. Note the c

[slurm-users] Re: any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-05-13 Thread Robert Kudyba via slurm-users
> > Cheers, > > > > Luke > > > > -- > > Luke Sudbery > > Principal Engineer (HPC and Storage). > > Architecture, Infrastructure and Systems > > Advanced Research Computing, IT Services > > Room 132, Computer Centre G5, Elms Road > > > > *Pleas

[slurm-users] any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-04-19 Thread Robert Kudyba via slurm-users
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does not appear to come by default with the Bright 'cm' package of Slurm. Currently ssh to a node gets: Login not allowed: no running jobs and no WLM allocatio

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
On Bright it's set in a few places: grep -r -i SLURM_CONF /etc /etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slur

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
> > Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). > For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the correct path. > On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users w

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
t; > Ah yes thanks for pointing that out. Hope this helps someone down the > line...perhaps the error detection could be more explicit in slurmctld? > > On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> On 24/2/24 06:

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: > > > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is > > there a better option? > > Traditionally /tmp and /var/tmp have been 1777 (that "1" being the > stick

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote: > Hi Robert, > > On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: > > > We switched over from using systemctl for tmp.mount and change to zram, > > e.g., > > modprobe zram > > echo 20GB > /sys/block/zram0/disksize > > mkfs.

[slurm-users] slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of JobId=

2024-02-23 Thread Robert Kudyba via slurm-users
We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp srun with --x11 was working before changing this. We're on RHEL 9. slurmctld logs show this whenever --x11 is used