[slurm-users] Shard conf weirdness

2025-02-24 Thread Reed Dier via slurm-users
Hoping someone can help me pin down the weirdness I’m experiencing. There are actually two issues, I’ve run into, the root issue, and then something odd when trying to work around the root issue. v23.11.10 - Ubuntu 22.04 - slurm-smd debs built

[slurm-users] Re: Using sharding

2024-07-05 Thread Reed Dier via slurm-users
I would try specifying cpus and mem just to be sure its not requesting 0/all. Also, I was running into a weird issue when I had oversubscribe=yes:2 causing odd issues in my lab cluster when playing with shards, where they would go pending resources despite no alloc of gpu/shards. Once I reverted

[slurm-users] Re: File-less NVIDIA GeForce 4070 Ti being removed from GRES list

2024-04-02 Thread Reed Dier via slurm-users
Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance), You should create a gres.conf with just this line: > AutoDetect=nvml If that doesn’t automagically begin working, you can increase the verbosity of slurmd with > SlurmdDebug=debug2 It should then print a bu

[slurm-users] Re: GPU shards not exclusive

2024-02-29 Thread Reed Dier via slurm-users
Hi Will, I appreciate your corroboration. After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce than before. However, the issue appears to have subsided, and the only change I can potentially attribute it to was after turning on > SlurmctldParameters=rl_enable in slurm.c

[slurm-users] GPU shards not exclusive

2024-02-14 Thread Reed Dier via slurm-users
I seem to have run into an edge case where I’m able to oversubscribe a specific subset of GPUs on one host in particular. Slurm 22.05.8 Ubuntu 20.04 cgroups v1 (ProctrackType=proctrack/cgroup) It seems to be partly a corner case with a couple of caveats. This host has 2 different GPU types in th

[slurm-users] sinfo gresused shard count wrong/incomplete

2024-02-07 Thread Reed Dier via slurm-users
I have a bash script that grabs current statistics from sinfo to ship into a time series database to use for Grafana dashboards. We recently began using shards with our gpus, and I’m seeing some unexpected behavior with the data reported from sinfo. > $ sinfo -h -O "NodeHost:5 ,GresUsed:100 ,Gr

[slurm-users] Socket timed out - tuning

2024-01-29 Thread Reed Dier
Hoping someone can help point me towards some tweaks to help prevent denial-of-service issues. > sbatch: error: Batch job submission failed: Socket timed out on send/recv > operation Root cause is understood, issues with shared storage for the slurmctld’s was impacted, leading to an increase in

Re: [slurm-users] Site factor plugin example?

2023-10-16 Thread Reed Dier
Hi Angel and Loris, I hope this will be of at least some help, as I was tasked with trying to get site factor implemented in our cluster for the sake of making conformant, predictable priority values that were “pretty” and round, and I was not able to find any good documentation for it either.

Re: [slurm-users] Backfill Scheduling

2023-06-27 Thread Reed Dier
> On Jun 27, 2023, at 1:10 AM, Loris Bennett wrote: > > Hi Reed, > > Reed Dier mailto:reed.d...@focusvq.com>> writes: > >> Is this an issue with the relative FIFO nature of the priority scheduling >> currently with all of the other factors disabled, >

[slurm-users] Backfill Scheduling

2023-06-26 Thread Reed Dier
Hoping this will be an easy one for the community. The priority schema was recently reworked for our cluster, with only PriorityWeightQOS and PriorityWeightAge contributing to the priority value, while PriorityWeightAssoc, PriorityWeightFairshare, PriorityWeightJobSize, and PriorityWeightPartit

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread Reed Dier
I don’t have any direct advice off-hand, but I figure I will try to help steer the conversation in the right direction for figuring it out. I’m going to assume that since you mention 21.08.5, that this means you are using the slurm-wlm packages from the ubuntu repos, and not building yourself?

Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Reed Dier
Hey Andrew, I don’t have any specific examples I can share right this second, I’ll look into making it shareable, but my solution was to throw some basic bash scripts into cron to scrap and ship into influx. I have one script that looks at sinfo, parsing out AIOT state for nodes and CPUs, and

Re: [slurm-users] hi-priority partition and preemption

2023-05-25 Thread Reed Dier
After trying to approach this with preempt/partition_prio, we ended up moving to QOS based preemption due to some issues with suspend/requeue, and also wanting to use QOS for quicker/easier tweaks than changing partitions as a whole. > PreemptType=preempt/qos > PreemptMode=SUSPEND,GANG > Partit

Re: [slurm-users] How to install a newer slurm version on Ubuntu 18.04

2023-04-14 Thread Reed Dier
> On the third node (with Ubuntu 18.04), I tried to add the following line to > /etc/apt/sources.list: > deb http://za.archive.ubuntu.com/ubuntu > jammy main universe You definitely can’t just add the jammy repos to a bionic system. This will more or less br

[slurm-users] Shard accounting in sreport

2023-02-14 Thread Reed Dier
Hoping someone can tell me if I’m just thinking about this wrong, or if maybe this is somewhere with room for improvement. I recently upgraded my cluster to 22.05.8 and am testing out gpu sharding on a subset of GPUs, specifically my T4’s. > -

Re: [slurm-users] Job continuing to use cpu minutes after completion

2023-02-03 Thread Reed Dier
This sounds similar to something I recently experienced and finally figured out in 21.08. https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html The long and short of it, is that I had jobs with the clo

Re: [slurm-users] Job cancelled into the future

2023-01-19 Thread Reed Dier
AM, Reed Dier wrote: > > So I was going to take a stab at trying to rectify this after taking care of > post-holiday matters. > > Paste of the $CLUSTER_job_table table where I think I see the issue, and now > I just want to sanity check my steps to remediate. > https:/

Re: [slurm-users] Job cancelled into the future

2023-01-17 Thread Reed Dier
So I was going to take a stab at trying to rectify this after taking care of post-holiday matters. Paste of the $CLUSTER_job_table table where I think I see the issue, and now I just want to sanity check my steps to remediate. https://rentry.co/qhw6mg (pastebin alterna

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
eps I’ve tried to flush those ideas. Thanks, Reed > On Dec 20, 2022, at 10:08 AM, Reed Dier wrote: > > 2 votes for runawayjobs is a strong vote (and also something I’m glad to > learn exists for the future), however, it does not appear to be the case. > >> # sacctmgr show runa

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
ulprit. Appreciate the responses. Reed > On Dec 20, 2022, at 10:03 AM, Brian Andrus wrote: > > Try: > > sacctmgr list runawayjobs > > Brian Andrus > > On 12/20/2022 7:54 AM, Reed Dier wrote: >> Hoping this is a fairly simple one. >> >> This is

[slurm-users] Job cancelled into the future

2022-12-20 Thread Reed Dier
Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solv

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Reed Dier
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm. > postrotate > pkill -x --signal SIGUS

Re: [slurm-users] Suspend without gang scheduling

2022-08-08 Thread Reed Dier
, this gets me to where I wanted to be in the first place, which is tier3 not gang scheduling, while still allowing tier1/tier2 to be requeued/suspended. So I answered my own question, and hopefully someone will benefit from this. Reed > On Aug 8, 2022, at 11:27 AM, Reed Dier wrote: > >

[slurm-users] Suspend without gang scheduling

2022-08-08 Thread Reed Dier
I’ve got essentially 3 “tiers” of jobs. tier1 are stateless and can be requeued tier2 are stateful and can be suspended tier3 are “high priority” and can preempt tier1 and tier2 with the requisite preemption modes. > $ sacctmgr show qos format=name%10,priority%10,preempt%12,preemptmode%10 >

Re: [slurm-users] Using "srun" on compute nodes -- Ray cluster

2022-07-15 Thread Reed Dier
I have some users that are using ray on slurm. I will preface by saying we are new slurm users, so may not be doing everything exactly correct. The only issue that we came across so far as something that was somewhat ray specific that we ran into. Specifically, and pardon my lack of specificity,

Re: [slurm-users] DBD Reset

2022-06-15 Thread Reed Dier
cally, > not pulled from a config file exactly. > > I ran into this exact thing years ago, but can’t remember where the firewall > was the issue. > > Sent from my iPhone > >> On Jun 15, 2022, at 20:12, Reed Dier wrote: >> >>  Hoping this is an easy answ

[slurm-users] DBD Reset

2022-06-15 Thread Reed Dier
Hoping this is an easy answer. My mysql instance somehow corrupted itself, and I’m having to purge and start over. This is ok, because the data in there isn’t too valuable, and we aren’t making use of associations or anything like that yet (no AccountingStorageEnforce). That said, I’ve decided