[slurm-users] Re: /etc/passwd sync?

2025-02-11 Thread Feng Zhang via slurm-users
Keep the /etc/password, group synced to all the nodes should work. And it will need to set up an SSH key for MPI. Best, Feng On Mon, Feb 10, 2025 at 10:29 PM mark.w.moorcroft--- via slurm-users < slurm-users@lists.schedmd.com> wrote: > If you set up slurm elastic cloud in EC2 without LDAP, wha

[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Feng Zhang via slurm-users
You can also check https://github.com/prod-feng/slurm_tools slurm_job_perf_show.py may be helpful. I used to try to use slurm_job_perf_show_email.py to send emails to users to summarize their usage, like monthly. While some users seemed to get confused, so stopped. Best, Feng On Fri, Aug 9, 20

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Feng Zhang via slurm-users
yes, the algorithm should be like that 1 cpu (core) per job(task). Like someone mentioned already, need to to --oversubscribe=10 on cpu cores, meaning 10 jobs on each core for you case. Slurm.conf. Best, Feng On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users wrote: > > > Every job will need

[slurm-users] maxrss reported by sachet is wrong

2024-06-07 Thread Feng Zhang via slurm-users
Hi All, I am having trouble calculating the real RSS memory usage by some kind of users' jobs. Which the sacct returned wrong numbers. Rocky Linux release 8.5, Slurm 21.08 (slurm.conf) ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux The troubling jobs are like: 1. python

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
Do you have containers setting? On Tue, May 14, 2024 at 3:57 PM Feng Zhang wrote: > > Not sure, very strange, while the two linux-vdso.so.1 looks different: > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x7ffde81ee000) > > > [deej@mo

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
4/libdl.so.2 (0x14a9d82c9000) > libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000) > libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0) > /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000) > &g

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
Looks more like a runtime environment issue. Check the binaries: ldd /mnt/local/ollama/ollama on both clusters and comparing the output may give some hints. Best, Feng On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users wrote: > > I'm running into a strange issue and I'm hoping anoth

Re: [slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

2023-10-16 Thread Feng Zhang
Try scontrol update NodeName=heimdall state=DOWN Reason="gpu issue" and then scontrol update NodeName=heimdall state=RESUME to see if it will work. Probably just SLURM daemon having a hiccup after you made changes. Best, Feng On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken wrote: > > Hi,

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
As I read again on the pasted slurm.conf info, it includes "AllowAccounts, AllowGroups,", so it seems slurm actually takes this into account. So I think it should work... Best, Feng On Thu, Sep 21, 2023 at 2:33 PM Feng Zhang wrote: > > As I said I am not sure, but it depends

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
threshold." Best, Feng On Thu, Sep 21, 2023 at 11:48 AM Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) wrote: > > On Sep 21, 2023, at 11:37 AM, Feng Zhang wrote: > > Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure. > > > Hmm, in

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure. Best, Feng Best, Feng On Thu, Sep 21, 2023 at 11:27 AM Jason Simms wrote: > > I personally don't think that we should assume users will always know which > partitions are available to them. Ideally, of course, th

Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Feng Zhang
drain reason=stuck; scontrol > update nodename= state=resume > > > Best > Marcus > > Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen: > > On 9/20/23 01:39, Feng Zhang wrote: > >> Restarting the slurmd dameon of the compute node should work, if the > >> node is

Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Feng Zhang
Restarting the slurmd dameon of the compute node should work, if the node is still online and normal. Best, Feng On Tue, Sep 19, 2023 at 8:03 AM Felix wrote: > > Hello > > I have a job on my system which is running more than its time, more than > 4 days. > > 1808851 debug gridjob atlas01

Re: [slurm-users] Granular or dynamic control of partitions?

2023-08-04 Thread Feng Zhang
You can try command as: scontrol update partition mypart Nodes=node[1-90],ab,ac #exclude the one you want to remove "Changing the Nodes in a partition has no effect upon jobs that have already begun execution." Best, Feng On Fri, Aug 4, 2023 at 10:47 AM Pacey, Mike wrote: > > Hi folks, > >

Re: [slurm-users] slurm sinfo format memory

2023-07-20 Thread Feng Zhang
Looks like Slurm itself only supports that format(in MB unit). Slurm commands output format is not very user friendly to me. If it can add some easy options, like for the output info of sinfo command in this email thread, how about adding support for lazy options, like sinfo -ABC, etc. For the des

Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Feng Zhang
Very interesting issue. I am guessing there might be a workaround: SInce oryx has 2 gpus instead, you can define both of them, but disable the GT 710? Does Slurm support this? Best, Feng Best, Feng On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M wrote: > > Hi, > > I manually configure the

Re: [slurm-users] Problem with Cuda program in multi-cluster

2023-07-05 Thread Feng Zhang
Mohamad, It seems you need to upgrade the GCC on the GPU nodes of cluster A and C. The error message says that the srun needs newer GCC libs. Or you can downgrade your SLURM(like recompile it using GCC 2.27 or older) on cluster A/C. Best, Feng On Tue, Jul 4, 2023 at 2:46 PM mohammed shambakey

[slurm-users] run issue

2022-11-30 Thread Feng Zhang
hello all, I am doing some tests using the Slurm. Just found that when I run the srun command with -n and -c options, when the -n and -c are odd numbers, srun job hangs and no shell is given to me. When I check using "squeue", it reports that this job is actually running. When -C = even number

Re: [slurm-users] SLURM Array Job BASH scripting within python subprocess

2022-11-28 Thread Feng Zhang
Not sure if it works, but you can try using "\${SLURM_ARRAY_JOB_ID}. The "\" to escape the early evaluation of the env variables. On Thu, Nov 10, 2022 at 6:53 PM Chase Schuette wrote: > > Due to needing to support existing HPC workflows. I have a need to pass a > bash script within a python subp