[slurm-users] Re: problem with squeue --json with version 24.05.1

2024-08-05 Thread Markus Köberl via slurm-users
For me the problem is now fixed with SLURM 24.05.2


regards
Markus Köberl

On Wednesday, 3 July 2024 15:34:37 CEST Ümit Seren wrote:
> We experience the same issue.
> 
> SLURM 24.05.1 segfaults with squeue –json and squeue --json=v0.0.41 but
> works with squeue --json=v0.0.40
> 
> 
> From: Markus Köberl via slurm-users 
> Date: Wednesday, 3. July 2024 at 15:15
> To: Joshua Randall 
> Cc: slurm-users@lists.schedmd.com 
> Subject: [slurm-users] Re: problem with squeue --json with version 24.05.1
> 
> On Wednesday, 3 July 2024 13:26:25 CEST Joshua Randall wrote:
> > Markus,
> > 
> > I had a similar problem after upgrading from v23 to v24 but found that
> > specifying _any_ valid data version worked for me, it was only
> > specifying `--json` without a version that triggered an error (which
> > in my case was I believe a segfault from sinfo rather than a malloc
> > error from squeue - but as these are both memory issues it seems
> > possible they could both potentially arise from the same underlying
> > library issue presenting differently in different CLI tools). So the
> > underlying issue _may_ be with the logic that attempts to determine
> > what the latest data version is and to load that, whereas specifying
> > any valid version explicitly may work.
> > 
> > Are you able to run `squeue --json=v0.0.41` successfully?
> 
> It seams to be a problem only with squeue and data parser version v0.0.41
> only, it also affects 24.05.0 the same way.
> 
> $ squeue --json=v0.0.41
> malloc(): corrupted top size
> Aborted
> 
> data parser version v0.0.40 works, v0.0.39 does not return anything.
> 
> 
> regards
> Markus
> --
> Markus Koeberl
> Graz University of Technology
> Signal Processing and Speech Communication Laboratory
> E-mail: markus.koeb...@tugraz.at


-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeb...@tugraz.at

signature.asc
Description: This is a digitally signed message part.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-05 Thread Daniel Letai via slurm-users


  
  
I think the issue is more severe than you describe.


Slurm juggles the needs of many jobs. Just because there are some
  resources available at the exact second a job starts, doesn't mean
  those resource are not pre-allocated for some future job waiting
  for even more resources, or what about the use case of the
  opportunistic job being a backfill job, and prevents a higher
  priority job from starting, or being pushed back due to asking
  more resources at the last minute?


The request, while understandable from a user's point of view, is
  a non-starter for a shared cluster.


Just my 2 cents.


On 02/08/2024 17:34, Laura Hild via
  slurm-users wrote:


  My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five.

I don't personally know of a way to specify such a job, and wouldn't be surprised if there isn't one, since as other posters have suggested, usually there's a core-count sweet spot that should be used, achieving a performance goal while making efficient use of resources.  A cluster administrator may in fact not want you using extra cores, even if there's a bit more speed-up to be had, when those cores could be used more efficiently by another job.  I'm also not sure how one would set a judicious TimeLimit on a job that would have such a variable wall-time.

So there is the question of whether it is possible, and whether it is advisable.



  


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?

2024-08-05 Thread Oren via slurm-users
Hello,
When I am running this command:
`salloc --nodelist=gpu03 -p A4500_Features  --gres=gpu:1`
and then automatically ssh to the job, what should I see when I run
nvidia-smi? All the GPUs in the host or just a single one?
Thanks

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] ODP: Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-08-05 Thread Rafał Lalik via slurm-users
I had the same issue. After upgrading to slurm-24.05.2 problem is solved. Try 
it.

R.

Od: andreas.wiedholz--- via slurm-users 
Wysłane: poniedziałek, 15 lipca 2024 14:32
Do: slurm-users@lists.schedmd.com 
Temat: [slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back 
keeping cached one



UWAGA: Wiadomość pochodzi od zewnętrznego nadawcy.



Hi João,

did you get this problem solved? I have the exact same problem and would be 
very interested.

Help would be greatly appreciated!

Thank you and best regards,
Andi

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?

2024-08-05 Thread Roberto Polverelli Monti via slurm-users

Hello Oren,

On 8/5/24 3:20 PM, Oren via slurm-users wrote:

When I am running this command:
`salloc --nodelist=gpu03 -p A4500_Features  --gres=gpu:1`
and then automatically ssh to the job, what should I see when I run 
nvidia-smi? All the GPUs in the host or just a single one?


That should depend on the ConstrainDevices parameter. [1]  You can 
quickly verify this with:


$ scontrol show conf | grep Constr

1. https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices

Best,

--
Roberto Polverelli Monti
HPC Engineer
Do IT Now | https://doit-now.tech/

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?

2024-08-05 Thread Oren via slurm-users
Hi James, I am sort of the admin and trying to understand what the goal
should be.
Thanks Roberto, I'll have a look on ConstrainDevices


On Mon, 5 Aug 2024 at 18:14, Roberto Polverelli Monti via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello Oren,
>
> On 8/5/24 3:20 PM, Oren via slurm-users wrote:
> > When I am running this command:
> > `salloc --nodelist=gpu03 -p A4500_Features  --gres=gpu:1`
> > and then automatically ssh to the job, what should I see when I run
> > nvidia-smi? All the GPUs in the host or just a single one?
>
> That should depend on the ConstrainDevices parameter. [1]  You can
> quickly verify this with:
>
> $ scontrol show conf | grep Constr
>
> 1. https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices
>
> Best,
>
> --
> Roberto Polverelli Monti
> HPC Engineer
> Do IT Now | https://doit-now.tech/
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com