from:"Ümit Seren"

[slurm-users] Re: problem with squeue --json with version 24.05.1

2024-07-03 Thread Ümit Seren via slurm-users

We experience the same issue. SLURM 24.05.1 segfaults with squeue –json and squeue --json=v0.0.41 but works with squeue --json=v0.0.40 From: Markus Köberl via slurm-users Date: Wednesday, 3. July 2024 at 15:15 To: Joshua Randall Cc: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: pr

Re: [slurm-users] slurmctld/slurmdbd (code=exited, status=217/USER)

2024-01-19 Thread Ümit Seren

Looks like the slurm user does not exist on the system. Did you run the slurmctld and slurmdbd before as root ? If you remove the two lines (User, Group), the services will start. But is is recommended to create a dedicated slurm user for that: https://slurm.schedmd.com/quickstart_admin.html#daemon

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread Ümit Seren

Maybe also post the output of scontrol show job to check the other resources allocated for the job. On Thu, Jan 18, 2024, 19:22 Kherfani, Hafedh (Professional Services, TC) < hafedh.kherf...@hpe.com> wrote: > Hi Ümit, Troy, > > > > I removed the line “#SBATCH --gres=gpu:1”, and changed the sba

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-18 Thread Ümit Seren

This line also has tobe changed: #SBATCH --gpus-per-node=4 • #SBATCH --gpus-per-node=1 --gpus-per-node seems to be the new parameter that is replacing the --gres= one, so you can remove the –gres line completely. Best Ümit From: slurm-users on behalf of Kherfani, Hafedh (Professional Servi

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

2023-10-05 Thread Ümit Seren

sed but > this is not in the version we have deployed. > > Cheers, > > Laurence > On 24.03.23 16:51, Ümit Seren wrote: > > Looks like you are missing the username field in the JWT token: > https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/plugins/auth/jwt/aut

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

2023-03-24 Thread Ümit Seren

oes contain > this parameter. I will continue to debug but any suggestions would be > greatly appreciated. > > Cheers, > > Laurence > On 23.03.23 11:42, Ümit Seren wrote: > > If you use AzureAD as your identity provider beware that their JWKS json > doesn't contai

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

2023-03-23 Thread Ümit Seren

If you use AzureAD as your identity provider beware that their JWKS json doesn't contain the alg parameter. We opened an issue: https://bugs.schedmd.com/show_bug.cgi?id=16168 and it is confirmed. As a workaround you can use this jq query to add the alg to the jwks json that you get from AzureAD: cu

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ümit Seren

As a side note: In Slurm 23.x a new rate limiting feature for client RPC calls was added: (see this commit: https://github.com/SchedMD/slurm/commit/674f118140e171d10c2501444a0040e1492f4eab#diff-b4e84d09d9b1d817a964fb78baba0a2ea6316bfc10c1405329a95ad0353ca33e ) This would give operators the ability

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ümit Seren

We had the same issue when we switched to job_container plugin. We ended up running cvmfs_cpnfig probe as part of the health check tool so that the cvmfs repos stay mounted. However after we switched on power saving we ran into some race conditions (job landed on a node before the cvmfs was mounted

Re: [slurm-users] GPU-node not waking up after power-save

2022-10-13 Thread Ümit Seren

We use power saving with our GPU nodes and they power up fine. They take a bit longer to boot but that’s it. What do you mean with not waking up ? The power on script is not called ? Best Ümit From: slurm-users on behalf of Loris Bennett Date: Thursday, 13. October 2022 at 08:14 To: Slurm User

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-16 Thread Ümit Seren

On Fri, Sep 16, 2022 at 3:43 PM Sebastian Potthoff < s.potth...@uni-muenster.de> wrote: > Hi Hermann, > > So you both are happily(?) ignoring this warning the "Prolog and Epilog > Guide", > right? :-) > > "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. > squeue, > scontrol, s

Re: [slurm-users] Rolling upgrade of compute nodes

2022-05-30 Thread Ümit Seren

We did a couple of major and minor SLURM upgrades without draining the compute nodes. Once slurmdbd and slurmctld were updated to the new major version, we did a package update on the compute nodes and restarted slurmd on them. The existing running jobs continued to run fine and new jobs on the s

[slurm-users] Re: problem with squeue --json with version 24.05.1

Re: [slurm-users] slurmctld/slurmdbd (code=exited, status=217/USER)

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

Re: [slurm-users] External Authentication Integration with JWKS and RS256 Tokens

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

Re: [slurm-users] job_container/tmpfs and autofs

Re: [slurm-users] GPU-node not waking up after power-save

Re: [slurm-users] Providing users with info on wait time vs. run time

Re: [slurm-users] Rolling upgrade of compute nodes

12 matches

Site Navigation

Mail list logo

Footer information