[slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN

2024-01-19 Thread urbanjost
RE: Placing the full pathname of the job stdout in an environment variable Would others find it useful if new variables were added that contained the full pathnames of the standard input, error and input files of batch jobs? ## SYNOPSIS Proposed new environment variables SLURM_STDOUT,SLURM_STDE

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread Jason Macklin
u are trying to achieve: >> https://slurm.schedmd.com/gres.html#MPS_Management >> >> >> >> I agree with the first paragraph. How many GPUs are you expecting each >> job to use? I'd have assumed, based on the original text, that each job is >> supposed to use 1

Re: [slurm-users] slurmctld/slurmdbd (code=exited, status=217/USER)

2024-01-19 Thread Ümit Seren
Looks like the slurm user does not exist on the system. Did you run the slurmctld and slurmdbd before as root ? If you remove the two lines (User, Group), the services will start. But is is recommended to create a dedicated slurm user for that: https://slurm.schedmd.com/quickstart_admin.html#daemon

[slurm-users] slurmctld/slurmdbd (code=exited, status=217/USER)

2024-01-19 Thread Miriam Olmi
Hi all, I am having some issue with the new version of slurm 23.11.0-1. I had already installed and configured slurm 23.02.3-1 on my cluster and all the services were active and running properly. After I install with the same procedure the new version of slurm I have that the slurmctld and slurm

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread Ümit Seren
Maybe also post the output of scontrol show job to check the other resources allocated for the job. On Thu, Jan 18, 2024, 19:22 Kherfani, Hafedh (Professional Services, TC) < hafedh.kherf...@hpe.com> wrote: > Hi Ümit, Troy, > > > > I removed the line “#SBATCH --gres=gpu:1”, and changed the sba

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread Marko Markoc
+1 on checking the memory allocation. Or add/check if you have any DefMemPerX set in your slurm.conf On Fri, Jan 19, 2024 at 12:33 AM mohammed shambakey wrote: > Hi > > I'm not an expert, but is it possible that the currently running jobs is > consuming the whole node because it is allocated the

[slurm-users] Jobs exiting together

2024-01-19 Thread Alexander Silva
Recently, i have built an hpc cluster with slurm as workload. The test jobs with quatum chemistry codes have worked fine. However, production jobs with lammps have shown an unexpected behavior when the first job completed, normally or not, cause the termination of the others in the same compute

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread mohammed shambakey
Hi I'm not an expert, but is it possible that the currently running jobs is consuming the whole node because it is allocated the whole memory of the node (so the other 2 jobs had to wait until it finishes)? Maybe if you try to restrict the required memory for each job? Regards On Thu, Jan 18, 20