[slurm-users] Unexpected negative NICE values
Hello all, I am encountering some unexpected behavior where the jobs (queued & running) of one specific user have negative NICE values and therefore an increased priority. The user is not privileged in any way and cannot explicitly set the nice value to a negative value by e.g. adding "--nice=-INT" . There are also no QoS which would allow this (is this even possible?). The cluster is using the "priority/multifactor" plugin with weights set for Age, FaireShare and JobSize. This is the only user on the whole cluster where this occurs. From what I can tell, he/she is not doing anything out of the ordinary. However, in the job scripts the user does set a nice value of "0". The user also uses some "strategy" where he/she submits the same job to multiple partitions and, as soon as one of these jobs starts, all other jobs (with the same jobname) will be set on "hold". Does anyone have an idea how this could happen? Does Slurm internally adjust the NICE values in certain situations? (I searched the sources but couldn't find anything that would suggest this). Slurm version is 23.02.1 Example squeue output: [root@mgmt ~]# squeue -u USERID -O JobID,Nice JOBID NICE 14846760 -5202 14846766 -8988 14913146 -13758 14917361 -15103 Any hints are appreciated. Kind regards Sebastian smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] slurm-users Digest, Vol 65, Issue 38
Hi Mike Thanks for the suggestion. I think something else may be missing here on my end. With `acct` I can actually get the usage of individual jobs with TRES information, but there must be something else causing GPU not to be included in the information I get. When I include the "--allocations" option, the TRES information disappears from my output. In any case, I think it would kind of be re-implementing the job of `sreport` this way, so I will look further into making `sreport` work for me. Best regards, Thomas Den 27.03.2023 kl. 11.07 skrev slurm-users-requ...@lists.schedmd.com: Date: Sun, 26 Mar 2023 10:13:09 -0400 From: Mike Mikailov To: Slurm User Community List Cc:t...@its.aau.dk Subject: Re: [slurm-users] Getting usage reporting from sacct/sreport Message-ID:<06fe0d12-9ce0-46b0-9a07-d8f8a0435...@gmail.com> Content-Type: text/plain; charset=utf-8 Hi Thomas et al, I have just written a Linux shell script which does exactly what you are asking for. Please use ??allocations? option in sacct command to generate aggregated resources usage per user. You may also use awk Linux command to summarize all CPU usages. More advanced awk command may also summarize all GPU usages. I have also placed the script on the GitHub but it is private now until we clear it for public. Traceable resources normalization along with traceable resources weights are needed for more fair usage reports. in this case ?billing? value represents combined (max or sum of individual traceable resources) billing unit. Note by default this values equals to the number of CPUs used. Thanks, -Mike USA
Re: [slurm-users] slurm-users Digest, Vol 65, Issue 38
Hi Jürgen Thanks for your feedback. I think you are right that I should probably be using `sreport` for this. I think there must be some other reason that `sreport` is not showing me any actual output. Perhaps the explanation could be that we currently do not have users organised in accounts. We just have one big pile of users. I will look further into this. Best regards, Thomas Den 27.03.2023 kl. 11.07 skrev slurm-users-requ...@lists.schedmd.com: Date: Sun, 26 Mar 2023 17:49:06 +0200 From: Juergen Salk To: Slurm User Community List Subject: Re: [slurm-users] Getting usage reporting from sacct/sreport Message-ID:<20230326154906.gd80...@qualle.rz.uni-ulm.de> Content-Type: text/plain; charset="iso-8859-1" Hi Thomas, I think sreport should actually do what you want out of the box if you have permissions to retrieve that information for other users than yourself. In my understanding, sacct is meant for individual job and job step accounting while sreport is more suitable for aggregated cluster usage accounting. Thus, sreport also accounts for reservation hours which sacct does not. sreport should also be able to report on consumed GRES-hours, such as GPU hours in your case, but you'll probably have to use '-T' option in order to include that information to the report. In case it matters, our AccountingStorageTRES looks like that: AccountingStorageTRES=gres/scratch,gres/gpu (We also account for local scratch space allocations as a GRES.) These are the commands that we usually point our users to when they ask for their historical ressource utilization: https://wiki.bwhpc.de/e/BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_retrieve_historical_resource_usage_for_a_specific_user_or_account.3F (But omit 'user=' or 'account=' for a report on all users or accounts.) Hope that helps. Best regards J?rgen
Re: [slurm-users] Unexpected negative NICE values
Hi Sebastian, maybe it's a silly thought on my part, but do you have the `enable_user_top´ Option included in your SchedulerParameters configuration? This would allow regular users to use `scontrol top ´ to push some of their jobs ahead of other jobs owned by them and this works internally by adjusting the nice values of the specified jobs. I may be totally wrong, but if I remember correctly it is not recommended to configure SchedulerParameters=enable_user_top in general, though, because regular user use of `scontrol top´ is (or was?) supposed to introduce bad side effects in certain scenarios that would allow users to push pending jobs ahead of normal (also other user's) jobs in the queue, if only one of their jobs has already a negative nice value assigned, e.g. by an administrator. Best regards Jürgen * Sebastian Potthoff [230503 10:36]: > Hello all, > > I am encountering some unexpected behavior where the jobs (queued & running) > of one specific user have negative NICE values and therefore an increased > priority. The user is not privileged in any way and cannot explicitly set > the nice value to a negative value by e.g. adding "--nice=-INT" . There are > also no QoS which would allow this (is this even possible?). The cluster is > using the "priority/multifactor" plugin with weights set for Age, FaireShare > and JobSize. > > This is the only user on the whole cluster where this occurs. From what I > can tell, he/she is not doing anything out of the ordinary. However, in the > job scripts the user does set a nice value of "0". The user also uses some > "strategy" where he/she submits the same job to multiple partitions and, as > soon as one of these jobs starts, all other jobs (with the same jobname) > will be set on "hold". > > Does anyone have an idea how this could happen? Does Slurm internally adjust > the NICE values in certain situations? (I searched the sources but couldn't > find anything that would suggest this). > > Slurm version is 23.02.1 > > Example squeue output: > > [root@mgmt ~]# squeue -u USERID -O JobID,Nice > JOBID NICE > 14846760 -5202 > 14846766 -8988 > 14913146 -13758 > 14917361 -15103 > > > Any hints are appreciated. > > Kind regards > Sebastian > smime.p7s Description: S/MIME cryptographic signature
Re: [slurm-users] unable to kill namd3 process
Hi, For an update we tried one case please find it below: We tried by adding below script to kill the namd3 process in our epilog script. # To kill remaining processes of job. # if [ $SLURM_UID = 1234 ] ; then STUCK_PID=`${SLURM_BIN}scontrol listpids $SLURM_JOB_ID | awk '{print $1}' | grep -v PID` for kpid in $STUCK_PID do kill -9 $kpid done fi but it didn't worked out as it is unable to fetch the required pid with "scontrol listpid" command It looks like the slurmd had a problem with a job step that didn't end correctly, and the slurmd wasn't able to kill it after the timeout was reached. Any help would be much appreciated. Thanks, Shaghuf Rahman On Tue, Apr 25, 2023 at 8:32 PM Shaghuf Rahman wrote: > Hi, > > Also forgot to mention the process is still running when user do scancel > and epilog does not clean if one job finished when doing multiple job > submission. > We tried to use unkillable option but did not work. The process still > remains the same until killing it manually. > > > > On Tue, 25 Apr 2023 at 19:57, Shaghuf Rahman wrote: > >> Hi, >> >> We are facing one issue in my environment and the behaviour looks strange >> to me. It is specifically associated with the namd3 application. >> The issue is narrated below and I have made some of the cases. >> >> I am trying to understand the way to kill the processes of the namd3 >> application submitted through sbatch without making the node in drain. >> >> What I observed is when a user submits a single job on a node and then >> when he do scancel of namd3 job it kills the job and the node gets to idle >> state and everything looks as expected. >> But when the user submit multiple jobs on a single node and do scancel 1 >> of his job, it puts the node in drain state. However the other jobs are >> running fine without an issue. >> >> Due to this issue multiple nodes getting to drain state when a user >> do scancel of the namd3 job. >> >> Note: When the user is not performing scancel, all job run successfully >> and the node states are also fine. >> >> It is not creating issues with any of the applications. So we are >> suspecting the issue could be with the namd3 application >> Kindly suggest some solution or any ideas on how to fix this issue. >> >> Thanks in advance, >> Shaghuf Rahman >> >>
Re: [slurm-users] Unexpected negative NICE values
Hi Jürgen, This was it! Thank you so much for the hint! I did not know about the "top" command and was also not aware that this option was enabled in our slurm.conf. Thanks for the help! Sebastian On 03.05.23 12:10, Juergen Salk wrote: Hi Sebastian, maybe it's a silly thought on my part, but do you have the `enable_user_top´ Option included in your SchedulerParameters configuration? This would allow regular users to use `scontrol top ´ to push some of their jobs ahead of other jobs owned by them and this works internally by adjusting the nice values of the specified jobs. I may be totally wrong, but if I remember correctly it is not recommended to configure SchedulerParameters=enable_user_top in general, though, because regular user use of `scontrol top´ is (or was?) supposed to introduce bad side effects in certain scenarios that would allow users to push pending jobs ahead of normal (also other user's) jobs in the queue, if only one of their jobs has already a negative nice value assigned, e.g. by an administrator. Best regards Jürgen * Sebastian Potthoff [230503 10:36]: Hello all, I am encountering some unexpected behavior where the jobs (queued & running) of one specific user have negative NICE values and therefore an increased priority. The user is not privileged in any way and cannot explicitly set the nice value to a negative value by e.g. adding "--nice=-INT" . There are also no QoS which would allow this (is this even possible?). The cluster is using the "priority/multifactor" plugin with weights set for Age, FaireShare and JobSize. This is the only user on the whole cluster where this occurs. From what I can tell, he/she is not doing anything out of the ordinary. However, in the job scripts the user does set a nice value of "0". The user also uses some "strategy" where he/she submits the same job to multiple partitions and, as soon as one of these jobs starts, all other jobs (with the same jobname) will be set on "hold". Does anyone have an idea how this could happen? Does Slurm internally adjust the NICE values in certain situations? (I searched the sources but couldn't find anything that would suggest this). Slurm version is 23.02.1 Example squeue output: [root@mgmt ~]# squeue -u USERID -O JobID,Nice JOBID NICE 14846760 -5202 14846766 -8988 14913146 -13758 14917361 -15103 Any hints are appreciated. Kind regards Sebastian -- Westfälische Wilhelms-Universität (WWU) Münster WWU IT Sebastian Potthoff, M.Sc. (eScience/HPC) Röntgenstraße 7-13, R.207/208 48149 Münster Tel. +49 251 83-31640 E-Mail: s.potth...@uni-muenster.de Website: www.uni-muenster.de/it smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?
Good morning, We have at least one billed account right now, where the associated researchers are able to submit jobs that run against our normal queue with fairshare, but not for an academic research purpose. So we'd like to accurately calculate their CPU hours. We are currently using a script to query the db with sacct and sum up the value of ElapsedRaw * AllocCPUS for all jobs. But this seems limited, because requeueing will create what the sacct man page calls duplicates. By default jobs normally get requeued only if there's something outside of the user's control like a NODE_FAIL or an scontrol command to requeue it manually, though I think users can requeue things themselves, it's not a feature we've seen our researchers use. However with the new scrontab feature, whenever the cron is executed more than once, sacct reports that the previous jobs are "requeued" and are only visible by looking up duplicates. I haven't seen any billed account use requeueing or scrontab yet, but it's clear to me that it could be significant once researchers start using scrontab more. Scrontab has existed since one of the releases from 2020 I believe, but we enabled it this year and see it as much more powerful than the traditional linux crontab. What would be the best way to more thoroughly calculate ElapsedRaw * AllocCPUS, to account for duplicates, but optionally ignore unintentional requeueing like from a NODE_FAIL? Here's the main loop of the simple bash script I have now: while IFS='|' read -r end elapsed cpus; do # if a job crosses the month barrier # the entire bill will be put under the 2nd month year_month="${end:0:7}" if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then continue fi core_seconds["$year_month"]=$(( core_seconds["$year_month"] + (elapsed * cpus) )) done < <(sacct -a -A "$SLURM_ACCOUNT" \ -S "$START_DATE" \ -E "$END_DATE" \ -o End,ElapsedRaw,AllocCPUS -X -P --noheader) Our slurmdbd is configured to keep 6 months of data. It make senses to loop through the jobids instead, using sacct's -D/--duplicates option each time to reveal the hidden duplicates in the REQUEUED state, but I'm interested if there are alternatives or if I'm missing anything here. Thanks, Joseph -- Joseph F. Guzman - ITS (Advanced Research Computing) Northern Arizona University joseph.f.guz...@nau.edu
Re: [slurm-users] Several slurmdbds against one mysql server?
Hello, Angel de Vicente writes: > And hence my question.. because as I was saying in a previous mail, > reading the documentation I understand that this is the standard way to > do it, but right now I got it working the other way: in each cluster I > have one slurmdbd daemon that connects with a single mysqld daemon in a > third machine (option 2 from my question). just to wrap up in case somebody gets here in the future... Both options I was considering are really easy to set up, but in the end I went for the standard way of having only one slurmdbd daemon (and so far not the Federated configuration, just the Multi-Cluster one) for, at least, the following reasons (I'm sure there are plenty more): + it is the default way in the Slurm documentation + it is probably more secure, because I don't have to open the mysql server to remote connections. + I found at least one problem with the multiple slurmdbd daemons: while I found that adding a new cluster and then, for example, adding a new user was properly reflected in the database, those clusters already running would not see the new accounts until their slurmdbd was restarted (this is not an issue when we have only one slurmdbd). Cheers, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Tel.: +34 922-605-747 Web.: http://research.iac.es/proyecto/polmag/ GPG: 0x8BDC390B69033F52 smime.p7s Description: S/MIME cryptographic signature
Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5
Hello, Angel de Vicente writes: > , > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: > | 5:freezer:/ > | 3:cpuacct:/ > ` in the end I learnt that despite Ubuntu 22.04 reporting to be using only cgroup V2, it was also using V1 and creating those mount points, and then Slurm 23.02.01 was complaining that it could not work with Cgroups in hybrid mode. So, the "solution" (as far as you don't need V1 for some reason) was to add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1 mount points and Slurm was happy with that. [in case somebody is interested in the future, I needed this so that I could limit the resources given to users not using Slurm. We have some shared workstations with many cores and users were oversubscribing the CPUs, so I have installed Slurm to put some order in the executions there. But these machines are not an actual cluster with a login node: the login node is the same as the executing node! So with cgroups I control that users connecting via ssh only have the resources equivalent to 3/4 of a core (enough to edit files, etc.) until they submit their jobs via Slurm, when they then get the full allocation they requested]. Cheers, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Tel.: +34 922-605-747 Web.: http://research.iac.es/proyecto/polmag/ GPG: 0x8BDC390B69033F52 smime.p7s Description: S/MIME cryptographic signature
[slurm-users] Reservations and groups
Hello all. I'm trying to define a reservation that only allows users in a group, but it seems I'm missing something: [root@slurmctl ~]# scontrol update res reservationname=prj-test groups=res-TEST Error updating the reservation: Invalid group id slurm_update error: Invalid group id [root@slurmctl ~]# getent group res-TEST res-TEST:*:1180406822:testuser The group comes from AD via sssd. What am I missing? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] Reservations and groups
Ok, PEBKAC :) When creating the reservation, I set account=root . Just adding "account=" to the update fixed both errors. Sorry for the noise. Diego Il 04/05/2023 07:51, Diego Zuccato ha scritto: Hello all. I'm trying to define a reservation that only allows users in a group, but it seems I'm missing something: [root@slurmctl ~]# scontrol update res reservationname=prj-test groups=res-TEST Error updating the reservation: Invalid group id slurm_update error: Invalid group id [root@slurmctl ~]# getent group res-TEST res-TEST:*:1180406822:testuser The group comes from AD via sssd. What am I missing? TIA -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786