[slurm-users] Unexpected negative NICE values

2023-05-03 Thread Sebastian Potthoff

Hello all,

I am encountering some unexpected behavior where the jobs (queued & 
running) of one specific user have negative NICE values and therefore an 
increased priority. The user is not privileged in any way and cannot 
explicitly set the nice value to a negative value by e.g. adding 
"--nice=-INT" . There are also no QoS which would allow this (is this 
even possible?). The cluster is using the "priority/multifactor" plugin 
with weights set for Age, FaireShare and JobSize.


This is the only user on the whole cluster where this occurs. From what 
I can tell, he/she is not doing anything out of the ordinary. However, 
in the job scripts the user does set a nice value of "0". The user also 
uses some "strategy" where he/she submits the same job to multiple 
partitions and, as soon as one of these jobs starts, all other jobs 
(with the same jobname) will be set on "hold".


Does anyone have an idea how this could happen? Does Slurm internally 
adjust the NICE values in certain situations? (I searched the sources 
but couldn't find anything that would suggest this).


Slurm version is 23.02.1

Example squeue output:

[root@mgmt ~]# squeue -u USERID -O JobID,Nice
JOBID   NICE
14846760    -5202
14846766    -8988
14913146    -13758
14917361    -15103


Any hints are appreciated.

Kind regards
Sebastian



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] slurm-users Digest, Vol 65, Issue 38

2023-05-03 Thread Thomas Arildsen

Hi Mike

Thanks for the suggestion. I think something else may be missing here on 
my end. With `acct` I can actually get the usage of individual jobs with 
TRES information, but there must be something else causing GPU not to be 
included in the information I get.
When I include the "--allocations" option, the TRES information 
disappears from my output.
In any case, I think it would kind of be re-implementing the job of 
`sreport` this way, so I will look further into making `sreport` work 
for me.


Best regards,

Thomas

Den 27.03.2023 kl. 11.07 skrev slurm-users-requ...@lists.schedmd.com:

Date: Sun, 26 Mar 2023 10:13:09 -0400
From: Mike Mikailov
To: Slurm User Community List
Cc:t...@its.aau.dk
Subject: Re: [slurm-users] Getting usage reporting from sacct/sreport
Message-ID:<06fe0d12-9ce0-46b0-9a07-d8f8a0435...@gmail.com>
Content-Type: text/plain; charset=utf-8

Hi Thomas et al,

I have just written a Linux shell script which does exactly what you are asking 
for.

Please use ??allocations? option in sacct command to generate aggregated 
resources usage per user.

You may also use awk Linux command to summarize all CPU usages.

More advanced awk command may also summarize all GPU usages.

I have also placed the script on the GitHub but it is private now until we 
clear it for public.

Traceable resources normalization along with traceable resources weights are 
needed for more fair usage reports. in this case ?billing? value represents 
combined (max or sum of individual traceable resources) billing unit. Note by 
default this values equals to the number of CPUs used.

Thanks,
-Mike
USA




Re: [slurm-users] slurm-users Digest, Vol 65, Issue 38

2023-05-03 Thread Thomas Arildsen

Hi Jürgen

Thanks for your feedback. I think you are right that I should probably 
be using `sreport` for this. I think there must be some other reason 
that `sreport` is not showing me any actual output. Perhaps the 
explanation could be that we currently do not have users organised in 
accounts. We just have one big pile of users. I will look further into this.


Best regards,

Thomas

Den 27.03.2023 kl. 11.07 skrev slurm-users-requ...@lists.schedmd.com:

Date: Sun, 26 Mar 2023 17:49:06 +0200
From: Juergen Salk
To: Slurm User Community List
Subject: Re: [slurm-users] Getting usage reporting from sacct/sreport
Message-ID:<20230326154906.gd80...@qualle.rz.uni-ulm.de>
Content-Type: text/plain; charset="iso-8859-1"

Hi Thomas,

I think sreport should actually do what you want out of the box if you
have permissions to retrieve that information for other users than
yourself.

In my understanding, sacct is meant for individual job and job step
accounting while sreport is more suitable for aggregated cluster usage
accounting. Thus, sreport also accounts for reservation hours which
sacct does not.

sreport should also be able to report on consumed GRES-hours, such as
GPU hours in your case, but you'll probably have to use '-T' option in
order to include that information to the report.

In case it matters, our AccountingStorageTRES looks like that:

AccountingStorageTRES=gres/scratch,gres/gpu

(We also account for local scratch space allocations as a GRES.)

These are the commands that we usually point our users to when
they ask for their historical ressource utilization:

   
https://wiki.bwhpc.de/e/BwForCluster_JUSTUS_2_Slurm_HOWTO#How_to_retrieve_historical_resource_usage_for_a_specific_user_or_account.3F

(But omit 'user=' or 'account=' for a report on all
users or accounts.)

Hope that helps.

Best regards
J?rgen




Re: [slurm-users] Unexpected negative NICE values

2023-05-03 Thread Juergen Salk
Hi Sebastian,

maybe it's a silly thought on my part, but do you have the 
`enable_user_top´ Option included in your SchedulerParameters 
configuration? 

This would allow regular users to use `scontrol top ´ to
push some of their jobs ahead of other jobs owned by them and this
works internally by adjusting the nice values of the specified jobs. 

I may be totally wrong, but if I remember correctly it is not
recommended to configure SchedulerParameters=enable_user_top in
general, though, because regular user use of `scontrol top´ is (or
was?) supposed to introduce bad side effects in certain scenarios that
would allow users to push pending jobs ahead of normal (also
other user's) jobs in the queue, if only one of their jobs has already 
a negative nice value assigned, e.g. by an administrator. 

Best regards
Jürgen


* Sebastian Potthoff  [230503 10:36]:
> Hello all,
> 
> I am encountering some unexpected behavior where the jobs (queued & running)
> of one specific user have negative NICE values and therefore an increased
> priority. The user is not privileged in any way and cannot explicitly set
> the nice value to a negative value by e.g. adding "--nice=-INT" . There are
> also no QoS which would allow this (is this even possible?). The cluster is
> using the "priority/multifactor" plugin with weights set for Age, FaireShare
> and JobSize.
> 
> This is the only user on the whole cluster where this occurs. From what I
> can tell, he/she is not doing anything out of the ordinary. However, in the
> job scripts the user does set a nice value of "0". The user also uses some
> "strategy" where he/she submits the same job to multiple partitions and, as
> soon as one of these jobs starts, all other jobs (with the same jobname)
> will be set on "hold".
> 
> Does anyone have an idea how this could happen? Does Slurm internally adjust
> the NICE values in certain situations? (I searched the sources but couldn't
> find anything that would suggest this).
> 
> Slurm version is 23.02.1
> 
> Example squeue output:
> 
> [root@mgmt ~]# squeue -u USERID -O JobID,Nice
> JOBID   NICE
> 14846760    -5202
> 14846766    -8988
> 14913146    -13758
> 14917361    -15103
> 
> 
> Any hints are appreciated.
> 
> Kind regards
> Sebastian
> 


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] unable to kill namd3 process

2023-05-03 Thread Shaghuf Rahman
Hi,

For an update we tried one case please find it below:

We tried by adding below script to kill the namd3 process in our epilog
script.

# To kill remaining processes of job.
#
if [ $SLURM_UID = 1234 ] ; then
STUCK_PID=`${SLURM_BIN}scontrol listpids $SLURM_JOB_ID | awk
'{print $1}' | grep -v PID`
for kpid in $STUCK_PID
do
kill -9 $kpid
done
fi

but it didn't worked out as it is unable to fetch the required pid with
"scontrol listpid" command

It looks like the slurmd had a problem with a job step that didn't end
correctly, and the slurmd wasn't able to kill it after the timeout was
reached.

Any help would be much appreciated.

Thanks,
Shaghuf Rahman


On Tue, Apr 25, 2023 at 8:32 PM Shaghuf Rahman  wrote:

> Hi,
>
> Also forgot to mention the process is still running when user do scancel
> and epilog does not clean if one job finished when doing multiple job
> submission.
> We tried to use unkillable option but did not work. The process still
> remains the same until killing it manually.
>
>
>
> On Tue, 25 Apr 2023 at 19:57, Shaghuf Rahman  wrote:
>
>> Hi,
>>
>> We are facing one issue in my environment and the behaviour looks strange
>> to me. It is specifically associated with the namd3 application.
>> The issue is narrated below and I have made some of the cases.
>>
>> I am trying to understand the way to kill the processes of the namd3
>> application submitted through sbatch without making the node in drain.
>>
>> What I observed is when a user submits a single job on a node and then
>> when he do scancel of namd3 job it kills the job and the node gets to idle
>> state and everything looks as expected.
>> But when the user submit multiple jobs on a single node and do scancel 1
>> of his job, it puts the node in drain state. However the other jobs are
>> running fine without an issue.
>>
>> Due to this issue multiple nodes getting to drain state when a user
>> do scancel of the namd3 job.
>>
>> Note: When the user is not performing scancel, all job run successfully
>> and the node states are also fine.
>>
>> It is not creating issues with any of the applications. So we are
>> suspecting the issue could be with the namd3 application
>> Kindly suggest some solution or any ideas on how to fix this issue.
>>
>> Thanks in advance,
>> Shaghuf Rahman
>>
>>


Re: [slurm-users] Unexpected negative NICE values

2023-05-03 Thread Sebastian Potthoff

Hi Jürgen,

This was it! Thank you so much for the hint! I did not know about the 
"top" command and was also not aware that this option was enabled in our 
slurm.conf.


Thanks for the help!

Sebastian


On 03.05.23 12:10, Juergen Salk wrote:

Hi Sebastian,

maybe it's a silly thought on my part, but do you have the
`enable_user_top´ Option included in your SchedulerParameters
configuration?

This would allow regular users to use `scontrol top ´ to
push some of their jobs ahead of other jobs owned by them and this
works internally by adjusting the nice values of the specified jobs.

I may be totally wrong, but if I remember correctly it is not
recommended to configure SchedulerParameters=enable_user_top in
general, though, because regular user use of `scontrol top´ is (or
was?) supposed to introduce bad side effects in certain scenarios that
would allow users to push pending jobs ahead of normal (also
other user's) jobs in the queue, if only one of their jobs has already
a negative nice value assigned, e.g. by an administrator.

Best regards
Jürgen


* Sebastian Potthoff  [230503 10:36]:

Hello all,

I am encountering some unexpected behavior where the jobs (queued & running)
of one specific user have negative NICE values and therefore an increased
priority. The user is not privileged in any way and cannot explicitly set
the nice value to a negative value by e.g. adding "--nice=-INT" . There are
also no QoS which would allow this (is this even possible?). The cluster is
using the "priority/multifactor" plugin with weights set for Age, FaireShare
and JobSize.

This is the only user on the whole cluster where this occurs. From what I
can tell, he/she is not doing anything out of the ordinary. However, in the
job scripts the user does set a nice value of "0". The user also uses some
"strategy" where he/she submits the same job to multiple partitions and, as
soon as one of these jobs starts, all other jobs (with the same jobname)
will be set on "hold".

Does anyone have an idea how this could happen? Does Slurm internally adjust
the NICE values in certain situations? (I searched the sources but couldn't
find anything that would suggest this).

Slurm version is 23.02.1

Example squeue output:

[root@mgmt ~]# squeue -u USERID -O JobID,Nice
JOBID   NICE
14846760    -5202
14846766    -8988
14913146    -13758
14917361    -15103


Any hints are appreciated.

Kind regards
Sebastian


--
Westfälische Wilhelms-Universität (WWU) Münster
WWU IT
Sebastian Potthoff, M.Sc. (eScience/HPC)
Röntgenstraße 7-13, R.207/208
48149 Münster
Tel. +49 251 83-31640
E-Mail: s.potth...@uni-muenster.de
Website: www.uni-muenster.de/it



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

2023-05-03 Thread Joseph Francisco Guzman
Good morning,

We have at least one billed account right now, where the associated researchers 
are able to submit jobs that run against our normal queue with fairshare, but 
not for an academic research purpose. So we'd like to accurately calculate 
their CPU hours. We are currently using a script to query the db with sacct and 
sum up the value of ElapsedRaw * AllocCPUS for all jobs. But this seems 
limited, because requeueing will create what the sacct man page calls 
duplicates. By default jobs normally get requeued only if there's something 
outside of the user's control like a NODE_FAIL or an scontrol command to 
requeue it manually, though I think users can requeue things themselves, it's 
not a feature we've seen our researchers use.

However with the new scrontab feature, whenever the cron is executed more than 
once, sacct reports that the previous jobs are "requeued" and are only visible 
by looking up duplicates. I haven't seen any billed account use requeueing or 
scrontab yet, but it's clear to me that it could be significant once 
researchers start using scrontab more. Scrontab has existed since one of the 
releases from 2020 I believe, but we enabled it this year and see it as much 
more powerful than the traditional linux crontab.

What would be the best way to more thoroughly calculate ElapsedRaw * AllocCPUS, 
to account for duplicates, but optionally ignore unintentional requeueing like 
from a NODE_FAIL?

Here's the main loop of the simple bash script I have now:

while IFS='|' read -r end elapsed cpus; do
# if a job crosses the month barrier
# the entire bill will be put under the 2nd month
year_month="${end:0:7}"
if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
continue
fi
core_seconds["$year_month"]=$(( core_seconds["$year_month"] + (elapsed * 
cpus) ))
done < <(sacct -a -A "$SLURM_ACCOUNT" \
   -S "$START_DATE" \
   -E "$END_DATE" \
   -o End,ElapsedRaw,AllocCPUS -X -P --noheader)

Our slurmdbd is configured to keep 6 months of data.

It make senses to loop through the jobids instead, using sacct's 
-D/--duplicates option each time to reveal the hidden duplicates in the 
REQUEUED state, but I'm interested if there are alternatives or if I'm missing 
anything here.

Thanks,

Joseph


--
Joseph F. Guzman - ITS (Advanced Research Computing)

Northern Arizona University

joseph.f.guz...@nau.edu


Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-03 Thread Angel de Vicente
Hello,

Angel de Vicente  writes:

> And hence my question.. because as I was saying in a previous mail,
> reading the documentation I understand that this is the standard way to
> do it, but right now I got it working the other way: in each cluster I
> have one slurmdbd daemon that connects with a single mysqld daemon in a
> third machine (option 2 from my question).

just to wrap up in case somebody gets here in the future...

Both options I was considering are really easy to set up, but in the end
I went for the standard way of having only one slurmdbd daemon (and so
far not the Federated configuration, just the Multi-Cluster one) for, at
least, the following reasons (I'm sure there are plenty more):

+ it is the default way in the Slurm documentation
+ it is probably more secure, because I don't have to open the mysql
  server to remote connections. 
+ I found at least one problem with the multiple slurmdbd daemons: while
  I found that adding a new cluster and then, for example, adding a new
  user was properly reflected in the database, those clusters already
  running would not see the new accounts until their slurmdbd was
  restarted (this is not an issue when we have only one slurmdbd).

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-05-03 Thread Angel de Vicente
Hello,

Angel de Vicente  writes:

> ,
> | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 
> | 5:freezer:/
> | 3:cpuacct:/
> `

in the end I learnt that despite Ubuntu 22.04 reporting to be using
only cgroup V2, it was also using V1 and creating those mount points,
and then Slurm 23.02.01 was complaining that it could not work with
Cgroups in hybrid mode.

So, the "solution" (as far as you don't need V1 for some reason) was to
add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
mount points and Slurm was happy with that.

[in case somebody is interested in the future, I needed this so that I
could limit the resources given to users not using Slurm. We have some
shared workstations with many cores and users were oversubscribing the
CPUs, so I have installed Slurm to put some order in the executions
there. But these machines are not an actual cluster with a login node:
the login node is the same as the executing node! So with cgroups I
control that users connecting via ssh only have the resources equivalent
to 3/4 of a core (enough to edit files, etc.) until they submit their
jobs via Slurm, when they then get the full allocation they requested].

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Reservations and groups

2023-05-03 Thread Diego Zuccato

Hello all.

I'm trying to define a reservation that only allows users in a group, 
but it seems I'm missing something:


[root@slurmctl ~]# scontrol update res reservationname=prj-test 
groups=res-TEST

Error updating the reservation: Invalid group id
slurm_update error: Invalid group id
[root@slurmctl ~]# getent group res-TEST
res-TEST:*:1180406822:testuser

The group comes from AD via sssd.

What am I missing?
TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Reservations and groups

2023-05-03 Thread Diego Zuccato

Ok, PEBKAC :)

When creating the reservation, I set account=root . Just adding 
"account=" to the update fixed both errors.


Sorry for the noise.

Diego

Il 04/05/2023 07:51, Diego Zuccato ha scritto:

Hello all.

I'm trying to define a reservation that only allows users in a group, 
but it seems I'm missing something:


[root@slurmctl ~]# scontrol update res reservationname=prj-test 
groups=res-TEST

Error updating the reservation: Invalid group id
slurm_update error: Invalid group id
[root@slurmctl ~]# getent group res-TEST
res-TEST:*:1180406822:testuser

The group comes from AD via sssd.

What am I missing?
TIA



--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786