Hi all:
When designing restriction in job_submit.lua, I found there is no member in
job_desc struct can directly be used to determine the node number finally
allocated to a job. The job_desc.min_nodes seem to be a close answer, but it
will be 0xFFFE when user not specify -node option. The
Hi all,
Recently we found a problem caused by too many CG jobs. When user
continuously submit small jobs which complete quickly, the RUNNING and
PENDING job number do restricted by MaxJob and MaxSubmit in user's
association. But slurm did not count the CG job. Because we set epilog to
collect s
Hello, Kamil Wilczek:
Well I agree that the non-responding case may caused by network unstable, since
our slurm cluster has 2 part nodes geographical distant distributed with only
ethernet link them. Those reported nodes are all in one building while the
slurmctld node in another building.
But
Hi, all:
Recently we found some strange log in slurmctld.log about node not
responding, such as:
[2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding
[2022-07-09T03:23:58.098] Node node171 now responding
[2022-07-09T03:23:58.099] Node node165 now responding
[2022-07-0
Hi, all:
We noticed that slurmdbd provide the conf option DbdBackupHost for user to
set a secondary slurmdbd node. Since slurmdbd is closely related to
database, we wonder will multiple slurmdbd bring up the split-brain danger,
which is the common topic in database high-available discussion. Wi
Hi all:
We found out slurmctld keep log error message as
[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal
accrue_cnt underflow
[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal
acct acct-ioomj accrue_cnt underflow
[2022-06-16T04:01:20.219] err
Well, after increase slurmctld log level to debug, we do found some error
related to munge like:
[2022-06-04T15:17:21.258] debug: auth/munge: _decode_cred: Munge decode
failed: Failed to connect to "/run/munge/munge.socket.2": Resource temporarily
unavailable (retrying ...)
But when test m
Hi, all:
Our cluster set up 2 slurm control node and scontrol show config as below:
> scontrol show config
.
SlurmctldHost[0]= slurm1
SlurmctldHost[1]= slurm2
StateSaveLocation = /etc/slurm/state
.
Of course we have make sure both node has the some slurm conf and mo
Hi, all:
We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.
I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be mar
Well, this is indeed the point. We didn’t set ConstrainDevices=yes in
cgroup.conf. After adding this, gpu restriction works as expected.
But what is the relation between gpu restriction and cgroup? I never heard that
cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia driver?
Hi, all:
We found a problem that slurm job with argument such as --gres gpu:1 didn't
be restricted with gpu usage, user still can see all gpu card on allocated
nodes.
Our gpu node has 4 cards with their gres.conf to be:
> cat /etc/slurm/gres.conf
Name=gpu Type=NVlink_A100_40GB File=/dev/nvid
Hi all:
We encountered a strange bug when query job history using sacct. As show
below, we try to list user hpczbzt's job, and sacct do filter the right jobs
belong to this user. But there username is displayed as phywht.
> sacct -X --user=hpczbzt
--format=jobid%16,jobidraw,user,uid,partiti
Well, ‘sacctmgr modify cluster name=***’ is exactly what we want, and
inspired by this command, we found that ‘sacctmgr show cluster’ can
clearly list all the cluster associations.
But during test we found another problem. When limitation is defined both on
cluster level and user level, the sma
Hi all,
According to Resource Limits page (
https://slurm.schedmd.com/resource_limits.html ), there is Root/Cluster
association level under account level to provide default limitation. But how
to check or modify this "cluster association"? Using command sacctmgr show
association, I can only lis
Well, you got the point. We didn’t configure ldap on slurm database node. After
configuring ldap authorization the PrivateData option finally worked as
expected.
Thanks for the assistance.
发件人: Brian Andrus
发送时间: 2021年7月1日 21:57
收件人: taleinterve...@sjtu.edu.cn
抄送: slurm-users@lists.schedmd
I can make sure the test job is running (of course in the default time
window) when doing sacct query, and here is the new test record which
describe it more clearly:
[2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh
Submitted batch job 6955371
[2021-07-01T16:02:48+0
Hello,
We find a strange behavior about sacct and PrivateData option of slurmdbd.
Our original configuration is setting "PrivateData =
accounts,jobs,usage,users,reservations" in slurm.conf and not setting
"PrivateData" in slurmdbd.conf. At this point, common user can see all
others job informat
Thanks for the help. We tried to reduce the sched_interval and the pending
time decreased as expected.
But the influence of 'sched_interval' is global, setting it too small may
put pressure on slurmctld server. Since we only want quick response on debug
partition (which is designed to let user fre
Hello,
Recently we notice a strange delay from job-submitting to job-start while
the partition is sure to have enough idle nodes to meet the job's demand. To
avoid interference, we use the 4-node debug partition for test, which does
not have any other job to run. And the test job script is also
Hello,
Because I'm not sure about the relations between fields of job_desc
structure and sbatch parameter, I want to print all the fields and their
values in job_desc when testing job_submit.lua. But the following code add
to job_submit.lua failed to iterate through job_desc, the for loop print
Well, maybe my example in first mail caused some misunderstanding. We just use
sacct to check some job records manually in the maintenance process after the
system fault. Our account and billing system is an commercial product which
unfortunately also not provide the ability to adjust billing ra
Thanks for the help. The doc page is useful and we can get the actual job id
now.
The reason we need to delete job record from database is our billing system
will calculate user cost from these historical records. But after a slurm
system faulty there will be some specific jobs which should not
Hello,
The question background is:
>From query command such as 'sacct -j 123456' I can see a series of jobs
named 123456_1, 123456_2, etc. And I need to delete these job records from
mysql database for some reason.
But in job_table of slurmdb, there is only one record with id_job=123456.
n
Hello,
Our slurm cluster managed about 600+ nodes and I tested to set
HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
this to CYCLE shall cause slurm to "cycle through running on all compute
nodes through the course of the HealthCheckInterval". So I set
"HealthCheckI
24 matches
Mail list logo