Good Morning;
This is not a slurm issue. This is a default shell script feature. If
you want to wait to finish until all background processes, you should
use wait command after all.
Regards;
C. Ahmet Mercan
On 26.07.2024 10:23, Steffen Grunewald via slurm-users wrote:
Good morning
accounting-configuration-before-build
C. Ahmet Mercan
30.05.2024 16:53 tarihinde Radhouane Aniba via slurm-users yazdı:
Yes I can connect to my database using mysql --user=slurm
--password=slurmdbpass slurm_acct_db and there is no firewall
blocking mysql after checking the firewall question
ALso h
Did you try to connect database using mysql command?
mysql --user=slurm --password=slurmdbpass slurm_acct_db
C. Ahmet Mercan
On 30.05.2024 14:48, Radhouane Aniba via slurm-users wrote:
Thank you Ahmet,
I dont have a firewall active.
And because slurmdbd cannot connect to the database I am
. Ahmet Mercan
On 30.05.2024 00:05, Radhouane Aniba via slurm-users wrote:
Hi everyone
I am trying to get slurmdbd to run on my local home server but I am
really struggling.
Note : am a novice slurm user
my slurmdbd always times out even though all the details in the conf
file are correct
My log
Why don't use a spesific queue instead of the specific feature.The queue
is an object for waiting resource, it is ready to use for this purpose.
When required resources are ready to use, the jobs will start.
Regards;
Ahmet M.
29.09.2022 22:27 tarihinde Groner, Rob yazdı:
I'm trying to se
Hi;
The Epilog script will be invoked by slurm user at job's node. Who is
your slurm user? Did the slurm user have right to read & execute your
epilog script. Did you check slurmctld logs?
Also, instead of the using the /tmp directory, if you can use a shared
directory, you can look for the
Hi;
You can look the slurm code for information.
https://github.com/SchedMD/slurm/blob/master/src/common/slurm_protocol_defs.c#L3838
The "ALLOCATED + DRAIN" and "MIX + DRAIN" are same. Others are
different. Also There are some other flags which can change status keywords.
Regards;
Ahmet M.
Hi;
We don't modify /use SuspendExcNodes parameter, and even the Slurm
power-saving feature at all. Because of this, we don't reconfigure
slurm. We use our script as a separate solution.
You can find the script on my githup page:
https://github.com/mercanca/powerSave
But I did not add enoug
Hi;
Because of the same reasons as you said, I don't use slurm power saving
features. I want to keep a certain number of nodes always power on and
ready to run. The Slurm settings are very limited, just SuspendExcNodes
and SuspendExcParts parameters are exist. But SuspendExcNodes totally
usel
Instead of the using the epilog script, You can use the -d (
--dependency) feature of the sbatch:
https://slurm.schedmd.com/sbatch.html
It supports to run a job after finished multiple jobs.
Regards;
Ahmet M.
10.05.2022 00:50 tarihinde David Henkemeyer yazdı:
Prologue is a feature whereby
Hi;
Slurm log says that your prolog did not finish at 300 seconds.
Only possible cause that I see, is the line started with "sudo
/usr/bin/beeond start -F -P -b /usr/bin/pdsh".
You can put a timeout command at the begining of the sudo line to test:
timeout 150 sudo /usr/bin/beeond start -
Hi;
You can use the spart to list only partitions a user has access to that
are in the 'UP' state (and with other limiting factors such as partition
limits, allow/deny groups and qos etc.) :
https://github.com/mercanca/spart
It is a user-oriented partition info command for slurm. Also, it gi
Hi;
The EnforcePartLimits parameter in slurm.conf, should be set to ALL or
ANY to enforce time limit for partition.
Regards.
Ahmet M.
2.12.2021 16:18 tarihinde Gestió Servidors yazdı:
Hello,
I’m going a problema I have detected in my SLURM cluster. If I
configure a partition with a “Tim
Hi;
The Slurm is selecting the nodes according to the weight parameter of
the nodes. I don't know any settings to change the way of the selecting
node, except the changing values of the weights. But it is not a
suitable for the randomly selecting nodes.
Fortunately, absolutely there is not a
The Partitions= option is only valid for "sreport job".
Ref:
https://slurm.schedmd.com/sreport.html
Ahmet M.
10.11.2021 18:56 tarihinde Bill Wichser yazdı:
I can't seem to figure out how to do a query against a partition.
sreport cluster AccountUtilizationByUser user=bill cluster=della, no
Hi;
Please check the StateSaveLocation directory which should readable and
writable by both slurmctld nodes and it should be a shared directory,
not two local directory.
The explanation at below is taken from slurm web site:
"The backup controller recovers state information from the
StateSa
Hi;
Did you check slurmctld log for a complain about the host line. if the
slumctld can not recognize a parameter, may be it give up processing
whole host line.
Ahmet M.
20.07.2021 13:49 tarihinde Diego Zuccato yazdı:
Hello all.
It's been since yesterday that I'm facing this issue.
I'm co
Hi;
We use a bash script to watch and kill users' processes, if they exceed
the our cpu and memory limits. Also this solution ensures total usage of
cpu or memory can not exceed because of a lot of well behaved users as
well as a bad user:
https://github.com/mercanca/kill_for_loginnode.sh
A
Hi;
May be the database can not fit innodb buffer any more. If there are
enough room to increase this value(innodb_buffer_pool_size) , to find
reason, you can try the increase.
Ahmet M.
23.02.2021 17:03 tarihinde Luke Sudbery yazdı:
That great, thanks. We were thinking about staging it lik
Hi;
Prolog and TaskProlog are different parameters and scripts. You should
use the TaskProlog script to set env. variables.
Regards;
Ahmet M.
13.02.2021 00:12 tarihinde Herc Silverstein yazdı:
Hi,
I have a prolog script that is being run via the slurm.conf Prolog=
setting. I've verifie
Hi;
We are using the yumlock feature of the yum to protect unwanted upgrade
of the some packages. Also, Ole mentioned "exclude=slurm" option of the
repo file. It is not a solutionless problem. But, the package maintainer
is a valued resource which hard to find.
Regards,
Ahmet M.
25.01.202
Hi;
I don't know the best way, but if you did not put a loginnode's name
into a partition, the sinfo will not show this node and any job will not
run on this node, just because of a node have a running slurmd.
Ahmet M.
6.01.2021 19:45 tarihinde Steve Brasier yazdı:
Hi all,
For a cluster i
that
variable defined as a hostname, not localhost.
Thanks,
Avery
On Tue, Dec 15, 2020, 1:51 PM mercan <mailto:ahmet.mer...@uhem.itu.edu.tr>> wrote:
Hi;
I dont know the problem is this, but, I think the setting
"ControlMachine=localhost" and not setting a hostname for
Hi;
I dont know the problem is this, but, I think the setting
"ControlMachine=localhost" and not setting a hostname for slurm master
node are not good decisions. How compute nodes decide the ip address of
the slurm masternode from "localhost". Also, I suggest not using capital
letters for any
Hi;
There is an explanation at https://slurm.schedmd.com/quickstart_admin.html
"The configure script in the top-level directory of this distribution
will determine which authentication plugins may be built."
If you have munge, may be the configure script decided to not compile
the auth/none
Hi;
Did you test munge connection? If not, would you test it like this
munge -n | ssh SRVGRIDSLURM02 unmunge
Ahmet M.
30.11.2020 14:43 tarihinde Steve Bland yazdı:
Thanks Diego
actually, nothing at all in the hosts file, did not seem to need to
modify it to see the nodes.
the differe
bug: Waiting for job 110's prolog to
complete
[2020-11-18T10:21:10.121] debug: Finished wait for job 110's prolog
to complete
[2020-11-18T10:21:10.121] debug: [job 110] attempting to run epilog
[/cm/local/apps/cmd/scripts/epilog]
[2020-11-18T10:21:10.124] debug: completed epilog fo
Hi;
Check epilog return value which comes from the return value of the last
line of epilog script. Also, you can add a "exit 0" line at the last
line of the epilog script to ensure to get a zero return value for
testing purpose.
Ahmet M.
18.11.2020 20:00 tarihinde William Markuske yazdı:
Hi;
You can submit each pimplefoam as a seperate job. or if you realy submit
as a single job, you can use a program to run each of them as much as
cpu count such as gnu parallel:
https://www.gnu.org/software/parallel/
regards;
Ahmet M.
10.10.2020 14:05 tarihinde Max Quast yazdı:
Dear sl
duplicate records
- direct insert is working and case sensitive, but scontrol doesn't see change
until slurmctld restart
Regards
Alexey
-Original Message-
From: mercan
Sent: Friday, September 25, 2020 11:16 AM
To: Slurm User Community List ; Tager, Alexey
Subject: RE: [EXTERNAL] [
Hi;
You don't need to modify slurm.conf and reconfigure. There is a
remote/dynamic licenses feature:
https://slurm.schedmd.com/licenses.html
You can add licenses using scontrol command such as:
sacctmgr add resource name=matlab count=50 server=rlm_host \
servertype=rlm type=license
Regard
Hi;
The Slurm license feature is just a simple counter, not more than that.
It can not connect to the license server to read or update the licenses.
Slurm only count used license and subtract from setted license count. If
result is zero, does not run new jobs. The license feature names and
sl
Hi;
If you want, You can use our script, very simple, but it works:
https://github.com/mercanca/slurmmail
Regards;
Ahmet M.
27.08.2020 08:02 tarihinde Andrew Elwell yazdı:
Hi folks,
I'm getting fed up receiving out-of-office replies to slurm job state mails.
Given that by default slurmctl
esult using:
sacctmgr show assoc where user=foo
Ahmet M.
21.08.2020 23:51 tarihinde mercan yazdı:
Hi;
I think You can not update user's partition:
https://slurm-dev.schedmd.narkive.com/2UnWaNQJ/setting-a-users-s-partition-with-sacctmgr
It is a part of the assosiation and it can be set a
Hi;
I think You can not update user's partition:
https://slurm-dev.schedmd.narkive.com/2UnWaNQJ/setting-a-users-s-partition-with-sacctmgr
It is a part of the assosiation and it can be set at creating user as an
option:
https://slurm.schedmd.com/accounting.html#database-configuration
'Ac
Hi;
Are you sure this is a job task completing issue. When the epilog script
fails, slurm will set node to DRAIN state:
"If the Epilog fails (returns a non-zero exit code), this will result in
the node being set to a DRAIN state"
https://slurm.schedmd.com/prolog_epilog.html
You can test th
Hi;
I think you can use pacemaker cluster for a virtual slurmdb server. A
virtual slurmdb server which runs both slurmdb and mysql services on the
active slurmctl server. When the active slurmctl server die, You can try
to start on the passive one.
Regards;
Ahmet M.
23.07.2020 19:12 tarih
Hi Janna;
It sounds like a Arp cache table problem to me. If your slurm head node
can reachable ~1000 or more network devices (all connected network
cards, switches etc., even they are reachable by different ports of the
server), you need to increse some network settings at headnode and
serve
But don't forget, if there aren't a script you can not get running
script such as salloc jobs.
Ahmet M.
On 19.06.2020 12:39, Adrian Sevcenco wrote:
On 6/19/20 12:35 PM, mercan wrote:
Hi;
For running jobs, you can get the running script with using:
scontrol write ba
Hi;
For running jobs, you can get the running script with using:
scontrol write batch_script "$SLURM_JOBID" -
command. the - parameter reqired for screen output.
Ahmet M.
On 19.06.2020 12:25, Adrian Sevcenco wrote:
On 6/18/20 9:35 AM, Loris Bennett wrote:
Hi Adrain,
Hi
Adrian Sevcenco
Hi;
Did you check /var/log/messages file for errors. Systemctl logs this
file, instead of the slurmctl log file.
Ahmet M.
16.06.2020 11:12 tarihinde Ole Holm Nielsen yazdı:
Today we upgraded the controller node from 19.05 to 20.02.3, and
immediately all Slurm commands (on the controller nod
Sorry, I falsely crop the "mkdir" line at below:
mkdir -p $JDIR
I should be after "JDIR=/okyanus/..." line
Regards;
Ahmet M.
23.04.2020 12:31 tarihinde mercan yazdı:
Hi;
I prefer to use epilog script to store the job information to a top
directory owned by the sl
Hi;
I prefer to use epilog script to store the job information to a top
directory owned by the slurm user. To avoid a directory with a lot of
files, It creates a sub-directory for a thousand job file. For a job
which its jobid is 230988, It creates a directory named as 230XXX. Also
the SLURM_
Hi;
Did you restart slurmctld after changing
"PriorityType=priority/multifactor"?
Also your nice values are too small. It is not unix nice. Its range is
+/-2147483645, and it race with other priority factors at priority
factor formula. Look priority factor formula at
https://slurm.schedmd.c
Hi;
If you have working job_submit.lua script, you can put a block new jobs
of the spesific user:
if job_desc.user_name == "baduser" then
return 2045
end
thats all!
Regards;
Ahmet M.
1.04.2020 16:22 tarihinde Mark Dixon yazdı:
Hi David,
Thanks for this, it sounds like I'
Hi;
The spart command version 1.0.0 is available:
https://github.com/mercanca/spart
The spart is a user-oriented info command for slurm. It shows the user
specific brief partition info with core count of available nodes and
pending jobs.
It hides unnecessary information for users in the out
Hi;
At your partition definition, there is "Shared=NO". This is means "do
not share nodes between jobs". This parameter conflict with
"OverSubscribe=FORCE:12 " parameter. Acording to the slurm
documentation, the Shared parameter has been replaced by the
OverSubscribe parameter. But, I suppose
hi;
From the slurm.conf documentation web page:
Note: The filetxt plugin records only a limited subset of accounting
information and will prevent some sacct options from proper operation.
regards;
Ahmet M.
29.01.2020 21:47 tarihinde Dr. Thomas Orgis yazdı:
Hi,
I happen to run a small cl
Hi;
Your mpi and NAMD use your second network because of your applications
did not compiled for infiniband. There are many compiled NAMD versions.
the verb and ibverb versions are for using infiniband. Also, when you
compiling the mpi source, you should check configure script detect the
infin
f they are exceeded),
then I could extract what I need from that.
Again, thanks for the assistance.
Mike
On Thu, Oct 24, 2019 at 11:27 PM mercan <mailto:ahmet.mer...@uhem.itu.edu.tr>> wrote:
Hi;
You should set
SelectType=select/cons_res
and plus one of these:
S
Hi;
You should set
SelectType=select/cons_res
and plus one of these:
SelectTypeParameters=CR_Memory
SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_CPU_Memory
SelectTypeParameters=CR_Socket_Memory
to open Memory allocation tracking according to documentation:
https://slurm.schedm
Hi;
You can use the "--dependency=afterok:jobid:jobid ..." parameter of the
sbatch to ensure the new submitted job will be waiting until all older
jobs are finished. Simply, you can submit the new job even while older
jobs are running, the new job will not start before old jobs are finished.
Hi;
Starttime and Endtime are for any states include PENDING. If you want to
restrict only working jobs between start and end time, you should give
which states you want using -s parameter.
Ahmet M.
16.10.2019 20:31 tarihinde Brian Andrus yazdı:
All,
When running a report to try and get j
Hi;
I think you should set
SelectType=select/cons_res
and plus one of these:
SelectTypeParameters=CR_Memory
SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_CPU_Memory
SelectTypeParameters=CR_Socket_Memory
to open Memory allocation tracking according to documentation:
https://slur
Hi;
If you want to use the threads as cpus, you should set CR_CPU, instead
of CR_Core.
Regards;
Ahmet M.
12.07.2019 21:29 tarihinde mercan yazdı:
Hi;
You can find the Definitions of Socket, Core, & Thread at:
https://slurm.schedmd.com/mc_support.html
Your status:
CPUs=COREs=Soc
Hi;
You can find the Definitions of Socket, Core, & Thread at:
https://slurm.schedmd.com/mc_support.html
Your status:
CPUs=COREs=Sockets*CoresPerSocket=1*4=4
Threads=COREs*ThreadsPerCore=4*2=8
Regards;
Ahmet M.
12.07.2019 20:15 tarihinde Hanu Pathuri yazdı:
Hi,
Here is my node informa
Hi;
There is a official page which gives a lot of link to third party
solutions you can use:
https://slurm.schedmd.com/download.html
According to me, the best slurm page for system administration is:
https://wiki.fysik.dtu.dk/niflheim/SLURM
At this page, You can find a lot of links and inf
Hi;
As far as I know, the slurm is not able to work (communicate) with
reprise license manager or any other license manager. Slurm just sums
the used licenses according to the -L parameter of the jobs, and
subtracts this sum from the total license count which given by using
"sacctmgr add/modi
2019 at 12:24 PM mercan <mailto:ahmet.mer...@uhem.itu.edu.tr>> wrote:
Hi;
Sorry, as you can see, I did a mistake again. I wrote two different
directories:
"The owner of the /var/run/slurm-llnl directory and the
slurmctld.pid and slurmd.pid files should be &quo
R noki:root /var/run/slurm-llnl
Regards;
Ahmet M.
19.06.2019 05:55 tarihinde Noki Lee yazdı:
Hi, slurm-users and mercan.
I tried what you said.
|noki@noki-System-Product-Name:~$ sudo chown -R noki:root
/var/spool/slurm-llnl/ |noki@noki-System-Product-Name:/var/spool/slurm-llnl$ ls -l
total 92
-r
Hi;
I did not notice
SlurmUser=noki
line. The owner of the /var/run/slurm-llnl directory and the
slurmctld.pid and slurmd.pid files should be "noki" user.
chown -R noki:root /var/spool/slurm-llnl
Regards;
Ahmet M.
On 18.06.2019 15:15, mercan wrote:
Hi;
The owner of the /var
Hi;
The owner of the /var/run/slurm-llnl directory and the slurmctld.pid and
slurmd.pid files should be "slurm" user. Your files owner are root and
noki.
chown -R slurm:slurm /var/spool/slurm-llnl
Regards;
Ahmet M.
On 18.06.2019 15:03, Noki Lee wrote:
Though SLURM works fine for job su
Hi;
Try:
salloc ./run_qemu.sh
Regards;
Ahmet M.
17.06.2019 20:28 tarihinde Mahmood Naderan yazdı:
Hi,
May I know why the user is not able to run a qemu interactive job?
According to the configuration which I made, everything should be
fine. Isn't that?
[valipour@rocks7 ~]$ salloc run_qe
Hi;
If you did not use the epilog script, you can set the epilog script to
clean up all residues from the finished jobs:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
Ahmet M.
28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
We use triggers f
Hi;
Do not think "the number of devices" as "the number of servers". If a
devices which have a MAC address and connected to your node's local
networks, it counts as a device. For example, if your BMC ports
(ILO,iDRAC etc.) connected to one of the networks of your nodes, it
doubles the number
Hi;
I am trying to use the slurm_load_partitions2 function from slurm api.
It is defined as:
extern int slurm_load_partitions2(time_t update_time,
partition_info_msg_t **resp,
uint16_t show_flags,
Hi;
For a summary of the partitions, you can use spart:
https://github.com/mercanca/spart
Regards,
Ahmet M.
On 30.04.2019 15:47, Jean-mathieu CHANTREIN wrote:
Hello,
Do you know a command to get a summary of the use of compute nodes
and/or partition of a cluster in real time ? Something wi
Hi;
We use node weight parameter to do that. When you set High mem nodes
with high weight, and low mem nodes with low weight; Slurm will select
lowest weight nodes which have enough mem job requested. So, if there
are free low mem nodes, high mem nodes will stay free. At our cluster,
low mem
multiple partitions, the program will work fine.
On Mar 27, 2019, at 5:51 AM, mercan <mailto:ahmet.mer...@uhem.itu.edu.tr>> wrote:
Hi;
Except sjstat script, Slurm does not contains a command to show
user-oriented partition info. I wrote a command. I hope you wil
Hi;
Except sjstat script, Slurm does not contains a command to show
user-oriented partition info. I wrote a command. I hope you will find it
useful.
https://github.com/mercanca/spart
Regards,
Ahmet M.
Hi;
I think dirty debugging is required using printf (slurm.log_user),
because the lua of our slurm installation returns a lot of variables as
nil. You can limit the output to a specific user as below:
if job_desc.user_name == "mercan" then
slurm.log_user("j
Hi;
You can use a job submit plugin to logging. We use lua job_submit
plugin. The slurm.log_info() function writes a string to slurmctl log
file. But we use a seperate file as a user activity log file. The
logging lua code something as below:
dt = os.date()
jaccount = job_des
Hi;
We upgraded from 18.08.3 to 18.08.4 and there is a job_submit.lua script
also. And nearly same issue at our cluster:
$ sbatch batch
sbatch: error: Batch job submission failed: Unspecified error
$ mv batch nobatchy
$ sbatch nobatchy
Submitted batch job 172174
I hope this helps.
Ahmet M.
Hi;
As far as I know exit code 141 and 13 are the same. Signal + 128 gives
exit code:
https://slurm-dev.schedmd.narkive.com/MYGH56EW/job-exit-codes
Ahmet M.
On 23.11.2018 14:36, Matthew Goulden wrote:
A confirmation re-run yielded the same outcome but the correct outcome
was available
Hi;
Are there some typo errors or they are really different paths:
/opt/exp_soft/slurm/bin/srun
vs.
which srun
/opt/exp_soft/bin/srun
Ahmet Mercan
13.11.2018 11:24 tarihinde Scott Hazelhurst yazdı:
Dear all
I still haven’t found the cause to the problem I raised last week where srun -w
75 matches
Mail list logo