es
>
>
>
> You can find the documentation here:
>
> https://slurm.schedmd.com/cgroup.conf.html
>
>
>
> If you want to share GPUs you can use CUDA MPS or MIG if your GPU supports
> it.
>
>
>
> Regards,
>
> Jesse Chintanadilok
>
>
>
> *From:*
hi,
facing an issue in my environment where the batch job and the
interactive job use the same gpu.
Each server has 2 gpu. When 2 batch jobs are running it works fine and use
the 2 different gpu's. but if one batch job is running and another job is
submitted interactively then it uses the same GP
Hi,
I have a question related to the mariadb vs slurm version compatibility.
Is there any matrix available?
We are running with slurm version 20.02 in our environment on SLES15SP3 and
with mariadb 10.5.x . We are upgrading the OS from SLES15SP3 to SP4 and
with this we see the mariadb version is 1
ters & login nodes that allow access to both.
> That
> > do? I don't think a third would make any difference in setup.
> >
> > They need to share a database. As long as the share a database, the
> > clusters have 'knowledge' of each other.
> &g
ledge' of each other.
>
> So if you set up one database server (running slurmdbd), and then a
> SLURM controller for each cluster (running slurmctld) using that one
> central database, the '-M' option should work.
>
> Tina
>
> On 28/10/2021 10:54, navin sri
Hi ,
I am looking for a stepwise guide to setup multi cluster implementation.
We wanted to set up 3 clusters and one Login Node to run the job using -M
cluster option.
can anybody have such a setup and can share some insight into how it works
and it is really a stable solution.
Regards
Navin.
Dear slurm community users,
We are using slurm version 20.02.x.
We see the below message appearing a lot of times in slurmctld log
and found that whenever this message is appearing the sinfo/squeue out gets
slow.
No timeout as i kept the value 100.
Warning: Note very large processing time from
e you using federated clusters? If not, check slurm.conf -- do you
> > have FirstJobId set?
> >
> > Andy
> >
> > On 11/18/2020 8:42 AM, navin srivastava wrote:
> >> While running the sacct we found that some jobid are not listing.
> >>
> >&g
While running the sacct we found that some jobid are not listing.
5535566 SYNTHLIBT+ stdg_defq stdg_acc 1 COMPLETED 0:0
5535567 SYNTHLIBT+ stdg_defq stdg_acc 1 COMPLETED 0:0
11016496 jupyter-s+ stdg_defq stdg_acc 1RUNNING 0:0
is there a way to find the utilization per Node?
Regards
Navin.
On Wed, Nov 18, 2020 at 10:37 AM navin srivastava
wrote:
> Dear All,
>
> Good Day!
>
> i am seeing one strange behaviour in my environment.
>
> we have 2 clusters in our environment one acting as a datab
Dear All,
Good Day!
i am seeing one strange behaviour in my environment.
we have 2 clusters in our environment one acting as a database server and
have pointed the 2nd cluster to the same database.
-- -
hpc1 155.250.126.30 6817 8192 1
by 18.x and 19.x or i can uninstall the slurm
17.11.8 and install 20.2 on all compute nodes.
Regards
Navin.
On Tue, Nov 3, 2020 at 12:31 PM Ole Holm Nielsen
wrote:
> On 11/2/20 2:25 PM, navin srivastava wrote:
> > Currently we are running slurm version 17.11.x and wanted to mov
Dear All,
Currently we are running slurm version 17.11.x and wanted to move to 20.x.
We are building the New server with Slurm 20.2 version and planning to
upgrade the client nodes from 17.x to 20.x.
wanted to check if we can upgrade the Client from 17.x to 20.x directly or
we need to go through
Hi team,
i have extracted the %utilization report and found that the idle time is at
the higher end so wanted to check is there any way we can find the node
based utilization?
it will help us to figure out what are the nodes are unutilized.
REgards
navin.
Deall all,
I read the concept of federation clusters in Slurm. is it really helpful to
maximize the cluster usage?
Actually we have 4 independent clusters with slurm which works with local
storage and wanted to build a federation cluster where we can be able to
utilize the free available compute
Hi Team,
facing one issue. several users submitting 2 job in a single batch job
which is very short jobs( says 1-2 sec). so while submitting more job
slurmctld become unresponsive and started giving message
ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
sbatch: error: Batch job submiss
y
> can complete without delaying the estimated start time of higher priority
> jobs.
>
> On Jul 13, 2020, at 4:18 AM, navin srivastava
> wrote:
>
> Hi Team,
>
> We have separate partitions for the GPU nodes and only CPU nodes .
>
> scenario: the jobs submitted in our
Hi Team,
We have separate partitions for the GPU nodes and only CPU nodes .
scenario: the jobs submitted in our environment is 4CPU+1GPU as well as
4CPU only in nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted
and rest other jobs are in queue waiting for the availability of GPU
resour
> If you run slurmd -C on the compute node, it should tell you what
> > slurm thinks the RealMemory number is.
> >
> > Jeff
> >
> > ----
> > *From:* slurm-users on behalf
> of
> >
ish, then remove it.
>
> Brian Andrus
>
> On 7/8/2020 10:57 PM, navin srivastava wrote:
> > Hi Team,
> >
> > i have 2 small query.because of the lack of testing environment i am
> > unable to test the scenario. working on to set up a test environment.
> >
Hi Team,
i have 2 small query.because of the lack of testing environment i am unable
to test the scenario. working on to set up a test environment.
1. In my environment i am unable to pass #SBATCH --mem-2GB option.
i found the reason is because there is no RealMemory entry in the node
definition
Hi Team,
I have differentiated the CPU node and GPU nodes into two different queues.
Now I have 20 Nodes having CPUS (20 cores)only but no GPU.
Another set of nodes having GPU+CPU.some nodes are with 2 GPU and 20 CPU
and some are with 8GPU and 48 CPU assigned to GPU queue
user facing issues when
Thanks Ole.
Regards
Navin
On Thu, Jun 18, 2020 at 11:56 AM Ole Holm Nielsen <
ole.h.niel...@fysik.dtu.dk> wrote:
> The scontrol command to set the nice level is on the list here:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#useful-commands
>
> /Ole
>
> On 6/18/20 8:05 AM
odify the order of execution.
>
> El mié., 17 jun. 2020 a las 12:31, navin srivastava (<
> navin.alt...@gmail.com>) escribió:
>
>> Hi Team,
>>
>> Is their a way to change the job order in slurm.similar to sorder in PBS.
>>
>> I want to swap my job from the other top job.
>>
>> Regards
>> Navin
>>
>>
Hi Team,
Is their a way to change the job order in slurm.similar to sorder in PBS.
I want to swap my job from the other top job.
Regards
Navin
ion would
> have to (a) not require a GPU, (b) require a limited number of CPUs per
> node, so that you'd have some CPUs available for GPU jobs on the nodes
> containing GPUs.
>
> --
> *From:* slurm-users on behalf of
> navin srivastava
> *Sent:
Hi,
One query about how nice value will be decided by the scheduler.
our scheduling policy id FIFO + Fair tree.
one user submitted 100 of jobs in different dates. what i see is the old
jobs are in queue but few latest job went for the execution. when i see the
nice value of the latest running jo
Yes we have separate partitions. Some are specific to gpu having 2 nodes
with 8 gpu and another partitions are mix of both,nodes with 2 gpu and very
few nodes are without any gpu.
Regards
Navin
On Sat, Jun 13, 2020, 21:11 navin srivastava wrote:
> Thanks Renfro.
>
> Yes we have both
d non-GPU jobs? Do you
> have nodes without GPUs?
>
> On Jun 13, 2020, at 12:28 AM, navin srivastava
> wrote:
>
> Hi All,
>
> In our environment we have GPU. so what i found is if the user having high
> priority and his job is in queue and waiting for the GPU resou
Hi All,
In our environment we have GPU. so what i found is if the user having high
priority and his job is in queue and waiting for the GPU resources which
are almost full and not available. so the other user submitted the job
which does not require the GPU resources are in queue even though lots
o:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 11:31 AM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> i am able to get the output scontrol show node oled3
>
shown for “NodeAddr=”
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 10:40 AM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] unable to start slurmd process.
or the like is messed up?
>
>
>
> If that’s not the case, I think my next step would be to follow up on
> someone else’s suggestion, and scan the slurmctld.log file for the problem
> node name.
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.c
ig | grep -I log” if you’re not
> sure where the logs are stored).
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 9:01 AM
> *To:* Slurm User Community List
> *Subject:* Re:
;
> For example,
>
>
>
> # /usr/local/slurm/sbin/slurmd -D
>
>
>
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail
> when you run it this way, it’s time to look elsewhere.
>
>
>
> Andy
>
>
>
> *From:* slurm-users [mailt
Hi Team,
when i am trying to start the slurmd process i am getting the below error.
2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
operation timed out. Terminating.
2020-06-11T13:13:28.68447
s this working earlier or is this the first time are you trying ?
> Are you using pam module ? if yes, try disabling the pam module and see
> if it works.
>
> Thanks
> Sathish
>
> On Thu, Jun 4, 2020 at 10:47 PM navin srivastava
> wrote:
>
>> Hi Team,
>>
>>
Hi Team,
i am seeing a weird issue in my environment.
one of the gaussian job is failing with the slurm within a minute after it
go for the execution without writing anything and unable to figure out the
reason.
The same job works fine without slurm on the same node.
slurmctld.log
[2020-06-03T19
node job on an available node being used by JOBID. Add
> other parameters as required for cpus-per-task, time limits, or whatever
> else is needed. If you start the larger jobs first, and let the later jobs
> fill in on idle CPUs on those nodes, it should work.
>
> > On May 6, 2020,
server=flex_host servertype=flexlm type=license
>
> and submit jobs with a '-L software_name:N’ flag where N is the number of
> nodes you want to run on.
>
> > On May 6, 2020, at 5:33 AM, navin srivastava
> wrote:
> >
> > Thanks Micheal.
> >
> > Actua
On May 5, 2020, at 8:37 AM, navin srivastava
> wrote:
> >
> > External Email Warning
> > This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.
> > Thanks Michael,
> >
> &g
run from 1-4 nodes.
>
> There are also options to query a FlexLM or RLM server for license
> management.
>
> --
> Mike Renfro, PhD / HPC Systems Administrator, Information Technology
> Services
> 931 372-3601 / Tennessee Tech University
>
> > On May 5, 2020, at
Hi Team,
we have an application whose licenses is limited .it scales upto 4
nodes(~80 cores).
so if 4 nodes are full, in 5th node job used to get fail.
we want to put a restriction so that the application can't go for the
execution beyond the 4 nodes and fail it should be in queue state.
i do not
Thanks Denial for detailed Description
Regards
Navin
On Sun, May 3, 2020, 13:35 Daniel Letai wrote:
>
> On 29/04/2020 12:00:13, navin srivastava wrote:
>
> Thanks Daniel.
>
> All jobs went into run state so unable to provide the details but
> definitely will reach out la
; It would really help if you pasted the results of:
>
> squeue
>
> sinfo
>
>
> As well as the exact sbatch line, so we can see how many resources per
> node are requested.
>
>
> On 26/04/2020 12:00:06, navin srivastava wrote:
>
> Thanks Brian,
>
> As su
us to get through but reading
> through it multiple times opens many doors.
>
> DefaultTime is listed in there as a Partition option.
> If you are scheduling gres/gpu resources, it's quite possible there are
> cores available with no corresponding gpus avail.
>
> -b
>
>
re not
> specifying a reasonable timelimit to their jobs, this won't help either.
>
>
> -b
>
>
> On 4/24/20 1:52 PM, navin srivastava wrote:
>
> In addition to the above when i see the sprio of both the jobs it says :-
>
> for normal queue jobs all jobs showing t
PRIORITY FAIRSHARE
1291339 GPUsmall 21052 21053
On Fri, Apr 24, 2020 at 11:14 PM navin srivastava
wrote:
> Hi Team,
>
> we are facing some issue in our environment. The resources are free but
> job is going into the QUEUE state but not running.
>
> i have attached t
Hi Team,
we are facing some issue in our environment. The resources are free but job
is going into the QUEUE state but not running.
i have attached the slurm.conf file here.
scenario:-
There are job only in the 2 partitions:
344 jobs are in PD state in normal partition and the node belongs fro
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
> --
> *From:* slurm-users on behalf of
> navin srivastava
> *Sent:* Wednesday, April 15, 2020 10:37 PM
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] How to request for the alloca
es unless the SchedulerParameters
> configuration parameter includes the "default_gbytes" option for gigabytes.
> Different units can be specified using the suffix [K|M|G|T].
> https://slurm.schedmd.com/sbatch.html
>
>
>
> ---
> Erik Ellestad
> Wynton Cluster
ion of local scratch globally via TmpFS.
>
> And then the amount per host is defined via TmpDisk=xxx.
>
> Then the request for srun/sbatch via --tmp=X
>
>
>
> ---
> Erik Ellestad
> Wynton Cluster SysAdmin
> UCSF
> --
> *From:* slurm-user
Any suggestion on the above query.need help to understand it.
Does TmpFS=/scratch and the request is #SBATCH --tmp=500GB then it will
reserve the 500GB from scratch.
let me know if my assumption is correct?
Regards
Navin.
On Mon, Apr 13, 2020 at 11:10 AM navin srivastava
wrote:
> Hi T
Hi Team,
i wanted to define a mechanism to request the local disk space while
submitting the job.
we have dedicated /scratch of 1.2 TB file system for the execution of the
job on each of the compute nodes other than / and other file system.
i have defined in slurm.conf as TmpFS=/scratch and then
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=50
PriorityFlags=FAIR_TREE
could you please also suggest here if the scheduling policy is fairshare
then still it will consider the priority over the partition?
Regards
Navin.
On Sat, Apr 4, 2020 at 8:34 PM navin srivastava
wrote:
> Hi Team,
>
> I
Hi Team,
I am facing one issue in my environment. our slurm version is 17.11.x
My question is i have 2 partition:
Queue A with node1 and node2 with Priority=1000 shared=yes
Queue B with node1 and node2 with priority=100. shared =yes
Problem is when job from A partition is running then the j
rom a different partition.
On Tue, Mar 31, 2020 at 4:34 PM navin srivastava
wrote:
> Hi ,
>
> have an issue with the resource allocation.
>
> In the environment have partition like below:
>
> PartitionName=small_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE
> Stat
Hi ,
have an issue with the resource allocation.
In the environment have partition like below:
PartitionName=small_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=8000
PartitionName=large_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE
State=UP Shared=YES Pri
Hi,
i wanted to understand how log rotation of slurmctld works.
in my environment i don't have any logrotation for the slurmctld.log and
now the log file size reached to 125GB.
can i move the log file to some other location and then restart.reload of
slurm service will start a new log file.i thi
the explanation for each are found on the
> Resource Limits document.
>
> /Ole
>
> On 2/17/20 12:20 PM, navin srivastava wrote:
> > Hi ole,
> >
> > i am submitting 100 of jobs are i see all jobs starting at the same time
> > and all job is going into the run s
t; Why do you think the limit is not working? The MaxJobs limits the number
> of running jobs to 3, but you can still submit as many jobs as you like!
>
> See "man sacctmgr" for definitions of the limits MaxJobs as well as
> MaxSubmitJobs.
>
> /Ole
>
> On 2/17/
Hi,
Thanks for your script.
with this i am able to show the limit what i set. but this limt is
not working.
MaxJobs =3, current value = 0
Regards
Navin.
On Mon, Feb 17, 2020 at 4:13 PM Ole Holm Nielsen
wrote:
> On 2/17/20 11:16 AM, navin srivastava wrote:
> > i have an issue
Hi Team,
i have an issue with the slurm job limit. i applied the Maxjobs limit on
user using
sacctmgr modify user navin1 set maxjobs=3
but still i see this is not getting applied. i am still bale to submit more
jobs.
Slurm version is 17.11.x
Let me know what setting is required to implement th
63 matches
Mail list logo