Thank you David! Let me try it.
Thinking about our case, I'll try to dump the debug info to somewhere like
syslog. Anyway, the idea should be useful to improve our system monitoring.
Much appreciated.
Best,
Kota
露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki
We have several users submitting single GPU jobs to our cluster. We expected
the jobs to fill each node and fully utilize the available GPU's but we instead
find that only 2 out of the 4 gpu's in each node gets allocated.
If we request 2 GPU's in the job and start two jobs, both jobs will start
Short of getting on the system and kicking the tires myself, I’m fresh out of
ideas. Does “sinfo -R” offer any hints?
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
navin srivastava
Sent: Thursday, June 11, 2020 11:31 AM
To: Slurm User Community List
Subject: Re:
i am able to get the output scontrol show node oled3
also the oled3 is pinging fine
and scontrol ping output showing like
Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN
so all looks ok to me.
REgards
Navin.
On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy wrote:
> So there seems to
So there seems to be a failure to communicate between slurmctld and the oled3
slurmd.
From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.
From the head node, try “scontrol show node oled3”, and then ping the address
that is shown for “NodeAddr=”
From: slurm-users [
i collected the log from slurmctld and it says below
[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284
Hi,
I have some trouble understanding the "Oversubscribe" setting completely. What
I would like is to oversubscribe nodes to increase overall throughput.
- Is there a way to oversubscribe by a certain fraction, e.g. +20% or +50%?
- Is there a way to stop if a node reaches 100% "Load"?
Is there
Thanks, all for your replies. I think I can figure out something that makes
sense from here...
--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the
Helmholtz Association / Charité –
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to
interpret it not reporting anything but the “log file” and “munge” messages.
When you have it running attached to your window, is there any chance that
sinfo or scontrol suggest that the node is actually all right? P
Spare capacity is critical. At our scale, the few dozen cores that were
typically left idle in our GPU nodes handles the vast majority of interactive
work.
> On Jun 11, 2020, at 8:38 AM, Paul Edmon wrote:
>
> External Email Warning
>
> This email originated from outside the university. Please
That's pretty slick. We just have a test, gpu_test, and remotedesktop
partition set up for those purposes.
What the real trick is making sure you have sufficient spare capacity
that you can deliberately idle for these purposes. If we were a smaller
shop with less hardware I wouldn't be able
That’s close to what we’re doing, but without dedicated nodes. We have three
back-end partitions (interactive, any-interactive, and gpu-interactive), but
the users typically don’t have to consider that, due to our job_submit.lua
plugin.
All three partitions have a default of 2 hours, 1 core, 2
Sorry Andy I missed to add.
1st i tried the slurmd -Dvvv and it is not written anything
slurmd: debug: Log file re-opened
slurmd: debug: Munge authentication plugin loaded
After that I waited for 10-20 minutes but no output and finally i pressed
Ctrl^c.
My doubt is in slurm.conf file:
Control
Generally the way we've solved this is to set aside a specific set of
nodes in a partition for interactive sessions. We deliberately scale
the size of the resources so that users will always run immediately and
we also set a QoS on the partition to make it so that no one user can
dominate the
Is the time on that node too far out-of-sync w.r.t. the slurmctld server?
> On Jun 11, 2020, at 09:01 , navin srivastava wrote:
>
> I tried by executing the debug mode but there also it is not writing anything.
>
> i waited for about 5-10 minutes
>
> deda1x1452:/etc/sysconfig # /usr/sbin/slu
If you omitted the “-D” that I suggested, then the daemon would have detached
and logged nothing on the screen. In this case, you can still go to the slurmd
log (use “scontrol show config | grep -I log” if you’re not sure where the logs
are stored).
From: slurm-users [mailto:slurm-users-boun...
I tried by executing the debug mode but there also it is not writing
anything.
i waited for about 5-10 minutes
deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v
No output on terminal.
The OS is SLES12-SP4 . All firewall services are disabled.
The recent change is the local hostname earlier it
Hi Manual,
"Holtgrewe, Manuel" writes:
> Hi,
>
> is there a way to make interactive logins where users will use almost no
> resources "always succeed"?
>
> In most of these interactive sessions, users will have mostly idle shells
> running and do some batch job submissions. Is there a way to a
Hi: please mention the below output.
cat /etc/redhat-release
OR
cat /etc/lsb_release
Also, please let us know the detailed log reports that is probably
available at /var/log/slurm/slurmctld.log
status of:
ps -ef | grep slurmctld
Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Sc
Hi Navin,
try running slurmd in the foregrund with increased verbosity:
slurmd -D -v (add as many v as you deem necessary)
Hopefully it'll tell you more about why it times out.
Best,
Marcus
On 6/11/20 2:24 PM, navin srivastava wrote:
> Hi Team,
>
> when i am trying to start the slurmd process
On 11-06-2020 14:24, navin srivastava wrote:
Hi Team,
when i am trying to start the slurmd process i am getting the below error.
2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
operation
Navin,
As you can see, systemd provides very little service-specific information. For
slurm, you really need to go to the slurm logs to find out what happened.
Hint: A quick way to identify problems like this with slurmd and slurmctld is
to run them with the “-Dvvv” option, causing them to log
Hi Team,
when i am trying to start the slurmd process i am getting the below error.
2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
operation timed out. Terminating.
2020-06-11T13:13:28.68447
Hi,
is there a way to make interactive logins where users will use almost no
resources "always succeed"?
In most of these interactive sessions, users will have mostly idle shells
running and do some batch job submissions. Is there a way to allocate "infinite
virtual cpus" on each node that can
24 matches
Mail list logo