[slurm-users] Re: Slurm sacct ResvCPURAW invalid field in version 24.12.5

2024-07-29 Thread Bjørn-Helge Mevik via slurm-users
Perhaps PlannedCPURAW?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm fails before nvidia-smi command

2024-07-29 Thread Aziz Ogutlu via slurm-users

Hi there all,

We have Dell server with 2 x Nvidia H100 and running slurm on it. After 
restart server if we do not write nvidia-smi command slurm fails. When 
we run nvidia-smi && systemctl restart slurmd && systemctl restart 
slurmctld , slurm queue begins. Do you have any idea about this error 
and what can we do for this issue?


--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  www.eduline.com.tr
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Steffen Grunewald via slurm-users
On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
> Hi there all,
> 
> We have Dell server with 2 x Nvidia H100 and running slurm on it. After
> restart server if we do not write nvidia-smi command slurm fails. When we
> run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld ,
> slurm queue begins. Do you have any idea about this error and what can we do
> for this issue?

Apparently the nvidia driver doesn't get loaded on reboot?
There are multiple ways - add to /etc/modules, run modprobe nvidia via
a @reboot crontab entry (or even run nvidia-smi in this way)...

Best,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Sarlo, Jeffrey S via slurm-users
nvidia-persistenced is something that gets installed by the nvidia driver.  
Setting it to start at boot time helps with slurmd being able to find the GPUs 
when it tries to start.  This is just one web page that has some information 
about it.

https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html

Jeff


From: Aziz Ogutlu via slurm-users 
Sent: Monday, July 29, 2024 3:23 AM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Slurm fails before nvidia-smi command

Hi there all,

We have Dell server with 2 x Nvidia H100 and running slurm on it. After
restart server if we do not write nvidia-smi command slurm fails. When
we run nvidia-smi && systemctl restart slurmd && systemctl restart
slurmctld , slurm queue begins. Do you have any idea about this error
and what can we do for this issue?

--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglEd9rKaSvbqCahkHBIziUjNdld-BP-8OKeAV2Nz5lq0VxXtENo_YpSnidSYn7ZafUZ2sE40XXFX4J05IYGdTOxg$
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Aziz Ogutlu via slurm-users

After added nvidia-persistenced service, slurm did not fail.

Thanks for your help.

On 7/29/24 13:00, Sarlo, Jeffrey S wrote:
nvidia-persistenced is something that gets installed by the nvidia 
driver.  Setting it to start at boot time helps with slurmd being able 
to find the GPUs when it tries to start.  This is just one web page 
that has some information about it.


https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html 



Jeff


*From:* Aziz Ogutlu via slurm-users 
*Sent:* Monday, July 29, 2024 3:23 AM
*To:* slurm-us...@schedmd.com 
*Subject:* [slurm-users] Slurm fails before nvidia-smi command
Hi there all,

We have Dell server with 2 x Nvidia H100 and running slurm on it. After
restart server if we do not write nvidia-smi command slurm fails. When
we run nvidia-smi && systemctl restart slurmd && systemctl restart
slurmctld , slurm queue begins. Do you have any idea about this error
and what can we do for this issue?

--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. 
https://urldefense.com/v3/__http://www.eduline.com.tr__;!!LkSTlj0I!As7iQnglEd9rKaSvbqCahkHBIziUjNdld-BP-8OKeAV2Nz5lq0VxXtENo_YpSnidSYn7ZafUZ2sE40XXFX4J05IYGdTOxg$ 


Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
İyi çalışmalar,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.www.eduline.com.tr
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Cutts, Tim via slurm-users
It sounds to me perhaps as though your systemd units are starting in the wrong 
order, or don’t have appropriate dependencies set in them?

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Aziz Ogutlu via slurm-users 
Date: Monday, 29 July 2024 at 9:25 AM
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Slurm fails before nvidia-smi command
Hi there all,

We have Dell server with 2 x Nvidia H100 and running slurm on it. After
restart server if we do not write nvidia-smi command slurm fails. When
we run nvidia-smi && systemctl restart slurmd && systemctl restart
slurmctld , slurm queue begins. Do you have any idea about this error
and what can we do for this issue?

--
Best regards,
Aziz Öğütlü

Eduline Bilişim Sanayi ve Ticaret Ltd. Şti.  
www.eduline.com.tr
Merkez Mah. Ayazma Cad. No:37 Papirus Plaza
Kat:6 Ofis No:118 Kağıthane -  İstanbul - Türkiye 34406
Tel : +90 212 324 60 61 Cep: +90 541 350 40 72


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Convergence of Kube and Slurm?

2024-07-29 Thread wdennis--- via slurm-users
Can I ask if this replaces the work on "SUNK" that was previously announced? 
(but never released as open-source on GitHub as was planned; looks like it is 
only available on CoreWeave Cloud?)

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Final Call for SLUG Standard Registration

2024-07-29 Thread Victoria Hobson via slurm-users
Slurm User Group (SLUG) 2024 is set for September 12-13 at the
University of Oslo in Oslo, Norway.

Registration information, abstracts, and travel recommendations can be
found here:https://slug24.splashthat.com/

The last day to register with standard pricing ($900) is this Friday,
August 2nd. After this, final registration will run until August 30th
at a price of $1100.

SLUG is the best way to interact with the Slurm community and to
interact with the SchedMD Support & Training staff.

Don't forget to register. We can't wait to see you in Oslo!

--
Victoria Hobson
SchedMD LLC
Vice President of Marketing

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmctld hourly: Unexpected missing socket error

2024-07-29 Thread Jason Ellul via slurm-users
Thanks again Patryk,

For your insights, we have implemented many of the same things, but the socket 
errors are still occurring regularly.

If we find a solution that works I will be sure to add it to this thread.

Many thanks

Jason


Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
My onsite days are Mon, alt Wed and Friday.

[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidbd5fd9a2-1554-4a49-b8f2-79c2470f2...@petermac.org.au]

Phone +61 3 8559 6546
Email jason.el...@petermac.org
305 Grattan Street
Melbourne, Victoria
3000 Australia

www.petermac.org

[/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidec351626-829a-4a59-ad56-7c757fe00...@petermac.org.au]

From: Patryk Bełzak via slurm-users 
Date: Wednesday, 24 July 2024 at 8:03 PM
To: Jason Ellul via slurm-users 
Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket error
! EXTERNAL EMAIL: Think before you click. If suspicious send to 
cyberrep...@petermac.org

Hi,

we're on 389 directory server (aka 389ds), which is pretty large instance. One 
of optimizations was to create proper ACI's on server side which significantly 
improved lookup times on slurm controller and worker nodes. Second thing was to 
move sssd cache to tmpfs - instruction by RedHat: 
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm#mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments
Entire chapter 9 may be helpful.

I also remembered that recently I modified kernel to match the slurmd port 
range from slurm.conf (6-63001) by creating file 
/etc/sysctl.d/91-slurm.conf with following content:
# set ipv4 port range accordingly to slurmdPortRange in slurm.conf
net.ipv4.ip_local_port_range = 3276863001
Unfortunately it hasn't stopped the error from occuring.

Best regards,
Patryk.

On 24/07/23 12:08, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 6,8K --]
> Hi Patryk,
>
> Thanks so much for your email.
>
> There are a couple of things you list that we have not tried yet so we will 
> definitely look at them. You mention optimizing SSSD which has me curious, 
> are you using RedHat Identity management (free IPA?) because we are and after 
> going through our logs it appears the errors became more consistent after 
> upgrading our instance and replica to REHL9.
>
> May I please ask what optimizations did you put in place for SSSD?
>
> Many thanks
>
> Jason
>
>
> Jason Ellul
> Head - Research Computing Facility
> Office of Cancer Research
> My onsite days are Mon, alt Wed and Friday.
>
> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidbd5fd9a2-1554-4a49-b8f2-79c2470f2...@petermac.org.au]
>
> Phone +61 3 8559 6546
> Email jason.el...@petermac.org
> 305 Grattan Street
> Melbourne, Victoria
> 3000 Australia
>
> www.petermac.org
>
> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidec351626-829a-4a59-ad56-7c757fe00...@petermac.org.au]
>
> From: Patryk Bełzak via slurm-users 
> Date: Monday, 22 July 2024 at 6:03 PM
> To: Jason Ellul via slurm-users 
> Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket error
> ! EXTERNAL EMAIL: Think before you click. If suspicious send to 
> cyberrep...@petermac.org
>
> Hi,
> we've been facing the same issue for some time. At the beginning the missing 
> socket error happened every 20 minutes, later once per hour, now it happens 
> few times a day.
> The only downside of this was that controller was unresponsive for that 
> couple of seconds - up to 60, if I remember well.
> We tried to debug it in many ways, but we've found no straightforward 
> solution or source of problems.
>
> Things we've changed since the problem came up:
> * RPC user limit: 
> `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384`
> * made sure that VM that slurm runs on has "network-latency" profile in 
> `tuned`, also the same profile on worker nodes
> * implemented some of these recommendations 
> https://slurm.schedmd.com/high_throughput.html on controllers
> * largely optimized slurmdb by some housekeeping and cleaning up inactive 
> accounts, associations etc.
> * optimized SSSD configuration (this one I believe had the biggest impact) 
> both on controllers and on worker nodes
> plus plenty of other (not related I guess) changes.
>
> I'm not really sure if any of above helped us significantly in that matter.
>
> Best reg