Hi,
We have a slightly difference script to do the same. It only relies on /sys:
# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE
if [[ ! -d /sys/class/infiniband ]]
then
logger "No infiniband found"
exit 0
fi
ports=$(ls /sys/class/infiniba
Not sure if it's the largest, but LUMI is a very large one
https://www.top500.org/system/180048/
https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/
On Sun, Oct 29, 2023 at 4:16 AM John Joseph wrote:
> Dear All,
> Like to know that what is the maximum scalled up instance of SL
I would like to report how the Infiniband/OPA network device starts up
step by step as reported by Max's Systemd service from
https://github.com/maxlxl/network.target_wait-for-interfaces
This is the sequence of events during boot:
$ grep wait-for-interfaces.sh /var/log/messages
Nov 1 16:13:39
Could this apply in your case:
https://slurm.schedmd.com/faq.html#opencl_pmix ?
On Wed, Nov 1, 2023 at 5:24 AM Paulo Jose Braga Estrela <
paulo.estr...@petrobras.com.br> wrote:
> Yeah, you are right. I don’t know why but it seems that my email client
> messed with message formatting putting all s
Ole,
Look at the NetworkManager-wait-online.service man page bellow (from RHEL 8.8).
Maybe your IB interfaces aren't properly configured in NetworkManager. The ***
were added by me.
" NetworkManager-wait-online.service blocks until NetworkManager logs "startup
complete" and announces startup
Yeah, you are right. I don’t know why but it seems that my email client messed
with message formatting putting all srun commands in one line.
PÚBLICA
-Mensagem original-
De: slurm-users Em nome de Bjørn-Helge
Mevik
Enviada em: quarta-feira, 1 de novembro de 2023 04:55
Para: slurm-us..
Hello Gérard,
> On 30/10/2023 15:46, Gérard Henry (AMU) wrote:
>> Hello all,
>> …
>> when it fails, sacct gives the follwing information:
>> JobID JobName Elapsed NCPUS TotalCPU CPUTime
>> ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode
>> --
Hi Rémi,
Thanks for the feedback! The patch revert[1] explains SchedMD's reason:
The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not i
Hi Ole,
Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should). This happens only on EL8
> Linux (AlmaLinux
Hi Paulo,
On 11/1/23 01:12, Paulo Jose Braga Estrela wrote:
I think that you should use NetworkManager-wait-online.service In RHEL 8. Take
a look at its man page. It only allows the system reach network-online after
all network interfaces are online. So, if your OP interfaces are managed by
N
Paulo Jose Braga Estrela writes:
> Hi,
>
> I think that you have a syntax error in your bash script. The "&"
> means that you want to send a process to background not that you want
> to run many commands in parallel. To run commands in a serial fashion
> you should use cmd && cmd2, then the cmd2
11 matches
Mail list logo