Re: [slurm-users] Issue with x11

2019-05-15 Thread Stijn De Weirdt
hi all, we are currently also going through the painful process of making x11 support userfriendly, so i'm also in favour of making this work from eg vnc or nx/x2go. however, we now run 17.11.8, and we already noticed that 17.11.11 has very different x11 related code. is the 19.05 x11 even more d

Re: [slurm-users] MPI jobs via mirun vs. srun through PMIx.

2019-09-17 Thread Stijn De Weirdt
hi jurgen, > For our next cluster we will switch from Moab/Torque to Slurm and have > to adapt the documentation and example batch scripts for the users. heh, we did that a year ago, and we made (well, fixed the slurm one) a qsub wrapper to avoid having to document this and retraining our users. (

Re: [slurm-users] Heterogeneous HPC

2019-09-20 Thread Stijn De Weirdt
hi michael, very intersting feedback! have you ever tried/looked at https://github.com/eth-cscs/sarus? stijn On 9/20/19 9:11 AM, Mahmood Naderan wrote: > I appreciate the repplies. > I will try to test Charliecloud to see what is what... > > > On Fri, Sep 20, 2019, 10:37 Fulcomer, Samuel > wr

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-12 Thread Stijn De Weirdt
hi max, are you using rdma-core with mellanox ofed? and do you have any uverbs_write error messages in dmesg on the hosts? there is an issue with rdma vs tcp in ucx+pmix when rdma-core is not used. the workaournd for the issue is to start slurmd on the nodes with environment 'UCX_TLS=tcp,self,sm'

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-14 Thread Stijn De Weirdt
for details and link to youtube recording) stijn > > Thanks for helping me! > -max > > -Ursprüngliche Nachricht- > Von: Stijn De Weirdt > Gesendet: Mittwoch, 12. August 2020 22:30 > An: slurm-users@lists.schedmd.com > Betreff: Re: [slurm-users] [External] Re:

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-16 Thread Stijn De Weirdt
hi max, >> you let pmix do it's job and thus simply start the mpi parts with srun > instead of mpirun > > In this case, the srun command works fine for 'srun -N2 -n2 --mpi=pmix > pingpong 100 100', but the IB connection is not used for the > communication, only the tcp connection. h, this od

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

2025-05-12 Thread Stijn De Weirdt via slurm-users
hi all, we are currently going through the process of reviewing our limits after subtle OOM issues that had nothing to do with jobs. we found out that idle (just rebooted) nodes were not representative for nodes that were running for a while: gpfs mmfsd was using up to 2.5GB extra, rsyslogd w

[slurm-users] Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

2025-05-22 Thread Stijn De Weirdt via slurm-users
salut guillaume, nothing else is different between the v1 and v2 setup? (/tmp is tmpfs on v2 setup perhaps?) stijn On 5/22/25 11:10, Guillaume COCHARD via slurm-users wrote: Hello, We've noticed a recent change in how MaxRSS is reported on our cluster. Specifically, the MaxRSS value for m