Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Barbara Krašovec
It could be a problem with ARP cache. If the number of devices approaches 512, there is a kernel limitation in dynamic ARP-cache size and it can result in the loss of connectivity between nodes. The garbage collector will run if the number of entries in the cache is less than 128, by default: *g

Re: [slurm-users] Issue with x11

2019-05-15 Thread Sean Crosby
Hi Mahmood, I've never tried using the native X11 of SLURM without being ssh'ed into the submit node. Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7) from a different machine, and then try your srun --x11 command? Sean -- Sean Crosby Senior DevOpsHPC Engineer and H

Re: [slurm-users] Issue with x11

2019-05-15 Thread Marcus Wagner
Dear Mahmood, please open a console in the VNC session, do a ssh -Y rocks7 in the console (yes, relogin to the console) and try it again. SLURM does not want to use local displays, and a VNC session is a "local" display, as far as it concerns linux and the X11 subsystem. So, you need to relogin

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Bill Broadley
On 5/15/19 12:34 AM, Barbara Krašovec wrote: > It could be a problem with ARP cache. > > If the number of devices approaches 512, there is a kernel limitation in > dynamic > ARP-cache size and it can result in the loss of connectivity between nodes. We have 162 compute nodes, a dozen or so file

Re: [slurm-users] Issue with x11

2019-05-15 Thread Tina Friedrich
Indeed - am I the only person that finds that quite a bit annoying? A lot of interactive software works a lot better over things like NX, so why this limitation? Tina (I realise I'm not adding much the discussion, probably :) ) On 15/05/2019 08:36, Marcus Wagner wrote: > Dear Mahmood, > > ple

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread mercan
Hi; Do not think "the number of devices" as "the number of servers". If a devices which have a MAC address and connected to your node's local networks, it counts as a device. For example, if your BMC ports (ILO,iDRAC etc.) connected to one of the networks of your nodes, it doubles the number

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Ole Holm Nielsen
On 15-05-2019 09:34, Barbara Krašovec wrote: It could be a problem with ARP cache. If the number of devices approaches 512, there is a kernel limitation in dynamic ARP-cache size and it can result in the loss of connectivity between nodes. This is something every cluster owner should be awar

[slurm-users] Accounting on group hierarchy

2019-05-15 Thread Alain O' Miniussi
Hi, I created an account with a Parent with: $ sudo sacctmgr create account Name=dsi [..] Parent=galilee Then submitted some jobs in both accounts: [alainm@gemini ~]$ sacct -n -X -S 01.01.19 -E 05.16.19 -o CPUTimeRAW,Account -A dsi,galilee | wc -l 18300 [alainm@gemini ~]$ sacct -n -X -S 01.01.1

Re: [slurm-users] Issue with x11

2019-05-15 Thread Chris Samuel
On 15/5/19 3:01 am, Tina Friedrich wrote: Indeed - am I the only person that finds that quite a bit annoying? A lot of interactive software works a lot better over things like NX, so why this limitation? It might be a limitation around the plumbing they use to do this, and the whole X11 forwa

Re: [slurm-users] Issue with x11

2019-05-15 Thread Tina Friedrich
Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. (I thought the plumbing was - basically - libssh; and, well, ssh itself is capable of dealing with local displays?) Tina On 15/05/2019 15:06, Chris Samuel wrote: > On 15/5/19 3:01 am, Tina Friedrich wrote: > >> Indeed -

Re: [slurm-users] Issue with x11

2019-05-15 Thread Stijn De Weirdt
hi all, we are currently also going through the painful process of making x11 support userfriendly, so i'm also in favour of making this work from eg vnc or nx/x2go. however, we now run 17.11.8, and we already noticed that 17.11.11 has very different x11 related code. is the 19.05 x11 even more d

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 7:32 AM, Tina Friedrich wrote: Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. Cool, Tim has ripped out all the libssh code (which caused me issues at ${JOB-1} because it didn't play nicely with SSH keep alive messages) and replaced it with native handling

[slurm-users] account hiearchy question

2019-05-15 Thread Alain O' Miniussi
Hi, I am trying to make sense of the following session: [alainm@gemini ~]$ sacctmgr list account name=child1 AccountDescr Org -- child1 child1 parent1 [alainm@gemini ~]$ sacctmg

Re: [slurm-users] account hiearchy question

2019-05-15 Thread Alain O' Miniussi
- On 15 Mai 19, at 19:52, Alain O' Miniussi alain.miniu...@oca.eu wrote: > Hi, > > I am trying to make sense of the following session: > > > [alainm@gemini ~]$ sacctmgr list account name=child1 > AccountDescr Org > -- -

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>please open a console in the VNC session, do a ssh -Y rocks7 in the console (yes, relogin to the console) and try it again. >SLURM does not want to use local displays, and a VNC session is a "local" display, as far as it concerns linux and the X11 >subsystem. >So, you need to relogin or login to a

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7) from a different machine, and then try your srun >--x11 command? No... This doesn't work either. The error is X11 forwarding not available. Please see the picture at https://pasteboard.co/IeQGNOx.png Regards, Mahmood

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>A >lot of interactive software works a lot better over things like NX, so >why this limitation? Agreed... Slurm is a very powerful job manager and I really appreciate its capabilities. However, I don't know why x11 has been always a pain for that? spank-x11 was good but that was not a builtin fe

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 11:36 AM, Mahmood Naderan wrote: I really like to know why x11 is not so friendly? For example, slurm works with MPI. Why not with X11?! Because MPI support is fundamental, X11 support is nice to have. I suspect 19.05 will make your life an awful lot easier! All the best, Chris --