Re: [slurm-users] Issue with x11

2019-05-17 Thread Alan Orth
Dear Christopher, I tried as you suggested and increased UnkillableStepTimeout from 60 to 120 seconds, but a few hours later three of my nodes were drained with reason "Kill task failed" again. We're not using cgroups. There is a bugĀ¹ on SchedMD's tracker describing attempts to understand this err

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 1:04 AM, Alan Orth wrote: but now we get a handful of nodes drained every day with reason "Kill task failed". In ten years of using SLURM I've never had so many problems as I'm having now. :\ We see "kill task failed" issues but as Marcus says that's not related to X11 support, wh

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 8:53 AM, Mahmood Naderan wrote: Can I ask what is the expected release date for 19? It seems that rc1 has been released in theMay? Sometime in May hopefully! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-16 Thread Mahmood Naderan
Can I ask what is the expected release date for 19? It seems that rc1 has been released in theMay? Regards, Mahmood On Thu, May 16, 2019 at 4:48 PM Marcus Wagner wrote: > Hi Alan, > > we are also seeing this, but that has nothing to do with X11 support, > since we compile atm. SLURM without

Re: [slurm-users] Issue with x11

2019-05-16 Thread Marcus Wagner
Hi Alan, we are also seeing this, but that has nothing to do with X11 support, since we compile atm. SLURM without X11 support. We also see sometimes jobs running on, even if e.g. mpi rank one got killed by oom, rank zero is stuck in mpi_finalize. SLURM seems to not detect everytimes, if oom ki

Re: [slurm-users] Issue with x11

2019-05-16 Thread Alan Orth
Yes I'm also looking forward to SLURM 19.05. We have had lots of issues with X11 since we upgraded to 18.08 and started using its built-in X11 support. Part of this was resolved by setting "X11Parameters=local_xauthority" in slurm.conf to reduce locking contention on the Xauthority file, but now we

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 11:36 AM, Mahmood Naderan wrote: I really like to know why x11 is not so friendly? For example, slurm works with MPI. Why not with X11?! Because MPI support is fundamental, X11 support is nice to have. I suspect 19.05 will make your life an awful lot easier! All the best, Chris --

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>A >lot of interactive software works a lot better over things like NX, so >why this limitation? Agreed... Slurm is a very powerful job manager and I really appreciate its capabilities. However, I don't know why x11 has been always a pain for that? spank-x11 was good but that was not a builtin fe

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7) from a different machine, and then try your srun >--x11 command? No... This doesn't work either. The error is X11 forwarding not available. Please see the picture at https://pasteboard.co/IeQGNOx.png Regards, Mahmood

Re: [slurm-users] Issue with x11

2019-05-15 Thread Mahmood Naderan
>please open a console in the VNC session, do a ssh -Y rocks7 in the console (yes, relogin to the console) and try it again. >SLURM does not want to use local displays, and a VNC session is a "local" display, as far as it concerns linux and the X11 >subsystem. >So, you need to relogin or login to a

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 7:32 AM, Tina Friedrich wrote: Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. Cool, Tim has ripped out all the libssh code (which caused me issues at ${JOB-1} because it didn't play nicely with SSH keep alive messages) and replaced it with native handling

Re: [slurm-users] Issue with x11

2019-05-15 Thread Stijn De Weirdt
hi all, we are currently also going through the painful process of making x11 support userfriendly, so i'm also in favour of making this work from eg vnc or nx/x2go. however, we now run 17.11.8, and we already noticed that 17.11.11 has very different x11 related code. is the 19.05 x11 even more d

Re: [slurm-users] Issue with x11

2019-05-15 Thread Tina Friedrich
Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. (I thought the plumbing was - basically - libssh; and, well, ssh itself is capable of dealing with local displays?) Tina On 15/05/2019 15:06, Chris Samuel wrote: > On 15/5/19 3:01 am, Tina Friedrich wrote: > >> Indeed -

Re: [slurm-users] Issue with x11

2019-05-15 Thread Chris Samuel
On 15/5/19 3:01 am, Tina Friedrich wrote: Indeed - am I the only person that finds that quite a bit annoying? A lot of interactive software works a lot better over things like NX, so why this limitation? It might be a limitation around the plumbing they use to do this, and the whole X11 forwa

Re: [slurm-users] Issue with x11

2019-05-15 Thread Tina Friedrich
Indeed - am I the only person that finds that quite a bit annoying? A lot of interactive software works a lot better over things like NX, so why this limitation? Tina (I realise I'm not adding much the discussion, probably :) ) On 15/05/2019 08:36, Marcus Wagner wrote: > Dear Mahmood, > > ple

Re: [slurm-users] Issue with x11

2019-05-15 Thread Marcus Wagner
Dear Mahmood, please open a console in the VNC session, do a ssh -Y rocks7 in the console (yes, relogin to the console) and try it again. SLURM does not want to use local displays, and a VNC session is a "local" display, as far as it concerns linux and the X11 subsystem. So, you need to relogin

Re: [slurm-users] Issue with x11

2019-05-15 Thread Sean Crosby
Hi Mahmood, I've never tried using the native X11 of SLURM without being ssh'ed into the submit node. Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7) from a different machine, and then try your srun --x11 command? Sean -- Sean Crosby Senior DevOpsHPC Engineer and H

Re: [slurm-users] Issue with x11

2019-05-14 Thread Mahmood Naderan
>No, but you'll need to logout of rocks7 and ssh back into it. >Are you physically logged into rocks7? Or are you connecting via SSH? $DISPLAY = :1 kind of means that you are physically logged into the machine I am connecting through a vnc session. Right now, I have access to the desktop of the f

Re: [slurm-users] Issue with x11

2019-05-14 Thread Sean Crosby
Hi Mahmood, Are you physically logged into rocks7? Or are you connecting via SSH? $DISPLAY = :1 kind of means that you are physically logged into the machine Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physi

Re: [slurm-users] Issue with x11

2019-05-14 Thread Christopher Samuel
On 5/14/19 5:09 PM, Mahmood Naderan wrote: Should I modify that parameter on compute-0-0 too? No, but you'll need to logout of rocks7 and ssh back into it. Or are you on the console of the system itself? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-14 Thread Mahmood Naderan
>What does this say? >echo $DISPLAY On frontend of compute-0-0? [mahmood@rocks7 ~]$ echo $DISPLAY :1 >To get native X11 working with SLURM, we had to add this config to sshd_config on the login node (your rocks7 host) >X11UseLocalhost no >You'll then need to restart sshd I checked that and it

Re: [slurm-users] Issue with x11

2019-05-14 Thread Sean Crosby
Hi Mahmood, To get native X11 working with SLURM, we had to add this config to sshd_config on the login node (your rocks7 host) X11UseLocalhost no You'll then need to restart sshd Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing |

Re: [slurm-users] Issue with x11

2019-05-14 Thread Christopher Samuel
On 5/14/19 4:00 PM, Mahmood Naderan wrote: srun: error: Cannot forward to local display. Can only use X11 forwarding with network displays. What does this say? echo $DISPLAY All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

[slurm-users] Issue with x11

2019-05-14 Thread Mahmood Naderan
Hi I think I have asked this question before, but wasn't able to fix that. While "xclock" command works by "ssh -Y", srun with x11 option fails to opens xclock. [mahmood@rocks7 ~]$ srun --x11 --nodelist=compute-0-0 --account y4 --partition RUBY -n 1 -c 4 --mem=1GB xclock srun: error: Cannot forwa