Re: [slurm-users] srun problem -- Can't find an address, check slurm.conf

2018-11-13 Thread Scott Hazelhurst

Dear all

I still haven’t found the cause to the problem I raised last week where srun -w 
xx  runs for some nodes but not for others — thanks for the ideas.

One intriguing result I’ve had trying to pursue this which I thought I’d share 
in case it sparks some ideas. If I give the full path for srun, then it works


# show path
scott@cream-ce ~]$ which srun
/opt/exp_soft/bin/srun


# Node n37 is good (as are most of our nodes)
[scott@cream-ce ~]$ srun  -w n37 --pty bash
[scott@n37 ~]$ 


# Node n38 is not (and a few othrs)
scott@cream-ce ~]$ srun  -w n38 --pty bash
srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
srun: error: Task launch for 20094.0 failed on node n38: Can't find an address, 
check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

But if I  give the full path name — it works!

scott@cream-ce ~]$ /opt/exp_soft/slurm/bin/srun  -w n38 --pty bash
[scott@n38 ~]$ 


Scott



Scott


This communication is intended for the addressee only. It is confidential. If 
you have received this communication in error, please notify us immediately and 
destroy the original message. You may not copy or disseminate this 
communication without the permission of the University. Only authorised 
signatories are competent to enter into agreements on behalf of the University 
and recipients are thus advised that the content of this message may not be 
legally binding on the University and may contain the personal views and 
opinions of the author, which are not necessarily the views and opinions of The 
University of the Witwatersrand, Johannesburg. All agreements between the 
University and outsiders are subject to South African Law unless the University 
agrees in writing to the contrary.


Re: [slurm-users] srun problem -- Can't find an address, check slurm.conf

2018-11-13 Thread mercan

Hi;

Are there some typo errors or they are really different paths:

/opt/exp_soft/slurm/bin/srun

vs.

which srun
/opt/exp_soft/bin/srun

Ahmet Mercan



13.11.2018 11:24 tarihinde Scott Hazelhurst yazdı:

Dear all

I still haven’t found the cause to the problem I raised last week where srun -w 
xx  runs for some nodes but not for others — thanks for the ideas.

One intriguing result I’ve had trying to pursue this which I thought I’d share 
in case it sparks some ideas. If I give the full path for srun, then it works


# show path
scott@cream-ce ~]$ which srun
/opt/exp_soft/bin/srun


# Node n37 is good (as are most of our nodes)
[scott@cream-ce ~]$ srun  -w n37 --pty bash
[scott@n37 ~]$


# Node n38 is not (and a few othrs)
scott@cream-ce ~]$ srun  -w n38 --pty bash
srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
srun: error: Task launch for 20094.0 failed on node n38: Can't find an address, 
check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

But if I  give the full path name — it works!

scott@cream-ce ~]$ /opt/exp_soft/slurm/bin/srun  -w n38 --pty bash
[scott@n38 ~]$


Scott



Scott


This communication is intended for the addressee only. It is confidential. If 
you have received this communication in error, please notify us immediately and 
destroy the original message. You may not copy or disseminate this 
communication without the permission of the University. Only authorised 
signatories are competent to enter into agreements on behalf of the University 
and recipients are thus advised that the content of this message may not be 
legally binding on the University and may contain the personal views and 
opinions of the author, which are not necessarily the views and opinions of The 
University of the Witwatersrand, Johannesburg. All agreements between the 
University and outsiders are subject to South African Law unless the University 
agrees in writing to the contrary.





Re: [slurm-users] Slurm missing non primary group memberships

2018-11-13 Thread Joerg Sassmannshausen
Dear all,

I am wondering if that is the same issue we are having here as well.
When I am adding users in the secondary group some time *after* the
initial user installation, the user cannot access the slurm partition it
suppose to. We found two remedies here, more or less by chance:
- rebooting both the slurm server and slurm DB server
- be patient and wait for long enough

Obviously, both remedies are not suitable if you are running a large
research environment. The reboot was happening as we physically had to
move the servers and the waiting for long enough was simply as we did
not have an answer to the question.
As already mentioned in a different posting, we have deleted the user in
slurm and re-installed it, updated the sssd on the slurm server, all in
vain.

However, reading the threat, the latter case points to a caching
problem, similar to the one described here. We are also using FreeIPA
and hence sssd for the ID lookup.

Poking the list a bit further on this subject: does anybody have similar
experiences when the lookup is done directly on AD? We are planning to
move to AD and if that is also an issue at least are warned here.

All the best

Jörg

On 10/11/18 11:17, Douglas Jacobsen wrote:
> We've had issues getting sssd to work reliably on compute nodes (at
> least at scale), the reason is not fully understood, but basically if
> the connection times out with sssd it'll black list the server for 60s,
> which then causes those kinds of issues.
>
> Setting LaunchParameters=send_gids will sidestep this issue by doing the
> lookups exclusively on the controller node, where more frequent
> connections can prevent time decay disconnections and reduce the
> likelihood of cache misses.
>
> On Fri, Nov 9, 2018 at 11:16 PM Chris Samuel  > wrote:
>
> On Friday, 9 November 2018 2:47:51 AM AEDT Aravindh Sampathkumar wrote:
>
> > navtp@console2:~> ssh c07b07 id
> > uid=29865(navtp) gid=510(finland)
> groups=510(finland),508(nav),5001(ghpc)
> > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>
> Do you have SElinux configured by some chance?
>
> If so you might want to check if it works with it disabled first..
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/
> 
> 
> :  Melbourne, VIC
>
>
>
>
> --
> Sent from Gmail Mobile

--
Dr. Jörg Saßmannshausen, MRSC
HPC & Research Data System Engineer
Scientific Computing
The Francis Crick Institute
1 Midland Way
London, NW1 1AT
email: joerg.sassmannshau...@crick.ac.uk
phone: 020 379 65139
The Francis Crick Institute Limited is a registered charity in England and 
Wales no. 1140062 and a company registered in England and Wales no. 06885462, 
with its registered office at 1 Midland Road London NW1 1AT


Re: [slurm-users] Slurm missing non primary group memberships

2018-11-13 Thread Antony Cleave
Are you sure this isn't working as designed?

I remember there is something annoying about groups in the manual.  Here it
is. This is why I prefer accounts.

*NOTE:* For performance reasons, Slurm maintains a list of user IDs allowed
to use each partition and this is checked at job submission time. This list
of user IDs is updated when the *slurmctld*daemon is restarted,
reconfigured (e.g. "scontrol reconfig") or the partition's *AllowGroups* value
is reset, even if is value is unchanged (e.g. "scontrol update
PartitionName=name AllowGroups=group"). For a user's access to a partition
to change, both his group membership must change and Slurm's internal user
ID list must change using one of the methods described above.

Are you adding groups after submission too? Does changing allow groups on
the partition fix it too?

Antony

On Tue, 13 Nov 2018, 09:13 Joerg Sassmannshausen <
joerg.sassmannshau...@crick.ac.uk wrote:

> Dear all,
>
> I am wondering if that is the same issue we are having here as well.
> When I am adding users in the secondary group some time *after* the
> initial user installation, the user cannot access the slurm partition it
> suppose to. We found two remedies here, more or less by chance:
> - rebooting both the slurm server and slurm DB server
> - be patient and wait for long enough
>
> Obviously, both remedies are not suitable if you are running a large
> research environment. The reboot was happening as we physically had to
> move the servers and the waiting for long enough was simply as we did
> not have an answer to the question.
> As already mentioned in a different posting, we have deleted the user in
> slurm and re-installed it, updated the sssd on the slurm server, all in
> vain.
>
> However, reading the threat, the latter case points to a caching
> problem, similar to the one described here. We are also using FreeIPA
> and hence sssd for the ID lookup.
>
> Poking the list a bit further on this subject: does anybody have similar
> experiences when the lookup is done directly on AD? We are planning to
> move to AD and if that is also an issue at least are warned here.
>
> All the best
>
> Jörg
>
> On 10/11/18 11:17, Douglas Jacobsen wrote:
> > We've had issues getting sssd to work reliably on compute nodes (at
> > least at scale), the reason is not fully understood, but basically if
> > the connection times out with sssd it'll black list the server for 60s,
> > which then causes those kinds of issues.
> >
> > Setting LaunchParameters=send_gids will sidestep this issue by doing the
> > lookups exclusively on the controller node, where more frequent
> > connections can prevent time decay disconnections and reduce the
> > likelihood of cache misses.
> >
> > On Fri, Nov 9, 2018 at 11:16 PM Chris Samuel  > > wrote:
> >
> > On Friday, 9 November 2018 2:47:51 AM AEDT Aravindh Sampathkumar
> wrote:
> >
> > > navtp@console2:~> ssh c07b07 id
> > > uid=29865(navtp) gid=510(finland)
> > groups=510(finland),508(nav),5001(ghpc)
> > > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> >
> > Do you have SElinux configured by some chance?
> >
> > If so you might want to check if it works with it disabled first..
> >
> > All the best,
> > Chris
> > --
> >  Chris Samuel  :  http://www.csamuel.org/
> > <
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=02%7C01%7C%7Cbf873add236a4bc74b0a08d646ff523c%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C636774459751813515&sdata=L5%2Fg8HVibwr3xnv4%2FzlnwMBj8HgMlytUYposfbGi%2Bq8%3D&reserved=0
> >
> > :  Melbourne, VIC
> >
> >
> >
> >
> > --
> > Sent from Gmail Mobile
>
> --
> Dr. Jörg Saßmannshausen, MRSC
> HPC & Research Data System Engineer
> Scientific Computing
> The Francis Crick Institute
> 1 Midland Way
> London, NW1 1AT
> email: joerg.sassmannshau...@crick.ac.uk
> phone: 020 379 65139
> The Francis Crick Institute Limited is a registered charity in England and
> Wales no. 1140062 and a company registered in England and Wales no.
> 06885462, with its registered office at 1 Midland Road London NW1 1AT
>


Re: [slurm-users] srun problem -- Can't find an address, check slurm.conf

2018-11-13 Thread Scott Hazelhurst

Dear Mercan

Thank you! — yes different paths so different behaviour. Amazing how you can 
spend so much time looking at something and not seeing it.

On Sunday did  an upgrade from 17.11.10 to 17.11.12 to try to fix the problem 
but had left old binaries in a directory I should not have, so kept on getting 
the same behaviour.


I can’t be sure, but I think the problem I reported last week was in 17.11.10 
and has gone away in 17.11.12

All good now


Again, thanks for the help

Scott



> Are there some typo errors or they are really different paths:
> 
> /opt/exp_soft/slurm/bin/srun
> 
> vs.
> 
> which srun
> /opt/exp_soft/bin/srun
This communication is intended for the addressee only. It is confidential. If 
you have received this communication in error, please notify us immediately and 
destroy the original message. You may not copy or disseminate this 
communication without the permission of the University. Only authorised 
signatories are competent to enter into agreements on behalf of the University 
and recipients are thus advised that the content of this message may not be 
legally binding on the University and may contain the personal views and 
opinions of the author, which are not necessarily the views and opinions of The 
University of the Witwatersrand, Johannesburg. All agreements between the 
University and outsiders are subject to South African Law unless the University 
agrees in writing to the contrary.


[slurm-users] heterogeneous jobs using packjob

2018-11-13 Thread Jing Gong
Hi,

I can submit heterogeneous jobs  using packjob likes


#SBATCH -p high_mem
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH packjob
#SBATCH -p log_mem
#SBATCH  -N 2
#SBATCH --exclusive

i.e. specify 1 fat node and two thin nodes for one jobs. 

If I use "squeue/scontrol" to check the job, it indeed distribute 3 nodes, 
However, if set SLURM environment variables in the job script, I always get the 
information of the first partition "high_mem".

echo $SLURM_NNODES   # 1
echo $SLURM_JOB_NODELIST   # p01k30

How can I obtain the information of the two type nodes in a job script to run 
different applications on different nodes?

Thanks a lot. /Jing 
 




Re: [slurm-users] heterogeneous jobs using packjob

2018-11-13 Thread Jeffrey Frey
See the documentation at


https://slurm.schedmd.com/heterogeneous_jobs.html#env_var


There are *_PACK_* environment variables in the job env that describe the 
heterogeneous allocation.  The batch step of the job (corresponding to your 
script) executes on the first node of the first part of the allocation, which 
in your case is a node from the "high_mem" partition (hence the Slurm variables 
you're seeing).  When using srun inside your batch script, the remote command's 
standard Slurm environment should be set (e.g. SLURM_NNODES, 
SLURM_JOB_NODELIST, SLURM_STEP_NUM_TASKS, etc.).





> On Nov 13, 2018, at 7:06 AM, Jing Gong  wrote:
> 
> Hi,
> 
> I can submit heterogeneous jobs  using packjob likes
> 
> 
> #SBATCH -p high_mem
> #SBATCH -N 1
> #SBATCH --exclusive
> #SBATCH packjob
> #SBATCH -p log_mem
> #SBATCH  -N 2
> #SBATCH --exclusive
> 
> i.e. specify 1 fat node and two thin nodes for one jobs. 
> 
> If I use "squeue/scontrol" to check the job, it indeed distribute 3 nodes, 
> However, if set SLURM environment variables in the job script, I always get 
> the information of the first partition "high_mem".
> 
> echo $SLURM_NNODES   # 1
> echo $SLURM_JOB_NODELIST   # p01k30
> 
> How can I obtain the information of the two type nodes in a job script to run 
> different applications on different nodes?
> 
> Thanks a lot. /Jing 
> 
> 
> 


::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::







[slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault

2018-11-13 Thread Bill Broadley


After being up since the second week in Oct or so, yesterday our slurm
controller started segfaultings.  It was compiled/run on ubuntu 16.04.1.


Nov 12 14:31:48 nas-11-1 kernel: [2838306.311552] srvcn[9111]: segfault at 58 ip
004b51fa sp 7fbe270efb70 error 4 in slurmctld[40+eb000]
Nov 12 14:32:48 nas-11-1 kernel: [2838366.586784] srvcn[11217]: segfault at 58
ip 004b51fa sp 7f8f7cc41b70 error 4 in slurmctld[40+eb000]
Nov 12 14:33:48 nas-11-1 kernel: [2838426.761784] srvcn[13231]: segfault at 58
ip 004b51fa sp 7fb78a7e6b70 error 4 in slurmctld[40+eb000]
Nov 12 14:34:48 nas-11-1 kernel: [2838486.976987] srvcn[15228]: segfault at 58
ip 004b51fa sp 7ffb8e9e8b70 error 4 in slurmctld[40+eb000]

I compiled 18.08.3 on 18.04 and it hits the same problem.

Now slurmctld segfaults shortly after boot:
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
Segmentation fault (core dumped)

If I look at the core dump:

# gdb ./slurmctld
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Reading symbols from ./slurmctld...done.
(gdb) core ./core
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./slurmctld -D -v -v -v'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092
2092i_first = bit_ffs(job_resrcs_ptr->node_bitmap);
[Current thread is 1 (Thread 0x7f06a93d3700 (LWP 25825))]
(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092
#1  post_job_step (step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:4720
#2  0x55578571d1d8 in _post_job_step (step_ptr=0x555787af0f70) at 
step_mgr.c:270
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x555787af04a0,
step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:311
#4  0x55578571d35c in job_step_complete (job_id=7035546, step_id=4294967295,
uid=uid@entry=0, requeue=requeue@entry=false,
job_return_code=) at step_mgr.c:878
#5  0x5557856f0522 in _slurm_rpc_step_complete (msg=0x7f06a93d2e20,
running_composite=) at proc_req.c:3863
#6  0x5557856fde0b in slurmctld_req (msg=0x7f06a93d2e20, arg=0x7f067c001370)
at proc_req.c:512
#7  0x5557856897e2 in _service_connection (arg=) at
controller.c:1274
#8  0x7f06be41a6db in start_thread (arg=0x7f06a93d3700) at 
pthread_create.c:463
#9  0x7f06be14388f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Has anyone seen anything like this before?




Re: [slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault

2018-11-13 Thread Kilian Cavalotti
Hi Bill,

On Tue, Nov 13, 2018 at 5:35 PM Bill Broadley  wrote:
> (gdb) bt
> #0  _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092
> #1  post_job_step (step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:4720
> #2  0x55578571d1d8 in _post_job_step (step_ptr=0x555787af0f70) at 
> step_mgr.c:270
> #3  _internal_step_complete (job_ptr=job_ptr@entry=0x555787af04a0,
> step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:311
> #4  0x55578571d35c in job_step_complete (job_id=7035546, 
> step_id=4294967295,
> uid=uid@entry=0, requeue=requeue@entry=false,
> job_return_code=) at step_mgr.c:878
> #5  0x5557856f0522 in _slurm_rpc_step_complete (msg=0x7f06a93d2e20,
> running_composite=) at proc_req.c:3863

There are a couple mentions of the same backtrace on the bugtracker,
but that was a long time ago (namely
https://bugs.schedmd.com/show_bug.cgi?id=1557 and
https://bugs.schedmd.com/show_bug.cgi?id=1660, for Slurm 14.11). Weird
to see that popping up again in 18.08.

Cheers,
-- 
Kilian