date:20240514

[slurm-users] Submitting from an untrusted node

2024-05-14 Thread Rike-Benjamin Schuppner via slurm-users

Hi,

If I understand it correctly, the MUNGE and SACK authentication modules 
naturally require that no-one can get access to the key. This means that we 
should not use our normal workstations to which our users have physical access 
to run any jobs, nor could our users use the workstations to submit jobs to the 
compute nodes. They would have to ssh to a specific submit node and only then 
could they schedule their jobs.

Is there an elegant way to enable job submission from any computer (possibly 
requiring that users type their password for the submit node – or to their ssh 
key – at some point)? (All computers/users use the same LDAP server for logins.)

Best
/rike


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Submitting from an untrusted node

2024-05-14 Thread Brian Andrus via slurm-users


Rike,

Assuming the data, scripts and other dependencies are already on the 
cluster, you could just ssh and execute the sbatch command in a single 
shot: ssh submitnode sbatch some_script.sh


It will ask for a password if appropriate and could use ssh keys to 
bypass that need.


Brian Andrus

On 5/14/2024 5:10 AM, Rike-Benjamin Schuppner via slurm-users wrote:

Hi,

If I understand it correctly, the MUNGE and SACK authentication modules 
naturally require that no-one can get access to the key. This means that we 
should not use our normal workstations to which our users have physical access 
to run any jobs, nor could our users use the workstations to submit jobs to the 
compute nodes. They would have to ssh to a specific submit node and only then 
could they schedule their jobs.

Is there an elegant way to enable job submission from any computer (possibly 
requiring that users type their password for the submit node – or to their ssh 
key – at some point)? (All computers/users use the same LDAP server for logins.)

Best
/rike




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] srun weirdness

2024-05-14 Thread Dj Merrill via slurm-users

I'm running into a strange issue and I'm hoping another set of brains 
looking at this might help.  I would appreciate any feedback.


I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
on Rocky Linux 8.9 machines.  The second cluster is running Slurm 
23.11.6 on Rocky Linux 9.4 machines.


This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
    runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
pc=0x4605dc

runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
    runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 
sp=0x7ffe6be32648 pc=0x456b7c

runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
    runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 
pc=0x454565

runtime.(*mheap).init(0x127b47e0)
    runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 
pc=0x451885

runtime.mallocinit()
    runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 
pc=0x434f97

runtime.schedinit()
    runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 
pc=0x464397

runtime.rt0_go()
    runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 
pc=0x49421c



If I ssh directly to the same node on that second cluster (skipping 
Slurm entirely), and run the same "/mnt/local/ollama/ollama help" 
command, it works perfectly fine.



My first thought was that it might be related to cgroups.  I switched 
the second cluster from cgroups v2 to v1 and tried again, no 
difference.  I tried disabling cgroups on the second cluster by removing 
all cgroups references in the slurm.conf file but that also made no 
difference.



My guess is something changed with regards to srun between these two 
Slurm versions, but I'm not sure what.


Any thoughts on what might be happening and/or a way to get this to work 
on the second cluster?  Essentially I need a way to request an 
interactive shell through Slurm that is associated with the requested 
resources.  Should we be using something other than srun for this?



Thank you,

-Dj



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
 wrote:
>
> I'm running into a strange issue and I'm hoping another set of brains
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory
> runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> pc=0x4605dc
> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched
> the second cluster from cgroups v2 to v1 and tried again, no
> difference.  I tried disabling cgroups on the second cluster by removing
> all cgroups references in the slurm.conf file but that also made no
> difference.
>
>
> My guess is something changed with regards to srun between these two
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to work
> on the second cluster?  Essentially I need a way to request an
> interactive shell through Slurm that is associated with the requested
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

2024-05-14 Thread Dj Merrill via slurm-users


Hi Feng,
Thank you for replying.

It is the same binary on the same machine that fails.

If I ssh to a compute node on the second cluster, it works fine.

It fails when running in an interactive shell obtained with srun on that 
same compute node.


I agree that it seems like a runtime environment difference between the 
SSH shell and the srun obtained shell.


This is the ldd from within the srun obtained shell (and gives the error 
when run):


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
    linux-vdso.so.1 (0x7ffde81ee000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
    librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
    libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
    libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
    /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)

This is the ldd from the same exact node within an SSH shell which runs 
fine:


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
    linux-vdso.so.1 (0x7fffa66ff000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
    librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
    libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000)
    libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
    libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
    /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)


-Dj



On 5/14/24 15:25, Feng Zhang via slurm-users wrote:

Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
 wrote:

I'm running into a strange issue and I'm hoping another set of brains
looking at this might help.  I would appreciate any feedback.

I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
on Rocky Linux 8.9 machines.  The second cluster is running Slurm
23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
pc=0x4605dc
runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
sp=0x7ffe6be32648 pc=0x456b7c
runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
pc=0x454565
runtime.(*mheap).init(0x127b47e0)
  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
pc=0x451885
runtime.mallocinit()
  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
pc=0x434f97
runtime.schedinit()
  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
pc=0x464397
runtime.rt0_go()
  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
pc=0x49421c


If I ssh directly to the same node on that second cluster (skipping
Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
command, it works perfectly fine.


My first thought was that it might be related to cgroups.  I switched
the second cluster from cgroups v2 to v1 and tried again, no
difference.  I tried disabling cgroups on the second cluster by removing
all cgroups references in the slurm.conf file but that also made no
difference.


My guess is something changed with regards to srun between these two
Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work
on the second cluster?  Essentially I need a way to request an
interactive shell through Slurm that is associated with the requested
resources.  Should we be using something other than srun for this?


Thank you,

-Dj



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Not sure, very strange, while the two linux-vdso.so.1 looks different:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7ffde81ee000)


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7fffa66ff000)

Best,

Feng

On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
 wrote:
>
> Hi Feng,
> Thank you for replying.
>
> It is the same binary on the same machine that fails.
>
> If I ssh to a compute node on the second cluster, it works fine.
>
> It fails when running in an interactive shell obtained with srun on that
> same compute node.
>
> I agree that it seems like a runtime environment difference between the
> SSH shell and the srun obtained shell.
>
> This is the ldd from within the srun obtained shell (and gives the error
> when run):
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
>  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
>  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
>  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
>  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
>  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
>
> This is the ldd from the same exact node within an SSH shell which runs
> fine:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
>  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
>  libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000)
>  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
>  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
>  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
>
>
> -Dj
>
>
>
> On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > Looks more like a runtime environment issue.
> >
> > Check the binaries:
> >
> > ldd  /mnt/local/ollama/ollama
> >
> > on both clusters and comparing the output may give some hints.
> >
> > Best,
> >
> > Feng
> >
> > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> >  wrote:
> >> I'm running into a strange issue and I'm hoping another set of brains
> >> looking at this might help.  I would appreciate any feedback.
> >>
> >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> >> 23.11.6 on Rocky Linux 9.4 machines.
> >>
> >> This works perfectly fine on the first cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >>
> >> srun: job 93911 queued and waiting for resources
> >> srun: job 93911 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >>
> >> and the ollama help message appears as expected.
> >>
> >> However, on the second cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >> srun: job 3 queued and waiting for resources
> >> srun: job 3 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >> fatal error: failed to reserve page summary memory
> >> runtime stack:
> >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> >> pc=0x4605dc
> >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
> >>   runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> >> sp=0x7ffe6be32648 pc=0x456b7c
> >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
> >>   runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> >> pc=0x454565
> >> runtime.(*mheap).init(0x127b47e0)
> >>   runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> >> pc=0x451885
> >> runtime.mallocinit()
> >>   runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> >> pc=0x434f97
> >> runtime.schedinit()
> >>   runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> >> pc=0x464397
> >> runtime.rt0_go()
> >>   runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> >> pc=0x49421c
> >>
> >>
> >> If I ssh directly to the same node on that second cluster (skipping
> >> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> >> command, it works perfectly fine.
> >>
> >>
> >> My first thought was that it might be related to cgroups.  I switched
> >> the second cluster from cgroups v2 to v1 and tried again, no
> >> difference.  I tried disabling cgroups on the second cluster by removing
>

[slurm-users] Slurm release candidate version 24.05.0rc1 available for testing

2024-05-14 Thread Marshall Garey via slurm-users

We are pleased to announce the availability of Slurm release candidate 
24.05.0rc1.


To highlight some new features coming in 24.05:

- (Optional) isolated Job Step management. Enabled on a job-by-job basis 
with the --stepmgr option, or globally through 
SlurmctldParameters=enable_stepmgr.
- Federation - Allow for client command operation while SlurmDBD is 
unavailable.

- New MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser QOS limits.
- New USER_DELETE reservation flag.
- New Flags=rebootless option on Features for node_features/helpers 
which indicates the given feature can be enabled without rebooting the node.
- Cloud power management options: New "max_powered_nodes=" option 
in SlurmctldParamters, and new SuspendExcNodes=: syntax 
allowing for  nodes out of a given node list to be excluded.
- StdIn/StdOut/StdErr now stored in SlurmDBD accounting records for 
batch jobs.
- New switch/nvidia_imex plugin for IMEX channel management on NVIDIA 
systems.
- New RestrictedCoresPerGPU option at the Node level, designed to ensure 
GPU workloads always have access to a certain number of CPUs even when 
nodes are running non-GPU workloads concurrently.


This is the first release candidate of the upcoming 24.05 release 
series, and represents the end of development for this release, and a 
finalization of the RPC and state file formats.


If any issues are identified with this release candidate, please report 
them through https://bugs.schedmd.com against the 24.05.x version and we 
will address them before the first production 24.05.0 release is made.


Please note that the release candidates are not intended for production use.

A preview of the updated documentation can be found at 
https://slurm.schedmd.com/archive/slurm-master/ .


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Do you have containers setting?

On Tue, May 14, 2024 at 3:57 PM Feng Zhang  wrote:
>
> Not sure, very strange, while the two linux-vdso.so.1 looks different:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>
> Best,
>
> Feng
>
> On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
>  wrote:
> >
> > Hi Feng,
> > Thank you for replying.
> >
> > It is the same binary on the same machine that fails.
> >
> > If I ssh to a compute node on the second cluster, it works fine.
> >
> > It fails when running in an interactive shell obtained with srun on that
> > same compute node.
> >
> > I agree that it seems like a runtime environment difference between the
> > SSH shell and the srun obtained shell.
> >
> > This is the ldd from within the srun obtained shell (and gives the error
> > when run):
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7ffde81ee000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
> >  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
> >  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
> >  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
> >
> > This is the ldd from the same exact node within an SSH shell which runs
> > fine:
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7fffa66ff000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
> >  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000)
> >  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
> >  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
> >
> >
> > -Dj
> >
> >
> >
> > On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > > Looks more like a runtime environment issue.
> > >
> > > Check the binaries:
> > >
> > > ldd  /mnt/local/ollama/ollama
> > >
> > > on both clusters and comparing the output may give some hints.
> > >
> > > Best,
> > >
> > > Feng
> > >
> > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> > >  wrote:
> > >> I'm running into a strange issue and I'm hoping another set of brains
> > >> looking at this might help.  I would appreciate any feedback.
> > >>
> > >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> > >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> > >> 23.11.6 on Rocky Linux 9.4 machines.
> > >>
> > >> This works perfectly fine on the first cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >>
> > >> srun: job 93911 queued and waiting for resources
> > >> srun: job 93911 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >>
> > >> and the ollama help message appears as expected.
> > >>
> > >> However, on the second cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >> srun: job 3 queued and waiting for resources
> > >> srun: job 3 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >> fatal error: failed to reserve page summary memory
> > >> runtime stack:
> > >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> > >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> > >> pc=0x4605dc
> > >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
> > >>   runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> > >> sp=0x7ffe6be32648 pc=0x456b7c
> > >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
> > >>   runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> > >> pc=0x454565
> > >> runtime.(*mheap).init(0x127b47e0)
> > >>   runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> > >> pc=0x451885
> > >> runtime.mallocinit()
> > >>   runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> > >> pc=0x434f97
> > >> runtime.schedinit()
> > >>   runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> > >> pc=0x464397
> > >> runtime.rt0_go()
> > >>   runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> > >> pc=0x49421c
> > >>
> > >>
> > >> If I ssh directly to the same node on that second cluster (s

[slurm-users] Submitting from an untrusted node

[slurm-users] Re: Submitting from an untrusted node

[slurm-users] srun weirdness

[slurm-users] Re: srun weirdness

[slurm-users] Re: srun weirdness

[slurm-users] Re: srun weirdness

[slurm-users] Slurm release candidate version 24.05.0rc1 available for testing

[slurm-users] Re: srun weirdness

8 matches

Site Navigation

Mail list logo

Footer information