Re: [slurm-users] Slurm Multi-cluster implementation

2021-11-01 Thread Yair Yarom
cpu limit using ulimit is pretty straightforward with pam_limits and
/etc/security/limits.conf. On some of the login nodes we have a cpu limit
of 10 minutes, so heavy processes will fail.

The memory was a bit more complicated (i.e. not pretty). We wanted that a
user won't be able to use more than e.g. 1G for all processes combined.
Using systemd we added the file
/etc/systemd/system/user-.slice.d/20-memory.conf which contains:
[Slice]
MemoryLimit=1024M
MemoryAccounting=true

But we also wanted to restrict swap usage and we're still on cgroupv1, so
systemd didn't help there. The ugly part comes with a pam_exec to a script
that updates the memsw limit of the cgroup for the above slice. The script
does more things, but the swap section is more or less:

if [ "x$PAM_TYPE" = 'xopen_session' ]; then
_id=`id -u $PAM_USER`
if [ -z "$_id" ]; then
exit 1
fi
if [[ -e
/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes
]]; then
swap=$((1126 * 1024 * 1024))
echo $swap >
/sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes
fi
fi


On Sun, Oct 31, 2021 at 6:36 PM Brian Andrus  wrote:

> That is interesting to me.
>
> How do you use ulimit and systemd to limit user usage on the login nodes?
> This sounds like something very useful.
>
> Brian Andrus
> On 10/31/2021 1:08 AM, Yair Yarom wrote:
>
> Hi,
>
> If it helps, this is our setup:
> 6 clusters (actually a bit more)
> 1 mysql + slurmdbd on the same host
> 6 primary slurmctld on 3 hosts (need to make sure each have a distinct
> SlurmctldPort)
> 6 secondary slurmctld on an arbitrary node on the clusters themselves.
> 1 login node per cluster (this is a very small VM, and the users are
> limited both to cpu time (with ulimit) and memory (with systemd))
> The slurm.conf's are shared on nfs to everyone in /path/to/nfs/ name>/slurm.conf. With symlink to /etc for the relevant cluster per node.
>
> The -M generally works, we can submit/query jobs from a login node of one
> cluster to another. But there's a caveat to notice when upgrading. slurmdbd
> must be upgraded first, but usually we have a not so small gap between
> upgrading the different clusters. This causes the -M to stop working
> because binaries of one version won't work on the other (I don't remember
> in which direction).
> We solved this by using an lmod module per cluster, which both sets the
> SLURM_CONF environment, and the PATH to the correct slurm binaries (which
> we install in /usr/local/slurm// so that they co-exists). So when
> the -M won't work, users can use:
> module load slurm/clusterA
> squeue
> module load slurm/clusterB
> squeue
>
> BR,
>
>
>
>
>
>
>
> On Thu, Oct 28, 2021 at 7:39 PM navin srivastava 
> wrote:
>
>> Thank you Tina.
>> It will really help
>>
>> Regards
>> Navin
>>
>> On Thu, Oct 28, 2021, 22:01 Tina Friedrich 
>> wrote:
>>
>>> Hello,
>>>
>>> I have the database on a separate server (it runs the database and the
>>> database only). The login nodes run nothing SLURM related, they simply
>>> have the binaries installed & a SLURM config.
>>>
>>> I've never looked into having multiple databases & using
>>> AccountingStorageExternalHost (in fact I'd forgotten you could do that),
>>> so I can't comment on that (maybe someone else can); I think that works,
>>> yes, but as I said never tested that (didn't see much point in running
>>> multiple databases if one would do the job).
>>>
>>> I actually have specific login nodes for both of my clusters, to make it
>>> easier for users (especially those with not much experience using the
>>> HPC environment); so I have one login node connecting to cluster 1 and
>>> one connecting to cluster 1.
>>>
>>> I think the relevant bits of slurm.conf Relevant config entries (if I'm
>>> not mistaken) on the login nodes are probably:
>>>
>>> The differences in the slurm config files (that haven't got to do with
>>> topology & nodes & scheduler tuning) are
>>>
>>> ClusterName=cluster1
>>> ControlMachine=cluster1-slurm
>>> ControlAddr=/IP_OF_SLURM_CONTROLLER/
>>>
>>> ClusterName=cluster2
>>> ControlMachine=cluster2-slurm
>>> ControlAddr=/IP_OF_SLURM_CONTROLLER/
>>>
>>> (where IP_OF_SLURM_CONTROLLER is the IP address of host cluster1-slurm,
>>> same for cluster2)
>>>
>>> And then the have common entries for the AccountingStorageHost:
>>>
>>> AccountingStorageHost=slurm-db-prod
>>> AccountingStorageBackupHost=slurm-db-prod
>>> AccountingStoragePort=7030
>>> AccountingStorageType=accounting_storage/slurmdbd
>>>
>>> (slurm-db-prod is simply the hostname of the SLURM database server)
>>>
>>> Does that help?
>>>
>>> Tina
>>>
>>> On 28/10/2021 14:59, navin srivastava wrote:
>>> > Thank you Tina.
>>> >
>>> > so if i understood correctly.Database is global to both cluster and
>>> > running on login Node?
>>> > or is the database running on one of the master Node and shared with
>>> > another master server Node?
>>> >
>>> > but as far I have read that the slurm database can a

[slurm-users] What happened to accounting_storage/filetxt?

2021-11-01 Thread Stuart Barkley
I think this is my first time posting here.  Like some others I
recognize from the old Grid Engine mailing list we are planning to
move from Grid Engine to Slurm.

We had an old minimal test environment that I have been recently
updating and I notice that AccountingStorageType=accounting_storage/filetxt
has disappeared.

It is documented in 
https://slurm.schedmd.com/archive/slurm-20.02.7/slurm.conf.html
but seems to have disappeared in 
https://slurm.schedmd.com/archive/slurm-20.11.0/slurm.conf.html .

I can't find any mention in this mailing list about the change.  The
release notes at https://slurm.schedmd.com/archive/slurm-20.11.0/news.html
are for an older version of Slurm.

Was there a reason for the removal of the file based accounting?  We
are just starting to look at how Slurm can handle our accounting needs
but have some initial concerns about the database accounting storage:

* Is the record format for the database storage documented anywhere?
I don't seem to be able to find it.

* What size do people find suitable for the database?  We often get
500K to over 1M accounting records per week in Grid Engine.

* What cleans data out of the accounting database?

* Can the data be archived for long term storage?  We sometimes find
it useful to revisit accounting from 12 or more years ago.

I will probably be coming back with additional questions.  Previous
work has mostly been experimental but we now seem serious about
converting to Slurm.

Thanks,
Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
--  Daniel Boone