On 10/05/2017 08:38 AM, Blomqvist Janne wrote:
what we do is, roughly, a combination of your options #2 and #3. To start with,
however, I'd like to point out that we're using Lmod instead of the old Tcl
environment-modules. I'd really recommend you to do the same.
So basically, we have our modules available on NFS, both the module files
themselves and the software that modules makes available. Then we use
configuration management (ansible, in our case) to ensure that Lmod is
installed on all nodes, and that we have a suitable configuration file in
/etc/profile.d that adds our NFS location to $MODULEPATH so that Lmod can find
it.
We also use Easybuild to build (most) software and module files, you might want
to look into that as well.
We use the same approach.
And yes, we tell our users to load the appropriate modules in the slurm batch
scripts rather than relying on slurm to transfer the environment correctly.
As to whether this is preferred, well, it works, but provisioning with
kickstart + config management gets tedious at scale (say, hundreds of nodes or
more). If we were to rebuild everything from scratch, I think we'd take a long
hard look at image-based deployment, e.g. openhpc/warewulf.
We use Kickstart including some post-install scripts to automatically
install compute nodes with CentOS. At 800 nodes currently, it's not at
all tedious to perform installation and config management, IMHO.
In the distant past, we used the image-based approach with SystemImager,
but I think this was no simpler than the Kickstart-based approach.
/Ole