On 10/05/2017 08:38 AM, Blomqvist Janne wrote:
what we do is, roughly, a combination of your options #2 and #3. To start with, 
however, I'd like to point out that we're using Lmod instead of the old Tcl 
environment-modules. I'd really recommend you to do the same.

So basically, we have our modules available on NFS, both the module files 
themselves and the software that modules makes available. Then we use 
configuration management (ansible, in our case) to ensure that Lmod is 
installed on all nodes, and that we have a suitable configuration file in 
/etc/profile.d that adds our NFS location to $MODULEPATH so that Lmod can find 
it.

We also use Easybuild to build (most) software and module files, you might want 
to look into that as well.

We use the same approach.

And yes, we tell our users to load the appropriate modules in the slurm batch 
scripts rather than relying on slurm to transfer the environment correctly.

As to whether this is preferred, well, it works, but provisioning with 
kickstart + config management gets tedious at scale (say, hundreds of nodes or 
more). If we were to rebuild everything from scratch, I think we'd take a long 
hard look at image-based deployment, e.g. openhpc/warewulf.

We use Kickstart including some post-install scripts to automatically install compute nodes with CentOS. At 800 nodes currently, it's not at all tedious to perform installation and config management, IMHO.

In the distant past, we used the image-based approach with SystemImager, but I think this was no simpler than the Kickstart-based approach.

/Ole

Reply via email to