But rsync -a will only help you if people are using identical or at least overlapping data sets? And you don't need rsync to prune out old files.
On 2/26/19 1:53 AM, Janne Blomqvist wrote: > On 22/02/2019 18.50, Will Dennis wrote: >> Hi folks, >> >> Not directly Slurm-related, but... We have a couple of research groups >> that have large data sets they are processing via Slurm jobs >> (deep-learning applications) and are presently consuming the data via >> NFS mounts (both groups have 10G ethernet interconnects between the >> Slurm nodes and the NFS servers.) They are both now complaining of >> “too-long loading times” for the data, and are casting about for a way >> to bring the needed data onto the processing node, onto fast SSD >> single drives (or even SSD arrays.) These local drives would be >> considered “scratch space”, not for long-term data storage, but for >> use over the lifetime of a job, or maybe perhaps a few sequential jobs >> (given the nature of the work.) “Permanent” storage would remain the >> existing NFS servers. We don’t really have the funding for 25-100G >> networks and/or all-flash commercial data storage appliances (NetApp, >> Pure, etc.) >> >> Any good patterns that I might be able to learn about implementing >> here? We have a few ideas floating about, but I figured this already >> may be a solved problem in this community... > > We have a similar problem, many ML users with big datasets. We use > Lustre over IB, but the problem isn't IO bandwidth per se but rather > that the datasets tend to be very suboptimal for any kind of network fs > (lots and lots of small files). We do have node local disks which we > currently have configured so that a per-job /tmp is mounted on the local > disk, and then cleaned up at job exit. But this isn't really good for ML > type workflows, as even if they use the local disk a large fraction of > the job runtime is then spent copying the data from Lustre to the local > disk, only for the data to be blown away when the job ends. > > Adding quotas to node local disks doesn't really work either, as we have > lots of users/groups sharing our resources, and thus if we'd allocate > the disk space using quotas each one would be getting a uselessly small > amount. > > One idea I've been toying with is to write some duct tape around rsync, > here are my notes about it: > > ## datasync tool > > Essentially a small wrapper around 'rsync -a'. The different is that it > creates SRC/.datasync and DEST/.datasync directories containing special > metadata: > > - .datasync/TIMESTAMP: The mtime of this empty file is used to check > whether the SRC dataset is newer than the DEST dataset, in that case run > 'rsync -a', otherwise rsync can be skipped. > > - DEST/.datasync/LAST_SYNCED: mtime of this empty file tells the last > time this dataset was synced, whether any rsync was run or not. > > - DEST/.datasync/SLURM_JOB_IDS: Contains the slurm job id's (if > applicable) of the jobs that ran datasync with this DEST directory. > > > So the idea would be that a user in the job script could do something like > > #SBATCH blahblah > srun datasync /scratch/my_group/dataset_big /l/my_group/dataset_big > srun --gres=gpu:1 my_ML_job.py /l/my_group/dataset_big > > > ## datasync-reaper > > Admin tool that can be run from cron on every compute node to reap > unused datasets based on policy, e.g. /l partition must have at least > 50GB free (or max 70% full, or whatever). > > When reaping, it searches for these special .datasync directories (up to > a configurable recursion depth, say 2 by default), and based on the > LAST_SYNCED timestamps, deletes entire datasets starting with the oldest > LAST_SYNCED, until the policy goal has been met. Directory trees without > .datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used > as an extra safety check to not delete a dataset used by a running job. > > > > But nothing concrete done yet. Anyway, I'm open to suggestions about > better ideas, or existing tools that already solve this problem. > >