Hi Will, You have bumped into the old adage: "HPC is just about moving the bottlenecks around".
If your bottleneck is now your network, you may want to upgrade the network. Then the disks will become your bottleneck :) For GPU training-type jobs that load the same set of data over and over again, local node SSD is a good solution. Especially with the dropping SSD prices. For an example architecture, take a look at the DDN "AI" or IBM "AI" solutions. I think they generally take a storage box with lots of flash storage and connect it via 2 or 4 100Gb links to something like an nvidia DGX (compute node with 8 GPU). Presumably they are doing mostly small file reads. In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox ZFS servers connected at 40GbE. A fraction of the performance at a fraction of the price. Regards, Alex On Fri, Feb 22, 2019 at 9:52 AM Will Dennis <wden...@nec-labs.com> wrote: > Thanks for the reply, Ray. > > For one of my groups, on the GPU servers in their cluster, I have provided > a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path > ("/mnt/local" for historical reasons) that they can use for local scratch > space. Their other servers in the cluster have a single multi-TB spinning > disk mounted at that same path. We do not manage the data at all on this > path; it's currently up to the researchers to put needed data there, and > remove the data when it is no longer needed. (They wanted us to auto-manage > the removal, but we aren't in a position to know what data they still need > or not, and "delete data if atime/mtime is older than [...]" via cron is a > bit too simplistic.) They can use that local-disk path in any way they > want, with the caveat that it's not to be used as "permanent storage", > there's no backups, and if we suffer a disk failure, etc, we just replace > with new and the old data is gone. > > The other group has (at this moment) no local disk at all on their worker > nodes. They actually work with even bigger data sets than the first group, > and they are the ones that really need a solution. I figured that if I > solve the one group's problem, I also can implement on the other (and > perhaps even on future Slurm clusters we spin up.) > > A few other questions I have: > - is it possible in Slurm to define more than one filesystem path (i.e, > other than "/tmp") as "TmpDisk"? > - any way to allocate storage on a node via GRES or another method? > > > On Friday, February 22, 2019 12:06 PM, Raymond Wan wrote: > > >Hi Will, > > > >I'm not a system administrator, but on the cluster that I have access to, > indeed that is what we are given. > >Everything, including our home directories, are NFS mounted. > >Each node has a very large scratch space (i.e., /tmp), which is > periodically deleted. I think the sysadmins have a cron job that wipes it > >occasionally. > > > >We also only have a 10G network and sure...people will complain about how > everything should be faster, but our sysadmins are doing the best they can > >with the budget allocated for them. If they want 100G speed, then they > need to give the money for the sysadmins to play with. :-) > > > >Each research group is given a disk array (or more, depending on their > budget). And thus disk quota isn't managed by the sysadmins. If disk > space >is exhausted, it's up to the head of the research group to either > buy more disk space or get their team members to share. > > > >I suppose if some of this data is needed across jobs, you can maybe > allocate a fixed amount of quota on each node's scratch space to each lab. > Then, >you would have to teach them to write SLURM scripts that check if > the file is there and if not, to make a copy of it. Of course, you'd want > to make >sure they are careful not to have concurrent jobs... The type of > data analysis I do involves an index. If jobs #1 and #2 run on the same > node, both >will see (in let's say /tmp/rwan/myindex/) that the index is > absent and do a copy. I guess this is the tricky bit... But this kind of > management is >left for us users to worry about; the sysadmins just give us > the scratch space and it's up to us to find a way to make good use of it. > > > >I hope this helps. I'm not sure if this is the kind of information you > were looking for? > > > >Ray > >