Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

Will Dennis Fri, 22 Feb 2019 10:26:14 -0800

(replies inline)

On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:


>Hi Will,
>
>If your bottleneck is now your network, you may want to upgrade the network.  
>Then the disks will become your bottleneck :)
>

Via network bandwidth analysis, it's not really network that's the problem... 
It’s the NFS/disk I/O...

>For GPU training-type jobs that load the same set of data over and over again, 
>local node SSD is a good solution.  Especially with the dropping SSD prices.
>

Good to hear :)

>For an example architecture, take a look at the DDN "AI" or IBM "AI" 
>solutions. I think they generally take a storage box with lots of flash 
>storage and connect it via 2 or 4 100Gb links to something like an nvidia DGX 
>(compute node with 8 GPU).  Presumably they are doing mostly small file reads.
>
>In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox ZFS 
>servers connected at 40GbE.  A fraction of the performance at a fraction of 
>the price.
>

Same here, but connected at only 10G... Again, no budget (as of yet, anyhow) to 
do 25/40/50/100G network or all-flash storage :(

>Regards,
>Alex

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

Reply via email to