(replies inline) On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:
>Hi Will, > >If your bottleneck is now your network, you may want to upgrade the network. >Then the disks will become your bottleneck :) > Via network bandwidth analysis, it's not really network that's the problem... It’s the NFS/disk I/O... >For GPU training-type jobs that load the same set of data over and over again, >local node SSD is a good solution. Especially with the dropping SSD prices. > Good to hear :) >For an example architecture, take a look at the DDN "AI" or IBM "AI" >solutions. I think they generally take a storage box with lots of flash >storage and connect it via 2 or 4 100Gb links to something like an nvidia DGX >(compute node with 8 GPU). Presumably they are doing mostly small file reads. > >In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox ZFS >servers connected at 40GbE. A fraction of the performance at a fraction of >the price. > Same here, but connected at only 10G... Again, no budget (as of yet, anyhow) to do 25/40/50/100G network or all-flash storage :( >Regards, >Alex