On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:
> As long as the filesystem is mounted at the same path on every node, you > should be able to just run Spark and use a file:// URL for your files. > > The only downside with running it this way is that Lustre won't expose > data locality info to Spark, the way HDFS does. That may not matter if it's > a network-mounted file system though. > Is the locality querying mechanism specific to HDFS mode, or is it possible to implement plugins in Spark to query location in other ways on other filesystems? I ask because, glusterfs can expose data location of a file through virtual extended attributes and I would be interested in making Spark exploit that locality when the file location is specified as glusterfs:// (or querying the xattr blindly for file://). How much of a difference does data locality make for Spark use cases anyways (since most of the computation happens in memory)? Any sort of numbers? Thanks! Avati > > Matei > > On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy <ven...@yarcdata.com> > wrote: > > All > > Are there any drawbacks or technical challenges (or any information, > really) related to using Spark directly on a global parallel filesystem > like Lustre/GPFS? > > Any idea of what would be involved in doing a minimal proof of concept? > Is it just possible to run Spark unmodified (without the HDFS substrate) > for a start, or will that not work at all? I do know that it's possible to > implement Tachyon on Lustre and get the HDFS interface - just looking at > other options. > > Venkat > > >