Re: Spark on other parallel filesystems

Anand Avati Fri, 04 Apr 2014 22:46:24 -0700

On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> As long as the filesystem is mounted at the same path on every node, you
> should be able to just run Spark and use a file:// URL for your files.
>
> The only downside with running it this way is that Lustre won't expose
> data locality info to Spark, the way HDFS does. That may not matter if it's
> a network-mounted file system though.
>

Is the locality querying mechanism specific to HDFS mode, or is it possible
to implement plugins in Spark to query location in other ways on other
filesystems? I ask because, glusterfs can expose data location of a file
through virtual extended attributes and I would be interested in making
Spark exploit that locality when the file location is specified as
glusterfs:// (or querying the xattr blindly for file://). How much of a
difference does data locality make for Spark use cases anyways (since most
of the computation happens in memory)? Any sort of numbers?

Thanks!
Avati

>
>
Matei
>
> On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy <ven...@yarcdata.com>
> wrote:
>
>  All
>
>  Are there any drawbacks or technical challenges (or any information,
> really) related to using Spark directly on a global parallel filesystem
>  like Lustre/GPFS?
>
>  Any idea of what would be involved in doing a minimal proof of concept?
> Is it just possible to run Spark unmodified (without the HDFS substrate)
> for a start, or will that not work at all? I do know that it's possible to
> implement Tachyon on Lustre and get the HDFS interface - just looking at
> other options.
>
>  Venkat
>
>
>

Re: Spark on other parallel filesystems

Reply via email to