Re: Best practices for sharing/maintaining large resource files for Spark jobs

Dmitry Goldenberg Thu, 14 Jan 2016 04:55:28 -0800

OK so it looks like Tachyon is a cluster memory plugin marked as
"experimental" in Spark.

In any case, we've got a few requirements for the system we're working on
which may drive the decision for how to implement large resource file
management.

The system is a framework of N data analyzers which take incoming documents
as input and transform them or extract some data out of those documents.
These analyzers can be chained together which makes it a great case for
processing with RDD's and a set of map/filter types of Spark functions.
There's already an established framework API which we want to preserve.
This means that most likely, we'll create a relatively thin "binding" layer
for exposing these analyzers as well-documented functions to the end-users
who want to use them in a Spark based distributed computing environment.

We also want to, ideally, hide the complexity of how these resources are
loaded from the end-users who will be writing the actual Spark jobs that
utilize the Spark "binding" functions that we provide.

So, for managing large numbers of small, medium, or large resource files,
we're considering the below options, with a variety of pros and cons
attached, from the following perspectives:

a) persistence - where do the resources reside initially;
b) loading - what are the mechanics for loading of these resources;
c) caching and sharing across worker nodes.

Possible options:

1. Load each resource into a broadcast variable. Considering that we have
scores if not hundreds of these resource files, maintaining that many
broadcast variables seems like a complexity that's going to be hard to
manage. We'd also need a translation layer between the broadcast variables
and the internal API that would want to "speak" InputStream's rather than
broadcast variables.

2. Load resources into RDD's and perform join's against them from our
incoming document data RDD's, thus achieving the effect of a value lookup
from the resources.  While this seems like a very Spark'y way of doing
things, the lookup mechanics seem quite non-trivial, especially because
some of the resources aren't going to be pure dictionaries; they may be
statistical models.  Additionally, this forces us to utilize Spark's
semantics for handling of these resources which means a potential rewrite
of our internal product API. That would be a hard option to go with.

3. Pre-install all the needed resources on each of the worker nodes;
retrieve the needed resources from the file system and load them into
memory as needed. Ideally, the resources would only be installed once, on
the Spark driver side; we'd want to avoid having to pre-install all these
files on each node. However, we've done this as an exercise and this
approach works OK.

4. Pre-load all the resources into HDFS or S3 i.e. into some distributed
persistent store; load them into cluster memory from there, as necessary.
Presumably this could be a pluggable store with a common API exposed.
Since our framework is an OEM'able product, we could plug and play with a
variety of such persistent stores via Java's FileSystem/URL scheme handler
API's.

5. Implement a Resource management server, with a RESTful interface on top.
Under the covers, this could be a wrapper on top of #4.  Potentially
unnecessary if we have a solid persistent store API as per #4.

6. Beyond persistence, caching also has to be considered for these
resources. We've considered Tachyon (especially since it's pluggable into
Spark), Redis, and the like. Ideally, I would think we'd want resources to
be loaded into the cluster memory as needed; paged in/out on-demand in an
LRU fashion.  From this perspective, it's not yet clear to me what the best
option(s) would be. Any thoughts / recommendations would be appreciated.

On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> Thanks, Gene.
>
> Does Spark use Tachyon under the covers anyway for implementing its
> "cluster memory" support?
>
> It seems that the practice I hear the most about is the idea of loading
> resources as RDD's and then doing join's against them to achieve the lookup
> effect.
>
> The other approach would be to load the resources into broadcast variables
> but I've heard concerns about memory.  Could we run out of memory if we
> load too much into broadcast vars?  Is there any memory_to_disk/spill to
> disk capability for broadcast variables in Spark?
>
>
> On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote:
>
>> Hi Dmitry,
>>
>> Yes, Tachyon can help with your use case. You can read and write to
>> Tachyon via the filesystem api (
>> http://tachyon-project.org/documentation/File-System-API.html). There is
>> a native Java API as well as a Hadoop-compatible API. Spark is also able to
>> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
>> input files from Tachyon and write output files to Tachyon.
>>
>> I hope that helps,
>> Gene
>>
>> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> I'd guess that if the resources are broadcast Spark would put them into
>>> Tachyon...
>>>
>>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
>>> wrote:
>>>
>>> Would it make sense to load them into Tachyon and read and broadcast
>>> them from there since Tachyon is already a part of the Spark stack?
>>>
>>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>>
>>>
>>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
>>> One option could be to store them as blobs in a cache like Redis and
>>> then read + broadcast them from the driver. Or you could store them in HDFS
>>> and read + broadcast from the driver.
>>>
>>> Regards
>>> Sab
>>>
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> We have a bunch of Spark jobs deployed and a few large resource files
>>>> such as e.g. a dictionary for lookups or a statistical model.
>>>>
>>>> Right now, these are deployed as part of the Spark jobs which will
>>>> eventually make the mongo-jars too bloated for deployments.
>>>>
>>>> What are some of the best practices to consider for maintaining and
>>>> sharing large resource files like these?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>>
>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>> Sullivan India ICT)*
>>> +++
>>>
>>>
>>
>

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Reply via email to