Re: "Sharing" dataframes...

Rick Moritz Wed, 21 Jun 2017 00:20:41 -0700

Keeping it inside the same program/SparkContext is the most performant
solution, since you can avoid serialization and deserialization.
In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
and invokes serialization and deserialization. Technologies that can help
you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
with in-memory tables and a memory-backed HDFS-directory (see tiered
storage).
Although livy and job-server provide the functionality of providing a
single SparkContext to mutliple programs, I would recommend you build your
own framework for integrating different jobs, since many features you may
need aren't present yet, while others may cause issues due to lack of
maturity. Artificially splitting jobs is in general a bad idea, since it
breaks the DAG and thus prevents some potential push-down optimizations.


On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net> wrote:

> Thanks Vadim & Jörn... I will look into those.
>
> jg
>
> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com>
> wrote:
>
> You can launch one permanent spark context and then execute your jobs
> within the context. And since they'll be running in the same context, they
> can share data easily.
>
> These two projects provide the functionality that you need:
> https://github.com/spark-jobserver/spark-jobserver#
> persistent-context-mode---faster--required-for-related-jobs
> https://github.com/cloudera/livy#post-sessions
>
> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net> wrote:
>
>> Hey,
>>
>> Here is my need: program A does something on a set of data and produces
>> results, program B does that on another set, and finally, program C
>> combines the data of A and B. Of course, the easy way is to dump all on
>> disk after A and B are done, but I wanted to avoid this.
>>
>> I was thinking of creating a temp view, but I do not really like the temp
>> aspect of it ;). Any idea (they are all worth sharing)
>>
>> jg
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>

Re: "Sharing" dataframes...

Reply via email to