Re: Questions about caching

Bin Fan Mon, 24 Dec 2018 21:21:10 -0800

Hi Andrew,

Since you mentioned the alternative solution with Alluxio
<http://alluxio.org>, here is a more comprehensive
tutorial on caching Spark dataframes on Alluxio:
https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio


Namely, caching your dataframe is simply running
df.write.parquet(alluxioFilePath)
and your dataframes are stored in Alluxio as parquet files and you can
share them with more users.
One advantage with Alluxio here is you can manually free the cached data
from memory tier or
set the TTL for the cached data if you'd like more control on the data.


- Bin

On Tue, Dec 11, 2018 at 9:13 AM Andrew Melo <andrew.m...@gmail.com> wrote:

> Greetings, Spark Aficionados-
>
> I'm working on a project to (ab-)use PySpark to do particle physics
> analysis, which involves iterating with a lot of transformations (to
> apply weights and select candidate events) and reductions (to produce
> histograms of relevant physics objects). We have a basic version
> working, but I'm looking to exploit some of Spark's caching behavior
> to speed up the interactive computation portion of the analysis,
> probably by writing a thin convenience wrapper. I have a couple
> questions I've been unable to find definitive answers to, which would
> help me design this wrapper an efficient way:
>
> 1) When cache()-ing a dataframe where only a subset of the columns are
> used, is the entire dataframe placed into the cache, or only the used
> columns. E.G. does "df2" end up caching only "a", or all three
> columns?
>
> df1 = sc.read.load('test.parquet') # Has columns a, b, c
> df2 = df1.cache()
> df2.select('a').collect()
>
> 2) Are caches reference-based, or is there some sort of de-duplication
> based on the logical/physical plans. So, for instance, does spark take
> advantage of the fact that these two dataframes should have the same
> content:
>
> df1 = sc.read.load('test.parquet').cache()
> df2 = sc.read.load('test.parquet').cache()
>
> ...or are df1 and df2 totally independent WRT caching behavior?
>
> 2a) If the cache is reference-based, is it sufficient to hold a
> weakref to the python object to keep the cache in-scope?
>
> 3) Finally, the spark.externalBlockStore.blockManager is intriguing in
> our environment where we have multiple users concurrently analyzing
> mostly the same input datasets. We have enough RAM in our clusters to
> cache a high percentage of the very common datasets, but only if users
> could somehow share their caches (which, conveniently, are the larger
> datasets), We also have very large edge SSD cache servers we use to
> cache trans-oceanic I/O we could throw at this as well.
>
> It looks, however, like that API was removed in 2.0.0 and there wasn't
> a replacement. There are products like Alluxio, but they aren't
> transparent, requiring the user to manually cache their dataframes by
> doing save/loads to external files using "alluxio://" URIs. Is there
> no way around this behavior now?
>
> Sorry for the long email, and thanks!
> Andrew
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Questions about caching

Reply via email to