Hi Andrew, 1) df2 will cache all the columns 2) In spark2 you will receive a warning like:
WARN execution.CacheManager: Asked to cache already cached data. I don't recall whether it is the same in 1.6. Seems you are not using spark 2. 2a) Not sure whether you are suggesting for a feature in Spark. Maybe someone else with more experience with pyspark can respond to you. 3) Have you considered using an external block store? Also I guess if this is an academic environment maybe there are much easier ways to handle this. Best, Reza. On Tue, Dec 11, 2018 at 12:13 PM Andrew Melo <andrew.m...@gmail.com> wrote: > Greetings, Spark Aficionados- > > I'm working on a project to (ab-)use PySpark to do particle physics > analysis, which involves iterating with a lot of transformations (to > apply weights and select candidate events) and reductions (to produce > histograms of relevant physics objects). We have a basic version > working, but I'm looking to exploit some of Spark's caching behavior > to speed up the interactive computation portion of the analysis, > probably by writing a thin convenience wrapper. I have a couple > questions I've been unable to find definitive answers to, which would > help me design this wrapper an efficient way: > > 1) When cache()-ing a dataframe where only a subset of the columns are > used, is the entire dataframe placed into the cache, or only the used > columns. E.G. does "df2" end up caching only "a", or all three > columns? > > df1 = sc.read.load('test.parquet') # Has columns a, b, c > df2 = df1.cache() > df2.select('a').collect() > > 2) Are caches reference-based, or is there some sort of de-duplication > based on the logical/physical plans. So, for instance, does spark take > advantage of the fact that these two dataframes should have the same > content: > > df1 = sc.read.load('test.parquet').cache() > df2 = sc.read.load('test.parquet').cache() > > ...or are df1 and df2 totally independent WRT caching behavior? > > 2a) If the cache is reference-based, is it sufficient to hold a > weakref to the python object to keep the cache in-scope? > > 3) Finally, the spark.externalBlockStore.blockManager is intriguing in > our environment where we have multiple users concurrently analyzing > mostly the same input datasets. We have enough RAM in our clusters to > cache a high percentage of the very common datasets, but only if users > could somehow share their caches (which, conveniently, are the larger > datasets), We also have very large edge SSD cache servers we use to > cache trans-oceanic I/O we could throw at this as well. > > It looks, however, like that API was removed in 2.0.0 and there wasn't > a replacement. There are products like Alluxio, but they aren't > transparent, requiring the user to manually cache their dataframes by > doing save/loads to external files using "alluxio://" URIs. Is there > no way around this behavior now? > > Sorry for the long email, and thanks! > Andrew > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >