People also store data off-heap by putting parquet data into Tachyon.
The optimization in 1.2 is to use the in-memory columnar cached format
instead of keeping row objects (and their boxed contents) around when you
call .cache(). This significantly reduces the number of live objects.
(since you h
Michael,
I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?
On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari
wrote:
> Yes, I
Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than
earlier).
I also tried by changing schema to use Long data type in some fields but
seems conversion takes more time.
Is there any way to specify index ?
If you are running on Spark 1.1 or earlier you'll want to use
rdd.registerTempTable() followed by
sqlContext.cacheTable() and then query that table. rdd.cache()
is not using the optimized in-memory format and thus puts a lot of pressure
on the GC. This is fixed in Spark 1.2 and .cache() should do