> That being said all systems are evolving. Hive supports tez+llap which >is basically the in-memory support.
There is a big difference between where LLAP & SparkSQL, which has to do with access pattern needs. The first one is related to the lifetime of the cache - the Spark RDD cache is per-user-session which allows for further operation in that session to be optimized. LLAP is designed to be hammered by multiple user sessions running different queries, designed to automate the cache eviction & selection process. There's no user visible explicit .cache() to remember - it's automatic and concurrent. My team works with both engines, trying to improve it for ORC, but the goals of both are different. I will probably have to write a proper academic paper & get it edited/reviewed instead of send my ramblings to the user lists like this. Still, this needs an example to talk about. To give a qualified example, let's leave the world of single use clusters and take the use-case detailed here http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ There are two distinct problems there - one is that a single day sees upto 100k independent user sessions running queries and that most queries cover the last hour (& possibly join/compare against a similar hour aggregate from the past). The problem with having independent 100k user-sessions from different connections was that the SparkSQL layer drops the RDD lineage & cache whenever a user ends a session. The scale problem in general for Impala was that even though the data size was in multiple terabytes, the actual hot data was approx <20Gb, which resides on <10 machines with locality. The same problem applies when you apply RDD caching with something like un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding popular that the machines which hold those blocks run extra hot. A cache model per-user session is entirely wasteful and a common cache + MPP model effectively overloads 2-3% of cluster, while leaving the other machines idle. LLAP was designed specifically to prevent that hotspotting, while maintaining the common cache model - within a few minutes after an hour ticks over, the whole cluster develops temporal popularity for the hot data and nearly every rack has at least one cached copy of the same data for availability/performance. Since data stream tend to be extremely wide table (Omniture) comes to mine, so the cache actually does not hold all columns in a table and since Zipf distributions are extremely common in these real data sets, the cache does not hold all rows either. select count(clicks) from table where zipcode = 695506; with ORC data bucketed + *sorted* by zipcode, the row-groups which are in the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter indexes for all files will be loaded into memory, all misses on the bloom will not even feature in the cache. A subsequent query for select count(clicks) from table where zipcode = 695586; will run against the collected indexes, before deciding which files need to be loaded into cache. Then again, select count(clicks)/count(impressions) from table where zipcode = 695586; will load only impressions out of the table into cache, to add it to the columnar cache without producing another complete copy (RDDs are not mutable, but LLAP cache is additive). The column split cache & index-cache separation allows for this to be cheaper than a full rematerialization - both are evicted as they fill up, with different priorities. Following the same vein, LLAP can do a bit of clairvoyant pre-processing, with a bit of input from UX patterns observed from Tableau/Microstrategy users to give it the impression of being much faster than the engine really can be. Illusion of performance is likely to be indistinguishable from actual - I'm actually looking for subjects for that experiment :) Cheers, Gopal