Thanks very interesting explanation. Looking forward to test it.
> On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference between where LLAP & SparkSQL, which has to do > with access pattern needs. > > The first one is related to the lifetime of the cache - the Spark RDD > cache is per-user-session which allows for further operation in that > session to be optimized. > > LLAP is designed to be hammered by multiple user sessions running > different queries, designed to automate the cache eviction & selection > process. There's no user visible explicit .cache() to remember - it's > automatic and concurrent. > > My team works with both engines, trying to improve it for ORC, but the > goals of both are different. > > I will probably have to write a proper academic paper & get it > edited/reviewed instead of send my ramblings to the user lists like this. > Still, this needs an example to talk about. > > To give a qualified example, let's leave the world of single use clusters > and take the use-case detailed here > > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ > > > There are two distinct problems there - one is that a single day sees upto > 100k independent user sessions running queries and that most queries cover > the last hour (& possibly join/compare against a similar hour aggregate > from the past). > > The problem with having independent 100k user-sessions from different > connections was that the SparkSQL layer drops the RDD lineage & cache > whenever a user ends a session. > > The scale problem in general for Impala was that even though the data size > was in multiple terabytes, the actual hot data was approx <20Gb, which > resides on <10 machines with locality. > > The same problem applies when you apply RDD caching with something like > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding > popular that the machines which hold those blocks run extra hot. > > A cache model per-user session is entirely wasteful and a common cache + > MPP model effectively overloads 2-3% of cluster, while leaving the other > machines idle. > > LLAP was designed specifically to prevent that hotspotting, while > maintaining the common cache model - within a few minutes after an hour > ticks over, the whole cluster develops temporal popularity for the hot > data and nearly every rack has at least one cached copy of the same data > for availability/performance. > > Since data stream tend to be extremely wide table (Omniture) comes to > mine, so the cache actually does not hold all columns in a table and since > Zipf distributions are extremely common in these real data sets, the cache > does not hold all rows either. > > select count(clicks) from table where zipcode = 695506; > > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter > indexes for all files will be loaded into memory, all misses on the bloom > will not even feature in the cache. > > A subsequent query for > > select count(clicks) from table where zipcode = 695586; > > will run against the collected indexes, before deciding which files need > to be loaded into cache. > > > Then again, > > select count(clicks)/count(impressions) from table where zipcode = 695586; > > will load only impressions out of the table into cache, to add it to the > columnar cache without producing another complete copy (RDDs are not > mutable, but LLAP cache is additive). > > The column split cache & index-cache separation allows for this to be > cheaper than a full rematerialization - both are evicted as they fill up, > with different priorities. > > Following the same vein, LLAP can do a bit of clairvoyant pre-processing, > with a bit of input from UX patterns observed from Tableau/Microstrategy > users to give it the impression of being much faster than the engine > really can be. > > Illusion of performance is likely to be indistinguishable from actual - > I'm actually looking for subjects for that experiment :) > > Cheers, > Gopal > >