Re: Using Spark on Hive with Hive also using Spark as its execution engine

Jörn Franke Tue, 31 May 2016 00:21:29 -0700

Thanks very interesting explanation. Looking forward to test it.


> On 31 May 2016, at 07:51, Gopal Vijayaraghavan <gop...@apache.org> wrote:
> 
> 
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
> 
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
> 
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
> 
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
> 
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
> 
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
> 
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
> 
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> 
> 
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
> 
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
> 
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
> 
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
> 
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
> 
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
> 
> Since data stream tend to be extremely wide table (Omniture) comes to
> mine, so the cache actually does not hold all columns in a table and since
> Zipf distributions are extremely common in these real data sets, the cache
> does not hold all rows either.
> 
> select count(clicks) from table where zipcode = 695506;
> 
> with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> indexes for all files will be loaded into memory, all misses on the bloom
> will not even feature in the cache.
> 
> A subsequent query for
> 
> select count(clicks) from table where zipcode = 695586;
> 
> will run against the collected indexes, before deciding which files need
> to be loaded into cache.
> 
> 
> Then again, 
> 
> select count(clicks)/count(impressions) from table where zipcode = 695586;
> 
> will load only impressions out of the table into cache, to add it to the
> columnar cache without producing another complete copy (RDDs are not
> mutable, but LLAP cache is additive).
> 
> The column split cache & index-cache separation allows for this to be
> cheaper than a full rematerialization - both are evicted as they fill up,
> with different priorities.
> 
> Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> with a bit of input from UX patterns observed from Tableau/Microstrategy
> users to give it the impression of being much faster than the engine
> really can be.
> 
> Illusion of performance is likely to be indistinguishable from actual -
> I'm actually looking for subjects for that experiment :)
> 
> Cheers,
> Gopal
> 
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to