> That being said all systems are evolving. Hive supports tez+llap which
>is basically the in-memory support.

There is a big difference between where LLAP & SparkSQL, which has to do
with access pattern needs.

The first one is related to the lifetime of the cache - the Spark RDD
cache is per-user-session which allows for further operation in that
session to be optimized.

LLAP is designed to be hammered by multiple user sessions running
different queries, designed to automate the cache eviction & selection
process. There's no user visible explicit .cache() to remember - it's
automatic and concurrent.

My team works with both engines, trying to improve it for ORC, but the
goals of both are different.

I will probably have to write a proper academic paper & get it
edited/reviewed instead of send my ramblings to the user lists like this.
Still, this needs an example to talk about.

To give a qualified example, let's leave the world of single use clusters
and take the use-case detailed here

http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/


There are two distinct problems there - one is that a single day sees upto
100k independent user sessions running queries and that most queries cover
the last hour (& possibly join/compare against a similar hour aggregate
from the past).

The problem with having independent 100k user-sessions from different
connections was that the SparkSQL layer drops the RDD lineage & cache
whenever a user ends a session.

The scale problem in general for Impala was that even though the data size
was in multiple terabytes, the actual hot data was approx <20Gb, which
resides on <10 machines with locality.

The same problem applies when you apply RDD caching with something like
un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
popular that the machines which hold those blocks run extra hot.

A cache model per-user session is entirely wasteful and a common cache +
MPP model effectively overloads 2-3% of cluster, while leaving the other
machines idle.

LLAP was designed specifically to prevent that hotspotting, while
maintaining the common cache model - within a few minutes after an hour
ticks over, the whole cluster develops temporal popularity for the hot
data and nearly every rack has at least one cached copy of the same data
for availability/performance.

Since data stream tend to be extremely wide table (Omniture) comes to
mine, so the cache actually does not hold all columns in a table and since
Zipf distributions are extremely common in these real data sets, the cache
does not hold all rows either.

select count(clicks) from table where zipcode = 695506;

with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
indexes for all files will be loaded into memory, all misses on the bloom
will not even feature in the cache.

A subsequent query for

select count(clicks) from table where zipcode = 695586;

will run against the collected indexes, before deciding which files need
to be loaded into cache.


Then again, 

select count(clicks)/count(impressions) from table where zipcode = 695586;

will load only impressions out of the table into cache, to add it to the
columnar cache without producing another complete copy (RDDs are not
mutable, but LLAP cache is additive).

The column split cache & index-cache separation allows for this to be
cheaper than a full rematerialization - both are evicted as they fill up,
with different priorities.

Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
with a bit of input from UX patterns observed from Tableau/Microstrategy
users to give it the impression of being much faster than the engine
really can be.

Illusion of performance is likely to be indistinguishable from actual -
I'm actually looking for subjects for that experiment :)

Cheers,
Gopal


Reply via email to