RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Mich Talebzadeh Thu, 31 Dec 2015 10:35:25 -0800

Ok guys.


I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as 
well.

 

Just to remind that the query is used is pretty common. Get the total amount 
sold for each calendar month from sales (I billion rows) and times 

 

SELECT t.calendar_month_desc, SUM(s.amount_sold)

FROM sales s, times t WHERE s.time_id = t.time_id

GROUP BY t.calendar_month_desc;

 

In total 48 rows are returned back

Now having thought about It, granted TEZ is going to be faster than MR as it is 
basically MR with DAG thrown at it. On the other Spark will have both DAG and 
in-memory calculation. 

 

 

The results are as follow:

 

 

Optimiser             Engine               Timing               Compression     
      Total Table size      

Hive                 MapReduce             4673.035 seconds      Snappy         
       totalSize=2678882153 = 2.5GB

Hive                 Spark 1.3.1           1578.817 seconds      Snappy

Columnar              Sybase IQ              30.000 seconds      Native         
       5GB

 

 

It is pretty obvious that Spark outperforms MapReduce more than twice even 
taking into account the number of rows on the FACT table and frankly I would 
not have thought that TEZ is going to beat Spark (to be seen). Having said that 
Hive storage is twice more efficient but I am not sure what one can do to 
improve the performance. Table in Hive is stored as ORC table and it has 
crossed my mind that maybe we should think about storing every column of an ORC 
table as an index. That may improve the performance further.

 

HTH

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Marcin Tustin [mailto:mtus...@handybook.com] 
Sent: 30 December 2015 19:27
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 

I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much 
faster than MR in pretty much all cases. Also, with hive, you'll make sure 
you've performed optimizations like aligning ORC stripe sizes with HDFS block 
sizes, and concatenated your tables (not so much an optimization as a must for 
avoiding the small files problem).

 

On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks again Jorn.

 

 

Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have 
compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes 
on the time_id column. For the dimension table (times) I have time_id defined 
as primary key. Also Sybase IQ creates FP (fast projection) indexes on every 
column by default.

 

Anyway I am trying to download and build TEZ. Do we know which version of TEZ 
works with Hive 1.2.1 please? 0.8 seems to be in alpha

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to