RE: Hive on Spark Engine versus Spark using Hive metastore

Mich Talebzadeh Thu, 04 Feb 2016 10:04:15 -0800

Hi Edward,

There is another angle to it as well. Fit for purpose.

We are currently migrating from a propriety DW on SAN to Hive on JBOD. It is 
going smoothly. It will save us $$ in licensing fees in times where the 
technology and storage dollars are at premium.

Our DBAs that look after Oracle, SAP ASES and others are comfortable with Hive. 
They can look after the metastore (on Oracle) and working with me for HA for 
metastore and Hive serever2 in line with the standard for other databases.

I am sure if we had started with Spark, that would have worked but what the 
hec. We have MongoDB as well independent of HDFS.

These arguments about what is better or worse is the one we have had for years 
about Oracle, Sybase, MSSQL etc. I believe Hive is better for us because I 
think Hive. If I was more familiar with Spark, I am sure that would have been 
the opposite.

We can go in circles. Religious arguments really.

HTH,

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: 04 February 2016 17:41
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

Hive is not the correct tool for every problem. Use the tool that makes the 
most sense for your problem and your experience. 

Many people like hive because it is generally applicable. In my case study for 
the hive book I highlighted many smart capably organizations use hive. 

Your argument is totally valid. You like X better because X works for you. You 
don't need to 'preach' hear we all know hive has it's limits. 

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com 
<mailto:ko...@tresata.com> > wrote:

Is the sky the limit? I know udfs can be used inside hive, like lambas 
basically i assume, and i will assume you have something similar for 
aggregations. But that's just abstractions inside a single map or reduce phase, 
pretty low level stuff. What you really need is abstractions around many map 
and reduce phases, because that is the level an algo is expressed at.

For example when doing logistic regression you want to be able to do something 
like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do many 
map and reduce steps internally (or even be defined at a higher level and 
compile into those steps). What is the equivalent in hive? Copy pasting crucial 
parts of the algo around while using udfs is just not the same thing in terms 
of reusability and abstraction. Its the opposite of keeping it DRY.

On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com 
<mailto:ryan.har...@zionsbancorp.com> > wrote:

https://github.com/myui/hivemall

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO. 

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

yeah but have you ever seen somewhat write a real analytical program in hive? 
how? where are the basic abstractions to wrap up a large amount of operations 
(joins, groupby's) into a single function call? where are the tools to write 
nice unit test for that? 

for example in spark i can write a DataFrame => DataFrame that internally does 
many joins, groupBys and complex operations. all unit tested and perfectly 
re-usable. and in hive? copy paste round sql queries? thats just dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com 
<mailto:edlinuxg...@gmail.com> > wrote:

Hive has numerous extension points, you are not boxed in by a long shot.

On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com 
<mailto:ko...@tresata.com> > wrote:

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com 
<mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

This may be:

1.    Spark does both the optimisation and execution seamlessly

2.    Hive on Spark has to invoke YARN that adds another layer to the process

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

Spark 1.5.2 on Hive 1.2.1 Metastore

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.563 seconds, Fetched 3 row(s)

So three runs returning three rows just over 50 seconds

Hive 1.2.1 on spark 1.3.1 execution engine

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (82.66 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  : Status: Finished successfully in 76.67 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (76.835 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  : Status: Finished successfully in 80.54 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (80.718 seconds)

Three runs returning the same rows in 80 seconds. 

It is possible that My Spark engine with Hive is 1.3.1 which is out of date and 
that causes this lag. 

There are certain queries that one cannot do with Spark. Besides it does not 
recognize CHAR fields which is a pain.

spark-sql> CREATE TEMPORARY TABLE tmp AS

         > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

         > FROM sales s, times t, channels c

         > WHERE s.time_id = t.time_id

         > AND   s.channel_id = c.channel_id

         > GROUP BY t.calendar_month_desc, c.channel_desc

         > ;

Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are likely trying to use an unsupported Hive feature.";

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Xuefu Zhang [mailto:xzh...@cloudera.com <mailto:xzh...@cloudera.com> ] 
Sent: 02 February 2016 23:12
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

I think the diff is not only about which does optimization but more on feature 
parity. Hive on Spark offers all functional features that Hive offers and these 
features play out faster. However, Spark SQL is far from offering this parity 
as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

My understanding is that with Hive on Spark engine, one gets the Hive optimizer 
and Spark query engine

With spark using Hive metastore, Spark does both the optimization and query 
engine. The only value add is that one can access the underlying Hive tables 
from spark-sql etc

Is this assessment correct?

Thanks

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than 
usual.

  _____  

THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately. Thank you.

RE: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to