RE: Hive on Spark Engine versus Spark using Hive metastore

Mich Talebzadeh Wed, 03 Feb 2016 08:21:25 -0800

OK thanks. These are my new ENV settings based upon the availability of 
resources

export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers (Default: 1).

export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G) 
(Default: 1G)

export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G) (Default: 
512 Mb)

These are the new runs after these settings:

Spark on Hive (3 consecutive runs)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 47.987 seconds, Fetched 3 row(s)

Around 48 seconds

Hive on Spark 1.3.1

0: jdbc:hive2://rhes564:10010/default>  select * from dummy where id in (1, 5, 
100000);

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2016-02-03 16:20:50,315 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:53,369 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:56,478 Stage-2_0: 0(+18)/18

INFO  : 2016-02-03 16:20:58,530 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:01,570 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:04,680 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:07,767 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:10,877 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:13,941 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:17,019 Stage-2_0: 1(+17)/18

INFO  : 2016-02-03 16:21:20,090 Stage-2_0: 3(+15)/18

INFO  : 2016-02-03 16:21:21,138 Stage-2_0: 6(+12)/18

INFO  : 2016-02-03 16:21:22,145 Stage-2_0: 10(+8)/18

INFO  : 2016-02-03 16:21:23,150 Stage-2_0: 14(+4)/18

INFO  : 2016-02-03 16:21:24,154 Stage-2_0: 17(+1)/18

INFO  : 2016-02-03 16:21:26,161 Stage-2_0: 18/18 Finished

INFO  : Status: Finished successfully in 36.88 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |         
        dummy.random_string                 | dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (37.161 seconds)

Around 37 seconds

Interesting results

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: 03 February 2016 12:47
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

In YARN or standalone mode, you can set spark.executor.cores to utilize all 
cores on the node. You can also set spark.executor.memory to allocate memory 
for Spark to use. Once you do this, you may only have two executors to run your 
map tasks, but each core in each executor can take up one task, increasing 
parallelism. With this, the eventually limit may come down to the bandwidth of 
your disks in the cluster.

Having said that, a two-node cluster isn't really big enough to do performance 
benchmark. Nevertheless, you still need to configure properly to make full use 
of the cluster.

--Xuefu

On Wed, Feb 3, 2016 at 1:25 AM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi Jeff,

I only have a two node cluster. Is there anyway one can simulate additional 
parallel runs in such an environment thus having more than two maps?

thanks

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Xuefu Zhang [mailto:xzh...@cloudera.com <mailto:xzh...@cloudera.com> ] 
Sent: 03 February 2016 02:39

To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

Yes, regardless what spark mode you're running in, from Spark AM webui, you 
should be able to see how many task are concurrently running. I'm a little 
surprised to see that your Hive configuration only allows 2 map tasks to run in 
parallel. If your cluster has the capacity, you should parallelize all the 
tasks to achieve optimal performance. Since I don't know your Spark SQL 
configuration, I cannot tell how much parallelism you have over there. Thus, 
I'm not sure if your comparison is valid.

--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi Jeff,

In below

…. You should be able to see the resource usage in YARN resource manage URL.

Just to be clear we are talking about Port 8088/cluster?

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Koert Kuipers [mailto:ko...@tresata.com <mailto:ko...@tresata.com> ] 
Sent: 03 February 2016 00:09

To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com 
<mailto:xzh...@cloudera.com> > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks Jeff.

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

This may be:

1.    Spark does both the optimisation and execution seamlessly

2.    Hive on Spark has to invoke YARN that adds another layer to the process

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

Spark 1.5.2 on Hive 1.2.1 Metastore

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 100000);

1       0       0       63      
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
xxxxxxxxxx

5       0       4       31      
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
xxxxxxxxxx

100000  99      999     188     
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
xxxxxxxxxx

Time taken: 50.563 seconds, Fetched 3 row(s)

So three runs returning three rows just over 50 seconds

Hive 1.2.1 on spark 1.3.1 execution engine

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (82.66 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  : Status: Finished successfully in 76.67 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (76.835 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
100000);

INFO  : Status: Finished successfully in 80.54 seconds

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| dummy.id <http://dummy.id>   | dummy.clustered  | dummy.scattered  | 
dummy.randomised  |                 dummy.random_string                 | 
dummy.small_vc  | dummy.padding  |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

| 1         | 0                | 0                | 63                | 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
xxxxxxxxxx     |

| 5         | 0                | 4                | 31                | 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
xxxxxxxxxx     |

| 100000    | 99               | 999              | 188               | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
xxxxxxxxxx     |

+-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+

3 rows selected (80.718 seconds)

Three runs returning the same rows in 80 seconds. 

It is possible that My Spark engine with Hive is 1.3.1 which is out of date and 
that causes this lag. 

There are certain queries that one cannot do with Spark. Besides it does not 
recognize CHAR fields which is a pain.

spark-sql> CREATE TEMPORARY TABLE tmp AS

         > SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
TotalSales

         > FROM sales s, times t, channels c

         > WHERE s.time_id = t.time_id

         > AND   s.channel_id = c.channel_id

         > GROUP BY t.calendar_month_desc, c.channel_desc

         > ;

Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7

.

You are likely trying to use an unsupported Hive feature.";

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

From: Xuefu Zhang [mailto:xzh...@cloudera.com <mailto:xzh...@cloudera.com> ] 
Sent: 02 February 2016 23:12
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

I think the diff is not only about which does optimization but more on feature 
parity. Hive on Spark offers all functional features that Hive offers and these 
features play out faster. However, Spark SQL is far from offering this parity 
as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

My understanding is that with Hive on Spark engine, one gets the Hive optimizer 
and Spark query engine

With spark using Hive metastore, Spark does both the optimization and query 
engine. The only value add is that one can access the underlying Hive tables 
from spark-sql etc

Is this assessment correct?

Thanks

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

RE: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to