Check HiveMall

> On 03 Feb 2016, at 05:49, Koert Kuipers <ko...@tresata.com> wrote:
> 
> yeah but have you ever seen somewhat write a real analytical program in hive? 
> how? where are the basic abstractions to wrap up a large amount of operations 
> (joins, groupby's) into a single function call? where are the tools to write 
> nice unit test for that? 
> 
> for example in spark i can write a DataFrame => DataFrame that internally 
> does many joins, groupBys and complex operations. all unit tested and 
> perfectly re-usable. and in hive? copy paste round sql queries? thats just 
> dangerous.
> 
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com> 
>> wrote:
>> Hive has numerous extension points, you are not boxed in by a long shot.
>> 
>> 
>>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>> uuuhm with spark using Hive metastore you actually have a real programming 
>>> environment and you can write real functions, versus just being boxed into 
>>> some version of sql and limited udfs?
>>> 
>>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>>> When comparing the performance, you need to do it apple vs apple. In 
>>>> another thread, you mentioned that Hive on Spark is much slower than Spark 
>>>> SQL. However, you configured Hive such that only two tasks can run in 
>>>> parallel. However, you didn't provide information on how much Spark SQL is 
>>>> utilizing. Thus, it's hard to tell whether it's just a configuration 
>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to 
>>>> see the resource usage in YARN resource manage URL.
>>>> 
>>>> --Xuefu
>>>> 
>>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> 
>>>>> wrote:
>>>>> Thanks Jeff.
>>>>> 
>>>>>  
>>>>> 
>>>>> Obviously Hive is much more feature rich compared to Spark. Having said 
>>>>> that in certain areas for example where the SQL feature is available in 
>>>>> Spark, Spark seems to deliver faster.
>>>>> 
>>>>>  
>>>>> 
>>>>> This may be:
>>>>> 
>>>>>  
>>>>> 
>>>>> 1.    Spark does both the optimisation and execution seamlessly
>>>>> 
>>>>> 2.    Hive on Spark has to invoke YARN that adds another layer to the 
>>>>> process
>>>>> 
>>>>>  
>>>>> 
>>>>> Now I did some simple tests on a 100Million rows ORC table available 
>>>>> through Hive to both.
>>>>> 
>>>>>  
>>>>> 
>>>>> Spark 1.5.2 on Hive 1.2.1 Metastore
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>> 
>>>>> 1       0       0       63      
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 5       0       4       31      
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 100000  99      999     188     
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>> 
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>> 
>>>>> 1       0       0       63      
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 5       0       4       31      
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 100000  99      999     188     
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>>> 
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>> 
>>>>> 1       0       0       63      
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 5       0       4       31      
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> 100000  99      999     188     
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000      
>>>>> xxxxxxxxxx
>>>>> 
>>>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>>> 
>>>>>  
>>>>> 
>>>>> So three runs returning three rows just over 50 seconds
>>>>> 
>>>>>  
>>>>> 
>>>>> Hive 1.2.1 on spark 1.3.1 execution engine
>>>>> 
>>>>>  
>>>>> 
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
>>>>> (1, 5, 100000);
>>>>> 
>>>>> INFO  :
>>>>> 
>>>>> Query Hive on Spark job[4] stages:
>>>>> 
>>>>> INFO  : 4
>>>>> 
>>>>> INFO  :
>>>>> 
>>>>> Status: Running (Hive on Spark job[4])
>>>>> 
>>>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>>>>>               dummy.random_string                 | dummy.small_vc  | 
>>>>> dummy.padding  |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | 1         | 0                | 0                | 63                | 
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 5         | 0                | 4                | 31                | 
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 100000    | 99               | 999              | 188               | 
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> 3 rows selected (82.66 seconds)
>>>>> 
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
>>>>> (1, 5, 100000);
>>>>> 
>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>>>>>               dummy.random_string                 | dummy.small_vc  | 
>>>>> dummy.padding  |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | 1         | 0                | 0                | 63                | 
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 5         | 0                | 4                | 31                | 
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 100000    | 99               | 999              | 188               | 
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> 3 rows selected (76.835 seconds)
>>>>> 
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
>>>>> (1, 5, 100000);
>>>>> 
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>>>>>               dummy.random_string                 | dummy.small_vc  | 
>>>>> dummy.padding  |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> | 1         | 0                | 0                | 63                | 
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 5         | 0                | 4                | 31                | 
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> | 100000    | 99               | 999              | 188               | 
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      | 
>>>>> xxxxxxxxxx     |
>>>>> 
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>> 
>>>>> 3 rows selected (80.718 seconds)
>>>>> 
>>>>>  
>>>>> 
>>>>> Three runs returning the same rows in 80 seconds.
>>>>> 
>>>>>  
>>>>> 
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of 
>>>>> date and that causes this lag.
>>>>> 
>>>>>  
>>>>> 
>>>>> There are certain queries that one cannot do with Spark. Besides it does 
>>>>> not recognize CHAR fields which is a pain.
>>>>> 
>>>>>  
>>>>> 
>>>>> spark-sql> CREATE TEMPORARY TABLE tmp AS
>>>>> 
>>>>>          > SELECT t.calendar_month_desc, c.channel_desc, 
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>> 
>>>>>          > FROM sales s, times t, channels c
>>>>> 
>>>>>          > WHERE s.time_id = t.time_id
>>>>> 
>>>>>          > AND   s.channel_id = c.channel_id
>>>>> 
>>>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>> 
>>>>>          > ;
>>>>> 
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>> 
>>>>> .
>>>>> 
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
>>>>> Sent: 02 February 2016 23:12
>>>>> To: user@hive.apache.org
>>>>> Subject: Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>> 
>>>>>  
>>>>> 
>>>>> I think the diff is not only about which does optimization but more on 
>>>>> feature parity. Hive on Spark offers all functional features that Hive 
>>>>> offers and these features play out faster. However, Spark SQL is far from 
>>>>> offering this parity as far as I know.
>>>>> 
>>>>>  
>>>>> 
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>  
>>>>> 
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive 
>>>>> optimizer and Spark query engine
>>>>> 
>>>>>  
>>>>> 
>>>>> With spark using Hive metastore, Spark does both the optimization and 
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive tables from spark-sql etc
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Is this assessment correct?
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>  
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>> 
>>>>>  
>>>>> 
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> 
>>>>>  
>>>>> 
>>>>> Sybase ASE 15 Gold Medal Award 2008
>>>>> 
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>> 
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>> 
>>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 
>>>>> 15", ISBN 978-0-9563693-0-7.
>>>>> 
>>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>>>>> 978-0-9759693-0-4
>>>>> 
>>>>> Publications due shortly:
>>>>> 
>>>>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>>>>> 978-0-9563693-3-8
>>>>> 
>>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
>>>>> volume one out shortly
>>>>> 
>>>>>  
>>>>> 
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>>> message is for the designated recipient only, if you are not the intended 
>>>>> recipient, you should destroy it immediately. Any information in this 
>>>>> message shall not be understood as given or endorsed by Peridale 
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>>> subsidiaries nor their employees accept any responsibility.
>>>>> 
>> 
>> 
>> -- 
>> Sorry this was sent from mobile. Will do less grammar and spell check than 
>> usual.
> 

Reply via email to