Check HiveMall
> On 03 Feb 2016, at 05:49, Koert Kuipers <ko...@tresata.com> wrote: > > yeah but have you ever seen somewhat write a real analytical program in hive? > how? where are the basic abstractions to wrap up a large amount of operations > (joins, groupby's) into a single function call? where are the tools to write > nice unit test for that? > > for example in spark i can write a DataFrame => DataFrame that internally > does many joins, groupBys and complex operations. all unit tested and > perfectly re-usable. and in hive? copy paste round sql queries? thats just > dangerous. > >> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >> Hive has numerous extension points, you are not boxed in by a long shot. >> >> >>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: >>> uuuhm with spark using Hive metastore you actually have a real programming >>> environment and you can write real functions, versus just being boxed into >>> some version of sql and limited udfs? >>> >>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: >>>> When comparing the performance, you need to do it apple vs apple. In >>>> another thread, you mentioned that Hive on Spark is much slower than Spark >>>> SQL. However, you configured Hive such that only two tasks can run in >>>> parallel. However, you didn't provide information on how much Spark SQL is >>>> utilizing. Thus, it's hard to tell whether it's just a configuration >>>> problem in your Hive or Spark SQL is indeed faster. You should be able to >>>> see the resource usage in YARN resource manage URL. >>>> >>>> --Xuefu >>>> >>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> >>>>> wrote: >>>>> Thanks Jeff. >>>>> >>>>> >>>>> >>>>> Obviously Hive is much more feature rich compared to Spark. Having said >>>>> that in certain areas for example where the SQL feature is available in >>>>> Spark, Spark seems to deliver faster. >>>>> >>>>> >>>>> >>>>> This may be: >>>>> >>>>> >>>>> >>>>> 1. Spark does both the optimisation and execution seamlessly >>>>> >>>>> 2. Hive on Spark has to invoke YARN that adds another layer to the >>>>> process >>>>> >>>>> >>>>> >>>>> Now I did some simple tests on a 100Million rows ORC table available >>>>> through Hive to both. >>>>> >>>>> >>>>> >>>>> Spark 1.5.2 on Hive 1.2.1 Metastore >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.805 seconds, Fetched 3 row(s) >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.358 seconds, Fetched 3 row(s) >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.563 seconds, Fetched 3 row(s) >>>>> >>>>> >>>>> >>>>> So three runs returning three rows just over 50 seconds >>>>> >>>>> >>>>> >>>>> Hive 1.2.1 on spark 1.3.1 execution engine >>>>> >>>>> >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : >>>>> >>>>> Query Hive on Spark job[4] stages: >>>>> >>>>> INFO : 4 >>>>> >>>>> INFO : >>>>> >>>>> Status: Running (Hive on Spark job[4]) >>>>> >>>>> INFO : Status: Finished successfully in 82.49 seconds >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised | >>>>> dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 | >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 | >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 | >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (82.66 seconds) >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : Status: Finished successfully in 76.67 seconds >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised | >>>>> dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 | >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 | >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 | >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (76.835 seconds) >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : Status: Finished successfully in 80.54 seconds >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised | >>>>> dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 | >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 | >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 | >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (80.718 seconds) >>>>> >>>>> >>>>> >>>>> Three runs returning the same rows in 80 seconds. >>>>> >>>>> >>>>> >>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of >>>>> date and that causes this lag. >>>>> >>>>> >>>>> >>>>> There are certain queries that one cannot do with Spark. Besides it does >>>>> not recognize CHAR fields which is a pain. >>>>> >>>>> >>>>> >>>>> spark-sql> CREATE TEMPORARY TABLE tmp AS >>>>> >>>>> > SELECT t.calendar_month_desc, c.channel_desc, >>>>> SUM(s.amount_sold) AS TotalSales >>>>> >>>>> > FROM sales s, times t, channels c >>>>> >>>>> > WHERE s.time_id = t.time_id >>>>> >>>>> > AND s.channel_id = c.channel_id >>>>> >>>>> > GROUP BY t.calendar_month_desc, c.channel_desc >>>>> >>>>> > ; >>>>> >>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >>>>> >>>>> . >>>>> >>>>> You are likely trying to use an unsupported Hive feature."; >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> >>>>> >>>>> Sybase ASE 15 Gold Medal Award 2008 >>>>> >>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>>> >>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>>> >>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE >>>>> 15", ISBN 978-0-9563693-0-7. >>>>> >>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN >>>>> 978-0-9759693-0-4 >>>>> >>>>> Publications due shortly: >>>>> >>>>> Complex Event Processing in Heterogeneous Environments, ISBN: >>>>> 978-0-9563693-3-8 >>>>> >>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, >>>>> volume one out shortly >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> NOTE: The information in this email is proprietary and confidential. This >>>>> message is for the designated recipient only, if you are not the intended >>>>> recipient, you should destroy it immediately. Any information in this >>>>> message shall not be understood as given or endorsed by Peridale >>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>>> stated. It is the responsibility of the recipient to ensure that this >>>>> email is virus free, therefore neither Peridale Technology Ltd, its >>>>> subsidiaries nor their employees accept any responsibility. >>>>> >>>>> >>>>> >>>>> From: Xuefu Zhang [mailto:xzh...@cloudera.com] >>>>> Sent: 02 February 2016 23:12 >>>>> To: user@hive.apache.org >>>>> Subject: Re: Hive on Spark Engine versus Spark using Hive metastore >>>>> >>>>> >>>>> >>>>> I think the diff is not only about which does optimization but more on >>>>> feature parity. Hive on Spark offers all functional features that Hive >>>>> offers and these features play out faster. However, Spark SQL is far from >>>>> offering this parity as far as I know. >>>>> >>>>> >>>>> >>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> My understanding is that with Hive on Spark engine, one gets the Hive >>>>> optimizer and Spark query engine >>>>> >>>>> >>>>> >>>>> With spark using Hive metastore, Spark does both the optimization and >>>>> query engine. The only value add is that one can access the underlying >>>>> Hive tables from spark-sql etc >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Is this assessment correct? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> >>>>> >>>>> Sybase ASE 15 Gold Medal Award 2008 >>>>> >>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>>> >>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>>> >>>>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE >>>>> 15", ISBN 978-0-9563693-0-7. >>>>> >>>>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN >>>>> 978-0-9759693-0-4 >>>>> >>>>> Publications due shortly: >>>>> >>>>> Complex Event Processing in Heterogeneous Environments, ISBN: >>>>> 978-0-9563693-3-8 >>>>> >>>>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, >>>>> volume one out shortly >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> NOTE: The information in this email is proprietary and confidential. This >>>>> message is for the designated recipient only, if you are not the intended >>>>> recipient, you should destroy it immediately. Any information in this >>>>> message shall not be understood as given or endorsed by Peridale >>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>>> stated. It is the responsibility of the recipient to ensure that this >>>>> email is virus free, therefore neither Peridale Technology Ltd, its >>>>> subsidiaries nor their employees accept any responsibility. >>>>> >> >> >> -- >> Sorry this was sent from mobile. Will do less grammar and spell check than >> usual. >