ok i am sure there is some way to do it. i am going to guess snippets of hive code stuck together with oozie jobs or whatever. the oozie jobs become the re-usable pieces perhaps? now you got sql and xml, completely lacking any benefits of a compiler to catch errors. unit tests will be slow if even available at all. so yeah yeah i am sure it can be made to *work*. just like you can get a nail into a wall with a screwdriver if you really want.
On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers <ko...@tresata.com> wrote: > yeah but have you ever seen somewhat write a real analytical program in > hive? how? where are the basic abstractions to wrap up a large amount of > operations (joins, groupby's) into a single function call? where are the > tools to write nice unit test for that? > > for example in spark i can write a DataFrame => DataFrame that internally > does many joins, groupBys and complex operations. all unit tested and > perfectly re-usable. and in hive? copy paste round sql queries? thats just > dangerous. > > On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> Hive has numerous extension points, you are not boxed in by a long shot. >> >> >> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: >> >>> uuuhm with spark using Hive metastore you actually have a real >>> programming environment and you can write real functions, versus just being >>> boxed into some version of sql and limited udfs? >>> >>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: >>> >>>> When comparing the performance, you need to do it apple vs apple. In >>>> another thread, you mentioned that Hive on Spark is much slower than Spark >>>> SQL. However, you configured Hive such that only two tasks can run in >>>> parallel. However, you didn't provide information on how much Spark SQL is >>>> utilizing. Thus, it's hard to tell whether it's just a configuration >>>> problem in your Hive or Spark SQL is indeed faster. You should be able to >>>> see the resource usage in YARN resource manage URL. >>>> >>>> --Xuefu >>>> >>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> >>>> wrote: >>>> >>>>> Thanks Jeff. >>>>> >>>>> >>>>> >>>>> Obviously Hive is much more feature rich compared to Spark. Having >>>>> said that in certain areas for example where the SQL feature is available >>>>> in Spark, Spark seems to deliver faster. >>>>> >>>>> >>>>> >>>>> This may be: >>>>> >>>>> >>>>> >>>>> 1. Spark does both the optimisation and execution seamlessly >>>>> >>>>> 2. Hive on Spark has to invoke YARN that adds another layer to the >>>>> process >>>>> >>>>> >>>>> >>>>> Now I did some simple tests on a 100Million rows ORC table available >>>>> through Hive to both. >>>>> >>>>> >>>>> >>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore* >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.805 seconds, Fetched 3 row(s) >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.358 seconds, Fetched 3 row(s) >>>>> >>>>> spark-sql> select * from dummy where id in (1, 5, 100000); >>>>> >>>>> 1 0 0 63 >>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>>>> xxxxxxxxxx >>>>> >>>>> 5 0 4 31 >>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>>>> xxxxxxxxxx >>>>> >>>>> 100000 99 999 188 >>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>>>> xxxxxxxxxx >>>>> >>>>> Time taken: 50.563 seconds, Fetched 3 row(s) >>>>> >>>>> >>>>> >>>>> So three runs returning three rows just over 50 seconds >>>>> >>>>> >>>>> >>>>> *Hive 1.2.1 on spark 1.3.1 execution engine* >>>>> >>>>> >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : >>>>> >>>>> Query Hive on Spark job[4] stages: >>>>> >>>>> INFO : 4 >>>>> >>>>> INFO : >>>>> >>>>> Status: Running (Hive on Spark job[4]) >>>>> >>>>> INFO : Status: Finished successfully in 82.49 seconds >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>>> | dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 >>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 >>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 >>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (82.66 seconds) >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : Status: Finished successfully in 76.67 seconds >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>>> | dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 >>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 >>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 >>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (76.835 seconds) >>>>> >>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>>>> (1, 5, 100000); >>>>> >>>>> INFO : Status: Finished successfully in 80.54 seconds >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>>>> | dummy.random_string | dummy.small_vc | >>>>> dummy.padding | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> | 1 | 0 | 0 | 63 >>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 5 | 0 | 4 | 31 >>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>>>> xxxxxxxxxx | >>>>> >>>>> | 100000 | 99 | 999 | 188 >>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>>>> xxxxxxxxxx | >>>>> >>>>> >>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>>>> >>>>> 3 rows selected (80.718 seconds) >>>>> >>>>> >>>>> >>>>> Three runs returning the same rows in 80 seconds. >>>>> >>>>> >>>>> >>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of >>>>> date and that causes this lag. >>>>> >>>>> >>>>> >>>>> There are certain queries that one cannot do with Spark. Besides it >>>>> does not recognize CHAR fields which is a pain. >>>>> >>>>> >>>>> >>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS* >>>>> >>>>> > SELECT t.calendar_month_desc, c.channel_desc, >>>>> SUM(s.amount_sold) AS TotalSales >>>>> >>>>> > FROM sales s, times t, channels c >>>>> >>>>> > WHERE s.time_id = t.time_id >>>>> >>>>> > AND s.channel_id = c.channel_id >>>>> >>>>> > GROUP BY t.calendar_month_desc, c.channel_desc >>>>> >>>>> > ; >>>>> >>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >>>>> >>>>> . >>>>> >>>>> You are likely trying to use an unsupported Hive feature."; >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> *Sybase ASE 15 Gold Medal Award 2008* >>>>> >>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>>> >>>>> >>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>>> >>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase >>>>> ASE 15", ISBN 978-0-9563693-0-7*. >>>>> >>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>>>> 978-0-9759693-0-4* >>>>> >>>>> *Publications due shortly:* >>>>> >>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>>>> 978-0-9563693-3-8 >>>>> >>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, >>>>> volume >>>>> one out shortly >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> NOTE: The information in this email is proprietary and confidential. >>>>> This message is for the designated recipient only, if you are not the >>>>> intended recipient, you should destroy it immediately. Any information in >>>>> this message shall not be understood as given or endorsed by Peridale >>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>>> stated. It is the responsibility of the recipient to ensure that this >>>>> email >>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>>>> nor their employees accept any responsibility. >>>>> >>>>> >>>>> >>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] >>>>> *Sent:* 02 February 2016 23:12 >>>>> *To:* user@hive.apache.org >>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >>>>> >>>>> >>>>> >>>>> I think the diff is not only about which does optimization but more on >>>>> feature parity. Hive on Spark offers all functional features that Hive >>>>> offers and these features play out faster. However, Spark SQL is far from >>>>> offering this parity as far as I know. >>>>> >>>>> >>>>> >>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> My understanding is that with Hive on Spark engine, one gets the Hive >>>>> optimizer and Spark query engine >>>>> >>>>> >>>>> >>>>> With spark using Hive metastore, Spark does both the optimization and >>>>> query engine. The only value add is that one can access the underlying >>>>> Hive >>>>> tables from spark-sql etc >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Is this assessment correct? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> *Sybase ASE 15 Gold Medal Award 2008* >>>>> >>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>>>> >>>>> >>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>>>> >>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase >>>>> ASE 15", ISBN 978-0-9563693-0-7*. >>>>> >>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>>>> 978-0-9759693-0-4* >>>>> >>>>> *Publications due shortly:* >>>>> >>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>>>> 978-0-9563693-3-8 >>>>> >>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, >>>>> volume >>>>> one out shortly >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> NOTE: The information in this email is proprietary and confidential. >>>>> This message is for the designated recipient only, if you are not the >>>>> intended recipient, you should destroy it immediately. Any information in >>>>> this message shall not be understood as given or endorsed by Peridale >>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>>>> stated. It is the responsibility of the recipient to ensure that this >>>>> email >>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>>>> nor their employees accept any responsibility. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> >> -- >> Sorry this was sent from mobile. Will do less grammar and spell check >> than usual. >> > >