fair enough On Thu, Feb 4, 2016 at 12:41 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> Hive is not the correct tool for every problem. Use the tool that makes > the most sense for your problem and your experience. > > Many people like hive because it is generally applicable. In my case study > for the hive book I highlighted many smart capably organizations use hive. > > Your argument is totally valid. You like X better because X works for you. > You don't need to 'preach' hear we all know hive has it's limits. > > On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Is the sky the limit? I know udfs can be used inside hive, like lambas >> basically i assume, and i will assume you have something similar for >> aggregations. But that's just abstractions inside a single map or reduce >> phase, pretty low level stuff. What you really need is abstractions around >> many map and reduce phases, because that is the level an algo is expressed >> at. >> >> For example when doing logistic regression you want to be able to do >> something like: >> read("somefile").train(settings).write("model") >> Here train is an eternally defined method that is well tested and could >> do many map and reduce steps internally (or even be defined at a higher >> level and compile into those steps). What is the equivalent in hive? Copy >> pasting crucial parts of the algo around while using udfs is just not the >> same thing in terms of reusability and abstraction. Its the opposite of >> keeping it DRY. >> On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com> >> wrote: >> >>> https://github.com/myui/hivemall >>> >>> >>> >>> as long as you are comfortable with java UDFs, the sky is really the >>> limit...it's not for everyone and spark does have many advantages, but they >>> are two tools that can complement each other in numerous ways. >>> >>> >>> >>> I don't know that there is necessarily a universal "better" for how to >>> use spark as an execution engine (or if spark is necessarily the **best** >>> execution engine for any given hive job). >>> >>> >>> >>> The reality is that once you start factoring in the numerous tuning >>> parameters of the systems and jobs there probably isn't a clear answer. >>> For some queries, the Catalyst optimizer may do a better job...is it going >>> to do a better job with ORC based data? less likely IMO. >>> >>> >>> >>> *From:* Koert Kuipers [mailto:ko...@tresata.com] >>> *Sent:* Tuesday, February 02, 2016 9:50 PM >>> *To:* user@hive.apache.org >>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >>> >>> >>> >>> yeah but have you ever seen somewhat write a real analytical program in >>> hive? how? where are the basic abstractions to wrap up a large amount of >>> operations (joins, groupby's) into a single function call? where are the >>> tools to write nice unit test for that? >>> >>> for example in spark i can write a DataFrame => DataFrame that >>> internally does many joins, groupBys and complex operations. all unit >>> tested and perfectly re-usable. and in hive? copy paste round sql queries? >>> thats just dangerous. >>> >>> >>> >>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> >>> Hive has numerous extension points, you are not boxed in by a long shot. >>> >>> >>> >>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: >>> >>> uuuhm with spark using Hive metastore you actually have a real >>> programming environment and you can write real functions, versus just being >>> boxed into some version of sql and limited udfs? >>> >>> >>> >>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: >>> >>> When comparing the performance, you need to do it apple vs apple. In >>> another thread, you mentioned that Hive on Spark is much slower than Spark >>> SQL. However, you configured Hive such that only two tasks can run in >>> parallel. However, you didn't provide information on how much Spark SQL is >>> utilizing. Thus, it's hard to tell whether it's just a configuration >>> problem in your Hive or Spark SQL is indeed faster. You should be able to >>> see the resource usage in YARN resource manage URL. >>> >>> --Xuefu >>> >>> >>> >>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> >>> wrote: >>> >>> Thanks Jeff. >>> >>> >>> >>> Obviously Hive is much more feature rich compared to Spark. Having said >>> that in certain areas for example where the SQL feature is available in >>> Spark, Spark seems to deliver faster. >>> >>> >>> >>> This may be: >>> >>> >>> >>> 1. Spark does both the optimisation and execution seamlessly >>> >>> 2. Hive on Spark has to invoke YARN that adds another layer to the >>> process >>> >>> >>> >>> Now I did some simple tests on a 100Million rows ORC table available >>> through Hive to both. >>> >>> >>> >>> *Spark 1.5.2 on Hive 1.2.1 Metastore* >>> >>> >>> >>> >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.805 seconds, Fetched 3 row(s) >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.358 seconds, Fetched 3 row(s) >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.563 seconds, Fetched 3 row(s) >>> >>> >>> >>> So three runs returning three rows just over 50 seconds >>> >>> >>> >>> *Hive 1.2.1 on spark 1.3.1 execution engine* >>> >>> >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : >>> >>> Query Hive on Spark job[4] stages: >>> >>> INFO : 4 >>> >>> INFO : >>> >>> Status: Running (Hive on Spark job[4]) >>> >>> INFO : Status: Finished successfully in 82.49 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (82.66 seconds) >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : Status: Finished successfully in 76.67 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (76.835 seconds) >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : Status: Finished successfully in 80.54 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (80.718 seconds) >>> >>> >>> >>> Three runs returning the same rows in 80 seconds. >>> >>> >>> >>> It is possible that My Spark engine with Hive is 1.3.1 which is out of >>> date and that causes this lag. >>> >>> >>> >>> There are certain queries that one cannot do with Spark. Besides it does >>> not recognize CHAR fields which is a pain. >>> >>> >>> >>> spark-sql> *CREATE TEMPORARY TABLE tmp AS* >>> >>> > SELECT t.calendar_month_desc, c.channel_desc, >>> SUM(s.amount_sold) AS TotalSales >>> >>> > FROM sales s, times t, channels c >>> >>> > WHERE s.time_id = t.time_id >>> >>> > AND s.channel_id = c.channel_id >>> >>> > GROUP BY t.calendar_month_desc, c.channel_desc >>> >>> > ; >>> >>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >>> >>> . >>> >>> You are likely trying to use an unsupported Hive feature."; >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> *Sybase ASE 15 Gold Medal Award 2008* >>> >>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>> >>> >>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>> >>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >>> 15", ISBN 978-0-9563693-0-7*. >>> >>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>> 978-0-9759693-0-4* >>> >>> *Publications due shortly:* >>> >>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>> 978-0-9563693-3-8 >>> >>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >>> one out shortly >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> NOTE: The information in this email is proprietary and confidential. >>> This message is for the designated recipient only, if you are not the >>> intended recipient, you should destroy it immediately. Any information in >>> this message shall not be understood as given or endorsed by Peridale >>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>> stated. It is the responsibility of the recipient to ensure that this email >>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>> nor their employees accept any responsibility. >>> >>> >>> >>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] >>> *Sent:* 02 February 2016 23:12 >>> *To:* user@hive.apache.org >>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >>> >>> >>> >>> I think the diff is not only about which does optimization but more on >>> feature parity. Hive on Spark offers all functional features that Hive >>> offers and these features play out faster. However, Spark SQL is far from >>> offering this parity as far as I know. >>> >>> >>> >>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> >>> wrote: >>> >>> Hi, >>> >>> >>> >>> My understanding is that with Hive on Spark engine, one gets the Hive >>> optimizer and Spark query engine >>> >>> >>> >>> With spark using Hive metastore, Spark does both the optimization and >>> query engine. The only value add is that one can access the underlying Hive >>> tables from spark-sql etc >>> >>> >>> >>> >>> >>> Is this assessment correct? >>> >>> >>> >>> >>> >>> >>> >>> Thanks >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> *Sybase ASE 15 Gold Medal Award 2008* >>> >>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>> >>> >>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>> >>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >>> 15", ISBN 978-0-9563693-0-7*. >>> >>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>> 978-0-9759693-0-4* >>> >>> *Publications due shortly:* >>> >>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>> 978-0-9563693-3-8 >>> >>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >>> one out shortly >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> NOTE: The information in this email is proprietary and confidential. >>> This message is for the designated recipient only, if you are not the >>> intended recipient, you should destroy it immediately. Any information in >>> this message shall not be understood as given or endorsed by Peridale >>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>> stated. It is the responsibility of the recipient to ensure that this email >>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>> nor their employees accept any responsibility. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Sorry this was sent from mobile. Will do less grammar and spell check >>> than usual. >>> >>> >>> ------------------------------ >>> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS >>> CONFIDENTIAL and may contain information that is privileged and exempt from >>> disclosure under applicable law. If you are neither the intended recipient >>> nor responsible for delivering the message to the intended recipient, >>> please note that any dissemination, distribution, copying or the taking of >>> any action in reliance upon the message is strictly prohibited. If you have >>> received this communication in error, please notify the sender immediately. >>> Thank you. >>> >> >