OK two questions here please:
1. Which version of Hive are you running 2. Have you tried Hive on Spark which does both DAG & In-memory calculation. Query Hive on Spark job[1] stages: INFO : 2 INFO : 3 HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 2 March 2016 at 18:14, Dayong <will...@gmail.com> wrote: > Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive > metadata store can be decoupled from hive as well. In reality, we do suffer > from hive's performance even for ETL job. As result, we'll switch to > implala + spark/ flink. > > Thanks, > Dayong > > On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On > Hadoop (HPL/SQL) which is going to add another dimension to Hive > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> SQL plays an increasing important role on Hadoop. As of today Hive IMO >> provides the best and most robust solution to anything resembling to Data >> Warehouse "solution" on Hadoop, chiefly by means of its powerful metastore >> which can be hosted on a variety of mission critical databases plus Hive's >> ever increasing support for a variety of file types on HDFs from humble >> textfile to ORC. The remaining tools are little more than query tools that >> crucially rely on Hive Metastore for their needs. Take away Hive component >> and they are more and less lame ducks. >> >> Hive on MR speed was perceived to be slow but what the hec we are talking >> about a Data Warehouse here which in most part should be batch oriented >> and not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use >> Spark and Tez as the execution engine and if you are well into functional >> programming, you can deploy Spark on Hive. If you look around from Impala >> to Spark the architecture is essentially a query tool. >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 2 March 2016 at 13:52, Dayong <will...@gmail.com> wrote: >> >>> As I remember of few weeks before in Hadoop weekly news feed, cloudera >>> has a benchmark showing implala is a little better than spark SQL and hive >>> with tez. You can check that. From my experience, hive is still leading >>> tool for regular ETL job since it is stable. The other tool are better for >>> adhoc and interactive query use case. Cloudera bet on implala especially >>> with its new kudo project. >>> >>> Thanks, >>> Dayong >>> >>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> >>> My nocks on impala. (not intended to be a post knocking impala) >>> >>> Impala really has not delivered on the complex types that hive has >>> (after promising it for quite a while), also it only works with the >>> 'blessed' input formats, parquet, avro, text. >>> >>> It is very annoying to work with impala, In my version if you create a >>> partition in hive impala does not see it. You have to run "refresh". >>> >>> In impala I do not have all the UDFS that hive has like percentile, etc. >>> >>> Impala is fast. Many data-analysts / data-scientist types that can't >>> wait 10 seconds for a query so when I need top produce something for them I >>> make sure the data has no complex types and uses a table type that impala >>> understands. >>> >>> But for my work I still work primarily in hive, because I do not want to >>> deal with all the things that impala does not have/might have/ and when I >>> need something special like my own UDFs it is easier to whip up the >>> solution in hive. >>> >>> Having worked with M$ SQL server, and vertica, Impala is on par with >>> them but I don'think of it like i think of hive. To me it just feels like a >>> vertica that I can cheat loading sometimes because it is backed by hdfs. >>> >>> Hive is something different, I am making pipelines, I am transforming >>> data, doing streaming, writing custom udfs, querying JSON directly. Its not >>> != impala. >>> >>> ::random message of the day:: >>> >>> >>> >>> >>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34...@yahoo.com> >>> wrote: >>> >>>> >>>> Dr Mitch, >>>> >>>> My two cents here. >>>> >>>> I don't have direct experience of Impala but in my humble opinion I >>>> share your views that Hive provides the best metastore of all Big Data >>>> systems. Looking around almost every product in one form and shape use Hive >>>> code somewhere. My colleagues inform me that Hive is one of the most stable >>>> Big Data products. >>>> >>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of >>>> course MR, there is really little need for many other products in the same >>>> space. It is good to keep things simple. >>>> >>>> Warmest >>>> >>>> >>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>> >>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled >>>> >>>> "Apache Hive Or Cloudera Impala? What is Best for me?" >>>> >>>> "We can access all objects from Hive data warehouse with HiveQL which >>>> leverages the map-reduce architecture in background for data retrieval and >>>> transformation and this results in latency." >>>> >>>> My response was >>>> >>>> This statement is no longer valid as you have choices of three engines >>>> now with MR, Spark and Tez. I have not used Impala myself as I don't think >>>> there is a need for it with Hive on Spark or Spark using Hive metastore >>>> providing whatever needed. Hive is for Data Warehouse and provides what is >>>> says on the tin. Please also bear in mind that Hive offers ORC storage >>>> files that provide store Index capabilities further optimizing the queries >>>> with additional stats at file, stripe and row group levels. >>>> >>>> Anyway the question is with Hive on Spark or Spark using Hive metastore >>>> what we cannot achieve that we can achieve with Impala? >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> >>> >> >