Mich - it sounds like maybe you should try these benchmarks with alluxio abstracting the storage layer, and see how much it makes a difference. Alluxio should (if I understand it right) provide a lot of the optimisation you're looking for with in memory work.
I've never used it, but I would love to hear the experiences of people who have. On Mon, May 30, 2016 at 5:32 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I think we are going to move to a model that the computation stack will be > separate from storage stack and moreover something like Hive that provides > the means for persistent storage (well HDFS is the one that stores all the > data) will have an in-memory type capability much like what Oracle TimesTen > IMDB does with its big brother Oracle. Now TimesTen is effectively designed > to provide in-memory capability for analytics for Oracle 12c. These two work > like > an index or materialized view. You write queries against tables - > optimizer figures out whether to use row oriented storage and indexes to > access (Oracle classic) or column non-indexed storage to answer (TimesTen). > just one optimizer. > > I gather Hive will be like that eventually. it will decide based on the > frequency of access where to look for data. Yes we may have 10 TB of data > on disk but how much of it is frequently accessed (hot data). 80-20 rule? > In reality may be just 2TB or most recent partitions etc. The rest is cold > data. > > cheers > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 30 May 2016 at 21:59, Michael Segel <msegel_had...@hotmail.com> wrote: > >> And you have MapR supporting Apache Drill. >> >> So these are all alternatives to Spark, and its not necessarily an either >> or scenario. You can have both. >> >> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> yep Hortonworks supports Tez for one reason or other which I am going >> hopefully to test it as the query engine for hive. Tthough I think Spark >> will be faster because of its in-memory support. >> >> Also if you are independent then you better off dealing with Spark and >> Hive without the need to support another stack like Tez. >> >> Cloudera support Impala instead of Hive but it is not something I have >> used. . >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com> wrote: >> >>> Mich, >>> >>> Most people use vendor releases because they need to have the support. >>> Hortonworks is the vendor who has the most skin in the game when it >>> comes to Tez. >>> >>> If memory serves, Tez isn’t going to be M/R but a local execution >>> engine? Then LLAP is the in-memory piece to speed up Tez? >>> >>> HTH >>> >>> -Mike >>> >>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> thanks I think the problem is that the TEZ user group is exceptionally >>> quiet. Just sent an email to Hive user group to see anyone has managed to >>> built a vendor independent version. >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>>> Well I think it is different from MR. It has some optimizations which >>>> you do not find in MR. Especially the LLAP option in Hive2 makes it >>>> interesting. >>>> >>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it >>>> is integrated in the Hortonworks distribution. >>>> >>>> >>>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Hi Jorn, >>>> >>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys >>>> from TEZ user group kindly gave a hand but I could not go very far (or may >>>> be I did not make enough efforts) making it work. >>>> >>>> That TEZ user group is very quiet as well. >>>> >>>> My understanding is TEZ is MR with DAG but of course Spark has both >>>> plus in-memory capability. >>>> >>>> It would be interesting to see what version of TEZ works as execution >>>> engine with Hive. >>>> >>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead >>>> of Hive etc as I am sure you already know. >>>> >>>> Cheers, >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>>> Very interesting do you plan also a test with TEZ? >>>>> >>>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I did another study of Hive using Spark engine compared to Hive with >>>>> MR. >>>>> >>>>> Basically took the original table imported using Sqoop and created and >>>>> populated a new ORC table partitioned by year and month into 48 partitions >>>>> as follows: >>>>> >>>>> <sales_partition.PNG> >>>>> >>>>> Connections use JDBC via beeline. Now for each partition using MR it >>>>> takes an average of 17 minutes as seen below for each PARTITION.. Now >>>>> that >>>>> is just an individual partition and there are 48 partitions. >>>>> >>>>> In contrast doing the same operation with Spark engine took 10 minutes >>>>> all inclusive. I just gave up on MR. You can see the StartTime and >>>>> FinishTime from below >>>>> >>>>> <image.png> >>>>> >>>>> This is by no means indicate that Spark is much better than MR but >>>>> shows that some very good results can ve achieved using Spark engine. >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We use Hive as the database and use Spark as an all purpose query >>>>>> tool. >>>>>> >>>>>> Whether Hive is the write database for purpose or one is better off >>>>>> with something like Phoenix on Hbase, well the answer is it depends and >>>>>> your mileage varies. >>>>>> >>>>>> So fit for purpose. >>>>>> >>>>>> Ideally what wants is to use the fastest method to get the results. >>>>>> How fast we confine it to our SLA agreements in production and that helps >>>>>> us from unnecessary further work as we technologists like to play around. >>>>>> >>>>>> So in short, we use Spark most of the time and use Hive as the >>>>>> backend engine for data storage, mainly ORC tables. >>>>>> >>>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a >>>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but >>>>>> at the moment it is one of my projects. >>>>>> >>>>>> We do not use any vendor's products as it enables us to move away >>>>>> from being tied down after years of SAP, Oracle and MS dependency to yet >>>>>> another vendor. Besides there is some politics going on with one >>>>>> promoting >>>>>> Tez and another Spark as a backend. That is fine but obviously we prefer >>>>>> an >>>>>> independent assessment ourselves. >>>>>> >>>>>> My gut feeling is that one needs to look at the use case. Recently we >>>>>> had to import a very large table from Oracle to Hive and decided to use >>>>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used >>>>>> JDBC connection with temp table and it was good. We could have used sqoop >>>>>> but decided to settle for Spark so it all depends on use case. >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> >>>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> Thanks for very useful stats. >>>>>>> >>>>>>> Did you have any benchmark for using Spark as backend engine for >>>>>>> Hive vs using Spark thrift server (and run spark code for hive >>>>>>> queries)? We >>>>>>> are using later but it will be very useful to remove thriftserver, if we >>>>>>> can. >>>>>>> >>>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hi Mich, >>>>>>>> >>>>>>>> I think these comparisons are useful. One interesting aspect could >>>>>>>> be hardware scalability in this context. Additionally different type of >>>>>>>> computations. Furthermore, one could compare Spark and Tez+llap as >>>>>>>> execution engines. I have the gut feeling that each one can be >>>>>>>> justified >>>>>>>> by different use cases. >>>>>>>> Nevertheless, there should be always a disclaimer for such >>>>>>>> comparisons, because Spark and Hive are not good for a lot of >>>>>>>> concurrent >>>>>>>> lookups of single rows. They are not good for frequently write small >>>>>>>> amounts of data (eg sensor data). Here hbase could be more interesting. >>>>>>>> Other use cases can justify graph databases, such as Titan, or text >>>>>>>> analytics/ data matching using Solr on Hadoop. >>>>>>>> Finally, even if you have a lot of data you need to think if you >>>>>>>> always have to process everything. For instance, I have found valid use >>>>>>>> cases in practice where we decided to evaluate 10 machine learning >>>>>>>> models >>>>>>>> in parallel on only a sample of data and only evaluate the "winning" >>>>>>>> model >>>>>>>> of the total of data. >>>>>>>> >>>>>>>> As always it depends :) >>>>>>>> >>>>>>>> Best regards >>>>>>>> >>>>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with >>>>>>>> hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere >>>>>>>> described >>>>>>>> how to manage bringing both together. You may check also Apache Bigtop >>>>>>>> (vendor neutral distribution) on how they managed to bring both >>>>>>>> together. >>>>>>>> >>>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> I have done a number of extensive tests using Spark-shell with Hive >>>>>>>> DB and ORC tables. >>>>>>>> >>>>>>>> >>>>>>>> Now one issue that we typically face is and I quote: >>>>>>>> >>>>>>>> >>>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save >>>>>>>> data it is not fast enough >>>>>>>> >>>>>>>> OK but there is a solution now. If you use Spark with Hive and you >>>>>>>> are on a descent version of Hive >= 0.14, then you can also deploy >>>>>>>> Spark as >>>>>>>> execution engine for Hive. That will make your application run pretty >>>>>>>> fast >>>>>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a >>>>>>>> nutshell >>>>>>>> what you are gaining speed in both querying and storage. >>>>>>>> >>>>>>>> >>>>>>>> I have made some comparisons on this set-up and I am sure some of >>>>>>>> you will find it useful. >>>>>>>> >>>>>>>> >>>>>>>> The version of Spark I use for Spark queries (Spark as query tool) >>>>>>>> is 1.6. >>>>>>>> The version of Hive I use in Hive 2 >>>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It >>>>>>>> works and frankly Spark 1.3.1 as an execution engine is adequate >>>>>>>> (until we >>>>>>>> sort out the Hadoop libraries mismatch). >>>>>>>> >>>>>>>> >>>>>>>> An example I am using Hive on Spark engine to find the min and max >>>>>>>> of IDs for a table with 1 billion rows: >>>>>>>> >>>>>>>> >>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>>>>> Query ID = >>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 >>>>>>>> >>>>>>>> >>>>>>>> INFO : Completed compiling >>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>>>>> Time taken: 1.911 seconds >>>>>>>> INFO : Executing >>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): >>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>>> INFO : Query ID = >>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006 >>>>>>>> INFO : Total jobs = 1 >>>>>>>> INFO : Launching Job 1 out of 1 >>>>>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>>>>> >>>>>>>> >>>>>>>> Query Hive on Spark job[0] stages: >>>>>>>> 0 >>>>>>>> 1 >>>>>>>> Status: Running (Hive on Spark job[0]) >>>>>>>> Job Progress Format >>>>>>>> CurrentTime StageId_StageAttemptId: >>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>>>>> [StageCost] >>>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: 0/1 >>>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: 0/1 >>>>>>>> INFO : >>>>>>>> Query Hive on Spark job[0] stages: >>>>>>>> INFO : 0 >>>>>>>> INFO : 1 >>>>>>>> INFO : >>>>>>>> Status: Running (Hive on Spark job[0]) >>>>>>>> INFO : Job Progress Format >>>>>>>> CurrentTime StageId_StageAttemptId: >>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>>>>> [StageCost] >>>>>>>> INFO : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1 >>>>>>>> INFO : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22 Stage-1_0: >>>>>>>> 0/1 >>>>>>>> INFO : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22 Stage-1_0: >>>>>>>> 0/1 >>>>>>>> INFO : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22 Stage-1_0: >>>>>>>> 0/1 >>>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished Stage-1_0: >>>>>>>> 0(+1)/1 >>>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished Stage-1_0: >>>>>>>> 1/1 Finished >>>>>>>> Status: Finished successfully in 53.25 seconds >>>>>>>> OK >>>>>>>> INFO : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished >>>>>>>> Stage-1_0: 0(+1)/1 >>>>>>>> INFO : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished >>>>>>>> Stage-1_0: 1/1 Finished >>>>>>>> INFO : Status: Finished successfully in 53.25 seconds >>>>>>>> INFO : Completed executing >>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); >>>>>>>> Time taken: 56.337 seconds >>>>>>>> INFO : OK >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> | c0 | c1 | c2 | c3 | >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> 1 row selected (58.529 seconds) >>>>>>>> >>>>>>>> >>>>>>>> 58 seconds first run with cold cache is pretty good >>>>>>>> >>>>>>>> >>>>>>>> And let us compare it with running the same query on map-reduce >>>>>>>> engine >>>>>>>> >>>>>>>> >>>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr; >>>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>>>>>>> future versions. Consider using a different execution engine (i.e. >>>>>>>> spark, >>>>>>>> tez) or using Hive 1.X releases. >>>>>>>> No rows affected (0.007 seconds) >>>>>>>> 0: jdbc:hive2://rhes564:10010/default> select min(id), >>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy; >>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>>>>> available in the future versions. Consider using a different execution >>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>>>>> Query ID = >>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>>>>> Total jobs = 1 >>>>>>>> Launching Job 1 out of 1 >>>>>>>> Number of reduce tasks determined at compile time: 1 >>>>>>>> In order to change the average load for a reducer (in bytes): >>>>>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>>>>> In order to limit the maximum number of reducers: >>>>>>>> set hive.exec.reducers.max=<number> >>>>>>>> In order to set a constant number of reducers: >>>>>>>> set mapreduce.job.reduces=<number> >>>>>>>> Starting Job = job_1463956731753_0005, Tracking URL = >>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>>>>> job_1463956731753_0005 >>>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number >>>>>>>> of reducers: 1 >>>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>>>>> INFO : Compiling >>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>>> INFO : Semantic Analysis Completed >>>>>>>> INFO : Returning Hive schema: >>>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null), >>>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2, >>>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double, >>>>>>>> comment:null)], properties:null) >>>>>>>> INFO : Completed compiling >>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>>>>> Time taken: 0.144 seconds >>>>>>>> INFO : Executing >>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc): >>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy >>>>>>>> WARN : Hive-on-MR is deprecated in Hive 2 and may not be available >>>>>>>> in the future versions. Consider using a different execution engine >>>>>>>> (i.e. >>>>>>>> spark, tez) or using Hive 1.X releases. >>>>>>>> INFO : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>>>>> available in the future versions. Consider using a different execution >>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be >>>>>>>> available in the future versions. Consider using a different execution >>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases. >>>>>>>> INFO : Query ID = >>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc >>>>>>>> INFO : Total jobs = 1 >>>>>>>> INFO : Launching Job 1 out of 1 >>>>>>>> INFO : Starting task [Stage-1:MAPRED] in serial mode >>>>>>>> INFO : Number of reduce tasks determined at compile time: 1 >>>>>>>> INFO : In order to change the average load for a reducer (in >>>>>>>> bytes): >>>>>>>> INFO : set hive.exec.reducers.bytes.per.reducer=<number> >>>>>>>> INFO : In order to limit the maximum number of reducers: >>>>>>>> INFO : set hive.exec.reducers.max=<number> >>>>>>>> INFO : In order to set a constant number of reducers: >>>>>>>> INFO : set mapreduce.job.reduces=<number> >>>>>>>> WARN : Hadoop command-line option parsing not performed. Implement >>>>>>>> the Tool interface and execute your application with ToolRunner to >>>>>>>> remedy >>>>>>>> this. >>>>>>>> INFO : number of splits:22 >>>>>>>> INFO : Submitting tokens for job: job_1463956731753_0005 >>>>>>>> INFO : The url to track the job: >>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>>> INFO : Starting Job = job_1463956731753_0005, Tracking URL = >>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/ >>>>>>>> INFO : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job >>>>>>>> -kill job_1463956731753_0005 >>>>>>>> INFO : Hadoop job information for Stage-1: number of mappers: 22; >>>>>>>> number of reducers: 1 >>>>>>>> INFO : 2016-05-23 00:26:38,127 Stage-1 map = 0%, reduce = 0% >>>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, Cumulative >>>>>>>> CPU 4.56 sec >>>>>>>> INFO : 2016-05-23 00:26:44,367 Stage-1 map = 5%, reduce = 0%, >>>>>>>> Cumulative CPU 4.56 sec >>>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, Cumulative >>>>>>>> CPU 9.17 sec >>>>>>>> INFO : 2016-05-23 00:26:50,558 Stage-1 map = 9%, reduce = 0%, >>>>>>>> Cumulative CPU 9.17 sec >>>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, Cumulative >>>>>>>> CPU 14.04 sec >>>>>>>> INFO : 2016-05-23 00:26:56,747 Stage-1 map = 14%, reduce = 0%, >>>>>>>> Cumulative CPU 14.04 sec >>>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, Cumulative >>>>>>>> CPU 18.64 sec >>>>>>>> INFO : 2016-05-23 00:27:02,944 Stage-1 map = 18%, reduce = 0%, >>>>>>>> Cumulative CPU 18.64 sec >>>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, Cumulative >>>>>>>> CPU 23.25 sec >>>>>>>> INFO : 2016-05-23 00:27:08,105 Stage-1 map = 23%, reduce = 0%, >>>>>>>> Cumulative CPU 23.25 sec >>>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, Cumulative >>>>>>>> CPU 27.84 sec >>>>>>>> INFO : 2016-05-23 00:27:14,298 Stage-1 map = 27%, reduce = 0%, >>>>>>>> Cumulative CPU 27.84 sec >>>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, Cumulative >>>>>>>> CPU 32.56 sec >>>>>>>> INFO : 2016-05-23 00:27:20,484 Stage-1 map = 32%, reduce = 0%, >>>>>>>> Cumulative CPU 32.56 sec >>>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, Cumulative >>>>>>>> CPU 37.1 sec >>>>>>>> INFO : 2016-05-23 00:27:26,659 Stage-1 map = 36%, reduce = 0%, >>>>>>>> Cumulative CPU 37.1 sec >>>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, Cumulative >>>>>>>> CPU 41.74 sec >>>>>>>> INFO : 2016-05-23 00:27:32,839 Stage-1 map = 41%, reduce = 0%, >>>>>>>> Cumulative CPU 41.74 sec >>>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, Cumulative >>>>>>>> CPU 46.32 sec >>>>>>>> INFO : 2016-05-23 00:27:39,003 Stage-1 map = 45%, reduce = 0%, >>>>>>>> Cumulative CPU 46.32 sec >>>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, Cumulative >>>>>>>> CPU 50.93 sec >>>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, Cumulative >>>>>>>> CPU 55.55 sec >>>>>>>> INFO : 2016-05-23 00:27:45,173 Stage-1 map = 50%, reduce = 0%, >>>>>>>> Cumulative CPU 50.93 sec >>>>>>>> INFO : 2016-05-23 00:27:50,316 Stage-1 map = 55%, reduce = 0%, >>>>>>>> Cumulative CPU 55.55 sec >>>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, Cumulative >>>>>>>> CPU 60.25 sec >>>>>>>> INFO : 2016-05-23 00:27:56,482 Stage-1 map = 59%, reduce = 0%, >>>>>>>> Cumulative CPU 60.25 sec >>>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, Cumulative >>>>>>>> CPU 64.86 sec >>>>>>>> INFO : 2016-05-23 00:28:02,642 Stage-1 map = 64%, reduce = 0%, >>>>>>>> Cumulative CPU 64.86 sec >>>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, Cumulative >>>>>>>> CPU 69.41 sec >>>>>>>> INFO : 2016-05-23 00:28:08,814 Stage-1 map = 68%, reduce = 0%, >>>>>>>> Cumulative CPU 69.41 sec >>>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, Cumulative >>>>>>>> CPU 74.06 sec >>>>>>>> INFO : 2016-05-23 00:28:14,977 Stage-1 map = 73%, reduce = 0%, >>>>>>>> Cumulative CPU 74.06 sec >>>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, Cumulative >>>>>>>> CPU 78.72 sec >>>>>>>> INFO : 2016-05-23 00:28:21,134 Stage-1 map = 77%, reduce = 0%, >>>>>>>> Cumulative CPU 78.72 sec >>>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, Cumulative >>>>>>>> CPU 83.32 sec >>>>>>>> INFO : 2016-05-23 00:28:27,282 Stage-1 map = 82%, reduce = 0%, >>>>>>>> Cumulative CPU 83.32 sec >>>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, Cumulative >>>>>>>> CPU 87.9 sec >>>>>>>> INFO : 2016-05-23 00:28:33,437 Stage-1 map = 86%, reduce = 0%, >>>>>>>> Cumulative CPU 87.9 sec >>>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, Cumulative >>>>>>>> CPU 92.52 sec >>>>>>>> INFO : 2016-05-23 00:28:38,579 Stage-1 map = 91%, reduce = 0%, >>>>>>>> Cumulative CPU 92.52 sec >>>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, Cumulative >>>>>>>> CPU 97.35 sec >>>>>>>> INFO : 2016-05-23 00:28:44,759 Stage-1 map = 95%, reduce = 0%, >>>>>>>> Cumulative CPU 97.35 sec >>>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>>>>>>> Cumulative CPU 99.6 sec >>>>>>>> INFO : 2016-05-23 00:28:49,915 Stage-1 map = 100%, reduce = 0%, >>>>>>>> Cumulative CPU 99.6 sec >>>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>>>>>>> Cumulative CPU 101.4 sec >>>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec >>>>>>>> Ended Job = job_1463956731753_0005 >>>>>>>> MapReduce Jobs Launched: >>>>>>>> Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 sec >>>>>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec >>>>>>>> OK >>>>>>>> INFO : 2016-05-23 00:28:54,043 Stage-1 map = 100%, reduce = 100%, >>>>>>>> Cumulative CPU 101.4 sec >>>>>>>> INFO : MapReduce Total cumulative CPU time: 1 minutes 41 seconds >>>>>>>> 400 msec >>>>>>>> INFO : Ended Job = job_1463956731753_0005 >>>>>>>> INFO : MapReduce Jobs Launched: >>>>>>>> INFO : Stage-Stage-1: Map: 22 Reduce: 1 Cumulative CPU: 101.4 >>>>>>>> sec HDFS Read: 5318569 HDFS Write: 46 SUCCESS >>>>>>>> INFO : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 >>>>>>>> msec >>>>>>>> INFO : Completed executing >>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc); >>>>>>>> Time taken: 142.525 seconds >>>>>>>> INFO : OK >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> | c0 | c1 | c2 | c3 | >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> | 1 | 100000000 | 5.00000005E7 | 2.8867513459481288E7 | >>>>>>>> +-----+------------+---------------+-----------------------+--+ >>>>>>>> 1 row selected (142.744 seconds) >>>>>>>> >>>>>>>> >>>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58 >>>>>>>> seconds with Hive on Spark. So you can obviously gain pretty well by >>>>>>>> using >>>>>>>> Hive on Spark. >>>>>>>> >>>>>>>> >>>>>>>> Please also note that I did not use any vendor's build for this >>>>>>>> purpose. I compiled Spark 1.3.1 myself. >>>>>>>> >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> LinkedIn >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Ayan Guha >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity