Re: Using Spark on Hive with Hive also using Spark as its execution engine

Marcin Tustin Mon, 30 May 2016 15:44:27 -0700

Mich - it sounds like maybe you should try these benchmarks with alluxio
abstracting the storage layer, and see how much it makes a difference.
Alluxio should (if I understand it right) provide a lot of the optimisation
you're looking for with in memory work.


I've never used it, but I would love to hear the experiences of people who
have.

On Mon, May 30, 2016 at 5:32 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> I think we are going to move to a model that the computation stack will be
> separate from storage stack and moreover something like Hive that provides
> the means for persistent storage (well HDFS is the one that stores all the
> data) will have an in-memory type capability much like what Oracle TimesTen
> IMDB does with its big brother Oracle. Now TimesTen is effectively designed
> to provide in-memory capability for analytics for Oracle 12c. These two work 
> like
> an index or materialized view.  You write queries against tables -
> optimizer figures out whether to use row oriented storage and indexes to
> access (Oracle classic) or column non-indexed storage to answer (TimesTen).
> just one optimizer.
>
> I gather Hive will be like that eventually. it will decide based on the
> frequency of access where to look for data. Yes we may have 10 TB of data
> on disk but how much of it is frequently accessed (hot data). 80-20 rule?
> In reality may be just 2TB or most recent partitions etc. The rest is cold
> data.
>
> cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 21:59, Michael Segel <msegel_had...@hotmail.com> wrote:
>
>> And you have MapR supporting Apache Drill.
>>
>> So these are all alternatives to Spark, and its not necessarily an either
>> or scenario. You can have both.
>>
>> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> yep Hortonworks supports Tez for one reason or other which I am going
>> hopefully to test it as the query engine for hive. Tthough I think Spark
>> will be faster because of its in-memory support.
>>
>> Also if you are independent then you better off dealing with Spark and
>> Hive without the need to support another stack like Tez.
>>
>> Cloudera support Impala instead of Hive but it is not something I have
>> used. .
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com> wrote:
>>
>>> Mich,
>>>
>>> Most people use vendor releases because they need to have the support.
>>> Hortonworks is the vendor who has the most skin in the game when it
>>> comes to Tez.
>>>
>>> If memory serves, Tez isn’t going to be M/R but a local execution
>>> engine? Then LLAP is the in-memory piece to speed up Tez?
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> thanks I think the problem is that the TEZ user group is exceptionally
>>> quiet. Just sent an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com> wrote:
>>>
>>>> Well I think it is different from MR. It has some optimizations which
>>>> you do not find in MR. Especially the LLAP option in Hive2 makes it
>>>> interesting.
>>>>
>>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
>>>> is integrated in the Hortonworks distribution.
>>>>
>>>>
>>>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Jorn,
>>>>
>>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>>>> from TEZ user group kindly gave a hand but I could not go very far (or may
>>>> be I did not make enough efforts) making it work.
>>>>
>>>> That TEZ user group is very quiet as well.
>>>>
>>>> My understanding is TEZ is MR with DAG but of course Spark has both
>>>> plus in-memory capability.
>>>>
>>>> It would be interesting to see what version of TEZ works as execution
>>>> engine with Hive.
>>>>
>>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead
>>>> of Hive etc as I am sure you already know.
>>>>
>>>> Cheers,
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>
>>>>> Very interesting do you plan also a test with TEZ?
>>>>>
>>>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I did another study of Hive using Spark engine compared to Hive with
>>>>> MR.
>>>>>
>>>>> Basically took the original table imported using Sqoop and created and
>>>>> populated a new ORC table partitioned by year and month into 48 partitions
>>>>> as follows:
>>>>>
>>>>> <sales_partition.PNG>
>>>>> 
>>>>> Connections use JDBC via beeline. Now for each partition using MR it
>>>>> takes an average of 17 minutes as seen below for each PARTITION..  Now 
>>>>> that
>>>>> is just an individual partition and there are 48 partitions.
>>>>>
>>>>> In contrast doing the same operation with Spark engine took 10 minutes
>>>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>>>> FinishTime from below
>>>>>
>>>>> <image.png>
>>>>>
>>>>> This is by no means indicate that Spark is much better than MR but
>>>>> shows that some very good results can ve achieved using Spark engine.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 24 May 2016 at 08:03, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We use Hive as the database and use Spark as an all purpose query
>>>>>> tool.
>>>>>>
>>>>>> Whether Hive is the write database for purpose or one is better off
>>>>>> with something like Phoenix on Hbase, well the answer is it depends and
>>>>>> your mileage varies.
>>>>>>
>>>>>> So fit for purpose.
>>>>>>
>>>>>> Ideally what wants is to use the fastest  method to get the results.
>>>>>> How fast we confine it to our SLA agreements in production and that helps
>>>>>> us from unnecessary further work as we technologists like to play around.
>>>>>>
>>>>>> So in short, we use Spark most of the time and use Hive as the
>>>>>> backend engine for data storage, mainly ORC tables.
>>>>>>
>>>>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>>>>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>>>>>> at the moment it is one of my projects.
>>>>>>
>>>>>> We do not use any vendor's products as it enables us to move away
>>>>>> from being tied down after years of SAP, Oracle and MS dependency to yet
>>>>>> another vendor. Besides there is some politics going on with one 
>>>>>> promoting
>>>>>> Tez and another Spark as a backend. That is fine but obviously we prefer 
>>>>>> an
>>>>>> independent assessment ourselves.
>>>>>>
>>>>>> My gut feeling is that one needs to look at the use case. Recently we
>>>>>> had to import a very large table from Oracle to Hive and decided to use
>>>>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used
>>>>>> JDBC connection with temp table and it was good. We could have used sqoop
>>>>>> but decided to settle for Spark so it all depends on use case.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24 May 2016 at 03:11, ayan guha <guha.a...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Thanks for very useful stats.
>>>>>>>
>>>>>>> Did you have any benchmark for using Spark as backend engine for
>>>>>>> Hive vs using Spark thrift server (and run spark code for hive 
>>>>>>> queries)? We
>>>>>>> are using later but it will be very useful to remove thriftserver, if we
>>>>>>> can.
>>>>>>>
>>>>>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke <jornfra...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Mich,
>>>>>>>>
>>>>>>>> I think these comparisons are useful. One interesting aspect could
>>>>>>>> be hardware scalability in this context. Additionally different type of
>>>>>>>> computations. Furthermore, one could compare Spark and Tez+llap as
>>>>>>>> execution engines. I have the gut feeling that  each one can be 
>>>>>>>> justified
>>>>>>>> by different use cases.
>>>>>>>> Nevertheless, there should be always a disclaimer for such
>>>>>>>> comparisons, because Spark and Hive are not good for a lot of 
>>>>>>>> concurrent
>>>>>>>> lookups of single rows. They are not good for frequently write small
>>>>>>>> amounts of data (eg sensor data). Here hbase could be more interesting.
>>>>>>>> Other use cases can justify graph databases, such as Titan, or text
>>>>>>>> analytics/ data matching using Solr on Hadoop.
>>>>>>>> Finally, even if you have a lot of data you need to think if you
>>>>>>>> always have to process everything. For instance, I have found valid use
>>>>>>>> cases in practice where we decided to evaluate 10 machine learning 
>>>>>>>> models
>>>>>>>> in parallel on only a sample of data and only evaluate the "winning" 
>>>>>>>> model
>>>>>>>> of the total of data.
>>>>>>>>
>>>>>>>> As always it depends :)
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> P.s.: at least Hortonworks has in their distribution spark 1.5 with
>>>>>>>> hive 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere 
>>>>>>>> described
>>>>>>>> how to manage bringing both together. You may check also Apache Bigtop
>>>>>>>> (vendor neutral distribution) on how they managed to bring both 
>>>>>>>> together.
>>>>>>>>
>>>>>>>> On 23 May 2016, at 01:42, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I have done a number of extensive tests using Spark-shell with Hive
>>>>>>>> DB and ORC tables.
>>>>>>>>
>>>>>>>>
>>>>>>>> Now one issue that we typically face is and I quote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Spark is fast as it uses Memory and DAG. Great but when we save
>>>>>>>> data it is not fast enough
>>>>>>>>
>>>>>>>> OK but there is a solution now. If you use Spark with Hive and you
>>>>>>>> are on a descent version of Hive >= 0.14, then you can also deploy 
>>>>>>>> Spark as
>>>>>>>> execution engine for Hive. That will make your application run pretty 
>>>>>>>> fast
>>>>>>>> as you no longer rely on the old Map-Reduce for Hive engine. In a 
>>>>>>>> nutshell
>>>>>>>> what you are gaining speed in both querying and storage.
>>>>>>>>
>>>>>>>>
>>>>>>>> I have made some comparisons on this set-up and I am sure some of
>>>>>>>> you will find it useful.
>>>>>>>>
>>>>>>>>
>>>>>>>> The version of Spark I use for Spark queries (Spark as query tool)
>>>>>>>> is 1.6.
>>>>>>>> The version of Hive I use in Hive 2
>>>>>>>> The version of Spark I use as Hive execution engine is 1.3.1 It
>>>>>>>> works and frankly Spark 1.3.1 as an execution engine is adequate 
>>>>>>>> (until we
>>>>>>>> sort out the Hadoop libraries mismatch).
>>>>>>>>
>>>>>>>>
>>>>>>>> An example I am using Hive on Spark engine to find the min and max
>>>>>>>> of IDs for a table with 1 billion rows:
>>>>>>>>
>>>>>>>>
>>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>>> Query ID =
>>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>>>>>>
>>>>>>>>
>>>>>>>> INFO  : Completed compiling
>>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>>> Time taken: 1.911 seconds
>>>>>>>> INFO  : Executing
>>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>> INFO  : Query ID =
>>>>>>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>>>>>> INFO  : Total jobs = 1
>>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>>>
>>>>>>>>
>>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>>> 0
>>>>>>>> 1
>>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>>> Job Progress Format
>>>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>>> [StageCost]
>>>>>>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>>>>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>>>>>> INFO  :
>>>>>>>> Query Hive on Spark job[0] stages:
>>>>>>>> INFO  : 0
>>>>>>>> INFO  : 1
>>>>>>>> INFO  :
>>>>>>>> Status: Running (Hive on Spark job[0])
>>>>>>>> INFO  : Job Progress Format
>>>>>>>> CurrentTime StageId_StageAttemptId:
>>>>>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>>>>>>> [StageCost]
>>>>>>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>>>>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0:
>>>>>>>> 0/1
>>>>>>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0:
>>>>>>>> 0/1
>>>>>>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0:
>>>>>>>> 0/1
>>>>>>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>>>> 0(+1)/1
>>>>>>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0:
>>>>>>>> 1/1 Finished
>>>>>>>> Status: Finished successfully in 53.25 seconds
>>>>>>>> OK
>>>>>>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
>>>>>>>> Stage-1_0: 0(+1)/1
>>>>>>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
>>>>>>>> Stage-1_0: 1/1 Finished
>>>>>>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>>>>>> INFO  : Completed executing
>>>>>>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>>>>>>> Time taken: 56.337 seconds
>>>>>>>> INFO  : OK
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> 1 row selected (58.529 seconds)
>>>>>>>>
>>>>>>>>
>>>>>>>> 58 seconds first run with cold cache is pretty good
>>>>>>>>
>>>>>>>>
>>>>>>>> And let us compare it with running the same query on map-reduce
>>>>>>>> engine
>>>>>>>>
>>>>>>>>
>>>>>>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>>>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>>>>>>> future versions. Consider using a different execution engine (i.e. 
>>>>>>>> spark,
>>>>>>>> tez) or using Hive 1.X releases.
>>>>>>>> No rows affected (0.007 seconds)
>>>>>>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id),
>>>>>>>> max(id),avg(id), stddev(id) from oraclehadoop.dummy;
>>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>>>>>>> available in the future versions. Consider using a different execution
>>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>>> Query ID =
>>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>>> Total jobs = 1
>>>>>>>> Launching Job 1 out of 1
>>>>>>>> Number of reduce tasks determined at compile time: 1
>>>>>>>> In order to change the average load for a reducer (in bytes):
>>>>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>>> In order to limit the maximum number of reducers:
>>>>>>>>   set hive.exec.reducers.max=<number>
>>>>>>>> In order to set a constant number of reducers:
>>>>>>>>   set mapreduce.job.reduces=<number>
>>>>>>>> Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>>>>>>> job_1463956731753_0005
>>>>>>>> Hadoop job information for Stage-1: number of mappers: 22; number
>>>>>>>> of reducers: 1
>>>>>>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>>> INFO  : Compiling
>>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>> INFO  : Semantic Analysis Completed
>>>>>>>> INFO  : Returning Hive schema:
>>>>>>>> Schema(fieldSchemas:[FieldSchema(name:c0, type:int, comment:null),
>>>>>>>> FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,
>>>>>>>> type:double, comment:null), FieldSchema(name:c3, type:double,
>>>>>>>> comment:null)], properties:null)
>>>>>>>> INFO  : Completed compiling
>>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>>> Time taken: 0.144 seconds
>>>>>>>> INFO  : Executing
>>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>>>>>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>>>>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available
>>>>>>>> in the future versions. Consider using a different execution engine 
>>>>>>>> (i.e.
>>>>>>>> spark, tez) or using Hive 1.X releases.
>>>>>>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>>>>>>> available in the future versions. Consider using a different execution
>>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>>>>>>> available in the future versions. Consider using a different execution
>>>>>>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>>>>>> INFO  : Query ID =
>>>>>>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>>>>>> INFO  : Total jobs = 1
>>>>>>>> INFO  : Launching Job 1 out of 1
>>>>>>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>>>>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>>>>>> INFO  : In order to change the average load for a reducer (in
>>>>>>>> bytes):
>>>>>>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>>>>> INFO  : In order to limit the maximum number of reducers:
>>>>>>>> INFO  :   set hive.exec.reducers.max=<number>
>>>>>>>> INFO  : In order to set a constant number of reducers:
>>>>>>>> INFO  :   set mapreduce.job.reduces=<number>
>>>>>>>> WARN  : Hadoop command-line option parsing not performed. Implement
>>>>>>>> the Tool interface and execute your application with ToolRunner to 
>>>>>>>> remedy
>>>>>>>> this.
>>>>>>>> INFO  : number of splits:22
>>>>>>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>>>>>> INFO  : The url to track the job:
>>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
>>>>>>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>>>>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job
>>>>>>>> -kill job_1463956731753_0005
>>>>>>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22;
>>>>>>>> number of reducers: 1
>>>>>>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>>>>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative
>>>>>>>> CPU 4.56 sec
>>>>>>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%,
>>>>>>>> Cumulative CPU 4.56 sec
>>>>>>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative
>>>>>>>> CPU 9.17 sec
>>>>>>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%,
>>>>>>>> Cumulative CPU 9.17 sec
>>>>>>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative
>>>>>>>> CPU 14.04 sec
>>>>>>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
>>>>>>>> Cumulative CPU 14.04 sec
>>>>>>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative
>>>>>>>> CPU 18.64 sec
>>>>>>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
>>>>>>>> Cumulative CPU 18.64 sec
>>>>>>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative
>>>>>>>> CPU 23.25 sec
>>>>>>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
>>>>>>>> Cumulative CPU 23.25 sec
>>>>>>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative
>>>>>>>> CPU 27.84 sec
>>>>>>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
>>>>>>>> Cumulative CPU 27.84 sec
>>>>>>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative
>>>>>>>> CPU 32.56 sec
>>>>>>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
>>>>>>>> Cumulative CPU 32.56 sec
>>>>>>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative
>>>>>>>> CPU 37.1 sec
>>>>>>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
>>>>>>>> Cumulative CPU 37.1 sec
>>>>>>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative
>>>>>>>> CPU 41.74 sec
>>>>>>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
>>>>>>>> Cumulative CPU 41.74 sec
>>>>>>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative
>>>>>>>> CPU 46.32 sec
>>>>>>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
>>>>>>>> Cumulative CPU 46.32 sec
>>>>>>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative
>>>>>>>> CPU 50.93 sec
>>>>>>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative
>>>>>>>> CPU 55.55 sec
>>>>>>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
>>>>>>>> Cumulative CPU 50.93 sec
>>>>>>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
>>>>>>>> Cumulative CPU 55.55 sec
>>>>>>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative
>>>>>>>> CPU 60.25 sec
>>>>>>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
>>>>>>>> Cumulative CPU 60.25 sec
>>>>>>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative
>>>>>>>> CPU 64.86 sec
>>>>>>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
>>>>>>>> Cumulative CPU 64.86 sec
>>>>>>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative
>>>>>>>> CPU 69.41 sec
>>>>>>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
>>>>>>>> Cumulative CPU 69.41 sec
>>>>>>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative
>>>>>>>> CPU 74.06 sec
>>>>>>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
>>>>>>>> Cumulative CPU 74.06 sec
>>>>>>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative
>>>>>>>> CPU 78.72 sec
>>>>>>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
>>>>>>>> Cumulative CPU 78.72 sec
>>>>>>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative
>>>>>>>> CPU 83.32 sec
>>>>>>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
>>>>>>>> Cumulative CPU 83.32 sec
>>>>>>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative
>>>>>>>> CPU 87.9 sec
>>>>>>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
>>>>>>>> Cumulative CPU 87.9 sec
>>>>>>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative
>>>>>>>> CPU 92.52 sec
>>>>>>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
>>>>>>>> Cumulative CPU 92.52 sec
>>>>>>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative
>>>>>>>> CPU 97.35 sec
>>>>>>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
>>>>>>>> Cumulative CPU 97.35 sec
>>>>>>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>>>>>>> Cumulative CPU 99.6 sec
>>>>>>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>>>>>>> Cumulative CPU 99.6 sec
>>>>>>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>>>> Cumulative CPU 101.4 sec
>>>>>>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>>>>>> Ended Job = job_1463956731753_0005
>>>>>>>> MapReduce Jobs Launched:
>>>>>>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
>>>>>>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>>>>>> OK
>>>>>>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>>>>>>> Cumulative CPU 101.4 sec
>>>>>>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds
>>>>>>>> 400 msec
>>>>>>>> INFO  : Ended Job = job_1463956731753_0005
>>>>>>>> INFO  : MapReduce Jobs Launched:
>>>>>>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4
>>>>>>>> sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>>>>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400
>>>>>>>> msec
>>>>>>>> INFO  : Completed executing
>>>>>>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>>>>>>> Time taken: 142.525 seconds
>>>>>>>> INFO  : OK
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> | c0  |     c1     |      c2       |          c3           |
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>>>>>> +-----+------------+---------------+-----------------------+--+
>>>>>>>> 1 row selected (142.744 seconds)
>>>>>>>>
>>>>>>>>
>>>>>>>> OK Hive on map-reduce engine took 142 seconds compared to 58
>>>>>>>> seconds with Hive on Spark. So you can obviously gain pretty well by 
>>>>>>>> using
>>>>>>>> Hive on Spark.
>>>>>>>>
>>>>>>>>
>>>>>>>> Please also note that I did not use any vendor's build for this
>>>>>>>> purpose. I compiled Spark 1.3.1 myself.
>>>>>>>>
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Reply via email to