More like 2x than 10x as I recall. On Tue, Jul 12, 2016 at 9:39 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> thanks Marcin. > > What Is your guesstimate on the order of "faster" please? > > Cheers > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 12 July 2016 at 14:35, Marcin Tustin <mtus...@handybook.com> wrote: > >> Quick note - my experience (no benchmarks) is that Tez without LLAP >> (we're still not on hive 2) is faster than MR by some way. I haven't dug >> into why that might be. >> >> On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> sorry I completely miss your points >>> >>> I was NOT talking about Exadata. I was comparing Oracle 12c caching with >>> that of Oracle TimesTen. no one mentioned Exadata here and neither >>> storeindex etc.. >>> >>> >>> so if Tez is not MR with DAG could you give me an example of how it >>> works. No opinions but relevant to this point. I do not know much about Tez >>> as I stated it before >>> >>> Case in point if Tez could do the job on its own why Tez is used in >>> conjunction with LLAP as Martin alluded to as well in this thread. >>> >>> >>> Having said that , I would be interested if you provide a working >>> example of Hive on Tez, compared to Hive on MR. >>> >>> One experiment is worth hundreds of opinions >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 12 July 2016 at 13:31, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>>> >>>> I think the comparison with Oracle rdbms and oracle times ten is not so >>>> good. There are times when the in-memory database of Oracle is slower than >>>> the rdbms (especially in case of Exadata) due to the issue that in-memory - >>>> as in Spark - means everything is in memory and everything is always >>>> processed (no storage indexes , no bloom filters etc) which explains this >>>> behavior quiet well. >>>> >>>> Hence, I do not agree with the statement that tez is basically mr with >>>> dag (or that llap is basically in-memory which is also not correct). This >>>> is a wrong oversimplification and I do not think this is useful for the >>>> community, but better is to understand when something can be used and when >>>> not. In-memory is also not the solution to everything and if you look for >>>> example behind SAP Hana or NoSql there is much more around this, which is >>>> not even on the roadmap of Spark. >>>> >>>> Anyway, discovering good use case patterns should be done on >>>> standardized benchmarks going beyond the select count etc >>>> >>>> On 12 Jul 2016, at 11:16, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> That is only a plan not what execution engine is doing. >>>> >>>> As I stated before Spark uses DAG + in-memory computing. MR is serial >>>> on disk. >>>> >>>> The key is the execution here or rather the execution engine. >>>> >>>> In general >>>> >>>> The standard MapReduce as I know reads the data from HDFS, apply >>>> map-reduce algorithm and writes back to HDFS. If there are many iterations >>>> of map-reduce then, there will be many intermediate writes to HDFS. This is >>>> all serial writes to disk. Each map-reduce step is completely independent >>>> of other steps, and the executing engine does not have any global knowledge >>>> of what map-reduce steps are going to come after each map-reduce step. For >>>> many iterative algorithms this is inefficient as the data between each >>>> map-reduce pair gets written and read from the file system. >>>> >>>> The equivalent to parallelism in Big Data is deploying what is known as >>>> Directed Acyclic Graph (DAG >>>> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In >>>> a nutshell deploying DAG results in a fuller picture of global optimisation >>>> by deploying parallelism, pipelining consecutive map steps into one and not >>>> writing intermediate data to HDFS. So in short this prevents writing data >>>> back and forth after every reduce step which for me is a significant >>>> improvement, compared to the classical MapReduce algorithm. >>>> >>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory >>>> computing. Think of it as a comparison between a classic RDBMS like Oracle >>>> and IMDB like Oracle TimesTen with in-memory processing. >>>> >>>> The outcome is that Hive using Spark as execution engine is pretty >>>> impressive. You have the advantage of Hive CBO + In-memory computing. If >>>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own >>>> optimizer called Catalyst that does not have CBO yet plus in memory >>>> computing. >>>> >>>> As usual your mileage varies. >>>> >>>> HTH >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 12 July 2016 at 09:33, Markovitz, Dudu <dmarkov...@paypal.com> >>>> wrote: >>>> >>>>> I don’t see how this explains the time differences. >>>>> >>>>> >>>>> >>>>> Dudu >>>>> >>>>> >>>>> >>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] >>>>> *Sent:* Tuesday, July 12, 2016 10:56 AM >>>>> *To:* user <user@hive.apache.org> >>>>> *Cc:* user @spark <u...@spark.apache.org> >>>>> >>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its >>>>> execution engine >>>>> >>>>> >>>>> >>>>> This the whole idea. Spark uses DAG + IM, MR is classic >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> This is for Hive on Spark >>>>> >>>>> >>>>> >>>>> hive> explain select max(id) from dummy_parquet; >>>>> OK >>>>> STAGE DEPENDENCIES: >>>>> Stage-1 is a root stage >>>>> Stage-0 depends on stages: Stage-1 >>>>> >>>>> STAGE PLANS: >>>>> Stage: Stage-1 >>>>> Spark >>>>> Edges: >>>>> Reducer 2 <- Map 1 (GROUP, 1) >>>>> * DagName: >>>>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1* >>>>> Vertices: >>>>> Map 1 >>>>> Map Operator Tree: >>>>> TableScan >>>>> alias: dummy_parquet >>>>> Statistics: Num rows: 100000000 Data size: 700000000 >>>>> Basic stats: COMPLETE Column stats: NONE >>>>> Select Operator >>>>> expressions: id (type: int) >>>>> outputColumnNames: id >>>>> Statistics: Num rows: 100000000 Data size: >>>>> 700000000 Basic stats: COMPLETE Column stats: NONE >>>>> Group By Operator >>>>> aggregations: max(id) >>>>> mode: hash >>>>> outputColumnNames: _col0 >>>>> Statistics: Num rows: 1 Data size: 4 Basic >>>>> stats: COMPLETE Column stats: NONE >>>>> Reduce Output Operator >>>>> sort order: >>>>> Statistics: Num rows: 1 Data size: 4 Basic >>>>> stats: COMPLETE Column stats: NONE >>>>> value expressions: _col0 (type: int) >>>>> Reducer 2 >>>>> Reduce Operator Tree: >>>>> Group By Operator >>>>> aggregations: max(VALUE._col0) >>>>> mode: mergepartial >>>>> outputColumnNames: _col0 >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>>>> COMPLETE Column stats: NONE >>>>> File Output Operator >>>>> compressed: false >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>>>> COMPLETE Column stats: NONE >>>>> table: >>>>> input format: >>>>> org.apache.hadoop.mapred.TextInputFormat >>>>> output format: >>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>>>> serde: >>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>>>> >>>>> Stage: Stage-0 >>>>> Fetch Operator >>>>> limit: -1 >>>>> Processor Tree: >>>>> ListSink >>>>> >>>>> Time taken: 2.801 seconds, Fetched: 50 row(s) >>>>> >>>>> >>>>> >>>>> And this is with setting the execution engine to MR >>>>> >>>>> >>>>> >>>>> hive> set hive.execution.engine=mr; >>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>>>> future versions. Consider using a different execution engine (i.e. spark, >>>>> tez) or using Hive 1.X releases. >>>>> >>>>> >>>>> >>>>> hive> explain select max(id) from dummy_parquet; >>>>> OK >>>>> STAGE DEPENDENCIES: >>>>> Stage-1 is a root stage >>>>> Stage-0 depends on stages: Stage-1 >>>>> >>>>> STAGE PLANS: >>>>> Stage: Stage-1 >>>>> Map Reduce >>>>> Map Operator Tree: >>>>> TableScan >>>>> alias: dummy_parquet >>>>> Statistics: Num rows: 100000000 Data size: 700000000 Basic >>>>> stats: COMPLETE Column stats: NONE >>>>> Select Operator >>>>> expressions: id (type: int) >>>>> outputColumnNames: id >>>>> Statistics: Num rows: 100000000 Data size: 700000000 >>>>> Basic stats: COMPLETE Column stats: NONE >>>>> Group By Operator >>>>> aggregations: max(id) >>>>> mode: hash >>>>> outputColumnNames: _col0 >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>>>> COMPLETE Column stats: NONE >>>>> Reduce Output Operator >>>>> sort order: >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>>>> COMPLETE Column stats: NONE >>>>> value expressions: _col0 (type: int) >>>>> Reduce Operator Tree: >>>>> Group By Operator >>>>> aggregations: max(VALUE._col0) >>>>> mode: mergepartial >>>>> outputColumnNames: _col0 >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE >>>>> Column stats: NONE >>>>> File Output Operator >>>>> compressed: false >>>>> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE >>>>> Column stats: NONE >>>>> table: >>>>> input format: org.apache.hadoop.mapred.TextInputFormat >>>>> output format: >>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>>>> serde: >>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>>>> >>>>> Stage: Stage-0 >>>>> Fetch Operator >>>>> limit: -1 >>>>> Processor Tree: >>>>> ListSink >>>>> >>>>> Time taken: 0.1 seconds, Fetched: 44 row(s) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> HTH >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn >>>>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 12 July 2016 at 08:16, Markovitz, Dudu <dmarkov...@paypal.com> >>>>> wrote: >>>>> >>>>> This is a simple task – >>>>> >>>>> Read the files, find the local max value and combine the results (find >>>>> the global max value). >>>>> >>>>> How do you explain the differences in the results? Spark reads the >>>>> files and finds a local max 10X (+) faster than MR? >>>>> >>>>> Can you please attach the execution plan? >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> Dudu >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] >>>>> *Sent:* Monday, July 11, 2016 11:55 PM >>>>> *To:* user <user@hive.apache.org>; user @spark <u...@spark.apache.org> >>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its >>>>> execution engine >>>>> >>>>> >>>>> >>>>> In my test I did like for like keeping the systematic the same namely: >>>>> >>>>> >>>>> >>>>> 1. Table was a parquet table of 100 Million rows >>>>> 2. The same set up was used for both Hive on Spark and Hive on MR >>>>> 3. Spark was very impressive compared to MR on this particular >>>>> test. >>>>> >>>>> >>>>> >>>>> Just to see any issues I created an ORC table in in the image of >>>>> Parquet (insert/select from Parquet to ORC) with stats updated for columns >>>>> etc >>>>> >>>>> >>>>> >>>>> These were the results of the same run using ORC table this time: >>>>> >>>>> >>>>> >>>>> hive> select max(id) from oraclehadoop.dummy; >>>>> >>>>> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b >>>>> >>>>> Query Hive on Spark job[1] stages: >>>>> 2 >>>>> 3 >>>>> >>>>> Status: Running (Hive on Spark job[1]) >>>>> Job Progress Format >>>>> CurrentTime StageId_StageAttemptId: >>>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>>>> [StageCost] >>>>> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23 Stage-3_0: 0/1 >>>>> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished Stage-3_0: >>>>> 0(+1)/1 >>>>> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished Stage-3_0: 1/1 >>>>> Finished >>>>> Status: Finished successfully in 16.08 seconds >>>>> OK >>>>> 100000000 >>>>> Time taken: 17.775 seconds, Fetched: 1 row(s) >>>>> >>>>> >>>>> >>>>> Repeat with MR engine >>>>> >>>>> >>>>> >>>>> hive> set hive.execution.engine=mr; >>>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>>>> future versions. Consider using a different execution engine (i.e. spark, >>>>> tez) or using Hive 1.X releases. >>>>> >>>>> >>>>> >>>>> hive> select max(id) from oraclehadoop.dummy; >>>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available >>>>> in the future versions. Consider using a different execution engine (i.e. >>>>> spark, tez) or using Hive 1.X releases. >>>>> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8 >>>>> Total jobs = 1 >>>>> Launching Job 1 out of 1 >>>>> Number of reduce tasks determined at compile time: 1 >>>>> In order to change the average load for a reducer (in bytes): >>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>> In order to limit the maximum number of reducers: >>>>> set hive.exec.reducers.max=<number> >>>>> In order to set a constant number of reducers: >>>>> set mapreduce.job.reduces=<number> >>>>> Starting Job = job_1468226887011_0008, Tracking URL = >>>>> http://rhes564:8088/proxy/application_1468226887011_0008/ >>>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>>>> job_1468226887011_0008 >>>>> Hadoop job information for Stage-1: number of mappers: 23; number of >>>>> reducers: 1 >>>>> 2016-07-11 21:37:00,061 Stage-1 map = 0%, reduce = 0% >>>>> 2016-07-11 21:37:06,440 Stage-1 map = 4%, reduce = 0%, Cumulative CPU >>>>> 16.48 sec >>>>> 2016-07-11 21:37:14,751 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>>>> 40.63 sec >>>>> 2016-07-11 21:37:22,048 Stage-1 map = 13%, reduce = 0%, Cumulative >>>>> CPU 58.88 sec >>>>> 2016-07-11 21:37:30,412 Stage-1 map = 17%, reduce = 0%, Cumulative >>>>> CPU 80.72 sec >>>>> 2016-07-11 21:37:37,707 Stage-1 map = 22%, reduce = 0%, Cumulative >>>>> CPU 103.43 sec >>>>> 2016-07-11 21:37:45,999 Stage-1 map = 26%, reduce = 0%, Cumulative >>>>> CPU 125.93 sec >>>>> 2016-07-11 21:37:54,300 Stage-1 map = 30%, reduce = 0%, Cumulative >>>>> CPU 147.17 sec >>>>> 2016-07-11 21:38:01,538 Stage-1 map = 35%, reduce = 0%, Cumulative >>>>> CPU 166.56 sec >>>>> 2016-07-11 21:38:08,807 Stage-1 map = 39%, reduce = 0%, Cumulative >>>>> CPU 189.29 sec >>>>> 2016-07-11 21:38:17,115 Stage-1 map = 43%, reduce = 0%, Cumulative >>>>> CPU 211.03 sec >>>>> 2016-07-11 21:38:24,363 Stage-1 map = 48%, reduce = 0%, Cumulative >>>>> CPU 235.68 sec >>>>> 2016-07-11 21:38:32,638 Stage-1 map = 52%, reduce = 0%, Cumulative >>>>> CPU 258.27 sec >>>>> 2016-07-11 21:38:40,916 Stage-1 map = 57%, reduce = 0%, Cumulative >>>>> CPU 278.44 sec >>>>> 2016-07-11 21:38:49,206 Stage-1 map = 61%, reduce = 0%, Cumulative >>>>> CPU 300.35 sec >>>>> 2016-07-11 21:38:58,524 Stage-1 map = 65%, reduce = 0%, Cumulative >>>>> CPU 322.89 sec >>>>> 2016-07-11 21:39:07,889 Stage-1 map = 70%, reduce = 0%, Cumulative >>>>> CPU 344.8 sec >>>>> 2016-07-11 21:39:16,151 Stage-1 map = 74%, reduce = 0%, Cumulative >>>>> CPU 367.77 sec >>>>> 2016-07-11 21:39:25,456 Stage-1 map = 78%, reduce = 0%, Cumulative >>>>> CPU 391.82 sec >>>>> 2016-07-11 21:39:33,725 Stage-1 map = 83%, reduce = 0%, Cumulative >>>>> CPU 415.48 sec >>>>> 2016-07-11 21:39:43,037 Stage-1 map = 87%, reduce = 0%, Cumulative >>>>> CPU 436.09 sec >>>>> 2016-07-11 21:39:51,292 Stage-1 map = 91%, reduce = 0%, Cumulative >>>>> CPU 459.4 sec >>>>> 2016-07-11 21:39:59,563 Stage-1 map = 96%, reduce = 0%, Cumulative >>>>> CPU 477.92 sec >>>>> 2016-07-11 21:40:05,760 Stage-1 map = 100%, reduce = 0%, Cumulative >>>>> CPU 491.72 sec >>>>> 2016-07-11 21:40:10,921 Stage-1 map = 100%, reduce = 100%, Cumulative >>>>> CPU 499.37 sec >>>>> MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec >>>>> Ended Job = job_1468226887011_0008 >>>>> MapReduce Jobs Launched: >>>>> Stage-Stage-1: Map: 23 Reduce: 1 Cumulative CPU: 499.37 sec HDFS >>>>> Read: 403754774 HDFS Write: 10 SUCCESS >>>>> Total MapReduce CPU Time Spent: 8 minutes 19 seconds 370 msec >>>>> OK >>>>> 100000000 >>>>> Time taken: 202.333 seconds, Fetched: 1 row(s) >>>>> >>>>> >>>>> >>>>> So in summary >>>>> >>>>> >>>>> >>>>> Table MR/sec Spark/sec >>>>> >>>>> Parquet 239.532 14.38 >>>>> >>>>> ORC 202.333 17.77 >>>>> >>>>> >>>>> >>>>> Still I would use Spark if I had a choice and I agree that on VLT >>>>> (very large tables), the limitation in available memory may be the >>>>> overriding factor in using Spark. >>>>> >>>>> >>>>> >>>>> HTH >>>>> >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn >>>>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 11 July 2016 at 19:25, Gopal Vijayaraghavan <gop...@apache.org> >>>>> wrote: >>>>> >>>>> >>>>> > Status: Finished successfully in 14.12 seconds >>>>> > OK >>>>> > 100000000 >>>>> > Time taken: 14.38 seconds, Fetched: 1 row(s) >>>>> >>>>> That might be an improvement over MR, but that still feels far too >>>>> slow. >>>>> >>>>> >>>>> Parquet numbers are in general bad in Hive, but that's because the >>>>> Parquet >>>>> reader gets no actual love from the devs. The community, if it wants to >>>>> keep using Parquet heavily needs a Hive dev to go over to Parquet-mr >>>>> and >>>>> cut a significant number of memory copies out of the reader. >>>>> >>>>> The Spark 2.0 build for instance, has a custom Parquet reader for >>>>> SparkSQL >>>>> which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does >>>>> for >>>>> ORC (actually, it looks more like hive's VectorizedRowBatch than >>>>> Tungsten's flat layouts). >>>>> >>>>> But that reader cannot be used in Hive-on-Spark, because it is not a >>>>> public reader impl. >>>>> >>>>> >>>>> Not to pick an arbitrary dataset, my workhorse example is a TPC-H >>>>> lineitem >>>>> at 10Gb scale with a single 16 box. >>>>> >>>>> hive(tpch_flat_orc_10)> select max(l_discount) from lineitem; >>>>> Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda >>>>> Total jobs = 1 >>>>> Launching Job 1 out of 1 >>>>> >>>>> >>>>> Status: Running (Executing on YARN cluster with App id >>>>> application_1466700718395_0256) >>>>> >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >>>>> PENDING FAILED KILLED >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> Map 1 .......... llap SUCCEEDED 13 13 0 >>>>> 0 0 0 >>>>> Reducer 2 ...... llap SUCCEEDED 1 1 0 >>>>> 0 0 0 >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >>>>> 0.71 s >>>>> >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> Status: DAG finished successfully in 0.71 seconds >>>>> >>>>> Query Execution Summary >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> OPERATION DURATION >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> Compile Query 0.21s >>>>> Prepare Plan 0.13s >>>>> Submit Plan 0.34s >>>>> Start DAG 0.23s >>>>> Run DAG 0.71s >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> >>>>> Task Execution Summary >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS >>>>> OUTPUT_RECORDS >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> Map 1 604.00 0 0 59,957,438 >>>>> 13 >>>>> Reducer 2 105.00 0 0 13 >>>>> 0 >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> >>>>> LLAP IO Summary >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS >>>>> ALLOCATION >>>>> USED TOTAL_IO >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> Map 1 6036 0 146 0B 68.86MB >>>>> 491.00MB >>>>> 479.89MB 7.94s >>>>> >>>>> --------------------------------------------------------------------------- >>>>> ------------------- >>>>> >>>>> OK >>>>> 0.1 >>>>> Time taken: 1.669 seconds, Fetched: 1 row(s) >>>>> hive(tpch_flat_orc_10)> >>>>> >>>>> >>>>> This is running against a single 16 core box & I would assume it would >>>>> take <1.4s to read twice as much (13 tasks is barely touching the load >>>>> factors). >>>>> >>>>> It would probably be a bit faster if the cache had hits, but in general >>>>> 14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is. >>>>> >>>>> Cheers, >>>>> Gopal >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> Want to work at Handy? Check out our culture deck and open roles >> <http://www.handy.com/careers> >> Latest news <http://www.handy.com/press> at Handy >> Handy just raised $50m >> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >> led >> by Fidelity >> >> > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity