Quick note - my experience (no benchmarks) is that Tez without LLAP (we're still not on hive 2) is faster than MR by some way. I haven't dug into why that might be.
On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > sorry I completely miss your points > > I was NOT talking about Exadata. I was comparing Oracle 12c caching with > that of Oracle TimesTen. no one mentioned Exadata here and neither > storeindex etc.. > > > so if Tez is not MR with DAG could you give me an example of how it works. > No opinions but relevant to this point. I do not know much about Tez as I > stated it before > > Case in point if Tez could do the job on its own why Tez is used in > conjunction with LLAP as Martin alluded to as well in this thread. > > > Having said that , I would be interested if you provide a working example > of Hive on Tez, compared to Hive on MR. > > One experiment is worth hundreds of opinions > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 12 July 2016 at 13:31, Jörn Franke <jornfra...@gmail.com> wrote: > >> >> I think the comparison with Oracle rdbms and oracle times ten is not so >> good. There are times when the in-memory database of Oracle is slower than >> the rdbms (especially in case of Exadata) due to the issue that in-memory - >> as in Spark - means everything is in memory and everything is always >> processed (no storage indexes , no bloom filters etc) which explains this >> behavior quiet well. >> >> Hence, I do not agree with the statement that tez is basically mr with >> dag (or that llap is basically in-memory which is also not correct). This >> is a wrong oversimplification and I do not think this is useful for the >> community, but better is to understand when something can be used and when >> not. In-memory is also not the solution to everything and if you look for >> example behind SAP Hana or NoSql there is much more around this, which is >> not even on the roadmap of Spark. >> >> Anyway, discovering good use case patterns should be done on standardized >> benchmarks going beyond the select count etc >> >> On 12 Jul 2016, at 11:16, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> That is only a plan not what execution engine is doing. >> >> As I stated before Spark uses DAG + in-memory computing. MR is serial on >> disk. >> >> The key is the execution here or rather the execution engine. >> >> In general >> >> The standard MapReduce as I know reads the data from HDFS, apply >> map-reduce algorithm and writes back to HDFS. If there are many iterations >> of map-reduce then, there will be many intermediate writes to HDFS. This is >> all serial writes to disk. Each map-reduce step is completely independent >> of other steps, and the executing engine does not have any global knowledge >> of what map-reduce steps are going to come after each map-reduce step. For >> many iterative algorithms this is inefficient as the data between each >> map-reduce pair gets written and read from the file system. >> >> The equivalent to parallelism in Big Data is deploying what is known as >> Directed Acyclic Graph (DAG >> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a >> nutshell deploying DAG results in a fuller picture of global optimisation >> by deploying parallelism, pipelining consecutive map steps into one and not >> writing intermediate data to HDFS. So in short this prevents writing data >> back and forth after every reduce step which for me is a significant >> improvement, compared to the classical MapReduce algorithm. >> >> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory >> computing. Think of it as a comparison between a classic RDBMS like Oracle >> and IMDB like Oracle TimesTen with in-memory processing. >> >> The outcome is that Hive using Spark as execution engine is pretty >> impressive. You have the advantage of Hive CBO + In-memory computing. If >> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own >> optimizer called Catalyst that does not have CBO yet plus in memory >> computing. >> >> As usual your mileage varies. >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 12 July 2016 at 09:33, Markovitz, Dudu <dmarkov...@paypal.com> wrote: >> >>> I don’t see how this explains the time differences. >>> >>> >>> >>> Dudu >>> >>> >>> >>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] >>> *Sent:* Tuesday, July 12, 2016 10:56 AM >>> *To:* user <user@hive.apache.org> >>> *Cc:* user @spark <u...@spark.apache.org> >>> >>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its >>> execution engine >>> >>> >>> >>> This the whole idea. Spark uses DAG + IM, MR is classic >>> >>> >>> >>> >>> >>> This is for Hive on Spark >>> >>> >>> >>> hive> explain select max(id) from dummy_parquet; >>> OK >>> STAGE DEPENDENCIES: >>> Stage-1 is a root stage >>> Stage-0 depends on stages: Stage-1 >>> >>> STAGE PLANS: >>> Stage: Stage-1 >>> Spark >>> Edges: >>> Reducer 2 <- Map 1 (GROUP, 1) >>> * DagName: >>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1* >>> Vertices: >>> Map 1 >>> Map Operator Tree: >>> TableScan >>> alias: dummy_parquet >>> Statistics: Num rows: 100000000 Data size: 700000000 >>> Basic stats: COMPLETE Column stats: NONE >>> Select Operator >>> expressions: id (type: int) >>> outputColumnNames: id >>> Statistics: Num rows: 100000000 Data size: 700000000 >>> Basic stats: COMPLETE Column stats: NONE >>> Group By Operator >>> aggregations: max(id) >>> mode: hash >>> outputColumnNames: _col0 >>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>> COMPLETE Column stats: NONE >>> Reduce Output Operator >>> sort order: >>> Statistics: Num rows: 1 Data size: 4 Basic >>> stats: COMPLETE Column stats: NONE >>> value expressions: _col0 (type: int) >>> Reducer 2 >>> Reduce Operator Tree: >>> Group By Operator >>> aggregations: max(VALUE._col0) >>> mode: mergepartial >>> outputColumnNames: _col0 >>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>> COMPLETE Column stats: NONE >>> File Output Operator >>> compressed: false >>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>> COMPLETE Column stats: NONE >>> table: >>> input format: >>> org.apache.hadoop.mapred.TextInputFormat >>> output format: >>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>> serde: >>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>> >>> Stage: Stage-0 >>> Fetch Operator >>> limit: -1 >>> Processor Tree: >>> ListSink >>> >>> Time taken: 2.801 seconds, Fetched: 50 row(s) >>> >>> >>> >>> And this is with setting the execution engine to MR >>> >>> >>> >>> hive> set hive.execution.engine=mr; >>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>> future versions. Consider using a different execution engine (i.e. spark, >>> tez) or using Hive 1.X releases. >>> >>> >>> >>> hive> explain select max(id) from dummy_parquet; >>> OK >>> STAGE DEPENDENCIES: >>> Stage-1 is a root stage >>> Stage-0 depends on stages: Stage-1 >>> >>> STAGE PLANS: >>> Stage: Stage-1 >>> Map Reduce >>> Map Operator Tree: >>> TableScan >>> alias: dummy_parquet >>> Statistics: Num rows: 100000000 Data size: 700000000 Basic >>> stats: COMPLETE Column stats: NONE >>> Select Operator >>> expressions: id (type: int) >>> outputColumnNames: id >>> Statistics: Num rows: 100000000 Data size: 700000000 Basic >>> stats: COMPLETE Column stats: NONE >>> Group By Operator >>> aggregations: max(id) >>> mode: hash >>> outputColumnNames: _col0 >>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>> COMPLETE Column stats: NONE >>> Reduce Output Operator >>> sort order: >>> Statistics: Num rows: 1 Data size: 4 Basic stats: >>> COMPLETE Column stats: NONE >>> value expressions: _col0 (type: int) >>> Reduce Operator Tree: >>> Group By Operator >>> aggregations: max(VALUE._col0) >>> mode: mergepartial >>> outputColumnNames: _col0 >>> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE >>> Column stats: NONE >>> File Output Operator >>> compressed: false >>> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE >>> Column stats: NONE >>> table: >>> input format: org.apache.hadoop.mapred.TextInputFormat >>> output format: >>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>> >>> Stage: Stage-0 >>> Fetch Operator >>> limit: -1 >>> Processor Tree: >>> ListSink >>> >>> Time taken: 0.1 seconds, Fetched: 44 row(s) >>> >>> >>> >>> >>> >>> HTH >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> >>> On 12 July 2016 at 08:16, Markovitz, Dudu <dmarkov...@paypal.com> wrote: >>> >>> This is a simple task – >>> >>> Read the files, find the local max value and combine the results (find >>> the global max value). >>> >>> How do you explain the differences in the results? Spark reads the files >>> and finds a local max 10X (+) faster than MR? >>> >>> Can you please attach the execution plan? >>> >>> >>> >>> Thanks >>> >>> >>> >>> Dudu >>> >>> >>> >>> >>> >>> >>> >>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] >>> *Sent:* Monday, July 11, 2016 11:55 PM >>> *To:* user <user@hive.apache.org>; user @spark <u...@spark.apache.org> >>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its >>> execution engine >>> >>> >>> >>> In my test I did like for like keeping the systematic the same namely: >>> >>> >>> >>> 1. Table was a parquet table of 100 Million rows >>> 2. The same set up was used for both Hive on Spark and Hive on MR >>> 3. Spark was very impressive compared to MR on this particular test. >>> >>> >>> >>> Just to see any issues I created an ORC table in in the image of Parquet >>> (insert/select from Parquet to ORC) with stats updated for columns etc >>> >>> >>> >>> These were the results of the same run using ORC table this time: >>> >>> >>> >>> hive> select max(id) from oraclehadoop.dummy; >>> >>> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b >>> >>> Query Hive on Spark job[1] stages: >>> 2 >>> 3 >>> >>> Status: Running (Hive on Spark job[1]) >>> Job Progress Format >>> CurrentTime StageId_StageAttemptId: >>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount >>> [StageCost] >>> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23 Stage-3_0: 0/1 >>> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished Stage-3_0: >>> 0(+1)/1 >>> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished Stage-3_0: 1/1 >>> Finished >>> Status: Finished successfully in 16.08 seconds >>> OK >>> 100000000 >>> Time taken: 17.775 seconds, Fetched: 1 row(s) >>> >>> >>> >>> Repeat with MR engine >>> >>> >>> >>> hive> set hive.execution.engine=mr; >>> Hive-on-MR is deprecated in Hive 2 and may not be available in the >>> future versions. Consider using a different execution engine (i.e. spark, >>> tez) or using Hive 1.X releases. >>> >>> >>> >>> hive> select max(id) from oraclehadoop.dummy; >>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in >>> the future versions. Consider using a different execution engine (i.e. >>> spark, tez) or using Hive 1.X releases. >>> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8 >>> Total jobs = 1 >>> Launching Job 1 out of 1 >>> Number of reduce tasks determined at compile time: 1 >>> In order to change the average load for a reducer (in bytes): >>> set hive.exec.reducers.bytes.per.reducer=<number> >>> In order to limit the maximum number of reducers: >>> set hive.exec.reducers.max=<number> >>> In order to set a constant number of reducers: >>> set mapreduce.job.reduces=<number> >>> Starting Job = job_1468226887011_0008, Tracking URL = >>> http://rhes564:8088/proxy/application_1468226887011_0008/ >>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job -kill >>> job_1468226887011_0008 >>> Hadoop job information for Stage-1: number of mappers: 23; number of >>> reducers: 1 >>> 2016-07-11 21:37:00,061 Stage-1 map = 0%, reduce = 0% >>> 2016-07-11 21:37:06,440 Stage-1 map = 4%, reduce = 0%, Cumulative CPU >>> 16.48 sec >>> 2016-07-11 21:37:14,751 Stage-1 map = 9%, reduce = 0%, Cumulative CPU >>> 40.63 sec >>> 2016-07-11 21:37:22,048 Stage-1 map = 13%, reduce = 0%, Cumulative CPU >>> 58.88 sec >>> 2016-07-11 21:37:30,412 Stage-1 map = 17%, reduce = 0%, Cumulative CPU >>> 80.72 sec >>> 2016-07-11 21:37:37,707 Stage-1 map = 22%, reduce = 0%, Cumulative CPU >>> 103.43 sec >>> 2016-07-11 21:37:45,999 Stage-1 map = 26%, reduce = 0%, Cumulative CPU >>> 125.93 sec >>> 2016-07-11 21:37:54,300 Stage-1 map = 30%, reduce = 0%, Cumulative CPU >>> 147.17 sec >>> 2016-07-11 21:38:01,538 Stage-1 map = 35%, reduce = 0%, Cumulative CPU >>> 166.56 sec >>> 2016-07-11 21:38:08,807 Stage-1 map = 39%, reduce = 0%, Cumulative CPU >>> 189.29 sec >>> 2016-07-11 21:38:17,115 Stage-1 map = 43%, reduce = 0%, Cumulative CPU >>> 211.03 sec >>> 2016-07-11 21:38:24,363 Stage-1 map = 48%, reduce = 0%, Cumulative CPU >>> 235.68 sec >>> 2016-07-11 21:38:32,638 Stage-1 map = 52%, reduce = 0%, Cumulative CPU >>> 258.27 sec >>> 2016-07-11 21:38:40,916 Stage-1 map = 57%, reduce = 0%, Cumulative CPU >>> 278.44 sec >>> 2016-07-11 21:38:49,206 Stage-1 map = 61%, reduce = 0%, Cumulative CPU >>> 300.35 sec >>> 2016-07-11 21:38:58,524 Stage-1 map = 65%, reduce = 0%, Cumulative CPU >>> 322.89 sec >>> 2016-07-11 21:39:07,889 Stage-1 map = 70%, reduce = 0%, Cumulative CPU >>> 344.8 sec >>> 2016-07-11 21:39:16,151 Stage-1 map = 74%, reduce = 0%, Cumulative CPU >>> 367.77 sec >>> 2016-07-11 21:39:25,456 Stage-1 map = 78%, reduce = 0%, Cumulative CPU >>> 391.82 sec >>> 2016-07-11 21:39:33,725 Stage-1 map = 83%, reduce = 0%, Cumulative CPU >>> 415.48 sec >>> 2016-07-11 21:39:43,037 Stage-1 map = 87%, reduce = 0%, Cumulative CPU >>> 436.09 sec >>> 2016-07-11 21:39:51,292 Stage-1 map = 91%, reduce = 0%, Cumulative CPU >>> 459.4 sec >>> 2016-07-11 21:39:59,563 Stage-1 map = 96%, reduce = 0%, Cumulative CPU >>> 477.92 sec >>> 2016-07-11 21:40:05,760 Stage-1 map = 100%, reduce = 0%, Cumulative CPU >>> 491.72 sec >>> 2016-07-11 21:40:10,921 Stage-1 map = 100%, reduce = 100%, Cumulative >>> CPU 499.37 sec >>> MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec >>> Ended Job = job_1468226887011_0008 >>> MapReduce Jobs Launched: >>> Stage-Stage-1: Map: 23 Reduce: 1 Cumulative CPU: 499.37 sec HDFS >>> Read: 403754774 HDFS Write: 10 SUCCESS >>> Total MapReduce CPU Time Spent: 8 minutes 19 seconds 370 msec >>> OK >>> 100000000 >>> Time taken: 202.333 seconds, Fetched: 1 row(s) >>> >>> >>> >>> So in summary >>> >>> >>> >>> Table MR/sec Spark/sec >>> >>> Parquet 239.532 14.38 >>> >>> ORC 202.333 17.77 >>> >>> >>> >>> Still I would use Spark if I had a choice and I agree that on VLT (very >>> large tables), the limitation in available memory may be the overriding >>> factor in using Spark. >>> >>> >>> >>> HTH >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> >>> On 11 July 2016 at 19:25, Gopal Vijayaraghavan <gop...@apache.org> >>> wrote: >>> >>> >>> > Status: Finished successfully in 14.12 seconds >>> > OK >>> > 100000000 >>> > Time taken: 14.38 seconds, Fetched: 1 row(s) >>> >>> That might be an improvement over MR, but that still feels far too slow. >>> >>> >>> Parquet numbers are in general bad in Hive, but that's because the >>> Parquet >>> reader gets no actual love from the devs. The community, if it wants to >>> keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and >>> cut a significant number of memory copies out of the reader. >>> >>> The Spark 2.0 build for instance, has a custom Parquet reader for >>> SparkSQL >>> which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does >>> for >>> ORC (actually, it looks more like hive's VectorizedRowBatch than >>> Tungsten's flat layouts). >>> >>> But that reader cannot be used in Hive-on-Spark, because it is not a >>> public reader impl. >>> >>> >>> Not to pick an arbitrary dataset, my workhorse example is a TPC-H >>> lineitem >>> at 10Gb scale with a single 16 box. >>> >>> hive(tpch_flat_orc_10)> select max(l_discount) from lineitem; >>> Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda >>> Total jobs = 1 >>> Launching Job 1 out of 1 >>> >>> >>> Status: Running (Executing on YARN cluster with App id >>> application_1466700718395_0256) >>> >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> VERTICES MODE STATUS TOTAL COMPLETED RUNNING >>> PENDING FAILED KILLED >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> Map 1 .......... llap SUCCEEDED 13 13 0 >>> 0 0 0 >>> Reducer 2 ...... llap SUCCEEDED 1 1 0 >>> 0 0 0 >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 0.71 >>> s >>> >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> Status: DAG finished successfully in 0.71 seconds >>> >>> Query Execution Summary >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> OPERATION DURATION >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> Compile Query 0.21s >>> Prepare Plan 0.13s >>> Submit Plan 0.34s >>> Start DAG 0.23s >>> Run DAG 0.71s >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> >>> Task Execution Summary >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS >>> OUTPUT_RECORDS >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> Map 1 604.00 0 0 59,957,438 >>> 13 >>> Reducer 2 105.00 0 0 13 >>> 0 >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> >>> LLAP IO Summary >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS >>> ALLOCATION >>> USED TOTAL_IO >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> Map 1 6036 0 146 0B 68.86MB >>> 491.00MB >>> 479.89MB 7.94s >>> >>> --------------------------------------------------------------------------- >>> ------------------- >>> >>> OK >>> 0.1 >>> Time taken: 1.669 seconds, Fetched: 1 row(s) >>> hive(tpch_flat_orc_10)> >>> >>> >>> This is running against a single 16 core box & I would assume it would >>> take <1.4s to read twice as much (13 tasks is barely touching the load >>> factors). >>> >>> It would probably be a bit faster if the cache had hits, but in general >>> 14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is. >>> >>> Cheers, >>> Gopal >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity