> Status: Finished successfully in 14.12 seconds > OK > 100000000 > Time taken: 14.38 seconds, Fetched: 1 row(s)
That might be an improvement over MR, but that still feels far too slow. Parquet numbers are in general bad in Hive, but that's because the Parquet reader gets no actual love from the devs. The community, if it wants to keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and cut a significant number of memory copies out of the reader. The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for ORC (actually, it looks more like hive's VectorizedRowBatch than Tungsten's flat layouts). But that reader cannot be used in Hive-on-Spark, because it is not a public reader impl. Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem at 10Gb scale with a single 16 box. hive(tpch_flat_orc_10)> select max(l_discount) from lineitem; Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1466700718395_0256) --------------------------------------------------------------------------- ------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED --------------------------------------------------------------------------- ------------------- Map 1 .......... llap SUCCEEDED 13 13 0 0 0 0 Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0 --------------------------------------------------------------------------- ------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 0.71 s --------------------------------------------------------------------------- ------------------- Status: DAG finished successfully in 0.71 seconds Query Execution Summary --------------------------------------------------------------------------- ------------------- OPERATION DURATION --------------------------------------------------------------------------- ------------------- Compile Query 0.21s Prepare Plan 0.13s Submit Plan 0.34s Start DAG 0.23s Run DAG 0.71s --------------------------------------------------------------------------- ------------------- Task Execution Summary --------------------------------------------------------------------------- ------------------- VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS --------------------------------------------------------------------------- ------------------- Map 1 604.00 0 0 59,957,438 13 Reducer 2 105.00 0 0 13 0 --------------------------------------------------------------------------- ------------------- LLAP IO Summary --------------------------------------------------------------------------- ------------------- VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS ALLOCATION USED TOTAL_IO --------------------------------------------------------------------------- ------------------- Map 1 6036 0 146 0B 68.86MB 491.00MB 479.89MB 7.94s --------------------------------------------------------------------------- ------------------- OK 0.1 Time taken: 1.669 seconds, Fetched: 1 row(s) hive(tpch_flat_orc_10)> This is running against a single 16 core box & I would assume it would take <1.4s to read twice as much (13 tasks is barely touching the load factors). It would probably be a bit faster if the cache had hits, but in general 14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is. Cheers, Gopal