> Status: Finished successfully in 14.12 seconds
> OK
> 100000000
> Time taken: 14.38 seconds, Fetched: 1 row(s)

That might be an improvement over MR, but that still feels far too slow.


Parquet numbers are in general bad in Hive, but that's because the Parquet
reader gets no actual love from the devs. The community, if it wants to
keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and
cut a significant number of memory copies out of the reader.

The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL
which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for
ORC (actually, it looks more like hive's VectorizedRowBatch than
Tungsten's flat layouts).

But that reader cannot be used in Hive-on-Spark, because it is not a
public reader impl.


Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem
at 10Gb scale with a single 16 box.

hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1466700718395_0256)

---------------------------------------------------------------------------
-------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
PENDING  FAILED  KILLED
---------------------------------------------------------------------------
-------------------
Map 1 ..........      llap     SUCCEEDED     13         13        0
0       0       0  
Reducer 2 ......      llap     SUCCEEDED      1          1        0
0       0       0  
---------------------------------------------------------------------------
-------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 0.71 s
    
---------------------------------------------------------------------------
-------------------
Status: DAG finished successfully in 0.71 seconds

Query Execution Summary
---------------------------------------------------------------------------
-------------------
OPERATION                            DURATION
---------------------------------------------------------------------------
-------------------
Compile Query                           0.21s
Prepare Plan                            0.13s
Submit Plan                             0.34s
Start DAG                               0.23s
Run DAG                                 0.71s
---------------------------------------------------------------------------
-------------------

Task Execution Summary
---------------------------------------------------------------------------
-------------------
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
OUTPUT_RECORDS
---------------------------------------------------------------------------
-------------------
     Map 1         604.00             0            0     59,957,438
      13
 Reducer 2         105.00             0            0             13
       0
---------------------------------------------------------------------------
-------------------

LLAP IO Summary
---------------------------------------------------------------------------
-------------------
  VERTICES ROWGROUPS  META_HIT  META_MISS  DATA_HIT  DATA_MISS  ALLOCATION
    USED  TOTAL_IO
---------------------------------------------------------------------------
-------------------
     Map 1      6036         0        146        0B    68.86MB    491.00MB
479.89MB     7.94s
---------------------------------------------------------------------------
-------------------

OK
0.1
Time taken: 1.669 seconds, Fetched: 1 row(s)
hive(tpch_flat_orc_10)>


This is running against a single 16 core box & I would assume it would
take <1.4s to read twice as much (13 tasks is barely touching the load
factors).

It would probably be a bit faster if the cache had hits, but in general
14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.

Cheers,
Gopal











Reply via email to