> I only need to query 3 columns,
...
> The source table is about 1PB.


Format of this table is extremely critical.

A columnar data format like ORC is recommended to avoid reading any other
columns when reading 3 out of 1000.

> Will it be advised to do a subquery first, and then send it to the
>aggregation of group by, so that we have smaller files sending to
>groupby? Not sure it Hive automatically takes care of this.

Hive does column projection after the first scan, so this should not be
necessary - if you do explain logical <query>, you will see

hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from
lineitem group by l_shipmode, l_shipdate;



LOGICAL PLAN:
lineitem 
  TableScan (TS_0)
    alias: lineitem
    Select Operator (SEL_1)
      expressions: l_shipdate (type: string), l_shipmode (type: string),
l_quantity (type: double)
      outputColumnNames: l_shipdate, l_shipmode, l_quantity
      Group By Operator (GBY_2)
        aggregations: sum(l_quantity)
        keys: l_shipdate (type: string), l_shipmode (type: string)
        mode: hash
        outputColumnNames: _col0, _col1, _col2


The SEL_1 showing the projection of the 3 columns out of all cols in
lineitem.

Cheers,
Gopal




Reply via email to