> I only need to query 3 columns, ... > The source table is about 1PB.
Format of this table is extremely critical. A columnar data format like ORC is recommended to avoid reading any other columns when reading 3 out of 1000. > Will it be advised to do a subquery first, and then send it to the >aggregation of group by, so that we have smaller files sending to >groupby? Not sure it Hive automatically takes care of this. Hive does column projection after the first scan, so this should not be necessary - if you do explain logical <query>, you will see hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from lineitem group by l_shipmode, l_shipdate; LOGICAL PLAN: lineitem TableScan (TS_0) alias: lineitem Select Operator (SEL_1) expressions: l_shipdate (type: string), l_shipmode (type: string), l_quantity (type: double) outputColumnNames: l_shipdate, l_shipmode, l_quantity Group By Operator (GBY_2) aggregations: sum(l_quantity) keys: l_shipdate (type: string), l_shipmode (type: string) mode: hash outputColumnNames: _col0, _col1, _col2 The SEL_1 showing the projection of the 3 columns out of all cols in lineitem. Cheers, Gopal