Hi, I am working with an hourly table and a daily table in Hive 1.0.1. Both tables have the same schema except that the hourly table is partitioned by dt and hour, but the daily table is partitioned only by dt. At the end of each day, the records from the hourly table are aggregated into the daily table, but we do a group by to remove some duplicate records.
There are 1,363,106,822 total rows in the hourly table on 2016-06-20, and 1,300,287,508 rows in the daily table since some are dropped in the group by. However, the total size of all files in the hourly table for that date is 135.9 GB, but the total size of files in the daily table is 158.8 GB. I'm wondering why the daily table would be significantly larger, since it has fewer records but all the same information for the records it does have. Both tables are stored as RCFile and use com.hadoop.compression.lzo.LzoCodec for compression. However, the hourly table contains about 66000 small files, whereas the daily one contains 494. If I remove the group by and max functions from the query and just insert all the records for 2016-06-20 from the hourly table into the daily table (so the job becomes map-only), the daily table turns out to be smaller than the hourly. It seems like the introduction of the group by/reduce phase is causing the output to be larger somehow. I have also tried storing the daily table as ORC rather than RCFile, since ORC is more space-efficient. The total size is 147.2 GB, so smaller than the RCFile version but still larger than the hourly table. I've attached the query inserting into the daily table, as well as the explain output of the query. Thanks for any help, Matt
OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: events_hourly Statistics: Num rows: 20115823 Data size: 132443345481 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: session (type: string), title (type: string), ttype (type: string), en (type: string), et (type: string), es (type: int), pid (type: string), ptype (type: string), rtype (type: string), platform (type: string), lvl (type: string), dob (type: string), ip (type: string), country (type: string), locale (type: string), sdk (type: string), tax_version (type: string), build_version (type: string), cpt (type: timestamp), srt (type: timestamp), ept (type: timestamp), mac (type: string), e_params (type: string), d_map (type: map<string,string>), a_map (type: map<string,string>), e_list (type: array<string>), s_params (type: string), o_params (type: string), city (type: string), state (type: string), hashcode (type: string), stype (type: string), plvl (type: string), plvl_name (type: string), gsi (type: string) outputColumnNames: session, title, ttype, en, et, es, pid, ptype, rtype, platform, lvl, dob, ip, country, locale, sdk, tax_version, build_version, cpt, srt, ept, mac, e_params, d_map, a_map, e_list, s_params, o_params, city, state, hashcode, stype, plvl, plvl_name, gsi Statistics: Num rows: 20115823 Data size: 132443345481 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: max(lvl), max(dob), max(ip), max(country), max(locale), max(sdk), max(tax_version), max(build_version), max(cpt), max(srt), max(ept), max(mac), max(e_params), max(to_json(d_map)), max(to_json(a_map)), max(e_list), max(s_params), max(o_params), max(city), max(state), max(hashcode), max(stype), max(plvl), max(plvl_name), max(gsi) keys: session (type: string), title (type: string), ttype (type: string), en (type: string), et (type: string), es (type: int), pid (type: string), ptype (type: string), rtype (type: string), platform (type: string) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34 Statistics: Num rows: 20115823 Data size: 132443345481 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string) sort order: ++++++++++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string) Statistics: Num rows: 20115823 Data size: 132443345481 Basic stats: COMPLETE Column stats: NONE value expressions: _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: timestamp), _col19 (type: timestamp), _col20 (type: timestamp), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: string), _col25 (type: array<string>), _col26 (type: string), _col27 (type: string), _col28 (type: string), _col29 (type: string), _col30 (type: string), _col31 (type: string), _col32 (type: string), _col33 (type: string), _col34 (type: string) Reduce Operator Tree: Group By Operator aggregations: max(VALUE._col0), max(VALUE._col1), max(VALUE._col2), max(VALUE._col3), max(VALUE._col4), max(VALUE._col5), max(VALUE._col6), max(VALUE._col7), max(VALUE._col8), max(VALUE._col9), max(VALUE._col10), max(VALUE._col11), max(VALUE._col12), max(VALUE._col13), max(VALUE._col14), max(VALUE._col15), max(VALUE._col16), max(VALUE._col17), max(VALUE._col18), max(VALUE._col19), max(VALUE._col20), max(VALUE._col21), max(VALUE._col22), max(VALUE._col23), max(VALUE._col24) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: int), KEY._col6 (type: string), KEY._col7 (type: string), KEY._col8 (type: string), KEY._col9 (type: string) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34 Statistics: Num rows: 10057911 Data size: 66221669448 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col10 (type: string), _col6 (type: string), _col7 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col8 (type: string), _col18 (type: timestamp), _col19 (type: timestamp), _col20 (type: timestamp), _col21 (type: string), _col22 (type: string), json_map( _col23 ) (type: map<string,string>), json_map( _col24 ) (type: map<string,string>), _col25 (type: array<string>), _col26 (type: string), _col27 (type: string), _col28 (type: string), _col29 (type: string), _col30 (type: string), _col31 (type: string), _col32 (type: string), _col33 (type: string), _col34 (type: string), _col9 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34 Statistics: Num rows: 10057911 Data size: 66221669448 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: true Statistics: Num rows: 10057911 Data size: 66221669448 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat serde: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe name: events_daily Stage: Stage-0 Move Operator tables: partition: dt 2016-06-20 replace: true table: input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat serde: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe name: events_daily Stage: Stage-2 Stats-Aggr Operator
daily.q
Description: Binary data