Aggregated table larger than expected

Matt Olson Thu, 23 Jun 2016 14:38:20 -0700

Hi,

I am working with an hourly table and a daily table in Hive 1.0.1. Both
tables have the same schema except that the hourly table is partitioned by
dt and hour, but the daily table is partitioned only by dt. At the end of
each day, the records from the hourly table are aggregated into the daily
table, but we do a group by to remove some duplicate records.


There are 1,363,106,822 total rows in the hourly table on 2016-06-20, and
1,300,287,508 rows in the daily table since some are dropped in the group
by. However, the total size of all files in the hourly table for that date
is 135.9 GB, but the total size of files in the daily table is 158.8 GB.
I'm wondering why the daily table would be significantly larger, since it
has fewer records but all the same information for the records it does
have. Both tables are stored as RCFile and
use com.hadoop.compression.lzo.LzoCodec for compression. However, the
hourly table contains about 66000 small files, whereas the daily one
contains 494.

If I remove the group by and max functions from the query and just insert
all the records for 2016-06-20 from the hourly table into the daily table
(so the job becomes map-only), the daily table turns out to be smaller than
the hourly. It seems like the introduction of the group by/reduce phase is
causing the output to be larger somehow. I have also tried storing the
daily table as ORC rather than RCFile, since ORC is more space-efficient.
The total size is 147.2 GB, so smaller than the RCFile version but still
larger than the hourly table.

I've attached the query inserting into the daily table, as well as the
explain output of the query.

Thanks for any help,
Matt

OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: events_hourly
            Statistics: Num rows: 20115823 Data size: 132443345481 Basic stats: 
COMPLETE Column stats: NONE
            Select Operator
              expressions: session (type: string), title (type: string), ttype 
(type: string), en (type: string), et (type: string), es (type: int), pid 
(type: string), ptype (type: string), rtype (type: string), platform (type: 
string), lvl (type: string), dob (type: string), ip (type: string), country 
(type: string), locale (type: string), sdk (type: string), tax_version (type: 
string), build_version (type: string), cpt (type: timestamp), srt (type: 
timestamp), ept (type: timestamp), mac (type: string), e_params (type: string), 
d_map (type: map<string,string>), a_map (type: map<string,string>), e_list 
(type: array<string>), s_params (type: string), o_params (type: string), city 
(type: string), state (type: string), hashcode (type: string), stype (type: 
string), plvl (type: string), plvl_name (type: string), gsi (type: string)
              outputColumnNames: session, title, ttype, en, et, es, pid, ptype, 
rtype, platform, lvl, dob, ip, country, locale, sdk, tax_version, 
build_version, cpt, srt, ept, mac, e_params, d_map, a_map, e_list, s_params, 
o_params, city, state, hashcode, stype, plvl, plvl_name, gsi
              Statistics: Num rows: 20115823 Data size: 132443345481 Basic 
stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: max(lvl), max(dob), max(ip), max(country), 
max(locale), max(sdk), max(tax_version), max(build_version), max(cpt), 
max(srt), max(ept), max(mac), max(e_params), max(to_json(d_map)), 
max(to_json(a_map)), max(e_list), max(s_params), max(o_params), max(city), 
max(state), max(hashcode), max(stype), max(plvl), max(plvl_name), max(gsi)
                keys: session (type: string), title (type: string), ttype 
(type: string), en (type: string), et (type: string), es (type: int), pid 
(type: string), ptype (type: string), rtype (type: string), platform (type: 
string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, 
_col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, 
_col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34
                Statistics: Num rows: 20115823 Data size: 132443345481 Basic 
stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), 
_col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: 
int), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 
(type: string)
                  sort order: ++++++++++
                  Map-reduce partition columns: _col0 (type: string), _col1 
(type: string), _col2 (type: string), _col3 (type: string), _col4 (type: 
string), _col5 (type: int), _col6 (type: string), _col7 (type: string), _col8 
(type: string), _col9 (type: string)
                  Statistics: Num rows: 20115823 Data size: 132443345481 Basic 
stats: COMPLETE Column stats: NONE
                  value expressions: _col10 (type: string), _col11 (type: 
string), _col12 (type: string), _col13 (type: string), _col14 (type: string), 
_col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 
(type: timestamp), _col19 (type: timestamp), _col20 (type: timestamp), _col21 
(type: string), _col22 (type: string), _col23 (type: string), _col24 (type: 
string), _col25 (type: array<string>), _col26 (type: string), _col27 (type: 
string), _col28 (type: string), _col29 (type: string), _col30 (type: string), 
_col31 (type: string), _col32 (type: string), _col33 (type: string), _col34 
(type: string)
      Reduce Operator Tree:
        Group By Operator
          aggregations: max(VALUE._col0), max(VALUE._col1), max(VALUE._col2), 
max(VALUE._col3), max(VALUE._col4), max(VALUE._col5), max(VALUE._col6), 
max(VALUE._col7), max(VALUE._col8), max(VALUE._col9), max(VALUE._col10), 
max(VALUE._col11), max(VALUE._col12), max(VALUE._col13), max(VALUE._col14), 
max(VALUE._col15), max(VALUE._col16), max(VALUE._col17), max(VALUE._col18), 
max(VALUE._col19), max(VALUE._col20), max(VALUE._col21), max(VALUE._col22), 
max(VALUE._col23), max(VALUE._col24)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 
(type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 
(type: int), KEY._col6 (type: string), KEY._col7 (type: string), KEY._col8 
(type: string), KEY._col9 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, 
_col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, 
_col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34
          Statistics: Num rows: 10057911 Data size: 66221669448 Basic stats: 
COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 
(type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), 
_col10 (type: string), _col6 (type: string), _col7 (type: string), _col11 
(type: string), _col12 (type: string), _col13 (type: string), _col14 (type: 
string), _col15 (type: string), _col16 (type: string), _col17 (type: string), 
_col8 (type: string), _col18 (type: timestamp), _col19 (type: timestamp), 
_col20 (type: timestamp), _col21 (type: string), _col22 (type: string), 
json_map( _col23 ) (type: map<string,string>), json_map( _col24 ) (type: 
map<string,string>), _col25 (type: array<string>), _col26 (type: string), 
_col27 (type: string), _col28 (type: string), _col29 (type: string), _col30 
(type: string), _col31 (type: string), _col32 (type: string), _col33 (type: 
string), _col34 (type: string), _col9 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, 
_col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, 
_col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34
            Statistics: Num rows: 10057911 Data size: 66221669448 Basic stats: 
COMPLETE Column stats: NONE
            File Output Operator
              compressed: true
              Statistics: Num rows: 10057911 Data size: 66221669448 Basic 
stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
                  name: events_daily

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            dt 2016-06-20
          replace: true
          table:
              input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat
              output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
              serde: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
              name: events_daily

  Stage: Stage-2
    Stats-Aggr Operator

daily.q
Description: Binary data

Aggregated table larger than expected

Reply via email to