In case someone else encounters this issue, it looks like this was due to encoding differences between the hourly and daily table. The hourly table often had the same values stored consecutively for certain columns, but the group by on multiple columns caused them to be shuffled around to different reducers and therefore written to different files when loaded to the daily table.
This meant that for some columns the daily table was more likely to have a larger percentage of distinct values stored consecutively, whereas identical values in the hourly table tended to be clumped together more. The latter case is more conducive to storing compactly. On Thu, Jun 23, 2016 at 2:37 PM, Matt Olson <maolso...@gmail.com> wrote: > Hi, > > I am working with an hourly table and a daily table in Hive 1.0.1. Both > tables have the same schema except that the hourly table is partitioned by > dt and hour, but the daily table is partitioned only by dt. At the end of > each day, the records from the hourly table are aggregated into the daily > table, but we do a group by to remove some duplicate records. > > There are 1,363,106,822 total rows in the hourly table on 2016-06-20, and > 1,300,287,508 rows in the daily table since some are dropped in the group > by. However, the total size of all files in the hourly table for that date > is 135.9 GB, but the total size of files in the daily table is 158.8 GB. > I'm wondering why the daily table would be significantly larger, since it > has fewer records but all the same information for the records it does > have. Both tables are stored as RCFile and > use com.hadoop.compression.lzo.LzoCodec for compression. However, the > hourly table contains about 66000 small files, whereas the daily one > contains 494. > > If I remove the group by and max functions from the query and just insert > all the records for 2016-06-20 from the hourly table into the daily table > (so the job becomes map-only), the daily table turns out to be smaller than > the hourly. It seems like the introduction of the group by/reduce phase is > causing the output to be larger somehow. I have also tried storing the > daily table as ORC rather than RCFile, since ORC is more space-efficient. > The total size is 147.2 GB, so smaller than the RCFile version but still > larger than the hourly table. > > I've attached the query inserting into the daily table, as well as the > explain output of the query. > > Thanks for any help, > Matt >