Great info, thanks. Makes sense on the partition since those files can be shipped by themselves. These are "reference" tables, but one happens to be pretty long.
Thanks, Abe On Thu, Apr 30, 2015 at 12:54 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > > Using CDH 5.3 - Hive 0.13. Does a view help here? Does how i format > >the table help in reducing size? > > No, a view does not help - they are not materialized and you need hive-1.0 > to have temporary table support. > > The only way out is if you only have 1 filter column in the system. > > I assume your data is not in ORC (because of CDH), which prevents any > speedups due to the ORC row-index filters. > > In ORC, you can reorganize data to get min-max pruning IO savings by > reinserting data via ³insert overwrite table <x> select * from <x> sort by > <filter-col>². > > In my point lookup queries, when looking for ~3 rows in 6 billion unique > values, ORC ends up reading only 54,000 rows into memory thanks to ORC > indexes, even in MapReduce. > > But if your case is much simpler & you have a low-cardinality column (i.e > <100 unique items), you can use that column as a partition column for your > table, so that it is pre-filtered during planning time. > > Outside of those scenarios, a scalable distributed solution exists in > Tez¹s broadcast JOIN - you can test Tez using the open-source Apache > hive-0.13.1 (the one Yahoo uses), because the CDH version had Tez removed > from the package. > > Cheers, > Gopal > > >