What's the best way to enforce a single output file per partition? INSERT OVERWRITE TABLE <table> PARTITION (x,y,z) SELECT ... FROM ... WHERE ...
It tried adding CLUSTER BY x,y,z at the end thinking that sorting will force a single reducer per partition but that didn't work. I still got multiple files per partition. Do I have to use a single reduce task? With a few TB of data that's probably not a good idea. My current idea is to create a temp table with the same partitioning structure. Insert into that table first and then select * from that table into the output table. With combineinputformat=true that should work right? Or should I make Hive merge output files instead? (using hive.merge.mapfiles) Will that work with a partitioned table? Thanks! igor