Hi, We have a requirement to store a large data set (more than 5TB) mapped to a Hive table. This Hive table would be populated (and appended periodically) using a Hive query from another Hive table. In addition to the Hive queries, we need to be able to run Java MapReduce and preferably Pig jobs as well on top of this data.
I'm wondering what would be the best storage format for this Hive table. How easy it is to use JavaMapReduce on Hive generated sequence files (eg: stored as SequenceFile). How easy it is to use JavaMapReduce on RC files. Any pointers to examples of these would be really great. Does using compressed Text Files (deflate) sound like the best option for this usecase. BTW we are stuck with Hive 0.9 for the foreseeable future and ORC is out of the options. thanks, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org