In general, use Sequence Files + with GZip or Snappy Compression.
On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <cset...@gmail.com>wrote: > Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would > not be an option for us as our cluster still runs Hive 0.9 and we won't be > migrating any time soon. > > thanks, > Thilina > > > On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <srpunre...@gmail.com>wrote: > >> Quick insights: >> >> >> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ >> >> >> >> >> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) < >> eric.n.han...@microsoft.com> wrote: >> >>> It sounds like ORC would be best. >>> >>> >>> >>> -Eric >>> >>> >>> >>> *From:* Thilina Gunarathne [mailto:cset...@gmail.com] >>> *Sent:* Monday, January 27, 2014 11:05 AM >>> *To:* user@hive.apache.org >>> *Subject:* RCFile vs SequenceFile vs text files >>> >>> >>> >>> Dear all, >>> >>> We are trying to pick the right data storage format for the Hive table >>> with the following requirement and would really appreciate any insights you >>> can provide to help our decision. >>> >>> 1. ~50Billion records per month. ~14 columns per record and each record >>> is ~100 bytes. Table is partitioned by the date. Table gets populated >>> periodically from another Hive query. >>> >>> 2. The columns are dense, so I'm not sure whether we'll get any space >>> savings by using RCFiles. >>> >>> 3. Data needs to be compressed. >>> >>> 4. We will be doing lot of aggregation queries for selected columns. >>> There will be ad-hoc queries for whole records as well. >>> >>> 5. We need the ability to run Java MapReduce programs on the underlying >>> data. We have existing programs which use custom inputformats with >>> compressed textfiles as input and we are willing to port them to use other >>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?) >>> >>> 6. Ability to use hive indexing. >>> >>> thanks a ton in advance, >>> >>> Thilina >>> >>> >>> >>> -- >>> https://www.cs.indiana.edu/~tgunarat/ >>> http://www.linkedin.com/in/thilina >>> >>> http://thilina.gunarathne.org >>> >> >> >> >> -- >> Thank you >> >> Sharath Punreddy >> 1201 Golden gate Dr, >> Southlake TX 76092. >> Phone:626-470-7867 >> > > > > -- > https://www.cs.indiana.edu/~tgunarat/ > http://www.linkedin.com/in/thilina > http://thilina.gunarathne.org >