Quick insights: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) < eric.n.han...@microsoft.com> wrote: > It sounds like ORC would be best. > > > > -Eric > > > > *From:* Thilina Gunarathne [mailto:cset...@gmail.com] > *Sent:* Monday, January 27, 2014 11:05 AM > *To:* user@hive.apache.org > *Subject:* RCFile vs SequenceFile vs text files > > > > Dear all, > > We are trying to pick the right data storage format for the Hive table > with the following requirement and would really appreciate any insights you > can provide to help our decision. > > 1. ~50Billion records per month. ~14 columns per record and each record is > ~100 bytes. Table is partitioned by the date. Table gets populated > periodically from another Hive query. > > 2. The columns are dense, so I'm not sure whether we'll get any space > savings by using RCFiles. > > 3. Data needs to be compressed. > > 4. We will be doing lot of aggregation queries for selected columns. There > will be ad-hoc queries for whole records as well. > > 5. We need the ability to run Java MapReduce programs on the underlying > data. We have existing programs which use custom inputformats with > compressed textfiles as input and we are willing to port them to use other > formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?) > > 6. Ability to use hive indexing. > > thanks a ton in advance, > > Thilina > > > > -- > https://www.cs.indiana.edu/~tgunarat/ > http://www.linkedin.com/in/thilina > > http://thilina.gunarathne.org > -- Thank you Sharath Punreddy 1201 Golden gate Dr, Southlake TX 76092. Phone:626-470-7867