Many thanks Sir. Very useful. Kindly elaborate why RC files do not have these capabilities. As I see them they are Row Columnar files. Am I correct to assume that ORC file is basically an RC file with more optimisation. Are RC and ORC files designed for columnar format similar to the way a columnar data warehouse is built? Regards
On Monday, 21 December 2015, 18:58, Alan Gates <alanfga...@gmail.com> wrote: ORC offers a number of features not available in RC files: * Better encoding of data. Integer values are run length encoded. Strings and dates are stored in a dictionary (and the resulting pointers then run length encoded). * Internal indexes and statistics on the data. This allows for more efficient reading of the data as well as skipping of sections of the data not relevant to a given query. These indexes can also be used by the Hive optimizer to help plan query execution. * Predicate push down for some predicates. For example, in the query "select * from user where state = 'ca'", ORC could look at a collection of rows and use the indexes to see that no rows in that group have that value, and thus skip the group altogether. * Tight integration with Hive's vectorized execution, which produces much faster processing of rows * Support for new ACID features in Hive (transactional insert, update, and delete). * It has a much faster read time than RCFile and compresses much more efficiently. Whether ORC is the best format for what you're doing depends on the data you're storing and how you are querying it. If you are storing data where you know the schema and you are doing analytic type queries it's the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above about ORC vs RC applies to Parquet as well). If you are doing queries that select the whole row each time columnar formats like ORC won't be your friend. Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format. Alan. Ashok Kumar December 21, 2015 at 9:45 Hi Gurus, I am trying to understand the advantages that ORC file format offers over RC. I have read the existing documents but I still don't seem to grasp the main differences. Can someone explain to me as a user where ORC scores when compared to RC. What I like to know is mainly the performance. I am also aware that ORC does some smart compression as well. Finally is ORC file format is the best choice in Hive. Thank you