Many thanks Sir. Very useful.
Kindly elaborate why RC files do not have these capabilities. As I see them 
they are Row Columnar files. Am I correct to assume that ORC file is basically 
an RC file with more optimisation.
Are RC and ORC files designed for columnar format similar to the way a columnar 
data warehouse is built?
Regards

    On Monday, 21 December 2015, 18:58, Alan Gates <alanfga...@gmail.com> wrote:
 

 ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  Strings and 
dates are stored in a dictionary (and the resulting pointers then run length 
encoded).
* Internal indexes and statistics on the data.  This allows for more efficient 
reading of the data as well as skipping of sections of the data not relevant to 
a given query.  These indexes can also be used by the Hive optimizer to help 
plan query execution.
* Predicate push down for some predicates.  For example, in the query "select * 
from user where state = 'ca'", ORC could look at a collection of rows and use 
the indexes to see that no rows in that group have that value, and thus skip 
the group altogether.
* Tight integration with Hive's vectorized execution, which produces much 
faster processing of rows
* Support for new ACID features in Hive (transactional insert, update, and 
delete).
* It has a much faster read time than RCFile and compresses much more 
efficiently.

Whether ORC is the best format for what you're doing depends on the data you're 
storing and how you are querying it.  If you are storing data where you know 
the schema and you are doing analytic type queries it's the best choice (in 
fairness, some would dispute this and choose Parquet, though much of what I 
said above about ORC vs RC applies to Parquet as well).  If you are doing 
queries that select the whole row each time columnar formats like ORC won't be 
your friend.  Also, if you are storing self structured data such as JSON or 
Avro you may find text or Avro storage to be a better format.

Alan.




    Ashok Kumar  December 21, 2015 at 9:45  Hi Gurus,
I am trying to understand the advantages that ORC file format offers over RC.
I have read the existing documents but I still don't seem to grasp the main 
differences.
Can someone explain to me as a user where ORC scores when compared to RC. What 
I like to know is mainly the performance. I am also aware that ORC does some 
smart compression as well.
Finally is ORC file format is the best choice in Hive.
Thank you




  

Reply via email to