[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550379#comment-13550379 ]
He Yongqiang commented on HIVE-3874: ------------------------------------ I want to list a few thoughts why i think the orc solution is a much more appealing one. 1. For a BIG data warehouse that stores more than 90% of existing data in rcfile (like FB's >100PB warehouse), data conversion from one format to another is something that definitely should be avoided. It is possible to convert some tables if there is a big space saving advantage. But managing two distinct formats which do not have any compatibility, inter-operability, or even in two different code repositories is another big headache that would avoid at the first place. 2. Developing the new ORC format in the hive/hcatalog codebase will make hive development/operations much easier. 3. Letting new ORC format have some backward compatibility with RCFile will save a lot of trouble. > Create a new Optimized Row Columnar file format for Hive > -------------------------------------------------------- > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira