[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557341#comment-13557341 ]
Owen O'Malley commented on HIVE-3874: ------------------------------------- Joydeep, I've used a two level strategy: * large stripes (default 250MB) to enable large efficient reads * relatively frequent row index entries (default 10k rows) to enable skipping with in a stripe The row index entries have the locations within each column to enable seeking to the right compression block and byte within the decompressed block. I obviously did consider HFile, although from a practical point of view it is fairly embedded within HBase. Additionally, since it treats each of the columns as bytes it can't do any type-specific encodings/compression and can't interpret the column values, which is critical for performance. Once you have the ability to skip large sets of rows based on the filter predicates, you can sort the table on the secondary keys and achieve a large speed up. For example, if your primary partition is transaction date, you might want to sort the table on state, zip, and last name. Then if you are looking for just the records in CA it won't need to read the records for the other states. > Create a new Optimized Row Columnar file format for Hive > -------------------------------------------------------- > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira