[
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157705#comment-13157705
]
alex gemini commented on HIVE-2097:
-----------------------------------
a few suggestion:
In columnar database,they always organize column order in the "high selectivity
come first" way,then In each column,they store each value in sorted way.
case 1:if we already know the pattern of each column in big datasets,for
example we can calculated in database to get a sample column distribution.we
need to know the distinct value of each column value.in create database
statement
create table a
(col1,col2,col3,col4,col5,col6,xxx)
TBLPROPERTIES
(col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005)
when we organize column group,we know which column is most high selectivity.in
this example,the selectivity order of table a is :
col3>col4>col2>col6>col5>col1 ,so we can organize column group like
(col3,col4,col2),(col6,col5),col1
case 2:if we didn't know the table properties when we create table.we can just
store them like normally,then provide a utility like hive --service
rcfile_reorder 'some_hive_table_here', when execute this command,submit several
mapreduce job to calculate the selectivity of each column and store them in
metastore.then decompression each rcfile to reorganized them in a more space
efficience column group.
hope this help.
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
> Key: HIVE-2097
> URL: https://issues.apache.org/jira/browse/HIVE-2097
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor, Serializers/Deserializers
> Reporter: Krishna Kumar
> Assignee: Krishna Kumar
> Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>
> 1. More efficient serialization/deserialization based on type-specific and
> storage-specific knowledge.
>
> For instance, storing sorted numeric values efficiently using some delta
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific
> knowledge
> Enable compression codecs to be specified based on types or individual
> columns
> 3. Reordering the on-disk storage for better compression efficiency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira