[
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160726#comment-13160726
]
alex gemini commented on HIVE-2097:
-----------------------------------
another issue is about efficient serialization/deserialization,for the same
example above,assume every gender,age,region have 100 message equally store in
one dfs block,in gender column we store value like
this:{'male'}[1-60k]{'female'}[60k+1 - 120k],age column look like
this:{21}[1-3k]{22}[3k+1 - 6k]{23}[6k+1 - 9k],and region column is
like:{'LA'}[1-300]{'NY'}[301-600].
When we issue a query on a single table like :select sum(age) from logs where
region='LA' and age=30,we count every column represented at
'select,where,group' clause,so we know the last column means lowest
selectivity(in this example is region),we find the region
value={'LA'}[(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....] and 'NY'
value={'NY'}[(301-600),(30k+301 - 30k+600),(60k+301 -60k+600)....]
we just need to deserialization it but don't need to decompression it because
we know the lowest selectivity column,then we organize inputSplit's key like
{[age='21'][region='LA']} and value is {(1-300),(30k+1 - 30k+300),(60k+1
-60k+300)....},this inputSplit key and value is unique per dfs block because we
already sort column by selectivity,the lowest selectivity column presented at
(select,where,group) must be unique.
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
> Key: HIVE-2097
> URL: https://issues.apache.org/jira/browse/HIVE-2097
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor, Serializers/Deserializers
> Reporter: Krishna Kumar
> Assignee: Krishna Kumar
> Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>
> 1. More efficient serialization/deserialization based on type-specific and
> storage-specific knowledge.
>
> For instance, storing sorted numeric values efficiently using some delta
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific
> knowledge
> Enable compression codecs to be specified based on types or individual
> columns
> 3. Reordering the on-disk storage for better compression efficiency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira