[ 
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657
 ] 

Owen O'Malley commented on HIVE-4244:
-------------------------------------

We should play with different values, but I was guessing the right cutover 
point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values).

We aren't really going to know whether the heuristic is right or wrong unless 
we compare both encodings, which is much too expensive. By taking a good guess 
after looking at the start of the stripe, we can get good performance most of 
the time.
                
> Make string dictionaries adaptive in ORC
> ----------------------------------------
>
>                 Key: HIVE-4244
>                 URL: https://issues.apache.org/jira/browse/HIVE-4244
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Kevin Wilfong
>
> The ORC writer should adaptively switch between dictionary and direct 
> encoding. I'd propose looking at the first 100,000 values in each column and 
> decide whether there is sufficient loading in the dictionary to use 
> dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to