[ https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657 ]
Owen O'Malley commented on HIVE-4244: ------------------------------------- We should play with different values, but I was guessing the right cutover point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values). We aren't really going to know whether the heuristic is right or wrong unless we compare both encodings, which is much too expensive. By taking a good guess after looking at the start of the stripe, we can get good performance most of the time. > Make string dictionaries adaptive in ORC > ---------------------------------------- > > Key: HIVE-4244 > URL: https://issues.apache.org/jira/browse/HIVE-4244 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Kevin Wilfong > > The ORC writer should adaptively switch between dictionary and direct > encoding. I'd propose looking at the first 100,000 values in each column and > decide whether there is sufficient loading in the dictionary to use > dictionary encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira