Shayan Monshizadeh created ARROW-1407:
-----------------------------------------

             Summary: Dictionaries can only hold 4096 indices
                 Key: ARROW-1407
                 URL: https://issues.apache.org/jira/browse/ARROW-1407
             Project: Apache Arrow
          Issue Type: Bug
          Components: Java - Vectors
    Affects Versions: 0.6.0
            Reporter: Shayan Monshizadeh
            Priority: Minor
         Attachments: Screen Shot 2017-08-22 at 7.14.07 PM.png

Dictionaries seem to only be able to hold 4096 indices, meaning only vectors 
with 4096 values can be turned into dictionaries. The image attached is a stack 
trace of what happens when try to encode a dictionary with a vector containing 
4097 strings, and a dictionary containing two distinct values. 

Basically the error can be traced to line 95 of DictionaryEncoder.java 
(`setter.invoke(mutator, i, encoded);`). It seems that the indices array which 
hold the encoded values is allocated on line 84 as `indices.allocateNew()` and 
it seems that `allocateNew()` only allocates 4096 bytes of data initially. The 
code runs if there are 4096 rows of data or less. Anymore and the same error is 
given.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to