[ https://issues.apache.org/jira/browse/ARROW-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418301#comment-15418301 ]
Wes McKinney commented on ARROW-255: ------------------------------------ This makes sense, as any level of a nested type subtree could be hypothetically dictionary encoded. Are there many benefits to using unsigned integers for the dictionary indices (that reference elements in the dictionary)? If it makes things more difficult for JVM users, then regular int32 seems acceptable (similar in that we are doing that for variable length collection offsets). > Finalize Dictionary representation > ---------------------------------- > > Key: ARROW-255 > URL: https://issues.apache.org/jira/browse/ARROW-255 > Project: Apache Arrow > Issue Type: Improvement > Components: Format > Reporter: Julien Le Dem > > format/Messages.fbs mentions DictionaryBatches with an id but does not > specify where they are referenced. > We should add a {{dictionary: long}} in Field that references the dictionary > id: > Field: > https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L86 > Dictionary id: > https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L165 > We need a spec in format/Layout.md that describes the dictionary layout. > When dictionary encoded the value vector is an array of unsigned int32. > The dictionary vector is a Vector of the type of the value. indexed by their > id in the dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)