[jira] [Commented] (ARROW-255) Finalize Dictionary representation

Wes McKinney (JIRA) Fri, 12 Aug 2016 02:11:49 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418301#comment-15418301
 ]


Wes McKinney commented on ARROW-255:
------------------------------------

This makes sense, as any level of a nested type subtree could be hypothetically 
dictionary encoded. 

Are there many benefits to using unsigned integers for the dictionary indices 
(that reference elements in the dictionary)? If it makes things more difficult 
for JVM users, then regular int32 seems acceptable (similar in that we are 
doing that for variable length collection offsets). 

> Finalize Dictionary representation
> ----------------------------------
>
>                 Key: ARROW-255
>                 URL: https://issues.apache.org/jira/browse/ARROW-255
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>            Reporter: Julien Le Dem
>
> format/Messages.fbs mentions DictionaryBatches with an id but does not 
> specify where they are referenced.
> We should add a {{dictionary: long}} in Field that references the dictionary 
> id:
> Field: 
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L86
> Dictionary id: 
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L165
> We need a spec in format/Layout.md that describes the dictionary layout.
> When dictionary encoded the value vector is an array of unsigned int32.
> The dictionary vector is a Vector of the type of the value. indexed by their 
> id in the dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ARROW-255) Finalize Dictionary representation

Reply via email to