[ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425026#comment-15425026
 ] 

Wes McKinney commented on ARROW-81:
-----------------------------------

There is no doubt that a Category logical type / metadata is necessary for many 
use cases  (because it is semantically distinct from dictionary-encoded data, 
even though the physical representation is the same). For example: statistics 
and machine learning users from many communities would not be able to 
faithfully round trip data to Arrow metadata without it. I will ask others to 
give their perspective on this if you would like to hear from others. 

The implementation (physical representation) of Category is the open question. 
I would propose for it to be a dictionary-encoded struct with a single child. 
For example:

{{Category[string] -> Struct<levels: String>}}

The additional metadata requirement is orderedness. This needs to be stored in 
the schema as it needs to be a part of schema negotiation (rather than only 
observed in the realization of the data in the dictionary). 

By using dictionary encoding for the implementation, one can also easily share 
dictionaries used by multiple fields (having the same category/factor levels). 

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to