[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425026#comment-15425026 ]
Wes McKinney commented on ARROW-81: ----------------------------------- There is no doubt that a Category logical type / metadata is necessary for many use cases (because it is semantically distinct from dictionary-encoded data, even though the physical representation is the same). For example: statistics and machine learning users from many communities would not be able to faithfully round trip data to Arrow metadata without it. I will ask others to give their perspective on this if you would like to hear from others. The implementation (physical representation) of Category is the open question. I would propose for it to be a dictionary-encoded struct with a single child. For example: {{Category[string] -> Struct<levels: String>}} The additional metadata requirement is orderedness. This needs to be stored in the schema as it needs to be a part of schema negotiation (rather than only observed in the realization of the data in the dictionary). By using dictionary encoding for the implementation, one can also easily share dictionaries used by multiple fields (having the same category/factor levels). > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)