[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428957#comment-15428957 ]
Wes McKinney commented on ARROW-81: ----------------------------------- We're running into what "first class type" means again. I'm going to change the JIRA title to "Add a Category logical type" to be more clear (I don't think any changes to Layout.md are necessary). My preferred representation of Category would be as a dictionary-encoded Struct. This has the benefit of allowing systems that don't know about what a Category is to still manipulate / examine the data normally. In other words, if we had the data: {code} Category[String] codes: [0, 0, 0, 0, 1, 1, 1, 1] categories: ['foo', 'bar'] {code} Then the Arrow representation would be as {code} Struct<levels: String> dictionary-encoded dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1] dictionary_id: i dictionary i: type=Struct<levels: String> fields: levels (type=[String]) : ['foo', 'bar'] {code} Any other ideas for this? I could have been more clear about my point about the schema -- if the categories are embedded in the metadata, then generating a new Schema after a transformation could be arbitrarily expensive. In theory the size in-memory of the Schema should be small, so that modifications (yielding new schemas, due to schema object immutability) are cheap. > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)