[ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428957#comment-15428957
 ] 

Wes McKinney commented on ARROW-81:
-----------------------------------

We're running into what "first class type" means again. I'm going to change the 
JIRA title to "Add a Category logical type" to be more clear (I don't think any 
changes to Layout.md are necessary). 

My preferred representation of Category would be as a dictionary-encoded 
Struct. This has the benefit of allowing systems that don't know about what a 
Category is to still manipulate / examine the data normally. In other words, if 
we had the data:

{code}
Category[String]

codes: [0, 0, 0, 0, 1, 1, 1, 1]
categories: ['foo', 'bar']
{code}

Then the Arrow representation would be as

{code}
Struct<levels: String> dictionary-encoded
dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1]
dictionary_id: i

dictionary i: 
type=Struct<levels: String>
fields: 
  levels (type=[String]) : ['foo', 'bar']
{code}

Any other ideas for this?

I could have been more clear about my point about the schema -- if the 
categories are embedded in the metadata, then generating a new Schema after a 
transformation could be arbitrarily expensive. In theory the size in-memory of 
the Schema should be small, so that modifications (yielding new schemas, due to 
schema object immutability) are cheap.

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to