[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428834#comment-15428834 ]
Micah Kornfield commented on ARROW-81: -------------------------------------- Thanks @wesm, I was going to post that I think dictionary encoding is the way to go for these types of columns everything else just adds complications. I apologize for adding noise in that respect. On that note I would like to take a step back and ask the question implicit on my last post. Should Category be its own first class type, or should it be communicated via a metadata side channel? I lean towards the latter because it allows for compliant implementations to still handle Categorical values without having to write additional code to handle them explicitly (i.e. if its a Category[UTF16], I need to know how to convert this to something my implementation supports. However, if its just dictionary encoded Utf16 that happens to be a categorical variable, then the implementation can handle fine, and for systems that care about explicit categorical values, they can inspect the metadata and treat it normally. One other question in regards to one of your points: I assumed Schemas were immutable, does that match your understanding as well? > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)