[ https://issues.apache.org/jira/browse/ARROW-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe L. Korn resolved ARROW-374. ------------------------------- Resolution: Fixed Issue resolved by pull request 249 [https://github.com/apache/arrow/pull/249] > Python: clarify unicode vs. binary in API > ----------------------------------------- > > Key: ARROW-374 > URL: https://issues.apache.org/jira/browse/ARROW-374 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.1.0 > Reporter: Jochen Ott > Assignee: Wes McKinney > Priority: Minor > > pyarrow supports arrow's String type, arrow-internally represented as > BINARY+UTF8 annotation. > In python 2, the pyarrow API accept both {{unicode}} and binary strings > ({{str}}), where the latter are assumed to be utf-8 encoded. I find this > approach problematic, because: > * there is an implicit assumption that a binary {{str}} contains valid utf-8 > data. This assumption can be wrong, however, and it's not clear what the > consequences are of passing such "invalid data" to the API are. > * the utf-8 assumption is not clearly documented or otherwise visible from > the API > * if pyarrow wants to support pure binary data in the future, a natural > choice would be to use {{str}} as python2 type. However, this would conflict > with the current interpretation of binary {{str}} as BINARY+UTF8 > *Proposed solution* > I propose to change the API that it only accepts or returns unicode strings, > i.e. python2's {{unicode}} and python3's {{str}}. Passing a python2 {{str}} > should raise an exception, same for python3's {{bytes}}. > If in some point in the future also raw BINARY is supported, use python3's > {{bytes}} and python2's {{str}}. > As convenience feature for API users, the API may allow to also pass utf-8 > encoded binary data as arrow's String, but that should be an explicit, opt-in > choice, s.t. API users are aware of the (encoding-)assumptions made. -- This message was sent by Atlassian JIRA (v6.3.4#6332)