Bryan Cutler created ARROW-5682:
-----------------------------------

             Summary: [Python] from_pandas conversion casts values to string 
inconsistently
                 Key: ARROW-5682
                 URL: https://issues.apache.org/jira/browse/ARROW-5682
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.13.0
            Reporter: Bryan Cutler


When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
string with  "type=pa.string()", the resulting pyarrow Array can have 
inconsistent values. For most input, the result is an empty string, however for 
some types (int32, int64) the values are '\x01' etc.

{noformat}
In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)

In [9]: pa.Array.from_pandas(s, type=pa.string())                               
                                             
Out[9]: 
<pyarrow.lib.StringArray object at 0x7f90b6091a48>
[
  "",
  "",
  ""
]

In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)                              
                                             

In [11]: pa.Array.from_pandas(s, type=pa.string())                              
                                             
Out[11]: 
<pyarrow.lib.StringArray object at 0x7f9097efca48>
[
  "",
  "",
  ""
]
{noformat}

This came from the Spark discussion 
https://github.com/apache/spark/pull/24930/files#r296187903. Type casting this 
way in Spark is not supported, but it would be good to get the behavior 
consistent. Would it be better to raise an UnsupportedOperation error?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to