Wes McKinney created ARROW-1658:
-----------------------------------

             Summary: [Python] Out of bounds dictionary indices causes segfault 
after converting to pandas
                 Key: ARROW-1658
                 URL: https://issues.apache.org/jira/browse/ARROW-1658
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.7.1
            Reporter: Wes McKinney
             Fix For: 0.8.0


Minimal reproduction:

{code}
import numpy as np
import pandas as pd
import pyarrow as pa
 
num = 100
arr = pa.DictionaryArray.from_arrays(
    np.arange(0, num),
    np.array(['a'], np.object),
    np.zeros(num, np.bool),
    True)

print(arr.to_pandas())
{code}

At no time in the Arrow codebase do we validate that the dictionary indices are 
in bounds. It seems that pandas is overly trusting of the validity of the 
indices. So we should add a method someplace to validate that the dictionary 
non-null indices are not out of bounds (perhaps in 
{{CategoricalBlock::WriteIndices}}).

As an aside: there may be other times when doing analytics on categorical data 
that external data will have out of bounds index values. We should plan for 
these and decide whether to raise an exception or treat them as null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to