Wes McKinney created ARROW-1658:
-----------------------------------
Summary: [Python] Out of bounds dictionary indices causes segfault
after converting to pandas
Key: ARROW-1658
URL: https://issues.apache.org/jira/browse/ARROW-1658
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.7.1
Reporter: Wes McKinney
Fix For: 0.8.0
Minimal reproduction:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
num = 100
arr = pa.DictionaryArray.from_arrays(
np.arange(0, num),
np.array(['a'], np.object),
np.zeros(num, np.bool),
True)
print(arr.to_pandas())
{code}
At no time in the Arrow codebase do we validate that the dictionary indices are
in bounds. It seems that pandas is overly trusting of the validity of the
indices. So we should add a method someplace to validate that the dictionary
non-null indices are not out of bounds (perhaps in
{{CategoricalBlock::WriteIndices}}).
As an aside: there may be other times when doing analytics on categorical data
that external data will have out of bounds index values. We should plan for
these and decide whether to raise an exception or treat them as null
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)