Tom Augspurger created ARROW-1897: ------------------------------------- Summary: Incorrect numpy_type for pandas metadata of Categoricals Key: ARROW-1897 URL: https://issues.apache.org/jira/browse/ARROW-1897 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Tom Augspurger Fix For: 0.9.0
If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {{ In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} }} >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)