Jonas Nelle created ARROW-8812: ---------------------------------- Summary: Columns of type CategoricalIndex fails to be read back Key: ARROW-8812 URL: https://issues.apache.org/jira/browse/ARROW-8812 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Python 3.7.7 MacOS (Darwin-19.4.0-x86_64-i386-64bit) Pandas 1.0.3 Pyarrow 0.15.1 Reporter: Jonas Nelle
When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}: {code:python} import pandas as pd from pyarrow import parquet, Table base_df = pd.DataFrame([['foo', 'j', "1"], ['bar', 'j', "1"], ['foo', 'j', "1"], ['foobar', 'j', "1"]], columns=['my_cat', 'var', 'for_count']) base_df['my_cat'] = base_df['my_cat'].astype('category') df = ( base_df .groupby(["my_cat", "var"], observed=True) .agg({"for_count": "count"}) .rename(columns={"for_count": "my_cat_counts"}) .unstack(level="my_cat", fill_value=0) ) print(df) {code} The resulting data frame looks something like this: || ||my_cat_counts|| || || |my_cat|foo|bar|foobar| |var| | | | |j|2|1|1| Then, writing and reading causes the {{KeyError}}: {code:python} parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood {code} In the example, the column is also a MultiIndex, but that isn't the problem: {code:python} df.columns = df.columns.get_level_values(1) parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood {code} This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]: {code:python} df.columns = pd.Index(list(df.columns)) # suggested fix for the time being parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() # no error {code} Are there any plans to support the pattern described here in the future? -- This message was sent by Atlassian Jira (v8.3.4#803005)