Karl Dunkle Werner created ARROW-5480: -----------------------------------------
Summary: [Python] Pandas categorical type doesn't survive a round-trip through parquet Key: ARROW-5480 URL: https://issues.apache.org/jira/browse/ARROW-5480 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.13.0, 0.11.1 Environment: python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 5.0.0-15-generic machine: x86_64 processor: x86_64 byteorder: little pandas: 0.24.2 numpy: 1.16.4 pyarrow: 0.13.0 Reporter: Karl Dunkle Werner Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category. The same thing happens if the category is numeric -- a numeric category is read back as int64. In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not. In the scheme of things, this isn't a big deal, but it's a small surprise. {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) df.dtypes # category # This works: pa.Table.from_pandas(df).to_pandas().dtypes # category df.to_parquet("categories.parquet") # This reads back object, but I expected category pd.read_parquet("categories.parquet").dtypes # object # Numeric categories have the same issue: df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) df_num.dtypes # category pa.Table.from_pandas(df_num).to_pandas().dtypes # category df_num.to_parquet("categories_num.parquet") # This reads back int64, but I expected category pd.read_parquet("categories_num.parquet").dtypes # int64 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)