Florian Jetter created ARROW-5138:
-------------------------------------
Summary: [Python/C++] Row group retrieval doesn't restore index
properly
Key: ARROW-5138
URL: https://issues.apache.org/jira/browse/ARROW-5138
Project: Apache Arrow
Issue Type: Bug
Reporter: Florian Jetter
When retrieving row groups the index is no longer properly restored to its
initial value and is set to an range index starting at zero no matter what.
version 0.12.1 restored and int64 index with the correct index values.
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
print(pa.__version__)
df = pd.DataFrame(
{"a": [1, 2, 3, 4]}
)
print("total DF")
print(df.index)
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf, chunk_size=2)
reader = pa.BufferReader(buf.getvalue().to_pybytes())
parquet_file = pq.ParquetFile(reader)
rg = parquet_file.read_row_group(1)
df_restored = rg.to_pandas()
print("Row group")
print(df_restored.index)
{code}
Previous behavior
{code:python}
0.12.1
total DF
RangeIndex(start=0, stop=4, step=1)
Row group
Int64Index([2, 3], dtype='int64')
{code}
Behavior now
{code:python}
0.13.0
total DF
RangeIndex(start=0, stop=4, step=1)
Row group
RangeIndex(start=0, stop=2, step=1)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)