[ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662160#comment-17662160 ]
Rok Mihevc commented on ARROW-5138: ----------------------------------- This issue has been migrated to [issue #21620|https://github.com/apache/arrow/issues/21620] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python/C++] Row group retrieval doesn't restore index properly > --------------------------------------------------------------- > > Key: ARROW-5138 > URL: https://issues.apache.org/jira/browse/ARROW-5138 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.13.0 > Reporter: Florian Jetter > Assignee: Wes McKinney > Priority: Minor > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > When retrieving row groups the index is no longer properly restored to its > initial value and is set to an range index starting at zero no matter what. > version 0.12.1 restored and int64 index with the correct index values. > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2, 3, 4]} > ) > print("total DF") > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf, chunk_size=2) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > parquet_file = pq.ParquetFile(reader) > rg = parquet_file.read_row_group(1) > df_restored = rg.to_pandas() > print("Row group") > print(df_restored.index) > {code} > Previous behavior > {code:python} > 0.12.1 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > Int64Index([2, 3], dtype='int64') > {code} > Behavior now > {code:python} > 0.13.0 > total DF > RangeIndex(start=0, stop=4, step=1) > Row group > RangeIndex(start=0, stop=2, step=1) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)