Hi, I’m experiencing problem reading parquet files written with the `use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in 2.0.0 reading it in 8.0.0 gives:
>>> pd.read_parquet(‘dataset.parq') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > line 493, in read_parquet > return impl.read( > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py", > line 240, in read > result = self.api.parquet.read_table( > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > line 2780, in read_table > return dataset.read(columns=columns, use_threads=use_threads, > File > "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", > line 2443, in read > table = self._dataset.to_table( > File "pyarrow/_dataset.pyx", line 304, in > pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 2549, in > pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 144, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status > OSError: Unexpected end of stream > It’s easy to replicate (link to sample parquet: https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a) with the following schema: schema = pa.schema([ > ("col1", pa.int8()), > ("col2", pa.string()), > ("col3", pa.float64()), > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > ]) > Actually opening the file as a ParquetFile works (as long as I don’t read the row group): <pyarrow._parquet.FileMetaData object at 0x7f79c134c360> > created_by: parquet-cpp version 1.5.1-SNAPSHOT > num_columns: 4 > num_rows: 5 > num_row_groups: 1 > format_version: 2.6 > serialized_size: 858 > Is there any way to make pyarrow==8.0.0 read these parquet files? Or at least figure out a way to convert them from 2 to 8. Not using the use_dictionary works, but unfortunately I already have hundreds of gigabytes of these parquet files across a lot of environments. If I write it using pyarrow==3.0.0 I can read it all the way from 3 to 8.0.0, but not 2.0.0. Regards, Niklas Full sample code: import pyarrow as pa > import pyarrow.parquet as pq > > schema = pa.schema([ > ("col1", pa.int8()), > ("col2", pa.string()), > ("col3", pa.float64()), > ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False)) > ]) > > table = pa.table([ > [1, 2, 3, 4, 5], > ["a", "b", "c", "d", "e"], > [1.0, 2.0, 3.0, 4.0, 5.0], > ["a", "a", "a", "b", "b"] > ], schema=schema) > > > output_file = 'test2.parq' > > with pq.ParquetWriter( > output_file, > schema, > compression='snappy', > allow_truncated_timestamps=True, > version='2.0', # Highest available schema > data_page_version='2.0', # Highest available schema > # Convert these columns to categorical values, must be bytes keys > as seen on > # > https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python > use_dictionary=[category.encode('utf-8') for category in ['col4']], > ) as writer: > writer.write_table( > table, > row_group_size=10000 > ) > >