Hi,

I’m experiencing problem reading parquet files written with the
`use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in
2.0.0 reading it in 8.0.0 gives:

>>> pd.read_parquet(‘dataset.parq')
>
Traceback (most recent call last):
>
  File "<stdin>", line 1, in <module>
>
  File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 493, in read_parquet
>
    return impl.read(
>
  File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/io/parquet.py",
> line 240, in read
>
    result = self.api.parquet.read_table(
>
  File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> line 2780, in read_table
>
    return dataset.read(columns=columns, use_threads=use_threads,
>
  File
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet/__init__.py",
> line 2443, in read
>
    table = self._dataset.to_table(
>
  File "pyarrow/_dataset.pyx", line 304, in
> pyarrow._dataset.Dataset.to_table
>
  File "pyarrow/_dataset.pyx", line 2549, in
> pyarrow._dataset.Scanner.to_table
>
  File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
>
OSError: Unexpected end of stream
>

It’s easy to replicate (link to sample parquet:
https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0 or gist to create
your own https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a)
with the following schema:

schema = pa.schema([
>
    ("col1", pa.int8()),
>
    ("col2", pa.string()),
>
    ("col3", pa.float64()),
>
    ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
>
])
>

Actually opening the file as a ParquetFile works (as long as I don’t read
the row group):

<pyarrow._parquet.FileMetaData object at 0x7f79c134c360>
>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
>
  num_columns: 4
>
  num_rows: 5
>
  num_row_groups: 1
>
  format_version: 2.6
>
  serialized_size: 858
>

Is there any way to make pyarrow==8.0.0 read these parquet files? Or at
least figure out a way to convert them from 2 to 8. Not using the
use_dictionary works, but unfortunately I already have hundreds of
gigabytes of these parquet files across a lot of environments.

If I write it using pyarrow==3.0.0 I can read it all the way from 3 to
8.0.0, but not 2.0.0.

Regards,
Niklas

Full sample code:

 import pyarrow as pa
>
 import pyarrow.parquet as pq
>

>
 schema = pa.schema([
>
     ("col1", pa.int8()),
>
     ("col2", pa.string()),
>
     ("col3", pa.float64()),
>
     ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
>
 ])
>

>
 table = pa.table([
>
     [1, 2, 3, 4, 5],
>
     ["a", "b", "c", "d", "e"],
>
     [1.0, 2.0, 3.0, 4.0, 5.0],
>
     ["a", "a", "a", "b", "b"]
>
 ], schema=schema)
>

>

>
 output_file = 'test2.parq'
>

>
 with pq.ParquetWriter(
>
         output_file,
>
         schema,
>
         compression='snappy',
>
         allow_truncated_timestamps=True,
>
         version='2.0',  # Highest available schema
>
         data_page_version='2.0',  # Highest available schema
>
         # Convert these columns to categorical values, must be bytes keys
> as seen on
>
         #
> https://stackoverflow.com/questions/56377848/writing-stream-of-big-data-to-parquet-with-python
>
         use_dictionary=[category.encode('utf-8') for category in ['col4']],
>
     ) as writer:
>
         writer.write_table(
>
             table,
>
             row_group_size=10000
>
          )
>

>

Reply via email to