My mistake. The max is 16MB.

So, if deserialisation fails, we keep trying until we hit the max, that
works but not efficient. Looks like the custom page header is not
deserialisable. Will keep digging.

Thanks
Shyam

On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
shyambits2...@gmail.com> wrote:

> Hi
>
> While reading a custom parquet file that has extra information embedded
> (some custom stats), pyarrow is failing to read it.
>
>
> Traceback (most recent call last):
>
>   File "/tmp/pytest.py", line 19, in <module>
>
>     table = dataset.read()
>
>   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line
> 214, in read
>
>     use_threads=use_threads)
>
>   File "pyarrow/_parquet.pyx", line 737, in
> pyarrow._parquet.ParquetReader.read_all
>
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
>
> pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
> Invalid data
>
> Deserializing page header failed.
>
>
>
> Looking at the code, I realised that SerializedPageReader throws exception
> if the page header size goes beyond 16k (default max). There is a setter
> method for the max page header size that is used only in tests.
>
>
> Is there a way to get around the problem?
>
>
> Regards
>
> Shyam
>

Reply via email to