hi Shyam,

Well "Invalid data. Deserializing page header failed." is not a very
good error message. Can you open a JIRA issue and provide a way to
reproduce the problem (e.g. code to generate a file, or a sample
file)? From what you say it seems to be an atypical usage of Parquet,
but there might be a configurable option we can add to help. IIRC the
large header limit is there to prevent runaway behavior in malformed
Parquet files. I believe we used other Parquet implementations to
guide the choice

Thanks

On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
<shyambits2...@gmail.com> wrote:
>
> My mistake. The max is 16MB.
>
> So, if deserialisation fails, we keep trying until we hit the max, that
> works but not efficient. Looks like the custom page header is not
> deserialisable. Will keep digging.
>
> Thanks
> Shyam
>
> On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
> shyambits2...@gmail.com> wrote:
>
> > Hi
> >
> > While reading a custom parquet file that has extra information embedded
> > (some custom stats), pyarrow is failing to read it.
> >
> >
> > Traceback (most recent call last):
> >
> >   File "/tmp/pytest.py", line 19, in <module>
> >
> >     table = dataset.read()
> >
> >   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line
> > 214, in read
> >
> >     use_threads=use_threads)
> >
> >   File "pyarrow/_parquet.pyx", line 737, in
> > pyarrow._parquet.ParquetReader.read_all
> >
> >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> >
> > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
> > Invalid data
> >
> > Deserializing page header failed.
> >
> >
> >
> > Looking at the code, I realised that SerializedPageReader throws exception
> > if the page header size goes beyond 16k (default max). There is a setter
> > method for the max page header size that is used only in tests.
> >
> >
> > Is there a way to get around the problem?
> >
> >
> > Regards
> >
> > Shyam
> >

Reply via email to