Re: [pyarrow] Parquet page header size limit

Micah Kornfield Mon, 20 May 2019 06:34:27 -0700

Hi Shyam,
https://github.com/apache/parquet-testing contains stand alone test files.



https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc
is an example of how this is used (search for get_data_dir).


https://github.com/apache/parquet-cpp/blob/master/README.md#testing
describes how to setup your environment to use it.

Thanks,
Micah



n Monday, May 20, 2019, shyam narayan singh <shyambits2...@gmail.com> wrote:

> Hi Wes
>
> Sorry, this got out of my radar. I went ahead to dig the problem and filed
> the issue <https://issues.apache.org/jira/browse/ARROW-5322>. We can track
> the error message as part of the different bug?
>
> Now, I have a parquet file that can be read by java reader but not pyarrow.
> I have the fix for the issue but I do not know how to add a test case.
> Reason being, the test cases generate the files and then test the readers.
> Is there a way to add an existing parquet file as a test case to the
> current set of tests?
>
> Regards
> Shyam
>
> Regards
> Shyam
>
> On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Shyam,
> >
> > Well "Invalid data. Deserializing page header failed." is not a very
> > good error message. Can you open a JIRA issue and provide a way to
> > reproduce the problem (e.g. code to generate a file, or a sample
> > file)? From what you say it seems to be an atypical usage of Parquet,
> > but there might be a configurable option we can add to help. IIRC the
> > large header limit is there to prevent runaway behavior in malformed
> > Parquet files. I believe we used other Parquet implementations to
> > guide the choice
> >
> > Thanks
> >
> > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
> > <shyambits2...@gmail.com> wrote:
> > >
> > > My mistake. The max is 16MB.
> > >
> > > So, if deserialisation fails, we keep trying until we hit the max, that
> > > works but not efficient. Looks like the custom page header is not
> > > deserialisable. Will keep digging.
> > >
> > > Thanks
> > > Shyam
> > >
> > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
> > > shyambits2...@gmail.com> wrote:
> > >
> > > > Hi
> > > >
> > > > While reading a custom parquet file that has extra information
> embedded
> > > > (some custom stats), pyarrow is failing to read it.
> > > >
> > > >
> > > > Traceback (most recent call last):
> > > >
> > > >   File "/tmp/pytest.py", line 19, in <module>
> > > >
> > > >     table = dataset.read()
> > > >
> > > >   File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
> > line
> > > > 214, in read
> > > >
> > > >     use_threads=use_threads)
> > > >
> > > >   File "pyarrow/_parquet.pyx", line 737, in
> > > > pyarrow._parquet.ParquetReader.read_all
> > > >
> > > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> > > >
> > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
> > TProtocolException:
> > > > Invalid data
> > > >
> > > > Deserializing page header failed.
> > > >
> > > >
> > > >
> > > > Looking at the code, I realised that SerializedPageReader throws
> > exception
> > > > if the page header size goes beyond 16k (default max). There is a
> > setter
> > > > method for the max page header size that is used only in tests.
> > > >
> > > >
> > > > Is there a way to get around the problem?
> > > >
> > > >
> > > > Regards
> > > >
> > > > Shyam
> > > >
> >
>

Re: [pyarrow] Parquet page header size limit

Reply via email to