Re: [pyarrow] Parquet page header size limit

shyam narayan singh Tue, 21 May 2019 04:23:58 -0700

Hi

I have submitted parent PR <https://github.com/apache/arrow/pull/4359> and
the submodule PR <https://github.com/apache/parquet-testing/pull/5>.


Regards
Shyam

On Tue, May 21, 2019 at 12:09 PM shyam narayan singh <
shyambits2...@gmail.com> wrote:

> Thanks Micah and Wes. Will try to submit a PR in a day or two.
>
> Regards
> Shyam
>
> On Mon, May 20, 2019 at 10:46 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Those instructions are a bit out of date after the monorepo merge, see
>>
>>
>> https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development
>>
>> On Mon, May 20, 2019 at 8:33 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> > Hi Shyam,
>> > https://github.com/apache/parquet-testing contains stand alone test
>> files.
>> >
>> >
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc
>> > is an example of how this is used (search for get_data_dir).
>> >
>> >
>> > https://github.com/apache/parquet-cpp/blob/master/README.md#testing
>> > describes how to setup your environment to use it.
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> >
>> > n Monday, May 20, 2019, shyam narayan singh <shyambits2...@gmail.com>
>> wrote:
>> >
>> > > Hi Wes
>> > >
>> > > Sorry, this got out of my radar. I went ahead to dig the problem and
>> filed
>> > > the issue <https://issues.apache.org/jira/browse/ARROW-5322>. We can
>> track
>> > > the error message as part of the different bug?
>> > >
>> > > Now, I have a parquet file that can be read by java reader but not
>> pyarrow.
>> > > I have the fix for the issue but I do not know how to add a test case.
>> > > Reason being, the test cases generate the files and then test the
>> readers.
>> > > Is there a way to add an existing parquet file as a test case to the
>> > > current set of tests?
>> > >
>> > > Regards
>> > > Shyam
>> > >
>> > > Regards
>> > > Shyam
>> > >
>> > > On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >
>> > > > hi Shyam,
>> > > >
>> > > > Well "Invalid data. Deserializing page header failed." is not a very
>> > > > good error message. Can you open a JIRA issue and provide a way to
>> > > > reproduce the problem (e.g. code to generate a file, or a sample
>> > > > file)? From what you say it seems to be an atypical usage of
>> Parquet,
>> > > > but there might be a configurable option we can add to help. IIRC
>> the
>> > > > large header limit is there to prevent runaway behavior in malformed
>> > > > Parquet files. I believe we used other Parquet implementations to
>> > > > guide the choice
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh
>> > > > <shyambits2...@gmail.com> wrote:
>> > > > >
>> > > > > My mistake. The max is 16MB.
>> > > > >
>> > > > > So, if deserialisation fails, we keep trying until we hit the
>> max, that
>> > > > > works but not efficient. Looks like the custom page header is not
>> > > > > deserialisable. Will keep digging.
>> > > > >
>> > > > > Thanks
>> > > > > Shyam
>> > > > >
>> > > > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh <
>> > > > > shyambits2...@gmail.com> wrote:
>> > > > >
>> > > > > > Hi
>> > > > > >
>> > > > > > While reading a custom parquet file that has extra information
>> > > embedded
>> > > > > > (some custom stats), pyarrow is failing to read it.
>> > > > > >
>> > > > > >
>> > > > > > Traceback (most recent call last):
>> > > > > >
>> > > > > >   File "/tmp/pytest.py", line 19, in <module>
>> > > > > >
>> > > > > >     table = dataset.read()
>> > > > > >
>> > > > > >   File
>> "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py",
>> > > > line
>> > > > > > 214, in read
>> > > > > >
>> > > > > >     use_threads=use_threads)
>> > > > > >
>> > > > > >   File "pyarrow/_parquet.pyx", line 737, in
>> > > > > > pyarrow._parquet.ParquetReader.read_all
>> > > > > >
>> > > > > >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
>> > > > > >
>> > > > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift:
>> > > > TProtocolException:
>> > > > > > Invalid data
>> > > > > >
>> > > > > > Deserializing page header failed.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Looking at the code, I realised that SerializedPageReader throws
>> > > > exception
>> > > > > > if the page header size goes beyond 16k (default max). There is
>> a
>> > > > setter
>> > > > > > method for the max page header size that is used only in tests.
>> > > > > >
>> > > > > >
>> > > > > > Is there a way to get around the problem?
>> > > > > >
>> > > > > >
>> > > > > > Regards
>> > > > > >
>> > > > > > Shyam
>> > > > > >
>> > > >
>> > >
>>
>

Re: [pyarrow] Parquet page header size limit

Reply via email to