Re: [pyarrow] Parquet page header size limit

2019-05-21 Thread shyam narayan singh
Hi I have submitted parent PR and the submodule PR . Regards Shyam On Tue, May 21, 2019 at 12:09 PM shyam narayan singh < shyambits2...@gmail.com> wrote: > Thanks Micah and Wes. Will try to submit a PR

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread shyam narayan singh
Thanks Micah and Wes. Will try to submit a PR in a day or two. Regards Shyam On Mon, May 20, 2019 at 10:46 PM Wes McKinney wrote: > Those instructions are a bit out of date after the monorepo merge, see > > > https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parq

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Wes McKinney
Those instructions are a bit out of date after the monorepo merge, see https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development On Mon, May 20, 2019 at 8:33 AM Micah Kornfield wrote: > > Hi Shyam, > https://github.com/apache/parquet-testing contains s

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Micah Kornfield
Hi Shyam, https://github.com/apache/parquet-testing contains stand alone test files. https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc is an example of how this is used (search for get_data_dir). https://github.com/apache/parquet-cpp/blob/master/README.md#testing

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread shyam narayan singh
Hi Wes Sorry, this got out of my radar. I went ahead to dig the problem and filed the issue . We can track the error message as part of the different bug? Now, I have a parquet file that can be read by java reader but not pyarrow. I have the fix f

Re: [pyarrow] Parquet page header size limit

2019-04-22 Thread Wes McKinney
hi Shyam, Well "Invalid data. Deserializing page header failed." is not a very good error message. Can you open a JIRA issue and provide a way to reproduce the problem (e.g. code to generate a file, or a sample file)? From what you say it seems to be an atypical usage of Parquet, but there might b

Re: [pyarrow] Parquet page header size limit

2019-04-17 Thread shyam narayan singh
My mistake. The max is 16MB. So, if deserialisation fails, we keep trying until we hit the max, that works but not efficient. Looks like the custom page header is not deserialisable. Will keep digging. Thanks Shyam On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh < shyambits2...@gmail.com> w

[pyarrow] Parquet page header size limit

2019-04-16 Thread shyam narayan singh
Hi While reading a custom parquet file that has extra information embedded (some custom stats), pyarrow is failing to read it. Traceback (most recent call last): File "/tmp/pytest.py", line 19, in table = dataset.read() File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py"