Those instructions are a bit out of date after the monorepo merge, see https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development
On Mon, May 20, 2019 at 8:33 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Shyam, > https://github.com/apache/parquet-testing contains stand alone test files. > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc > is an example of how this is used (search for get_data_dir). > > > https://github.com/apache/parquet-cpp/blob/master/README.md#testing > describes how to setup your environment to use it. > > Thanks, > Micah > > > > n Monday, May 20, 2019, shyam narayan singh <shyambits2...@gmail.com> wrote: > > > Hi Wes > > > > Sorry, this got out of my radar. I went ahead to dig the problem and filed > > the issue <https://issues.apache.org/jira/browse/ARROW-5322>. We can track > > the error message as part of the different bug? > > > > Now, I have a parquet file that can be read by java reader but not pyarrow. > > I have the fix for the issue but I do not know how to add a test case. > > Reason being, the test cases generate the files and then test the readers. > > Is there a way to add an existing parquet file as a test case to the > > current set of tests? > > > > Regards > > Shyam > > > > Regards > > Shyam > > > > On Tue, Apr 23, 2019 at 9:20 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > hi Shyam, > > > > > > Well "Invalid data. Deserializing page header failed." is not a very > > > good error message. Can you open a JIRA issue and provide a way to > > > reproduce the problem (e.g. code to generate a file, or a sample > > > file)? From what you say it seems to be an atypical usage of Parquet, > > > but there might be a configurable option we can add to help. IIRC the > > > large header limit is there to prevent runaway behavior in malformed > > > Parquet files. I believe we used other Parquet implementations to > > > guide the choice > > > > > > Thanks > > > > > > On Wed, Apr 17, 2019 at 6:09 AM shyam narayan singh > > > <shyambits2...@gmail.com> wrote: > > > > > > > > My mistake. The max is 16MB. > > > > > > > > So, if deserialisation fails, we keep trying until we hit the max, that > > > > works but not efficient. Looks like the custom page header is not > > > > deserialisable. Will keep digging. > > > > > > > > Thanks > > > > Shyam > > > > > > > > On Wed, Apr 17, 2019 at 11:56 AM shyam narayan singh < > > > > shyambits2...@gmail.com> wrote: > > > > > > > > > Hi > > > > > > > > > > While reading a custom parquet file that has extra information > > embedded > > > > > (some custom stats), pyarrow is failing to read it. > > > > > > > > > > > > > > > Traceback (most recent call last): > > > > > > > > > > File "/tmp/pytest.py", line 19, in <module> > > > > > > > > > > table = dataset.read() > > > > > > > > > > File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", > > > line > > > > > 214, in read > > > > > > > > > > use_threads=use_threads) > > > > > > > > > > File "pyarrow/_parquet.pyx", line 737, in > > > > > pyarrow._parquet.ParquetReader.read_all > > > > > > > > > > File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status > > > > > > > > > > pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: > > > TProtocolException: > > > > > Invalid data > > > > > > > > > > Deserializing page header failed. > > > > > > > > > > > > > > > > > > > > Looking at the code, I realised that SerializedPageReader throws > > > exception > > > > > if the page header size goes beyond 16k (default max). There is a > > > setter > > > > > method for the max page header size that is used only in tests. > > > > > > > > > > > > > > > Is there a way to get around the problem? > > > > > > > > > > > > > > > Regards > > > > > > > > > > Shyam > > > > > > > > > >