Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Wes McKinney Wed, 09 Oct 2019 21:58:29 -0700

On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <[email protected]> wrote:
>
> 1. There already exists a low level parquet writer which can produce
> parquet file, so unit test should be fine. But writer from arrow to parquet
> doesn't exist yet, and it may take some period of time to finish it.
> 2. In fact my data are randomly generated and it's definitely reproducible.
> However, I don't think it would be good idea to randomly generate data
> everytime we run ci because it would be difficult to debug. For example PR
> a introduced a bug, which is triggerred in other PR's build it would be
> confusing for contributors.


Presumably any random data generation would use a fixed seed precisely
to be reproducible.

> 3. I think it would be good idea to spend effort on integration test with
> parquet because it's an important use case of arrow. Also similar approach
> could be extended to other language and other file format(avro, orc).
>
>
> On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <[email protected]> wrote:
>
> > There are a number of issues worth discussion.
> >
> > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> > It's OK to be reliant on other libraries in the short term to produce
> > files to test against, but does not strike me as a sustainable
> > long-term plan. Fixing bugs can be a lot more difficult than it needs
> > to be if you can't write targeted "endogenous" unit tests
> >
> > 2. Reproducible data generation
> >
> > I think if you're going to test against a pre-generated corpus, you
> > should make sure that generating the corpus is reproducible for other
> > developers (i.e. with a Dockerfile), and can be extended by adding new
> > files or random data generation.
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files. If this isn't viable right now
> > we can create an "arrow-rust-crutch" git repository for you to stash
> > binary files until some of these testing scalability issues are
> > addressed.
> >
> > If we're going to spend energy on Parquet integration testing with
> > Java, this would be a good opportunity to do the work in a way where
> > the C++ Parquet library can also participate (since we ought to be
> > doing integration tests with Java, and we can also read JSON files to
> > Arrow).
> >
> > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <[email protected]>
> > wrote:
> > >
> > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <[email protected]>
> > wrote:
> > >
> > > > I'm very interested in helping to find a solution to this because we
> > really
> > > > do need integration tests for Rust to make sure we're compatible with
> > other
> > > > implementations... there is also the ongoing CI dockerization work
> > that I
> > > > feel is related.
> > > >
> > > > I haven't looked at the current integration tests yet and would
> > appreciate
> > > > some pointers on how all of this works (do we have docs?) or where to
> > start
> > > > looking.
> > > >
> > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > > And here is the generated data:
> > > https://github.com/apache/arrow-testing/pull/11
> > > As with program to generate these data, it's just a simple java program.
> > > I'm not sure whether we need to integrate it into arrow.
> > >
> > > >
> > > > I imagine the integration test could follow the approach that Renjie is
> > > > outlining where we call Java to generate some files and then call Rust
> > to
> > > > parse them?
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <[email protected]>
> > wrote:
> > > >
> > > > > Hi:
> > > > >
> > > > > I'm developing rust version of reader which reads parquet into arrow
> > > > array.
> > > > > To verify the correct of this reader, I use the following approach:
> > > > >
> > > > >
> > > > >    1. Define schema with protobuf.
> > > > >    2. Generate json data of this schema using other language with
> > more
> > > > >    sophisticated implementation (e.g. java)
> > > > >    3. Generate parquet data of this schema using other language with
> > more
> > > > >    sophisticated implementation (e.g. java)
> > > > >    4. Write tests to read json file, and parquet file into memory
> > (arrow
> > > > >    array), then compare json data with arrow data.
> > > > >
> > > > >  I think with this method we can guarantee the correctness of arrow
> > > > reader
> > > > > because json format is ubiquitous and their implementation are more
> > > > stable.
> > > > >
> > > > > Any comment is appreciated.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Renjie Liu
> > > Software Engineer, MVAD
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to