Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Renjie Liu Sun, 13 Oct 2019 01:17:49 -0700

Currently for parquet reader of rust version only, some static files
covering some types would be enough.
However, I agree with Wes that we should not rely on static binary files
for functional tests because it's hard to maintain with the evolving of
arrow. For example, currently parquet reader in c++ doesn't support nested
type, so parquet file without nested type maybe easier to test against. But
when it can support nested types, we need to change again.
I think we should setup a program which can generate binary file for test
on demand in every test, which can be used by arrow libraries written in
different languages. I'll write a proposal about this in google doc.


On Sun, Oct 13, 2019 at 4:25 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I think the ideal scenario is to have a mix of "endogenous" unit
> testing and functional testing against real files to test for
> regressions or cross-compatibility. To criticize the work we've done
> in the C++ project, we have not done enough systematic integration
> testing IMHO, but we do test against some "bad files" that have
> accumulated.
>
> In any case, I think it's bad practice for a file format reader to
> rely exclusively on functional testing against static binary files.
>
> This good be a good opportunity to devise a language-agnostic Parquet
> integration testing strategy. Given that we're looking to add nested
> data support in C++ hopefully by the end of 2020, it would be good
> timing.
>
> On Sat, Oct 12, 2019 at 11:12 AM Andy Grove <andygrov...@gmail.com> wrote:
> >
> > I also think that there are valid use cases for checking in binary files,
> > but we have to be careful not to abuse this. For example, we might want
> to
> > check in a Parquet file created by a particular version of Apache Spark
> to
> > ensure that Arrow implementations can read it successfully (hypothetical
> > example).
> >
> > It would also be good to have a small set of Parquet files using every
> > possible data type that all implementations can use in their tests. I
> > suppose we might want one set per Arrow format version as well.
> >
> > The problem we have now, in my opinion, is that we're proposing adding
> > files on a pretty ad-hoc basis, driven by the needs of individual
> > contributors in one language implementation, and this is perhaps
> happening
> > because we don't already have a good set of standard test files.
> >
> > Renjie - perhaps you could comment on this. If we had these standard
> files
> > covering all data types, would that have worked for you in this instance?
> >
> > Thanks,
> >
> > Andy.
> >
> > On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Wes,
> > > >
> > > > I additionally would prefer generating the test corpus at test time
> > > > rather than checking in binary files.
> > >
> > >
> > > Can you elaborate on this? I think both generated on the fly and
> example
> > > files are useful.
> > >
> > > The checked in files catch regressions even when readers/writers can
> read
> > > their own data but they have either incorrect or undefined behavior in
> > > regards to the specification (for example I would imagine checking in a
> > > file as part of the fix for ARROW-6844
> > > <https://issues.apache.org/jira/browse/ARROW-6844>).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <liurenjie2...@gmail.com>
> > > wrote:
> > >
> > > > Thanks wes. Sure I'll fix it.
> > > >
> > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
> > > >
> > > > > I just merged the PR
> https://github.com/apache/arrow-testing/pull/11
> > > > >
> > > > > Various aspects of this make me uncomfortable so I hope they can be
> > > > > addressed in follow up work
> > > > >
> > > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <
> liurenjie2...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > I've create ticket to track here:
> > > > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > > > >
> > > > > > For this moment, can we check in those pregenerated data to
> unblock
> > > > rust
> > > > > > version's arrow reader?
> > > > > >
> > > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <
> liurenjie2...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > It would be fine in that case.
> > > > > > >
> > > > > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > > > > >
> > > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> > > liurenjie2...@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > 1. There already exists a low level parquet writer which can
> > > > produce
> > > > > > >> > parquet file, so unit test should be fine. But writer from
> arrow
> > > > to
> > > > > > >> parquet
> > > > > > >> > doesn't exist yet, and it may take some period of time to
> finish
> > > > it.
> > > > > > >> > 2. In fact my data are randomly generated and it's
> definitely
> > > > > > >> reproducible.
> > > > > > >> > However, I don't think it would be good idea to randomly
> > > generate
> > > > > data
> > > > > > >> > everytime we run ci because it would be difficult to debug.
> For
> > > > > example
> > > > > > >> PR
> > > > > > >> > a introduced a bug, which is triggerred in other PR's build
> it
> > > > > would be
> > > > > > >> > confusing for contributors.
> > > > > > >>
> > > > > > >> Presumably any random data generation would use a fixed seed
> > > > precisely
> > > > > > >> to be reproducible.
> > > > > > >>
> > > > > > >> > 3. I think it would be good idea to spend effort on
> integration
> > > > test
> > > > > > >> with
> > > > > > >> > parquet because it's an important use case of arrow. Also
> > > similar
> > > > > > >> approach
> > > > > > >> > could be extended to other language and other file
> format(avro,
> > > > > orc).
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> > > wesmck...@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > There are a number of issues worth discussion.
> > > > > > >> > >
> > > > > > >> > > 1. What is the timeline/plan for Rust implementing a
> Parquet
> > > > > _writer_?
> > > > > > >> > > It's OK to be reliant on other libraries in the short
> term to
> > > > > produce
> > > > > > >> > > files to test against, but does not strike me as a
> sustainable
> > > > > > >> > > long-term plan. Fixing bugs can be a lot more difficult
> than
> > > it
> > > > > needs
> > > > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > > > >> > >
> > > > > > >> > > 2. Reproducible data generation
> > > > > > >> > >
> > > > > > >> > > I think if you're going to test against a pre-generated
> > > corpus,
> > > > > you
> > > > > > >> > > should make sure that generating the corpus is
> reproducible
> > > for
> > > > > other
> > > > > > >> > > developers (i.e. with a Dockerfile), and can be extended
> by
> > > > > adding new
> > > > > > >> > > files or random data generation.
> > > > > > >> > >
> > > > > > >> > > I additionally would prefer generating the test corpus at
> test
> > > > > time
> > > > > > >> > > rather than checking in binary files. If this isn't viable
> > > right
> > > > > now
> > > > > > >> > > we can create an "arrow-rust-crutch" git repository for
> you to
> > > > > stash
> > > > > > >> > > binary files until some of these testing scalability
> issues
> > > are
> > > > > > >> > > addressed.
> > > > > > >> > >
> > > > > > >> > > If we're going to spend energy on Parquet integration
> testing
> > > > with
> > > > > > >> > > Java, this would be a good opportunity to do the work in
> a way
> > > > > where
> > > > > > >> > > the C++ Parquet library can also participate (since we
> ought
> > > to
> > > > be
> > > > > > >> > > doing integration tests with Java, and we can also read
> JSON
> > > > > files to
> > > > > > >> > > Arrow).
> > > > > > >> > >
> > > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > > > liurenjie2...@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > > > andygrov...@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > I'm very interested in helping to find a solution to
> this
> > > > > because
> > > > > > >> we
> > > > > > >> > > really
> > > > > > >> > > > > do need integration tests for Rust to make sure we're
> > > > > compatible
> > > > > > >> with
> > > > > > >> > > other
> > > > > > >> > > > > implementations... there is also the ongoing CI
> > > > dockerization
> > > > > work
> > > > > > >> > > that I
> > > > > > >> > > > > feel is related.
> > > > > > >> > > > >
> > > > > > >> > > > > I haven't looked at the current integration tests yet
> and
> > > > > would
> > > > > > >> > > appreciate
> > > > > > >> > > > > some pointers on how all of this works (do we have
> docs?)
> > > or
> > > > > > >> where to
> > > > > > >> > > start
> > > > > > >> > > > > looking.
> > > > > > >> > > > >
> > > > > > >> > > > I have a test in my latest PR:
> > > > > > >> https://github.com/apache/arrow/pull/5523
> > > > > > >> > > > And here is the generated data:
> > > > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > > > >> > > > As with program to generate these data, it's just a
> simple
> > > > java
> > > > > > >> program.
> > > > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > > > >> > > >
> > > > > > >> > > > >
> > > > > > >> > > > > I imagine the integration test could follow the
> approach
> > > > that
> > > > > > >> Renjie is
> > > > > > >> > > > > outlining where we call Java to generate some files
> and
> > > then
> > > > > call
> > > > > > >> Rust
> > > > > > >> > > to
> > > > > > >> > > > > parse them?
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Andy.
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > > > >> liurenjie2...@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi:
> > > > > > >> > > > > >
> > > > > > >> > > > > > I'm developing rust version of reader which reads
> > > parquet
> > > > > into
> > > > > > >> arrow
> > > > > > >> > > > > array.
> > > > > > >> > > > > > To verify the correct of this reader, I use the
> > > following
> > > > > > >> approach:
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. Define schema with protobuf.
> > > > > > >> > > > > >    2. Generate json data of this schema using other
> > > > language
> > > > > > >> with
> > > > > > >> > > more
> > > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > > >> > > > > >    3. Generate parquet data of this schema using
> other
> > > > > language
> > > > > > >> with
> > > > > > >> > > more
> > > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > > >> > > > > >    4. Write tests to read json file, and parquet
> file
> > > into
> > > > > > >> memory
> > > > > > >> > > (arrow
> > > > > > >> > > > > >    array), then compare json data with arrow data.
> > > > > > >> > > > > >
> > > > > > >> > > > > >  I think with this method we can guarantee the
> > > correctness
> > > > > of
> > > > > > >> arrow
> > > > > > >> > > > > reader
> > > > > > >> > > > > > because json format is ubiquitous and their
> > > implementation
> > > > > are
> > > > > > >> more
> > > > > > >> > > > > stable.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Any comment is appreciated.
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > --
> > > > > > >> > > > Renjie Liu
> > > > > > >> > > > Software Engineer, MVAD
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Renjie Liu
> > > > > > >> > Software Engineer, MVAD
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Renjie Liu
> > > > > > Software Engineer, MVAD
> > > > >
> > > >
> > >
>


-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to