Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-13 Thread Renjie Liu
Currently for parquet reader of rust version only, some static files covering some types would be enough. However, I agree with Wes that we should not rely on static binary files for functional tests because it's hard to maintain with the evolving of arrow. For example, currently parquet reader in

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-12 Thread Wes McKinney
I think the ideal scenario is to have a mix of "endogenous" unit testing and functional testing against real files to test for regressions or cross-compatibility. To criticize the work we've done in the C++ project, we have not done enough systematic integration testing IMHO, but we do test against

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-12 Thread Andy Grove
I also think that there are valid use cases for checking in binary files, but we have to be careful not to abuse this. For example, we might want to check in a Parquet file created by a particular version of Apache Spark to ensure that Arrow implementations can read it successfully (hypothetical ex

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-11 Thread Micah Kornfield
Hi Wes, > > I additionally would prefer generating the test corpus at test time > rather than checking in binary files. Can you elaborate on this? I think both generated on the fly and example files are useful. The checked in files catch regressions even when readers/writers can read their own d

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Renjie Liu
Thanks wes. Sure I'll fix it. Wes McKinney 于 2019年10月11日周五 上午6:10写道: > I just merged the PR https://github.com/apache/arrow-testing/pull/11 > > Various aspects of this make me uncomfortable so I hope they can be > addressed in follow up work > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu > wrot

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Wes McKinney
I just merged the PR https://github.com/apache/arrow-testing/pull/11 Various aspects of this make me uncomfortable so I hope they can be addressed in follow up work On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu wrote: > > I've create ticket to track here: > https://issues.apache.org/jira/browse/ARR

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-10 Thread Renjie Liu
I've create ticket to track here: https://issues.apache.org/jira/browse/ARROW-6845 For this moment, can we check in those pregenerated data to unblock rust version's arrow reader? On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu wrote: > It would be fine in that case. > > Wes McKinney 于 2019年10月10日周

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Renjie Liu
It would be fine in that case. Wes McKinney 于 2019年10月10日周四 下午12:58写道: > On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu > wrote: > > > > 1. There already exists a low level parquet writer which can produce > > parquet file, so unit test should be fine. But writer from arrow to > parquet > > doesn't

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Wes McKinney
On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu wrote: > > 1. There already exists a low level parquet writer which can produce > parquet file, so unit test should be fine. But writer from arrow to parquet > doesn't exist yet, and it may take some period of time to finish it. > 2. In fact my data are r

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Renjie Liu
1. There already exists a low level parquet writer which can produce parquet file, so unit test should be fine. But writer from arrow to parquet doesn't exist yet, and it may take some period of time to finish it. 2. In fact my data are randomly generated and it's definitely reproducible. However,

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Wes McKinney
There are a number of issues worth discussion. 1. What is the timeline/plan for Rust implementing a Parquet _writer_? It's OK to be reliant on other libraries in the short term to produce files to test against, but does not strike me as a sustainable long-term plan. Fixing bugs can be a lot more d

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-08 Thread Renjie Liu
On Wed, Oct 9, 2019 at 12:11 PM Andy Grove wrote: > I'm very interested in helping to find a solution to this because we really > do need integration tests for Rust to make sure we're compatible with other > implementations... there is also the ongoing CI dockerization work that I > feel is relat

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-08 Thread Andy Grove
I'm very interested in helping to find a solution to this because we really do need integration tests for Rust to make sure we're compatible with other implementations... there is also the ongoing CI dockerization work that I feel is related. I haven't looked at the current integration tests yet a

[DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-08 Thread Renjie Liu
Hi: I'm developing rust version of reader which reads parquet into arrow array. To verify the correct of this reader, I use the following approach: 1. Define schema with protobuf. 2. Generate json data of this schema using other language with more sophisticated implementation (e.g. java