Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

Uwe Korn Tue, 06 Sep 2016 21:51:35 -0700

Hello,

I'm also in favour of switching the dependency direction between Parquetand Arrow as this would avoid a lot of duplicate code in both projectsas well as parquet-cpp profiting from functionality that is available inArrow.

@wesm: go ahead with the JIRAs and I'll add comments or will pick someof them up.


Cheers

Uwe


On 07.09.16 04:41, Wes McKinney wrote:

hi Julien,

It makes sense to move the Parquet support for Arrow into Parquet
itself and invert the dependency. I had thought that the coupling to
Arrow C++'s IO subsystem might be tighter, but the connection between
memory allocators and file abstractions is fairly simple:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h

I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.

The exposure of the Parquet functionality in Python should stay inside
Arrow for now, but mainly because it would make developing the Python
side of things much more difficult if we split things up right now.

- Wes

On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman <brian.bow...@sas.com> wrote:

Forgive me if interposing my first post for the Apache Arrow project on this 
thread is incorrect procedure.

What Julien proposes with each storage layer producing Arrow Record Batches is 
exactly how I envision it working and would certainly make Arrow integration 
with SAS much more palatable.  This is likely true for other storage layer 
providers as well.

Brian Bowman (SAS)

On Sep 6, 2016, at 7:52 PM, Julien Le Dem <jul...@dremio.com> wrote:

Thanks Wes,
No worries, I know you are on top of those things.
On a side note, I was wondering if the arrow-parquet integration should be
in Parquet instead.
Parquet would depend on Arrow and not the other way around.
Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
...) provides a way to produce Arrow Record Batches.
thoughts?

On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney <wesmck...@gmail.com> wrote:

hi Julien,

I'm very sorry about the inconvenience with this and the delay in
getting it sorted out. I will triage this evening by disabling the
Parquet tests in Arrow until we get the current problems under
control. When we re-enable the Parquet tests in Travis CI I agree we
should pin the version SHA.

- Wes

On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem <jul...@dremio.com> wrote:
The Arrow cpp travis-ci build is broken right now because it depends on
parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
looks to me)
Since parquet-cpp is not released yet it is totally fine to make
incompatible API changes.
However, we may want to pin the Arrow to Parquet dependency (on a git

sha?)

to prevent cross project changes from breaking the master build.
Since I'm not one of the core cpp dev on those projects I mainly want to
start that conversation rather than prescribe a solution. Feel free to

take

this as a straw man and suggest something else.

[1] https://travis-ci.org/apache/arrow/jobs/156080555
[2]
https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d

5af150dd31/ci/travis_before_script_cpp.sh


--
Julien



--
Julien

Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

Reply via email to