Hello,
I'm also in favour of switching the dependency direction between Parquet
and Arrow as this would avoid a lot of duplicate code in both projects
as well as parquet-cpp profiting from functionality that is available in
Arrow.
@wesm: go ahead with the JIRAs and I'll add comments or will pick some
of them up.
Cheers
Uwe
On 07.09.16 04:41, Wes McKinney wrote:
hi Julien,
It makes sense to move the Parquet support for Arrow into Parquet
itself and invert the dependency. I had thought that the coupling to
Arrow C++'s IO subsystem might be tighter, but the connection between
memory allocators and file abstractions is fairly simple:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h
I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.
The exposure of the Parquet functionality in Python should stay inside
Arrow for now, but mainly because it would make developing the Python
side of things much more difficult if we split things up right now.
- Wes
On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman <brian.bow...@sas.com> wrote:
Forgive me if interposing my first post for the Apache Arrow project on this
thread is incorrect procedure.
What Julien proposes with each storage layer producing Arrow Record Batches is
exactly how I envision it working and would certainly make Arrow integration
with SAS much more palatable. This is likely true for other storage layer
providers as well.
Brian Bowman (SAS)
On Sep 6, 2016, at 7:52 PM, Julien Le Dem <jul...@dremio.com> wrote:
Thanks Wes,
No worries, I know you are on top of those things.
On a side note, I was wondering if the arrow-parquet integration should be
in Parquet instead.
Parquet would depend on Arrow and not the other way around.
Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
...) provides a way to produce Arrow Record Batches.
thoughts?
On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney <wesmck...@gmail.com> wrote:
hi Julien,
I'm very sorry about the inconvenience with this and the delay in
getting it sorted out. I will triage this evening by disabling the
Parquet tests in Arrow until we get the current problems under
control. When we re-enable the Parquet tests in Travis CI I agree we
should pin the version SHA.
- Wes
On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem <jul...@dremio.com> wrote:
The Arrow cpp travis-ci build is broken right now because it depends on
parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
looks to me)
Since parquet-cpp is not released yet it is totally fine to make
incompatible API changes.
However, we may want to pin the Arrow to Parquet dependency (on a git
sha?)
to prevent cross project changes from breaking the master build.
Since I'm not one of the core cpp dev on those projects I mainly want to
start that conversation rather than prescribe a solution. Feel free to
take
this as a straw man and suggest something else.
[1] https://travis-ci.org/apache/arrow/jobs/156080555
[2]
https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d
5af150dd31/ci/travis_before_script_cpp.sh
--
Julien
--
Julien