Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

Wes McKinney Tue, 06 Sep 2016 19:43:34 -0700

hi Julien,

It makes sense to move the Parquet support for Arrow into Parquet
itself and invert the dependency. I had thought that the coupling to
Arrow C++'s IO subsystem might be tighter, but the connection between
memory allocators and file abstractions is fairly simple:


https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h

I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.

The exposure of the Parquet functionality in Python should stay inside
Arrow for now, but mainly because it would make developing the Python
side of things much more difficult if we split things up right now.

- Wes

On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman <[email protected]> wrote:
> Forgive me if interposing my first post for the Apache Arrow project on this 
> thread is incorrect procedure.
>
> What Julien proposes with each storage layer producing Arrow Record Batches 
> is exactly how I envision it working and would certainly make Arrow 
> integration with SAS much more palatable.  This is likely true for other 
> storage layer providers as well.
>
> Brian Bowman (SAS)
>
>> On Sep 6, 2016, at 7:52 PM, Julien Le Dem <[email protected]> wrote:
>>
>> Thanks Wes,
>> No worries, I know you are on top of those things.
>> On a side note, I was wondering if the arrow-parquet integration should be
>> in Parquet instead.
>> Parquet would depend on Arrow and not the other way around.
>> Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
>> ...) provides a way to produce Arrow Record Batches.
>> thoughts?
>>
>>> On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney <[email protected]> wrote:
>>>
>>> hi Julien,
>>>
>>> I'm very sorry about the inconvenience with this and the delay in
>>> getting it sorted out. I will triage this evening by disabling the
>>> Parquet tests in Arrow until we get the current problems under
>>> control. When we re-enable the Parquet tests in Travis CI I agree we
>>> should pin the version SHA.
>>>
>>> - Wes
>>>
>>>> On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem <[email protected]> wrote:
>>>> The Arrow cpp travis-ci build is broken right now because it depends on
>>>> parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
>>>> looks to me)
>>>> Since parquet-cpp is not released yet it is totally fine to make
>>>> incompatible API changes.
>>>> However, we may want to pin the Arrow to Parquet dependency (on a git
>>> sha?)
>>>> to prevent cross project changes from breaking the master build.
>>>> Since I'm not one of the core cpp dev on those projects I mainly want to
>>>> start that conversation rather than prescribe a solution. Feel free to
>>> take
>>>> this as a straw man and suggest something else.
>>>>
>>>> [1] https://travis-ci.org/apache/arrow/jobs/156080555
>>>> [2]
>>>> https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d
>>> 5af150dd31/ci/travis_before_script_cpp.sh
>>>>
>>>>
>>>> --
>>>> Julien
>>
>>
>>
>> --
>> Julien

Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

Reply via email to