A couple more points to make re: Uwe's comments:
> An important point that we should keep in (and why I was a bit concerned in
> the previous times this discussion was raised) is that we have to be careful
> to not pull everything that touches Arrow into the Arrow repository.
An important disti
hi Uwe,
I agree with your points. Currently we have 3 software artifacts:
1. Arrow C++ libraries
2. Parquet C++ libraries with Arrow columnar integration
3. C++ interop layer for Python + Cython bindings
Changes in #1 prompt an awkward workflow involving multiple PRs; as a
result of this we just
Back from vacation, I also want to finally raise my voice.
With the current state of the Parquet<->Arrow development, I see a benefit in
merging the code base for now, but not necessarily forever.
Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter
is built and that
Thanks Ryan, will do. The people I'd still like to hear from are:
* Phillip Cloud
* Uwe Korn
As ASF contributors we are responsible to both be pragmatic as well as
act in the best interests of the community's health and productivity.
On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue wrote:
> I don't
I don't have an opinion here, but could someone send a summary of what is
decided to the dev list once there is consensus? This is a long thread for
parts of the project I don't work on, so I haven't followed it very closely.
On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney wrote:
> > It will be diff
> It will be difficult to track parquet-cpp changes if they get mixed with
> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> Can we enforce that parquet-cpp changes will not be committed without a
> corresponding Parquet JIRA?
I think we would use the following poli
I have a few more logistical questions to add.
It will be difficult to track parquet-cpp changes if they get mixed with
Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
Can we enforce that parquet-cpp changes will not be committed without a
corresponding Parquet JIRA?
I
Do other people have opinions? I would like to undertake this work in
the near future (the next 8-10 weeks); I would be OK with taking
responsibility for the primary codebase surgery.
Some logistical questions:
* We have a handful of pull requests in flight in parquet-cpp that
would need to be re
Thanks Tim.
Indeed, it's not very simple. Just today Antoine cleaned up some
platform code intending to improve the performance of bit-packing in
Parquet writes, and we resulted with 2 interdependent PRs
* https://github.com/apache/parquet-cpp/pull/483
* https://github.com/apache/arrow/pull/2355
hi,
On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti wrote:
> I think the circular dependency can be broken if we build a new library for
> the platform code. This will also make it easy for other projects such as
> ORC to use it.
> I also remember your proposal a while ago of having a separate pro
I think the circular dependency can be broken if we build a new library for
the platform code. This will also make it easy for other projects such as
ORC to use it.
I also remember your proposal a while ago of having a separate project for
the platform code. That project can live in the arrow repo
> The current Arrow adaptor code for parquet should live in the arrow repo.
> That will remove a majority of the dependency issues. Joshua's work would not
> have been blocked in parquet-cpp if that adapter was in the arrow repo. This
> will be similar to the ORC adaptor.
This has been suggest
A controlled fork doesn’t sound like a terrible option. Copy the code from
parquet into arrow, and for a limited period of time it would be the primary.
When that period is over, the code in parquet becomes the primary.
During the period during which arrow has the primary, the parquet release
m
> If you still strongly feel that the only way forward is to clone the
> parquet-cpp repo and part ways, I will withdraw my concern. Having two
> parquet-cpp repos is no way a better approach.
Yes, indeed. In my view, the next best option after a monorepo is to
fork. That would obviously be a ba
Wes,
Unfortunately, I cannot show you any practical fact-based problems of a
non-existent Arrow-Parquet mono-repo.
Bringing in related Apache community experiences are more meaningful than
how mono-repos work at Google and other big organizations.
We solely depend on volunteers and cannot hire ful
@Antoine
> By the way, one concern with the monorepo approach: it would slightly
> increase Arrow CI times (which are already too large).
A typical CI run in Arrow is taking about 45 minutes:
https://travis-ci.org/apache/arrow/builds/410119750
Parquet run takes about 28
https://travis-ci.org/ap
> I would like to point out that arrow's use of orc is a great example of how
> it would be possible to manage parquet-cpp as a separate codebase. That gives
> me hope that the projects could be managed separately some day.
Well, I don't know that ORC is the best example. The ORC C++ codebase
fe
You're point about the constraints of the ASF release process are well
taken and as a developer who's trying to work in the current environment I
would be much happier if the codebases were merged. The main issues I worry
about when you put codebases like these together are:
1. The delineation of
hi Josh,
> I can imagine use cases for parquet that don't involve arrow and tying them
> together seems like the wrong choice.
Apache is "Community over Code"; right now it's the same people
building these projects -- my argument (which I think you agree with?)
is that we should work more closel
I recently worked on an issue that had to be implemented in parquet-cpp
(ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
ARROW-2586). I found the circular dependencies confusing and hard to work
with. For example, I still have a PR open in parquet-cpp (created on May
10) because
On Mon, Jul 30, 2018 at 8:50 PM, Ted Dunning wrote:
> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote:
>
>>
>> > The community will be less willing to accept large
>> > changes that require multiple rounds of patches for stability and API
>> > convergence. Our contributions to Libhdfs++ in the
On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney wrote:
>
> > The community will be less willing to accept large
> > changes that require multiple rounds of patches for stability and API
> > convergence. Our contributions to Libhdfs++ in the HDFS community took a
> > significantly long time for the v
hi Antoine,
Thanks for chiming in.
On Mon, Jul 30, 2018 at 4:50 AM, Antoine Pitrou wrote:
>
> Hi Wes,
>
> Le 29/07/2018 à 01:44, Wes McKinney a écrit :
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C+
hi,
On Mon, Jul 30, 2018 at 6:52 PM, Deepak Majeti wrote:
> Wes,
>
> I definitely appreciate and do see the impact of contributions made by
> everyone. I made this statement not to rate any contributions but solely to
> support my concern.
> The contribution barrier is higher simply because of th
I'm not going to comment on the design of the parquet-cpp module and whether it
is “closer” to parquet or arrow.
But I do think Wes’s proposal is consistent with Apache policy. PMCs make
releases and govern communities; they don’t exist to manage code bases, except
as a means to the end of crea
Wes,
I definitely appreciate and do see the impact of contributions made by
everyone. I made this statement not to rate any contributions but solely to
support my concern.
The contribution barrier is higher simply because of the increased code,
build, and test dependencies. If the community has le
hi Deepak
On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti wrote:
> @Wes
> My observation is that most of the parquet-cpp contributors you listed that
> overlap with the Arrow community mainly contribute to the Arrow
> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
> repo. V
@Wes
My observation is that most of the parquet-cpp contributors you listed that
overlap with the Arrow community mainly contribute to the Arrow
bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
repo. Very few of them review/contribute patches to the parquet-cpp core.
I believ
Le 30/07/2018 à 10:50, Antoine Pitrou a écrit :
>
> Hi Wes,
>
> Le 29/07/2018 à 01:44, Wes McKinney a écrit :
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out
Hi Wes,
Le 29/07/2018 à 01:44, Wes McKinney a écrit :
> I believe the best way to remedy the situation is to adopt a
> "Community over Code" approach and find a way for the Parquet and
> Arrow C++ development communities to operate out of the same code
> repository, i.e. the apache/arrow git rep
I do not claim to have insight into parquet-cpp development. However, from
our experience developing Ray, I can say that the monorepo approach (for
Ray) has improved things a lot. Before we tried various schemes to split
the project into multiple repos, but the build system and test
infrastructure
hi Donald,
This would make things worse, not better. Code changes routinely
involve changes to the build system, and so you could be talking about
having to making changes to 2 or 3 git repositories as the result of a
single new feature or bug fix. There isn't really a cross-repo CI
solution avail
hi Deepak,
responses inline
On Sun, Jul 29, 2018 at 10:44 PM, Deepak Majeti wrote:
> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
>
Could this work as each module gets configured as sub-git repots. Top level
build tool go into each sub-repo, pick the correct release version to test.
Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
This should be the best of both worlds, if sub-repo are supposed option
I dislike the current build system complications as well.
However, in my opinion, combining the code bases will severely impact the
progress of the parquet-cpp project and implicitly the progress of the
entire parquet project.
Combining would have made much more sense if parquet-cpp is a mature
pr
hi folks,
We've been struggling for quite some time with the development
workflow between the Arrow and Parquet C++ (and Python) codebases.
To explain the root issues:
* parquet-cpp depends on "platform code" in Apache Arrow; this
includes file interfaces, memory management, miscellaneous algori
36 matches
Mail list logo