Re: Confronting Arrow packaging problems

Phillip Cloud Sat, 24 Mar 2018 13:14:15 -0700

I think we need to use a tool that can perform every single step of the
deployment process, end-to-end. Right now, cmake isn't cutting it IMO
because it lends itself quite heavily to copy pasting and oodles of bash
scripts that are indecipherable by anyone except the original author.

With that in mind, here's what I think the requirements are for the next
generation of arrow's build and deployment system for native code (C, C++,
Python):

For every platform (Linux, OS X, and Windows):

1. Build sources (these langauges are currently the most cumbersome at the
moment)
2. Run all tests for each language (ideally integration tests as well, but
I'm not sure if that should be a hard requirement at the moment)
3. Build the API documentation for each platform
4. Build installable packages that we support:
  * conda
  * pip wheels
  * deb
  * rpm

Additional requirements:
1. Be able to run all of these steps on any platform with minimal
environment setup.
2. The output of the doc build and package build steps **should be in a
release-ready state at all times**. If they are not then we should fail the
build
3. Ideally, these run on every PR so that we can find out if a commit would
introduce a change that would break the release-ready status of arrow.

These requirements indicate to me that a single, extensible, cross-platform
tool--as opposed to many tools that are tied together by a shell script--is
what we need.

There are a few tools in this space that I'm aware of:

1. Bazel (out of Google)
2. Buck (out of Facebook)

I'm not sure what others are out there, but I'm sure there must be some.

I don't really have a strong opinion on either Bazel or Buck, but I suspect
that since we follow Google's conventions in a few places integrating Bazel
into the arrow codebase would be less work.

The main risk I see here is that it's possible that bazel isn't the right
tool. I'm not sure how to mitigate this risk other than to make sure that
our requirements can be met by it by scouring the Bazel docs.

I do think that the fact that Bazel is extensible mitigates some risk here.
For example, we'd likely have to add rule for building conda packages and
pip wheels.

I guess CMake is extensible too, but I don't think I've ever seen the
extensibility features of CMake as anything but a burden. Bazel's extension
language is a subset of Python and I would therefore expect it to be a lot
easier to use.

I'm interested to hear others' experiences and opinions on similar
problems. Also, if I've missed anything in the requirements list, please
don't hestitate to respond.

Let's fix our packaging!

-Phillip

On Fri, Mar 23, 2018 at 11:21 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> I know in Spark we’ve benefited by having some of the different language
> devs act as RMs and each time that language dev has ended up improving a
> bunch of how their components packaging has been done. Not to suggest we
> should just do what other projects do, but maybe an idea to consider?
>
> On Fri, Mar 23, 2018 at 12:59 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi folks,
> >
> > So, I want to bring light to the problems we are having delivering
> > binary artifacts after Arrow releases.
> >
> > We have some amount of packaging automation implemented in
> > https://github.com/apache/arrow-dist using Travis CI and Appveyor to
> > upload packages to Bintray, a packaging hosting service.
> >
> > Unfortunately, we discovered a bunch of problems with these packaging
> > scripts after the release vote closed on Monday, and now 4 days later,
> > we still have been unable to post binaries to
> > https://pypi.python.org/pypi/pyarrow
> >
> > This is no one's fault, but it highlights structural problems with our
> > development process:
> >
> > * Why does producing packages after a release require error-prone manual
> > labor?
> >
> > * Why are we only finding out about packaging problem after a release
> > vote closes?
> >
> > * Why is setting up nightly binary builds a brittle and bespoke process?
> >
> > I hope all agree that:
> >
> > * Packaging should not be a hardship or require a lot of manual labor
> >
> > * Packaging problems on the master branch should be made known within
> > ~24 hours, so they can be remedied immediately
> >
> > * It should be straightforward to produce binary artifacts for all
> > supported platforms and programming languages
> >
> > Eventually, we should include some binary artifacts in our release
> > votes, but we are pretty far away from suitable automation to make
> > this possible.
> >
> > I don't know any easy solutions, but Apache Arrow has grown widely
> > used enough that I think it's worth our taking the time to plan and
> > execute some solutions to these problems, which I expect to pay
> > dividends in our community's productivity over time.
> >
> > Thanks,
> > Wes
> >
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Confronting Arrow packaging problems

Reply via email to