Re: [C++] The quest for zero-dependency builds

Antoine Pitrou Wed, 16 Oct 2019 06:27:14 -0700


Perhaps meson is also worth exploring?


Le 15/10/2019 à 23:06, Micah Kornfield a écrit :

Hi Wes,
I agree on both accounts that it won't be a done in the short term, and it
makes sense to tackle in incrementally.  Like I said I don't have much
bandwidth at the moment but might be able to re-arrange a few things on my
plate.  I think some people have asked on the mailing list how they might
be able to help, this might be one area that doesn't require a lot of
in-depth knowledge of C++ at least for a proof of concept.  I'll try to
open up some JIRAs soon.

Thanks,
Micah

On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <wesmck...@gmail.com> wrote:

hi Micah,

Definitely Bazel is worth exploring, but we must be realistic about
the amount of energy (several hundred hours or more) that's been
invested in the build system we have now. So a new build system will
be a large endeavor, but hopefully can make things simpler.

Aside from the requirements gathering process, if it is felt that
Bazel is a possible path forward in the future, it may be good to try
to break up the work into more tractable pieces. For example, a first
step would be to set up Bazel configurations to build the project's
thirdparty toolchain. Since we're reliant in ExternalProject in CMake
to do a lot of heavy lifting there for us, I imagine this (taking care
of what ThirdpartyToolchain.cmake does not) will take up a lot of the
energy

- Wes

On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:



This might be taking the thread on more of a tangent, but maybe we

should

start collecting requirements for the C++ build system in general and see
if there might be better solution that can address some of these

concerns?

In particular, Bazel at least on the surface seems like it might be a
better fit for some of the use cases discussed here.  I know this is a

big

project (and I currently don't have much bandwidth for it) but I think if
CMake is lacking in these areas it might be worth at least exploring
instead of going down the path of building our own meta-build system on

top

of CMake.

Requirements that I think we are targeting:
1.  Be able to provide an out of box build system that requires as close

to

zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
works on any C++ developers desktop without additional requirements)
2.  The build system should limit configuration knobs in favor of implied
dependencies (e.g. "$BUILD python" automatically builds "compute",
"filesystem", "ipc")
3.  The build system should be configurable to use (and have the user
specify) one of "System packages", "Conda packages" or source packages

for

providing dependencies (and fallback options between the three).
4.  The build system should be able to treat some dependencies as

optional

(e.g. different compression libraries or allocators).
5.  Easily allow developers to limit building unnecessary code for their
particular task at hand.
6.  The build system must work across the following toolchains/platforms:
     - Linux:  g++ and clang.  x86 and ARM
     - Mac
     - Windows (msys2 and MSVC)

Thanks,
Micah



On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <anto...@python.org>

wrote:


Yes, we could express dependencies in a Python script and have it
generate a CMake module of if/else chains in cmake_modules (which we
would check in git to avoid having people depend on a Python install,
perhaps).

Still, that is an additional maintenance burden.

Regards

Antoine.


Le 10/10/2019 à 14:50, Wes McKinney a écrit :

I guess one question we should first discuss is: who is the C++ build
system for?

The users who are most sensitive to benchmark-driven decision making
will generally be consuming the project through pre-built binaries,
like our Python or R packages. If C++ developers build the project
from source and don't do a minimal read of the documentation to see
what a "recommended configuration" looks like, I would say that is
more their fault than ours. In the case of the ARROW_JEMALLOC option,
I think it's important for C++ system integrators to be aware of the
impact of the choice of memory allocator.

The concern I have with the current "out of the box" experience is
that people are getting the impression that "I have to build $X, $Y,
and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
They can, of course, read the documentation and learn that those
things can be toggled off, but I think the user that reaches for a
self-built source install is much different in general than someone
who uses the project through the Linux binary packages, for example.

On the subject of managing intraproject dependencies and
relationships, I think we should develop a better way to express
relationships between components than we have now.

As an example, building the Python library assumes that various
components are enabled

- ARROW_COMPUTE=ON
- ARROW_FILESYSTEM=ON
- ARROW_IPC=ON

Somewhere in the code we might have some code like

if (ARROW_PYTHON)
   set(ARROW_COMPUTE ON)
   ...
endif()

This doesn't strike me as that scalable. I would rather see a
dependency file like

component_dependencies = {
     ...
     'python': ['compute', 'filesystem', 'ipc'],
     ...
}

A helper Python script as part of the build could be used to give
CMake (because CMake is a bit poor as a programming language) the

list

of required components based on what the user has indicated to CMake.

On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
<fsaintjacq...@gmail.com> wrote:


There's always the route of vendoring some library and not exposing
external CMake options. This would achieve the goal of
compile-out-of-the-box and enable important feature in the basic
build. We also simplify dependencies requirements (benefits CI or
developer). The downside is following security patches and grumpy
reaction from package maintainers. I think we should explore this
route for dependencies that match the following criteria:

- libarrow*.so don't export any of the symbols of the dependency and
not referenced in any public headers
- dependency is lightweight, e.g. excludes boost, openssl, grpc,

llvm,

thrift, protobuf
- dependency is not-ubiquitous on major platform and have a stable
API, e.g. excludes libz and openssl

A small list of candidates:
- RapidJSON (enables JSON)
- DoubleConversion (enables CSV)

There's a precedent, arrow already vendors small C++ libraries
(datetime, utf8cpp, variant, xxhash).

François


On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org>

wrote:



Hi all,

I'm a bit concerned that we're planning to add many additional

build

options in the quest to have a core zero-dependency build in C++.
See for example https://issues.apache.org/jira/browse/ARROW-6633

or

https://issues.apache.org/jira/browse/ARROW-6612.

The problem is that this is creating many possible configurations

and

we

will only be testing a tiny subset of them.  Inevitably, users

will try

other option combinations and they'll fail building for some random
reason.  It will not be a very good user experience.

Another related issue is user perception when doing a default

build.

For example https://issues.apache.org/jira/browse/ARROW-6638

proposes

to

build with jemalloc disabled by default.  Inevitably, people will

be

doing benchmarks with this (publicly or not) and they'll conclude

Arrow

is not as performant as it claims to be.

Perhaps we should look for another approach instead?

For example we could have a single ARROW_BARE_CORE (whatever the

name)

option that when enabled (not by default) builds the tiniest

minimal

subset of Arrow.  It's more inflexible, but at least it's something

that

we can reasonably test.

Regards

Antoine.

Re: [C++] The quest for zero-dependency builds

Reply via email to