I guess one question we should first discuss is: who is the C++ build
system for?

The users who are most sensitive to benchmark-driven decision making
will generally be consuming the project through pre-built binaries,
like our Python or R packages. If C++ developers build the project
from source and don't do a minimal read of the documentation to see
what a "recommended configuration" looks like, I would say that is
more their fault than ours. In the case of the ARROW_JEMALLOC option,
I think it's important for C++ system integrators to be aware of the
impact of the choice of memory allocator.

The concern I have with the current "out of the box" experience is
that people are getting the impression that "I have to build $X, $Y,
and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
They can, of course, read the documentation and learn that those
things can be toggled off, but I think the user that reaches for a
self-built source install is much different in general than someone
who uses the project through the Linux binary packages, for example.

On the subject of managing intraproject dependencies and
relationships, I think we should develop a better way to express
relationships between components than we have now.

As an example, building the Python library assumes that various
components are enabled

- ARROW_COMPUTE=ON
- ARROW_FILESYSTEM=ON
- ARROW_IPC=ON

Somewhere in the code we might have some code like

if (ARROW_PYTHON)
  set(ARROW_COMPUTE ON)
  ...
endif()

This doesn't strike me as that scalable. I would rather see a
dependency file like

component_dependencies = {
    ...
    'python': ['compute', 'filesystem', 'ipc'],
    ...
}

A helper Python script as part of the build could be used to give
CMake (because CMake is a bit poor as a programming language) the list
of required components based on what the user has indicated to CMake.

On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
<fsaintjacq...@gmail.com> wrote:
>
> There's always the route of vendoring some library and not exposing
> external CMake options. This would achieve the goal of
> compile-out-of-the-box and enable important feature in the basic
> build. We also simplify dependencies requirements (benefits CI or
> developer). The downside is following security patches and grumpy
> reaction from package maintainers. I think we should explore this
> route for dependencies that match the following criteria:
>
> - libarrow*.so don't export any of the symbols of the dependency and
> not referenced in any public headers
> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm,
> thrift, protobuf
> - dependency is not-ubiquitous on major platform and have a stable
> API, e.g. excludes libz and openssl
>
> A small list of candidates:
> - RapidJSON (enables JSON)
> - DoubleConversion (enables CSV)
>
> There's a precedent, arrow already vendors small C++ libraries
> (datetime, utf8cpp, variant, xxhash).
>
> François
>
>
> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Hi all,
> >
> > I'm a bit concerned that we're planning to add many additional build
> > options in the quest to have a core zero-dependency build in C++.
> > See for example https://issues.apache.org/jira/browse/ARROW-6633 or
> > https://issues.apache.org/jira/browse/ARROW-6612.
> >
> > The problem is that this is creating many possible configurations and we
> > will only be testing a tiny subset of them.  Inevitably, users will try
> > other option combinations and they'll fail building for some random
> > reason.  It will not be a very good user experience.
> >
> > Another related issue is user perception when doing a default build.
> > For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to
> > build with jemalloc disabled by default.  Inevitably, people will be
> > doing benchmarks with this (publicly or not) and they'll conclude Arrow
> > is not as performant as it claims to be.
> >
> > Perhaps we should look for another approach instead?
> >
> > For example we could have a single ARROW_BARE_CORE (whatever the name)
> > option that when enabled (not by default) builds the tiniest minimal
> > subset of Arrow.  It's more inflexible, but at least it's something that
> > we can reasonably test.
> >
> > Regards
> >
> > Antoine.

Reply via email to