I guess one question we should first discuss is: who is the C++ build system for?
The users who are most sensitive to benchmark-driven decision making will generally be consuming the project through pre-built binaries, like our Python or R packages. If C++ developers build the project from source and don't do a minimal read of the documentation to see what a "recommended configuration" looks like, I would say that is more their fault than ours. In the case of the ARROW_JEMALLOC option, I think it's important for C++ system integrators to be aware of the impact of the choice of memory allocator. The concern I have with the current "out of the box" experience is that people are getting the impression that "I have to build $X, $Y, and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1". They can, of course, read the documentation and learn that those things can be toggled off, but I think the user that reaches for a self-built source install is much different in general than someone who uses the project through the Linux binary packages, for example. On the subject of managing intraproject dependencies and relationships, I think we should develop a better way to express relationships between components than we have now. As an example, building the Python library assumes that various components are enabled - ARROW_COMPUTE=ON - ARROW_FILESYSTEM=ON - ARROW_IPC=ON Somewhere in the code we might have some code like if (ARROW_PYTHON) set(ARROW_COMPUTE ON) ... endif() This doesn't strike me as that scalable. I would rather see a dependency file like component_dependencies = { ... 'python': ['compute', 'filesystem', 'ipc'], ... } A helper Python script as part of the build could be used to give CMake (because CMake is a bit poor as a programming language) the list of required components based on what the user has indicated to CMake. On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques <fsaintjacq...@gmail.com> wrote: > > There's always the route of vendoring some library and not exposing > external CMake options. This would achieve the goal of > compile-out-of-the-box and enable important feature in the basic > build. We also simplify dependencies requirements (benefits CI or > developer). The downside is following security patches and grumpy > reaction from package maintainers. I think we should explore this > route for dependencies that match the following criteria: > > - libarrow*.so don't export any of the symbols of the dependency and > not referenced in any public headers > - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm, > thrift, protobuf > - dependency is not-ubiquitous on major platform and have a stable > API, e.g. excludes libz and openssl > > A small list of candidates: > - RapidJSON (enables JSON) > - DoubleConversion (enables CSV) > > There's a precedent, arrow already vendors small C++ libraries > (datetime, utf8cpp, variant, xxhash). > > François > > > On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Hi all, > > > > I'm a bit concerned that we're planning to add many additional build > > options in the quest to have a core zero-dependency build in C++. > > See for example https://issues.apache.org/jira/browse/ARROW-6633 or > > https://issues.apache.org/jira/browse/ARROW-6612. > > > > The problem is that this is creating many possible configurations and we > > will only be testing a tiny subset of them. Inevitably, users will try > > other option combinations and they'll fail building for some random > > reason. It will not be a very good user experience. > > > > Another related issue is user perception when doing a default build. > > For example https://issues.apache.org/jira/browse/ARROW-6638 proposes to > > build with jemalloc disabled by default. Inevitably, people will be > > doing benchmarks with this (publicly or not) and they'll conclude Arrow > > is not as performant as it claims to be. > > > > Perhaps we should look for another approach instead? > > > > For example we could have a single ARROW_BARE_CORE (whatever the name) > > option that when enabled (not by default) builds the tiniest minimal > > subset of Arrow. It's more inflexible, but at least it's something that > > we can reasonably test. > > > > Regards > > > > Antoine.