> > > This might be taking the thread on more of a tangent, but maybe we should start collecting requirements for the C++ build system in general and see if there might be better solution that can address some of these concerns? In particular, Bazel at least on the surface seems like it might be a better fit for some of the use cases discussed here. I know this is a big project (and I currently don't have much bandwidth for it) but I think if CMake is lacking in these areas it might be worth at least exploring instead of going down the path of building our own meta-build system on top of CMake.
Requirements that I think we are targeting: 1. Be able to provide an out of box build system that requires as close to zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal" works on any C++ developers desktop without additional requirements) 2. The build system should limit configuration knobs in favor of implied dependencies (e.g. "$BUILD python" automatically builds "compute", "filesystem", "ipc") 3. The build system should be configurable to use (and have the user specify) one of "System packages", "Conda packages" or source packages for providing dependencies (and fallback options between the three). 4. The build system should be able to treat some dependencies as optional (e.g. different compression libraries or allocators). 5. Easily allow developers to limit building unnecessary code for their particular task at hand. 6. The build system must work across the following toolchains/platforms: - Linux: g++ and clang. x86 and ARM - Mac - Windows (msys2 and MSVC) Thanks, Micah On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <anto...@python.org> wrote: > > Yes, we could express dependencies in a Python script and have it > generate a CMake module of if/else chains in cmake_modules (which we > would check in git to avoid having people depend on a Python install, > perhaps). > > Still, that is an additional maintenance burden. > > Regards > > Antoine. > > > Le 10/10/2019 à 14:50, Wes McKinney a écrit : > > I guess one question we should first discuss is: who is the C++ build > > system for? > > > > The users who are most sensitive to benchmark-driven decision making > > will generally be consuming the project through pre-built binaries, > > like our Python or R packages. If C++ developers build the project > > from source and don't do a minimal read of the documentation to see > > what a "recommended configuration" looks like, I would say that is > > more their fault than ours. In the case of the ARROW_JEMALLOC option, > > I think it's important for C++ system integrators to be aware of the > > impact of the choice of memory allocator. > > > > The concern I have with the current "out of the box" experience is > > that people are getting the impression that "I have to build $X, $Y, > > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1". > > They can, of course, read the documentation and learn that those > > things can be toggled off, but I think the user that reaches for a > > self-built source install is much different in general than someone > > who uses the project through the Linux binary packages, for example. > > > > On the subject of managing intraproject dependencies and > > relationships, I think we should develop a better way to express > > relationships between components than we have now. > > > > As an example, building the Python library assumes that various > > components are enabled > > > > - ARROW_COMPUTE=ON > > - ARROW_FILESYSTEM=ON > > - ARROW_IPC=ON > > > > Somewhere in the code we might have some code like > > > > if (ARROW_PYTHON) > > set(ARROW_COMPUTE ON) > > ... > > endif() > > > > This doesn't strike me as that scalable. I would rather see a > > dependency file like > > > > component_dependencies = { > > ... > > 'python': ['compute', 'filesystem', 'ipc'], > > ... > > } > > > > A helper Python script as part of the build could be used to give > > CMake (because CMake is a bit poor as a programming language) the list > > of required components based on what the user has indicated to CMake. > > > > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques > > <fsaintjacq...@gmail.com> wrote: > >> > >> There's always the route of vendoring some library and not exposing > >> external CMake options. This would achieve the goal of > >> compile-out-of-the-box and enable important feature in the basic > >> build. We also simplify dependencies requirements (benefits CI or > >> developer). The downside is following security patches and grumpy > >> reaction from package maintainers. I think we should explore this > >> route for dependencies that match the following criteria: > >> > >> - libarrow*.so don't export any of the symbols of the dependency and > >> not referenced in any public headers > >> - dependency is lightweight, e.g. excludes boost, openssl, grpc, llvm, > >> thrift, protobuf > >> - dependency is not-ubiquitous on major platform and have a stable > >> API, e.g. excludes libz and openssl > >> > >> A small list of candidates: > >> - RapidJSON (enables JSON) > >> - DoubleConversion (enables CSV) > >> > >> There's a precedent, arrow already vendors small C++ libraries > >> (datetime, utf8cpp, variant, xxhash). > >> > >> François > >> > >> > >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org> > wrote: > >>> > >>> > >>> Hi all, > >>> > >>> I'm a bit concerned that we're planning to add many additional build > >>> options in the quest to have a core zero-dependency build in C++. > >>> See for example https://issues.apache.org/jira/browse/ARROW-6633 or > >>> https://issues.apache.org/jira/browse/ARROW-6612. > >>> > >>> The problem is that this is creating many possible configurations and > we > >>> will only be testing a tiny subset of them. Inevitably, users will try > >>> other option combinations and they'll fail building for some random > >>> reason. It will not be a very good user experience. > >>> > >>> Another related issue is user perception when doing a default build. > >>> For example https://issues.apache.org/jira/browse/ARROW-6638 proposes > to > >>> build with jemalloc disabled by default. Inevitably, people will be > >>> doing benchmarks with this (publicly or not) and they'll conclude Arrow > >>> is not as performant as it claims to be. > >>> > >>> Perhaps we should look for another approach instead? > >>> > >>> For example we could have a single ARROW_BARE_CORE (whatever the name) > >>> option that when enabled (not by default) builds the tiniest minimal > >>> subset of Arrow. It's more inflexible, but at least it's something > that > >>> we can reasonably test. > >>> > >>> Regards > >>> > >>> Antoine. >