Hi Wes, I agree on both accounts that it won't be a done in the short term, and it makes sense to tackle in incrementally. Like I said I don't have much bandwidth at the moment but might be able to re-arrange a few things on my plate. I think some people have asked on the mailing list how they might be able to help, this might be one area that doesn't require a lot of in-depth knowledge of C++ at least for a proof of concept. I'll try to open up some JIRAs soon.
Thanks, Micah On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > Definitely Bazel is worth exploring, but we must be realistic about > the amount of energy (several hundred hours or more) that's been > invested in the build system we have now. So a new build system will > be a large endeavor, but hopefully can make things simpler. > > Aside from the requirements gathering process, if it is felt that > Bazel is a possible path forward in the future, it may be good to try > to break up the work into more tractable pieces. For example, a first > step would be to set up Bazel configurations to build the project's > thirdparty toolchain. Since we're reliant in ExternalProject in CMake > to do a lot of heavy lifting there for us, I imagine this (taking care > of what ThirdpartyToolchain.cmake does not) will take up a lot of the > energy > > - Wes > > On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > > > > > > This might be taking the thread on more of a tangent, but maybe we > should > > start collecting requirements for the C++ build system in general and see > > if there might be better solution that can address some of these > concerns? > > In particular, Bazel at least on the surface seems like it might be a > > better fit for some of the use cases discussed here. I know this is a > big > > project (and I currently don't have much bandwidth for it) but I think if > > CMake is lacking in these areas it might be worth at least exploring > > instead of going down the path of building our own meta-build system on > top > > of CMake. > > > > Requirements that I think we are targeting: > > 1. Be able to provide an out of box build system that requires as close > to > > zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal" > > works on any C++ developers desktop without additional requirements) > > 2. The build system should limit configuration knobs in favor of implied > > dependencies (e.g. "$BUILD python" automatically builds "compute", > > "filesystem", "ipc") > > 3. The build system should be configurable to use (and have the user > > specify) one of "System packages", "Conda packages" or source packages > for > > providing dependencies (and fallback options between the three). > > 4. The build system should be able to treat some dependencies as > optional > > (e.g. different compression libraries or allocators). > > 5. Easily allow developers to limit building unnecessary code for their > > particular task at hand. > > 6. The build system must work across the following toolchains/platforms: > > - Linux: g++ and clang. x86 and ARM > > - Mac > > - Windows (msys2 and MSVC) > > > > Thanks, > > Micah > > > > > > > > On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Yes, we could express dependencies in a Python script and have it > > > generate a CMake module of if/else chains in cmake_modules (which we > > > would check in git to avoid having people depend on a Python install, > > > perhaps). > > > > > > Still, that is an additional maintenance burden. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 10/10/2019 à 14:50, Wes McKinney a écrit : > > > > I guess one question we should first discuss is: who is the C++ build > > > > system for? > > > > > > > > The users who are most sensitive to benchmark-driven decision making > > > > will generally be consuming the project through pre-built binaries, > > > > like our Python or R packages. If C++ developers build the project > > > > from source and don't do a minimal read of the documentation to see > > > > what a "recommended configuration" looks like, I would say that is > > > > more their fault than ours. In the case of the ARROW_JEMALLOC option, > > > > I think it's important for C++ system integrators to be aware of the > > > > impact of the choice of memory allocator. > > > > > > > > The concern I have with the current "out of the box" experience is > > > > that people are getting the impression that "I have to build $X, $Y, > > > > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1". > > > > They can, of course, read the documentation and learn that those > > > > things can be toggled off, but I think the user that reaches for a > > > > self-built source install is much different in general than someone > > > > who uses the project through the Linux binary packages, for example. > > > > > > > > On the subject of managing intraproject dependencies and > > > > relationships, I think we should develop a better way to express > > > > relationships between components than we have now. > > > > > > > > As an example, building the Python library assumes that various > > > > components are enabled > > > > > > > > - ARROW_COMPUTE=ON > > > > - ARROW_FILESYSTEM=ON > > > > - ARROW_IPC=ON > > > > > > > > Somewhere in the code we might have some code like > > > > > > > > if (ARROW_PYTHON) > > > > set(ARROW_COMPUTE ON) > > > > ... > > > > endif() > > > > > > > > This doesn't strike me as that scalable. I would rather see a > > > > dependency file like > > > > > > > > component_dependencies = { > > > > ... > > > > 'python': ['compute', 'filesystem', 'ipc'], > > > > ... > > > > } > > > > > > > > A helper Python script as part of the build could be used to give > > > > CMake (because CMake is a bit poor as a programming language) the > list > > > > of required components based on what the user has indicated to CMake. > > > > > > > > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques > > > > <fsaintjacq...@gmail.com> wrote: > > > >> > > > >> There's always the route of vendoring some library and not exposing > > > >> external CMake options. This would achieve the goal of > > > >> compile-out-of-the-box and enable important feature in the basic > > > >> build. We also simplify dependencies requirements (benefits CI or > > > >> developer). The downside is following security patches and grumpy > > > >> reaction from package maintainers. I think we should explore this > > > >> route for dependencies that match the following criteria: > > > >> > > > >> - libarrow*.so don't export any of the symbols of the dependency and > > > >> not referenced in any public headers > > > >> - dependency is lightweight, e.g. excludes boost, openssl, grpc, > llvm, > > > >> thrift, protobuf > > > >> - dependency is not-ubiquitous on major platform and have a stable > > > >> API, e.g. excludes libz and openssl > > > >> > > > >> A small list of candidates: > > > >> - RapidJSON (enables JSON) > > > >> - DoubleConversion (enables CSV) > > > >> > > > >> There's a precedent, arrow already vendors small C++ libraries > > > >> (datetime, utf8cpp, variant, xxhash). > > > >> > > > >> François > > > >> > > > >> > > > >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org> > > > wrote: > > > >>> > > > >>> > > > >>> Hi all, > > > >>> > > > >>> I'm a bit concerned that we're planning to add many additional > build > > > >>> options in the quest to have a core zero-dependency build in C++. > > > >>> See for example https://issues.apache.org/jira/browse/ARROW-6633 > or > > > >>> https://issues.apache.org/jira/browse/ARROW-6612. > > > >>> > > > >>> The problem is that this is creating many possible configurations > and > > > we > > > >>> will only be testing a tiny subset of them. Inevitably, users > will try > > > >>> other option combinations and they'll fail building for some random > > > >>> reason. It will not be a very good user experience. > > > >>> > > > >>> Another related issue is user perception when doing a default > build. > > > >>> For example https://issues.apache.org/jira/browse/ARROW-6638 > proposes > > > to > > > >>> build with jemalloc disabled by default. Inevitably, people will > be > > > >>> doing benchmarks with this (publicly or not) and they'll conclude > Arrow > > > >>> is not as performant as it claims to be. > > > >>> > > > >>> Perhaps we should look for another approach instead? > > > >>> > > > >>> For example we could have a single ARROW_BARE_CORE (whatever the > name) > > > >>> option that when enabled (not by default) builds the tiniest > minimal > > > >>> subset of Arrow. It's more inflexible, but at least it's something > > > that > > > >>> we can reasonably test. > > > >>> > > > >>> Regards > > > >>> > > > >>> Antoine. > > > >