Hi Wes,
I agree on both accounts that it won't be a done in the short term, and it
makes sense to tackle in incrementally.  Like I said I don't have much
bandwidth at the moment but might be able to re-arrange a few things on my
plate.  I think some people have asked on the mailing list how they might
be able to help, this might be one area that doesn't require a lot of
in-depth knowledge of C++ at least for a proof of concept.  I'll try to
open up some JIRAs soon.

Thanks,
Micah

On Tue, Oct 15, 2019 at 10:33 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> Definitely Bazel is worth exploring, but we must be realistic about
> the amount of energy (several hundred hours or more) that's been
> invested in the build system we have now. So a new build system will
> be a large endeavor, but hopefully can make things simpler.
>
> Aside from the requirements gathering process, if it is felt that
> Bazel is a possible path forward in the future, it may be good to try
> to break up the work into more tractable pieces. For example, a first
> step would be to set up Bazel configurations to build the project's
> thirdparty toolchain. Since we're reliant in ExternalProject in CMake
> to do a lot of heavy lifting there for us, I imagine this (taking care
> of what ThirdpartyToolchain.cmake does not) will take up a lot of the
> energy
>
> - Wes
>
> On Sun, Oct 13, 2019 at 1:06 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > >
> > >
> > > This might be taking the thread on more of a tangent, but maybe we
> should
> > start collecting requirements for the C++ build system in general and see
> > if there might be better solution that can address some of these
> concerns?
> > In particular, Bazel at least on the surface seems like it might be a
> > better fit for some of the use cases discussed here.  I know this is a
> big
> > project (and I currently don't have much bandwidth for it) but I think if
> > CMake is lacking in these areas it might be worth at least exploring
> > instead of going down the path of building our own meta-build system on
> top
> > of CMake.
> >
> > Requirements that I think we are targeting:
> > 1.  Be able to provide an out of box build system that requires as close
> to
> > zero dependencies beyond a standard C++ toolchain (e.g. "$BUILD minimal"
> > works on any C++ developers desktop without additional requirements)
> > 2.  The build system should limit configuration knobs in favor of implied
> > dependencies (e.g. "$BUILD python" automatically builds "compute",
> > "filesystem", "ipc")
> > 3.  The build system should be configurable to use (and have the user
> > specify) one of "System packages", "Conda packages" or source packages
> for
> > providing dependencies (and fallback options between the three).
> > 4.  The build system should be able to treat some dependencies as
> optional
> > (e.g. different compression libraries or allocators).
> > 5.  Easily allow developers to limit building unnecessary code for their
> > particular task at hand.
> > 6.  The build system must work across the following toolchains/platforms:
> >     - Linux:  g++ and clang.  x86 and ARM
> >     - Mac
> >     - Windows (msys2 and MSVC)
> >
> > Thanks,
> > Micah
> >
> >
> >
> > On Thu, Oct 10, 2019 at 6:09 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Yes, we could express dependencies in a Python script and have it
> > > generate a CMake module of if/else chains in cmake_modules (which we
> > > would check in git to avoid having people depend on a Python install,
> > > perhaps).
> > >
> > > Still, that is an additional maintenance burden.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 10/10/2019 à 14:50, Wes McKinney a écrit :
> > > > I guess one question we should first discuss is: who is the C++ build
> > > > system for?
> > > >
> > > > The users who are most sensitive to benchmark-driven decision making
> > > > will generally be consuming the project through pre-built binaries,
> > > > like our Python or R packages. If C++ developers build the project
> > > > from source and don't do a minimal read of the documentation to see
> > > > what a "recommended configuration" looks like, I would say that is
> > > > more their fault than ours. In the case of the ARROW_JEMALLOC option,
> > > > I think it's important for C++ system integrators to be aware of the
> > > > impact of the choice of memory allocator.
> > > >
> > > > The concern I have with the current "out of the box" experience is
> > > > that people are getting the impression that "I have to build $X, $Y,
> > > > and $Z -- which I don't necessarily need -- to have $CORE_FEATURE_1".
> > > > They can, of course, read the documentation and learn that those
> > > > things can be toggled off, but I think the user that reaches for a
> > > > self-built source install is much different in general than someone
> > > > who uses the project through the Linux binary packages, for example.
> > > >
> > > > On the subject of managing intraproject dependencies and
> > > > relationships, I think we should develop a better way to express
> > > > relationships between components than we have now.
> > > >
> > > > As an example, building the Python library assumes that various
> > > > components are enabled
> > > >
> > > > - ARROW_COMPUTE=ON
> > > > - ARROW_FILESYSTEM=ON
> > > > - ARROW_IPC=ON
> > > >
> > > > Somewhere in the code we might have some code like
> > > >
> > > > if (ARROW_PYTHON)
> > > >   set(ARROW_COMPUTE ON)
> > > >   ...
> > > > endif()
> > > >
> > > > This doesn't strike me as that scalable. I would rather see a
> > > > dependency file like
> > > >
> > > > component_dependencies = {
> > > >     ...
> > > >     'python': ['compute', 'filesystem', 'ipc'],
> > > >     ...
> > > > }
> > > >
> > > > A helper Python script as part of the build could be used to give
> > > > CMake (because CMake is a bit poor as a programming language) the
> list
> > > > of required components based on what the user has indicated to CMake.
> > > >
> > > > On Thu, Oct 10, 2019 at 7:36 AM Francois Saint-Jacques
> > > > <fsaintjacq...@gmail.com> wrote:
> > > >>
> > > >> There's always the route of vendoring some library and not exposing
> > > >> external CMake options. This would achieve the goal of
> > > >> compile-out-of-the-box and enable important feature in the basic
> > > >> build. We also simplify dependencies requirements (benefits CI or
> > > >> developer). The downside is following security patches and grumpy
> > > >> reaction from package maintainers. I think we should explore this
> > > >> route for dependencies that match the following criteria:
> > > >>
> > > >> - libarrow*.so don't export any of the symbols of the dependency and
> > > >> not referenced in any public headers
> > > >> - dependency is lightweight, e.g. excludes boost, openssl, grpc,
> llvm,
> > > >> thrift, protobuf
> > > >> - dependency is not-ubiquitous on major platform and have a stable
> > > >> API, e.g. excludes libz and openssl
> > > >>
> > > >> A small list of candidates:
> > > >> - RapidJSON (enables JSON)
> > > >> - DoubleConversion (enables CSV)
> > > >>
> > > >> There's a precedent, arrow already vendors small C++ libraries
> > > >> (datetime, utf8cpp, variant, xxhash).
> > > >>
> > > >> François
> > > >>
> > > >>
> > > >> On Thu, Oct 10, 2019 at 6:03 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >>>
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I'm a bit concerned that we're planning to add many additional
> build
> > > >>> options in the quest to have a core zero-dependency build in C++.
> > > >>> See for example https://issues.apache.org/jira/browse/ARROW-6633
> or
> > > >>> https://issues.apache.org/jira/browse/ARROW-6612.
> > > >>>
> > > >>> The problem is that this is creating many possible configurations
> and
> > > we
> > > >>> will only be testing a tiny subset of them.  Inevitably, users
> will try
> > > >>> other option combinations and they'll fail building for some random
> > > >>> reason.  It will not be a very good user experience.
> > > >>>
> > > >>> Another related issue is user perception when doing a default
> build.
> > > >>> For example https://issues.apache.org/jira/browse/ARROW-6638
> proposes
> > > to
> > > >>> build with jemalloc disabled by default.  Inevitably, people will
> be
> > > >>> doing benchmarks with this (publicly or not) and they'll conclude
> Arrow
> > > >>> is not as performant as it claims to be.
> > > >>>
> > > >>> Perhaps we should look for another approach instead?
> > > >>>
> > > >>> For example we could have a single ARROW_BARE_CORE (whatever the
> name)
> > > >>> option that when enabled (not by default) builds the tiniest
> minimal
> > > >>> subset of Arrow.  It's more inflexible, but at least it's something
> > > that
> > > >>> we can reasonably test.
> > > >>>
> > > >>> Regards
> > > >>>
> > > >>> Antoine.
> > >
>

Reply via email to