Hello Micah,

I don't think we have explored using bazel yet. I would see it as a possible 
modular alternative but as you mention it will be a lot of work and we would 
probably need a mentor who is familiar with bazel, otherwise we probably end up 
spending too much time on this and get a non-typical bazel setup.

Uwe

On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> It has come up in the past, but I wonder if exploring Bazel as a build
> system with its a very explicit dependency graph might help (I'm not sure
> if something similar is available in CMake).
> 
> This is also a lot of work, but could also potentially benefit the
> developer experience because we can make unit tests depend on individual
> compilable units instead of all of libarrow.  There are trade-offs here as
> well in terms of public API coverage.
> 
> On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> > Hello,
> >
> > I can think of two other alternatives that make it more visible what Arrow
> > core is and what are the optional components:
> >
> > * Error out when no component is selected instead of building just the
> > core Arrow. Here we could add an explanative message that list all
> > components and for each component 2-3 words what it does and what it
> > requires. This would make the first-time experience much better.
> > * Split the CMake project into several subprojects. By correctly
> > structuring the CMakefiles, we should be able to separate out the Arrow
> > components into separate CMake projects that can be built independently if
> > needed while all using the same third-party toolchain. We would still have
> > a top-level CMakeLists.txt that is invoked just like the current one but
> > through having subprojects, you would not anymore be bound to use the
> > single top-level one. This would also have some benefit for packagers that
> > could separate out the build of individual Arrow modules. Furthermore, it
> > would also make it easier for PoC/academic projects to just take the Arrow
> > Core sources and drop it in as a CMake subproject; while this is not a good
> > solution for production-grade software, it is quite common practice to do
> > this in research.
> > I really like this approach and I think this is something we should have
> > as a long-term target, I'm also happy to implement given the time but I
> > think one CMake refactor per year is the maximum I can do and that was
> > already eaten up by the dependency detection. Also, I'm unsure about how
> > much this would block us at the moment vs the marketing benefit of having a
> > more modular Arrow; currently I'm leaning on the side that the
> > marketing/adoption benefit would be much larger but we lack someone
> > frustration-tolerant to do the refactoring.
> >
> > Uwe
> >
> > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > Lately there seem to be more and more people suggesting that the
> > > optional components in the Arrow C++ project are getting in the way of
> > > using the "core" which implements the columnar format and IPC
> > > protocol. I am not sure I agree with this argument, but in general I
> > > think it would be a good idea to make all optional components in the
> > > project "opt in" rather than "opt out"
> > >
> > > To demonstrate where things currently stand, I created a Dockerfile to
> > > try to make the smallest possible and most dependency-free build
> > >
> > >
> > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > >
> > > Here is the output of this build
> > >
> > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > >
> > > First, let's look at the CMake invocation
> > >
> > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > -DARROW_BOOST_USE_SHARED=OFF \
> > > -DARROW_COMPUTE=OFF \
> > > -DARROW_DATASET=OFF \
> > > -DARROW_JEMALLOC=OFF \
> > > -DARROW_JSON=ON \
> > > -DARROW_USE_GLOG=OFF \
> > > -DARROW_WITH_BZ2=OFF \
> > > -DARROW_WITH_ZLIB=OFF \
> > > -DARROW_WITH_ZSTD=OFF \
> > > -DARROW_WITH_LZ4=OFF \
> > > -DARROW_WITH_SNAPPY=OFF \
> > > -DARROW_WITH_BROTLI=OFF \
> > > -DARROW_BUILD_UTILITIES=OFF
> > >
> > > Aside from the issue of how to obtain and link Boost, here's a couple of
> > things:
> > >
> > > * COMPUTE and DATASET IMHO should be off by default
> > > * All compression libraries should be turned off
> > > * GLOG should be off by default
> > > * Utilities should be off (they are used for integration testing)
> > > * Jemalloc should probably be off, but we should make it clear that
> > > opting in will yield better performance
> > >
> > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > > the build. I opened ARROW-6590 to fix this
> > >
> > > Aside from potentially changing these defaults, there's some things in
> > > the build that we might want to turn into optional pieces:
> > >
> > > * We should see if we can make boost::filesystem not mandatory in the
> > > barebones build, if only to satisfy the peanut gallery
> > > * double-conversion is used in the CSV module. I think that
> > > double-conversion_ep and the CSV module should both be made opt-in
> > > * rapidjson_ep should be made optional. JSON support is only needed
> > > for integration testing
> > >
> > > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> > > is not mandatory.
> > >
> > > In general, enabling optional components is primarily relevant for
> > > packagers. If we implement these changes, a number of package build
> > > scripts will have to change.
> > >
> > > Thanks,
> > > Wes
> > >
> >
>

Reply via email to