Hello Micah, I don't think we have explored using bazel yet. I would see it as a possible modular alternative but as you mention it will be a lot of work and we would probably need a mentor who is familiar with bazel, otherwise we probably end up spending too much time on this and get a non-typical bazel setup.
Uwe On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote: > It has come up in the past, but I wonder if exploring Bazel as a build > system with its a very explicit dependency graph might help (I'm not sure > if something similar is available in CMake). > > This is also a lot of work, but could also potentially benefit the > developer experience because we can make unit tests depend on individual > compilable units instead of all of libarrow. There are trade-offs here as > well in terms of public API coverage. > > On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn <uw...@xhochy.com> wrote: > > > Hello, > > > > I can think of two other alternatives that make it more visible what Arrow > > core is and what are the optional components: > > > > * Error out when no component is selected instead of building just the > > core Arrow. Here we could add an explanative message that list all > > components and for each component 2-3 words what it does and what it > > requires. This would make the first-time experience much better. > > * Split the CMake project into several subprojects. By correctly > > structuring the CMakefiles, we should be able to separate out the Arrow > > components into separate CMake projects that can be built independently if > > needed while all using the same third-party toolchain. We would still have > > a top-level CMakeLists.txt that is invoked just like the current one but > > through having subprojects, you would not anymore be bound to use the > > single top-level one. This would also have some benefit for packagers that > > could separate out the build of individual Arrow modules. Furthermore, it > > would also make it easier for PoC/academic projects to just take the Arrow > > Core sources and drop it in as a CMake subproject; while this is not a good > > solution for production-grade software, it is quite common practice to do > > this in research. > > I really like this approach and I think this is something we should have > > as a long-term target, I'm also happy to implement given the time but I > > think one CMake refactor per year is the maximum I can do and that was > > already eaten up by the dependency detection. Also, I'm unsure about how > > much this would block us at the moment vs the marketing benefit of having a > > more modular Arrow; currently I'm leaning on the side that the > > marketing/adoption benefit would be much larger but we lack someone > > frustration-tolerant to do the refactoring. > > > > Uwe > > > > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote: > > > hi folks, > > > > > > Lately there seem to be more and more people suggesting that the > > > optional components in the Arrow C++ project are getting in the way of > > > using the "core" which implements the columnar format and IPC > > > protocol. I am not sure I agree with this argument, but in general I > > > think it would be a good idea to make all optional components in the > > > project "opt in" rather than "opt out" > > > > > > To demonstrate where things currently stand, I created a Dockerfile to > > > try to make the smallest possible and most dependency-free build > > > > > > > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal > > > > > > Here is the output of this build > > > > > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f > > > > > > First, let's look at the CMake invocation > > > > > > cmake .. -DBOOST_SOURCE=BUNDLED \ > > > -DARROW_BOOST_USE_SHARED=OFF \ > > > -DARROW_COMPUTE=OFF \ > > > -DARROW_DATASET=OFF \ > > > -DARROW_JEMALLOC=OFF \ > > > -DARROW_JSON=ON \ > > > -DARROW_USE_GLOG=OFF \ > > > -DARROW_WITH_BZ2=OFF \ > > > -DARROW_WITH_ZLIB=OFF \ > > > -DARROW_WITH_ZSTD=OFF \ > > > -DARROW_WITH_LZ4=OFF \ > > > -DARROW_WITH_SNAPPY=OFF \ > > > -DARROW_WITH_BROTLI=OFF \ > > > -DARROW_BUILD_UTILITIES=OFF > > > > > > Aside from the issue of how to obtain and link Boost, here's a couple of > > things: > > > > > > * COMPUTE and DATASET IMHO should be off by default > > > * All compression libraries should be turned off > > > * GLOG should be off by default > > > * Utilities should be off (they are used for integration testing) > > > * Jemalloc should probably be off, but we should make it clear that > > > opting in will yield better performance > > > > > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking > > > the build. I opened ARROW-6590 to fix this > > > > > > Aside from potentially changing these defaults, there's some things in > > > the build that we might want to turn into optional pieces: > > > > > > * We should see if we can make boost::filesystem not mandatory in the > > > barebones build, if only to satisfy the peanut gallery > > > * double-conversion is used in the CSV module. I think that > > > double-conversion_ep and the CSV module should both be made opt-in > > > * rapidjson_ep should be made optional. JSON support is only needed > > > for integration testing > > > > > > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep > > > is not mandatory. > > > > > > In general, enabling optional components is primarily relevant for > > > packagers. If we implement these changes, a number of package build > > > scripts will have to change. > > > > > > Thanks, > > > Wes > > > > > >