hi folks,

I wanted to share some concerns that I have about our current
trajectory with regards to producing shared libraries from the Arrow
build system.

Currently, a comprehensive build produces many shared libraries:

* libarrow
* libarrow_dataset
* libarrow_flight
* libarrow_python
* libgandiva
* libparquet
* libplasma

There are some others. There are a number of problems with the current approach:

* Each DLL needs its own set of "visibility" macros to control the use
of __declspec(dllimport/dllexport) on Windows, which is necessary to
instruct the import or export of symbols between DLLs on Windows. See
e.g. 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h

* Templates instantiated in one DLL may cause a violation of the One
Definition Rule during linking (we lost at least a day of work time
collectively to issues around this in ARROW-6244). It is good to be
able to share common template interfaces in general

* Statically-linked dependencies in one shared lib may need to be
statically linked into another library. For example, libgandiva
statically links parts of LLVM, but we will likely have some other
code that makes use of LLVM for other purposes (it has been discussed
in the context of Avro parsing)

Overall, my preferred solution to these issues is to move to a similar
approach to what the LLVM project does. To help understand, let me
have you first look at the libraries that come from the llvm-7-dev
package on Ubuntu

Here we have a collection of static "module" libraries that implement
different parts of the LLVM platform. Finally, a _single_ shared
library libLLVM-7.so is produced.

I think we should do the same thing in Apache Arrow. So we only ever
will produce a single shared library from the build. We can
additionally make the "name" of this shared library configurable to
suit different needs. For example, the default name could be simply
"libarrow.so" or something. But if someone wants to produce a
barebones Parquet shared library they can override the name to create
a "libparquet.so" that contains only the "libarrow_core.a" and
"libarrow_io.a" symbols needed for reading Parquet files.

This would have additional benefits:

* Use the same visibility macros for all exported C++ symbols, rather
than having to define DLL-specific visibility

* Improved modularization of builds and linking for third party users,
similar to the way that LLVM's modular linking works, see the way that
Gandiva requests specific components from LLVM to use for static
linking 
https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53

* Net simpler linking and deployment. Only one shared library to deal with

There are some drawbacks, however:

* Our C++ Linux packaging approach would need to be changed to be more
LLVM-like (a single .deb/.yum package containing the C++ platform
rather than many packages as now)

Interested to hear from other C++ developers.

Thanks
Wes

Reply via email to