Hi all,
I would like to request your feedback on incorporating Windows binaries in those Maven packages that have native Arrow dependencies, while drawing your attention to the likely impact on jar size. Five of the 23 arrow packages on Maven Central have native dependencies. Four of those five have bundled native libraries included in the maven package jar itself. (The exception is the plasma package.) For the others, both .so (Linux shared-object) and .dylib (OSX dynamic library) files are provided in the same jar. Windows native libraries are not included. The packages in question are: - arrow-dataset - arrow-orc - arrow-c - Arrow-gandiva For developers using Arrow on OSX or Linux, the experience using the arrow-dataset jar with its bundled native library is the same as using a pure Java library. Including Windows binaries in the jars would expand the community of developers who could use Arrow features like datasets “out-of the box.” Moreover, it is not trivial for devs on Windows to create their own solution. To the best of my knowledge, pre-compiled JNI DLLs are not available for download, and there are no build scripts or instructions, as there are for Linux and Mac users (see https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules ). Effort To produce the JNI DLLs, the main effort will be to create new Windows-focused build scripts similar to: *arrow <https://github.com/apache/arrow>/ci <https://github.com/apache/arrow/tree/master/ci>/scripts <https://github.com/apache/arrow/tree/master/ci/scripts>/java_jni_macos_build.sh, a*nd incorporate them into the larger build process. Creating these build files is a prerequisite for the suggested packaging changes, but is also desirable in its own right, even if the proposed packaging change is not implemented. File size concern The downside of including Windows binaries is that these files are large. In the 7.0.0 release, the two native library files included in the dataset jar total 78 MB on disk, which is roughly 100% of the total size of the jar. See table below for more details. module .dylib (size in MB) .so (size in MB) Combined dataset 34.6 43.7 78.3 ORC 29.3 37.9 67.2 Gandiva 77.4 87.1 164.5 c-data <1.0 <1.0 <`1.0 Total 141.3 167.7 It’s estimated that DLLs would be slightly larger than the dylib files, so that the proposed change would increase the size of the dataset jar from 78.3 MB to about 114 MB. For reference, here are the native Arrow libraries (.so) in a PyArrow x86-64 wheel: Dataset 2.3 Flight 13.0 Python 2.1 Python-flight 0.1 Plasma 0.2 Parquet 4.3 Arrow 49.0 Total 71.0 Note that this isn't an apples-to-apples comparison: the PyArrow libraries do not include Gandiva, while the Java libraries do not include Flight, Plasma, Parque, or (presumably) some amount of the code in the Arrow file. As more C++ functionality is used by Java code the number of modules with native dependencies may rise, and the size of the individual libraries may increase. For the sake of simplicity, it is preferable to produce a single Jar for each module that contains binaries for the three platforms: Windows, OSX, and Linux. If file size is a significant concern, there are several options: - Stripping some symbols (`strip -x`) on the Linux dataset JNI library brings it down from 43 to 34 MB, at the cost of debug information. It may be worth considering this option for release builds. - It may be possible to combine modules to reduce the amount of duplicated code for projects that need more than one module with native dependencies. - OS-specific Maven packages could be built Thank you for your feedback, larry