Hi all,

I would like to request your feedback on incorporating Windows binaries in
those Maven packages that have native Arrow dependencies, while drawing
your attention to the likely impact on jar size.


Five of the 23 arrow packages on Maven Central have native dependencies.
Four of those five have bundled native libraries included in the maven
package jar itself. (The exception is the plasma package.) For the others,
both .so (Linux shared-object) and .dylib (OSX dynamic library) files are
provided in the same jar. Windows native libraries are not included.


The packages in question are:

   -

   arrow-dataset


   -

   arrow-orc
   -

   arrow-c
   -

   Arrow-gandiva


For developers using Arrow on OSX or Linux, the experience using the
arrow-dataset jar with its bundled native library is the same as using a
pure Java library. Including Windows binaries in the jars would expand the
community of developers who could use Arrow features like datasets “out-of
the box.”


Moreover, it is not trivial for devs on Windows to create their own
solution. To the best of my knowledge, pre-compiled JNI DLLs are not
available for download, and there are no build scripts or instructions, as
there are for Linux and Mac users (see
https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
).
Effort

To produce the JNI DLLs, the main effort will be to create new
Windows-focused build scripts similar to: *arrow
<https://github.com/apache/arrow>/ci
<https://github.com/apache/arrow/tree/master/ci>/scripts
<https://github.com/apache/arrow/tree/master/ci/scripts>/java_jni_macos_build.sh,
a*nd incorporate them into the larger build process.


Creating these build files is a prerequisite for the suggested packaging
changes, but is also desirable in its own right, even if the proposed
packaging change is not implemented.
File size concern

The downside of including Windows binaries is that these files are large.
In the 7.0.0 release, the two native library files included in the dataset
jar total 78 MB on disk, which is roughly 100% of the total size of the
jar. See table below for more details.

module

.dylib (size in MB)

.so (size in MB)

Combined

dataset

34.6

43.7

78.3

ORC

29.3

37.9

67.2

Gandiva

77.4

87.1

164.5

c-data

<1.0

<1.0

<`1.0

Total

141.3

167.7



It’s estimated that DLLs would be slightly larger than the dylib files, so
that the proposed change would increase the size of the dataset jar from
78.3 MB to about 114 MB.

For reference, here are the native Arrow libraries (.so) in a PyArrow
x86-64 wheel:

Dataset

2.3

Flight

13.0

Python

2.1

Python-flight

0.1

Plasma

0.2

Parquet

4.3

Arrow

49.0

Total

71.0

Note that this isn't an apples-to-apples comparison: the PyArrow libraries
do not include Gandiva, while the Java libraries do not include Flight,
Plasma, Parque, or (presumably) some amount of the code in the Arrow file.

As more C++ functionality is used by Java code the number of modules with
native dependencies may rise, and the size of the individual libraries may
increase.

For the sake of simplicity, it is preferable to produce a single Jar for
each module that contains binaries for the three platforms: Windows, OSX,
and Linux. If file size is a significant concern, there are several options:



   -

   Stripping some symbols (`strip -x`) on the Linux dataset JNI library
   brings it down from 43 to 34 MB, at the cost of debug information. It may
   be worth considering this option for release builds.
   -

   It may be possible to combine modules to reduce the amount of duplicated
   code for projects that need more than one module with native dependencies.
   -

   OS-specific Maven packages could be built


Thank you for your feedback,

larry

Reply via email to