Hi all,

Please see
https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing
for a copy of this email with proper formatting.

thanks.

On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote:

> Hi all,
>
>
> I would like to request your feedback on incorporating Windows binaries in
> those Maven packages that have native Arrow dependencies, while drawing
> your attention to the likely impact on jar size.
>
>
> Five of the 23 arrow packages on Maven Central have native dependencies.
> Four of those five have bundled native libraries included in the maven
> package jar itself. (The exception is the plasma package.) For the others,
> both .so (Linux shared-object) and .dylib (OSX dynamic library) files are
> provided in the same jar. Windows native libraries are not included.
>
>
> The packages in question are:
>
>    -
>
>    arrow-dataset
>
>
>    -
>
>    arrow-orc
>    -
>
>    arrow-c
>    -
>
>    Arrow-gandiva
>
>
> For developers using Arrow on OSX or Linux, the experience using the
> arrow-dataset jar with its bundled native library is the same as using a
> pure Java library. Including Windows binaries in the jars would expand the
> community of developers who could use Arrow features like datasets “out-of
> the box.”
>
>
> Moreover, it is not trivial for devs on Windows to create their own
> solution. To the best of my knowledge, pre-compiled JNI DLLs are not
> available for download, and there are no build scripts or instructions,
> as there are for Linux and Mac users (see
> https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
> ).
> Effort
>
> To produce the JNI DLLs, the main effort will be to create new
> Windows-focused build scripts similar to: *arrow
> <https://github.com/apache/arrow>/ci
> <https://github.com/apache/arrow/tree/master/ci>/scripts
> <https://github.com/apache/arrow/tree/master/ci/scripts>/java_jni_macos_build.sh,
> a*nd incorporate them into the larger build process.
>
>
> Creating these build files is a prerequisite for the suggested packaging
> changes, but is also desirable in its own right, even if the proposed
> packaging change is not implemented.
> File size concern
>
> The downside of including Windows binaries is that these files are large.
> In the 7.0.0 release, the two native library files included in the dataset
> jar total 78 MB on disk, which is roughly 100% of the total size of the
> jar. See table below for more details.
>
> module
>
> .dylib (size in MB)
>
> .so (size in MB)
>
> Combined
>
> dataset
>
> 34.6
>
> 43.7
>
> 78.3
>
> ORC
>
> 29.3
>
> 37.9
>
> 67.2
>
> Gandiva
>
> 77.4
>
> 87.1
>
> 164.5
>
> c-data
>
> <1.0
>
> <1.0
>
> <`1.0
>
> Total
>
> 141.3
>
> 167.7
>
>
>
> It’s estimated that DLLs would be slightly larger than the dylib files, so
> that the proposed change would increase the size of the dataset jar from
> 78.3 MB to about 114 MB.
>
> For reference, here are the native Arrow libraries (.so) in a PyArrow
> x86-64 wheel:
>
> Dataset
>
> 2.3
>
> Flight
>
> 13.0
>
> Python
>
> 2.1
>
> Python-flight
>
> 0.1
>
> Plasma
>
> 0.2
>
> Parquet
>
> 4.3
>
> Arrow
>
> 49.0
>
> Total
>
> 71.0
>
> Note that this isn't an apples-to-apples comparison: the PyArrow libraries
> do not include Gandiva, while the Java libraries do not include Flight,
> Plasma, Parque, or (presumably) some amount of the code in the Arrow file.
>
> As more C++ functionality is used by Java code the number of modules with
> native dependencies may rise, and the size of the individual libraries may
> increase.
>
> For the sake of simplicity, it is preferable to produce a single Jar for
> each module that contains binaries for the three platforms: Windows, OSX,
> and Linux. If file size is a significant concern, there are several options:
>
>
>
>    -
>
>    Stripping some symbols (`strip -x`) on the Linux dataset JNI library
>    brings it down from 43 to 34 MB, at the cost of debug information. It may
>    be worth considering this option for release builds.
>    -
>
>    It may be possible to combine modules to reduce the amount of
>    duplicated code for projects that need more than one module with native
>    dependencies.
>    -
>
>    OS-specific Maven packages could be built
>
>
> Thank you for your feedback,
>
> larry
>

Reply via email to