Hi all, Please see https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing for a copy of this email with proper formatting.
thanks. On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote: > Hi all, > > > I would like to request your feedback on incorporating Windows binaries in > those Maven packages that have native Arrow dependencies, while drawing > your attention to the likely impact on jar size. > > > Five of the 23 arrow packages on Maven Central have native dependencies. > Four of those five have bundled native libraries included in the maven > package jar itself. (The exception is the plasma package.) For the others, > both .so (Linux shared-object) and .dylib (OSX dynamic library) files are > provided in the same jar. Windows native libraries are not included. > > > The packages in question are: > > - > > arrow-dataset > > > - > > arrow-orc > - > > arrow-c > - > > Arrow-gandiva > > > For developers using Arrow on OSX or Linux, the experience using the > arrow-dataset jar with its bundled native library is the same as using a > pure Java library. Including Windows binaries in the jars would expand the > community of developers who could use Arrow features like datasets “out-of > the box.” > > > Moreover, it is not trivial for devs on Windows to create their own > solution. To the best of my knowledge, pre-compiled JNI DLLs are not > available for download, and there are no build scripts or instructions, > as there are for Linux and Mac users (see > https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules > ). > Effort > > To produce the JNI DLLs, the main effort will be to create new > Windows-focused build scripts similar to: *arrow > <https://github.com/apache/arrow>/ci > <https://github.com/apache/arrow/tree/master/ci>/scripts > <https://github.com/apache/arrow/tree/master/ci/scripts>/java_jni_macos_build.sh, > a*nd incorporate them into the larger build process. > > > Creating these build files is a prerequisite for the suggested packaging > changes, but is also desirable in its own right, even if the proposed > packaging change is not implemented. > File size concern > > The downside of including Windows binaries is that these files are large. > In the 7.0.0 release, the two native library files included in the dataset > jar total 78 MB on disk, which is roughly 100% of the total size of the > jar. See table below for more details. > > module > > .dylib (size in MB) > > .so (size in MB) > > Combined > > dataset > > 34.6 > > 43.7 > > 78.3 > > ORC > > 29.3 > > 37.9 > > 67.2 > > Gandiva > > 77.4 > > 87.1 > > 164.5 > > c-data > > <1.0 > > <1.0 > > <`1.0 > > Total > > 141.3 > > 167.7 > > > > It’s estimated that DLLs would be slightly larger than the dylib files, so > that the proposed change would increase the size of the dataset jar from > 78.3 MB to about 114 MB. > > For reference, here are the native Arrow libraries (.so) in a PyArrow > x86-64 wheel: > > Dataset > > 2.3 > > Flight > > 13.0 > > Python > > 2.1 > > Python-flight > > 0.1 > > Plasma > > 0.2 > > Parquet > > 4.3 > > Arrow > > 49.0 > > Total > > 71.0 > > Note that this isn't an apples-to-apples comparison: the PyArrow libraries > do not include Gandiva, while the Java libraries do not include Flight, > Plasma, Parque, or (presumably) some amount of the code in the Arrow file. > > As more C++ functionality is used by Java code the number of modules with > native dependencies may rise, and the size of the individual libraries may > increase. > > For the sake of simplicity, it is preferable to produce a single Jar for > each module that contains binaries for the three platforms: Windows, OSX, > and Linux. If file size is a significant concern, there are several options: > > > > - > > Stripping some symbols (`strip -x`) on the Linux dataset JNI library > brings it down from 43 to 34 MB, at the cost of debug information. It may > be worth considering this option for release builds. > - > > It may be possible to combine modules to reduce the amount of > duplicated code for projects that need more than one module with native > dependencies. > - > > OS-specific Maven packages could be built > > > Thank you for your feedback, > > larry >