Netty (one of the Arrow dependencies) already has per-OS JARs (though those 
deps are optional), I would also be slightly in favor of doing this so long as 
the way to use them is well documented.

Netty also splits its native code among different dependencies since there is 
some common code, is that possibly a viable option for us? Depending on how 
much is actually shared between these. (There may very well not be much shared, 
since Gandiva probably gets most of its weight from LLVM and Dataset from 
Parquet/compression libraries/etc.)

On Wed, May 4, 2022, at 11:23, Antoine Pitrou wrote:
> Le 04/05/2022 à 17:21, Alessandro Molina a écrit :
>> The proposal seems reasonable to me, we should do our best at providing
>> users the same experience on the various systems whenever possible.
>> 
>> As long as we don't receive complaints about the package size, I think we
>> can live with it. If it becomes a problem for our users, we can always make
>> per-system binaries in the future.
>
> Hmm, I think it wouldn't hurt to be proactive wrt. package sizes. 
> Negative feedback doesn't always get propagated to us, and instead we 
> may lose users due to the bad first impression.
>
> Regards
>
> Antoine.
>
>
>
>> 
>> PS: I think you forgot to enable comments on the google docs, that's
>> something you usually want to allow as it eases providing feedback.
>> 
>> On Tue, May 3, 2022 at 4:19 PM Larry White <ljw1...@gmail.com> wrote:
>> 
>>> Hi all,
>>>
>>> Please see
>>>
>>> https://docs.google.com/document/d/1y25kRrXlORnUD9p7wTMOWjC6wONEyI9rU-Pv4q1udZ8/edit?usp=sharing
>>> for a copy of this email with proper formatting.
>>>
>>> thanks.
>>>
>>> On Mon, May 2, 2022 at 4:23 PM Larry White <ljw1...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>>
>>>> I would like to request your feedback on incorporating Windows binaries
>>> in
>>>> those Maven packages that have native Arrow dependencies, while drawing
>>>> your attention to the likely impact on jar size.
>>>>
>>>>
>>>> Five of the 23 arrow packages on Maven Central have native dependencies.
>>>> Four of those five have bundled native libraries included in the maven
>>>> package jar itself. (The exception is the plasma package.) For the
>>> others,
>>>> both .so (Linux shared-object) and .dylib (OSX dynamic library) files are
>>>> provided in the same jar. Windows native libraries are not included.
>>>>
>>>>
>>>> The packages in question are:
>>>>
>>>>     -
>>>>
>>>>     arrow-dataset
>>>>
>>>>
>>>>     -
>>>>
>>>>     arrow-orc
>>>>     -
>>>>
>>>>     arrow-c
>>>>     -
>>>>
>>>>     Arrow-gandiva
>>>>
>>>>
>>>> For developers using Arrow on OSX or Linux, the experience using the
>>>> arrow-dataset jar with its bundled native library is the same as using a
>>>> pure Java library. Including Windows binaries in the jars would expand
>>> the
>>>> community of developers who could use Arrow features like datasets
>>> “out-of
>>>> the box.”
>>>>
>>>>
>>>> Moreover, it is not trivial for devs on Windows to create their own
>>>> solution. To the best of my knowledge, pre-compiled JNI DLLs are not
>>>> available for download, and there are no build scripts or instructions,
>>>> as there are for Linux and Mac users (see
>>>>
>>> https://arrow.apache.org/docs/dev/developers/java/building.html#building-arrow-jni-modules
>>>> ).
>>>> Effort
>>>>
>>>> To produce the JNI DLLs, the main effort will be to create new
>>>> Windows-focused build scripts similar to: *arrow
>>>> <https://github.com/apache/arrow>/ci
>>>> <https://github.com/apache/arrow/tree/master/ci>/scripts
>>>> <https://github.com/apache/arrow/tree/master/ci/scripts
>>>> /java_jni_macos_build.sh,
>>>> a*nd incorporate them into the larger build process.
>>>>
>>>>
>>>> Creating these build files is a prerequisite for the suggested packaging
>>>> changes, but is also desirable in its own right, even if the proposed
>>>> packaging change is not implemented.
>>>> File size concern
>>>>
>>>> The downside of including Windows binaries is that these files are large.
>>>> In the 7.0.0 release, the two native library files included in the
>>> dataset
>>>> jar total 78 MB on disk, which is roughly 100% of the total size of the
>>>> jar. See table below for more details.
>>>>
>>>> module
>>>>
>>>> .dylib (size in MB)
>>>>
>>>> .so (size in MB)
>>>>
>>>> Combined
>>>>
>>>> dataset
>>>>
>>>> 34.6
>>>>
>>>> 43.7
>>>>
>>>> 78.3
>>>>
>>>> ORC
>>>>
>>>> 29.3
>>>>
>>>> 37.9
>>>>
>>>> 67.2
>>>>
>>>> Gandiva
>>>>
>>>> 77.4
>>>>
>>>> 87.1
>>>>
>>>> 164.5
>>>>
>>>> c-data
>>>>
>>>> <1.0
>>>>
>>>> <1.0
>>>>
>>>> <`1.0
>>>>
>>>> Total
>>>>
>>>> 141.3
>>>>
>>>> 167.7
>>>>
>>>>
>>>>
>>>> It’s estimated that DLLs would be slightly larger than the dylib files,
>>> so
>>>> that the proposed change would increase the size of the dataset jar from
>>>> 78.3 MB to about 114 MB.
>>>>
>>>> For reference, here are the native Arrow libraries (.so) in a PyArrow
>>>> x86-64 wheel:
>>>>
>>>> Dataset
>>>>
>>>> 2.3
>>>>
>>>> Flight
>>>>
>>>> 13.0
>>>>
>>>> Python
>>>>
>>>> 2.1
>>>>
>>>> Python-flight
>>>>
>>>> 0.1
>>>>
>>>> Plasma
>>>>
>>>> 0.2
>>>>
>>>> Parquet
>>>>
>>>> 4.3
>>>>
>>>> Arrow
>>>>
>>>> 49.0
>>>>
>>>> Total
>>>>
>>>> 71.0
>>>>
>>>> Note that this isn't an apples-to-apples comparison: the PyArrow
>>> libraries
>>>> do not include Gandiva, while the Java libraries do not include Flight,
>>>> Plasma, Parque, or (presumably) some amount of the code in the Arrow
>>> file.
>>>>
>>>> As more C++ functionality is used by Java code the number of modules with
>>>> native dependencies may rise, and the size of the individual libraries
>>> may
>>>> increase.
>>>>
>>>> For the sake of simplicity, it is preferable to produce a single Jar for
>>>> each module that contains binaries for the three platforms: Windows, OSX,
>>>> and Linux. If file size is a significant concern, there are several
>>> options:
>>>>
>>>>
>>>>
>>>>     -
>>>>
>>>>     Stripping some symbols (`strip -x`) on the Linux dataset JNI library
>>>>     brings it down from 43 to 34 MB, at the cost of debug information. It
>>> may
>>>>     be worth considering this option for release builds.
>>>>     -
>>>>
>>>>     It may be possible to combine modules to reduce the amount of
>>>>     duplicated code for projects that need more than one module with
>>> native
>>>>     dependencies.
>>>>     -
>>>>
>>>>     OS-specific Maven packages could be built
>>>>
>>>>
>>>> Thank you for your feedback,
>>>>
>>>> larry
>>>>
>>>
>>

Reply via email to