On Wed, Apr 3, 2019 at 2:24 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> I that the arrow-cpp conda packages for Windows have ballooned in size
> to nearly 140 megabytes for RC4
>
>
> https://bintray.com/apache/arrow/python-rc/0.13.0-rc4#files/python-rc/0.13.0-rc4
>
> Looking at one of these packages it seems the Windows static libraries
> are huge -- I'm not sure why they are so big but we should probably
> investigate
>
> $ ll Library/lib/
> total 741796
> -rw-r--r-- 1 wesm wesm   1507048 Mar 27 23:34 arrow.lib
> -rw-r--r-- 1 wesm wesm     76184 Mar 27 23:35 arrow_python.lib
> -rw-r--r-- 1 wesm wesm  61322082 Mar 27 23:36 arrow_python_static.lib
> -rw-r--r-- 1 wesm wesm 328090044 Mar 27 23:37 arrow_static.lib
> drwxr-xr-x 3 wesm wesm      4096 Apr  2 19:12 cmake/
> -rw-r--r-- 1 wesm wesm    302496 Mar 27 23:38 gandiva.lib
> -rw-r--r-- 1 wesm wesm 239314018 Mar 27 23:40 gandiva_static.lib
> -rw-r--r-- 1 wesm wesm    491292 Mar 27 23:41 parquet.lib
> -rw-r--r-- 1 wesm wesm 128473780 Mar 27 23:42 parquet_static.lib
> drwxr-xr-x 2 wesm wesm      4096 Apr  2 19:12 pkgconfig/
>
> As a mitigating measure in the meantime, I would suggest that we stop
> bundling the static libraries in the arrow-cpp conda package, since
> we're just hurting release managers and users with a large package
> download when they `conda install pyarrow`. Can someone open a JIRA
> issue about this? If packaging the static libraries in conda is
> something that people need then we could create a separate
> arrow-cpp-static artifact
>
Agree, but I'm not sure what's the conda-forge policy for static libraries.

>
> The production packages in conda-forge are a bit smaller (less than
> 100 MB), but still quite large.
>
> https://anaconda.org/conda-forge/arrow-cpp/files
>
> I noticed also that the wheel Python packages on Linux have become
> quite large. The Python 3.7 wheel is 48.5 megabytes for example. The
> expected culprit is libgandiva.so, where I see
>
> -rwxr-xr-x 1 wesm wesm   131047 Apr  2 19:18 libarrow_boost_filesystem.so*
> -rwxr-xr-x 1 wesm wesm   131047 Apr  2 19:18
> libarrow_boost_filesystem.so.1.66.0*
> -rwxr-xr-x 1 wesm wesm  1253641 Apr  2 19:18 libarrow_boost_regex.so*
> -rwxr-xr-x 1 wesm wesm  1253641 Apr  2 19:18
> libarrow_boost_regex.so.1.66.0*
> -rwxr-xr-x 1 wesm wesm    30081 Apr  2 19:18 libarrow_boost_system.so*
> -rwxr-xr-x 1 wesm wesm    30081 Apr  2 19:18
> libarrow_boost_system.so.1.66.0*
> -rwxr-xr-x 1 wesm wesm  1613712 Apr  2 19:18 libarrow_python.so*
> -rwxr-xr-x 1 wesm wesm  1400561 Apr  2 19:18 libarrow_python.so.13*
> -rwxr-xr-x 1 wesm wesm 12543416 Apr  2 19:18 libarrow.so*
> -rwxr-xr-x 1 wesm wesm 11540172 Apr  2 19:18 libarrow.so.13*
> -rw-r--r-- 1 wesm wesm  6393593 Apr  2 19:18 lib.cpp
> -rwxr-xr-x 1 wesm wesm  2558504 Apr  2 19:18
> lib.cpython-37m-x86_64-linux-gnu.so*
> -rwxr-xr-x 1 wesm wesm 61260912 Apr  2 19:18 libgandiva.so*
> -rwxr-xr-x 1 wesm wesm 57342916 Apr  2 19:18 libgandiva.so.13*
> -rwxr-xr-x 1 wesm wesm  3567224 Apr  2 19:18 libparquet.so*
> -rwxr-xr-x 1 wesm wesm  3035367 Apr  2 19:18 libparquet.so.13*
> -rwxr-xr-x 1 wesm wesm   352440 Apr  2 19:18 libplasma.so*
> -rwxr-xr-x 1 wesm wesm   315802 Apr  2 19:18 libplasma.so.13*
>
> There's something very odd here, though, which is that libgandiva.so
> and libgandiva.so.13 appear to be distinct. They have different
> checksums, for example

This is true for libarrow, libparquet and libplasma as well. I've just
checked that
previous wheel ships shared libraries similarly.

>
> (pyarrow-0.13.0-py37-test) 19:19 ~/Downloads/arrow-cpp-py36-vc14 $
> sha256sum
> ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so
> 8f1026d7bf476b90a0cac8239947ad334ee91cd31a944102aff6e8a67ae973e8
>
> /home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so
> (pyarrow-0.13.0-py37-test) 19:21 ~/Downloads/arrow-cpp-py36-vc14 $
> sha256sum
> ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13
> 9969a50787f8e0411115c0bfffccd3a350fde5f8c2f319acd72f3cf8097365dc
>
> /home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13
>
In case of OSX wheels the checksums are equal, so I suspect auditwheel [1]
does some magic behind the curtain.

[1]
https://github.com/apache/arrow/blob/master/python/manylinux1/build_arrow.sh#L122

>
> That seems buggy to me. We might also investigate if there's a way to
> trim the binary sizes in some way.
>
> Thanks
> Wes
>

Reply via email to