On Wed, Apr 3, 2019 at 2:24 AM Wes McKinney <wesmck...@gmail.com> wrote:
> hi folks, > > I that the arrow-cpp conda packages for Windows have ballooned in size > to nearly 140 megabytes for RC4 > > > https://bintray.com/apache/arrow/python-rc/0.13.0-rc4#files/python-rc/0.13.0-rc4 > > Looking at one of these packages it seems the Windows static libraries > are huge -- I'm not sure why they are so big but we should probably > investigate > > $ ll Library/lib/ > total 741796 > -rw-r--r-- 1 wesm wesm 1507048 Mar 27 23:34 arrow.lib > -rw-r--r-- 1 wesm wesm 76184 Mar 27 23:35 arrow_python.lib > -rw-r--r-- 1 wesm wesm 61322082 Mar 27 23:36 arrow_python_static.lib > -rw-r--r-- 1 wesm wesm 328090044 Mar 27 23:37 arrow_static.lib > drwxr-xr-x 3 wesm wesm 4096 Apr 2 19:12 cmake/ > -rw-r--r-- 1 wesm wesm 302496 Mar 27 23:38 gandiva.lib > -rw-r--r-- 1 wesm wesm 239314018 Mar 27 23:40 gandiva_static.lib > -rw-r--r-- 1 wesm wesm 491292 Mar 27 23:41 parquet.lib > -rw-r--r-- 1 wesm wesm 128473780 Mar 27 23:42 parquet_static.lib > drwxr-xr-x 2 wesm wesm 4096 Apr 2 19:12 pkgconfig/ > > As a mitigating measure in the meantime, I would suggest that we stop > bundling the static libraries in the arrow-cpp conda package, since > we're just hurting release managers and users with a large package > download when they `conda install pyarrow`. Can someone open a JIRA > issue about this? If packaging the static libraries in conda is > something that people need then we could create a separate > arrow-cpp-static artifact > Agree, but I'm not sure what's the conda-forge policy for static libraries. > > The production packages in conda-forge are a bit smaller (less than > 100 MB), but still quite large. > > https://anaconda.org/conda-forge/arrow-cpp/files > > I noticed also that the wheel Python packages on Linux have become > quite large. The Python 3.7 wheel is 48.5 megabytes for example. The > expected culprit is libgandiva.so, where I see > > -rwxr-xr-x 1 wesm wesm 131047 Apr 2 19:18 libarrow_boost_filesystem.so* > -rwxr-xr-x 1 wesm wesm 131047 Apr 2 19:18 > libarrow_boost_filesystem.so.1.66.0* > -rwxr-xr-x 1 wesm wesm 1253641 Apr 2 19:18 libarrow_boost_regex.so* > -rwxr-xr-x 1 wesm wesm 1253641 Apr 2 19:18 > libarrow_boost_regex.so.1.66.0* > -rwxr-xr-x 1 wesm wesm 30081 Apr 2 19:18 libarrow_boost_system.so* > -rwxr-xr-x 1 wesm wesm 30081 Apr 2 19:18 > libarrow_boost_system.so.1.66.0* > -rwxr-xr-x 1 wesm wesm 1613712 Apr 2 19:18 libarrow_python.so* > -rwxr-xr-x 1 wesm wesm 1400561 Apr 2 19:18 libarrow_python.so.13* > -rwxr-xr-x 1 wesm wesm 12543416 Apr 2 19:18 libarrow.so* > -rwxr-xr-x 1 wesm wesm 11540172 Apr 2 19:18 libarrow.so.13* > -rw-r--r-- 1 wesm wesm 6393593 Apr 2 19:18 lib.cpp > -rwxr-xr-x 1 wesm wesm 2558504 Apr 2 19:18 > lib.cpython-37m-x86_64-linux-gnu.so* > -rwxr-xr-x 1 wesm wesm 61260912 Apr 2 19:18 libgandiva.so* > -rwxr-xr-x 1 wesm wesm 57342916 Apr 2 19:18 libgandiva.so.13* > -rwxr-xr-x 1 wesm wesm 3567224 Apr 2 19:18 libparquet.so* > -rwxr-xr-x 1 wesm wesm 3035367 Apr 2 19:18 libparquet.so.13* > -rwxr-xr-x 1 wesm wesm 352440 Apr 2 19:18 libplasma.so* > -rwxr-xr-x 1 wesm wesm 315802 Apr 2 19:18 libplasma.so.13* > > There's something very odd here, though, which is that libgandiva.so > and libgandiva.so.13 appear to be distinct. They have different > checksums, for example This is true for libarrow, libparquet and libplasma as well. I've just checked that previous wheel ships shared libraries similarly. > > (pyarrow-0.13.0-py37-test) 19:19 ~/Downloads/arrow-cpp-py36-vc14 $ > sha256sum > ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so > 8f1026d7bf476b90a0cac8239947ad334ee91cd31a944102aff6e8a67ae973e8 > > /home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so > (pyarrow-0.13.0-py37-test) 19:21 ~/Downloads/arrow-cpp-py36-vc14 $ > sha256sum > ~/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13 > 9969a50787f8e0411115c0bfffccd3a350fde5f8c2f319acd72f3cf8097365dc > > /home/wesm/miniconda/envs/pyarrow-0.13.0-py37-test/lib/python3.7/site-packages/pyarrow/libgandiva.so.13 > In case of OSX wheels the checksums are equal, so I suspect auditwheel [1] does some magic behind the curtain. [1] https://github.com/apache/arrow/blob/master/python/manylinux1/build_arrow.sh#L122 > > That seems buggy to me. We might also investigate if there's a way to > trim the binary sizes in some way. > > Thanks > Wes >