Hi,

As part of the effort to reduce the footprint of pyarrow
installations, we have been working on splitting pyarrow into separate
packages for conda [1]. Each package will pull different C++
dependencies which will provide different capabilities.

This PR [1] will provide 3 packages for pyarrow:
pyarrow-core < pyarrow < pyarrow-all

- pyarrow-core: will pull the libarrow.so (~40MB) dependency.
- pyarrow: in addition to libarrow.so, will also pull libarrow_acero,
libarrow_dataset, libarrow_substrait and libparquet (~78MB)
dependencies.
- pyarrow-all: in addition to everything in pyarrow, will also pull
libarrow_flight, libarrow_flight_sql and libarrow_gandiva (~97MB).

This means that if you are using conda and installing pyarrow today
with 16.0.0 you will see a reduction in the C++ dependencies size and
you will not have access to flight, flight_sql or gandiva. If you want
to keep using those you will have to install pyarrow-all.

If you want to use a minimal pyarrow version without access to acero,
dataset, parquet or substrait you can use pyarrow-core and also get a
reduction in size. Bear in mind that the Arrow team is working on
moving the filesystems out of libarrow and that will be pulled out of
pyarrow-core in the future. This means that, probably, on 17.0.0
parrow-core will not support S3, GCS or Azure Filesystems.

The idea is to keep working on these efforts further to reduce pyarrow size.

Thanks everyone,
Raúl

[1] https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255

Reply via email to