Hi, As part of the effort to reduce the footprint of pyarrow installations, we have been working on splitting pyarrow into separate packages for conda [1]. Each package will pull different C++ dependencies which will provide different capabilities.
This PR [1] will provide 3 packages for pyarrow: pyarrow-core < pyarrow < pyarrow-all - pyarrow-core: will pull the libarrow.so (~40MB) dependency. - pyarrow: in addition to libarrow.so, will also pull libarrow_acero, libarrow_dataset, libarrow_substrait and libparquet (~78MB) dependencies. - pyarrow-all: in addition to everything in pyarrow, will also pull libarrow_flight, libarrow_flight_sql and libarrow_gandiva (~97MB). This means that if you are using conda and installing pyarrow today with 16.0.0 you will see a reduction in the C++ dependencies size and you will not have access to flight, flight_sql or gandiva. If you want to keep using those you will have to install pyarrow-all. If you want to use a minimal pyarrow version without access to acero, dataset, parquet or substrait you can use pyarrow-core and also get a reduction in size. Bear in mind that the Arrow team is working on moving the filesystems out of libarrow and that will be pulled out of pyarrow-core in the future. This means that, probably, on 17.0.0 parrow-core will not support S3, GCS or Azure Filesystems. The idea is to keep working on these efforts further to reduce pyarrow size. Thanks everyone, Raúl [1] https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255