We've discussed this in the past, I think. In addition to having many optional components enabled, the pyarrow wheel also includes the unit tests directory which is of growing size. I think if we made a pyarrow-slim wheel with support only for core Arrow (IPC, etc.) and Parquet file reading, it might be possible to trim by significant percentage.
Rusty -- if you would like to push this forward I would suggest creating an alternative wheel build script to the one that we use and modify flags / add other customizations (e.g. trimming unit tests) that produce a wheel that we could build and possibly upload as "pyarrow-slim" on PyPI On Mon, Oct 3, 2022 at 8:55 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Rusty, > > Le 02/10/2022 à 22:51, Rusty Conover a écrit : > > Hi Arrow Team, > > > > I'm using Apache Arrow with AWS Lambda Functions. > > > > The primary motivation is AWS Athena's user-defined functions[1]. Those > > functions process and return Arrow IPC segments. > > > > * The published Python wheels for Apache Arrow include almost every feature > > of Arrow. (Gandiva, Plasma, Flight) > > Gandiva isn't compiled in the Python wheels. Plasma is reasonably small > (but is also being deprecated soon). Flight is more sizable. However, > most of the size seems to be in Arrow itself and Parquet. A large part > of the size is probably attributable to the Arrow compute engine and > functions, and also perhaps to filesystem implementations such as S3 and > GCS (due to the large third-party dependencies that they bundle). > > > Would it be possible to create a new Python package (i.e., "pyarrow-slim") > > that would disable some of the functionality but result in smaller python > > wheels? > > Perhaps. The first step would be to allow disabling more components in > PyArrow, though. Otherwise I'm afraid the size reduction wouldn't be > terrific. > > Regards > > Antoine.