I've thought about this in the past, and I would like to make NumPy an optional dependency, but one of the things that kept me from trying was the extent to which NumPy arrays are supported as inputs (or elements of inputs) to pyarrow.array. The implementation in python_to_arrow.cc is significantly intertwined with NumPy's C API. It might require maintaining two altogether different internal implementations of pyarrow.array, a complicated one which deals with all the NumPy oddities (including NumPy array scalars) and a much simpler one that does not. pyarrow may have to detect at runtime whether numpy is in sys.modules to decide whether to import and invoke the more complicated function.
On Mon, Aug 16, 2021 at 5:59 PM Alessandro Molina <alessan...@ursacomputing.com> wrote: > > As Arrow/PyArrow grows more compute functions and features we might move > toward a world where the number of users relying on PyArrow without going > through Pandas or NumPy might grow. > > NumPy is a compile time dependency for PyArrow as it's required to compile > the C++ code needed to implement the pandas/numpy integration, but there > has been some discussion regard the fact that we could make NumPy optional > at runtime (remove it from required dependencies in the Python > distribution). You would have to install numpy only if you need to invoke > to_numpy or to_pandas methods or similar integration features. For all the > other use cases, that rely on Arrow alone, you would be able to pip install > pyarrow without involving any other dependency and be ready to go. > > Technically it seems a bit complicated, Python/Cython can always work > around missing libraries, but we would have to find ways to deal with lazy > involvement of numpy from C++. I don't know if this is something that was > already discussed in the past and thus someone already has solutions for > this part of the problem, but before investing time and effort in research > I think it made sense to make sure it's a goal that the development team > agrees with.