On Tue, 17 Aug 2021 at 16:20, Alessandro Molina <alessan...@ursacomputing.com> wrote: > ... > There are by the way some interesting points, like the fact that the mask > for a pyarrow array can only be a numpy array, how could I create a masked > array without numpy? I guess that accepting arrow arrays as mask is > actually something we should allow anyway.
A somewhat related JIRA about how to create a new array with a given validity bitmap: https://issues.apache.org/jira/browse/ARROW-7071 > > On Mon, Aug 16, 2021 at 6:53 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > I agree that "what happens when Numpy is not available at runtime" is a > > rather annoying problem. I'm not sure what happens when you call one > > of the Numpy C API functions and Numpy is not found (crash? error > > return?). It can probably be detected, but needs to be done > > consistently at the start of each PyArrow core function, which requires > > some care. > > > > At the end of the day, it looks like this would be a significant amount > > of work for a relatively minor benefit (did people complain about > > this?), so I'm not sure it's worth spending some time on it. Personally, I follow this sentiment of Antoine: yes, in an ideal world we wouldn't have a hard dependency on numpy. But as long as there is not a clear demand/use case for it, I am not sure it's worth the effort. Joris > > > > Regards > > > > Antoine. > > > > > > > > On Mon, 16 Aug 2021 18:09:54 +0200 > > Wes McKinney <wesmck...@gmail.com> wrote: > > > I've thought about this in the past, and I would like to make NumPy an > > > optional dependency, but one of the things that kept me from trying > > > was the extent to which NumPy arrays are supported as inputs (or > > > elements of inputs) to pyarrow.array. The implementation in > > > python_to_arrow.cc is significantly intertwined with NumPy's C API. It > > > might require maintaining two altogether different internal > > > implementations of pyarrow.array, a complicated one which deals with > > > all the NumPy oddities (including NumPy array scalars) and a much > > > simpler one that does not. pyarrow may have to detect at runtime > > > whether numpy is in sys.modules to decide whether to import and invoke > > > the more complicated function. > > > > > > On Mon, Aug 16, 2021 at 5:59 PM Alessandro Molina > > > <alessan...@ursacomputing.com> wrote: > > > > > > > > As Arrow/PyArrow grows more compute functions and features we might > > move > > > > toward a world where the number of users relying on PyArrow without > > going > > > > through Pandas or NumPy might grow. > > > > > > > > NumPy is a compile time dependency for PyArrow as it's required to > > compile > > > > the C++ code needed to implement the pandas/numpy integration, but > > there > > > > has been some discussion regard the fact that we could make NumPy > > optional > > > > at runtime (remove it from required dependencies in the Python > > > > distribution). You would have to install numpy only if you need to > > invoke > > > > to_numpy or to_pandas methods or similar integration features. For all > > the > > > > other use cases, that rely on Arrow alone, you would be able to pip > > install > > > > pyarrow without involving any other dependency and be ready to go. > > > > > > > > Technically it seems a bit complicated, Python/Cython can always work > > > > around missing libraries, but we would have to find ways to deal with > > lazy > > > > involvement of numpy from C++. I don't know if this is something that > > was > > > > already discussed in the past and thus someone already has solutions > > for > > > > this part of the problem, but before investing time and effort in > > research > > > > I think it made sense to make sure it's a goal that the development > > team > > > > agrees with. > > > > > > > > > > >