While working on https://github.com/apache/arrow/pull/10162 it was raised
the concern that it's hard to change Cython code because it might break
third party libraries and projects relying on pyarrow through Cython.

Mostly the problem comes from the fact that the documentation suggests
pyarrow.lib.* ( https://arrow.apache.org/docs/python/extending.html#example
) as what should be used to import features from pyarrow in Cython.
Given most of pyarrow is implemented including pxi files into the lib.pyx
module (
https://github.com/apache/arrow/blob/master/python/pyarrow/lib.pyx#L118-L163
) it means that we are exposing the majority of the internals as our public
api.

The consequence is that we in practice are preventing ourselves from
touching anything that exists in those included files as they might have
been used by another project and thus they can't be moved or change their
signature.

We could argue that only what was documented explicitly should be
considered "public" and everything else can be changed, but our
documentation seems to be unclear on this point. It lists some functions
that should be considered our explicit api (
https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
uses CArray  in the example (
https://arrow.apache.org/docs/python/extending.html#example ) which wasn't
listed as public.

I think it would be helpful to come to an agreement about what we should
consider publicly exposed from Cython so that we can properly update
documentation and unblock possible refactoring.

Personally, even at risk of breaking third parties code, I think it would
be wise to aim for the minimum exposed surface. I'd consider Cython mostly
an implementation detail and promote usage of libarrow from C/C++ directly
if you need to work on high performance Python extensions.

Reply via email to