PyArrow is currently full Cython codebase, but in reality it relies on some classes and functions that are implemented in C++ within the src/python directory ( https://github.com/apache/arrow/tree/master/cpp/src/arrow/python ). Especially for numpy/pandas conversion code that has to interface with Numpy arrays data at low level.
When working in the area of PyArrow it's not uncommon that you end up jumping back and forth between the Arrow C++ codebase for Python and PyArrow and you can also end up with, sometimes hard to catch, integration issues if you forgot to recompile libarrow even if you are working on a Python only change. I'm wondering if it wouldn't make life easier for contributors if the src/arrow/python directory was moved into pyarrow and we made PyArrow able to build it. That would probably reduce risk of integration issues as rebuilding pyarrow alone would probably be enough for most python specific changes (as it would also rebuild the Python specific C++). I think that moving src/arrow/python into pyarrow would also make the codebase more cohesive which would lower the barrier for new contributors looking for how to fix a pyarrow specific issue. Unless there is any major side effect (outside of having to build a bit more complex build scripts for pyarrow, but it's already CMake based, so building some C++ shouldn't be a big deal) that I'm missing, it seems that the benefits of having all Python related code into a single place would surpass the side effects. Also I'm not sure how widespread it is the requirement of Python from C++, but it seems to me that if we moved all Python specific code into pyarrow we could make libarrow decoupled from Python. Which might make it easier to deal with Virtualenvs or debug versions of python as you wouldn't have to deal with Python3_EXECUTABLE etc when building libarrow. Any thoughts?
