hi Zhuo,

On Thu, Jun 20, 2019 at 5:48 PM Zhuo Peng <[email protected]> wrote:
>
> Dear Arrow maintainers,
>
> I work on several TFX (TensorFlow eXtended) [1] projects (e.g. TensorFlow
> Data Validation [2]) and am trying to use Arrow in them. These projects are
> mostly written in Python but has C++ code as Python extension modules,
> therefore we use both Arrow’s C++ and Python APIs. Our projects are
> distributed through PyPI as binary packages.
>
> The python extension modules are compiled with the headers shipped within
> pyarrow PyPI binary package and are linked with libarrow.so and
> libarrow_python.so in the same package. So far we’ve seen two major
> problems:
>
> * There are STL container definitions in public headers.
>

I think this should be regarded as a bug (exporting compiled STL
symbols). It seems like you agree but we have let some symbols leak in
large part because the scope of the project is large and we need more
contributors (who understand the issue and the solutions) to help look
after these issues.

> It causes problems because the binary for template classes is generated at
> compilation time. And the definition of those template classes might differ
> from compiler to compiler. This might happen even if we use a different GCC
>  version than the one that compiled pyarrow (for example, the layout of
> std::unordered_map<> has changed in GCC 5.2 [3], and arrow::Schema used to
> contain an std::unordered_map<> member [4].)
>
> One might argue that everyone releasing manylinux1 packages should use
> exactly the same compiler, as provided by the pypa docker image, however
> the standard only specifies the maximum versions of corresponding
> fundamental libraries [5]. Newer GCC versions could be backported to work
> with older libraries [6].
>
> A recent change in Arrow [7] has removed most (but not all [8]) of the STL
> members in publicly accessible class declarations and will resolve our
> immediate problem, but I wonder if there is, or there should be an explicit
> policy on the ABI compatibility, especially regarding the usage of template
> functions / classes in public interfaces?
>
> * Our wheel cannot pass “auditwheel repair”
>
> I don’t think it’s correct to pull libarrow.so and libarrow_python.so into
> our wheel and have user’s Python load both our libarrow.so and pyarrow’s,
> but that’s what “auditwheel repair” attempts to do. But if we don’t allow
> auditwheel to do so, it refuses to stamp on our wheel because it has
> “external” dependencies.
>
> This seems not an Arrow problem, but I wonder if others in the community
> have had to deal with similar issues and what the resolution is. Our
> current workaround is to manually stamp the wheel.
>

You aren't vendoring libarrow, right (if so, that's a bigger issue)?
I'm not an expert on how to appease auditwheel but this seems like
something we should sort out so that other projects' wheels can depend
on the pyarrow wheels. For the record, the whole wheel infrastructure
is poorly adapted for this scenario, which conda handles much more
gracefully.

>
> Thanks,
> Zhuo
>
>
> References:
>
> [1] https://github.com/tensorflow/tfx
> [2] https://github.com/tensorflow/data-validation
> [3]
> https://github.com/gcc-mirror/gcc/commit/54b755d349d17bb197511529746cd7cf8ea761c1#diff-f82d3b9fa19961eed132b10c9a73903e
> [4]
> https://github.com/apache/arrow/blob/b22848952f09d6f9487feaff80ee358ca41b1562/cpp/src/arrow/type.h#L532
> [5] https://www.python.org/dev/peps/pep-0513/#id40
> [6] https://github.com/pypa/auditwheel/issues/125#issuecomment-438513357
> [7]
> https://github.com/apache/arrow/commit/7a5562174cffb21b16f990f64d114c1a94a30556
> [8]
> https://github.com/apache/arrow/blob/a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93/cpp/src/arrow/ipc/dictionary.h#L91

Reply via email to