I was able to simplify this very much. There is a problem with
pyarrow==0.16.0, r-arrow==0.16.0, and rpy2. Just by loading pyarrow, rpy2
will not be able to load r-arrow. This set of imports fails now, but was
fine in 0.14.1. Is it possible there is a conflict with shared objects that
pyarrow loads, and shared objects that r-arrow tries to load after?

# Fails
import rpy2.robjects as ro
import pyarrow
ro.r("library(arrow)")

# Succeeds
import rpy2.robjects as ro
ro.r("library(arrow)")

# Also fails
import rpy2.robjects as ro
import pyarrow
import pyarrow.parquet
import pyarrow.dataset
ro.r("library(arrow)")

On Sat, Apr 25, 2020 at 12:19 PM Jeffrey Wong <jeffr...@netflix.com> wrote:

> Hello, I am using Arrow Table's to facilitate fast data transfer between
> python and R. The below strategy worked with arrow==0.14.1, but is no
> longer working in arrow == 0.16.0.
>
> Using pyarrow, I convert a pandas dataframe to a pyarrow Table, then get
> the memory address to the underlying Arrow Table. Something like this:
>
> unsigned long get_arrow_table_memory_address(py::object pyarrow_table) {
>     arrow::py::import_pyarrow();
>     std::shared_ptr<arrow::Table> table;
>     arrow::py::unwrap_table(pyarrow_table.ptr(), &table);
>     return (unsigned long) table.get();
> }
>
> Using rpy2 I can create an R process inside the python process. The arrow
> table is still in memory. In the R process, I receive the memory address
> (as a string, which is then converted to unsigned int in Rcpp), and return
> a shared_ptr for R
>
> SEXP arrow_table_from_memory_address(std::string memory_address) {
>   std::shared_ptr<arrow::Table> table((arrow::Table *)
> std::stoul(memory_address));
>   Rcpp::XPtr<std::shared_ptr<arrow::Table>> output(new
> std::shared_ptr<arrow::Table>(table), false);
>   return output;
> }
>
> Finally, I can create a r-arrow Table object, using arrow::Table$new(xp).
> My ultimate goal is to then do as.data.frame, materializing the exact same
> dataframe in R as the original one in pandas.
>
> In arrow == 0.16.0, I get an error concerning the r-arrow.so not being
> able to see a symbol in libarrow_dataset.so.
>
> 10: dyn.load(file, DLLpath = DLLpath, ...)
> 9: library.dynam(lib, package, package.lib)
> 8: loadNamespace(name)
> 7: getNamespace(ns)
> 6: asNamespace(pkg)
> 5: get(name, envir = asNamespace(pkg), inherits = FALSE)
> 4: arrow:::shared_ptr at core_ArrowTablePointer.R#35
> 3: ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE)
> 2: (function (expr, envir = parent.frame(), enclos = if (is.list(envir) ||
>        is.pairlist(envir)) parent.frame() else baseenv())
>    .Internal(eval(expr, envir, enclos)))(expression(mydata =
> ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE)))
> 1: (function (expr, envir = parent.frame(), enclos = if (is.list(envir) ||
>        is.pairlist(envir)) parent.frame() else baseenv())
>    .Internal(eval(expr, envir, enclos)))(expression(mydata =
> ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE)))
> Traceback (most recent call last):
>   File "/root/nflx_causal_models/causal_models/r/rpy2_patches.py", line
> 30, in wrapped
>     return f(self, *args, **kwargs)
>   File
> "/opt/conda/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py",
> line 28, in _
>     cdata = function(*args, **kwargs)
>   File "/opt/conda/lib/python3.7/site-packages/rpy2/rinterface.py", line
> 785, in __call__
>     raise embedded.RRuntimeError(_rinterface._geterrmessage())
> rpy2.rinterface_lib.embedded.RRuntimeError: Error in dyn.load(file,
> DLLpath = DLLpath, ...) :
>   unable to load shared object
> '/opt/conda/lib/R/library/arrow/libs/arrow.so':
>   /opt/conda/lib/R/library/arrow/libs/../../../../libarrow_dataset.so.16:
> undefined symbol:
> _ZN5arrow2fs8internal17SplitAbstractPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
>
> Running ldd on the r-arrow.so, I do see that it is properly linked against
> the arrow_dataset.so
>
> ldd /opt/conda/lib/R/library/arrow/libs/arrow.so
> linux-vdso.so.1 =>  (0x00007ffc046d2000)
> libarrow_dataset.so.16 =>
> /opt/conda/lib/R/library/arrow/libs/../../../../libarrow_dataset.so.16
> (0x00007ffb76a5f000)
> libparquet.so.16 =>
> /opt/conda/lib/R/library/arrow/libs/../../../../libparquet.so.16
> (0x00007ffb76757000)
> libarrow.so.16 =>
> /opt/conda/lib/R/library/arrow/libs/../../../../libarrow.so.16
> (0x00007ffb757c7000)
> libR.so => /opt/conda/lib/R/library/arrow/libs/../../../lib/libR.so
> (0x00007ffb7532a000)
>
>
> I think the symbol is hashed, so I can't tell what function in
> libarrow_dataset.so it is looking for
>
> _ZN5arrow2fs8internal17SplitAbstractPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
>
> Did I need to compile a version of Arrow with some kind of flag in order
> to see this symbol? I currently get arrow-cpp, pyarrow, and r-arrow all
> from conda-forge.
>
> Thank you so much for all the amazing development in arrow. This exchange
> of pandas dataframe to R dataframe via arrow table is amazingly fast.
> --
> Jeffrey Wong
> Computational Causal Inference
>


-- 
Jeffrey Wong
Computational Causal Inference

Reply via email to