I was able to simplify this very much. There is a problem with pyarrow==0.16.0, r-arrow==0.16.0, and rpy2. Just by loading pyarrow, rpy2 will not be able to load r-arrow. This set of imports fails now, but was fine in 0.14.1. Is it possible there is a conflict with shared objects that pyarrow loads, and shared objects that r-arrow tries to load after?
# Fails import rpy2.robjects as ro import pyarrow ro.r("library(arrow)") # Succeeds import rpy2.robjects as ro ro.r("library(arrow)") # Also fails import rpy2.robjects as ro import pyarrow import pyarrow.parquet import pyarrow.dataset ro.r("library(arrow)") On Sat, Apr 25, 2020 at 12:19 PM Jeffrey Wong <jeffr...@netflix.com> wrote: > Hello, I am using Arrow Table's to facilitate fast data transfer between > python and R. The below strategy worked with arrow==0.14.1, but is no > longer working in arrow == 0.16.0. > > Using pyarrow, I convert a pandas dataframe to a pyarrow Table, then get > the memory address to the underlying Arrow Table. Something like this: > > unsigned long get_arrow_table_memory_address(py::object pyarrow_table) { > arrow::py::import_pyarrow(); > std::shared_ptr<arrow::Table> table; > arrow::py::unwrap_table(pyarrow_table.ptr(), &table); > return (unsigned long) table.get(); > } > > Using rpy2 I can create an R process inside the python process. The arrow > table is still in memory. In the R process, I receive the memory address > (as a string, which is then converted to unsigned int in Rcpp), and return > a shared_ptr for R > > SEXP arrow_table_from_memory_address(std::string memory_address) { > std::shared_ptr<arrow::Table> table((arrow::Table *) > std::stoul(memory_address)); > Rcpp::XPtr<std::shared_ptr<arrow::Table>> output(new > std::shared_ptr<arrow::Table>(table), false); > return output; > } > > Finally, I can create a r-arrow Table object, using arrow::Table$new(xp). > My ultimate goal is to then do as.data.frame, materializing the exact same > dataframe in R as the original one in pandas. > > In arrow == 0.16.0, I get an error concerning the r-arrow.so not being > able to see a symbol in libarrow_dataset.so. > > 10: dyn.load(file, DLLpath = DLLpath, ...) > 9: library.dynam(lib, package, package.lib) > 8: loadNamespace(name) > 7: getNamespace(ns) > 6: asNamespace(pkg) > 5: get(name, envir = asNamespace(pkg), inherits = FALSE) > 4: arrow:::shared_ptr at core_ArrowTablePointer.R#35 > 3: ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE) > 2: (function (expr, envir = parent.frame(), enclos = if (is.list(envir) || > is.pairlist(envir)) parent.frame() else baseenv()) > .Internal(eval(expr, envir, enclos)))(expression(mydata = > ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE))) > 1: (function (expr, envir = parent.frame(), enclos = if (is.list(envir) || > is.pairlist(envir)) parent.frame() else baseenv()) > .Internal(eval(expr, envir, enclos)))(expression(mydata = > ArrowTablePointer$new("94637300534352")$to_table(as_tibble = FALSE))) > Traceback (most recent call last): > File "/root/nflx_causal_models/causal_models/r/rpy2_patches.py", line > 30, in wrapped > return f(self, *args, **kwargs) > File > "/opt/conda/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", > line 28, in _ > cdata = function(*args, **kwargs) > File "/opt/conda/lib/python3.7/site-packages/rpy2/rinterface.py", line > 785, in __call__ > raise embedded.RRuntimeError(_rinterface._geterrmessage()) > rpy2.rinterface_lib.embedded.RRuntimeError: Error in dyn.load(file, > DLLpath = DLLpath, ...) : > unable to load shared object > '/opt/conda/lib/R/library/arrow/libs/arrow.so': > /opt/conda/lib/R/library/arrow/libs/../../../../libarrow_dataset.so.16: > undefined symbol: > _ZN5arrow2fs8internal17SplitAbstractPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE > > Running ldd on the r-arrow.so, I do see that it is properly linked against > the arrow_dataset.so > > ldd /opt/conda/lib/R/library/arrow/libs/arrow.so > linux-vdso.so.1 => (0x00007ffc046d2000) > libarrow_dataset.so.16 => > /opt/conda/lib/R/library/arrow/libs/../../../../libarrow_dataset.so.16 > (0x00007ffb76a5f000) > libparquet.so.16 => > /opt/conda/lib/R/library/arrow/libs/../../../../libparquet.so.16 > (0x00007ffb76757000) > libarrow.so.16 => > /opt/conda/lib/R/library/arrow/libs/../../../../libarrow.so.16 > (0x00007ffb757c7000) > libR.so => /opt/conda/lib/R/library/arrow/libs/../../../lib/libR.so > (0x00007ffb7532a000) > > > I think the symbol is hashed, so I can't tell what function in > libarrow_dataset.so it is looking for > > _ZN5arrow2fs8internal17SplitAbstractPathERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE > > Did I need to compile a version of Arrow with some kind of flag in order > to see this symbol? I currently get arrow-cpp, pyarrow, and r-arrow all > from conda-forge. > > Thank you so much for all the amazing development in arrow. This exchange > of pandas dataframe to R dataframe via arrow table is amazingly fast. > -- > Jeffrey Wong > Computational Causal Inference > -- Jeffrey Wong Computational Causal Inference