It would be very useful to be able to pass objects both directions in memory "by pointer". I had opened https://issues.apache.org/jira/browse/ARROW-3750 previously. It may be that what you've done is already the right way, but feel free to open a pull request and we can have a look
cheers Wes On Fri, Jan 11, 2019 at 2:33 AM Romain Francois <rom...@purrple.cat> wrote: > > This is exactly (IMO) the kind of things this project is about. Thanks for > looking into that, I'll try to incorporate the idea in the R package asap. > > Did you look at the reverse operation, i.e. promote an arrow object from R to > python ? > > Romain > > > Le 11 janv. 2019 à 06:54, Jeffrey Wong <jeffr...@netflix.com.INVALID> a > > écrit : > > > > Hello, I wanted to share a great experience I had with arrow, Python and R, > > and possible contribute a package I wrote. > > > > I work with many colleagues that build engineering systems in python. As a > > data scientist I work almost entirely in R and Rcpp. To bridge the > > engineering work with data science work my team uses rpy2. Our workflow is > > usually this: > > > > 1. Python is used to coordinate with other engineering systems > > 2. Python ultimately gets a dataset that needs to be analyzed. This can > > take the form of parquet or arrow formatted data > > 3. Using rpy2, python can pass data to R to be analyzed, and receive > > output from R at the end. Python can then coordinate with downstream > > systems > > > > What we have found is that passing data from python to R can be quite slow. > > Recently, I found that if python already has a pyarrow Table object in the > > session, it can be passed to Rcpp as a SEXP through rpy2. Rcpp can use the > > C++ arrow library to extract the underlying arrow Table object from the > > pyarrow Table object, and materialize a data frame out of that. I have > > found that I can transfer 20 million row datasets from python to R in ~10 > > seconds. This is particularly powerful when Python is already the driver of > > the engineering systems, and the compute is pushed into R. > > > > This is a huge advantage. Performance wise, this is the fastest way to > > transfer data to R that we have seen. Culture wise, it means my engineering > > and data science teams can collaborate much better as both teams operate on > > arrow types. > > > > I wrote an R package containing an Rcpp function > > RcppReceiveArrowTableFromPython (link > > <https://github.com/jeffwong-nflx/RcppPyArrow>) which receives the SEXP > > from rpy2, unwraps the underlying arrow table, and then produces an R > > dataframe. I am interested in contributing this package back to the Arrow > > community, as I believe part of the spirit of Arrow is to facilitate > > seamless data transfer across languages. Is the Arrow codebase a proper > > home for such a package? This package has a dependency on having Python > > headers and being able to link to libpython.so, will that complicate this > > contribution? > > > > -- > > Jeffrey Wong > > Computational Causal Inference >