Dear R list,

What is the best way to efficiently marry an R dataset with a very large 
(Oracle) database table?  

The goal is to only return Oracle table rows that match IDs present in the R 
dataset.  
I have an R data frame with 2000 user IDs analogous to: r = 
data.frame(userid=round(runif(2000)*100000,0))

...and I need to pull data from an Oracle table only for these 2000 IDs.  The 
Oracle table is quite large. Additionally, the sql query may need to join to 
other tables to bring in ancillary fields.

I currently connect to Oracle via odbc: 

library(RODBC)
connection <- odbcConnect("****", uid="****", pwd="****")
d = sqlQuery(connection, "select userid, x, y, z from largetable where 
timestamp > sysdate -7")

...allowing me to pull data from the database table into the R object "d" and 
then use the R merge function.  The problem however is that if "d" is too large 
it may fail due to memory limitations or be inefficient.  I would like to push 
the merge portion to the database and it would be very convenient if it were 
possible to request that the query look to the R object for the ID's to which 
it should restrict the output.  

Is there a way to do this?
Something like the following fictional code:
d = sqlQuery(connection, "select t.userid, x, y, z from largetable t where 
r$userid=t.userid")

Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how?   
This would be convenient and help me avoid needing to create a temporary table 
to store the R data, join via sql, then return the data back to R.

I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . 
Thanks for your comments, ideas, recommendations.


-Avram

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to