Thanks for the advice Gabor, I was indeed not starting and finishing with sqldf(). Which was why it was not working for me. Please forgive a blatantly obvious mistake.
I have tried what U suggested and unfortunately R is still having problems doing the join. The problem seems to be one of memory since I am receiving the below error message when I run the natural join using sqldf. Error: cannot allocate vector of size 250.0 Mb Timing stopped at: 32.8 0.48 34.79 I have tried it on a subset of the data and it works. So I think it's a memory issue, being caused by my very large dataset (11 million rows and 2 columns). I think I may have to admit that R cannot do this (on my machine). And try doing it in a full blown database such as postgre. Unless U (or anyone else) have any other suggestions??? Thanks again for your help. For anyone who's interested here's all my code and R log. ## # Info on input data ## > class(A) [1] "data.frame" > class(B) [1] "data.frame" > names(A) [1] "POINTID" "alistair" > names(B) [1] "POINTID" "alistair_range" > dim(A) [1] 11048592 2 > dim(B) [1] 11048592 2 ## # Tried the join with an index on the entire data set ## > sqldf() <SQLiteConnection:(3852,0)> > system.time(sqldf("create index ai1 on A(POINTID, alistair)")) user system elapsed 76.85 0.34 79.67 > system.time(sqldf("create index ai2 on B(POINTID, alistair_range)")) user system elapsed 75.43 0.43 77.16 > system.time(sqldf("select * from main.A natural join main.B")) Error: cannot allocate vector of size 250.0 Mb Timing stopped at: 32.8 0.48 34.79 > sqldf() Error in sqliteCloseConnection(conn, ...) : RS-DBI driver: (close the pending result sets before closing this connection) ## # Also tried the join with an index built from only the variable I intend to merge on, since I wasn't exactly sure which index was correct. ## > sqldf() <SQLiteConnection:(3852,1)> > system.time(sqldf("create index ai1 on A(POINTID)")) user system elapsed 66.67 0.44 69.28 > system.time(sqldf("create index ai2 on B(POINTID)")) user system elapsed 68.18 0.31 68.73 > system.time(sqldf("select * from main.A natural join main.B")) Error: cannot allocate vector of size 31.2 Mb Timing stopped at: 10.56 0.04 10.87 > sqldf() Error in sqliteCloseConnection(conn, ...) : RS-DBI driver: (close the pending result sets before closing this connection) ## # and some memory info ## > memory.size() [1] 412.6 > memory.size(NA) [1] 4095 Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP development, Data Analysis, Modelling, and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 ch...@trickysolutions.com.au -----Original Message----- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Friday, 15 October 2010 1:03 PM To: Chris Howden Cc: r-help@r-project.org Subject: Re: [R] merging and working with BIG data sets. Is sqldf the best way?? On Thu, Oct 14, 2010 at 10:56 PM, Chris Howden <ch...@trickysolutions.com.au> wrote: > Thanks for the suggestion and code Gabor, > > I've tried creating 2 indices: > > 1) just for the variable I intend to merge on > 2) on the entire data set I am merging (which I think is the one I should > be using??) > > However neither seemed to work. The first was still going after 2 hours, > and the second after 12 hours, so I stopped the join. > > If it's not too much bother I was wondering if U could let me know which > index I should be using? > > > Or in other words since I plan to merge using POINTID do I create an index > on > > system.time(sqldf("create index ai1 on A(POINTID)")) > system.time(sqldf("create index ai2 on B(POINTID)")) > > or > > system.time(sqldf("create index ai1 on A(POINTID,alistair)")) > system.time(sqldf("create index ai2 on B(POINTID, alistair_range)") > > > > I'm now using the following join statement > system.time(data2 <- sqldf("select * from A natural join B")) > If you only ran the three sqldf statements you mentioned in your post (thereby omitting 2 of the 5 sqldf calls in example 4i): sqldf("create...") sqldf("create...") sqldf("select...") then what you are doing is to create a database, upload your data to it, create an index on it, destroy the database, then create a second database, upload the data to this second database, create an second index and then destroy that database too and then finally create a third database, upload the data to it and then do a join without any indexes. You must bracket all this with empty sqldf calls as shown in 4i to force persistence: sqldf() sqldf("create...") sqldf("create...") sqldf("select...") sqldf() or else put all of the sql statements in a vector to one sqldf call: sqldf(c("create...", "create...", "select...")) Also you can replace "select ..." with "explain query plan select ...", and it will let you know which indexes its actually using. e.g. in example 4i if we do that: > sqldf("explain query plan select * from main.DF1 natural join main.DF2") order from detail 1 0 0 TABLE DF1 2 1 1 TABLE DF2 WITH INDEX ai2 we see that it really only used index ai2 and not index ai1 at all so we could have saved the 19 seconds that it took to create ai1 as it was never used. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.