Hi Matthew and Jim, Thanks for all the suggestions as always. Matthew's post was very informative in showing how things can be done much more efficiently with data.table. I haven't had a chance to finish the reshaping because my group was a in rush, and someone else decided to do it in Perl. However, I did get a chance to use the data.table package for the first time. In some preliminary steps, I had to do some subsetting and recoding and this was superfast with data.table. The tutorials were helpful in getting me up to speed. Over the next few days I plan to carry out the reshaping as a learning exercise so I'll be ready next time. I'll post my results afterwards.
Thanks, Juliet On Mon, Jul 12, 2010 at 11:50 AM, Matthew Dowle <mdo...@mdowle.plus.com> wrote: > Juliet, > > I've been corrected off list. I did not read properly that you are on 64bit. > > The calculation should be : > 53860858 * 4 * 8 /1024^3 = 1.6GB > since pointers are 8 bytes on 64bit. > > Also, data.table is an add-on package so I should have included : > > install.packages("data.table") > require(data.table) > > data.table is available on all platforms both 32bit and 64bit. > > Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should > be > 'perceived' and 'testDate' should be 'testData' at the end. > > The rest still applies, and you might have a much easier time than I thought > since you are on 64bit. I was working on the basis of squeezing into 32bit. > > Matthew > > > "Matthew Dowle" <mdo...@mdowle.plus.com> wrote in message > news:i1faj2$lv...@dough.gmane.org... >> >> Hi Juliet, >> >> Thanks for the info. >> >> It is very slow because of the == in testData[testData$V2==one_ind,] >> >> Why? Imagine someoone looks for 10 people in the phone directory. Would >> they search the entire phone directory for the first person's phone >> number, starting >> on page 1, looking at every single name, even continuing to the end of the >> book >> after they had found them ? Then would they start again from page 1 for >> the 2nd >> person, and then the 3rd, searching the entire phone directory from start >> to finish >> for each and every person ? That code using == does that. Some of us >> call >> that a 'vector scan' and is a common reason for R being percieved as slow. >> >> To do that more efficiently try this : >> >> testData = as.data.table(testData) >> setkey(testData,V2) # sorts data by V2 >> for (one_ind in mysamples) { >> one_sample <- testData[one_id,] >> reshape(one_sample) >> } >> >> or just this : >> >> testData = as.data.table(testData) >> setkey(testDate,V2) >> testData[,reshape(.SD,...), by=V2] >> >> That should solve the vector scanning problem, and get you on to the >> memory >> problems which will need to be tackled. Since the 4 columns are character, >> then >> the object size should be roughly : >> >> 53860858 * 4 * 4 /1024^3 = 0.8GB >> >> That is more promising to work with in 32bit so there is hope. [ That >> 0.8GB >> ignores the (likely small) size of the unique strings in global string >> hash (depending >> on your data). ] >> >> Its likely that the as.data.table() fails with out of memory. That is not >> data.table >> but unique. There is a change in unique.c in R 2.12 which makes unique >> more >> efficient and since factor calls unique, it may be necessary to use R >> 2.12. >> >> If that still doesn't work, then there are several more tricks (and we >> will need >> further information), and there may be some tweaks needed to that code as >> I >> didn't test it, but I think it should be possible in 32bit using R 2.12. >> >> Is it an option to just keep it in long format and use a data.table ? >> >> testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ] >> >> Why you you need to reshape from long to wide ? >> >> HTH, >> Matthew >> >> >> >> "Juliet Hannah" <juliet.han...@gmail.com> wrote in message >> news:aanlktinyvgmrvdp0svc-fylgogn2ro0omnugqbxx_...@mail.gmail.com... >> Hi Jim, >> >> Thanks for responding. Here is the info I should have included before. >> I should be able to access 4 GB. >> >>> str(myData) >> 'data.frame': 53860857 obs. of 4 variables: >> $ V1: chr "200003" "200006" "200047" "200050" ... >> $ V2: chr "cv0001" "cv0001" "cv0001" "cv0001" ... >> $ V3: chr "A" "A" "A" "B" ... >> $ V4: chr "B" "B" "A" "B" ... >>> sessionInfo() >> R version 2.11.0 (2010-04-22) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholt...@gmail.com> wrote: >>> What is the configuration you are running on (OS, memory, etc.)? What >>> does your object consist of? Is it numeric, factors, etc.? Provide a >>> 'str' of it. If it is numeric, then the size of the object is >>> probably about 1.8GB. Doing the long to wide you will probably need >>> at least that much additional memory to hold the copy, if not more. >>> This would be impossible on a 32-bit version of R. >>> >>> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.han...@gmail.com> >>> wrote: >>>> I have a data set that has 4 columns and 53860858 rows. I was able to >>>> read this into R with: >>>> >>>> cc <- rep("character",4) >>>> myData <- >>>> read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",") >>>> >>>> >>>> I need to reshape this data from long to wide. On a small data set the >>>> following lines work. But on the real data set, it didn't finish even >>>> when I took a sample of two (rows in new data). I didn't receive an >>>> error. I just stopped it because it was taking too long. Any >>>> suggestions for improvements? Thanks. >>>> >>>> # start example >>>> # i have commented out the write.table statement below >>>> >>>> testData <- read.table(textConnection("rs9999853,cv0084,A,A >>>> rs999986,cv0084,C,B >>>> rs9999883,cv0084,E,F >>>> rs9999853,cv0085,G,H >>>> rs999986,cv0085,I,J >>>> rs9999883,cv0085,K,L"),header=FALSE,sep=",") >>>> closeAllConnections() >>>> >>>> mysamples <- unique(testData$V2) >>>> >>>> for (one_ind in mysamples) { >>>> one_sample <- testData[testData$V2==one_ind,] >>>> mywide <- reshape(one_sample, timevar = "V1", idvar = >>>> "V2",direction = "wide") >>>> # write.table(mywide,file >>>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE) >>>> } >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> >>> >>> -- >>> Jim Holtman >>> Cincinnati, OH >>> +1 513 646 9390 >>> >>> What is the problem that you are trying to solve? >>> >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.