[R] remove a "corrupted file" after using download.file() with R on Windows 7

2016-09-28 Thread Fabien Tarrade
Hi there, Sometime download.file() failed to download the file and I would like to remove the correspond file. The issue is that I am not able to do it and Windows complain that the file is use by another application. I try to closeAllConnections(), or unlink() before removing the file but wit

Re: [R] Faster Subsetting

2016-09-28 Thread Dénes Tóth
Hi Harold, Generally: you can not beat data.table, unless you can represent your data in a matrix (or array or vector). For some specific cases, Hervé's suggestion might be also competitive. Your problem is that you did not put any effort to read at least part of the very extensive documentati

Re: [R] How to test a difference in ratios of count data in R

2016-09-28 Thread David Winsemius
> On Sep 28, 2016, at 9:49 AM, Greg Snow <538...@gmail.com> wrote: > > There are multiple ways of doing this, but here are a couple. > > To just test the fixed effect of treatment you can use the glm function: > > test <- read.table(text=" > replicate treatment n X > 1 A 32 4 > 1 B 33 18 > 2 A

Re: [R] Add annotation text outside of an xyplot (lattice package)

2016-09-28 Thread Kevin Wright
You can find an example of annotating lattice graphics with text anywhere on the graphics device using the pagenum package. See the vignette here: https://cran.r-project.org/web/packages/pagenum/vignettes/pagenum.html The pagenum package uses the grid package to add viewports for the annotation.

Re: [R] Faster Subsetting

2016-09-28 Thread Martin Morgan
On 09/28/2016 02:53 PM, Hervé Pagès wrote: Hi, I'm surprised nobody suggested split(). Splitting the data.frame upfront is faster than repeatedly subsetting it: tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20)) idList <- unique(tmp$id) system.time(for (i in idList) tmp

Re: [R] Error in gam() object 'scat' no found

2016-09-28 Thread David Winsemius
> On Sep 27, 2016, at 8:11 PM, Karl Neergaard wrote: > > Thank you David for taking time to answer my not so helpful question. > I thought your question had sufficient detail for at least a reasonable guess at an answer. When I first started using R I also thought that the gam function would

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
"I'm surprised nobody suggested split(). " I did. by() is a data frame oriented version of tapply(), which uses split(). Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom C

Re: [R] Faster Subsetting

2016-09-28 Thread Hervé Pagès
Hi, I'm surprised nobody suggested split(). Splitting the data.frame upfront is faster than repeatedly subsetting it: tmp <- data.frame(id = rep(1:2, each = 10), foo = rnorm(20)) idList <- unique(tmp$id) system.time(for (i in idList) tmp[which(tmp$id == i),]) # user system el

Re: [R] Faster Subsetting

2016-09-28 Thread Weiser, Dr. Constantin
I just modified the reproducible example a bit, so it's a bit more realistic. The function "mean" could be "easily" replaced by your analysis. And here are some possible solutions: tmp <- data.frame(id = rep(1:2000, each = 100), foo = rnorm(20)) tmp <- tmp[sample(dim(tmp)[1]),] # re-sampling

Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote: > I have an extremely large data frame (~13 million rows) that rese

Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory. Dominik On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold wrote: > I have an extremely large data frame (~13 million rows) that rese

Re: [R] Faster Subsetting

2016-09-28 Thread Enrico Schumann
On Wed, 28 Sep 2016, "Doran, Harold" writes: > I have an extremely large data frame (~13 million rows) that resembles > the structure of the object tmp below in the reproducible code. In my > real data, the variable, 'id' may or may not be ordered, but I think > that is irrelevant. > > I have a p

Re: [R] Putting a bunch of Excel files as data.frames into a list fails

2016-09-28 Thread jeremiah rounds
Try changing: v_list_of_files[v_file] to: v_list_of_files[[v_file]] Also are you sure you are not generating warnings? For example, l = list() l["iris"] = iris; Also, you can change it to lapply(v_files, function(v_file){...}) Have a good one, Jeremiah On Wed, Sep 28, 2016 at 8:02 AM, wrote:

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
Note that for base R, by() is considerably faster than aggregate() (both of which are *must* faster than the sapply() stuff; tapply() is what is more appropriate here). (for Constantin's example): > system.time({ + res4 <- aggregate(tmp$foo, by = list(id=tmp$id), FUN = mean) + }) user syste

Re: [R] Faster Subsetting

2016-09-28 Thread Doran, Harold
Many thanks. I did also try the filter function in dplyr and was also much slower than simply indexing in the original way the code had. system.time(replicate(500, filter(tmp, id == idList[1]))) I did this on the toy example as well as the real data, finding the same (slower) result each time c

Re: [R] Faster Subsetting

2016-09-28 Thread Bert Gunter
Don't do it this way. You are reinventing wheels. 1. Look at package dplyr, which has optimized functions to do exactly this (break into subframes, calculate on subframes, reassemble). Note also that dplyr is part of tidyverse. I use base R functionality for this because I know it and it does wha

Re: [R] Faster Subsetting

2016-09-28 Thread ruipbarradas
Hello, If you work with a matrix instead of a data.frame, it usually runs faster, but your column vectors must all be numeric. ### Fast, but not fast enough system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) user system elapsed 0.050.000.04 ### Not fast at all, a

Re: [R] How to test a difference in ratios of count data in R

2016-09-28 Thread Greg Snow
There are multiple ways of doing this, but here are a couple. To just test the fixed effect of treatment you can use the glm function: test <- read.table(text=" replicate treatment n X 1 A 32 4 1 B 33 18 2 A 20 6 2 B 21 18 3 A 7 0 3 B 8 4 ", header=TRUE) fit1 <- glm( cbind(X,n-X) ~ treatment, da

Re: [R] Faster Subsetting

2016-09-28 Thread Doran, Harold
Thank you very much. I don’t know tidyverse, I’ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) idList <- unique(tmp$id) system.time(replicate(500, tmp[which(

[R] Faster Subsetting

2016-09-28 Thread Doran, Harold
I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant. I have a process that requires subsetting the data by id and then

[R] Putting a bunch of Excel files as data.frames into a list fails

2016-09-28 Thread G . Maubach
Hi All, I need to read a bunch of Excel files and store them in R. I decided to store the different Excel files in data.frames in a named list where the names are the file names of each file (and that is different from the sources as far as I can see): -- cut -- # Sources: # - http://stackove

Re: [R] Help with PCA data file prep and R code

2016-09-28 Thread Albin Blaschka
Hi, maybe the package vegan with its tutorials is a good starting point, too... http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf http://cc.oulu.fi/~jarioksa/opetus/metodi/sessio2.pdf all the best, Albin Am 22.09.2016 10:23 PM, schrieb David L Carlson: Looking at your data there ar