Yes, that likely is the source of the difference: I'm happy to help fix it up (won't be hard), but I want to clarify exactly how you want the data done:
say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20; then 1:5, 6:10, 11:15,16:20 etc but what about situations with very different data sets: x = cbind(1:20, 1:7) one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration? More generally, what exactly are you comparing? It seems odd to have two different categories/samples and to compare their means and then to switch gears entirely to compare subsamples of the categories independently. It seems that they are just different inferences: comparing the average of cats vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs. That winds up highlighting different independent variables. (Iteration one: species --> iteration two: gender) If you could speak a little more about your data, it'd be easier to do the splits in a meaningful way. As currently implemented, my code takes a 2d data frame and simply divides it into the top and bottom halves, which in most applications would corresponding to doing a mean-comparison calculation for different statistics of the same observation. The subsetting then keeps "corresponding" data together -- I put corresponding in parentheses because we aren't doing paired t-tests. Looking forward to your reply, Michael PS -- I did the splits basically the same way (other than the direction) but I just used floor() instead of round(). On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <marinadewo...@hotmail.com>wrote: > Thank you for your reply, > > I used this code on my test data, but did not get the same p-values. > > I think I know were the difference lies; when the data is split in 4 parts > I want to compare the two left groups (group 1 and 2) with each other and > the two right groups (group 3 and 4) with each other. It seems that with > this code group 1 and 3 are compared with each other and group 2 and 4, I > did not yet succeeded in changing this. > > About the unequal data sizes, I thought I could 'correct' this by using > round. For example, when my data consists of 17 data points I would use > > m <- length(data)/2 > x <- data[1:round(m)] > y <- data[(round(m)+1):length(data)] > > x has size 9 and y has size 8. > > > Sincerely, > > Marina de Wolff > > ------------------------------ > From: michael.weyla...@gmail.com > Date: Thu, 11 Aug 2011 11:54:11 -0400 > Subject: Re: [R] Splitting data > To: marinadewo...@hotmail.com > CC: r-help@r-project.org > > > This sounds very much like a recursive problem: something like this seems > to get the gist of what you want. > > DataSplits <- function(Data, alpha = 0.05) { > DataSplitsCore <- function(Data, alpha, level) { > tt <- t.test(Data[,1],Data[,2]) > print(tt) > if (tt$p.value > alpha) { > print(paste("Stopped at level", level)) > return(invisible(TRUE)) > } else { > nr = floor(NROW(Data)/2) > if (nr == 1) {print(paste("Reached Samples of Size 1")); stop} > d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level > + 1) > if (d1) return(invisible(TRUE)) > d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level = > level +1) > if (d2) return(invisible(TRUE)) > return(invisible(FALSE)) > } > } > DataSplitsCore(Data, alpha = alpha, level = 1) > } > > Your description wasn't the clearest about what to do when the data sizes > didn't match, but this should give you a start. Let me know if this doesn't > do as desired and I can help tweak it. > > Hope this can be of help, > > Michael Weylandt > > PS -- You might as well use R's built in t.test function. > > On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff < > marinadewo...@hotmail.com> wrote: > > > I want to implement the following algorithm in R: > > I want to split my data, use a t test to compare both means of the groups > to see if they significantly differ from each other. If this is a yes (p < > alpha) I want to split again (into 4 groups) and do the same procedure > twice, and stop otherwise (here the problem arises). As a final result I > would have different groups of data. > > I made some code where the data is splitted, until no splitting is > possible. So for 16 datapoints, we can split 4 times with a final result of > 16 groups (p is NA for the 4th split since sd cannot be calculated..). > > The code calculated all p values, but I don't want this. I want it to stop > when p > alpha. I tried while, but didn't succeed. > > I hope someone can help me to acchieve my goal. > > This is what I tried so far with test data: > > a = rnorm(9,0,0.1) > b = rnorm(7,1,0.1) > data = c(a,b) > plot(data) > > # Want to calculate max of groups/split for the data > d = seq(1,100,1) > n = 2^d > m <- which(n <=length(data)) > n = n[m[1]:m[length(m)]] > > # All groups > i=0 > j=0 > dx = 0 > dy = > for (i in 1:length(n)){ > split <- length(data)/(n[i]) > for (j in 1:(n[i]/2)){ > x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))] > dx = cbind(dx,x) > y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)] > dy = cbind(dy,y) > }} > > dx = dx[,2:dim(dx)[2]] > dy = dy[,2:dim(dy)[2]] > > k=0 > meanx=0 > meany=0 > sdx=0 > sdy=0 > nx=0 > ny=0 > for (k in 1:dim(dx)[2]) { > meanx[k] = mean(unique(dx[,k])) > meany[k] = mean(unique(dy[,k])) > sdx[k] = sd(unique(dx[,k])) > sdy[k] = sd(unique(dy[,k])) > nx[k] = length(unique(dx[,k])) > ny[k] = length(unique(dy[,k])) > } > > t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny)) > df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) + > (sdy^2/ny)^2/(ny-1)) > p = 2*pt(-abs(t),df=df) > alpha = 0.05 > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.