By the way, I indeed want to get the data returned.
From: marinadewo...@hotmail.com To: michael.weyla...@gmail.com CC: r-help@r-project.org Subject: RE: [R] Splitting data Date: Mon, 15 Aug 2011 10:32:04 +0200 Dear Michael, I adjusted the code a bit; d1 = DataSplitsCore(Data1[(1:nr1)], Data1[-(1:nr1)], alpha = alpha, level = level + 1) d2 = DataSplitsCore(Data2[(1:nr2)], Data2[-(1:nr2)], alpha = alpha, level = level +1) The code stops when p>alpha, which is what I wanted. Only problem is that it does not calculates the p values for other groups. To make myself clear a short example: Data = c(rnorm(9,0,0.1),rnorm(7,1,0.1)). Data1 = Data[1:(floor(Data)/2)] Data2 = Data[((floor(Data)/2)+1):length(Data)] t.test(Data1,Data2) is significant --> split into 4 groups Data11 = Data1[1:floor(Data1)/2] Data12 = Data1[((floor(Data1)/2)+1):length(Data1)] t.test(Data11,Data12) is not significant --> Stop for this 'branch' Data21 = Data2[1:floor(Data2)/2] Data22 = Data2[((floor(Data2)/2)+1):length(Data2)] t.test(Data21,Data22) is not significant -> Stop for this 'branch' At this moment, the code stops running after the first time is reaches p>alpha, and does not calculate p values for other 'branches' which still needs to be examined. I shall try some of the ideas you suggested, and will come to that back later. Sincerely, Marina de Wolff From: michael.weyla...@gmail.com Date: Fri, 12 Aug 2011 11:50:30 -0400 Subject: Re: [R] Splitting data To: marinadewo...@hotmail.com CC: r-help@r-project.org Sorry -- missed a tweak. The call to DataSplitsCore should read nr = floor(NROW(Data)/2); Data1 = Data[(1:nr)]; Data2 = Data[-(1:nr)]; DataSplitsCore(Data1,Data2,alpha,level) On Fri, Aug 12, 2011 at 11:46 AM, R. Michael Weylandt <michael.weyla...@gmail.com> wrote: Hmmm, interesting problem. I now see your motivation for not using a data frame set up in the split code: try this -- DataSplits <- function(Data, alpha = 0.05) { DataSplitsCore <- function(Data1,Data2, alpha, level) { tt <- t.test(Data1,Data2) print(tt) if (tt$p.value > alpha) { print(paste("Stopped at level", level)) return(invisible(TRUE)) } else { nr1 = floor(NROW(Data1)/2); nr2 = floor(NROW(Data2)/2) if (nr == 1) {print(paste("Reached Samples of Size 1")); stop} d1 = DataSplitsCore(Data1[(1:nr1)], Data2[(1:nr2)], alpha = alpha, level = level + 1) if (d1) return(invisible(TRUE)) d2 = DataSplitsCore(Data1[-(1:nr1)], Data2[-(1:nr2)], alpha = alpha, level = level +1) if (d2) return(invisible(TRUE)) return(invisible(FALSE)) } } DataSplitsCore(Data, alpha = alpha, level = 1) } By the way, would you rather this returned the data rather than the depth the search got to? I don't know much about the subject (gas pressures or outlier identification), so I'll just spit-ball some other filtering techniques: 1) Since it's a time series that changes over time, do some sort of rolling z-score filter: i.e., if you are more than 2 sigma out, toss the data until it stabilizes. Can also do this non-parametrically with means and MADs. (Not wild about this one, but like I said: spitballing) 2) Subsample at random and throw out outliers until you have a fairly stable data set. 3) Use a trimmed mean or median instead of a mean for robustness in your analysis. 4) Filter between some a priori bands: 2e5 to 5e5? 5) Check ROC between points and make sure that's small. Would these work for this sort of reading? None would be hard in R, and most would probably be faster than the splitting technique. Looking at the data, I'm assuming that we more or less hope to isolate 5 stable levels: the first one with the widest variance, 2 and 3 where it gets higher and then comes back down, the long 4 and then 5 after that. I.e., just to confirm those little "drops" should be discarded in their entirety. Michael Weylandt On Fri, Aug 12, 2011 at 11:19 AM, Marina de Wolff <marinadewo...@hotmail.com> wrote: I understand the confusion, I hope I can clearify this. The problem you are refering to will not apply for my data I think. My data consists of 40.000 points, of pressure at a gaswell (vs time). I included a picture. The problem with this data set is that only datapoints in a 'stable' situation are reliable. Therefore the dataset needs to be filtered before it is useable. I'm trying different ideas to fulfill that goal. I already used breakpoints and some sort of steady state detection with moving variance. An other idea would be to split the data in half, compare with each other and if both groups (first half of the data and second half of the data) significantly differ split again and compare both left groups with each other and both right groups with each other etc. As a final result I would have different groups with different lengths (I hope), and only use the groups with a minimum size of m. Many thanks in advance for your assitance in this. Sincerely, Marina de Wolff From: michael.weyla...@gmail.com Date: Fri, 12 Aug 2011 09:35:00 -0400 Subject: Re: [R] Splitting data To: marinadewo...@hotmail.com CC: r-help@r-project.org Yes, that likely is the source of the difference: I'm happy to help fix it up (won't be hard), but I want to clarify exactly how you want the data done: say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20; then 1:5, 6:10, 11:15,16:20 etc but what about situations with very different data sets: x = cbind(1:20, 1:7) one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration? More generally, what exactly are you comparing? It seems odd to have two different categories/samples and to compare their means and then to switch gears entirely to compare subsamples of the categories independently. It seems that they are just different inferences: comparing the average of cats vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs. That winds up highlighting different independent variables. (Iteration one: species --> iteration two: gender) If you could speak a little more about your data, it'd be easier to do the splits in a meaningful way. As currently implemented, my code takes a 2d data frame and simply divides it into the top and bottom halves, which in most applications would corresponding to doing a mean-comparison calculation for different statistics of the same observation. The subsetting then keeps "corresponding" data together -- I put corresponding in parentheses because we aren't doing paired t-tests. Looking forward to your reply, Michael PS -- I did the splits basically the same way (other than the direction) but I just used floor() instead of round(). On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <marinadewo...@hotmail.com> wrote: Thank you for your reply, I used this code on my test data, but did not get the same p-values. I think I know were the difference lies; when the data is split in 4 parts I want to compare the two left groups (group 1 and 2) with each other and the two right groups (group 3 and 4) with each other. It seems that with this code group 1 and 3 are compared with each other and group 2 and 4, I did not yet succeeded in changing this. About the unequal data sizes, I thought I could 'correct' this by using round. For example, when my data consists of 17 data points I would use m <- length(data)/2 x <- data[1:round(m)] y <- data[(round(m)+1):length(data)] x has size 9 and y has size 8. Sincerely, Marina de Wolff From: michael.weyla...@gmail.com Date: Thu, 11 Aug 2011 11:54:11 -0400 Subject: Re: [R] Splitting data To: marinadewo...@hotmail.com CC: r-help@r-project.org This sounds very much like a recursive problem: something like this seems to get the gist of what you want. DataSplits <- function(Data, alpha = 0.05) { DataSplitsCore <- function(Data, alpha, level) { tt <- t.test(Data[,1],Data[,2]) print(tt) if (tt$p.value > alpha) { print(paste("Stopped at level", level)) return(invisible(TRUE)) } else { nr = floor(NROW(Data)/2) if (nr == 1) {print(paste("Reached Samples of Size 1")); stop} d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level + 1) if (d1) return(invisible(TRUE)) d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level = level +1) if (d2) return(invisible(TRUE)) return(invisible(FALSE)) } } DataSplitsCore(Data, alpha = alpha, level = 1) } Your description wasn't the clearest about what to do when the data sizes didn't match, but this should give you a start. Let me know if this doesn't do as desired and I can help tweak it. Hope this can be of help, Michael Weylandt PS -- You might as well use R's built in t.test function. On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <marinadewo...@hotmail.com> wrote: I want to implement the following algorithm in R: I want to split my data, use a t test to compare both means of the groups to see if they significantly differ from each other. If this is a yes (p < alpha) I want to split again (into 4 groups) and do the same procedure twice, and stop otherwise (here the problem arises). As a final result I would have different groups of data. I made some code where the data is splitted, until no splitting is possible. So for 16 datapoints, we can split 4 times with a final result of 16 groups (p is NA for the 4th split since sd cannot be calculated..). The code calculated all p values, but I don't want this. I want it to stop when p > alpha. I tried while, but didn't succeed. I hope someone can help me to acchieve my goal. This is what I tried so far with test data: a = rnorm(9,0,0.1) b = rnorm(7,1,0.1) data = c(a,b) plot(data) # Want to calculate max of groups/split for the data d = seq(1,100,1) n = 2^d m <- which(n <=length(data)) n = n[m[1]:m[length(m)]] # All groups i=0 j=0 dx = 0 dy = for (i in 1:length(n)){ split <- length(data)/(n[i]) for (j in 1:(n[i]/2)){ x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))] dx = cbind(dx,x) y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)] dy = cbind(dy,y) }} dx = dx[,2:dim(dx)[2]] dy = dy[,2:dim(dy)[2]] k=0 meanx=0 meany=0 sdx=0 sdy=0 nx=0 ny=0 for (k in 1:dim(dx)[2]) { meanx[k] = mean(unique(dx[,k])) meany[k] = mean(unique(dy[,k])) sdx[k] = sd(unique(dx[,k])) sdy[k] = sd(unique(dy[,k])) nx[k] = length(unique(dx[,k])) ny[k] = length(unique(dy[,k])) } t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny)) df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) + (sdy^2/ny)^2/(ny-1)) p = 2*pt(-abs(t),df=df) alpha = 0.05 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.