Sorry -- missed a tweak. The call to DataSplitsCore should read nr = floor(NROW(Data)/2); Data1 = Data[(1:nr)]; Data2 = Data[-(1:nr)]; DataSplitsCore(Data1,Data2,alpha,level)
On Fri, Aug 12, 2011 at 11:46 AM, R. Michael Weylandt < michael.weyla...@gmail.com> wrote: > Hmmm, interesting problem. I now see your motivation for not using a data > frame set up in the split code: try this -- > > DataSplits <- function(Data, alpha = 0.05) { > DataSplitsCore <- function(Data1,Data2, alpha, level) { > tt <- t.test(Data1,Data2) > > print(tt) > if (tt$p.value > alpha) { > print(paste("Stopped at level", level)) > return(invisible(TRUE)) > } else { > nr1 = floor(NROW(Data1)/2); nr2 = floor(NROW(Data2)/2) > > if (nr == 1) {print(paste("Reached Samples of Size 1")); stop} > d1 = DataSplitsCore(Data1[(1:nr1)], Data2[(1:nr2)], alpha = > alpha, level = level + 1) > > if (d1) return(invisible(TRUE)) > d2 = DataSplitsCore(Data1[-(1:nr1)], Data2[-(1:nr2)], alpha = > alpha, level = level +1) > > if (d2) return(invisible(TRUE)) > return(invisible(FALSE)) > } > } > DataSplitsCore(Data, alpha = alpha, level = 1) > } > > By the way, would you rather this returned the data rather than the depth > the search got to? > > I don't know much about the subject (gas pressures or outlier > identification), so I'll just spit-ball some other filtering techniques: > > 1) Since it's a time series that changes over time, do some sort of rolling > z-score filter: i.e., if you are more than 2 sigma out, toss the data until > it stabilizes. Can also do this non-parametrically with means and MADs. (Not > wild about this one, but like I said: spitballing) > > 2) Subsample at random and throw out outliers until you have a fairly > stable data set. > > 3) Use a trimmed mean or median instead of a mean for robustness in your > analysis. > > 4) Filter between some a priori bands: 2e5 to 5e5? > > 5) Check ROC between points and make sure that's small. > > Would these work for this sort of reading? None would be hard in R, and > most would probably be faster than the splitting technique. > > Looking at the data, I'm assuming that we more or less hope to isolate 5 > stable levels: the first one with the widest variance, 2 and 3 where it gets > higher and then comes back down, the long 4 and then 5 after that. I.e., > just to confirm those little "drops" should be discarded in their entirety. > > Michael Weylandt > > > > On Fri, Aug 12, 2011 at 11:19 AM, Marina de Wolff < > marinadewo...@hotmail.com> wrote: > >> I understand the confusion, I hope I can clearify this. The problem you >> are refering to will not apply for my data I think. >> >> My data consists of 40.000 points, of pressure at a gaswell (vs time). I >> included a picture. >> The problem with this data set is that only datapoints in a 'stable' >> situation are reliable. >> Therefore the dataset needs to be filtered before it is useable. I'm >> trying different ideas to fulfill that goal. >> >> I already used breakpoints and some sort of steady state detection with >> moving variance. >> An other idea would be to split the data in half, compare with each other >> and if both groups (first half of the data and second half of the data) >> significantly differ split again and compare both left groups with each >> other and both right groups with each other etc. >> As a final result I would have different groups with different lengths (I >> hope), and only use the groups with a minimum size of m. >> >> Many thanks in advance for your assitance in this. >> >> Sincerely, >> >> Marina de Wolff >> >> >> ------------------------------ >> From: michael.weyla...@gmail.com >> Date: Fri, 12 Aug 2011 09:35:00 -0400 >> >> Subject: Re: [R] Splitting data >> To: marinadewo...@hotmail.com >> CC: r-help@r-project.org >> >> Yes, that likely is the source of the difference: I'm happy to help fix it >> up (won't be hard), but I want to clarify exactly how you want the data >> done: >> >> say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20; >> then 1:5, 6:10, 11:15,16:20 etc >> >> but what about situations with very different data sets: >> >> x = cbind(1:20, 1:7) >> one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or >> cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration? >> >> More generally, what exactly are you comparing? It seems odd to have two >> different categories/samples and to compare their means and then to switch >> gears entirely to compare subsamples of the categories independently. It >> seems that they are just different inferences: comparing the average of cats >> vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs. >> That winds up highlighting different independent variables. (Iteration one: >> species --> iteration two: gender) >> >> If you could speak a little more about your data, it'd be easier to do the >> splits in a meaningful way. >> >> As currently implemented, my code takes a 2d data frame and simply divides >> it into the top and bottom halves, which in most applications would >> corresponding to doing a mean-comparison calculation for different >> statistics of the same observation. The subsetting then keeps >> "corresponding" data together -- I put corresponding in parentheses because >> we aren't doing paired t-tests. >> >> Looking forward to your reply, >> >> Michael >> >> PS -- I did the splits basically the same way (other than the direction) >> but I just used floor() instead of round(). >> >> >> On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff < >> marinadewo...@hotmail.com> wrote: >> >> Thank you for your reply, >> >> I used this code on my test data, but did not get the same p-values. >> >> I think I know were the difference lies; when the data is split in 4 parts >> I want to compare the two left groups (group 1 and 2) with each other and >> the two right groups (group 3 and 4) with each other. It seems that with >> this code group 1 and 3 are compared with each other and group 2 and 4, I >> did not yet succeeded in changing this. >> >> About the unequal data sizes, I thought I could 'correct' this by using >> round. For example, when my data consists of 17 data points I would use >> >> m <- length(data)/2 >> x <- data[1:round(m)] >> y <- data[(round(m)+1):length(data)] >> >> x has size 9 and y has size 8. >> >> Sincerely, >> >> Marina de Wolff >> >> ------------------------------ >> From: michael.weyla...@gmail.com >> Date: Thu, 11 Aug 2011 11:54:11 -0400 >> Subject: Re: [R] Splitting data >> To: marinadewo...@hotmail.com >> CC: r-help@r-project.org >> >> >> This sounds very much like a recursive problem: something like this seems >> to get the gist of what you want. >> >> DataSplits <- function(Data, alpha = 0.05) { >> DataSplitsCore <- function(Data, alpha, level) { >> tt <- t.test(Data[,1],Data[,2]) >> print(tt) >> if (tt$p.value > alpha) { >> print(paste("Stopped at level", level)) >> return(invisible(TRUE)) >> } else { >> nr = floor(NROW(Data)/2) >> if (nr == 1) {print(paste("Reached Samples of Size 1")); stop} >> d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = >> level + 1) >> if (d1) return(invisible(TRUE)) >> d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level = >> level +1) >> if (d2) return(invisible(TRUE)) >> return(invisible(FALSE)) >> } >> } >> DataSplitsCore(Data, alpha = alpha, level = 1) >> } >> >> Your description wasn't the clearest about what to do when the data sizes >> didn't match, but this should give you a start. Let me know if this doesn't >> do as desired and I can help tweak it. >> >> Hope this can be of help, >> >> Michael Weylandt >> >> PS -- You might as well use R's built in t.test function. >> >> On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff < >> marinadewo...@hotmail.com> wrote: >> >> >> I want to implement the following algorithm in R: >> >> I want to split my data, use a t test to compare both means of the groups >> to see if they significantly differ from each other. If this is a yes (p < >> alpha) I want to split again (into 4 groups) and do the same procedure >> twice, and stop otherwise (here the problem arises). As a final result I >> would have different groups of data. >> >> I made some code where the data is splitted, until no splitting is >> possible. So for 16 datapoints, we can split 4 times with a final result of >> 16 groups (p is NA for the 4th split since sd cannot be calculated..). >> >> The code calculated all p values, but I don't want this. I want it to stop >> when p > alpha. I tried while, but didn't succeed. >> >> I hope someone can help me to acchieve my goal. >> >> This is what I tried so far with test data: >> >> a = rnorm(9,0,0.1) >> b = rnorm(7,1,0.1) >> data = c(a,b) >> plot(data) >> >> # Want to calculate max of groups/split for the data >> d = seq(1,100,1) >> n = 2^d >> m <- which(n <=length(data)) >> n = n[m[1]:m[length(m)]] >> >> # All groups >> i=0 >> j=0 >> dx = 0 >> dy = >> for (i in 1:length(n)){ >> split <- length(data)/(n[i]) >> for (j in 1:(n[i]/2)){ >> x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))] >> dx = cbind(dx,x) >> y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)] >> dy = cbind(dy,y) >> }} >> >> dx = dx[,2:dim(dx)[2]] >> dy = dy[,2:dim(dy)[2]] >> >> k=0 >> meanx=0 >> meany=0 >> sdx=0 >> sdy=0 >> nx=0 >> ny=0 >> for (k in 1:dim(dx)[2]) { >> meanx[k] = mean(unique(dx[,k])) >> meany[k] = mean(unique(dy[,k])) >> sdx[k] = sd(unique(dx[,k])) >> sdy[k] = sd(unique(dy[,k])) >> nx[k] = length(unique(dx[,k])) >> ny[k] = length(unique(dy[,k])) >> } >> >> t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny)) >> df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) + >> (sdy^2/ny)^2/(ny-1)) >> p = 2*pt(-abs(t),df=df) >> alpha = 0.05 >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.