Re: [R] Splitting data

R. Michael Weylandt Fri, 12 Aug 2011 08:53:45 -0700

Sorry -- missed a tweak. The call to DataSplitsCore should read

nr = floor(NROW(Data)/2); Data1 = Data[(1:nr)]; Data2 = Data[-(1:nr)];
DataSplitsCore(Data1,Data2,alpha,level)


On Fri, Aug 12, 2011 at 11:46 AM, R. Michael Weylandt <
michael.weyla...@gmail.com> wrote:

> Hmmm, interesting problem. I now see your motivation for not using a data
> frame set up in the split code: try this --
>
> DataSplits <- function(Data, alpha = 0.05) {
>     DataSplitsCore <- function(Data1,Data2, alpha, level) {
>         tt <- t.test(Data1,Data2)
>
>         print(tt)
>         if (tt$p.value > alpha) {
>             print(paste("Stopped at level", level))
>             return(invisible(TRUE))
>         } else {
>             nr1 = floor(NROW(Data1)/2); nr2 = floor(NROW(Data2)/2)
>
>             if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
>             d1 = DataSplitsCore(Data1[(1:nr1)], Data2[(1:nr2)], alpha =
> alpha, level = level + 1)
>
>             if (d1) return(invisible(TRUE))
>             d2 = DataSplitsCore(Data1[-(1:nr1)], Data2[-(1:nr2)], alpha =
> alpha, level = level +1)
>
>             if (d2) return(invisible(TRUE))
>             return(invisible(FALSE))
>         }
>     }
>     DataSplitsCore(Data, alpha = alpha, level = 1)
> }
>
> By the way, would you rather this returned the data rather than the depth
> the search got to?
>
> I don't know much about the subject (gas pressures or outlier
> identification), so I'll just spit-ball some other filtering techniques:
>
> 1) Since it's a time series that changes over time, do some sort of rolling
> z-score filter: i.e., if you are more than 2 sigma out, toss the data until
> it stabilizes. Can also do this non-parametrically with means and MADs. (Not
> wild about this one, but like I said: spitballing)
>
> 2) Subsample at random and throw out outliers until you have a fairly
> stable data set.
>
> 3) Use a trimmed mean or median instead of a mean for robustness in your
> analysis.
>
> 4) Filter between some a priori bands: 2e5 to 5e5?
>
> 5) Check ROC between points and make sure that's small.
>
> Would these work for this sort of reading? None would be hard in R, and
> most would probably be faster than the splitting technique.
>
> Looking at the data, I'm assuming that we more or less hope to isolate 5
> stable levels: the first one with the widest variance, 2 and 3 where it gets
> higher and then comes back down, the long 4 and then 5 after that. I.e.,
> just to confirm those little "drops" should be discarded in their entirety.
>
> Michael Weylandt
>
>
>
> On Fri, Aug 12, 2011 at 11:19 AM, Marina de Wolff <
> marinadewo...@hotmail.com> wrote:
>
>>  I understand the confusion, I hope I can clearify this. The problem you
>> are refering to will not apply for my data I think.
>>
>> My data consists of 40.000 points, of pressure at a gaswell (vs time). I
>> included a picture.
>> The problem with this data set is that only datapoints in a 'stable'
>> situation are reliable.
>> Therefore the dataset needs to be filtered before it is useable. I'm
>> trying different ideas to fulfill that goal.
>>
>> I already used breakpoints and some sort of steady state detection with
>> moving variance.
>> An other idea would be to split the data in half, compare with each other
>> and if both groups (first half of the data and second half of the data)
>> significantly differ split again and compare both left groups with each
>> other and both right groups with each other etc.
>> As a final result I would have different groups with different lengths (I
>> hope), and only use the groups with a minimum size of m.
>>
>> Many thanks in advance for your assitance in this.
>>
>> Sincerely,
>>
>> Marina de Wolff
>>
>>
>>  ------------------------------
>> From: michael.weyla...@gmail.com
>> Date: Fri, 12 Aug 2011 09:35:00 -0400
>>
>> Subject: Re: [R] Splitting data
>> To: marinadewo...@hotmail.com
>> CC: r-help@r-project.org
>>
>> Yes, that likely is the source of the difference: I'm happy to help fix it
>> up (won't be hard), but I want to clarify exactly how you want the data
>> done:
>>
>> say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20;
>> then 1:5, 6:10, 11:15,16:20 etc
>>
>> but what about situations with very different data sets:
>>
>> x = cbind(1:20, 1:7)
>> one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or
>> cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration?
>>
>> More generally, what exactly are you comparing? It seems odd to have two
>> different categories/samples and to compare their means and then to switch
>> gears entirely to compare subsamples of the categories independently. It
>> seems that they are just different inferences: comparing the average of cats
>> vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs.
>> That winds up highlighting different independent variables. (Iteration one:
>> species --> iteration two: gender)
>>
>> If you could speak a little more about your data, it'd be easier to do the
>> splits in a meaningful way.
>>
>> As currently implemented, my code takes a 2d data frame and simply divides
>> it into the top and bottom halves, which in most applications would
>> corresponding to doing a mean-comparison calculation for different
>> statistics of the same observation. The subsetting then keeps
>> "corresponding" data together -- I put corresponding in parentheses because
>> we aren't doing paired t-tests.
>>
>> Looking forward to your reply,
>>
>> Michael
>>
>> PS -- I did the splits basically the same way (other than the direction)
>> but I just used floor() instead of round().
>>
>>
>> On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <
>> marinadewo...@hotmail.com> wrote:
>>
>>  Thank you for your reply,
>>
>> I used this code on my test data, but did not get the same p-values.
>>
>> I think I know were the difference lies; when the data is split in 4 parts
>> I want to compare the two left groups (group 1 and 2) with each other and
>> the two right groups (group 3 and 4) with each other. It seems that with
>> this code group 1 and 3 are compared with each other and group 2 and 4, I
>> did not yet succeeded in changing this.
>>
>> About the unequal data sizes, I thought I could 'correct' this by using
>> round. For example, when my data consists of 17 data points I would use
>>
>> m <- length(data)/2
>> x <- data[1:round(m)]
>> y <- data[(round(m)+1):length(data)]
>>
>> x has size 9 and y has size 8.
>>
>> Sincerely,
>>
>> Marina de Wolff
>>
>>  ------------------------------
>> From: michael.weyla...@gmail.com
>> Date: Thu, 11 Aug 2011 11:54:11 -0400
>> Subject: Re: [R] Splitting data
>> To: marinadewo...@hotmail.com
>> CC: r-help@r-project.org
>>
>>
>> This sounds very much like a recursive problem: something like this seems
>> to get the gist of what you want.
>>
>> DataSplits <- function(Data, alpha = 0.05) {
>>     DataSplitsCore <- function(Data, alpha, level) {
>>         tt <- t.test(Data[,1],Data[,2])
>>         print(tt)
>>         if (tt$p.value > alpha) {
>>             print(paste("Stopped at level", level))
>>             return(invisible(TRUE))
>>         } else {
>>             nr = floor(NROW(Data)/2)
>>             if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
>>             d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level =
>> level + 1)
>>             if (d1) return(invisible(TRUE))
>>             d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level =
>> level +1)
>>             if (d2) return(invisible(TRUE))
>>             return(invisible(FALSE))
>>         }
>>     }
>>     DataSplitsCore(Data, alpha = alpha, level = 1)
>> }
>>
>> Your description wasn't the clearest about what to do when the data sizes
>> didn't match, but this should give you a start. Let me know if this doesn't
>> do as desired and I can help tweak it.
>>
>> Hope this can be of help,
>>
>> Michael Weylandt
>>
>> PS -- You might as well use R's built in t.test function.
>>
>> On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <
>> marinadewo...@hotmail.com> wrote:
>>
>>
>> I want to implement the following algorithm in R:
>>
>> I want to split my data, use a t test to compare both means of the groups
>> to see if they significantly differ from each other. If this is a yes (p <
>> alpha) I want to split again (into 4 groups) and do the same procedure
>> twice,  and stop otherwise (here the problem arises). As a final result I
>> would have different groups of data.
>>
>> I made some code where the data is splitted, until no splitting is
>> possible. So for 16 datapoints, we can split 4 times with a final result of
>> 16 groups (p is NA for the 4th split since sd cannot be calculated..).
>>
>> The code calculated all p values, but I don't want this. I want it to stop
>> when p > alpha. I tried while, but didn't succeed.
>>
>> I hope someone can help me to acchieve my goal.
>>
>> This is what I tried so far with test data:
>>
>> a = rnorm(9,0,0.1)
>> b = rnorm(7,1,0.1)
>> data = c(a,b)
>> plot(data)
>>
>> # Want to calculate max of groups/split for the data
>> d = seq(1,100,1)
>> n = 2^d
>> m <- which(n <=length(data))
>> n = n[m[1]:m[length(m)]]
>>
>> # All groups
>> i=0
>> j=0
>> dx = 0
>> dy =
>> for (i in 1:length(n)){
>> split <- length(data)/(n[i])
>> for (j in 1:(n[i]/2)){
>> x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))]
>> dx = cbind(dx,x)
>> y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)]
>> dy = cbind(dy,y)
>> }}
>>
>> dx = dx[,2:dim(dx)[2]]
>> dy = dy[,2:dim(dy)[2]]
>>
>> k=0
>> meanx=0
>> meany=0
>> sdx=0
>> sdy=0
>> nx=0
>> ny=0
>> for (k in 1:dim(dx)[2]) {
>> meanx[k] = mean(unique(dx[,k]))
>> meany[k] = mean(unique(dy[,k]))
>> sdx[k] = sd(unique(dx[,k]))
>> sdy[k] = sd(unique(dy[,k]))
>> nx[k] = length(unique(dx[,k]))
>> ny[k] = length(unique(dy[,k]))
>> }
>>
>> t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny))
>> df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) +
>> (sdy^2/ny)^2/(ny-1))
>> p = 2*pt(-abs(t),df=df)
>> alpha = 0.05
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting data

Reply via email to