Re: [R] Splitting data

R. Michael Weylandt Fri, 12 Aug 2011 08:47:33 -0700

Hmmm, interesting problem. I now see your motivation for not using a data
frame set up in the split code: try this --


DataSplits <- function(Data, alpha = 0.05) {
    DataSplitsCore <- function(Data1,Data2, alpha, level) {
        tt <- t.test(Data1,Data2)
        print(tt)
        if (tt$p.value > alpha) {
            print(paste("Stopped at level", level))
            return(invisible(TRUE))
        } else {
            nr1 = floor(NROW(Data1)/2); nr2 = floor(NROW(Data2)/2)
            if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
            d1 = DataSplitsCore(Data1[(1:nr1)], Data2[(1:nr2)], alpha =
alpha, level = level + 1)
            if (d1) return(invisible(TRUE))
            d2 = DataSplitsCore(Data1[-(1:nr1)], Data2[-(1:nr2)], alpha =
alpha, level = level +1)
            if (d2) return(invisible(TRUE))
            return(invisible(FALSE))
        }
    }
    DataSplitsCore(Data, alpha = alpha, level = 1)
}

By the way, would you rather this returned the data rather than the depth
the search got to?

I don't know much about the subject (gas pressures or outlier
identification), so I'll just spit-ball some other filtering techniques:

1) Since it's a time series that changes over time, do some sort of rolling
z-score filter: i.e., if you are more than 2 sigma out, toss the data until
it stabilizes. Can also do this non-parametrically with means and MADs. (Not
wild about this one, but like I said: spitballing)

2) Subsample at random and throw out outliers until you have a fairly stable
data set.

3) Use a trimmed mean or median instead of a mean for robustness in your
analysis.

4) Filter between some a priori bands: 2e5 to 5e5?

5) Check ROC between points and make sure that's small.

Would these work for this sort of reading? None would be hard in R, and most
would probably be faster than the splitting technique.

Looking at the data, I'm assuming that we more or less hope to isolate 5
stable levels: the first one with the widest variance, 2 and 3 where it gets
higher and then comes back down, the long 4 and then 5 after that. I.e.,
just to confirm those little "drops" should be discarded in their entirety.

Michael Weylandt


On Fri, Aug 12, 2011 at 11:19 AM, Marina de Wolff <marinadewo...@hotmail.com
> wrote:

>  I understand the confusion, I hope I can clearify this. The problem you
> are refering to will not apply for my data I think.
>
> My data consists of 40.000 points, of pressure at a gaswell (vs time). I
> included a picture.
> The problem with this data set is that only datapoints in a 'stable'
> situation are reliable.
> Therefore the dataset needs to be filtered before it is useable. I'm trying
> different ideas to fulfill that goal.
>
> I already used breakpoints and some sort of steady state detection with
> moving variance.
> An other idea would be to split the data in half, compare with each other
> and if both groups (first half of the data and second half of the data)
> significantly differ split again and compare both left groups with each
> other and both right groups with each other etc.
> As a final result I would have different groups with different lengths (I
> hope), and only use the groups with a minimum size of m.
>
> Many thanks in advance for your assitance in this.
>
> Sincerely,
>
> Marina de Wolff
>
>
>  ------------------------------
> From: michael.weyla...@gmail.com
> Date: Fri, 12 Aug 2011 09:35:00 -0400
>
> Subject: Re: [R] Splitting data
> To: marinadewo...@hotmail.com
> CC: r-help@r-project.org
>
> Yes, that likely is the source of the difference: I'm happy to help fix it
> up (won't be hard), but I want to clarify exactly how you want the data
> done:
>
> say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20;
> then 1:5, 6:10, 11:15,16:20 etc
>
> but what about situations with very different data sets:
>
> x = cbind(1:20, 1:7)
> one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or
> cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration?
>
> More generally, what exactly are you comparing? It seems odd to have two
> different categories/samples and to compare their means and then to switch
> gears entirely to compare subsamples of the categories independently. It
> seems that they are just different inferences: comparing the average of cats
> vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs.
> That winds up highlighting different independent variables. (Iteration one:
> species --> iteration two: gender)
>
> If you could speak a little more about your data, it'd be easier to do the
> splits in a meaningful way.
>
> As currently implemented, my code takes a 2d data frame and simply divides
> it into the top and bottom halves, which in most applications would
> corresponding to doing a mean-comparison calculation for different
> statistics of the same observation. The subsetting then keeps
> "corresponding" data together -- I put corresponding in parentheses because
> we aren't doing paired t-tests.
>
> Looking forward to your reply,
>
> Michael
>
> PS -- I did the splits basically the same way (other than the direction)
> but I just used floor() instead of round().
>
>
> On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <
> marinadewo...@hotmail.com> wrote:
>
>  Thank you for your reply,
>
> I used this code on my test data, but did not get the same p-values.
>
> I think I know were the difference lies; when the data is split in 4 parts
> I want to compare the two left groups (group 1 and 2) with each other and
> the two right groups (group 3 and 4) with each other. It seems that with
> this code group 1 and 3 are compared with each other and group 2 and 4, I
> did not yet succeeded in changing this.
>
> About the unequal data sizes, I thought I could 'correct' this by using
> round. For example, when my data consists of 17 data points I would use
>
> m <- length(data)/2
> x <- data[1:round(m)]
> y <- data[(round(m)+1):length(data)]
>
> x has size 9 and y has size 8.
>
> Sincerely,
>
> Marina de Wolff
>
>  ------------------------------
> From: michael.weyla...@gmail.com
> Date: Thu, 11 Aug 2011 11:54:11 -0400
> Subject: Re: [R] Splitting data
> To: marinadewo...@hotmail.com
> CC: r-help@r-project.org
>
>
> This sounds very much like a recursive problem: something like this seems
> to get the gist of what you want.
>
> DataSplits <- function(Data, alpha = 0.05) {
>     DataSplitsCore <- function(Data, alpha, level) {
>         tt <- t.test(Data[,1],Data[,2])
>         print(tt)
>         if (tt$p.value > alpha) {
>             print(paste("Stopped at level", level))
>             return(invisible(TRUE))
>         } else {
>             nr = floor(NROW(Data)/2)
>             if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
>             d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level
> + 1)
>             if (d1) return(invisible(TRUE))
>             d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level =
> level +1)
>             if (d2) return(invisible(TRUE))
>             return(invisible(FALSE))
>         }
>     }
>     DataSplitsCore(Data, alpha = alpha, level = 1)
> }
>
> Your description wasn't the clearest about what to do when the data sizes
> didn't match, but this should give you a start. Let me know if this doesn't
> do as desired and I can help tweak it.
>
> Hope this can be of help,
>
> Michael Weylandt
>
> PS -- You might as well use R's built in t.test function.
>
> On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <
> marinadewo...@hotmail.com> wrote:
>
>
> I want to implement the following algorithm in R:
>
> I want to split my data, use a t test to compare both means of the groups
> to see if they significantly differ from each other. If this is a yes (p <
> alpha) I want to split again (into 4 groups) and do the same procedure
> twice,  and stop otherwise (here the problem arises). As a final result I
> would have different groups of data.
>
> I made some code where the data is splitted, until no splitting is
> possible. So for 16 datapoints, we can split 4 times with a final result of
> 16 groups (p is NA for the 4th split since sd cannot be calculated..).
>
> The code calculated all p values, but I don't want this. I want it to stop
> when p > alpha. I tried while, but didn't succeed.
>
> I hope someone can help me to acchieve my goal.
>
> This is what I tried so far with test data:
>
> a = rnorm(9,0,0.1)
> b = rnorm(7,1,0.1)
> data = c(a,b)
> plot(data)
>
> # Want to calculate max of groups/split for the data
> d = seq(1,100,1)
> n = 2^d
> m <- which(n <=length(data))
> n = n[m[1]:m[length(m)]]
>
> # All groups
> i=0
> j=0
> dx = 0
> dy =
> for (i in 1:length(n)){
> split <- length(data)/(n[i])
> for (j in 1:(n[i]/2)){
> x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))]
> dx = cbind(dx,x)
> y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)]
> dy = cbind(dy,y)
> }}
>
> dx = dx[,2:dim(dx)[2]]
> dy = dy[,2:dim(dy)[2]]
>
> k=0
> meanx=0
> meany=0
> sdx=0
> sdy=0
> nx=0
> ny=0
> for (k in 1:dim(dx)[2]) {
> meanx[k] = mean(unique(dx[,k]))
> meany[k] = mean(unique(dy[,k]))
> sdx[k] = sd(unique(dx[,k]))
> sdy[k] = sd(unique(dy[,k]))
> nx[k] = length(unique(dx[,k]))
> ny[k] = length(unique(dy[,k]))
> }
>
> t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny))
> df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) +
> (sdy^2/ny)^2/(ny-1))
> p = 2*pt(-abs(t),df=df)
> alpha = 0.05
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting data

Reply via email to