Re: [R] Splitting data

Marina de Wolff Mon, 15 Aug 2011 01:35:21 -0700

By the way, I indeed want to get the data returned.

From: marinadewo...@hotmail.com
To: michael.weyla...@gmail.com
CC: r-help@r-project.org
Subject: RE: [R] Splitting data
Date: Mon, 15 Aug 2011 10:32:04 +0200

Dear Michael,

I adjusted the code a bit; 

d1 = DataSplitsCore(Data1[(1:nr1)], Data1[-(1:nr1)], alpha = alpha, level = 
level + 1)
d2 = DataSplitsCore(Data2[(1:nr2)], Data2[-(1:nr2)], alpha = alpha, level = 
level +1)

The code stops when p>alpha, which is what I wanted. Only problem is that it 
does not calculates the p values for other groups. To make myself clear a short 
example:

Data = c(rnorm(9,0,0.1),rnorm(7,1,0.1)).
Data1 = Data[1:(floor(Data)/2)]
Data2 = Data[((floor(Data)/2)+1):length(Data)]
t.test(Data1,Data2) is significant --> split into 4 groups

Data11 = Data1[1:floor(Data1)/2]
Data12 = Data1[((floor(Data1)/2)+1):length(Data1)]

t.test(Data11,Data12) is not significant --> Stop for this 'branch'

Data21 = Data2[1:floor(Data2)/2]
Data22 = Data2[((floor(Data2)/2)+1):length(Data2)]

t.test(Data21,Data22) is not significant -> Stop for this 'branch'

At this moment, the code stops running after the first time is reaches p>alpha, 
and does not calculate p values for other 'branches'  which still needs to be 
examined.

I shall try some of the ideas you suggested, and will come to that back later.

Sincerely,

Marina de Wolff

From: michael.weyla...@gmail.com
Date: Fri, 12 Aug 2011 11:50:30 -0400
Subject: Re: [R] Splitting data
To: marinadewo...@hotmail.com
CC: r-help@r-project.org

Sorry -- missed a tweak. The call to DataSplitsCore should read

nr = floor(NROW(Data)/2); Data1 = Data[(1:nr)]; Data2 = Data[-(1:nr)];
DataSplitsCore(Data1,Data2,alpha,level)

On Fri, Aug 12, 2011 at 11:46 AM, R. Michael Weylandt 
<michael.weyla...@gmail.com> wrote:

Hmmm, interesting problem. I now see your motivation for not using a data frame 
set up in the split code: try this -- 

DataSplits <- function(Data, alpha = 0.05) {
    DataSplitsCore <- function(Data1,Data2, alpha, level) {
        tt <- t.test(Data1,Data2) 

        print(tt)
        if (tt$p.value > alpha) {
            print(paste("Stopped at level", level))
            return(invisible(TRUE))
        } else {
            nr1 = floor(NROW(Data1)/2); nr2 = floor(NROW(Data2)/2) 

            if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
            d1 = DataSplitsCore(Data1[(1:nr1)], Data2[(1:nr2)], alpha = alpha, 
level = level + 1) 

            if (d1) return(invisible(TRUE))
            d2 = DataSplitsCore(Data1[-(1:nr1)], Data2[-(1:nr2)], alpha = 
alpha, level = level +1) 

            if (d2) return(invisible(TRUE))
            return(invisible(FALSE))
        }
    }
    DataSplitsCore(Data, alpha = alpha, level = 1)
}

By the way, would you rather this returned the data rather than the depth the 
search got to?

I don't know much about the subject (gas pressures or outlier identification), 
so I'll just spit-ball some other filtering techniques:

1) Since it's a time series that changes over time, do some sort of rolling 
z-score filter: i.e., if you are more than 2 sigma out, toss the data until it 
stabilizes. Can also do this non-parametrically with means and MADs. (Not wild 
about this one, but like I said: spitballing)

2) Subsample at random and throw out outliers until you have a fairly stable 
data set. 

3) Use a trimmed mean or median instead of a mean for robustness in your 
analysis.

4) Filter between some a priori bands: 2e5 to 5e5? 

5) Check ROC between points and make sure that's small. 

Would these work for this sort of reading? None would be hard in R, and most 
would probably be faster than the splitting technique. 

Looking at the data, I'm assuming that we more or less hope to isolate 5 stable 
levels: the first one with the widest variance, 2 and 3 where it gets higher 
and then comes back down, the long 4 and then 5 after that. I.e., just to 
confirm those little "drops" should be discarded in their entirety. 

Michael Weylandt 

On Fri, Aug 12, 2011 at 11:19 AM, Marina de Wolff <marinadewo...@hotmail.com> 
wrote:

I understand the confusion, I hope I can clearify this. The problem you are 
refering to will not apply for my data I think.

My data consists of 40.000 points, of pressure at a gaswell (vs time). I 
included a picture. 
The problem with this data set is that only datapoints in a 'stable' situation 
are reliable.
Therefore the dataset needs to be filtered before it is useable. I'm trying 
different ideas to fulfill that goal.

I already used breakpoints and some sort of steady state detection with moving 
variance.
An other idea would be to split the data in half, compare with each other and 
if both groups (first half of the data and second half of the data)
significantly differ split again and compare both left groups with each other 
and both right groups with each other etc. 
As a final result I would have different groups with different lengths (I 
hope), and only use the groups with a minimum size of m.

Many thanks in advance for your assitance in this.

Sincerely,

Marina de Wolff

From: michael.weyla...@gmail.com
Date: Fri, 12 Aug 2011 09:35:00 -0400 

Subject: Re: [R] Splitting data
To: marinadewo...@hotmail.com
CC: r-help@r-project.org

Yes, that likely is the source of the difference: I'm happy to help fix it up 
(won't be hard), but I want to clarify exactly how you want the data done: 

say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20; then 
1:5, 6:10, 11:15,16:20 etc

but what about situations with very different data sets: 

x = cbind(1:20, 1:7)
one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or 
cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration? 

More generally, what exactly are you comparing? It seems odd to have two 
different categories/samples and to compare their means and then to switch 
gears entirely to compare subsamples of the categories independently. It seems 
that they are just different inferences: comparing the average of cats vs dogs 
and then comparing boy cats vs girl cats and boy dogs vs girl dogs. That winds 
up highlighting different independent variables. (Iteration one: species --> 
iteration two: gender)

If you could speak a little more about your data, it'd be easier to do the 
splits in a meaningful way. 

As currently implemented, my code takes a 2d data frame and simply divides it 
into the top and bottom halves, which in most applications would corresponding 
to doing a mean-comparison calculation for different statistics of the same 
observation. The subsetting then keeps "corresponding" data together -- I put 
corresponding in parentheses because we aren't doing paired t-tests. 

Looking forward to your reply,

Michael

PS -- I did the splits basically the same way (other than the direction) but I 
just used floor() instead of round(). 

On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <marinadewo...@hotmail.com> 
wrote:

Thank you for your reply,

I used this code on my test data, but did not get the same p-values. 

I think I know were the difference lies; when the data is split in 4 parts I 
want to compare the two left groups (group 1 and 2) with each other and the two 
right groups (group 3 and 4) with each other. It seems that with this code 
group 1 and 3 are compared with each other and group 2 and 4, I did not yet 
succeeded in changing this. 

About the unequal data sizes, I thought I could 'correct' this by using round. 
For example, when my data consists of 17 data points I would use

m <- length(data)/2
x <- data[1:round(m)]
y <- data[(round(m)+1):length(data)]

x has size 9 and y has size 8. 

Sincerely,

Marina de Wolff

From: michael.weyla...@gmail.com
Date: Thu, 11 Aug 2011 11:54:11 -0400
Subject: Re: [R] Splitting data
To: marinadewo...@hotmail.com
CC: r-help@r-project.org

This sounds very much like a recursive problem: something like this seems to 
get the gist of what you want. 

DataSplits <- function(Data, alpha = 0.05) {
    DataSplitsCore <- function(Data, alpha, level) {
        tt <- t.test(Data[,1],Data[,2])
        print(tt)
        if (tt$p.value > alpha) {
            print(paste("Stopped at level", level))
            return(invisible(TRUE))
        } else {
            nr = floor(NROW(Data)/2)
            if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
            d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level + 1)
            if (d1) return(invisible(TRUE))
            d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level = level +1)
            if (d2) return(invisible(TRUE))
            return(invisible(FALSE))
        }
    }
    DataSplitsCore(Data, alpha = alpha, level = 1)
}

Your description wasn't the clearest about what to do when the data sizes 
didn't match, but this should give you a start. Let me know if this doesn't do 
as desired and I can help tweak it. 

Hope this can be of help, 

Michael Weylandt 

PS -- You might as well use R's built in t.test function. 

On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <marinadewo...@hotmail.com> 
wrote:

I want to implement the following algorithm in R:

I want to split my data, use a t test to compare both means of the groups to 
see if they significantly differ from each other. If this is a yes (p < alpha) 
I want to split again (into 4 groups) and do the same procedure twice,  and 
stop otherwise (here the problem arises). As a final result I would have 
different groups of data.

I made some code where the data is splitted, until no splitting is possible. So 
for 16 datapoints, we can split 4 times with a final result of 16 groups (p is 
NA for the 4th split since sd cannot be calculated..).

The code calculated all p values, but I don't want this. I want it to stop when 
p > alpha. I tried while, but didn't succeed.

I hope someone can help me to acchieve my goal.

This is what I tried so far with test data:

a = rnorm(9,0,0.1)
b = rnorm(7,1,0.1)
data = c(a,b)
plot(data)

# Want to calculate max of groups/split for the data
d = seq(1,100,1)
n = 2^d
m <- which(n <=length(data))
n = n[m[1]:m[length(m)]]

# All groups
i=0
j=0
dx = 0
dy =
for (i in 1:length(n)){
split <- length(data)/(n[i])
for (j in 1:(n[i]/2)){
x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))]
dx = cbind(dx,x)
y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)]
dy = cbind(dy,y)
}}

dx = dx[,2:dim(dx)[2]]
dy = dy[,2:dim(dy)[2]]

k=0
meanx=0
meany=0
sdx=0
sdy=0
nx=0
ny=0
for (k in 1:dim(dx)[2]) {
meanx[k] = mean(unique(dx[,k]))
meany[k] = mean(unique(dy[,k]))
sdx[k] = sd(unique(dx[,k]))
sdy[k] = sd(unique(dy[,k]))
nx[k] = length(unique(dx[,k]))
ny[k] = length(unique(dy[,k]))
}

t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny))
df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) + (sdy^2/ny)^2/(ny-1))
p = 2*pt(-abs(t),df=df)
alpha = 0.05
       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting data

Reply via email to