Re: [R] Splitting data

R. Michael Weylandt Fri, 12 Aug 2011 06:38:00 -0700

Yes, that likely is the source of the difference: I'm happy to help fix it
up (won't be hard), but I want to clarify exactly how you want the data
done:


say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20;
then 1:5, 6:10, 11:15,16:20 etc

but what about situations with very different data sets:

x = cbind(1:20, 1:7)
one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or
cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration?

More generally, what exactly are you comparing? It seems odd to have two
different categories/samples and to compare their means and then to switch
gears entirely to compare subsamples of the categories independently. It
seems that they are just different inferences: comparing the average of cats
vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs.
That winds up highlighting different independent variables. (Iteration one:
species --> iteration two: gender)

If you could speak a little more about your data, it'd be easier to do the
splits in a meaningful way.

As currently implemented, my code takes a 2d data frame and simply divides
it into the top and bottom halves, which in most applications would
corresponding to doing a mean-comparison calculation for different
statistics of the same observation. The subsetting then keeps
"corresponding" data together -- I put corresponding in parentheses because
we aren't doing paired t-tests.

Looking forward to your reply,

Michael

PS -- I did the splits basically the same way (other than the direction) but
I just used floor() instead of round().


On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff
<marinadewo...@hotmail.com>wrote:

>  Thank you for your reply,
>
> I used this code on my test data, but did not get the same p-values.
>
> I think I know were the difference lies; when the data is split in 4 parts
> I want to compare the two left groups (group 1 and 2) with each other and
> the two right groups (group 3 and 4) with each other. It seems that with
> this code group 1 and 3 are compared with each other and group 2 and 4, I
> did not yet succeeded in changing this.
>
> About the unequal data sizes, I thought I could 'correct' this by using
> round. For example, when my data consists of 17 data points I would use
>
> m <- length(data)/2
> x <- data[1:round(m)]
> y <- data[(round(m)+1):length(data)]
>
> x has size 9 and y has size 8.
>
>
> Sincerely,
>
> Marina de Wolff
>
>  ------------------------------
> From: michael.weyla...@gmail.com
> Date: Thu, 11 Aug 2011 11:54:11 -0400
> Subject: Re: [R] Splitting data
> To: marinadewo...@hotmail.com
> CC: r-help@r-project.org
>
>
> This sounds very much like a recursive problem: something like this seems
> to get the gist of what you want.
>
> DataSplits <- function(Data, alpha = 0.05) {
>     DataSplitsCore <- function(Data, alpha, level) {
>         tt <- t.test(Data[,1],Data[,2])
>         print(tt)
>         if (tt$p.value > alpha) {
>             print(paste("Stopped at level", level))
>             return(invisible(TRUE))
>         } else {
>             nr = floor(NROW(Data)/2)
>             if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
>             d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level
> + 1)
>             if (d1) return(invisible(TRUE))
>             d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level =
> level +1)
>             if (d2) return(invisible(TRUE))
>             return(invisible(FALSE))
>         }
>     }
>     DataSplitsCore(Data, alpha = alpha, level = 1)
> }
>
> Your description wasn't the clearest about what to do when the data sizes
> didn't match, but this should give you a start. Let me know if this doesn't
> do as desired and I can help tweak it.
>
> Hope this can be of help,
>
> Michael Weylandt
>
> PS -- You might as well use R's built in t.test function.
>
> On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <
> marinadewo...@hotmail.com> wrote:
>
>
> I want to implement the following algorithm in R:
>
> I want to split my data, use a t test to compare both means of the groups
> to see if they significantly differ from each other. If this is a yes (p <
> alpha) I want to split again (into 4 groups) and do the same procedure
> twice,  and stop otherwise (here the problem arises). As a final result I
> would have different groups of data.
>
> I made some code where the data is splitted, until no splitting is
> possible. So for 16 datapoints, we can split 4 times with a final result of
> 16 groups (p is NA for the 4th split since sd cannot be calculated..).
>
> The code calculated all p values, but I don't want this. I want it to stop
> when p > alpha. I tried while, but didn't succeed.
>
> I hope someone can help me to acchieve my goal.
>
> This is what I tried so far with test data:
>
> a = rnorm(9,0,0.1)
> b = rnorm(7,1,0.1)
> data = c(a,b)
> plot(data)
>
> # Want to calculate max of groups/split for the data
> d = seq(1,100,1)
> n = 2^d
> m <- which(n <=length(data))
> n = n[m[1]:m[length(m)]]
>
> # All groups
> i=0
> j=0
> dx = 0
> dy =
> for (i in 1:length(n)){
> split <- length(data)/(n[i])
> for (j in 1:(n[i]/2)){
> x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))]
> dx = cbind(dx,x)
> y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)]
> dy = cbind(dy,y)
> }}
>
> dx = dx[,2:dim(dx)[2]]
> dy = dy[,2:dim(dy)[2]]
>
> k=0
> meanx=0
> meany=0
> sdx=0
> sdy=0
> nx=0
> ny=0
> for (k in 1:dim(dx)[2]) {
> meanx[k] = mean(unique(dx[,k]))
> meany[k] = mean(unique(dy[,k]))
> sdx[k] = sd(unique(dx[,k]))
> sdy[k] = sd(unique(dy[,k]))
> nx[k] = length(unique(dx[,k]))
> ny[k] = length(unique(dy[,k]))
> }
>
> t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny))
> df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) +
> (sdy^2/ny)^2/(ny-1))
> p = 2*pt(-abs(t),df=df)
> alpha = 0.05
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting data

Reply via email to