Re: [R] Averaging within a range of values

Jeff Newmiller Fri, 13 Jan 2012 22:23:37 -0800

I don't think my advice to use cut is in fact working, because the rangesare overlapping. Following is a reproducible example... as the postingguide indicates, you should provide self-contained examples like this infuture questions posted to the list.


# begin example
tc <- textConnection(
"Group       Start         End
G1             200         700
G2             500        1000
G3            2000        3000
G4            4000        6000
G5            7000        8000
" )
d1 <- read.table( tc, header=TRUE )
close(tc)


tc <- textConnection(
"Pos    C0    C1
 200   0.9   0.6
 500   0.8   0.8
 800   0.9   0.7
1000   0.7   0.6
2000   0.6   0.4
2500   1.2   0.8
3000   0.6   1.5
3500   0.7   0.7
4000   0.8   0.8
4500   0.6   0.6
5000   0.9   0.9
5500   0.7   0.8
6000   0.8   0.7
6500   0.4   0.4
7000   0.5   0.8
7500   0.7   0.9
8000   0.9   0.5
8500   0.8   0.6
9000   0.9   0.8
" )
d2 <- read.table( tc, header=TRUE )
close(tc)

library(plyr)

# get speed by using more memory
# outer join
d3 <- merge( d1, d2, by.all=TRUE )
# remove combinations that do not fit
d3 <- d3[ ( d3$Start <= d3$Pos ) & ( d3$Pos <= d3$End ), ]
d4a <- ddply( d3
            , "Group"
            , function( df ) {
                c( C0mean=mean(df$C0), C1mean=mean(df$C1) )
              }
            )

# if you work with a large dataset, you may not be able to afford an
# open join, so use a slower calculation that conserves memory
d4b <- ldply( seq_along( d1$Group )
             , function( idx, gpdf, dta ) {
                 group <- gpdf$Group[ idx ]
                 start <- gpdf$Start[ idx ]
                 end <- gpdf$End[ idx ]
                 subdta <- dta[ ( start <= dta$Pos ) & ( dta$Pos <= end ), ]
                 data.frame( Group=group
                           , C0mean=mean( subdta$C0 )
                           , C1mean=mean( subdta$C1 ) )
               }
             , gpdf = d1
             , dta = d2
             )

# end of suggested solutions

# there are other ways as well, such as using the aggregate function or# the sqldf package


On Fri, 13 Jan 2012, doggysaywhat wrote:

My apologies for the context problem.  I'll explain.

df1 is a matrix of genes labeled g1 through g5 with start positions in the
START column and end positions in the END column.

df2 is a matrix of chromatin modification values at positions along the DNA.

I want to average chromatin modification values for each gene from the start
to the end position.  So this would involve pulling out all values for
column C0 that are between pos 200 and 700 for the first gene and averaging
them.  Then, I would pull all values from 500 to 1000, and continue for each
gene.

The example I gave previously was a short one, but I will be doing this for
around 1000 genes with different positions.  This is why just removing one
group.

This was something I tried to come up with that allowed me to use start and
end positions.  Your advice to use the cut is working.

start<-df1[,2]
end<-df1[,3]

while(i<length(start)){
         i&lt;-i+1
          print(cut(df2[,1],c(start[i],end[i])))
}

These were the results

[1] &lt;NA>      (200,700] <NA>      <NA>      <NA>      <NA>      <NA>
[8] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
[15] <NA>      <NA>      <NA>      <NA>      <NA>
Levels: (200,700]
[1] <NA>        <NA>        (500,1e+03] (500,1e+03] <NA>        <NA>
[7] <NA>        <NA>        <NA>        <NA>        <NA>        <NA>
[13] <NA>        <NA>        <NA>        <NA>        <NA>        <NA>
[19] <NA>
Levels: (500,1e+03]
[1] <NA>          <NA>          <NA>          <NA>          <NA>
[6] (2e+03,3e+03] (2e+03,3e+03] <NA>          <NA>          <NA>
[11] <NA>          <NA>          <NA>          <NA>          <NA>
[16] <NA>          <NA>          <NA>          <NA>
Levels: (2e+03,3e+03]
[1] <NA>          <NA>          <NA>          <NA>          <NA>
[6] <NA>          <NA>          <NA>          <NA>          (4e+03,6e+03]
[11] (4e+03,6e+03] (4e+03,6e+03] (4e+03,6e+03] <NA>          <NA>
[16] <NA>          <NA>          <NA>          <NA>
Levels: (4e+03,6e+03]
[1] <NA>          <NA>          <NA>          <NA>          <NA>
[6] <NA>          <NA>          <NA>          <NA>          <NA>
[11] <NA>          <NA>          <NA>          <NA>          <NA>
[16] (7e+03,8e+03] (7e+03,8e+03] <NA>          <NA>
Levels: (7e+03,8e+03]


This is producing the right bins for each of the results, but I'm not sure
how to put this into a data frame.  When I did this.


start<-df1[,2]
end<-df1[,3]

while(i<length(start)){
         i<-i+1
          bins<-(cut(df2[,1],c(start[i],end[i])))
}

the bins variable was the last level.
Is there a way to assign the results of the of the while statement to a
dataframe?

Many thanks

--
View this message in context: 
http://r.789695.n4.nabble.com/Averaging-within-a-range-of-values-tp4291958p4294061.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Averaging within a range of values

Reply via email to