Hi there, First off, despite this being my first post here, I have scanned the R help forums a lot in the past few months to help with some questions, so a big thank you to the community as a whole for being so helpful!
I'm somewhat of an R newbie, and have run up against a problem that I can't seem to solve. If anyone is able to help I would really appreciate it! I'm looking at a number of events across a chromosome, and have written a program that collects them into different bins, based on a specified binsize. The events are directional, either forward or reverse, and a chromosome can either be fwd/fwd (all the events fall into the fwd bins), rev/rev (all the events fall into the rev bins) or fwd/rev (events are evenly split). In some cases, chromosomes switch from one state to another (eg fwd/fwd to fwd/rev). There are a number of rules that dictate my data. First, while there is stochastic variation, the sum of fwd and rev in each bin should have approximately the same value. If I were to take the total number of events and divide them by the number of bins to get an average count per bin, I would expect approximately that value in each bin; in the case of fwd/fwd it would be about average number in the fwd column and close to zero in the rev column, in rev/rev it would be about the average number in the rev column and close to zero in the fwd column, and in fwd/rev it would be about half the average number in both. Hopefully my png attachment worked and you can see an example. The top plot shows fwd reads, the 2nd shows rev reads and the third shows fwd minus rev reads. What I would like to be able to do is to automatically assign regions in which the chromosome switches from one state to another. From the graphs (and from the read.table output below) you can see that this particular chromosome is fwd/fwd from bin 1 to 59, fwd/rev from bin 61 to 73, and rev/rev for the remainder of the chromosomes. These are generated from a read.table that looks like this: bin fwd rev 50 484 2 51 366 4 52 527 6 53 635 2 54 573 6 55 506 4 56 600 6 57 560 2 58 504 2 59 545 0 60 501 68 61 419 223 62 252 109 63 259 138 64 355 189 65 218 125 66 140 57 67 45 31 68 276 144 69 263 152 70 330 193 71 439 204 72 347 207 73 10 611 74 6 619 75 2 578 76 7 372 77 6 436 78 4 373 79 8 417 80 2 276 My question is this: 1. Is there an obvious way to automatically identify these regions? I am not sure how I can go about scanning previous lines within a read.table to find a point at which the values change. In the above example, I would like the program to identify that the fwd graph shifts from ~1x the average to ~0.5x the average between bin 61 and 62, and from ~0.5x the average to ~0x the average between bin 72 and 73. Conversely I'd like to identify the rev graph shifting from ~0x average to ~0.5x average between bins 59 and 60, and from 0.5x average to 1x average from bin 72 to 73. Finally, I'd like to cross-reference the output from fwd and rev to only pull out reciprocal switches (ie those that occur within 3 bins of each other in both fwd and rev data sets). What I've been trying to gt to work is to generate values based on 0, 0.5 and 1x the average events, and trying to pull out the range of bins that fall into each of those categories (possibly 1 SD higher or lower to account for the stochastic variation), but I'm not really sure how to go about that. 2. If I can find a way to identify a shift between bins, is there any way to then look in smaller bin sizes across those regions. The bins shown above are for 200,000 bases of DNA. If my program automatically found an event between bin 72 and 73 (14,400,000 bases to 14,600,000), is it possible to feed that region into say 50,000 base bins to narrow down the location of the event? I'm not sure if either of these questions is a simple task or incredibly time consuming and inappropriate to ask a message board. If you can offer any advice though, I would really appreciate it! Thanks, Mark
<<attachment: Chr13.png>>
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.