Thanks for the intrusion! I have options(stringsAsFactors=FALSE) and Paul probably doesn't, so he saw factors where I saw characters.
Paul, I saw your other note ... try this code L <- nrow(df) # assign a new bin every time chrom changes prev.chrom <- c(NA, df$chrom[-L]) bin1 <- cumsum(is.na(prev.chrom) | df$chrom != levels(df$chrom)[prev.chrom]) # substract the minimum chromStart from each bin min.start <- tapply(df$chromStart, bin1, min, na.rm=TRUE)[bin1] # split bins further if chromStart >= 115341 + min.start bin2 <- floor((df$chromStart - min.start) / 115341) # combine the two bins into one df$bin <- interaction(bin1, bin2) df Jean Rui Barradas <ruipbarra...@sapo.pt> wrote on 07/02/2012 02:24:43 PM: > Hello, > > Sorry to intrude, but I think it's a factor issue. > Try the changing the disjunction to, (in multiline edit) > > > new.bin <- is.na(prev.chrom) | > df$chrom != levels(df$chrom)[prev.chrom] | > delta.start >= 115341 > > It should work, now. > > Hope this helps, > > Rui Barradas > > Em 02-07-2012 20:03, pguilha escreveu: > > Jean, > > It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column..... > > Also (this is an unrelated problem i think), unless I've misunderstood > > it, I think your code will only create a new bin if the difference > > between chromStart at i and i-1 position is >=115341....What I want is > > for a new bin to be created each time the difference between > > chromStart at i and i-j is >=115341, where 'i-j' corresponds to the > > first row of the last bin....Im not sure if I'm being > > clear...chromStart values correspond to coordinates along a chromosome > > so I want to basically cut up each chromosome into sections/bins of > > approximately 115341... > > > > thanks again for all your efforts with this, they're much appreciated! > > Paul > > > > On 2 July 2012 19:36, Jean V Adams [via R] > > <ml-node+s789695n4635185...@n4.nabble.com> wrote: > >> Paul, > >> > >> Try this (I changed some of the object names, but the meat of the code is > >> the same): > >> > >> df <- data.frame( > >> chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"), > >> chromStart = c(10089, 10132, 10133, 10148, 210382, 216132), > >> chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352), > >> name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)", > >> "ZBTB33", "CTCF"), > >> cumsum = c(10089, 20221, 30354, 40502, 50884, 67016) > >> ) > >> > >> # assign a new bin every time chrom changes and every time chromStart > >> changes by 115341 or more > >> L <- nrow(df) > >> prev.chrom <- c(NA, df$chrom[-L]) > >> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L]) > >> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start >= > >> 115341 > >> df$bin <- cumsum(new.bin) > >> df > >> > >> > >> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM: > >> > >>> Jean, that's exactly what it should be, but yes I copied and pasted > >>> from your email so I don't see how I could have introduced an error in > >>> there.... > >>> paul > >>> > >>> On 2 July 2012 15:57, Jean V Adams [via R] > >>> <[hidden email]> wrote: > >>>> Paul, > >>>> > >>>> Are you submitting the exact code that I included in my previous > >> e-mail? > >> > >>>> When I submit that code, I get this ... > >>>> > >>>> chrom chromStart chromEnd name cumsum bin > >>>> 1 chr1 10089 10309 ZBTB33 10089 1 > >>>> 2 chr1 10132 10536 TAF7_(SQ-8) 20221 1 > >>>> 3 chr2 10133 10362 Pol2-4H8 30354 2 > >>>> 4 chr2 10148 10418 MafF_(M8194) 40502 2 > >>>> 5 chr2 210382 210578 ZBTB33 50884 3 > >>>> 6 chr2 216132 216352 CTCF 67016 3 > >>>> > >>>> Jean > >>>> > >>>> > >>>> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM: > >>>> > >>>>> Thanks for your reply Jean, > >>>>> > >>>>> I think your interpretation is correct but when I run your code I end > >>>>> up with the below dataframe and obviously the bins created there > >> don't > >> > >>>>> correspond to a chromStart change of 115341: > >>>>> > >>>>> chrom chromStart chromEnd name cumsum bin > >>>>> 1 chr1 10089 10309 ZBTB33 10089 1 > >>>>> 2 chr1 10132 10536 TAF7_(SQ-8) 20221 2 > >>>>> 3 chr2 10133 10362 Pol2-4H8 30354 3 > >>>>> 4 chr2 10148 10418 MafF_(M8194) 40502 4 > >>>>> 5 chr2 210382 210578 ZBTB33 50884 5 > >>>>> 6 chr2 216132 216352 CTCF 67016 6 > >>>>> > >>>>> the first two rows should have the same bin number (same chrom, > >>>>> <115341 diff), then rows 3&4 should be in another bin (different > >> chrom > >> > >>>>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom > >>>>> but >115341 difference between row 4 and row 5). > >>>>> > >>>>> it seems the new.bin line of your code isn't quite doing what it > >>>>> should but I can't pinpoint the error there... > >>>>> Paul > >>>>> > >>>>> > >>>>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote: > >>>>>> Paul, > >>>>>> > >>>>>> My interpretation is that you are trying to assign a new bin number > >> to > >> > >>>> a row > >>>>>> every time the variable chrom changes and every time the variable > >>>> chromStart > >>>>>> changes by 115341 or more. Is that right? If so, you don't need a > >>>> loop at > >>>>>> all. Check out the code below. I made a couple changes to the > >>>> all.tf7 > >>>>>> example data frame so that it would have two changes in bin number, > >>>> one > >>>> > >>>>>> based on the chrom variable and one based on the chromStart > >> variable. > >>>>>> > >>>>>> Jean > >>>>>> > >>>>>> all.tf7 <- data.frame( > >>>>>> chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"), > >>>>>> chromStart = c(10089, 10132, 10133, 10148, 210382, 216132), > >>>>>> chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352), > >>>>>> name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", > >> "MafF_(M8194)", > >>>>>> "ZBTB33", "CTCF"), > >>>>>> cumsum = c(10089, 20221, 30354, 40502, 50884, 67016), > >>>>>> bin = rep(NA, 6) > >>>>>> ) > >>>>>> > >>>>>> # assign a new bin every time chrom changes and every time > >> chromStart > >>>>>> changes by 115341 or more > >>>>>> L <- nrow(all.tf7) > >>>>>> prev.chrom <- c(NA, all.tf7$chrom[-L]) > >>>>>> delta.start <- c(NA, all.tf7$chromStart[-1] - > >> all.tf7$chromStart[-L]) > >> > >>>>>> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom | > >>>> delta.start >= > >>>> > >>>>>> 115341 > >>>>>> all.tf7$bin <- cumsum(new.bin) > >>>>>> all.tf7 > >>>>>> > >>>>>> > >>>>>> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM: > >>>>>> > >>>>>>> Hello all, > >>>>>>> > >>>>>>> I have written a for loop to act on a dataframe with close to > >>>> 3million > >>>>>>> rows > >>>>>>> and 6 columns and I would like to pass it to apply() to speed the > >>>> process > >>>>>>> up > >>>>>>> (I let the loop run for 2 days before stopping it and it had only > >>>> gone > >>>>>>> through 200,000 rows) but I am really struggling to find a way to > >>>> pass the > >>>>>>> arguments. Below are the loop and the head of the dataframe I am > >>>> working > >>>>>>> on. > >>>>>>> Any hints would be much appreciated, thank you! (I have searched > >> for > >> > >>>> this > >>>> > >>>>>>> but could not find any other posts doing quite what I want) > >>>>>>> Paul > >>>>>>> > >>>>>>> x<-as.numeric(all.tf7[1,2]) > >>>>>>> for (i in 2:nrow(all.tf7)) { > >>>>>>> if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341) > >>>>>>> all.tf7[i,6]<-all.tf7[i-1,6] > >>>>>>> else if (all.tf7[i,1]==all.tf7[i-1,1] & > >> (all.tf7[i,2]-x)>=115341) { > >>>>>>> all.tf7[i,6]<-(all.tf7[i-1,6]+1) > >>>>>>> x<-as.numeric(all.tf7[i,2]) } > >>>>>>> else if (all.tf7[i,1]!=all.tf7[i-1,1]) { > >>>>>>> all.tf7[i,6]<-(all.tf7[i-1,6]+1) > >>>>>>> x<-as.numeric(all.tf7[i,2]) } > >>>>>>> } > >>>>>>> > >>>>>>> #the aim here is to attribute a bin number to each row so that I > >> can > >> > >>>> then > >>>> > >>>>>>> split the dataframe according to those bins. > >>>>>>> > >>>>>>> > >>>>>>> chrom chromStart chromEnd name cumsum bin > >>>>>>> chr1 10089 10309 ZBTB33 10089 1 > >>>>>>> chr1 10132 10536 TAF7_(SQ-8) 20221 1 > >>>>>>> chr1 10133 10362 Pol2-4H8 30354 1 > >>>>>>> chr1 10148 10418 MafF_(M8194) 40502 1 > >>>>>>> chr1 10382 10578 ZBTB33 50884 1 > >>>>>>> chr1 16132 16352 CTCF 67016 1 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.