Thanks for the intrusion!  I have 
        options(stringsAsFactors=FALSE)
and Paul probably doesn't, so he saw factors where I saw characters.

Paul,

I saw your other note ... try this code

L <- nrow(df)
# assign a new bin every time chrom changes
prev.chrom <- c(NA, df$chrom[-L])
bin1 <- cumsum(is.na(prev.chrom) | df$chrom != 
levels(df$chrom)[prev.chrom])

# substract the minimum chromStart from each bin
min.start <- tapply(df$chromStart, bin1, min, na.rm=TRUE)[bin1]

# split bins further if chromStart >= 115341 + min.start
bin2 <- floor((df$chromStart - min.start) / 115341)

# combine the two bins into one
df$bin <- interaction(bin1, bin2)

df


Jean



Rui Barradas <ruipbarra...@sapo.pt> wrote on 07/02/2012 02:24:43 PM:

> Hello,
> 
> Sorry to intrude, but I think it's a factor issue.
> Try the changing the disjunction to, (in multiline edit)
> 
> 
> new.bin <- is.na(prev.chrom) |
>       df$chrom != levels(df$chrom)[prev.chrom] |
>       delta.start >= 115341
> 
> It should work, now.
> 
> Hope this helps,
> 
> Rui Barradas
> 
> Em 02-07-2012 20:03, pguilha escreveu:
> > Jean,
> > It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column.....
> > Also (this is an unrelated problem i think), unless I've misunderstood
> > it, I think your code will only create a new bin if the difference
> > between chromStart at i and i-1 position is >=115341....What I want is
> > for a new bin to be created each time the difference between
> > chromStart at i and i-j is >=115341, where 'i-j' corresponds to the
> > first row of the last bin....Im not sure if I'm being
> > clear...chromStart values correspond to coordinates along a chromosome
> > so I want to basically cut up each chromosome into sections/bins of
> > approximately 115341...
> >
> > thanks again for all your efforts with this, they're much appreciated!
> > Paul
> >
> > On 2 July 2012 19:36, Jean V Adams [via R]
> > <ml-node+s789695n4635185...@n4.nabble.com> wrote:
> >> Paul,
> >>
> >> Try this (I changed some of the object names, but the meat of the 
code is
> >> the same):
> >>
> >> df <- data.frame(
> >>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
> >>          chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
> >>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
> >>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", 
"MafF_(M8194)",
> >> "ZBTB33", "CTCF"),
> >>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
> >>          )
> >>
> >> # assign a new bin every time chrom changes and every time chromStart
> >> changes by 115341 or more
> >> L <- nrow(df)
> >> prev.chrom <- c(NA, df$chrom[-L])
> >> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
> >> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start 
>=
> >> 115341
> >> df$bin <- cumsum(new.bin)
> >> df
> >>
> >>
> >> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
> >>
> >>> Jean, that's exactly what it should be, but yes I copied and pasted
> >>> from your email so I don't see how I could have introduced an error 
in
> >>> there....
> >>> paul
> >>>
> >>> On 2 July 2012 15:57, Jean V Adams [via R]
> >>> <[hidden email]> wrote:
> >>>> Paul,
> >>>>
> >>>> Are you submitting the exact code that I included in my previous
> >> e-mail?
> >>
> >>>> When I submit that code, I get this ...
> >>>>
> >>>>    chrom chromStart chromEnd         name cumsum bin
> >>>> 1  chr1      10089    10309       ZBTB33  10089   1
> >>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
> >>>> 3  chr2      10133    10362     Pol2-4H8  30354   2
> >>>> 4  chr2      10148    10418 MafF_(M8194)  40502   2
> >>>> 5  chr2     210382   210578       ZBTB33  50884   3
> >>>> 6  chr2     216132   216352         CTCF  67016   3
> >>>>
> >>>> Jean
> >>>>
> >>>>
> >>>> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
> >>>>
> >>>>> Thanks for your reply Jean,
> >>>>>
> >>>>> I think your interpretation is correct but when I run your code I 
end
> >>>>> up with the below dataframe and obviously the bins created there
> >> don't
> >>
> >>>>> correspond to a chromStart change of 115341:
> >>>>>
> >>>>>    chrom chromStart chromEnd         name cumsum bin
> >>>>> 1  chr1      10089    10309       ZBTB33  10089   1
> >>>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
> >>>>> 3  chr2      10133    10362     Pol2-4H8  30354   3
> >>>>> 4  chr2      10148    10418 MafF_(M8194)  40502   4
> >>>>> 5  chr2     210382   210578       ZBTB33  50884   5
> >>>>> 6  chr2     216132   216352         CTCF  67016   6
> >>>>>
> >>>>> the first two rows should have the same bin number (same chrom,
> >>>>> <115341 diff), then rows 3&4 should be in another bin (different
> >> chrom
> >>
> >>>>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same 
chrom
> >>>>> but >115341 difference between row 4 and row 5).
> >>>>>
> >>>>> it seems the new.bin line of your code isn't quite doing what it
> >>>>> should but I can't pinpoint the error there...
> >>>>> Paul
> >>>>>
> >>>>>
> >>>>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
> >>>>>> Paul,
> >>>>>>
> >>>>>> My interpretation is that you are trying to assign a new bin 
number
> >> to
> >>
> >>>> a row
> >>>>>> every time the variable chrom changes and every time the variable
> >>>> chromStart
> >>>>>> changes by 115341 or more.  Is that right?  If so, you don't need 
a
> >>>> loop at
> >>>>>> all.  Check out the code below.  I made a couple changes to the
> >>>> all.tf7
> >>>>>> example data frame so that it would have two changes in bin 
number,
> >>>> one
> >>>>
> >>>>>> based on the chrom variable and one based on the chromStart
> >> variable.
> >>>>>>
> >>>>>> Jean
> >>>>>>
> >>>>>> all.tf7 <- data.frame(
> >>>>>>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", 
"chr2"),
> >>>>>>          chromStart = c(10089, 10132, 10133, 10148, 210382, 
216132),
> >>>>>>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 
216352),
> >>>>>>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
> >> "MafF_(M8194)",
> >>>>>> "ZBTB33", "CTCF"),
> >>>>>>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
> >>>>>>          bin = rep(NA, 6)
> >>>>>>          )
> >>>>>>
> >>>>>> # assign a new bin every time chrom changes and every time
> >> chromStart
> >>>>>> changes by 115341 or more
> >>>>>> L <- nrow(all.tf7)
> >>>>>> prev.chrom <- c(NA, all.tf7$chrom[-L])
> >>>>>> delta.start <- c(NA, all.tf7$chromStart[-1] -
> >> all.tf7$chromStart[-L])
> >>
> >>>>>> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
> >>>> delta.start >=
> >>>>
> >>>>>> 115341
> >>>>>> all.tf7$bin <- cumsum(new.bin)
> >>>>>> all.tf7
> >>>>>>
> >>>>>>
> >>>>>> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
> >>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> I have written a for loop to act on a dataframe with close to
> >>>> 3million
> >>>>>>> rows
> >>>>>>> and 6 columns and I would like to pass it to apply() to speed 
the
> >>>> process
> >>>>>>> up
> >>>>>>> (I let the loop run for 2 days before stopping it and it had 
only
> >>>> gone
> >>>>>>> through 200,000 rows) but I am really struggling to find a way 
to
> >>>> pass the
> >>>>>>> arguments. Below are the loop and the head of the dataframe I am
> >>>> working
> >>>>>>> on.
> >>>>>>> Any hints would be much appreciated, thank you! (I have searched
> >> for
> >>
> >>>> this
> >>>>
> >>>>>>> but could not find any other posts doing quite what I want)
> >>>>>>> Paul
> >>>>>>>
> >>>>>>> x<-as.numeric(all.tf7[1,2])
> >>>>>>> for (i in 2:nrow(all.tf7)) {
> >>>>>>>    if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
> >>>>>>> all.tf7[i,6]<-all.tf7[i-1,6]
> >>>>>>>    else if (all.tf7[i,1]==all.tf7[i-1,1] &
> >> (all.tf7[i,2]-x)>=115341) {
> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
> >>>>>>>    else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
> >>>>>>> }
> >>>>>>>
> >>>>>>> #the aim here is to attribute a bin number to each row so that I
> >> can
> >>
> >>>> then
> >>>>
> >>>>>>> split the dataframe according to those bins.
> >>>>>>>
> >>>>>>>
> >>>>>>> chrom chromStart chromEnd         name cumsum bin
> >>>>>>> chr1      10089             10309               ZBTB33  10089 1
> >>>>>>> chr1      10132             10536      TAF7_(SQ-8)  20221   1
> >>>>>>> chr1      10133             10362            Pol2-4H8  30354   1
> >>>>>>> chr1      10148             10418  MafF_(M8194)  40502   1
> >>>>>>> chr1      10382             10578                ZBTB33  50884 1
> >>>>>>> chr1      16132             16352                    CTCF  67016 
1
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to