Hi Lyndon, I'm sorry, I just saw that you provided your dataset. I checked my program and it reproduced the same error.
con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata") load(con) close(con) bins<-quantile(x,0:200/200) ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b)))) colnames(ctch)<-c("bin","tp","fp") head(ctch,20) so it all comes down to quantile not producing strictly ordered values bins[11]>bins[12] #TRUE this is due to how "quantile" handles ties in conjunction with floating point arithmetic. There are 9 different way in R to calculate quantiles, see ?quantile The last 6 produce non-monotone quantiles: sapply(1:9,function(i)any(diff(quantile(x,0:200/200,type=i))<0)) while the first three are ok. So bins<-quantile(x,0:200/200,type=1) or rounding bins up to 4 digits will suffice. cheers. Am 13.07.2011 19:03, schrieb Eik Vettorazzi: > to cut a long story short, how about that: > > bins<-quantile(x,0:200/200) #maybe replacing x by c(x,y) > ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b)))) > colnames(ctch)<-c("bin","tp","fp") > > Cheers. > > Am 13.07.2011 18:49, schrieb Eik Vettorazzi: >> Hello Lyndon, >> I'm still puzzled what x and y actually are and how they are related. >> Your "bins" are calculated from x only, so you might missing some range >> of y and since you use a fixed number of bins of equal size, some bins >> might be empty (and therefore reproducing the result from the bin >> before), and some might contain more than one distinct value (and so are >> ignoring an intermediate step). But this might be intentional and the >> reason for you to build up your own function. >> If not, have a look at pROC, there is a clever way to calculate the >> required bins, in particular look at >> >> getAnywhere("roc.utils.thresholds") >> >> >> Some tweaking of your code is possible anyway: >> >>> for(i in 1:length(threshold)) { >>> tp[i] <- length(x2) - length(x2[x2 <= bins[i]]) >>> fp[i] <- length(y2) - length(y2[y2 <= bins[i]]) >>> } >> >> calculates length(x2) and length(y2) in every single loop, but this is a >> constant, so it is wasting time. Anyway, you get away even without >> calculating this once, using >> >> tp[i]<-sum(x2>bins[i]) #indexing is not necessary, >> #and 1-x2[j]<=tr == x2[j]>tr >> >> and on top one that, the fanciest or at least R'ish way, by common >> understanding, is avoiding loops and using *apply, but that's a matter >> of taste, when it comes to that ;) >> >> tp<-sapply(bins,function(b)sum(x>b)) >> fp<-sapply(bins,function(b)sum(y>b)) >> >> #or all at once >> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b)))) >> >> A last minor remark: sorting x and y is not necessary in the new code >> fragment. >> >> hth. >> >> >> >> Am 13.07.2011 16:46, schrieb Lyndon Estes: >>> Hello Eik, >>> >>> Thanks very much for your response and for directing me to a useful >>> explanation. >>> >>> To make sure I am grasping your answer correctly, the two problems I >>> was experiencing are related to the fact that the floating point >>> numbers I was calculating and using in subsequent indices were not >>> entirely equivalent, even if they were calculated using the same >>> function (quantile). >>> >>> I confirmed this as follows: >>> >>> j <- 0.055 # bins value for 5.5% threshold >>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE))) >>> bin.ct >>> #430.29 >>> bin.ct2 <- quantile(x, j, na.rm = TRUE) >>> bin.ct2 >>> # 5.5% >>> #430.29 >>> bin.ct - bin.ct2 >>> # 5.5% >>> #5.684342e-14 >>> length(x) - length(x[x <= bin.ct]) >>> length(x) - length(x[x <= bin.ct]) # 2875 >>> length(x) - length(x[x <= bin.ct2]) # 3029 >>> >>> Testing the unname() option does not fix the result, I should however note: >>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE))) >>> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE)) >>> bin.ct - bin.ct2 # 5.684342e-14 >>> >>> But rounding, as an alternative approach, works: >>> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2) >>> bin.ct - bin.ct2 >>> #5.5% >>> # 0 >>> length(x) - length(x[x <= bin.ct]) # 2875 >>> length(x) - length(x[x <= bin.ct2]) # 2875 >>> >>> As to my code, it is part of a custom ROC function I created a while >>> back and started using again recently (the data are rainfall >>> values). I can't remember why I did this rather than using one of the >>> existing ROC functions, but I thought (probably incorrectly) that I >>> had some compelling reason. In >>> any case, it is quite unwieldy, so I will explore those other >>> packages, or try revise this to be more efficient >>> >>> (e.g. maybe this is a better approach, although the answers are fairly >>> different?). >>> >>> x2 <- x[order(x)] >>> y2 <- y[order(y)] >>> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2) >>> threshold <- seq(0, 100, by = 0.5) >>> tp <- rep(0, length(bins)) >>> fp <- rep(0, length(bins)) >>> for(i in 1:length(threshold)) { >>> tp[i] <- length(x2) - length(x2[x2 <= bins[i]]) >>> fp[i] <- length(y2) - length(y2[y2 <= bins[i]]) >>> } >>> ctch <- cbind(threshold, bins, tp, fp) >>> ctch[1:20, ] >>> >>> Thanks again for your help. >>> >>> Cheers, Lyndon >>> >>> >>> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi >>> <e.vettora...@uke.uni-hamburg.de> wrote: >>>> Hi, >>>> >>>> Am 11.07.2011 22:57, schrieb Lyndon Estes: >>>>> ctch[ctch$threshold == 3.5, ] >>>>> # [1] threshold val tp fp tn fn tpr >>>>> fpr tnr fnr >>>>> #<0 rows> (or 0-length row.names) >>>> >>>> this is the very effective FAQ 7.31 trap. >>>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f >>>> >>>> Welcome to the first circle of Patrick Burns' R Inferno! >>>> >>>> Also, unname() is a more intuitive way of removing names. >>>> >>>> And I think your code is quite inefficient, because you calculate >>>> quantiles many times, which involves repeated ordering of x, and you may >>>> use a inefficient size of bin (either to small and therefore calculating >>>> the same split many times or to large and then missing some splits). >>>> I'm a bit puzzled what is x and y in your code, so any further advise is >>>> vague but you might have a look at any package that calculates >>>> ROC-curves such as ROCR or pROC (and many more). >>>> >>>> Hth >>>> >>>> -- >>>> Eik Vettorazzi >>>> >>>> Department of Medical Biometry and Epidemiology >>>> University Medical Center Hamburg-Eppendorf >>>> >>>> Martinistr. 52 >>>> 20246 Hamburg >>>> >>>> T ++49/40/7410-58243 >>>> F ++49/40/7410-57790 >>>> >>> >>> >>> >> > -- Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.