Hi Roger, Thank you for your reply. To my understanding, changing the regression method only helps to speed up the computation, but not necessarily solve the problem with 99th percentile that p-values for all the factors are 1.0. I wonder how I should interpret the result for 99th percentile, while the results for other percentiles seem to work fine.
Correct me if I’m wrong. Thank you! Yunqi On Nov 16, 2014, at 8:42 AM, Roger <rkoen...@illinois.edu> wrote: > You could try method = "pin". > > Sent from my iPhone > >> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzh...@ucsd.edu> wrote: >> >> Hi William, >> >> Thank you very much for your reply. >> >> I did a subsampling to reduce the number of samples to ~1.8 million. It >> seems to work fine except for 99th percentile (p-values for all the >> features are 1.0). Does this mean I’m subsampling too much? How should I >> interpret the result? >> >> tau: [1] 0.25 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 72.15700 0.03651 1976.10513 0.00000 >> >> f1 -0.51000 0.04906 -10.39508 0.00000 >> >> f2 -20.44200 0.03933 -519.78766 0.00000 >> >> f3 -2.37000 0.04871 -48.65117 0.00000 >> >> f1:f2 -2.52500 0.05315 -47.50361 0.00000 >> >> f1:f3 1.03600 0.06573 15.76193 0.00000 >> >> f2:f3 3.41300 0.05247 65.05075 0.00000 >> >> f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000 >> >> >> >> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * >> >> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, >> >> 0.75, 0.9, 0.95, 0.99), data = data_stats) >> >> >> >> tau: [1] 0.5 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 83.80900 0.05626 1489.61222 0.00000 >> >> f1 -0.92200 0.07528 -12.24692 0.00000 >> >> f2 -27.90700 0.05937 -470.07189 0.00000 >> >> f3 -6.45000 0.07204 -89.53909 0.00000 >> >> f1:f2 -2.66500 0.07933 -33.59275 0.00000 >> >> f1:f3 1.99000 0.09869 20.16440 0.00000 >> >> f2:f3 7.09600 0.07611 93.23813 0.00000 >> >> f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000 >> >> >> >> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * >> >> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, >> >> 0.75, 0.9, 0.95, 0.99), data = data_stats) >> >> >> >> tau: [1] 0.75 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 102.71700 0.10175 1009.45946 0.00000 >> >> f1 -1.59300 0.13241 -12.03125 0.00000 >> >> f2 -40.64200 0.10623 -382.58456 0.00000 >> >> f3 -14.40900 0.12096 -119.11988 0.00000 >> >> f1:f2 -2.97600 0.13867 -21.46071 0.00000 >> >> f1:f3 3.74600 0.16335 22.93165 0.00000 >> >> f2:f3 14.14800 0.12692 111.47217 0.00000 >> >> f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000 >> >> >> >> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * >> >> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, >> >> 0.75, 0.9, 0.95, 0.99), data = data_stats) >> >> >> >> tau: [1] 0.9 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 130.89400 0.20609 635.12464 0.00000 >> >> f1 -2.55500 0.28139 -9.07995 0.00000 >> >> f2 -60.90500 0.21322 -285.64558 0.00000 >> >> f3 -29.42300 0.23409 -125.69092 0.00000 >> >> f1:f2 -2.77700 0.29052 -9.55870 0.00000 >> >> f1:f3 7.89700 0.33308 23.70870 0.00000 >> >> f2:f3 27.78100 0.24338 114.14722 0.00000 >> >> f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000 >> >> >> >> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * >> >> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, >> >> 0.75, 0.9, 0.95, 0.99), data = data_stats) >> >> >> >> tau: [1] 0.95 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 157.45900 0.42733 368.47413 0.00000 >> >> f1 -4.10200 0.55834 -7.34678 0.00000 >> >> f2 -81.24000 0.44012 -184.58697 0.00000 >> >> f3 -46.17500 0.46235 -99.87033 0.00000 >> >> f1:f2 -2.01700 0.57651 -3.49866 0.00047 >> >> f1:f3 15.67000 0.67409 23.24600 0.00000 >> >> f2:f3 43.00100 0.47973 89.63500 0.00000 >> >> f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000 >> >> >> >> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * >> >> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, >> >> 0.75, 0.9, 0.95, 0.99), data = data_stats) >> >> >> >> tau: [1] 0.99 >> >> >> >> Coefficients: >> >> Value Std. Error t value Pr(>|t|) >> >> (Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01 >> >> f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00 >> >> f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00 >> >> f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00 >> >> f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00 >> >> f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00 >> >> f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00 >> >> f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00 >> >> Warning message: >> >> In summary.rq(xi, ...) : 288000 non-positive fis >> >>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdun...@tibco.com> wrote: >>> >>> You can time it yourself on increasingly large subsets of your data. E.g., >>> >>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6), >>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE)) >>>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6)) >>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),]; >>> print(system.time(rq(data=d, y ~ x1 + x2*x3, >>> tau=0.9)))},FUN.VALUE=numeric(5)) >>> user system elapsed >>> 0 0 0 >>> user system elapsed >>> 0 0 0 >>> user system elapsed >>> 0.02 0.00 0.01 >>> user system elapsed >>> 0.01 0.00 0.02 >>> user system elapsed >>> 0.10 0.00 0.11 >>> user system elapsed >>> 1.09 0.00 1.10 >>> user system elapsed >>> 13.05 0.02 13.07 >>> user system elapsed >>> 273.30 0.11 273.74 >>>> t >>> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] >>> user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30 >>> sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11 >>> elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74 >>> user.child NA NA NA NA NA NA NA NA >>> sys.child NA NA NA NA NA NA NA NA >>> >>> Do some regressions on t["elapsed",] as a function of n and predict up to >>> n=10^7. E.g., >>>> summary(lm(t["elapsed",] ~ poly(n,4))) >>> >>> Call: >>> lm(formula = t["elapsed", ] ~ poly(n, 4)) >>> >>> Residuals: >>> 1 2 3 4 5 6 >>> 7 8 >>> -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05 >>> -9.199e-07 2.715e-09 >>> >>> Coefficients: >>> Estimate Std. Error t value Pr(>|t|) >>> (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 *** >>> poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 *** >>> poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 *** >>> poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 *** >>> poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 ** >>> --- >>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >>> >>> Residual standard error: 0.003565 on 3 degrees of freedom >>> Multiple R-squared: 1, Adjusted R-squared: 1 >>> F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14 >>> >>> >>> It does not look good for n=10^7. >>> >>> >>> >>> Bill Dunlap >>> TIBCO Software >>> wdunlap tibco.com >>> >>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzh...@ucsd.edu> wrote: >>>> >>>> Hi all, >>>> >>>> I'm using quantreg rq() to perform quantile regression on a large data >>>> set. >>>> Each record has 4 fields and there are about 18 million records in total. >>>> I >>>> wonder if anyone has tried rq() on a large dataset and how long I should >>>> expect it to finish. Or it is simply too large and I should subsample the >>>> data. I would like to have an idea before I start to run and wait forever. >>>> >>>> In addition, I will appreciate if anyone could give me an idea how long it >>>> takes for rq() to run approximately for certain dataset size. >>>> >>>> Yunqi >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.