Hi William, Thank you very much for your reply.
I did a subsampling to reduce the number of samples to ~1.8 million. It seems to work fine except for 99th percentile (p-values for all the features are 1.0). Does this mean I’m subsampling too much? How should I interpret the result? tau: [1] 0.25 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 72.15700 0.03651 1976.10513 0.00000 f1 -0.51000 0.04906 -10.39508 0.00000 f2 -20.44200 0.03933 -519.78766 0.00000 f3 -2.37000 0.04871 -48.65117 0.00000 f1:f2 -2.52500 0.05315 -47.50361 0.00000 f1:f3 1.03600 0.06573 15.76193 0.00000 f2:f3 3.41300 0.05247 65.05075 0.00000 f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000 Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99), data = data_stats) tau: [1] 0.5 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 83.80900 0.05626 1489.61222 0.00000 f1 -0.92200 0.07528 -12.24692 0.00000 f2 -27.90700 0.05937 -470.07189 0.00000 f3 -6.45000 0.07204 -89.53909 0.00000 f1:f2 -2.66500 0.07933 -33.59275 0.00000 f1:f3 1.99000 0.09869 20.16440 0.00000 f2:f3 7.09600 0.07611 93.23813 0.00000 f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000 Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99), data = data_stats) tau: [1] 0.75 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 102.71700 0.10175 1009.45946 0.00000 f1 -1.59300 0.13241 -12.03125 0.00000 f2 -40.64200 0.10623 -382.58456 0.00000 f3 -14.40900 0.12096 -119.11988 0.00000 f1:f2 -2.97600 0.13867 -21.46071 0.00000 f1:f3 3.74600 0.16335 22.93165 0.00000 f2:f3 14.14800 0.12692 111.47217 0.00000 f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000 Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99), data = data_stats) tau: [1] 0.9 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 130.89400 0.20609 635.12464 0.00000 f1 -2.55500 0.28139 -9.07995 0.00000 f2 -60.90500 0.21322 -285.64558 0.00000 f3 -29.42300 0.23409 -125.69092 0.00000 f1:f2 -2.77700 0.29052 -9.55870 0.00000 f1:f3 7.89700 0.33308 23.70870 0.00000 f2:f3 27.78100 0.24338 114.14722 0.00000 f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000 Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99), data = data_stats) tau: [1] 0.95 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 157.45900 0.42733 368.47413 0.00000 f1 -4.10200 0.55834 -7.34678 0.00000 f2 -81.24000 0.44012 -184.58697 0.00000 f3 -46.17500 0.46235 -99.87033 0.00000 f1:f2 -2.01700 0.57651 -3.49866 0.00047 f1:f3 15.67000 0.67409 23.24600 0.00000 f2:f3 43.00100 0.47973 89.63500 0.00000 f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000 Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 * f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5, 0.75, 0.9, 0.95, 0.99), data = data_stats) tau: [1] 0.99 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01 f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00 f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00 f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00 f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00 f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00 f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00 f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00 Warning message: In summary.rq(xi, ...) : 288000 non-positive fis On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdun...@tibco.com> wrote: > You can time it yourself on increasingly large subsets of your data. E.g., > > > dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6), > x3=sample(c("A","B","C"),size=1e6,replace=TRUE)) > > dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6)) > > t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),]; > print(system.time(rq(data=d, y ~ x1 + x2*x3, > tau=0.9)))},FUN.VALUE=numeric(5)) > user system elapsed > 0 0 0 > user system elapsed > 0 0 0 > user system elapsed > 0.02 0.00 0.01 > user system elapsed > 0.01 0.00 0.02 > user system elapsed > 0.10 0.00 0.11 > user system elapsed > 1.09 0.00 1.10 > user system elapsed > 13.05 0.02 13.07 > user system elapsed > 273.30 0.11 273.74 > > t > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] > user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30 > sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11 > elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74 > user.child NA NA NA NA NA NA NA NA > sys.child NA NA NA NA NA NA NA NA > > Do some regressions on t["elapsed",] as a function of n and predict up to > n=10^7. E.g., > > summary(lm(t["elapsed",] ~ poly(n,4))) > > Call: > lm(formula = t["elapsed", ] ~ poly(n, 4)) > > Residuals: > 1 2 3 4 5 6 > 7 8 > -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05 > -9.199e-07 2.715e-09 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 *** > poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 *** > poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 *** > poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 *** > poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 ** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 0.003565 on 3 degrees of freedom > Multiple R-squared: 1, Adjusted R-squared: 1 > F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14 > > > It does not look good for n=10^7. > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzh...@ucsd.edu> wrote: > >> Hi all, >> >> I'm using quantreg rq() to perform quantile regression on a large data >> set. >> Each record has 4 fields and there are about 18 million records in total. >> I >> wonder if anyone has tried rq() on a large dataset and how long I should >> expect it to finish. Or it is simply too large and I should subsample the >> data. I would like to have an idea before I start to run and wait forever. >> >> In addition, I will appreciate if anyone could give me an idea how long it >> takes for rq() to run approximately for certain dataset size. >> >> Yunqi >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.