[R] Maximum length for a -e argument to Rscript?
Hi! I’m attempting to use Rscript to do some automated plotting. It is working well, except that I seem to be running into a maximum line length issue, and I’m wondering if it is a bug on your end. Here’s an example of the command I’m trying to run: /usr/local/bin/Rscript -e '{x <- c(-1.31171, -0.686165, 1.62771, 0.320195, -0.322011, 1.66518, -0.271971, -0.665367, 0.516482, -0.716343, -0.317471, 0.068046, -0.100371, -1.15907, 0.263329, -0.936049, -0.852444, 0.358817, -0.233959, 0.209891, -0.831575, -0.952987, -0.0420206, -1.78527, -0.280584, -0.62353, 1.42597, 0.127994, 0.0751232, 0.896835, -0.319488, 0.897876, 0.18457, 0.779571, -0.0543194, 0.226722, -0.769983, -0.723463, 0.144386, -0.468544, -0.349417, 0.336786, 0.749212, -1.62397, 0.683075, -0.746449, 0.300921, -0.365468, 0.548271, 1.13169, -1.34042, -0.0740572, 1.34986, 0.531771, -0.147157, 0.824894, -1.05816, 1.58867, -0.885764, 1.11912, 0.361512, 1.77985, 0.585099, -1.205, 2.44134, -0.331372, -0.346322, 0.0535267, -1.75089, 0.0773243, -1.07846, -1.29632, 1.0622, 1.34867, 0.199777, 0.197516, 0.574185, 1.06555, -0.885166, -0.788576, -1.46061, -1.54026, 0.690576, -0.88821, 0.343747, -0.100751, -0.865596, -0.128504, 0.222334, -1.18932, -0.555258, -0.557368, 0.219272, 0.298858, 0.848022, 0.142608, 1.10082, -0.348039, 0.0566489, 0.662136, 0.50451, -0.909399, 1.02446, 1.40592, -0.114786, -1.10718, 2.02549, 0.0818607, -1.037, 1.18961, -0.204, 2.83165, -0.959653, -0.393082, -0.463351, 0.914054, 1.14472, -1.32927, 1.25416, 0.372267, 0.410832, 1.04187, 1.22288, 1.27131, 0.0949385, 0.194053, -0.226184, -0.502155, -1.36834, -0.000591861, -0.565903, 1.14099, 1.67811, 0.331709, -0.756879, 0.889596, 0.718098, 0.740242, -0.861818, 0.0332746, 1.01745, 0.584536, -1.14245, -0.85, -1.34237, 0.660603, 1.16048, -0.898828, 0.965746, -1.16953, -2.33417, 0.591078, -0.364892, 0.0719267, -1.21676, 1.12646, 1.37534, 0.0712832, 1.22889, -0.0110024, 0.248213, -1.12013, -0.525197, -0.352156, -0.317182, -0.89552, 1.53422, -1.36777, 1.52099, 1.18789, -3.15251, 1.24008, -0.564289, -0.515629, -0.0920464, 2.94027, 0.895481, -0.157643, -0.347874, -0.290823, -0.771436, 1.29285, 0.216689, -1.86856, 2.24075, 0.888635, 0.430417, -0.585856, 1.13119, -0.243977, 0.544491, 0.921995, 0.815365, 1.2584, -1.29347, 0.0574579, 0.990557, -1.58657, -0.264373, 0.865893, 0.599298, -0.417531, 0.132897, 1.88597, 1.33112, -0.880904, 0.0762161, 0.0567852, 0.593295, -0.632135, 0.885625, 0.365863, -0.17885, 0.420185, -0.508275, 0.974357, 0.628085, 0.710578, 1.72447, 1.38488, 1.01301, 1.30015, 0.260501, 0.808981, 0.440228, 0.416344, -1.66273, -0.397224, -0.512086, -0.175854, -0.663143, 0.369712, -1.01654, 0.660465, 0.124851, -1.51101, -0.95725, 2.09893, 1.26819, 1.08086, 0.493204, 0.79073, 1.49191, 0.563689, 0.414473, 2.27361, 0.871923, 0.193703, -0.185039, -0.312859, -1.42267, -2.11561, 0.311996, -0.0906527, 1.19139, 1.57502, 1.10587, 0.416333, 2.35374, -1.0531, 0.0450512, 0.979958, 0.398269, 0.0897618, -0.855305, -1.59337, -0.084904, 0.245872, 1.27115, 1.3512, -0.166962, 1.01098, -1.19854, -2.05932, -0.98, 0.704973, 0.694688, 1.20408, -1.12553, 0.770056, 1.01602, 0.295223, -1.18801, 1.51707, 1.1418, -0.148787, 1.28886, 1.23981, 1.67984, 0.0185941, -0.877581, 0.495042, -0.368668, 1.59972, -2.20849, -1.36852, -0.972566, -1.01848, -0.366674, -2.60273, -0.540706, -0.475797, 0.227651, -1.11476, 1.73452, -0.212185, 3.04994, -0.251225, -0.0443482, -0.489125, 0.557936, -0.246389, -0.954287, 0.388668, 0.759049, -0.501683, -1.98722, 0.158712, -0.897082, -1.17644, 0.192465, -1.49901, -0.289129, -0.0460198, -0.520331, 0.432488, -0.471775, 1.21482, 0.998091, -0.794933, -0.36989, 0.937091, 1.27297, 1.06108, -0.784307, 0.70919, -0.309221, -0.516031, 0.479702, -0.669637, 1.60288, 0.657474, -0.666286, -1.01816, -0.452818, -0.283749, 1.05511, -1.2002, 0.112269, -1.37403, 1.00512, 0.664936, 0.600516, -1.08099, -0.705522, 0.103051, 0.0461179, 1.74404, 0.727475, 2.41175, 1.20414, 1.71095, 0.0957544, 0.610755, 0.545194, -0.936381, 0.190467, 0.485791, 0.0855697, 0.142676, 0.721754, -1.84506, 2.1227, -1.1271, -1.11228, -1.2807, 0.13385, 0.228727, -0.34391, 1.09837, -0.37053, 0.832574, 0.673463, 0.717732, -0.307687, 1.12486, 0.159803, -1.51251, 1.403, 2.0215, 0.010914, -0.543749, 0.137119, 0.200364, -0.104029, -0.930966, -1.56781, -0.526978, -0.537582, 1.11872, -0.99061, -0.501421, 1.21982, 0.813201, -0.539109, 0.433523, -0.0615188, 2.04298, 0.697692, 1.34515, 1.7298, 0.515137, 2.08119, 0.550985, 1.49876, 1.31187, 0.920405, 0.597678, 0.884353, -0.732892, -0.143545, -0.236836, -0.330872, 1.55577, -1.74473, -0.493322, 1.46375, 1.14347, 1.76164, 1.73099, -0.234701, -0.0546848, 0.346991, -0.393301, 1.34267, -1.58519, -0.381789, 0.622675, 1.34655, 2.84895, -0.371, -0.519666, -1.64944, 0.573592, 1.06484, -0.0239894, -0.604563, 0.0680641, -0.881325, 1.07265, 0.182585, 0.373288, 2.20228, -0.763593, -0.25659, 1.9397, -0.166943, -0.672522, -1.35983, 0.227406, 0.49471, -1
Re: [R] Maximum length for a -e argument to Rscript?
OK, thanks for the information Ben and Henrik. It sounds like even if the Rscript stall were fixed, there would still be a 10k limit when using -e, as well as a 4K maximum line length limit when using -e or stdin, if I understand the section cited by Ben; so I need to switch to using a script file instead, to get past these various limits, I guess. Thanks for your help. Cheers, -B. Benjamin C. Haller Messer Lab Cornell University > On Apr 22, 2017, at 6:59 AM, Henrik Bengtsson > wrote: > > That Rscript stalls sounds like a bug, but not sure it's R or the > terminal that needs to be fixed: > > Rscript -e "$long_expression" > WARNING: '-e > {x~+~<-~+~c(-1.31171,~+~-0.686165,~+~1.62771,~+~0.320195,~+~-0.322011,~+~1.66518,~+~-0.271971,~+~-0.665367,~+~0.516482,~+~-0.716343,~+~-0.317471,~+~0.068046,~+~-0.100371,~+~-1.15907,~+~0.263329,~+~-0.936049,~+~-0.852444,~+~0.358817,~+~-0.233959,~+~0.209891,~+~-0.831575,~+~-0.952987,~+~-0.0420206,~+~-1.78527,~+~-0.280584,~+~-0.62353,~+~1.42597,~+~0.127994,~+~0.0751232,~+~0.896835,~+~-0.319488,~+~0.897876,~+~0.18457,~+~0.779571,~+~-0.0543194,~+~0.226722,~+~-0.769983,~+~-0.723463,~+~0.144386,~+~-0.468544,~+~-0.349417,~+~0.336786,~+~0.749212,~+~-1.62397,~+~0.683075,~+~-0.746449,~+~0.300921,~+~-0.365468,~+~0.548271,~+~1.13169,~+~-1.34042,~+~-0.0740572,~+~1.34986,~+~0.531771,~+~-0.147157,~+~0.824894,~+~-1.05816,~+~1.58867,~+~-0.885764,~+~1.11912,~+~0.361512,~+~1.77985,~+~0.585099,~+~-1.205,~+~2.44134,~+~-0.331372,~+~-0.346322,~+~0.0535267,~+~-1.75089,~+~0.0773243,~+~-1.07846,~+~-1.29632,~+~1.0622,~+~1.34867,~+~0.199777,~+~0.197516,~+~0.574185,~+~1.06555,~+~-0.885166,~+~-0.788576,~+~-1.46061,~+~-1.5402 > ^C^C > ^C > ^C^C^C^C^C^C > ^\Quit (core dumped) > > On my default Ubuntu 16.04 terminal, R 3.3.3 hangs and does not > respond to user interrupts (SIGINT), but it does respond to Ctrl-\ > (SIGKILL). > > A workaround is to pass the expression via standard input to R, e.g. > > $ echo "$long_expression" | R --no-save > > /Henrik > > On Fri, Apr 21, 2017 at 11:07 AM, Ben Tupper wrote: >> Hi, >> >> I suspect you are over the 10kb limit for the expression. See >> >> https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Invoking-R-from-the-command-line >> >> Cheers, >> Ben >> >>> On Apr 21, 2017, at 3:44 AM, Ben Haller wrote: >>> >>> Hi! I’m attempting to use Rscript to do some automated plotting. It is >>> working well, except that I seem to be running into a maximum line length >>> issue, and I’m wondering if it is a bug on your end. Here’s an example of >>> the command I’m trying to run: >>> >>> /usr/local/bin/Rscript -e '{x <- c(-1.31171, -0.686165, 1.62771, 0.320195, >>> -0.322011, 1.66518, -0.271971, -0.665367, 0.516482, -0.716343, -0.317471, >>> 0.068046, -0.100371, -1.15907, 0.263329, -0.936049, -0.852444, 0.358817, >>> -0.233959, 0.209891, -0.831575, -0.952987, -0.0420206, -1.78527, -0.280584, >>> -0.62353, 1.42597, 0.127994, 0.0751232, 0.896835, -0.319488, 0.897876, >>> 0.18457, 0.779571, -0.0543194, 0.226722, -0.769983, -0.723463, 0.144386, >>> -0.468544, -0.349417, 0.336786, 0.749212, -1.62397, 0.683075, -0.746449, >>> 0.300921, -0.365468, 0.548271, 1.13169, -1.34042, -0.0740572, 1.34986, >>> 0.531771, -0.147157, 0.824894, -1.05816, 1.58867, -0.885764, 1.11912, >>> 0.361512, 1.77985, 0.585099, -1.205, 2.44134, -0.331372, -0.346322, >>> 0.0535267, -1.75089, 0.0773243, -1.07846, -1.29632, 1.0622, 1.34867, >>> 0.199777, 0.197516, 0.574185, 1.06555, -0.885166, -0.788576, -1.46061, >>> -1.54026, 0.690576, -0.88821, 0.343747, -0.100751, -0.865596, -0.128504, >>> 0.222334, -1.18932, -0.555258, -0.557368, 0.219272, 0.298858, 0.848022, >>> 0.142608, 1.10082, -0.348039, 0.0566489, 0.662136, 0.50451, -0.909399, >>> 1.02446, 1.40592, -0.114786, -1.10718, 2.02549, 0.0818607, -1.037, 1.18961, >>> -0.204, 2.83165, -0.959653, -0.393082, -0.463351, 0.914054, 1.14472, >>> -1.32927, 1.25416, 0.372267, 0.410832, 1.04187, 1.22288, 1.27131, >>> 0.0949385, 0.194053, -0.226184, -0.502155, -1.36834, -0.000591861, >>> -0.565903, 1.14099, 1.67811, 0.331709, -0.756879, 0.889596, 0.718098, >>> 0.740242, -0.861818, 0.0332746, 1.01745, 0.584536, -1.14245, -0.85, >>> -1.34237, 0.660603, 1.16048, -0.898828, 0.965746, -1.16953, -2.33417, >>> 0.591078, -0.364892, 0.0719267, -1.21676, 1.12646, 1.37534, 0.0712832, >>> 1.22889, -0.0110024, 0.248213, -1.12013, -0.525197, -0.352156, -0.317182, >>> -0.89552, 1.53422, -1.36777, 1.52099, 1.18789, -3.15251, 1.24008, >>> -0.564289,
[R] Confidence intervals and polynomial fits
Hi all! I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial) This works great, and I get a nice fit. My question regards the confidence intervals on that fit. I am calling: cis <- confint.default(model, level=0.95) to get the confidence intervals for each coefficient. This confint.default() call gives me a table like this (where "dispersal" is the real name of what I'm calling "x" above): 2.5 % 97.5 % (Intercept) 8.7214297 8.9842533 dispersal -37.5621412 -35.6629981 dispersal_sq66.8077669 74.4490123 dispersal_cb -74.5963861 -62.8347057 dispersal_qt22.6590159 28.6506186 A side note: calling confint() instead of confint.default() is much, much slower -- my model has many other terms I'm not discussing here, and is fit to 300,000 data points -- and a check indicates that the intervals returned for my model by confint() are not virtually identical, so this is not the source of my problem: 2.5 % 97.5 % (Intercept) 8.7216693 8.9844958 dispersal -37.5643922 -35.6652276 dispersal_sq66.8121519 74.4534753 dispersal_cb -74.5995820 -62.8377766 dispersal_qt22.6592724 28.6509494 These tables show the problem: the 95% confidence limits for the quartic term are every bit as wide as the limits for the other terms. Since the quartic term coefficient gets multiplied by the fourth power of x, this means that the width of the confidence band starts out nice and narrow (when x is close to zero, where the width of the confidence band is pretty much just determined by the confidence limits on the intercept) but balloons out to be tremendously wide for larger values of x. The width of it looks quite unreasonable to me, given that my fit is constrained by 300,000 data points. Intuitively, I would have expected the confidence limits for the quartic term to be much narrower than for, say, the linear term, because the effect of variation in the quartic term is so much larger. But the way confidence intervals are calculated in R, at least, does not seem to follow my instincts here. In other words, predicted values using the 2.5% value of the linear coefficient and the 97.5% value of the linear coefficient are really not very different, but predicted values using the 2.5% value of the quadratic coefficient and the 97.5% value of the quadratic coefficient are enormously, wildly divergent -- far beyond the range of variability in the original data, in fact. I feel quite confident, in a seat-of-the-pants sense, that the actual 95% limits of that quartic coefficient are much, much narrower than what R is giving me. Looking at it another way: if I just exclude the quartic term from the glm() fit, I get a dramatically narrower confidence band, even though the fit to the data is not nearly as good (I'm keeping the fourth power for good reasons). This again seems to me to indicate that the confidence band with the quartic term included is artificially, and incorrectly, wide. If a fit with reasonably narrow confidence limits can be produced for the model with terms only up to cubic (or, taking this reductio even further, for a quadratic or a linear model, also), why does adding the quadratic term, which improves the quality of the fit to the data, cause the confidence limits to nevertheless balloon outwards? That seems fundamentally illogical to me, but maybe my instincts are bad. So what's going on here? Is this just a limitation of the way confint() computes confidence intervals? Is there a better function to use, that is compatible with binomial glm() fits? Or am I misinterpreting the meaning of the confidence limits in some way? Or what? Thanks for any advice. I've had trouble finding examples of polynomial fits like this; people sometimes fit the square of the explanatory variable, of course, but going up to fourth powers seems to be quite uncommon, so I've had trouble finding out how others have dealt with these sorts of issues that crop up only with higher powers. Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confidence intervals and polynomial fits
> From what you have written, I am not exactly sure what your > seat-of-the-pant sense is coming from. My pantseat typically does not > tell me much; however, quartic trends tend to less stable than linear, > so I am not terribly surprised. My pantseat is not normally very informative either, but if you saw the width of the confidence limits I'm getting for the quartic coefficient, I think your pantseat would agree with mine. :-> The confidence band is staggeringly wide, many times the variation in the data itself; and with 300,000 data points to fit to, that just should not be the case. With that many data points, it seems quite obvious that one can be "95% confident" that the true data lies within a band somewhere reasonably close to the band that the sample data actually fall into, not a band many times wider. > As two side notes: > > x_qt <- x^4 # shorter code-wise > and you can certain just enter a quartic term if the data is just > quartic and you are not really itnerested in the lower trends. Yep, for sure. Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confidence intervals and polynomial fits
On May 6, 2011, at 12:31 PM, David Winsemius wrote: > On May 6, 2011, at 11:35 AM, Ben Haller wrote: > >> Hi all! I'm getting a model fit from glm() (a binary logistic regression >> fit, but I don't think that's important) for a formula that contains powers >> of the explanatory variable up to fourth. So the fit looks something like >> this (typing into mail; the actual fit code is complicated because it >> involves step-down and so forth): >> >> x_sq <- x * x >> x_cb <- x * x * x >> x_qt <- x * x * x * x >> model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial) > > It might have been easier with: > > model <- glm(outcome ~ poly(x, 4) , binomial) Interesting. The actual model I'm fitting has lots more terms, and needs to be able to be stepped down; sometimes some of the terms of the polynomials will get dropped while others get retained, for example. But more to the point, poly() seems to be doing something tricky involving "orthogonal polynomials" that I don't understand. I don't think I want whatever that is; I want my original variable x, plus its powers. For example, if I do: x < runif(10) poly(x, 3) the columns I get are not x, x^2, x^3; they are something else. So the poly() fit is not equivalent to my fit. > Since the authors of confint might have been expecting a poly() formulation > the results might be more reliable. I'm just using lm() but I think the > conclusions are more general: > > ... > > Coefficients: >Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.457370.52499 0.871 0.3893 > x -0.759891.15080 -0.660 0.5131 > x2 1.309870.67330 1.945 0.0594 . > x3 -0.035590.11058 -0.322 0.7494 > > ... > > Coefficients: >Estimate Std. Error t value Pr(>|t|) > (Intercept) 5.4271 0.1434 37.839 < 2e-16 *** > poly(x, 3)1 30.0235 0.9184 32.692 < 2e-16 *** > poly(x, 3)2 8.7823 0.9184 9.563 1.53e-11 *** > poly(x, 3)3 -0.2956 0.9184 -0.3220.749 Here, in your illustration, is underscored what I mean. Whatever these orthogonal polynomial terms are that you're using, they are clearly not the original x, x^2 and x^3, and they're giving you a different fit than those do. I probably ought to learn about this technique, since it looks interesting; but for my purposes I need the fit to actually be in terms of x, since x is my explanatory variable. And the fit I'm getting is highly significant (all terms < 2e-16), so the lack of fit problem you're illustrating does not seem to apply to my case. Or maybe I'm totally misunderstanding your point...? :-> Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confidence intervals and polynomial fits
On May 6, 2011, at 1:58 PM, Prof Brian Ripley wrote: > On Fri, 6 May 2011, Bert Gunter wrote: > >> FWIW: >> >> Fitting higher order polynomials (say > 2) is almost always a bad idea. >> >> See e.g. the Hastie, Tibshirani, et. al book on "Statistical >> Learning" for a detailed explanation why. The Wikipedia entry on >> "smoothing splines" also contains a brief explanation, I believe. >> >> Your ~0 P values for the coefficients also suggest problems/confusion >> (!) -- perhaps you need to consider something along the lines of >> "functional data analysis" for your analysis. >> >> Having no knowledge of your issues, these remarks are entirely >> speculative and may well be wrong. So feel free to dismiss. >> Nevertheless, you may find it useful to consult your local >> statistician for help. > > That is the main piece of advice I would have given. But if you must DIY, > consider the merits of orthogonal polynomials. Computing individual > confidence intervals for highly correlated coefficients is very dubious > practice. Without the example the posting guide asked for, we can only guess > if that is what is happening. Thanks to both of you. Yes, I am admittedly out of my depth; the statistician I would normally ask is on sabbatical, and I'm a bit at sea. Of course McGill has a whole department of mathematics and statistics; I guess I ought to try to make a friend over there (I'm in the biology department). Anyhow, I've just downloaded the Hastie et al. book and will try to figure out whether my use of higher order polynomials is incorrect in my situation. Eliminating those would certainly solve my problem with the confidence intervals. :-> I was figuring that the ~0 P-values for coefficients was just the result of my having 300,000 data points; I figured the regression procedure was thus able to pin down very accurate estimates of them. I'll look into "functional data analysis" as you recommend, though; I'm entirely unfamiliar with it. As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero. Is this what you mean, as a potential source of problems? Or if you mean that the various other terms in my model might be correlated with x, that is not the case; each independent variable is completely uncorrelated with the others (this data comes from simulations, so the independent variables for each data point were in fact chosen by random drawing). It didn't seem easy to post an example, since my dataset is so large, but if either of you would be willing to look at this further, I could upload my dataset to a web server somewhere and post a link to it. In any case, thanks very much for your help; I'll look into the things you mentioned. Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confidence intervals and polynomial fits
On May 6, 2011, at 4:27 PM, David Winsemius wrote: > On May 6, 2011, at 4:16 PM, Ben Haller wrote: >> > >> As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly >> correlated, for values close to zero. > > Not just for x close to zero: > > > cor( (10:20)^2, (10:20)^3 ) > [1] 0.9961938 > > cor( (100:200)^2, (100:200)^3 ) > [1] 0.9966219 Wow, that's very interesting. Quite unexpected, for me. Food for thought. Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Questions regrading the lasso and glmnet
the product of the Cx column and the Cy column, and so it is not centered/scaled. I don't know if this is correct or not. So question 3: 3. Is my model matrix correct, or do I have a problem with the scale of the interaction variables? Having run glmnet(), I now have lots of coefficients for the different values of lambda. One thing that seems problematic is that the coefficient for an interaction (Cx:Cy, say), will become non-zero while the coefficients for the underlying terms Cx and Cy are still zero. Of course glmnet() has no way of preventing this, since it doesn't know the formula or which matrix columns are interactions, etc. From my previous work on model selection via step-down, my understanding is that this state of affairs should not be allowed. Using step-down, Cx and Cy will not be dropped as long as Cx:Cy is retained, for example. As I understand it, this is for essentially the same conceptual reasons that one ought to retain the intercept in a regression even if it is non-significant; the slope of the regression depends upon it, and so dropping it is not correct. However, I don't see any way that I can tell glmnet() about these relationships, since it doesn't have a functional interf! ace. So: 4. Is it a problem that the lasso fit gives non-zero coefficients for interactions whose underlying terms have zero coefficients? And finally comes the actual model selection problem. The best lambda value given by cv.glmnet() is for a model that contains most of the 116 terms; I suppose that, as with the BIC step-down, this is because 300,000 rows is a lot of data, and so most of those terms are in fact well-supported by the data. But I want a simple explanatory model with only the terms of large effect, as described before. I'm thinking of finding the prediction rate for each lambda value used by glmnet, and using the same subjective "simplest model with a pretty good prediction rate" criterion that I was using before. But I'm wondering: 5. Is there any way to choose a simple explanatory model, smaller than the best predictive model supported by the data, that is less arbitrary / subjective? And finally, I've got sort of a philosophical question that I'm struggling with that makes me wonder if the lasso is really the right tool for my job in the first place. As lambda increases, the coefficients of terms diverge from zero, one by one, until all terms have non-zero coefficients and you've arrived at the full model. The funny thing is, for most values of lambda, some or another term in my formula will have only just recently diverged from zero, and so that term will have a very small coefficient -- the shrinkage aspect of the lasso will have shrunk it almost, but not quite, to zero. This is weird, given that my goal is a simple explanatory model; having terms with very small but non-zero coefficients is contrary to the goals of my statistical analysis. The lasso is considered to be better for producing explanatory models than some other methods, because coefficients do eventually go to zero, instead of just getting smaller and smaller; but still, I wonder if! I ought to be using some more extreme method of model selection that, like the step-down I was using before, either keeps a term in or drops it, with no in-between. On the other hand, I gather that the shrinkage that the lasso does is considered to be a good thing. I don't really understand enough about all this to know what to do. So: 6. Should I be looking at a method other than the lasso, so that I would not get some terms with very small coefficients? Or should I use the lasso, and just ignore, in my explanation of the model, terms with tiny coefficients? Or is there some other way of handling this problem? Thanks to anybody who has even read this far, and many many thanks to any responses! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] gam() (in mgcv) with multiple interactions
Hi! I'm learning mgcv, and reading Simon Wood's book on GAMs, as recommended to me earlier by some folks on this list. I've run into a question to which I can't find the answer in his book, so I'm hoping somebody here knows. My outcome variable is binary, so I'm doing a binomial fit with gam(). I have five independent variables, all continuous, all uniformly distributed in [0, 1]. (This dataset is the result of a simulation model.) Let's call them a,b,c,d,e for simplicity. I'm interested in interactions such as a*b, so I'm using tensor product smooths such as te(a,b). So far so good. But I'm also interested in, let's say, a*d. So ok, I put te(a,d) in as well. Both of these have a as a marginal basis (if I'm using the right terminology; all I mean is, both interactions involve a), and I would have expected them to share that basis; I would have expected them to be constrained such that the effect of a when b=0, for one, would be the same as the effect of a when d=0, for the other. This would be just as, in a GLM with formula a*b + a*d, that formula would expand to a + b + d + a:b + a:d, and there is only one "a"; a doesn't get to be different for the a*b interaction than it is for the! a*d interaction. But with tensor product smooths in gam(), that does not seem to be the case. I'm still just getting to know mgcv and experimenting with things, so I may be doing something wrong; but the plots I have done of fits of this type appear to show different marginal effects. I tried explicitly including terms for the marginal basis; in my example, I tried a formula like te(a) + te(b) + te(d) + te(a, b) + te(a, d). No dice; in this case, the main effect of a is different between all three places where it occurs in the model. I.e. te(a) shows a different effect of a than te(a, b) shows at b=0, which is again different from the effect shown by te(a, d) at d=0. I don't even know what that could possibly mean; it seems wrong to me that this could even be the case, but what do I know. :-> I could move up to a higher-order tensor like te(a,b,d), but there are three problems with that. One, the b:d interaction (in my simplified example) is then also part of the model, and I'm not interested in it. Two, given the set of interactions that I *am* interested in, I would actually be forced to do the full five-way te(a,b,c,d,e), and with a 300,000 row dataset, I shudder to think how long that will take to run, since it would have something like 5^5 free parameters to fit; that doesn't seem worth pursuing. And three, interpretation of a five-way interaction would be unpleasant, to say the least; I'd much rather be able to stay with just the two-way (and one three-way) interactions that I know are of interest (I know this from previous logistic regression modelling of the dataset). For those who like to see the actual R code, here are two fits I've tried: gam(outcome ~ te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML") gam(outcome ~ te(slope) + te(curvature) + te(amplitude) + te(acl) + te(dispersal) + te(slope, curvature) + te(slope, amplitude) + te(curvature, amplitude) + te(acl, dispersal) + te(amplitude, dispersal) + te(slope, curvature, amplitude), family=binomial, data=rla, method="REML") So. Any advice? How can I correctly do a gam() fit involving multiple interactions that involve the same independent variable? Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] gam() (in mgcv) with multiple interactions
On Jun 9, 2011, at 11:35 AM, Simon Wood wrote: > I think that the main problem here is that smooths are not constrained to > pass through the origin, so the covariate taking the value zero doesn't > correspond to no effect in the way that you would like it to. Another way of > putting this is that smooths are translation invariant, you get essentially > the same inference from the model y_i = f(x_i) + e_i as from y_i = f(x_i + k) > + e_i (which implies that x_i=0 can have no special status). OK, I understand the translation invariance, I think, and why x_i=0 has no special status. I don't understand the consequence that you are saying follows from that, though. Onward... > All mgcv does in the case of te(a) + te(b) + te(d) + te(a, b) + > te(a, d) is to remove the bases for te(a), te(b) and te(d) from the basis of > te(a,b) and te(a,d). Further constraining te(a,b) and te(a,d) so that > te(0,b) = te(a,0) = 0 etc wouldn't make much sense (in general 0 might not > even be in the range of a and b). Hmm. Perhaps my question would be better phrased as a question of model interpretation. Can I think of the smooth found for te(a) as the "main effect" of a? If so, should I just not be bothered by the fact that te(a,b) at b=0 has a different shape from te(a)? Or is the "main effect" of a really, here, te(a) + te(a,b | b=0) + te(a,d | d=0) (if my notation makes any sense, I'm not much at math)? Or is the whole concept of "main effect" meaningless for these kinds of models -- in which case, how do I interpret what te(a) means? Or perhaps I should not be trying to interpret a by itself; perhaps I can only interpret the interactions, not the main effects. In that case, do I interpret te(a,b) by itself, or do I need to conceptually "add in" te(a) to understand what te(a,b) is telling me? My head is spinning. Clearly I just don't understand what these GAM models even mean, on a fundamental conceptual level. > In general I find functional ANOVA not entirely intuitive to think about, but > there is a very good book on it by Chong Gu (Smoothing spline ANOVA, 2002, > Springer), and the associated package gss is on CRAN. Is what I'm doing functional ANOVA? I realize ANOVA and regression are fundamentally related; but I think of ANOVA as involving discrete levels (factors) in the independent variables, like treatment groups. My independent variables are all continuous, so I would not have thought of this as ANOVA. Anyhow, OK. I will go get that book today, and see if I can figure all this out. Thanks for your help! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/ Begin forwarded message: > From: Simon Wood > Date: June 9, 2011 11:35:11 AM EDT > To: r-help@r-project.org, rh...@sticksoftware.com > Subject: Re: [R] gam() (in mgcv) with multiple interactions > > I think that the main problem here is that smooths are not constrained to > pass through the origin, so the covariate taking the value zero doesn't > correspond to no effect in the way that you would like it to. Another way of > putting this is that smooths are translation invariant, you get essentially > the same inference from the model y_i = f(x_i) + e_i as from y_i = f(x_i + k) > + e_i (which implies that x_i=0 can have no special status). > > All mgcv does in the case of te(a) + te(b) + te(d) + te(a, b) + > te(a, d) is to remove the bases for te(a), te(b) and te(d) from the basis of > te(a,b) and te(a,d). Further constraining te(a,b) and te(a,d) so that > te(0,b) = te(a,0) = 0 etc wouldn't make much sense (in general 0 might not > even be in the range of a and b). > > In general I find functional ANOVA not entirely intuitive to think about, but > there is a very good book on it by Chong Gu (Smoothing spline ANOVA, 2002, > Springer), and the associated package gss is on CRAN. > > best, > Simon > > > > On 07/06/11 17:00, Ben Haller wrote: >> Hi! I'm learning mgcv, and reading Simon Wood's book on GAMs, as >> recommended to me earlier by some folks on this list. I've run into >> a question to which I can't find the answer in his book, so I'm >> hoping somebody here knows. >> >> My outcome variable is binary, so I'm doing a binomial fit with >> gam(). I have five independent variables, all continuous, all >> uniformly distributed in [0, 1]. (This dataset is the result of a >> simulation model.) Let's call them a,b,c,d,e for simplicity. I'm >> interested in interactions such as a*b, so I'm using tensor product >> smooths such as te(a,b). So far so good. But I'm also interested >> in, let's say, a*d. So ok, I put te(a,d) in as well. Both of
[R] error in optim, within polr(): "initial value in 'vmmin' is not finite"
Hi all. I'm just starting to explore ordinal multinomial regression. My dataset is 300,000 rows, with an outcome (ordinal factor from 1 to 9) and five independent variables (all continuous). My first stab at it was this: pomod <- polr(Npf ~ o_stddev + o_skewness + o_kurtosis + o_acl_1e + dispersal, rlc, Hess=TRUE) And that worked; I got a good model fit. However, a variety of other things that I've tried give me this error: Error in optim(s0, fmin, gmin, method = "BFGS", ...) : initial value in 'vmmin' is not finite This occurs, for example, when I try to use the method="probit" option of polr(). It also occurs when I try a regression involving interactions, such as: pomod <- polr(Npf ~ o_stddev * o_skewness * o_kurtosis * o_acl_1e * dispersal, rlc, Hess=TRUE) I have good reason to believe that interactions are important here, so I'd very much like to be able to fit such models. I have been doing that successfully with logistic regression (considering my outcome variable to be binary, either "1" or "2-9") using glm(), but now using polr() it gives me this error. I've searched Google and the R lists for information about this error, and while I did find a couple of other people asking about it, I didn't find any advice about what to do about it that I can apply to my situation. I'd be happy to share my dataset with anyone willing to help me on this, but 300,000 rows is a bit large to include in this email. :-> Thanks! Ben Haller McGill University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Evaluating model fits for ordinal multinomial regressions with polr()
Hi all. I'm just starting to explore ordinal multinomial regression. I've been doing logistic regression in the past, and I've gotten used to various functions for getting pseudo-R2 metrics and other metrics to evaluate the quality of a model. For example: val.prob() NagelkerkeR2() ClassLog() LogRegR2() But these don't seem to work for the models generated by polr(). Does anyone have any recommendations for ways to evaluate the goodness-of-fit and such things for ordinal multinomial regression models? I've searched Google and looked at the docs for the package (MASS) that defines polr(), but haven't discovered anything. Thanks! Ben Haller McGill University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multi process support in R
On Feb 17, 2011, at 11:40 AM, Alaios wrote: > ...Is it possible to split work in many cores in R and if yes how is this > library called? I'd recommend the "mclapply" function in the "multicore" package. The only drawback is that you can't run your code in a GUI any more. Ben Haller McGill University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error in optim, within polr(): "initial value in 'vmmin' is not finite"
An update for the benefit of the list/posterity: I resolved this issue by switching over to using the lrm() function of package rms. It seems to pick better starts, or something; in any case, it has been able to converge on a solution for every model I've tried, although for the most complex ones I needed to raise maxit (maximum iterations) above the default of 12 slightly. The lrm() function does not support interactions higher than third-order, and it does only logistic regressions, not probit or other types, so it does have its drawbacks; but it has solved my difficulties quite nicely. Just in case anybody cares. :-> Ben Haller McGill University On Feb 16, 2011, at 5:41 PM, Ben Haller wrote: > Hi all. I'm just starting to explore ordinal multinomial regression. My > dataset is 300,000 rows, with an outcome (ordinal factor from 1 to 9) and > five independent variables (all continuous). My first stab at it was this: > > pomod <- polr(Npf ~ o_stddev + o_skewness + o_kurtosis + o_acl_1e + > dispersal, rlc, Hess=TRUE) > > And that worked; I got a good model fit. However, a variety of other things > that I've tried give me this error: > > Error in optim(s0, fmin, gmin, method = "BFGS", ...) : > initial value in 'vmmin' is not finite > > This occurs, for example, when I try to use the method="probit" option of > polr(). It also occurs when I try a regression involving interactions, such > as: > > pomod <- polr(Npf ~ o_stddev * o_skewness * o_kurtosis * o_acl_1e * > dispersal, rlc, Hess=TRUE) > > I have good reason to believe that interactions are important here, so I'd > very much like to be able to fit such models. I have been doing that > successfully with logistic regression (considering my outcome variable to be > binary, either "1" or "2-9") using glm(), but now using polr() it gives me > this error. I've searched Google and the R lists for information about this > error, and while I did find a couple of other people asking about it, I > didn't find any advice about what to do about it that I can apply to my > situation. > > I'd be happy to share my dataset with anyone willing to help me on this, but > 300,000 rows is a bit large to include in this email. :-> > > Thanks! > > Ben Haller > McGill University > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SPDEP Package, neighbours list for Moran's I on large grid dataset
Laurent Jégou wrote: > Hello list members, i'd like to calculate the Moran I on a large dataset, > a 8640x3432 grid of values. > > When i try to create the neighbours list with the cell2nb function, on > such a scale, R works for several hours and seems to crash. I'm using the last > version (2.13), 64 bits, on a mac pro with 4go of memory. > > I think that i'm doing it wrong :-) > > I'd like to use a "moving window" weight matrix instead of a full scale > one, but i was not lucky enough to find a solution in the docs. I confronted this problem recently, and decided to roll my own. For a large dataset, an exhaustive computation of Moran's I is very time-consuming, as you say; but an estimate of it, based on a subset of pairs chosen randomly, seemed (for my data, at least) to converge quite nicely. Here's my code. It assumes a square grid, so you'll need to adapt it for your non-square grid, but that should be trivial. To determine the right number of pairs for a good estimate in my application, I called this function with various numbers of pairs, and plotted the estimate produced as a function of the number of pairs used, and observed good convergence by 200,000 pairs. You will probably want to do this yourself, for your dataset. 200,000 pairs took just a second or two to run, on my machine, so this is much, much faster than an exhaustive calculation. As a bonus, this code also gives you Geary's C. Oh, and the code below assumes a wraparound lattice (i.e. a torus, i.e. periodic boundaries), which may not be what you want; just get rid of the d_wraparound stuff and the following pmax() call if you want non-wraparound boundaries, and that should work fine, I think. Not sure what you mean by the moving window thing, so I leave that as an exercise for the reader. :-> I've never made an R package, and I'm not sure there's much demand for this code (is there?), so I'm presently unlikely to package it up. However, if someone who owns a package related to this problem would like to adopt this code, generalize it to non-square lattices, add a flag to choose periodic or non-periodic boundaries, etc., feel free to do so; I hereby place it in the public domain. I'd just like to hear about it if someone does this, and receive credit somewhere, that's all. I'm not super-good in R, either, so if there's a better way to do some things, like the somewhat hacked random index generation (trying to avoid floating point issues when generating a random integer), please let me know. I'm always interested in learning how to do things better. And of course this code is provided without warranty, may have bugs, etc. Enjoy! Ben Haller McGill University correlation_stats <- function(p, n_pairs=20) { # Compute Moran's I and Geary's C for the landscape p. This is tricky because the exact calculation # would be far too time-consuming, as it involves comparison of every point to every other point. # So I am trying to estimate it here with a small subset of all possible point comparisons. p_size <- NROW(p) # 512 p_length <- length(p) # 262144 mean_p <- mean(p) pv <- as.vector(p) # select points and look up their values; for speed, points are selected by their vector index p1ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001)) p1x <- (p1ix %/% p_size) + 1 p1y <- (p1ix %% p_size) + 1 p1ix <- p1ix + 1 p2ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001)) p2x <- (p2ix %/% p_size) + 1 p2y <- (p2ix %% p_size) + 1 p2ix <- p2ix + 1 v1 <- pv[p1ix] - mean_p v2 <- pv[p2ix] - mean_p # Calculate distances between point pairs, both directly and wrapping around # The average direct distance is much smaller than the average wraparound distance, # because points near the center vertically have a maximum direct distance near 256, # but a minimum wraparound distance near 256. The Moran's I estimate from wraparound # distances is different, as a result. Rupert recommends taking the shorter of the # two distances, whichever it is, because that keeps us within the meaningful scale # of our space, before it just starts repeating due to periodicity. d_direct <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p1y - p2y) ^ 2)) d_direct[is.infinite(d_direct)] <- 0 d_wraparound <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p_size - abs(p1y - p2y)) ^ 2)) d_wraparound[is.infinite(d_wraparound)] <- 0 d <- pmax(d_direct, d_wraparound) # max because we want the min distance, and these are 1/distan
Re: [R] SPDEP Package, neighbours list for Moran's I on large grid dataset
On Apr 22, 2011, at 2:07 AM, Laurent Jégou wrote: > Hello, thank you very much for this solution, i'll try to adapt it to my > problem, i think i'll be able to. Great! > By the way, i wrote about a "sliding-window" weight matrix, it's an > expression used in satellite images processing, esp. convolution filters, > meaning that the main matrix is not confronted to another one of the same > dimensions, but to a much smaller one. The calculus is executed value by > value (pixels on an image), centering the small matrix on each one, hence the > name of sliding window. Yeah, I'm familiar with the concept in general, but I don't see how you can really apply it to Moran's I calculations; the Moran's I value is, by definition, based on calculating the value for every point compared to every other point, and looking at only points within a smaller window risks getting a value that is not representative of the larger context. But if it makes sense to you in your application, then more power to you. :-> Ben Haller McGill University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.