[R] independent censoring
Dear All, Independent censoring is one of the fundamental assumptions in the survival analysis. However, I cannot find any test for it or any paper which discusses how real that assumption is. I would be grateful if anybody could point me to some useful references. I have found the following paper as an interesting reference but it is not freely available. Leung, Kwan-Moon, Robert M. Elashoff, and Abdelmonem A. Afifi. "Censoring issues in survival analysis." Annual review of public health 18.1 (1997): 83-104. Any feedback would be much appreciated. Kind regards DK [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] removing dropouts from survival analysis
Dear All. Apologies for posting a question regarding survival analysis, and not R, to the R-help list. In the past I received the best advices from the R community. The random censorship model (the censoring times independent of the failure times and vice versa) is one of the fundamental assumptions in the survival analysis. In the medical studies we have random entry to study and study end which is a censoring mechanism independent of the failure times. However, in reality we may have dropout subjects, lost to follow-up, which are censored by a different mechanism which may not be independent of the failure times. The inclusion of dropout subjects in the survival analysis may break the random censorship model and include bias in our estimates of survival with KM. I have studied papers on this subject (e.g. double sampling, copula approach for dependent censoring), but I have not found any research paper which examines the removal of dropout subjects from the survival analysis. I am alone in my research and would be grateful to hear thoughts on this subject. Thank you in advance and apologies for using the R-help list for my research question. DK [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] coxph (survival) paper from '82
Dear All, I am interested to understand better the coxph() function from the survival R package. There is one referenced paper there which I am not able to get hold of and it is Andersen, P. Gill, R. (1982) Cox's regression model for counting processes, a large sample study. Annals of Statistics 10, 1100-1120 I would be really grateful if someone could help me with the electronic copy of the paper. Writing to the R-help emailing list was my last resort. Thanks in advance. With kind regards DK Damjan Krstajic Director Research Centre for Cheminformatics Belgrade, Serbia http://www.rcc.org.rs _ [[elided Hotmail spam]] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R survival package error message - bug?!
Dear all, I have encountered a weird behaviour in R survival package which seems to me to be a bug. The weird behaviour happens when I am using 100 variables in the ridge function when calling coxph with following formula Surv(time = futime, event = fustat, type = "right") ~ ridge(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100, theta = lambda, scale = FALSE) and get the error message Error in model.frame.default(formula = form, data = sdf) : invalid variable names Calls: coxph -> eval -> eval -> model.frame -> model.frame.default Execution halted I am using R 2.8.0 and the latest version of the survival package 2.35-4. Here is how you can re-create the error message > library(survival) Loading required package: splines > x<-as.data.frame(matrix(rnorm(100*500),ncol=100)) > x$y<-1:500 > x$status<-1 > lambda<-1.0 > ff<-as.formula(paste("Surv(y,status)~ridge(", paste(names(x)[1:100],collapse=","), ",theta = lambda, scale = FALSE)")) > coxph(ff,x) Error in model.frame.default(formula = ff, data = x) : invalid variable names > print (ff) Surv(y, status) ~ ridge(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63, V64, V65, V66, V67, V68, V69, V70, V71, V72, V73, V74, V75, V76, V77, V78, V79, V80, V81, V82, V83, V84, V85, V86, V87, V88, V89, V90, V91, V92, V93, V94, V95, V96, V97, V98, V99, V100, theta = lambda, scale = FALSE) Also I have found that the code breaks with number of variables greater than 97. For 97 and less it works fine. I have found that it breaks at the following line of the coxph function if (is.R()) m <- eval(temp, parent.frame()) I have been trying to understand why it breaks there and have not progressed much so far. With kind regards DK _ icons. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] survival - ratio likelihood for ridge coxph()
It seems to me that R returns the unpenalized log-likelihood for the ratio likelihood test when ridge regression Cox proportional model is implemented. Is this as expected? In the example below, if I am not mistaken, fit$loglik[2] is unpenalized log-likelihood for the final estimates of coefficients. I would expect to get the penalized log-likelihood. I would like to check if this is as expected. If not, I am prepared to give more details. Cheers,DK. For example, if we have a model > fit <- coxph(Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, > theta=1),data=ovarian) > fit$loglik [1] -34.98494 -27.17558 > fit Call: coxph(formula = Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, theta = 1), data = ovarian) coef se(coef) se2Chisq DF p ridge(rx) -0.780 0.5862 0.5589 1.77 1 0.1800 ridge(age) 0.123 0.0387 0.0356 10.15 1 0.0014 ridge(ecog.ps) 0.104 0.5729 0.5478 0.03 1 0.8600 Iterations: 1 outer, 4 Newton-Raphson Degrees of freedom for terms= 2.7 Likelihood ratio test=15.6 on 2.67 df, p=0.000941 n= 26 > fit$loglik[2] [1] -27.17558 _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis
Dear all, I am a statistician doing research in QSAR, building regression models where the dependent variable is a numerical expression of some chemical activity and input variables are chemical descriptors, e.g. molecular weight, number of carbon atoms, etc. I am building regression models and I am confronted with a widely a technique called Y-RANDOMIZATION for which I have difficulties in finding references in general statistical literature regarding regression analysis. I would be grateful if someone could point me to papers/literature in statistical regression analysis which give scientific (statistical) foundation for using Y-RANDOMIZATION. Y-RANDOMIZATION is a widely used technique in QSAR community to unsure the robustness of a QSPR (regression) model. It is used after the "best" regression model is selected and to make sure that there are no chance correlations. Here is a short description. The dependent variable vector (Y-vector) is randomly shuffled and a new QSPR (regression) model is fitted using the original independent variable matrix. By repeating this a number of times, say 100 times, one will get hundred R2 and q2 (leave one out cross-validation R2) based on hundred shuffled Y. It is expected that the resulting regression models should generally have low R2 and low q2 values. However, if the majority of hundred regression models obtained in the Y-randomization have relatively high R2 and high q2 then it implies that an acceptable regression model cannot be obtained for the given data set by the current modelling method. I cannot find any references to Y-randomization or Y-scrambling anywhere in the literature outside chemometrics/QSAR. Any links or references would be much appreciated. Thanks in advance. DK ------ Damjan Krstajic Director Research Centre for Cheminformatics Belgrade, Serbia -- _ Tell us your greatest, weirdest and funniest Hotmail stories [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] confidence intervals for non-linear regression
Dear all, I am interested to calculate confidence interval for fitted values in general for non-linear regressions. Lets say we have y=f(x1,x2,..xN) where f() is a non-linear regression. I would like to calculate a confidence interval for new prediction f(a1,..,aN). I am aware of techniques for calculating confidence intervals for coeffiecients in specific non-linear regressions and with them then to calculate confidence interval for the predicted value. However, I am interested to find if there is any unique approach, maybe using bootstrap, which can be used on f() where f() is kind of black box. Any references to the literature or R packages would be very welcome. Thanks in advance. DK -- Damjan Krstajic Director Research Centre for Cheminformatics Belgrade, Serbia -- _ Send us your Hotmail stories and be featured in our newsletter [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] statistics and R package for election results
Dear all, Is there any R package which would help in analysing election results between two elections? Does anybody know any good papers which are related to this field? I am a statistician and my main research area so far has been regression and classification modelling. The analysis of two election results is new to me. Thanks in advance. Kind regards DK _ Use Hotmail to send and receive mail from your different email accounts. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Introducing R to statisticians
Dear all, I will present R language and R software environment to the Statistical Society of Serbia. As I will doing it to professional statisticians it seems unneccesary to me to present them how R language works in details. I am more interested to present them with the latest facts regarding R (approximately number of users, number of add-on packages etc.) and in general why they should start using it. I would be grateful if anyone could point to me a place where I can find the information. Also, any comparisons with other stats packages and languages would be much appreciated. Thanks in advance. With kind regards DK _ Use Hotmail to send and receive mail from your different email accounts. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] survival - summary and score test for ridge coxph()
It seems to me that summary for ridge coxph() prints summary but returns NULL. It is not a big issue because one can calculate statistics directly from a coxph.object. However, for some reason the score test is not calculated for ridge coxph(), i.e score nor rscore components are not included in the coxph object when ridge is specified. Please find the code below. I use 2.9.2 R with 2.35-4 version of the survival package. Thanks DK. > fit <- coxph(Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, > theta=1),data=ovarian) > a<-summary(fit) Call: coxph(formula = Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, theta = 1), data = ovarian) n= 26 coef se(coef) se2Chisq DF p ridge(rx) -0.780 0.5862 0.5589 1.77 1 0.1800 ridge(age) 0.123 0.0387 0.0356 10.15 1 0.0014 ridge(ecog.ps) 0.104 0.5729 0.5478 0.03 1 0.8600 exp(coef) exp(-coef) lower .95 upper .95 ridge(rx) 0.459 2.181 0.145 1.45 ridge(age) 1.131 0.884 1.049 1.22 ridge(ecog.ps) 1.110 0.901 0.361 3.41 Iterations: 1 outer, 4 Newton-Raphson Degrees of freedom for terms= 2.7 Rsquare= 0.452 (max possible= 0.932 ) Likelihood ratio test= 15.6 on 2.67 df, p=0.000941 Wald test= 12.9 on 2.67 df, p=0.00354 > a NULL > names(fit) [1] "coefficients" "var" "var2" "loglik" "iter" "linear.predictors" "residuals" [8] "means" "method""df""df2" "penalty" "pterms""assign2" [15] "history" "coxlist2" "printfun" "n" "terms" "assign""wald.test" [22] "y" "formula" "call" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] survival: ridge log-likelihood workaround
Dear all, I need to calculate likelihood ratio test for ridge regression. In February I have reported a bug where coxph returns unpenalized log-likelihood for final beta estimates for ridge coxph regression. In high-dimensional settings ridge regression models usually fail for lower values of lambda. As the result of it, in such settings the ridge regressions have higher values of lambda (e.g. over 100) which means that the difference between unpenalized log-likelihood and penalized log-likelihood is not insignificant. I would be grateful if someone can confirm that the below code is correct workaround. ridge.penalty <- function(theta,beta) { (theta/2)*sum(beta^2) } a.theta <- 100 fit1 <- coxph(Surv(futime, fustat) ~ ridge(rx,age,ecog.ps,theta=a.theta), data=ovarian) # correct loglikelihood for final beta estimate ridge.loglikelihood.beta <- coxph$loglik[2] - ridge.penalty(a.theta, fit1$coef) Thanks in advance DK [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] survival: ridge log-likelihood workaround
I would like to clarify statistics that ridge coxph returns. Here is my understanding and please correct me where I am wrong. 1) In his paper Gray [JASA 1992] suggests a Wald-type statistics with the formula for degree of freedom. The summary function for ridge coxph returns the degree of freedom and Wald test which are equivalent to what Gray wrote. 2) Summary for ridge coxph prints a likelihood ratio test which is not, if I may say, a proper likelihood ratio test. First it is based on unpenalized log-likelihoods and the above defined degree of freedom is used. I accept that there is nothing which suggests that one should use penalized log-likelihoods. However, there is also nothing published which suggests that unpenalized log-likelihoods should be used with the above defined degree of freedom. I have found that Therneau&Grambsch in their excellent book discuss this in a paragraph and mention that the p-value thus returned is somewhat conservative (p too large). Therefore, the likelihood ratio test that ridge coxph returns is not a true one and the statistics returned (i.e. 2*(loglik(beta)-loglik(0))) has the distribution which is somewhat more compact than the chi-square. I like conservative p-values and like to be on a safe side. However, in my work Wald test p-values for ridge regression are much higher than the "l! ikelihood ratio test's" p-value and I don't get impression that they are conservative. 3) There is no efficient score test for ridge regression, as there is no penalized efficient score test. That is OK. 4) "Rsquare" and "max possible" are returned and I have so far failed to find exact references for them. I would like to add a note here that the coxph algorithm works really fast with high-dimensional covariates and all compliments to people who developed it. There was evidently a lot of effort put to make it all work fast and correctly and, in my opinion, it is a shame that the last bit - summary for ridge coxph - is a bit, if I may say, shaky. In my opinion, only Wald test and degree of freedom as defined by Gray deserve to be part of summary for ridge coxph and I look forward to be corrected. On my behalf I am prepared in my spare time to write the code so that the summary for ridge coxph does not return NULL and that the statistics printed in summary for ridge coxph are based on published papers. Damjan Krstajic > Subject: Re: [R] survival: ridge log-likelihood workaround > From: thern...@mayo.edu > To: r-help@r-project.org; dkrsta...@hotmail.com > Date: Fri, 10 Dec 2010 09:07:42 -0600 > > -- begin inclusion - > Dear all, > > I need to calculate likelihood ratio test for ridge regression. In > February I have reported a bug where coxph returns unpenalized > log-likelihood for final beta estimates for ridge coxph regression. In > high-dimensional settings ridge regression models usually fail for lower > values of lambda. As the result of it, in such settings the ridge > regressions have higher values of lambda (e.g. over 100) which means > that the difference between unpenalized log-likelihood and penalized > log-likelihood is not insignificant. I would be grateful if someone can > confirm that the below code is correct workaround. > > --- end included message > > First, the "bug" you report is not a bug. The log partial likelihood > from a Cox model LPL(beta) is well defined for any vector of > coefficients beta, whether they are result of a maximization or taken > from your daily horoscope. The loglik component of coxph is the LPL for > the reported coefficients. > > For a ridge regression the coxph function maximizes LPL(beta) - > penalty(beta) = penalized partial likelihood = PPL(beta). You have > correctly recreated the PPL. > > Second: how do you do formal tests on such a model? This is hard. The > difference LPL1- LPL2 is a chi-square when each is the result of > maximizing the Cox LPL over a set of coefficients; when using a PPL we > are maximizing over something else. The distribution of the difference > of constrained LPL values can be argued to be a weighed sum of squared > normals where the weights are in (0,1), which is something more complex > than a chisq distribution. In a world with infinite free time I'd have > pursued this, worked it all out, and added appropriate code to coxph. > > What about the difference in PPL values, which is the test you propose? > I'm not aware of any theory showing that these have any relation to a > chi-square distribution. (Said theory may well exist, and I'd be happy > for pointers.) > > Terry Therneau > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] bug in glmnet 1.7.1 for multinomal when alpha=0?
Dear all, If I am not mistaken, I think that I have found a bug in glmnet 1.7.1 (latest version) for multinomial when alpha=0. Here is the code > library(glmnet) Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.7.1 > x=matrix(rnorm(40*500),40,500) > g4=sample(1:7,40,replace=TRUE) > fit=glmnet(x,g4,family="multinomial",alpha=0) > fit$lambda [1] NA 0 In my opinion lambda values for alpha=0 (ridge) should be different from that. Below is an example of lambdas for alpha=0.03. Thanks DK > fit=glmnet(x,g4,family="multinomial",alpha=0.03) > fit$lambda [1] 6.87012630 6.55786846 6.25980322 5.97528550 5.70369955 5.5761 [7] 5.19699861 4.96078700 4.73531157 4.52008435 4.31463954 4.11853252 [13] 3.93133886 3.75265344 3.58208955 3.41927805 3.26386659 3.11551881 [19] 2.97391367 2.83874471 2.70971938 2.58655845 2.46899538 2.35677573 [25] 2.24965663 2.14740627 2.04980334 1.95663661 1.86770446 1.78281441 [31] 1.70178274 1.62443409 1.55060105 1.48012384 1.41284993 1.34863372 [37] 1.28733624 1.22882482 1.17297283 1.11965941 1.06876916 1.02019195 [43] 0.97382265 0.92956091 0.88731093 0.84698128 0.80848468 0.77173780 [49] 0.73666112 0.70317874 0.67121818 0.64071028 0.61158901 0.58379134 [55] 0.55725713 0.53192893 0.50775194 0.48467383 0.46264466 0.44161674 [61] 0.42154458 0.40238473 0.38409572 0.36663798 0.34997372 0.33406687 [67] 0.31888302 0.30438929 0.29055433 0.27734818 0.26474228 0.25270934 [73] 0.24122331 0.23025934 0.21979369 0.20980373 0.20026783 0.19116535 [79] 0.18247659 0.17418274 0.16626587 0.15870883 0.15149527 0.14460957 [85] 0.13803685 0.13176286 0.12577403 0.12005741 0.11460061 0.10939184 [91] 0.10441981 0.09967377 0.09514344 0.09081903 0.08669116 0.08275091 [97] 0.07898976 0.07539955 0.07197253 0.06870126 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] differences between 1.7 and 1.7.1 glmnet versions
Dear All, I have found differences between glmnet versions 1.7 and 1.7.1 which, in my opinion, are not cosmetic and do not appear in the ChangeLog. If I am not mistaken, glmnet appears to return different number of selected input variables, i.e. nonzeroCoef(fit$beta[[1]]) differes between versions. The code below is the same for 1.7.1 and 1.7, but you can see that outputs differ. I would automatically use the latest version, but by looking at the ChangeLog I wonder if this is a bug or expected behaviour, as this change is not documented. Thanks in advance. DK ># glmnet 1.7.1 > library(glmnet) Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.7.1 > set.seed(1) > x=matrix(rnorm(40*500),40,500) > g4=sample(1:7,40,replace=TRUE) > fit=glmnet(x,g4,family="multinomial",alpha=0.1) > dgcBeta<- fit$beta[[1]] > which=nonzeroCoef(dgcBeta) > which [1] 1 12 15 17 19 20 34 39 42 58 60 62 63 65 71 72 73 77 [19] 80 82 85 86 95 97 98 99 106 110 113 114 119 120 123 124 128 130 [37] 136 138 139 143 148 149 151 160 161 162 173 174 175 176 177 183 186 187 [55] 188 190 193 194 195 198 199 204 206 218 224 238 239 240 241 245 247 250 [73] 252 255 256 258 265 266 270 277 278 281 287 293 294 296 297 300 306 308 [91] 311 316 317 321 326 329 336 337 341 349 354 356 363 365 368 374 376 377 [109] 379 384 385 389 397 398 400 403 404 407 415 417 418 423 424 430 432 437 [127] 440 442 446 450 451 454 456 459 463 467 470 472 474 478 481 488 496 497 [145] 498 500 > # just to check that inputs to glmnet are the same > g4 [1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 6 7 4 6 3 [39] 2 7 > x[,1] [1] -0.62645381 0.18364332 -0.83562861 1.59528080 0.32950777 -0.82046838 [7] 0.48742905 0.73832471 0.57578135 -0.30538839 1.51178117 0.38984324 [13] -0.62124058 -2.21469989 1.12493092 -0.04493361 -0.01619026 0.94383621 [19] 0.82122120 0.59390132 0.91897737 0.78213630 0.07456498 -1.98935170 [25] 0.61982575 -0.05612874 -0.15579551 -1.47075238 -0.47815006 0.41794156 [31] 1.35867955 -0.10278773 0.38767161 -0.05380504 -1.37705956 -0.41499456 [37] -0.39428995 -0.05931340 1.10002537 0.76317575 > > glmnet 1.7 > library(glmnet) Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.7 > set.seed(1) > x=matrix(rnorm(40*500),40,500) > g4=sample(1:7,40,replace=TRUE) > fit=glmnet(x,g4,family="multinomial",alpha=0.1) > dgcBeta<- fit$beta[[1]] > which=nonzeroCoef(dgcBeta) > which [1] 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [19] 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 [37] 39 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 [55] 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 [73] 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 [91] 98 99 100 101 102 104 105 106 107 109 110 111 112 113 114 115 116 119 [109] 120 121 122 123 124 126 127 128 130 131 132 133 134 135 136 137 138 139 [127] 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 156 157 159 [145] 160 161 162 163 164 165 167 168 170 171 172 173 174 175 176 177 178 179 [163] 180 181 182 183 184 185 186 187 188 189 190 191 193 194 195 196 197 198 [181] 199 200 203 204 205 206 207 208 209 211 212 213 215 216 217 218 219 220 [199] 221 222 223 224 225 226 227 228 229 231 232 233 234 235 236 237 238 239 [217] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 255 256 257 258 [235] 259 261 262 263 264 265 266 268 269 270 271 272 273 274 275 276 277 278 [253] 279 280 281 282 283 285 286 287 288 289 290 291 292 293 294 295 296 297 [271] 298 299 300 301 302 304 305 306 307 308 309 310 311 312 313 314 315 316 [289] 317 318 319 321 323 324 325 326 327 328 329 330 331 332 333 334 336 337 [307] 338 339 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 [325] 357 358 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 [343] 377 378 379 380 381 382 384 385 386 388 389 390 393 394 395 396 397 398 [361] 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 417 [379] 418 420 421 422 423 424 425 426 427 428 429 430 432 433 434 436 437 438 [397] 439 440 441 442 443 444 445 446 448 450 451 452 453 454 455 456 457 458 [415] 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 [433] 477 478 479 480 481 482 483 484 486 488 489 490 491 493 494 495 496 497 [451] 498 499 500 > # just to check that inputs to glmnet are the same > g4 [1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 6 7 4 6 3 [39] 2 7 > x[,1] [1] -0.62645381 0.18364332 -0.83562861 1.59528080 0.32950777 -0.82046838 [7] 0.48742905 0.73832471 0.57578135 -0.30538839 1.51178117 0.38984324 [13] -0.62124058 -2.21469989 1.12493092 -0.04493361 -0.01619026 0.94383621 [19] 0.82122120 0.59390132 0.91897737 0.78213630 0.07456498 -1.98935170 [25] 0.6198257
Re: [R] differences between 1.7 and 1.7.1 glmnet versions
Thank you very much. I will use 1.7.1 version. Have you had time to look at the issue regarding a weird behaviour of multinomial glmnet when alpha=0? I posted it to r-help more than two weeks ago and maybe you missed it. Damjan Krstajic Subject: Re: differences between 1.7 and 1.7.1 glmnet versions From: has...@stanford.edu Date: Thu, 29 Dec 2011 20:54:44 -0800 CC: r-help@r-project.org To: dkrsta...@hotmail.com I have just started using changelogs, and am clearly not disciplined enough at it.The big change that occurred was the convergence criterion, which would account for the difference.At some point will put up details of this. Trevor Hastie On Dec 26, 2011, at 11:55 PM, Damjan Krstajic wrote: Dear All, I have found differences between glmnet versions 1.7 and 1.7.1 which, in my opinion, are not cosmetic and do not appear in the ChangeLog. If I am not mistaken, glmnet appears to return different number of selected input variables, i.e. nonzeroCoef(fit$beta[[1]]) differes between versions. The code below is the same for 1.7.1 and 1.7, but you can see that outputs differ. I would automatically use the latest version, but by looking at the ChangeLog I wonder if this is a bug or expected behaviour, as this change is not documented. Thanks in advance. DK # glmnet 1.7.1 library(glmnet) Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.7.1 set.seed(1) x=matrix(rnorm(40*500),40,500) g4=sample(1:7,40,replace=TRUE) fit=glmnet(x,g4,family="multinomial",alpha=0.1) dgcBeta<- fit$beta[[1]] which=nonzeroCoef(dgcBeta) which [1] 1 12 15 17 19 20 34 39 42 58 60 62 63 65 71 72 73 77 [19] 80 82 85 86 95 97 98 99 106 110 113 114 119 120 123 124 128 130 [37] 136 138 139 143 148 149 151 160 161 162 173 174 175 176 177 183 186 187 [55] 188 190 193 194 195 198 199 204 206 218 224 238 239 240 241 245 247 250 [73] 252 255 256 258 265 266 270 277 278 281 287 293 294 296 297 300 306 308 [91] 311 316 317 321 326 329 336 337 341 349 354 356 363 365 368 374 376 377 [109] 379 384 385 389 397 398 400 403 404 407 415 417 418 423 424 430 432 437 [127] 440 442 446 450 451 454 456 459 463 467 470 472 474 478 481 488 496 497 [145] 498 500 # just to check that inputs to glmnet are the same g4 [1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 6 7 4 6 3 [39] 2 7 x[,1] [1] -0.62645381 0.18364332 -0.83562861 1.59528080 0.32950777 -0.82046838 [7] 0.48742905 0.73832471 0.57578135 -0.30538839 1.51178117 0.38984324 [13] -0.62124058 -2.21469989 1.12493092 -0.04493361 -0.01619026 0.94383621 [19] 0.82122120 0.59390132 0.91897737 0.78213630 0.07456498 -1.98935170 [25] 0.61982575 -0.05612874 -0.15579551 -1.47075238 -0.47815006 0.41794156 [31] 1.35867955 -0.10278773 0.38767161 -0.05380504 -1.37705956 -0.41499456 [37] -0.39428995 -0.05931340 1.10002537 0.76317575 glmnet 1.7 library(glmnet) Loading required package: Matrix Loading required package: lattice Loaded glmnet 1.7 set.seed(1) x=matrix(rnorm(40*500),40,500) g4=sample(1:7,40,replace=TRUE) fit=glmnet(x,g4,family="multinomial",alpha=0.1) dgcBeta<- fit$beta[[1]] which=nonzeroCoef(dgcBeta) which [1] 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [19] 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 [37] 39 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 [55] 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 [73] 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 [91] 98 99 100 101 102 104 105 106 107 109 110 111 112 113 114 115 116 119 [109] 120 121 122 123 124 126 127 128 130 131 132 133 134 135 136 137 138 139 [127] 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 156 157 159 [145] 160 161 162 163 164 165 167 168 170 171 172 173 174 175 176 177 178 179 [163] 180 181 182 183 184 185 186 187 188 189 190 191 193 194 195 196 197 198 [181] 199 200 203 204 205 206 207 208 209 211 212 213 215 216 217 218 219 220 [199] 221 222 223 224 225 226 227 228 229 231 232 233 234 235 236 237 238 239 [217] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 255 256 257 258 [235] 259 261 262 263 264 265 266 268 269 270 271 272 273 274 275 276 277 278 [253] 279 280 281 282 283 285 286 287 288 289 290 291 292 293 294 295 296 297 [271] 298 299 300 301 302 304 305 306 307 308 309 310 311 312 313 314 315 316 [289] 317 318 319 321 323 324 325 326 327 328 329 330 331 332 333 334 336 337 [307] 338 339 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 [325] 357 358 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 [343] 377 378 379 380 381 382 384 385 386 388 389 390 393 394 395 396 397 398 [361] 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 417 [379] 418 420 421 422 423 424 425 426 427 428 429 430 432 433 434 436 437 438 [397] 439 440 441 442 443 444 4
[R] estimating survival times with glmnet and coxph
Dear all, I am using glmnet (Coxnet) for building a Cox Model and to make actual prediction, i.e. to estimate the survival function S(t,Xn) for a new subject Xn. If I am not mistaken, glmnet (coxnet) returns beta, beta*X and exp(beta*X), which on its own cannot generate S(t,Xn). We miss baseline survival function So(t). Below is my code which takes beta coefficients from glmnet and creates coxph object (iter=0, init=beta) in order to later calculate survival estimates with survfit. I would be grateful if someone could confirm that my solution with coxph object (iter=0, init=beta) is correct and that the warning I get 'X matrix deemed to be singular' when creating a coxph object is of no concern. Below is my code which uses example from "Coxnet: Regularized Cox Regression". Thanks in advance. DK library(survival) library(glmnet) load(system.file("doc","VignetteExample.rdata",package="glmnet")) attach(patient.data) # leave the first patient for testing # and train glmnet on all other patients trainX <-x[-1,] trainTime <-time[-1] trainStatus <- status[-1] # fit Coxnet fit <- glmnet(trainX,Surv(trainTime,trainStatus),family="cox",alpha=0.5,maxit=1) # find lambda for which dev.ratio is max max.dev.index <- which.max(fit$dev.ratio) optimal.lambda <- fit$lambda[max.dev.index] # take beta for optimal lambda optimal.beta <- fit$beta[,max.dev.index] # find non zero beta coef nonzero.coef <- abs(optimal.beta)>0 selectedBeta <- optimal.beta[nonzero.coef] # take only covariates for which beta is not zero selectedTrainX <- trainX[,nonzero.coef] # create coxph object with pre-defined coefficients coxph.model<- coxph(Surv(trainTime,trainStatus) ~selectedTrainX,init=selectedBeta,iter=0) Warning message: In coxph(Surv(trainTime, trainStatus) ~ selectedTrainX, init = selectedBeta, : X matrix deemed to be singular; variable 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 [... truncated] # take test covariates for which beta is not zero selectedTestX <- x[1,nonzero.coef] # find survival curve for test subject sfit<- survfit(coxph.model,newdata=selectedTestX) cat("\ntime ") cat(sfit$time) cat("\nsurvival ") cat(sfit$surv) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] (bug?) in survfit in R-2-15-0
Dear all, I am using glmnet + survival and due to the latest release of glmnet 1.7.4 I was forced to use the latest version of R 2.15.0. My previous version of R was 2.10.1. I changed glmnet version and R version and when I started to get weird results I was not sure where the bug was. After putting glmnet back to previous version, I have found that the bug or weird behaviour happens in survival survfit. My script is below in email and it is uses glmnet 1.7.3. I get two completely different answers in different versions of R and in my opinion the older version of survival package returns correct value. in R 2-10-1 I get > cat(dim(sfit$surv)) > cat(length(sfit$surv)) 32 > in R-2-15-0 I get cat(dim(sfit$surv)) 62 99 > cat(length(sfit$surv)) 6138 > Kind regardsDK library(survival) library(glmnet) load(system.file("doc","VignetteExample.rdata",package="glmnet")) attach(patient.data) trainX <-x[-1,] trainTime <-time[-1] trainStatus <- status[-1] fit <- glmnet(trainX,Surv(trainTime,trainStatus),family="cox",alpha=0.5,maxit=1) max.dev.index <- which.max(fit$dev.ratio) optimal.lambda <- fit$lambda[max.dev.index] optimal.beta <- fit$beta[,max.dev.index] nonzero.coef <- abs(optimal.beta)>0 selectedBeta <- optimal.beta[nonzero.coef] selectedTrainX <- trainX[,nonzero.coef] coxph.model<- coxph(Surv(trainTime,trainStatus)~ selectedTrainX,init=selectedBeta,iter=0) selectedTestX <- x[1,nonzero.coef] sfit<- survfit(coxph.model,newdata=selectedTestX) cat("\ndim(sfit$surv)") cat(dim(sfit$surv)) cat("\nlength(sfit$surv)") cat(length(sfit$surv)) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] survival survfit with newdata
Dear all, I am confused with the behaviour of survfit with newdata option. I am using the latest version R-2-15-0. In the simple example below I am building a coxph model on 90 patients and trying to predict 10 patients. Unfortunately the survival curve at the end is for 90 patients. Could somebody please from the survival package confirm that this behaviour is as expected or not - because I cannot find a way of using 'newdata' with really new data. Thanks in advance. DK > x<-matrix(rnorm(100*20),100,20) > time<-runif(100,min=0,max=7) > status<-sample(c(0,1), 100, replace = TRUE) > trainX<-x[11:100,] > trainTime<-time[11:100] > trainStatus<-status[11:100] > testX<-x[1:10,] > coxph.model<- coxph(Surv(trainTime,trainStatus)~ trainX) > sfit<- survfit(coxph.model,newdata=data.frame(testX)) > dim(sfit$surv) [1] 90 90 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] survival survfit with newdata
Thanks David for prompt reply. I agree with you. However, I still fail to get the survfit function to work with newdata. In my previous example I changed the column names of testX matrix and I still fail. > colnames(testX)<-names(coxph.model$coefficients) > sfit<- survfit(coxph.model,newdata=data.frame(testX)) Error in model.frame.default(formula = Surv(trainTime, trainStatus) ~ : variable lengths differ (found for 'trainX') What would be solution in my simple example to get the survival curves for testX? Thanks in advance. DK > CC: r-help@r-project.org > From: dwinsem...@comcast.net > To: dkrsta...@hotmail.com > Subject: Re: [R] survival survfit with newdata > Date: Thu, 17 May 2012 00:52:55 -0400 > > > On May 16, 2012, at 5:08 PM, Damjan Krstajic wrote: > > > > > Dear all, > > > > I am confused with the behaviour of survfit with newdata option. > > Yes. It has the same behavior as any other newdata/predict from > regression. You need to supply a dataframe with the same names as in > the original formula. Doesn't look as though that strategy is being > followed. The name of the column needs to be 'trainX' since that was > what was its name on the RHS of hte formula, and you may want to > specify times. If you fail to follow those rules, the function falls > back on offering estimates from the original data. > > > > > I am using the latest version R-2-15-0. In the simple example below > > I am building a coxph model on 90 patients and trying to predict 10 > > patients. Unfortunately the survival curve at the end is for 90 > > patients. > > As is proper with a malformed newdata argument. > > > Could somebody please from the survival package confirm that this > > behaviour is as expected or not - because I cannot find a way of > > using 'newdata' with really new data. Thanks in advance. DK > > > >> x<-matrix(rnorm(100*20),100,20) > > > >> > > time<-runif(100,min=0,max=7) > > > >> > > status<-sample(c(0,1), 100, replace = TRUE) > >> trainX<-x[11:100,] > >> > > trainTime<-time[11:100] > >> > > trainStatus<-status[11:100] > >> > > testX<-x[1:10,] > >> coxph.model<- > > coxph(Surv(trainTime,trainStatus)~ trainX) > >> sfit<- survfit(coxph.model,newdata=data.frame(testX)) > > > >> > > dim(sfit$surv) > > > > [1] 90 90 > > > > > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > West Hartford, CT > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.