[R] independent censoring

2016-11-28 Thread Damjan Krstajic
Dear All,


Independent censoring is one of the fundamental assumptions in the survival 
analysis. However, I cannot find any test for it or any paper which discusses 
how real that assumption is.


I would be grateful if anybody could point me to some useful references. I have 
found the following paper as an interesting reference but it is not freely 
available.


Leung, Kwan-Moon, Robert M. Elashoff, and Abdelmonem A. Afifi. "Censoring 
issues in survival analysis." Annual review of public health 18.1 (1997): 
83-104.


Any feedback would be much appreciated.


Kind regards

DK


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] removing dropouts from survival analysis

2017-01-23 Thread Damjan Krstajic
Dear All.


Apologies for posting a question regarding survival analysis, and not R, to the 
R-help list. In the past I received the best advices from the R community.


The random censorship model (the censoring times independent of the failure 
times and vice versa) is one of the fundamental assumptions in the survival 
analysis. In the medical studies we have random entry to study and study end 
which is a censoring mechanism independent of the failure times. However, in 
reality we may have dropout subjects, lost to follow-up, which are censored by 
a different mechanism which may not be independent of the failure times. The 
inclusion of dropout subjects in the survival analysis may break the random 
censorship model and include bias in our estimates of survival with KM. I have 
studied papers on this subject (e.g. double sampling, copula approach for 
dependent censoring), but I have not found any research paper which examines 
the removal of dropout subjects from the survival analysis.


I am alone in my research and would be grateful to hear thoughts on this 
subject. Thank you in advance and apologies for using the R-help list for my 
research question.


DK

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] coxph (survival) paper from '82

2009-08-18 Thread Damjan Krstajic

Dear All,

I am interested to understand better the coxph() 
function from the survival R package. There is one 
referenced paper there which I am not able to get 
hold of and it is
Andersen, P. Gill, R. (1982)
Cox's regression model for counting processes,
a large sample study.
Annals of Statistics 10, 1100-1120

I would be really grateful if someone could help 
me with the electronic copy of the paper. Writing 
to the R-help emailing list was my last resort.

Thanks in advance.

With kind regards
DK 

Damjan Krstajic
Director
Research Centre for Cheminformatics
Belgrade, Serbia
http://www.rcc.org.rs


_

[[elided Hotmail spam]]


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R survival package error message - bug?!

2009-08-24 Thread Damjan Krstajic

Dear all,

I have encountered a weird behaviour in R 
survival package which seems to me to be a bug. 

The weird behaviour happens when I am using 
100 variables in the ridge function when calling 
coxph with following formula Surv(time = futime,
event = fustat, type = "right") ~ ridge(X1, X2, 
X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, 
X14, X15, X16, X17, X18, X19, X20, X21, X22, X23,
X24, X25, X26, X27, X28, X29, X30, X31, X32, X33,
X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, 
X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, 
X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, 
X64, X65, X66, X67, X68, X69, X70, X71, X72, X73,
X74, X75, X76, X77, X78, X79, X80, X81, X82, X83,
X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, 
X94, X95, X96, X97, X98, X99, X100, 
theta = lambda, scale = FALSE)

and get the error message 
Error in model.frame.default(formula = form,
data = sdf) : invalid variable names 
Calls: coxph -> eval -> eval ->
model.frame -> model.frame.default Execution halted

I am using R 2.8.0 and the latest version of 
the survival package 2.35-4.

Here is how you can re-create the error message

> library(survival)
Loading required package: splines
> x<-as.data.frame(matrix(rnorm(100*500),ncol=100))
> x$y<-1:500
> x$status<-1
> lambda<-1.0
> ff<-as.formula(paste("Surv(y,status)~ridge(",
paste(names(x)[1:100],collapse=","), 
",theta = lambda, scale = FALSE)"))
> coxph(ff,x)
Error in model.frame.default(formula = ff, 
data = x) : invalid variable names
> print (ff)
Surv(y, status) ~ ridge(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10,
V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22,
V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34,
V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46,
V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58,
V59, V60, V61, V62, V63, V64, V65, V66, V67, V68, V69, V70,
V71, V72, V73, V74, V75, V76, V77, V78, V79, V80, V81, V82,
V83, V84, V85, V86, V87, V88, V89, V90, V91, V92, V93, V94,
V95, V96, V97, V98, V99, V100, theta = lambda, scale = FALSE)

Also I have found that the code breaks with number of variables greater
than 97. For 97 and less it works fine. I have found that it breaks at the
following line of the coxph function

if (is.R())
m <- eval(temp, parent.frame())

I have been trying to understand why it breaks there and have not
progressed much so far.

With kind regards
DK


_

icons.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] survival - ratio likelihood for ridge coxph()

2010-02-15 Thread Damjan Krstajic

It seems to me that R returns the unpenalized log-likelihood for the ratio 
likelihood test when ridge regression Cox proportional model is implemented. Is 
this as expected?

In the example below, if I am not mistaken, fit$loglik[2] is unpenalized 
log-likelihood for the final estimates of coefficients. I would expect to get 
the penalized log-likelihood. I would like to check if this is as expected. If 
not, I am prepared to give more details. Cheers,DK.

For example, if we have a model
> fit <- coxph(Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, 
> theta=1),data=ovarian)
> fit$loglik
[1] -34.98494 -27.17558
> fit
Call:
coxph(formula = Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, 
theta = 1), data = ovarian)

   coef   se(coef) se2Chisq DF p 
ridge(rx)  -0.780 0.5862   0.5589  1.77 1  0.1800
ridge(age)  0.123 0.0387   0.0356 10.15 1  0.0014
ridge(ecog.ps)  0.104 0.5729   0.5478  0.03 1  0.8600

Iterations: 1 outer, 4 Newton-Raphson
Degrees of freedom for terms= 2.7 
Likelihood ratio test=15.6  on 2.67 df, p=0.000941  n= 26 
> fit$loglik[2]
[1] -27.17558


  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis

2010-03-05 Thread Damjan Krstajic

Dear all,

I am a statistician doing research in QSAR, building regression models where 
the dependent variable is a numerical expression of some chemical activity and 
input variables are chemical descriptors, e.g. molecular weight, number of 
carbon atoms, etc.

I am building regression models and I am confronted with a widely a technique 
called Y-RANDOMIZATION for which I have difficulties in finding references in 
general statistical literature regarding regression analysis. I would be 
grateful if someone could point me to papers/literature in statistical 
regression analysis which give scientific (statistical) foundation for using 
Y-RANDOMIZATION.

Y-RANDOMIZATION is a widely used technique in QSAR community to unsure the 
robustness of a QSPR (regression) model. It is used after the "best" regression 
model is selected and to make sure that there are no chance correlations. Here 
is a short description. The dependent variable vector (Y-vector) is randomly 
shuffled and a new QSPR (regression) model is fitted using the original 
independent variable matrix. By repeating this a number of times, say 100 
times, one will get hundred R2 and q2 (leave one out cross-validation R2) based 
on hundred shuffled Y. It is expected that the resulting regression models 
should generally have low R2 and low q2 values. However, if the majority of 
hundred regression models obtained in the Y-randomization have relatively high 
R2 and high q2 then it implies that an acceptable regression model cannot be 
obtained for the given data set by the current modelling method.

I cannot find any references to Y-randomization or Y-scrambling anywhere in the 
literature outside chemometrics/QSAR. Any links or references would be much 
appreciated.

Thanks in advance.

DK
------
Damjan Krstajic
Director
Research Centre for Cheminformatics
Belgrade, Serbia

--

  
_
Tell us your greatest, weirdest and funniest Hotmail stories

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] confidence intervals for non-linear regression

2010-03-14 Thread Damjan Krstajic

Dear all,

I am interested to calculate confidence interval for fitted values in general 
for non-linear regressions. Lets say we have y=f(x1,x2,..xN) where f() is a 
non-linear regression. I would like to calculate a confidence interval for new 
prediction f(a1,..,aN). I am aware of techniques for calculating confidence 
intervals for coeffiecients in specific non-linear regressions and with them 
then to calculate confidence interval for the predicted value. However, I am 
interested to find if there is any unique approach, maybe using bootstrap, 
which can be used on f() where f() is kind of black box.

Any references to the literature or R packages would be very welcome.

Thanks in advance.
DK
--
Damjan Krstajic
Director
Research Centre for Cheminformatics
Belgrade, Serbia

--
  
_
Send us your Hotmail stories and be featured in our newsletter

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] statistics and R package for election results

2009-10-08 Thread Damjan Krstajic

Dear all,

Is there any R package which would help in analysing election results between 
two elections? Does anybody know any good papers which are related to this 
field? I am a statistician and my main research area so far has been regression 
and classification modelling. The analysis of two election results is new to me.

Thanks in advance.

Kind regards
DK

  
_
Use Hotmail to send and receive mail from your different email accounts.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Introducing R to statisticians

2009-11-11 Thread Damjan Krstajic

Dear all,

I will present R language and R software environment to the Statistical Society 
of Serbia.

As I will doing it to professional statisticians it seems unneccesary to
me to present them how R language works in details. I am more
interested to present them with the latest facts regarding R (approximately 
number of users, number of add-on packages etc.) and in general why they should 
start using it.

I would be grateful if anyone could point to me a place where I can find the 
information. Also, any comparisons with other stats packages and languages 
would be much appreciated. 

Thanks in advance.

With kind regards
DK
_
Use Hotmail to send and receive mail from your different email accounts.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] survival - summary and score test for ridge coxph()

2010-12-02 Thread Damjan Krstajic

It seems to me that summary for ridge coxph() prints summary but returns NULL. 
It is not a big issue because one can calculate statistics directly from a 
coxph.object. However, for some reason the score test is not calculated for 
ridge coxph(), i.e score nor rscore components are not included in the coxph 
object when ridge is specified. Please find the code below. I use 2.9.2 R with 
2.35-4 version of the survival package. Thanks DK. 

> fit <- coxph(Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, 
> theta=1),data=ovarian) 
> a<-summary(fit)
Call:
coxph(formula = Surv(futime, fustat) ~ ridge(rx, age, ecog.ps, 
theta = 1), data = ovarian)

  n= 26 
   coef   se(coef) se2Chisq DF p 
ridge(rx)  -0.780 0.5862   0.5589  1.77 1  0.1800
ridge(age)  0.123 0.0387   0.0356 10.15 1  0.0014
ridge(ecog.ps)  0.104 0.5729   0.5478  0.03 1  0.8600

   exp(coef) exp(-coef) lower .95 upper .95
ridge(rx)  0.459  2.181 0.145  1.45
ridge(age) 1.131  0.884 1.049  1.22
ridge(ecog.ps) 1.110  0.901 0.361  3.41

Iterations: 1 outer, 4 Newton-Raphson
Degrees of freedom for terms= 2.7 
Rsquare= 0.452   (max possible= 0.932 )
Likelihood ratio test= 15.6  on 2.67 df,   p=0.000941
Wald test= 12.9  on 2.67 df,   p=0.00354
> a
NULL
> names(fit)
 [1] "coefficients"  "var"   "var2"  "loglik"   
 "iter"  "linear.predictors" "residuals"
 [8] "means" "method""df""df2"  
 "penalty"   "pterms""assign2"  
[15] "history"   "coxlist2"  "printfun"  "n"
 "terms" "assign""wald.test"
[22] "y" "formula"   "call" 

  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] survival: ridge log-likelihood workaround

2010-12-09 Thread Damjan Krstajic

Dear all,

I need to calculate likelihood ratio test for ridge regression. In February I 
have reported a bug where coxph returns unpenalized log-likelihood for final 
beta estimates for ridge coxph regression. In high-dimensional settings ridge 
regression models usually fail for lower values of lambda. As the result of it, 
in such settings the ridge regressions have higher values of lambda (e.g. over 
100) which means that the difference between unpenalized log-likelihood and 
penalized log-likelihood is not insignificant. I would be grateful if someone 
can confirm that the below code is correct workaround.


ridge.penalty <- function(theta,beta)
{
(theta/2)*sum(beta^2)
}

a.theta <- 100

fit1 <- coxph(Surv(futime, fustat) ~ ridge(rx,age,ecog.ps,theta=a.theta), 
data=ovarian)

# correct loglikelihood for final beta estimate 
ridge.loglikelihood.beta <- coxph$loglik[2] - ridge.penalty(a.theta, fit1$coef)

Thanks in advance
DK
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] survival: ridge log-likelihood workaround

2010-12-14 Thread Damjan Krstajic

I would like to clarify statistics that ridge coxph returns. Here is my 
understanding and please correct me where I am wrong.

1) In his paper Gray [JASA 1992] suggests a Wald-type statistics with the 
formula for degree of freedom. The summary function for ridge coxph returns the 
degree of freedom and Wald test which are equivalent to what Gray wrote.
2) Summary for ridge coxph prints a likelihood ratio test which is not, if I 
may say, a proper likelihood ratio test. First it is based on unpenalized 
log-likelihoods and the above defined degree of freedom is used. I accept that 
there is nothing which suggests that one should use penalized log-likelihoods. 
However, there is also nothing published which suggests that unpenalized 
log-likelihoods should be used with the above defined degree of freedom. I have 
found that Therneau&Grambsch in their excellent book discuss this in a 
paragraph and mention that the p-value thus returned is somewhat conservative 
(p too large). Therefore, the likelihood ratio test that ridge coxph returns is 
not a true one and the statistics returned (i.e. 2*(loglik(beta)-loglik(0)))  
has the distribution which is somewhat more compact than the chi-square. I like 
conservative p-values and like to be on a safe side. However, in my work Wald 
test p-values for ridge regression are much higher than the "l!
 ikelihood ratio test's" p-value and I don't get impression that they are 
conservative.
3) There is no efficient score test for ridge regression, as there is no 
penalized efficient score test. That is OK.
4) "Rsquare" and "max possible" are returned and I have so far failed to find 
exact references for them.

I would like to add a note here that the coxph algorithm works really fast with 
high-dimensional covariates and all compliments to people who developed it.  
There was evidently a lot of effort put to make it all work fast and correctly 
and, in my opinion, it is a shame that the last bit - summary for ridge coxph - 
is a bit, if I may say, shaky. In my opinion, only Wald test and degree of 
freedom as defined by Gray deserve to be part of summary for ridge coxph and I 
look forward to be corrected. On my behalf I am prepared in my spare time to 
write the code so that the summary for ridge coxph does not return NULL and 
that the statistics printed in summary for ridge coxph are based on published 
papers.

Damjan Krstajic


> Subject: Re: [R] survival: ridge log-likelihood workaround
> From: thern...@mayo.edu
> To: r-help@r-project.org; dkrsta...@hotmail.com
> Date: Fri, 10 Dec 2010 09:07:42 -0600
> 
> -- begin inclusion -
> Dear all,
> 
> I need to calculate likelihood ratio test for ridge regression. In
> February I have reported a bug where coxph returns unpenalized
> log-likelihood for final beta estimates for ridge coxph regression. In
> high-dimensional settings ridge regression models usually fail for lower
> values of lambda. As the result of it, in such settings the ridge
> regressions have higher values of lambda (e.g. over 100) which means
> that the difference between unpenalized log-likelihood and penalized
> log-likelihood is not insignificant. I would be grateful if someone can
> confirm that the below code is correct workaround.
> 
> --- end included message 
> 
> First, the "bug" you report is not a bug.  The log partial likelihood
> from a Cox model LPL(beta) is well defined for any vector of
> coefficients beta, whether they are result of a maximization or taken
> from your daily horoscope.  The loglik component of coxph is the LPL for
> the reported coefficients.
> 
> For a ridge regression the coxph function maximizes LPL(beta) -
> penalty(beta) = penalized partial likelihood = PPL(beta).  You have
> correctly recreated the PPL.
> 
> Second: how do you do formal tests on such a model?  This is hard.  The
> difference LPL1- LPL2 is a chi-square when each is the result of
> maximizing the Cox LPL over a set of coefficients; when using a PPL we
> are maximizing over something else.  The distribution of the difference
> of constrained LPL values can be argued to be a weighed sum of squared
> normals where the weights are in (0,1), which is something more complex
> than a chisq distribution.  In a world with infinite free time I'd have
> pursued this, worked it all out, and added appropriate code to coxph.
> 
> What about the difference in PPL values, which is the test you propose?
> I'm not aware of any theory showing that these have any relation to a
> chi-square distribution.  (Said theory may well exist, and I'd be happy
> for pointers.)
> 
> Terry Therneau
> 
  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] bug in glmnet 1.7.1 for multinomal when alpha=0?

2011-12-13 Thread Damjan Krstajic

Dear all,

If I am not mistaken, I think that I have found a bug in glmnet 1.7.1 (latest 
version) for multinomial when alpha=0. Here is the code
> library(glmnet)
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.7.1

> x=matrix(rnorm(40*500),40,500)
> g4=sample(1:7,40,replace=TRUE)
> fit=glmnet(x,g4,family="multinomial",alpha=0)
> fit$lambda
[1] NA  0

In my opinion lambda values for alpha=0 (ridge) should be different from that. 
Below is an example of lambdas for alpha=0.03. Thanks DK

> fit=glmnet(x,g4,family="multinomial",alpha=0.03)
> fit$lambda
  [1] 6.87012630 6.55786846 6.25980322 5.97528550 5.70369955 5.5761
  [7] 5.19699861 4.96078700 4.73531157 4.52008435 4.31463954 4.11853252
 [13] 3.93133886 3.75265344 3.58208955 3.41927805 3.26386659 3.11551881
 [19] 2.97391367 2.83874471 2.70971938 2.58655845 2.46899538 2.35677573
 [25] 2.24965663 2.14740627 2.04980334 1.95663661 1.86770446 1.78281441
 [31] 1.70178274 1.62443409 1.55060105 1.48012384 1.41284993 1.34863372
 [37] 1.28733624 1.22882482 1.17297283 1.11965941 1.06876916 1.02019195
 [43] 0.97382265 0.92956091 0.88731093 0.84698128 0.80848468 0.77173780
 [49] 0.73666112 0.70317874 0.67121818 0.64071028 0.61158901 0.58379134
 [55] 0.55725713 0.53192893 0.50775194 0.48467383 0.46264466 0.44161674
 [61] 0.42154458 0.40238473 0.38409572 0.36663798 0.34997372 0.33406687
 [67] 0.31888302 0.30438929 0.29055433 0.27734818 0.26474228 0.25270934
 [73] 0.24122331 0.23025934 0.21979369 0.20980373 0.20026783 0.19116535
 [79] 0.18247659 0.17418274 0.16626587 0.15870883 0.15149527 0.14460957
 [85] 0.13803685 0.13176286 0.12577403 0.12005741 0.11460061 0.10939184
 [91] 0.10441981 0.09967377 0.09514344 0.09081903 0.08669116 0.08275091
 [97] 0.07898976 0.07539955 0.07197253 0.06870126

  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] differences between 1.7 and 1.7.1 glmnet versions

2011-12-26 Thread Damjan Krstajic

Dear All,
 
I have found differences between glmnet versions 1.7 and 1.7.1 which, in 
my opinion, are not cosmetic and do not appear in the ChangeLog. If I am 
not mistaken, glmnet appears to return different number of selected 
input variables, i.e. nonzeroCoef(fit$beta[[1]]) differes between 
versions. The code below is the same for 1.7.1 and 1.7, but you can see 
that outputs differ. I would automatically use the latest version, but 
by looking at the ChangeLog I wonder if this is a bug or expected 
behaviour, as this change is not documented.

Thanks in advance.
DK

># glmnet 1.7.1
> library(glmnet)
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.7.1
> set.seed(1)
> x=matrix(rnorm(40*500),40,500)
> g4=sample(1:7,40,replace=TRUE)
> fit=glmnet(x,g4,family="multinomial",alpha=0.1)
> dgcBeta<- fit$beta[[1]]
> which=nonzeroCoef(dgcBeta)
> which
[1]   1  12  15  17  19  20  34  39  42  58  60  62  63  65  71  72 
73  77
[19]  80  82  85  86  95  97  98  99 106 110 113 114 119 120 123 124 
128 130
[37] 136 138 139 143 148 149 151 160 161 162 173 174 175 176 177 183 
186 187
[55] 188 190 193 194 195 198 199 204 206 218 224 238 239 240 241 245 
247 250
[73] 252 255 256 258 265 266 270 277 278 281 287 293 294 296 297 300 
306 308
[91] 311 316 317 321 326 329 336 337 341 349 354 356 363 365 368 374 
376 377
[109] 379 384 385 389 397 398 400 403 404 407 415 417 418 423 424 430 
432 437
[127] 440 442 446 450 451 454 456 459 463 467 470 472 474 478 481 488 
496 497
[145] 498 500
> # just to check that inputs to glmnet are the same
> g4
[1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 
6 7 4 6 3
[39] 2 7
> x[,1]
[1] -0.62645381  0.18364332 -0.83562861  1.59528080  0.32950777 
-0.82046838
[7]  0.48742905  0.73832471  0.57578135 -0.30538839  1.51178117 
0.38984324
[13] -0.62124058 -2.21469989  1.12493092 -0.04493361 -0.01619026  0.94383621
[19]  0.82122120  0.59390132  0.91897737  0.78213630  0.07456498 -1.98935170
[25]  0.61982575 -0.05612874 -0.15579551 -1.47075238 -0.47815006  0.41794156
[31]  1.35867955 -0.10278773  0.38767161 -0.05380504 -1.37705956 -0.41499456
[37] -0.39428995 -0.05931340  1.10002537  0.76317575
>

> glmnet 1.7
> library(glmnet)
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.7
> set.seed(1)
> x=matrix(rnorm(40*500),40,500)
> g4=sample(1:7,40,replace=TRUE)
> fit=glmnet(x,g4,family="multinomial",alpha=0.1)
> dgcBeta<- fit$beta[[1]]
> which=nonzeroCoef(dgcBeta)
> which
[1]   1   2   3   4   6   7   8   9  10  11  12  13  14  15  16  17 
18  19
[19]  20  21  22  23  24  25  26  27  28  30  31  32  33  34  35  36 
37  38
[37]  39  41  42  43  44  45  46  47  48  50  51  52  53  54  55  56 
57  58
[55]  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74 
75  76
[73]  77  78  79  80  81  82  83  84  85  86  87  88  89  91  93  94 
95  97
[91]  98  99 100 101 102 104 105 106 107 109 110 111 112 113 114 115 
116 119
[109] 120 121 122 123 124 126 127 128 130 131 132 133 134 135 136 137 
138 139
[127] 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 156 
157 159
[145] 160 161 162 163 164 165 167 168 170 171 172 173 174 175 176 177 
178 179
[163] 180 181 182 183 184 185 186 187 188 189 190 191 193 194 195 196 
197 198
[181] 199 200 203 204 205 206 207 208 209 211 212 213 215 216 217 218 
219 220
[199] 221 222 223 224 225 226 227 228 229 231 232 233 234 235 236 237 
238 239
[217] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 255 256 
257 258
[235] 259 261 262 263 264 265 266 268 269 270 271 272 273 274 275 276 
277 278
[253] 279 280 281 282 283 285 286 287 288 289 290 291 292 293 294 295 
296 297
[271] 298 299 300 301 302 304 305 306 307 308 309 310 311 312 313 314 
315 316
[289] 317 318 319 321 323 324 325 326 327 328 329 330 331 332 333 334 
336 337
[307] 338 339 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
355 356
[325] 357 358 361 362 363 364 365 366 367 368 369 370 371 372 373 374 
375 376
[343] 377 378 379 380 381 382 384 385 386 388 389 390 393 394 395 396 
397 398
[361] 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 
415 417
[379] 418 420 421 422 423 424 425 426 427 428 429 430 432 433 434 436 
437 438
[397] 439 440 441 442 443 444 445 446 448 450 451 452 453 454 455 456 
457 458
[415] 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 
475 476
[433] 477 478 479 480 481 482 483 484 486 488 489 490 491 493 494 495 
496 497
[451] 498 499 500
> # just to check that inputs to glmnet are the same
> g4
[1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 
6 7 4 6 3
[39] 2 7
> x[,1]
[1] -0.62645381  0.18364332 -0.83562861  1.59528080  0.32950777 
-0.82046838
[7]  0.48742905  0.73832471  0.57578135 -0.30538839  1.51178117 
0.38984324
[13] -0.62124058 -2.21469989  1.12493092 -0.04493361 -0.01619026  0.94383621
[19]  0.82122120  0.59390132  0.91897737  0.78213630  0.07456498 -1.98935170
[25]  0.6198257

Re: [R] differences between 1.7 and 1.7.1 glmnet versions

2011-12-31 Thread Damjan Krstajic

Thank you very much. I will use 1.7.1 version.  Have you had time to look at 
the issue regarding a weird behaviour of multinomial glmnet when alpha=0? I 
posted it to r-help more than two weeks ago and maybe you missed it. Damjan 
Krstajic
 Subject: Re: differences between 1.7 and 1.7.1 glmnet versions
From: has...@stanford.edu
Date: Thu, 29 Dec 2011 20:54:44 -0800
CC: r-help@r-project.org
To: dkrsta...@hotmail.com



I have just started using changelogs, and am clearly not disciplined enough at 
it.The big change that occurred was the convergence criterion, which would 
account for the difference.At some point will put up details of this.
Trevor Hastie
On Dec 26, 2011, at 11:55 PM, Damjan Krstajic wrote:
Dear All,
 
I have found differences between glmnet versions 1.7 and 1.7.1 which, in 
my opinion, are not cosmetic and do not appear in the ChangeLog. If I am 
not mistaken, glmnet appears to return different number of selected 
input variables, i.e. nonzeroCoef(fit$beta[[1]]) differes between 
versions. The code below is the same for 1.7.1 and 1.7, but you can see 
that outputs differ. I would automatically use the latest version, but 
by looking at the ChangeLog I wonder if this is a bug or expected 
behaviour, as this change is not documented.

Thanks in advance.
DK

# glmnet 1.7.1
library(glmnet)
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.7.1
set.seed(1)
x=matrix(rnorm(40*500),40,500)
g4=sample(1:7,40,replace=TRUE)
fit=glmnet(x,g4,family="multinomial",alpha=0.1)
dgcBeta<- fit$beta[[1]]
which=nonzeroCoef(dgcBeta)
which
[1]   1  12  15  17  19  20  34  39  42  58  60  62  63  65  71  72 
73  77
[19]  80  82  85  86  95  97  98  99 106 110 113 114 119 120 123 124 
128 130
[37] 136 138 139 143 148 149 151 160 161 162 173 174 175 176 177 183 
186 187
[55] 188 190 193 194 195 198 199 204 206 218 224 238 239 240 241 245 
247 250
[73] 252 255 256 258 265 266 270 277 278 281 287 293 294 296 297 300 
306 308
[91] 311 316 317 321 326 329 336 337 341 349 354 356 363 365 368 374 
376 377
[109] 379 384 385 389 397 398 400 403 404 407 415 417 418 423 424 430 
432 437
[127] 440 442 446 450 451 454 456 459 463 467 470 472 474 478 481 488 
496 497
[145] 498 500
# just to check that inputs to glmnet are the same
g4
[1] 5 4 5 3 2 6 1 6 6 1 3 6 1 2 6 3 7 2 6 7 6 7 5 1 3 2 2 3 2 3 3 1 5 
6 7 4 6 3
[39] 2 7
x[,1]
[1] -0.62645381  0.18364332 -0.83562861  1.59528080  0.32950777 
-0.82046838
[7]  0.48742905  0.73832471  0.57578135 -0.30538839  1.51178117 
0.38984324
[13] -0.62124058 -2.21469989  1.12493092 -0.04493361 -0.01619026  0.94383621
[19]  0.82122120  0.59390132  0.91897737  0.78213630  0.07456498 -1.98935170
[25]  0.61982575 -0.05612874 -0.15579551 -1.47075238 -0.47815006  0.41794156
[31]  1.35867955 -0.10278773  0.38767161 -0.05380504 -1.37705956 -0.41499456
[37] -0.39428995 -0.05931340  1.10002537  0.76317575


 glmnet 1.7
library(glmnet)
Loading required package: Matrix
Loading required package: lattice
Loaded glmnet 1.7
set.seed(1)
x=matrix(rnorm(40*500),40,500)
g4=sample(1:7,40,replace=TRUE)
fit=glmnet(x,g4,family="multinomial",alpha=0.1)
dgcBeta<- fit$beta[[1]]
which=nonzeroCoef(dgcBeta)
which
[1]   1   2   3   4   6   7   8   9  10  11  12  13  14  15  16  17 
18  19
[19]  20  21  22  23  24  25  26  27  28  30  31  32  33  34  35  36 
37  38
[37]  39  41  42  43  44  45  46  47  48  50  51  52  53  54  55  56 
57  58
[55]  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74 
75  76
[73]  77  78  79  80  81  82  83  84  85  86  87  88  89  91  93  94 
95  97
[91]  98  99 100 101 102 104 105 106 107 109 110 111 112 113 114 115 
116 119
[109] 120 121 122 123 124 126 127 128 130 131 132 133 134 135 136 137 
138 139
[127] 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 156 
157 159
[145] 160 161 162 163 164 165 167 168 170 171 172 173 174 175 176 177 
178 179
[163] 180 181 182 183 184 185 186 187 188 189 190 191 193 194 195 196 
197 198
[181] 199 200 203 204 205 206 207 208 209 211 212 213 215 216 217 218 
219 220
[199] 221 222 223 224 225 226 227 228 229 231 232 233 234 235 236 237 
238 239
[217] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 255 256 
257 258
[235] 259 261 262 263 264 265 266 268 269 270 271 272 273 274 275 276 
277 278
[253] 279 280 281 282 283 285 286 287 288 289 290 291 292 293 294 295 
296 297
[271] 298 299 300 301 302 304 305 306 307 308 309 310 311 312 313 314 
315 316
[289] 317 318 319 321 323 324 325 326 327 328 329 330 331 332 333 334 
336 337
[307] 338 339 341 342 343 344 345 346 347 348 349 350 351 352 353 354 
355 356
[325] 357 358 361 362 363 364 365 366 367 368 369 370 371 372 373 374 
375 376
[343] 377 378 379 380 381 382 384 385 386 388 389 390 393 394 395 396 
397 398
[361] 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 
415 417
[379] 418 420 421 422 423 424 425 426 427 428 429 430 432 433 434 436 
437 438
[397] 439 440 441 442 443 444 4

[R] estimating survival times with glmnet and coxph

2012-05-07 Thread Damjan Krstajic

Dear all,

I am using glmnet (Coxnet) for building a Cox Model and
to make actual prediction, i.e. to estimate the survival function S(t,Xn) for a
new subject Xn. If I am not mistaken, glmnet (coxnet) returns beta, beta*X and
exp(beta*X), which on its own cannot generate S(t,Xn). We miss baseline
survival function So(t).

Below is my code which takes beta coefficients from
glmnet and creates coxph object (iter=0, init=beta) in order to later calculate
survival estimates with survfit.

I would be grateful if someone could confirm that my
solution with coxph object (iter=0, init=beta) is correct and that the warning
I get 'X matrix deemed to be singular' when creating a coxph object is of no
concern. Below is my code which uses example from "Coxnet: Regularized Cox
Regression".

Thanks in advance.
DK

library(survival)
library(glmnet)
load(system.file("doc","VignetteExample.rdata",package="glmnet"))
attach(patient.data)

# leave the first patient for testing
# and train glmnet on all other patients
trainX  <-x[-1,]
trainTime   <-time[-1]
trainStatus <- status[-1]

# fit Coxnet
fit <- 
glmnet(trainX,Surv(trainTime,trainStatus),family="cox",alpha=0.5,maxit=1)

# find lambda for which dev.ratio is max
max.dev.index <- which.max(fit$dev.ratio)
optimal.lambda <- fit$lambda[max.dev.index]

# take beta for optimal lambda
optimal.beta  <- fit$beta[,max.dev.index]

# find non zero beta coef
nonzero.coef <- abs(optimal.beta)>0
selectedBeta <- optimal.beta[nonzero.coef]

# take only covariates for which beta is not zero
selectedTrainX   <- trainX[,nonzero.coef]

# create coxph object with pre-defined coefficients
coxph.model<- coxph(Surv(trainTime,trainStatus) 
~selectedTrainX,init=selectedBeta,iter=0)

Warning message:
In coxph(Surv(trainTime, trainStatus) ~ selectedTrainX,
init = selectedBeta,  :
X matrix deemed to be singular; variable 99 100 101 102 103 104 105
106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
336 337 338 [... truncated]

# take test covariates for which beta is not zero
selectedTestX <- x[1,nonzero.coef]

# find survival curve for test subject
sfit<- survfit(coxph.model,newdata=selectedTestX)
cat("\ntime ")
cat(sfit$time)
cat("\nsurvival ")
cat(sfit$surv)
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] (bug?) in survfit in R-2-15-0

2012-05-15 Thread Damjan Krstajic



Dear all,

 

I am using glmnet + survival and due to the latest
release of glmnet

1.7.4 I was forced to use the latest version of R 2.15.0.
My previous version of R was 2.10.1. I changed glmnet version and R
version and when I started to get weird results I was not sure where the bug 
was.

 

After putting glmnet back to previous version, I have
found that the bug or weird behaviour happens in survival survfit. My script is
below in email and it is uses glmnet 1.7.3. I get two completely different
answers in different versions of R and in my opinion the older version of
survival package returns correct value.

 

in R 2-10-1 I get

 >
cat(dim(sfit$surv))

 

 >
cat(length(sfit$surv))

32

 >

 

in R-2-15-0 I get

cat(dim(sfit$surv))

62 99

 >
cat(length(sfit$surv))

6138

 >

 Kind regardsDK


library(survival)

library(glmnet)

load(system.file("doc","VignetteExample.rdata",package="glmnet"))

attach(patient.data)

trainX 
<-x[-1,]

trainTime  
<-time[-1]

trainStatus <- status[-1]

fit <-

glmnet(trainX,Surv(trainTime,trainStatus),family="cox",alpha=0.5,maxit=1)

max.dev.index
<- which.max(fit$dev.ratio)

optimal.lambda <- fit$lambda[max.dev.index]
optimal.beta  <-
fit$beta[,max.dev.index] nonzero.coef <- abs(optimal.beta)>0 selectedBeta
<- optimal.beta[nonzero.coef]

selectedTrainX  
<- trainX[,nonzero.coef]

coxph.model<- coxph(Surv(trainTime,trainStatus)~

selectedTrainX,init=selectedBeta,iter=0)

selectedTestX <- x[1,nonzero.coef]

sfit<- survfit(coxph.model,newdata=selectedTestX)

cat("\ndim(sfit$surv)")

cat(dim(sfit$surv))

cat("\nlength(sfit$surv)")

cat(length(sfit$surv))

  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] survival survfit with newdata

2012-05-16 Thread Damjan Krstajic

Dear all,

I am confused with the behaviour of survfit with newdata option.

I am using the latest version R-2-15-0. In the simple example below I am 
building a coxph model on 90 patients and trying to predict 10 patients. 
Unfortunately the survival curve at the end is for 90 patients. Could somebody 
please from the survival package confirm that this behaviour is as expected or 
not - because I cannot find a way of using 'newdata' with really new data. 
Thanks in advance. DK

> x<-matrix(rnorm(100*20),100,20)

>
time<-runif(100,min=0,max=7)

>
status<-sample(c(0,1), 100, replace = TRUE)  
> trainX<-x[11:100,]  
>
trainTime<-time[11:100]  
>
trainStatus<-status[11:100]  
>
testX<-x[1:10,]  
> coxph.model<-
coxph(Surv(trainTime,trainStatus)~ trainX)  
> sfit<- survfit(coxph.model,newdata=data.frame(testX))

>
dim(sfit$surv)

[1] 90 90


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] survival survfit with newdata

2012-05-16 Thread Damjan Krstajic

Thanks David for prompt reply. I agree with you. However, I still fail to get 
the survfit function to work with newdata. In my previous example I changed the 
column names of testX matrix and I still fail. 

> colnames(testX)<-names(coxph.model$coefficients)
> sfit<- survfit(coxph.model,newdata=data.frame(testX))
Error in model.frame.default(formula = Surv(trainTime, trainStatus) ~  :
  variable lengths differ (found for 'trainX')

What would be solution in my simple example to get the survival curves for 
testX? Thanks in advance. DK

> CC: r-help@r-project.org
> From: dwinsem...@comcast.net
> To: dkrsta...@hotmail.com
> Subject: Re: [R] survival survfit with newdata
> Date: Thu, 17 May 2012 00:52:55 -0400
> 
> 
> On May 16, 2012, at 5:08 PM, Damjan Krstajic wrote:
> 
> >
> > Dear all,
> >
> > I am confused with the behaviour of survfit with newdata option.
> 
> Yes. It has the same behavior as any other newdata/predict from  
> regression. You need to supply a dataframe with the same names as in  
> the original formula. Doesn't look as though that strategy is being  
> followed. The name of the column needs to be 'trainX' since that was  
> what was its name on the RHS of hte formula,  and you may want to  
> specify times. If you fail to follow those rules, the function falls  
> back on offering estimates from the original data.
> 
> >
> > I am using the latest version R-2-15-0. In the simple example below  
> > I am building a coxph model on 90 patients and trying to predict 10  
> > patients. Unfortunately the survival curve at the end is for 90  
> > patients.
> 
> As is proper with a malformed newdata argument.
> 
> > Could somebody please from the survival package confirm that this  
> > behaviour is as expected or not - because I cannot find a way of  
> > using 'newdata' with really new data. Thanks in advance. DK
> >
> >> x<-matrix(rnorm(100*20),100,20)
> >
> >>
> > time<-runif(100,min=0,max=7)
> >
> >>
> > status<-sample(c(0,1), 100, replace = TRUE)
> >> trainX<-x[11:100,]
> >>
> > trainTime<-time[11:100]
> >>
> > trainStatus<-status[11:100]
> >>
> > testX<-x[1:10,]
> >> coxph.model<-
> > coxph(Surv(trainTime,trainStatus)~ trainX)
> >> sfit<- survfit(coxph.model,newdata=data.frame(testX))
> >
> >>
> > dim(sfit$surv)
> >
> > [1] 90 90
> >
> >
> > 
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> West Hartford, CT
> 
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.