[R] two questions about regression models and clustering routines

Maura E Monville Thu, 05 Jun 2008 13:17:08 -0700

 I managed to use an example (see attachment) of clever regression
routines. I customized it to suit my needs.
The initial model I try to fit consists of the first 10 powers of time
(time the observation was recorded) and the first 10 powers of the
phase. In fact my files record patients' breathing signals as a
sequence of breathing cycles. Every cycle sampled phase (inhale -
exhale) is mapped to an angle in the range [0,2PI]
I have two questions,


1. Surprisingly (for me) for some files the summary of the regular lm
command shows a number of non significant coefficients (those for
which the column  "Pr(>|t|)" value is > 0.05)
    But after running the step command on the model output from lm I
see that all the 20 coefficients have become significant, which makes
me feel astonished because  I have  always thought that step  would
prune  the model stripping it off the  non significant coefficients.
    So I was thinking to submit the model output from "step" to the Cp
test anyway. As it is implemented right now the Cp stage is run only
if the model output from "step" still has some non significant
coefficients.
    Your thoughts .....

2. The regression model coefficients, stored in the first 20  columns
of matrix rg, are used to calculate a distance matrix that is then
input to clustering routines.
    I am writing a more sophisticated clustering algorithm that uses PAM.
    The 21st column of matrix rg stores the file ID, which is
obviously not used in the distance evaluation.
    I would like to be able to attach the file ID as labels visible in
each cluster. The dist command description mentions some "Labels". But
it fails to explain clearly  how the observations labels can be saved
in the distance matrix and then displayed in the cluster plot.
    Can you please help me with that ?

Thank you in advance
Best regards,
--
Maura E.M

###################################################################################
##                                                  Fit  Regression Model To 
Signal                                                            ##
###################################################################################
         
## function:  -
###################################################################################
## input:  phase-unfolded text file signal                                      
                                                             ##
##                                                                              
                                                                                
##
## output:  regression coefficients                                             
                                                                              ##
##                                                                              
                                                                                
                  ##
###################################################################################
 

##....................  function FitRegModel  ........................##

FitRegModel <- function (filname,InDir,OutDir,signum) {

##set returned flag
 if(!file.exists(filname)) {
    cat("\n\n ***  File: ",flname," not found! ! Exit current Script ! *** 
\n\n")
    cat ("\n FitRegModel: success = ",success,"\n")
    return(NULL)
 }

##read pure file 
 xx <- read.table(filname,skip = 1, sep = " ") 
 names(xx) <- c("samptime","sampamp","samphase","cycle","patient")

##predicted variable
 y <- xx[,"sampamp"]

##generate regressors
 t1 <-    xx[,"samptime"]
 t2 <-    I(t1^2)
 t3 <-    I(t1^3)
 t4 <-    I(t1^4)
 t5 <-    I(t1^5) 
 t6 <-    I(t1^6)
 t7 <-    I(t1^7)
 t8 <-    I(t1^8)
 t9 <-    I(t1^9)
 t10 <-  I(t1^10)
 p1 <-   xx[,"samphase"]
 p2 <-   I(p1^2)
 p3 <-   I(p1^3)
 p4 <-   I(p1^4)
 p5 <-   I(p1^5) 
 p6 <-   I(p1^6)
 p7 <-   I(p1^7)
 p8 <-   I(p1^8)
 p9 <-   I(p1^9)
 p10 <- I(p1^10)

##place the data into a data frame
 my.df <- data.frame(y,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10, 
p1,p2,p3,p4,p5,p6,p7,p8,p9,p10 )

##fit the full model
 big.lm <- lm(y ~ ., data = my.df)
 summary(big.lm)

##step wise procedure
 step.lm <- step(big.lm)  
 step.lm.sum <- summary(step.lm)
 
##get  number of terms still not significant excluding intercept 
 step.lm.sum$coefficients <- step.lm.sum$coefficients[-1,]        #drop the 
intercept    
 nleft <- which(step.lm.sum$coefficients[,"Pr(>|t|)"] >= 0.05)

##if no unsignificant terms present  return step-output coefficients
 if(length(nleft) == 0)  {
    cat("\n\n All Step Output Coefficients Significant for Signal: 
",signum,"\n\n") 
    return (step.lm.sum$coefficients)
 }

##if not significant terms still present create a new data frame with only the 
variables selected by step
 selected.variables <- names(step.lm$coefficients)[-1]
 selected.variables
 my.df.smaller <- my.df[,c("y",selected.variables)]

##  leaps
## Based on Faraway, Linear Models with R, page 127 et seq.
 library(leaps) 
 b <- regsubsets(y ~ . ,nvmax =  length(selected.variables), data = 
my.df.smaller)
 summary(b)
 rs <- summary(b) 
 no.parameters <-  seq_along(rs$cp) + 1

##set plot window to accommodate 3 graphs (1 row, 3 columns)
# x11(width = 14, height = 10)
# par(cex.axis=2, cex.lab=2,mfrow = c(1,3))

##plot adjusted R^2
 #plot(no.parameters, rs$adjr2, xlab = "Number of parameters", ylab = "Adjusted 
R-square",font.lab=2,font.axis=2)

##work with Cp (unlike Faraway's example)
##we desire models with Cp around or less than the number of parameters
# plot(no.parameters  , rs$cp, xlab = "Number of parameters", ylab = "Cp 
Statistics",font.lab=2,font.axis=2)
# abline(0, 1)
# plot(no.parameters  , rs$cp - no.parameters, xlab = "Number of parameters p", 
ylab = "Cp Statistics - p",font.lab=2,font.axis=2) 
# abline(h = 0)

##wait for mouse input (consent) to remove drawing
# locator()
# graphics.off()

##find the smallest model with Cp <= number-parameters
 chosen.model <- min(which(rs$cp <= no.parameters))
 chosen.model

##see which variables are included
 reduced.variables <- rs$which[chosen.model, ]
 which.variables <- which(reduced.variables)

##remove intercept
 which.variables <- which.variables[which.variables > 1]
 which.variables

##now reduce the data frame keeping the y variable (which is in column 1)
 reduced.df <- my.df.smaller[,c(1,which.variables)]

## check-point
 names(my.df.smaller)      #previous data frame
 names(reduced.df)           #new reduced data frame

##refit using the reduced data frame
 reduced.lm <- lm(y ~ ., data = reduced.df)
 reduced.lm.sum <- summary(reduced.lm)

##get  number of terms still not significant excluding intercept
 reduced.lm.sum$coefficients <- reduced.lm.sum$coefficients[-1,]        #drop 
the intercept   
 nleft <- which(reduced.lm.sum$coefficients[,"Pr(>|t|)"] >= 0.05)

##if no unsignificant terms present  return step-output coefficients
 if(length(nleft) == 0)  {
    cat("\n\n All Regsubsets Output Coefficients Significant for Signal: 
",signum,"\n\n") 
    return (reduced.lm.sum$coefficients)
 }

##remove unsignificant coefficients 
 while (length(nleft) > 0) { 
     names(reduced.df)  
     nn <- nleft[[1]]  +1       # account for the "y" in position 1
     reduced.df <- reduced.df[,-nn]
     names(reduced.df)  
     reduced.lm <- lm(y ~ ., data = reduced.df)
     reduced.lm.sum <- summary(reduced.lm)
     reduced.lm.sum$coefficients <- reduced.lm.sum$coefficients[-1,]        
#drop the intercept
     nleft <- which(reduced.lm.sum$coefficients[,"Pr(>|t|)"] >= 0.05)
 }
 
##return coefficients
 return (reduced.lm.sum$coefficients) 
}


##.............................  Main  .................................##

 OutDir <- 
"/Users/mauede/Breathing-Curves-Dir/Modeling-Dir/Regression-Models-Dir"
 InDir <- 
"/Users/mauede/Breathing-Curves-Dir/Modeling-Dir/Pure-Signals-Dir/Mixed"  
 setwd(InDir) 
 cat("\n")
 print(list.files(pattern = "[0-1,a-z,A-Z, ,_]*.txt"))
 cat("\n\n")
 nms <- list.files(pattern="[0-1,a-z,A-Z, ,_]*.txt") 
 if(is.null(nms)) {
    stop("\n\n *** Pure SIgnalss list empty. Exit current process. *** \n\n") 
 }

##set up matrix to save regression model coefficients: 
##columns(1:10) contain the coefficients of the time powers
##columns(11:20) contain the coefficients of the phase powers  
##column(21) contain the patient' s ID
 rg <- matrix(nrow=length(nms),ncol=21)
 colnames(rg) = 
c("t1","t2","t3","t4","t5","t6","t7","t8","t9","t10","p1","p2","p3","p4","p5","p6","p7","p8","p9","p10","Id")

##initialize regression matrix to zero
 for (m in 1:ncol(rg)) {
     rg[,m] <- rep(0,nrow(rg))
 }

 ns <- 0       #initialize index

 for(i in 1:length(nms)) {
    cat("\n i:",i, " nms[i]:",nms[i], "\n")
    insig <- NULL
    insig = strsplit(nms[i]," ")[[1]][1]
    result <- FitRegModel(nms[i],InDir,OutDir,insig)
    cat("\n result:  ",result,"\n\n") 
    if(length(result) == 0) {
        cat("\n\n Signal: ", insig, " not processed ! \n\n")
       next
    } 
     cat("\n\n Signal: ", insig, " SUCCESS ! \n\n")
     ns <- ns +1
     for (k in 1: nrow(result)) {
         for (j in 1:ncol(rg)) {
             if (rownames(result)[k] == colnames(rg)[j]) {
                  rg[ns,j] <- result[k,"Estimate"]
             }
          }
     }     
     rg [ns,"Id"] <- insig 
 }

##save regression matrix 
 setwd(OutDir)
 write.table(rg,file="RegressionCoeffMatrix",quote=FALSE,sep = " 
",row.names=FALSE,col.names=colnames(rg))

 ##create distance matrix between regression coefficients (ignore patient's ID)
 rgdist <- dist(rg[,1:20])
 
##generate clusters from distance matrix
 h <- hclust(rgdist)
 plot(h)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] two questions about regression models and clustering routines

Reply via email to