Hi Peter, Thank you so much!!! I will use complete linkage clustering because Mendelian Randomization function (https://cran.r-project.org/web/packages/MendelianRandomization/vignettes/Vignette_MR.pdf) I plan to use allows for correlations but not as high as 0.9 or more. I got 40 SNPs out of 246 so improvement!
Regards, Ana On Fri, Nov 15, 2019 at 8:01 PM Peter Langfelder <peter.langfel...@gmail.com> wrote: > > Try hclust(as.dist(1-calc.rho), method = "average"). > > Peter > > On Fri, Nov 15, 2019 at 10:02 AM Ana Marija <sokovic.anamar...@gmail.com> > wrote: > > > > HI Peter, > > > > Thank you for getting back to me and shedding light on this. I see > > your point, doing Jim's method: > > > > > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3)) > > > ro246.lt.8<-calc.rho[keeprows,keeprows] > > > ro246.lt.8[ro246.lt.8 == 1] <- NA > > > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE)) > > [1] 0.566 > > > > Which is good in general, correlations in my matrix should not be > > exceeding 0.8. I need to run Mendelian Rendomization on it later on so > > I can not be having there highly correlated SNPs. But with Jim's > > method I am only left with 17 SNPs (out of 246) and that means that > > both pairs of highly correlated SNPs are removed and it would be good > > to keep one of those highly correlated ones. > > > > I tried to do your code: > > > tree = hclust(1-calc.rho, method = "average") > > Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor > > exceed 65536") : > > missing value where TRUE/FALSE needed > > > > Please advise. > > > > Thanks > > Ana > > > > On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder > > <peter.langfel...@gmail.com> wrote: > > > > > > I suspect that you want to identify which variables are highly > > > correlated, and then keep only "representative" variables, i.e., > > > remove redundant ones. This is a bit of a risky procedure but I have > > > done such things before as well sometimes to simplify large sets of > > > highly related variables. If your threshold of 0.8 is approximate, you > > > could simply use average linkage hierarchical clustering with > > > dissimilarity = 1-correlation, cut the tree at the appropriate height > > > (1-0.8=0.2), and from each cluster keep a single representative (e.g., > > > the one with the highest mean correlation with other members of the > > > cluster). Something along these lines (untested) > > > > > > tree = hclust(1-calc.rho, method = "average") > > > clusts = cutree(tree, h = 0.2) > > > clustLevels = sort(unique(clusts)) > > > representatives = unlist(lapply(clustLevels, function(cl) > > > { > > > inClust = which(clusts==cl); > > > rho1 = calc.rho[inClust, inClust, drop = FALSE]; > > > repr = inClust[ which.max(colSums(rho1)) ] > > > repr > > > })) > > > > > > the variable representatives now contains indices of the variables you > > > want to retain, so you could subset the calc.rho matrix as > > > rho.retained = calc.rho[representatives, representatives] > > > > > > I haven't tested the code and it may contain bugs, but something along > > > these lines should get you where you want to be. > > > > > > Oh, and depending on how strict you want to be with the remaining > > > correlations, you could use complete linkage clustering (will retain > > > more variables, some correlations will be above 0.8). > > > > > > Peter > > > > > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamar...@gmail.com> > > > wrote: > > > > > > > > Hello, > > > > > > > > I have a data frame like this (a matrix): > > > > head(calc.rho) > > > > rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995 > > > > rs56192520 0.903 0.268 0.327 0.327 0.327 0.582 > > > > rs3764410 0.928 0.276 0.336 0.336 0.336 0.598 > > > > rs145984817 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs1807401 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs1807402 0.975 0.309 0.371 0.371 0.371 0.638 > > > > rs35350506 0.975 0.309 0.371 0.371 0.371 0.638 > > > > > > > > > dim(calc.rho) > > > > [1] 246 246 > > > > > > > > I would like to remove from this data all highly correlated variables, > > > > with correlation more than 0.8 > > > > > > > > I tried this: > > > > > > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))] > > > > > dim(data) > > > > [1] 246 0 > > > > > > > > Can you please advise, > > > > > > > > Thanks > > > > Ana > > > > > > > > But this removes everything. > > > > > > > > ______________________________________________ > > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide > > > > http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.