That clears up a great deal. Each row of your data represents the observation of a particular species on a particular image. You are actually clustering localities (images) and you want to know what species are commonly found in localities with similar temp/sal/depth/subs.
Your current approach is to cluster multiple rows of the same image which is cluttering up the cluster analysis. A more productive approach would be to create two data tables (or one very long one) each with one row for each image as you indicated at the end or your message: 1. Image Name (or a numeric ID) 2. st_x 3. st_y 4. Temp 5. Sal 6. Depth_M 7. Subs 8. Count_id_1 9. Count_id_2 . . . . N+7. Count_id_n This would also allow you to compute species diversity and density for each image that could be added to the table. To get there from your data, you need to create a table of images: > spmat<-read.csv("http://epi.whoi.edu/ipython/results/mdistefano/pg_site1_sp. csv",header=T) > dd <- na.omit(spmat) > dd.images <- unique(dd[,3:9]) > nrow(dd.images) [1] 1763 > length(levels(dd$imagename)) [1] 1710 So dd.images contains 53 more rows than the number of images! I've spot-checked this and it seems to be cases where two different "subs" values were assigned to the same image. To match the image file with the species file (below), each image needs to be included only once. To get a table of species composition: > dd.species <- xtabs(count~imagename+idcode, dd) > str(dd.species) xtabs [1:1710, 1:20] 0 0 0 0 0 1 0 1 0 0 ... - attr(*, "dimnames")=List of 2 ..$ imagename: chr [1:1710] "UNQ.20080414.150557936.90579.jpg" "UNQ.20080414.150600152.90589.jpg" "UNQ.20080414.150602167.90599.jpg" "UNQ.20080414.150604182.90609.jpg" ... ..$ idcode : chr [1:20] "10008" "10022" "10024" "11010" ... - attr(*, "class")= chr [1:2] "xtabs" "table" - attr(*, "call")= language xtabs(formula = count ~ imagename + idcode, data = dd) With this approach you could use more of the original 100 species in the analysis or even all of them. Now you can cluster the images into similar groups and look at the distribution of species in each cluster. Then use cuttree to produce a vector of cluster memberships to see which images fall into the same cluster. ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 From: epi [mailto:massimodisa...@gmail.com] Sent: Wednesday, May 1, 2013 9:32 PM To: dcarl...@tamu.edu Cc: r-help@r-project.org Subject: Re: [R] help understanding hierarchical clustering Hi David, thank yuou so much for helping me! Il giorno 01/mag/2013, alle ore 10:16, David Carlson <dcarl...@tamu.edu> ha scritto: You need to clarify what you are trying to achieve and fix some errors in your code. First, thanks for giving us reproducible data. i tried to fix the errors , thanks for your advice! Once you have read the file, you seem to be attempting to remove cases with missing values, but you check for missing values of "count" twice and you never check "depth." The whole line can be replaced with dd <- na.omit(mat) Now you have data with complete cases. In your next step you create a distance matrix that includes "idcode" as a variable! Although it is numeric, it is really a categorical variable. That suggests you need to read up on R and cluster analysis. It is very likely that you want to exclude this variable from the distance matrix and possibly the "count" variable as well. i excluded idcode and count from the distance matrix What does one row of data represent? You have 8036 complete cases representing data on 100 species. There are great differences in the number of rows for each species (idcode) ranging from 1 to 1066. i'm trying to clean-up the data, i removed all the records where the species "idcode" is found less than 100 times I uploaded a new link to the new-data and code [1] is this correct ? can i go further and try to understand which species are assigned for each branch of the dendrogram at a specified "cut-level" ? thanks All for any further help! Massimo. [1] http://nbviewer.ipython.org/5499800 ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of epi Sent: Tuesday, April 30, 2013 8:06 PM To: r-help@r-project.org Subject: [R] help understanding hierarchical clustering Hi All, i've problem to understand how to work with R to generate a hierarchical clustering my data are in a csv and looks like : idcode,count,temp,sal,depth_m,subs 16001,136,4.308,32.828,63.46,47 16001,109,4.31,32.829,63.09,49 16001,107,4.302,32.822,62.54,47 16001,87,4.318,32.834,62.54,48 16002,82,4.312,32.832,63.28,49 16002,77,4.325,32.828,65.65,46 16002,77,4.302,32.821,62.36,47 16002,71,4.299,32.832,65.84,37 16002,70,4.302,32.821,62.54,49 where idcode is a specie identification number and the other fields are environmental parameters. library(vegan) mat<-read.csv("http://epi.whoi.edu/ipython/results/mdistefano/pg_site1.csv";, header=T) dd <- mat[!is.na(mat$idcode) & !is.na(mat$temp) & !is.na(mat$sal) & !is.na(mat$count) & !is.na(mat$count) & !is.na(mat$subs),] distmat<-vegdist(dd) clusa<-hclust(distmat,"average") print(clusa) Call: hclust(d = distmat, method = "average") Cluster method : average Distance : bray Number of objects: 8036 print(dend1 <- as.dendrogram(clusa)) 'dendrogram' with 2 branches and 8036 members total, at height 0.3194225 dend2 <- cut(dend1, h=0.07) a complete run with plots is available here : http://nbviewer.ipython.org/5492912 i'm trying try to group together the species (idcode's) that are sharing similar environmental parameters like (looking at the plots) i should be able to retrieve the list of idcode for each branch at "cut-level" X in the example : X = 0.07 branches1 : [idcodeA, .. .. ,idcodeJ] .. .. branche6 : [idcodeB, .. .. , idcodeK] Many thanks for your precious help!!! Massimo. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.