Hello, I am working with a matrix of multilocus genotypes for ~180 individual snail samples, with substantial missing data. I am trying to calculate the pairwise genetic distance between individuals using the stats package 'dist' function, using euclidean distance. I took a subset of this dataset (3 samples x 3 loci) to test how euclidean distance is calculated:
3x3 subset used Locus1 Locus2 Locus3 Samp1 GG <NA> GG Samp2 AG CA GA Samp3 AG CA GG The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) My assumption was that the difference between x_i and y_i would be the number of allelic differences at each base pair site between samples. For example, the euclidean distance between Samp1 and Samp2 would be: euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 - S2_L3)^2 ) at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1 at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4 at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1 euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940 Calculating euclidean distances this way, the distance matrix should be: # Samp1 Samp2 Samp3 # Samp1 0.000000 2.449400 2.236068 # Samp2 2.449400 0.000000 1.000000 # Samp3 2.236068 1.000000 0.000000 However, this distance matrix differs from the one calculated by the R stats package 'dist' function: # Samp1 Samp2 Samp3 # Samp1 0.000000 3.478652 2.659285 # Samp2 3.478652 0.000000 2.124787 # Samp3 2.659285 2.124787 0.000000 I used the following code (with intermediate output) to achieve the latter distance matrix: >>> setwd("~/Desktop/R_stuff") ### Data Prep: load and collect info from matrix file infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt' Mydata <- read.table(infile, header = TRUE, check.names = FALSE) dim(Mydata) # dimensions of data.frame ind <- as.character(Mydata$sample) # individual labels population <- as.character(Mydata$region) # population labels location <- Mydata$location ### Section 1: Convert data to usable format # removes non-genotype data from matrix (i.e. lines 1-4) # subset 3 samples, 3 loci for testing SAMPS<-3 locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))] locus # Locus1 Locus2 Locus3 # Samp1 GG <NA> GG # Samp2 AG CA GA # Samp3 AG CA GG # converts geno matrix to genind object (adegenet) Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop = population[1:SAMPS], sep="") Mydata1$tab # get stats on newly created genind object # Locus1.G Locus1.A Locus2.C Locus2.A Locus3.G Locus3.A # Samp1 2 0 NA NA 2 0 # Samp2 1 1 1 1 1 1 # Samp3 1 1 1 1 2 0 ### Section 2: Individual genetic distance: euclidean distance (dist {adegenet}) # Test dist() function # collect euclidean genetic distances distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper = FALSE, p = 2) distgenEUCL # Samp1 Samp2 # Samp2 2.449490 # Samp3 1.732051 1.414214 x<-as.matrix(dist(distgenEUCL)) x # Samp1 Samp2 Samp3 # Samp1 0.000000 3.478652 2.659285 # Samp2 3.478652 0.000000 2.124787 # Samp3 2.659285 2.124787 0.000000 >>> I have checked several forums and have been unable to resolve this discrepancy. Any and all help will be much appreciated! Thank you! Cheyenne Cheyenne Payne Bioinformatics Technician Palumbi Lab | Hopkins Marine Station (714) 200-5897 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.