Hi Tim, Yes you are right this is an issue, BC (and other distance metrics) are sensitive to sampling intensity, which is often an artefact of the sampling technique. Transformation is not a great solution to the problem - it works imperfectly and will have different effects depending on the properties of your data. There are lots of different types of datasets out there, each with different properties, and different behaviours under different transformation/standardisation strategies, so there is no one-transformation-suits-all solution. An illustration of this (in the case of row standardisation) is in the below paper: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12843
The strategy I would advise here is to go a very different route and build a statistical model for the data. You can then include row effects in the model to handle variation in sampling intensity across rows of data (along the lines of equation 2 of the above paper). Or if the magnitude of the variation in sampling intensity is known (e.g. it is due to changes in sizes of quadrats used for sampling, and quadrat size has been recorded), then the standard approach to handle this is to add an offset to the model. There is plenty of software out there that can fit suitable statistical models with row effects (and offsets) for this sort of data, including the mvabund, HMSC, boral, and gllvm packages on R. Importantly, these packages come with diagnostic tools to check that the analysis approach adequately captures key properties of your data - an essential step in any analysis. All the best David Professor David Warton School of Mathematics and Statistics, Evolution & Ecology Research Centre, Centre for Ecosystem Science UNSW Sydney NSW 2052 AUSTRALIA phone +61(2) 9385 7031 fax +61(2) 9385 7123 http://www.eco-stats.unsw.edu.au ---------------------------------------------------------------------- Date: Tue, 2 Apr 2019 17:15:45 +0200 From: Tim Richter-Heitmann <trich...@uni-bremen.de> To: r-sig-ecology@r-project.org Subject: [R-sig-eco] interpreting ecological distance approaches (Bray Curtis after various data transformation) Message-ID: <3834fea1-040a-12b5-c3a3-633e68dc6...@uni-bremen.de> Content-Type: text/plain; charset="utf-8"; Format="flowed" Dear list, i am not an ecologist by training, so please bear with me. It is my understanding that Bray Curtis distances seem to be sensitive to different community sizes. Thus, they seem to deliver inadequate results when the different community sizes are the result of technical artifacts rather than biology (see e.g. Weiss et al, 2017 on microbiome data). Therefore, i often see BC distances made on relative data (which seems to be equivalent to the Manhattan distance) or on data which has been subsampled to even sizes (e.g. rarefying). Sometimes i also see Bray Curtis distances calculated on Hellinger-transformed data, which is the square root of relative data. This again makes sample sizes unequal (but only to a small degree), so i wondered if this is a valid approach, especially considering that the "natural" distance choice for Hellinger transformed data is Euclidean (to obtain, well, the Hellinger distance). Another question is what different sizes (i.e. the sums) of Hellinger transformed communities represent? I tested some datasets, and couldnt find a correlation between original sample sizes and their hellinger transformed counterparts. Any advice is very much welcome. Thank you. -- Dr. Tim Richter-Heitmann University of Bremen Microbial Ecophysiology Group (AG Friedrich) FB02 - Biologie/Chemie Leobener Straße (NW2 A2130) D-28359 Bremen Tel.: 0049(0)421 218-63062 Fax: 0049(0)421 218-63069 _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology