[Rd] Canberra distance
Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? Christophe __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. Christophe On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance and binary distance
I guess there is also a problem in the binary distance since x <- y <- rep(0,10) dist(rbind(x,y),method="binary") gives 0 whereas it suppose to be undefine. (the aka asymmetric binary is not suppose to take in account the (off,off) couples in its calculation) Christophe The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. Christophe On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
On 06/02/2010 11:31 AM, Christophe Genolini wrote: The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch Christophe On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] win-builder offline
Dear package developers, the win-builder service provided at http://win-builder.r-project.org/ will be offline from roughly 7pm (CET) today until 5pm (CET) tomorrow. Best wishes, Uwe Ligges __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
On 06/02/2010 18:10, "Duncan Murdoch" wrote: > On 06/02/2010 10:39 AM, Christophe Genolini wrote: >> Hi the list, >> >> According to what I know, the Canberra distance between X et Y is : sum[ >> (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function >> 'absolute value') >> In the source code of the canberra distance in the file distance.c, we >> find : >> >> sum = fabs(x[i1] + x[i2]); >> diff = fabs(x[i1] - x[i2]); >> dev = diff/sum; >> >> which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] >> (note that this does not define a distance... This is correct when x_i >> and y_i are positive, but not when a value is negative.) >> >> Is it on purpose or is it a bug? > > It matches the documentation in ?dist, so it's not just a coding error. > It will give the same value as your definition if the two items have > the same sign (not only both positive), but different values if the > signs differ. > > The first three links I found searching Google Scholar for "Canberra > distance" all define it only for non-negative data. One of them gave > exactly the R formula (even though the absolute value in the denominator > is redundant), the others just put x_i + y_i in the denominator. G'day cobbers, Without checking the original sources (that I can't do before Monday), I'd say that the "Canberra distance" was originally suggested only for non-negative data (abundances of organisms which are non-negative if observed directly). The fabs(x-y) notation was used just as a convenient tool to get rid off the original pmin(x,y) for non-negative data -- which is nice in R, but not so natural in C. Extension of the "Canberra distance" to negative data probably makes a new distance perhaps deserving a new name (Eureka distance?). If you ever go to Canberra and drive around you'll see that it's all going through a roundabout after a roundabout, and going straight somewhere means goin' 'round 'n' 'round. That may make you skeptical about the "Canberra distance". Cheers, Jazza Oksanen __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is actually the Canberra distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of definition in this field. What surprises me most is why ecologists still cling to this way of doing things, It is one of the few places I know of where the analysis is justified purely heuristically and not from any kind of explicit model for the ecological processes under study. Bill Venables. From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] On Behalf Of Duncan Murdoch [murd...@stats.uwo.ca] Sent: 07 February 2010 03:00 To: genol...@u-paris10.fr Cc: r-devel@r-project.org Subject: Re: [Rd] Canberra distance On 06/02/2010 11:31 AM, Christophe Genolini wrote: > The definition I use is the on find in the book "Cluster analysis" by > Brian Everitt, Sabine Landau and Morven Leese. > They cite, as definition paper for Canberra distance, an article of > Lance and Williams "Computer programs for hierarchical polythetic > classification" Computer Journal 1966. > I do not have access, but here is the link : > http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 > Hope this helps. > I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch > Christophe >> On 06/02/2010 10:39 AM, Christophe Genolini wrote: >>> Hi the list, >>> >>> According to what I know, the Canberra distance between X et Y is : >>> sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function >>> 'absolute value') >>> In the source code of the canberra distance in the file distance.c, >>> we find : >>> >>> sum = fabs(x[i1] + x[i2]); >>> diff = fabs(x[i1] - x[i2]); >>> dev = diff/sum; >>> >>> which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] >>> (note that this does not define a distance... This is correct when >>> x_i and y_i are positive, but not when a value is negative.) >>> >>> Is it on purpose or is it a bug? >> It matches the documentation in ?dist, so it's not just a coding >> error. It will give the same value as your definition if the two >> items have the same sign (not only both positive), but different >> values if the signs differ. >> >> The first three links I found searching Google Scholar for "Canberra >> distance" all define it only for non-negative data. One of them gave >> exactly the R formula (even though the absolute value in the >> denominator is redundant), the others just put x_i + y_i in the >> denominator. >> >> None of the 3 papers cited the origin of the definition, so I can't >> tell you who is wrong. >> >> Duncan Murdoch >> >> __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Canberra distance
This is cetainly ancient R history. The essence of the formula was last changed - dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]); + dist += fabs(x[i1] - x[i2])/fabs(x[i1] + x[i2]); in October 1998. The help page description came later. The dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]); form was there as 'canberra' in the first CVS archive in September 1997 (as src/library/mva/src/dist.c) so it looks like one of R&R was the original author and this could be called pre-history. On Sun, 7 Feb 2010, bill.venab...@csiro.au wrote: That is interesting. The first of these, namely sum(|x_i - y_i|) / sum(x_i + y_i) is now better known in ecology as the Bray-Curtis distance. Even more interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis distance formula is actually the Canberra distance (Eq. 10.2 p. 289). There seems to be a certain slipperiness of definition in this field. What surprises me most is why ecologists still cling to this way of doing things, It is one of the few places I know of where the analysis is justified purely heuristically and not from any kind of explicit model for the ecological processes under study. Bill Venables. From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] On Behalf Of Duncan Murdoch [murd...@stats.uwo.ca] Sent: 07 February 2010 03:00 To: genol...@u-paris10.fr Cc: r-devel@r-project.org Subject: Re: [Rd] Canberra distance On 06/02/2010 11:31 AM, Christophe Genolini wrote: The definition I use is the on find in the book "Cluster analysis" by Brian Everitt, Sabine Landau and Morven Leese. They cite, as definition paper for Canberra distance, an article of Lance and Williams "Computer programs for hierarchical polythetic classification" Computer Journal 1966. I do not have access, but here is the link : http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60 Hope this helps. I do have access to that journal, and that paper gives the definition sum(|x_i - y_i|) / sum(x_i + y_i) and suggests the variation sum( [|x_i - y_i|) / (x_i + y_i) ] ) It doesn't call either one the Canberra distance; it calls the first one the "non-metric coefficient" and doesn't name the second. (I imagine the Canberra name came from the fact that the authors were at CSIRO in Canberra.) So I'd agree your definition is better, but I don't know if it can really be called the "Canberra distance". Duncan Murdoch Christophe On 06/02/2010 10:39 AM, Christophe Genolini wrote: Hi the list, According to what I know, the Canberra distance between X et Y is : sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 'absolute value') In the source code of the canberra distance in the file distance.c, we find : sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); dev = diff/sum; which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ] (note that this does not define a distance... This is correct when x_i and y_i are positive, but not when a value is negative.) Is it on purpose or is it a bug? It matches the documentation in ?dist, so it's not just a coding error. It will give the same value as your definition if the two items have the same sign (not only both positive), but different values if the signs differ. The first three links I found searching Google Scholar for "Canberra distance" all define it only for non-negative data. One of them gave exactly the R formula (even though the absolute value in the denominator is redundant), the others just put x_i + y_i in the denominator. None of the 3 papers cited the origin of the definition, so I can't tell you who is wrong. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel