[Rd] Canberra distance

2010-02-06 Thread Christophe Genolini

Hi the list,

According to what I know, the Canberra distance between X et Y is : sum[ 
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, we 
find :


   sum = fabs(x[i1] + x[i2]);
   diff = fabs(x[i1] - x[i2]);
   dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i 
and y_i are positive, but not when a value is negative.)


Is it on purpose or is it a bug?

Christophe

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Duncan Murdoch

On 06/02/2010 10:39 AM, Christophe Genolini wrote:

Hi the list,

According to what I know, the Canberra distance between X et Y is : sum[ 
(|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, we 
find :


sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when x_i 
and y_i are positive, but not when a value is negative.)


Is it on purpose or is it a bug?


It matches the documentation in ?dist, so it's not just a coding error. 
 It will give the same value as your definition if the two items have 
the same sign (not only both positive), but different values if the 
signs differ.


The first three links I found searching Google Scholar for "Canberra 
distance" all define it only for non-negative data.  One of them gave 
exactly the R formula (even though the absolute value in the denominator 
is redundant), the others just put x_i + y_i in the denominator.


None of the 3 papers cited the origin of the definition, so I can't tell 
you who is wrong.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Christophe Genolini
The definition I use is the on find in the book "Cluster analysis" by 
Brian Everitt, Sabine Landau and Morven Leese.
They cite, as definition paper for Canberra distance, an article of 
Lance and Williams "Computer programs for hierarchical polythetic 
classification" Computer Journal 1966.
I do not have access, but here is the link : 
http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60

Hope this helps.

Christophe

On 06/02/2010 10:39 AM, Christophe Genolini wrote:

Hi the list,

According to what I know, the Canberra distance between X et Y is : 
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, 
we find :


sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when 
x_i and y_i are positive, but not when a value is negative.)


Is it on purpose or is it a bug?


It matches the documentation in ?dist, so it's not just a coding 
error.  It will give the same value as your definition if the two 
items have the same sign (not only both positive), but different 
values if the signs differ.


The first three links I found searching Google Scholar for "Canberra 
distance" all define it only for non-negative data.  One of them gave 
exactly the R formula (even though the absolute value in the 
denominator is redundant), the others just put x_i + y_i in the 
denominator.


None of the 3 papers cited the origin of the definition, so I can't 
tell you who is wrong.


Duncan Murdoch




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance and binary distance

2010-02-06 Thread Christophe Genolini

I guess there is also a problem in the binary distance since

x <- y <- rep(0,10)
dist(rbind(x,y),method="binary")

gives 0 whereas it suppose to be undefine. (the aka asymmetric binary is 
not suppose to take in account the (off,off) couples in its calculation)


Christophe

The definition I use is the on find in the book "Cluster analysis" by 
Brian Everitt, Sabine Landau and Morven Leese.
They cite, as definition paper for Canberra distance, an article of 
Lance and Williams "Computer programs for hierarchical polythetic 
classification" Computer Journal 1966.
I do not have access, but here is the link : 
http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60

Hope this helps.

Christophe

On 06/02/2010 10:39 AM, Christophe Genolini wrote:

Hi the list,

According to what I know, the Canberra distance between X et Y is : 
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, 
we find :


sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when 
x_i and y_i are positive, but not when a value is negative.)


Is it on purpose or is it a bug?


It matches the documentation in ?dist, so it's not just a coding 
error.  It will give the same value as your definition if the two 
items have the same sign (not only both positive), but different 
values if the signs differ.


The first three links I found searching Google Scholar for "Canberra 
distance" all define it only for non-negative data.  One of them gave 
exactly the R formula (even though the absolute value in the 
denominator is redundant), the others just put x_i + y_i in the 
denominator.


None of the 3 papers cited the origin of the definition, so I can't 
tell you who is wrong.


Duncan Murdoch







__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Duncan Murdoch

On 06/02/2010 11:31 AM, Christophe Genolini wrote:
The definition I use is the on find in the book "Cluster analysis" by 
Brian Everitt, Sabine Landau and Morven Leese.
They cite, as definition paper for Canberra distance, an article of 
Lance and Williams "Computer programs for hierarchical polythetic 
classification" Computer Journal 1966.
I do not have access, but here is the link : 
http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60

Hope this helps.



I do have access to that journal, and that paper gives the definition

sum(|x_i - y_i|) / sum(x_i + y_i)

and suggests the variation

sum( [|x_i - y_i|) / (x_i + y_i) ] )

It doesn't call either one the Canberra distance; it calls the first one 
the "non-metric coefficient" and doesn't name the second.  (I imagine 
the Canberra name came from the fact that the authors were at CSIRO in 
Canberra.)


So I'd agree your definition is better, but I don't know if it can 
really be called the "Canberra distance".


Duncan Murdoch


Christophe

On 06/02/2010 10:39 AM, Christophe Genolini wrote:

Hi the list,

According to what I know, the Canberra distance between X et Y is : 
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function 
'absolute value')
In the source code of the canberra distance in the file distance.c, 
we find :


sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when 
x_i and y_i are positive, but not when a value is negative.)


Is it on purpose or is it a bug?
It matches the documentation in ?dist, so it's not just a coding 
error.  It will give the same value as your definition if the two 
items have the same sign (not only both positive), but different 
values if the signs differ.


The first three links I found searching Google Scholar for "Canberra 
distance" all define it only for non-negative data.  One of them gave 
exactly the R formula (even though the absolute value in the 
denominator is redundant), the others just put x_i + y_i in the 
denominator.


None of the 3 papers cited the origin of the definition, so I can't 
tell you who is wrong.


Duncan Murdoch




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] win-builder offline

2010-02-06 Thread Uwe Ligges

Dear package developers,

the win-builder service provided at http://win-builder.r-project.org/ 
will be offline from roughly 7pm (CET) today until 5pm (CET) tomorrow.


Best wishes,
Uwe Ligges

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Jari Oksanen



On 06/02/2010 18:10, "Duncan Murdoch"  wrote:

> On 06/02/2010 10:39 AM, Christophe Genolini wrote:
>> Hi the list,
>> 
>> According to what I know, the Canberra distance between X et Y is : sum[
>> (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
>> 'absolute value')
>> In the source code of the canberra distance in the file distance.c, we
>> find :
>> 
>> sum = fabs(x[i1] + x[i2]);
>> diff = fabs(x[i1] - x[i2]);
>> dev = diff/sum;
>> 
>> which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
>> (note that this does not define a distance... This is correct when x_i
>> and y_i are positive, but not when a value is negative.)
>> 
>> Is it on purpose or is it a bug?
> 
> It matches the documentation in ?dist, so it's not just a coding error.
>   It will give the same value as your definition if the two items have
> the same sign (not only both positive), but different values if the
> signs differ.
> 
> The first three links I found searching Google Scholar for "Canberra
> distance" all define it only for non-negative data.  One of them gave
> exactly the R formula (even though the absolute value in the denominator
> is redundant), the others just put x_i + y_i in the denominator.

G'day cobbers, 

Without checking the original sources (that I can't do before Monday), I'd
say that the "Canberra distance" was originally suggested only for
non-negative data (abundances of organisms which are non-negative if
observed directly). The fabs(x-y) notation was used just as a convenient
tool to get rid off the original pmin(x,y) for non-negative data -- which is
nice in R, but not so natural in C. Extension of the "Canberra distance" to
negative data probably makes a new distance perhaps deserving a new name
(Eureka distance?).

If you ever go to Canberra and drive around you'll see that it's all going
through a roundabout after a roundabout, and going straight somewhere means
goin' 'round 'n' 'round. That may make you skeptical about the "Canberra
distance". 

Cheers, Jazza Oksanen

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Bill.Venables
That is interesting.  The first of these, namely

sum(|x_i - y_i|) / sum(x_i + y_i)

is now better known in ecology as the Bray-Curtis distance.  Even more 
interesting is the typo in Henry & Stevens "A Primer of Ecology in R" where the 
Bray Curtis distance formula is actually the Canberra distance  (Eq. 10.2 p. 
289).  There seems to be a certain slipperiness of definition in this field.

What surprises me most is why ecologists still cling to this way of doing 
things,  It is one of the few places I know of where the analysis is justified 
purely heuristically and not from any kind of explicit model for the ecological 
processes under study.

Bill Venables.




From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] On Behalf 
Of Duncan Murdoch [murd...@stats.uwo.ca]
Sent: 07 February 2010 03:00
To: genol...@u-paris10.fr
Cc: r-devel@r-project.org
Subject: Re: [Rd] Canberra distance

On 06/02/2010 11:31 AM, Christophe Genolini wrote:
> The definition I use is the on find in the book "Cluster analysis" by
> Brian Everitt, Sabine Landau and Morven Leese.
> They cite, as definition paper for Canberra distance, an article of
> Lance and Williams "Computer programs for hierarchical polythetic
> classification" Computer Journal 1966.
> I do not have access, but here is the link :
> http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60
> Hope this helps.
>

I do have access to that journal, and that paper gives the definition

sum(|x_i - y_i|) / sum(x_i + y_i)

and suggests the variation

sum( [|x_i - y_i|) / (x_i + y_i) ] )

It doesn't call either one the Canberra distance; it calls the first one
the "non-metric coefficient" and doesn't name the second.  (I imagine
the Canberra name came from the fact that the authors were at CSIRO in
Canberra.)

So I'd agree your definition is better, but I don't know if it can
really be called the "Canberra distance".

Duncan Murdoch

> Christophe
>> On 06/02/2010 10:39 AM, Christophe Genolini wrote:
>>> Hi the list,
>>>
>>> According to what I know, the Canberra distance between X et Y is :
>>> sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
>>> 'absolute value')
>>> In the source code of the canberra distance in the file distance.c,
>>> we find :
>>>
>>> sum = fabs(x[i1] + x[i2]);
>>> diff = fabs(x[i1] - x[i2]);
>>> dev = diff/sum;
>>>
>>> which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
>>> (note that this does not define a distance... This is correct when
>>> x_i and y_i are positive, but not when a value is negative.)
>>>
>>> Is it on purpose or is it a bug?
>> It matches the documentation in ?dist, so it's not just a coding
>> error.  It will give the same value as your definition if the two
>> items have the same sign (not only both positive), but different
>> values if the signs differ.
>>
>> The first three links I found searching Google Scholar for "Canberra
>> distance" all define it only for non-negative data.  One of them gave
>> exactly the R formula (even though the absolute value in the
>> denominator is redundant), the others just put x_i + y_i in the
>> denominator.
>>
>> None of the 3 papers cited the origin of the definition, so I can't
>> tell you who is wrong.
>>
>> Duncan Murdoch
>>
>>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Canberra distance

2010-02-06 Thread Prof Brian Ripley
This is cetainly ancient R history.  The essence of the formula was 
last changed

-   dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
+   dist += fabs(x[i1] - x[i2])/fabs(x[i1] + x[i2]);

in October 1998.  The help page description came later.

The
   dist += fabs(x[i1] - x[i2])/(x[i1] + x[i2]);
form was there as 'canberra' in the first CVS archive in September 
1997 (as src/library/mva/src/dist.c) so it looks like one of R&R was 
the original author and this could be called pre-history.


On Sun, 7 Feb 2010, bill.venab...@csiro.au wrote:


That is interesting.  The first of these, namely

sum(|x_i - y_i|) / sum(x_i + y_i)

is now better known in ecology as the Bray-Curtis distance.  Even more interesting is the 
typo in Henry & Stevens "A Primer of Ecology in R" where the Bray Curtis 
distance formula is actually the Canberra distance  (Eq. 10.2 p. 289).  There seems to be a 
certain slipperiness of definition in this field.

What surprises me most is why ecologists still cling to this way of doing 
things,  It is one of the few places I know of where the analysis is justified 
purely heuristically and not from any kind of explicit model for the ecological 
processes under study.

Bill Venables.




From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] On Behalf 
Of Duncan Murdoch [murd...@stats.uwo.ca]
Sent: 07 February 2010 03:00
To: genol...@u-paris10.fr
Cc: r-devel@r-project.org
Subject: Re: [Rd] Canberra distance

On 06/02/2010 11:31 AM, Christophe Genolini wrote:

The definition I use is the on find in the book "Cluster analysis" by
Brian Everitt, Sabine Landau and Morven Leese.
They cite, as definition paper for Canberra distance, an article of
Lance and Williams "Computer programs for hierarchical polythetic
classification" Computer Journal 1966.
I do not have access, but here is the link :
http://comjnl.oxfordjournals.org/cgi/content/abstract/9/1/60
Hope this helps.



I do have access to that journal, and that paper gives the definition

sum(|x_i - y_i|) / sum(x_i + y_i)

and suggests the variation

sum( [|x_i - y_i|) / (x_i + y_i) ] )

It doesn't call either one the Canberra distance; it calls the first one
the "non-metric coefficient" and doesn't name the second.  (I imagine
the Canberra name came from the fact that the authors were at CSIRO in
Canberra.)

So I'd agree your definition is better, but I don't know if it can
really be called the "Canberra distance".

Duncan Murdoch


Christophe

On 06/02/2010 10:39 AM, Christophe Genolini wrote:

Hi the list,

According to what I know, the Canberra distance between X et Y is :
sum[ (|x_i - y_i|) / (|x_i|+|y_i|) ] (with | | denoting the function
'absolute value')
In the source code of the canberra distance in the file distance.c,
we find :

sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
dev = diff/sum;

which correspond to the formula : sum[ (|x_i - y_i|) / (|x_i+y_i|) ]
(note that this does not define a distance... This is correct when
x_i and y_i are positive, but not when a value is negative.)

Is it on purpose or is it a bug?

It matches the documentation in ?dist, so it's not just a coding
error.  It will give the same value as your definition if the two
items have the same sign (not only both positive), but different
values if the signs differ.

The first three links I found searching Google Scholar for "Canberra
distance" all define it only for non-negative data.  One of them gave
exactly the R formula (even though the absolute value in the
denominator is redundant), the others just put x_i + y_i in the
denominator.

None of the 3 papers cited the origin of the definition, so I can't
tell you who is wrong.

Duncan Murdoch




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel