[GitHub] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

GitBox Wed, 27 Feb 2019 03:07:32 -0800

aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity 
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-467821107
 
 
   Hi @ameyjadiye. Thanks for the contribution. I've had a look at this 
similarity as it is essentially a binary scoring metric:
   
   [Sørensen–Dice 
coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)
   
   Thus you must compute the matches (true-positives) between the two sets `X` 
and `Y` (the union) and then include a penalty for items in either set that are 
not matched (false-positives and false negatives).
   
   `(2 * TP) / (2 * TP + FP + FN)`
   
   Or simply:
   
   `2 * |X union Y| / (|X| + |Y|)`
   
   with `|N|` the cardinality, or count.
   
   The score is related to the `Jaccard` score and can be computed directly 
from it (`S = 2J / (1 + J)`). However the `JaccardSimilarity` class in the same 
package uses single characters from each `CharSequence` and not character pairs 
(bigrams). So this algorithm using pairs is new.
   
   The way you have defined the score is to take each unique character pair 
(bigram) as the set from each string. Thus:
   
   ```
   sim.apply("aa", "aa")      = [aa] vs [aa] = 1.0
   sim.apply("aaa", "aa")     = [aa] vs [aa] = 1.0
   sim.apply("aaaa", "aa")    = [aa] vs [aa] = 1.0
   sim.apply("aaaa", "aaa")   = [aa] vs [aa] = 1.0
   ```
   
   The original article you 
[referenced](http://www.catalysoft.com/articles/StrikeAMatch.html) includes 
duplicate observations of each unique character pair (bigram). Thus:
   
   ```
   sim.apply("aa", "aa")      = [aa]         vs [aa]     = (2 * 1) / (1 + 1) = 
1.0
   sim.apply("aaa", "aa")     = [aa][aa]     vs [aa]     = (2 * 1) / (2 + 1) = 
0.667
   sim.apply("aaaa", "aa")    = [aa][aa][aa] vs [aa]     = (2 * 1) / (3 + 1) = 
0.5
   sim.apply("aaaa", "aaa")   = [aa][aa][aa] vs [aa][aa] = (2 * 2) / (3 + 2) = 
0.8
   ```
   
   This requires a different algorithm without using `Set`. The author provides 
one which could be used for testing although it could be improved for 
efficiency.
   
   Q. Do you know if there is a preference in the field for parsing unique 
bigrams or including duplicates? Either way it should be clear in the Javadoc 
what the intension is.
   
   I also note that splitting the `CharSequence` into bigrams can be done 
efficiently using a 32-bit integer to store each pair of 16-bit chars:
   
   ```
   /**
    * Creates the bigrams packed as two 16-bit chars into a 32-bit int. This 
does not
    * support extended character sets.
    *
    * @param cs the char sequence
    * @return the bigrams
    */
   private int[] createBigrams(CharSequence cs) {
       if (cs.length() < 2) {
           return new int[0];
       }
       final int[] bigrams = new int[cs.length() - 1];
       int ch2 = cs.charAt(0);
       for (int i = 0; i < bigrams.length; i++) {
           final int ch1 = ch2;
           ch2 = cs.charAt(i + 1);
           bigrams[i] = (ch1 << 16) | ch2;
       }
       return bigrams;
   }
   ```
   
   You can then do what you want with these including writing an algorithm that 
sorts the bigrams for efficient comparison, or use them in your current 
algorithm with `Set<Integer>`.
   
   Also be careful of overflow in the result computation. I acknowledge that 
with the `Set` method you would have to input two `CharSequence`s with close to 
all `2^16 * 2^16` bigrams to get two set sizes above half `Integer.MAX_VALUE`. 
With the duplicates method you just have to input two `CharSequence` that have 
a combined length above `Integer.MAX_VALUE` and then adding the sizes will 
overflow the `int`. For example if you were to compare two books!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[GitHub] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

Reply via email to