aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham URL: https://github.com/apache/commons-text/pull/103#issuecomment-467821107 Hi @ameyjadiye. Thanks for the contribution. I've had a look at this similarity as it is essentially a binary scoring metric: [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) Thus you must compute the matches (true-positives) between the two sets `X` and `Y` (the union) and then include a penalty for items in either set that are not matched (false-positives and false negatives). `(2 * TP) / (2 * TP + FP + FN)` Or simply: `2 * |X union Y| / (|X| + |Y|)` with `|N|` the cardinality, or count. The score is related to the `Jaccard` score and can be computed directly from it (`S = 2J / (1 + J)`). However the `JaccardSimilarity` class in the same package uses single characters from each `CharSequence` and not character pairs (bigrams). So this algorithm using pairs is new. The way you have defined the score is to take each unique character pair (bigram) as the set from each string. Thus: ``` sim.apply("aa", "aa") = [aa] vs [aa] = 1.0 sim.apply("aaa", "aa") = [aa] vs [aa] = 1.0 sim.apply("aaaa", "aa") = [aa] vs [aa] = 1.0 sim.apply("aaaa", "aaa") = [aa] vs [aa] = 1.0 ``` The original article you [referenced](http://www.catalysoft.com/articles/StrikeAMatch.html) includes duplicate observations of each unique character pair (bigram). Thus: ``` sim.apply("aa", "aa") = [aa] vs [aa] = (2 * 1) / (1 + 1) = 1.0 sim.apply("aaa", "aa") = [aa][aa] vs [aa] = (2 * 1) / (2 + 1) = 0.667 sim.apply("aaaa", "aa") = [aa][aa][aa] vs [aa] = (2 * 1) / (3 + 1) = 0.5 sim.apply("aaaa", "aaa") = [aa][aa][aa] vs [aa][aa] = (2 * 2) / (3 + 2) = 0.8 ``` This requires a different algorithm without using `Set`. The author provides one which could be used for testing although it could be improved for efficiency. Q. Do you know if there is a preference in the field for parsing unique bigrams or including duplicates? Either way it should be clear in the Javadoc what the intension is. I also note that splitting the `CharSequence` into bigrams can be done efficiently using a 32-bit integer to store each pair of 16-bit chars: ``` /** * Creates the bigrams packed as two 16-bit chars into a 32-bit int. This does not * support extended character sets. * * @param cs the char sequence * @return the bigrams */ private int[] createBigrams(CharSequence cs) { if (cs.length() < 2) { return new int[0]; } final int[] bigrams = new int[cs.length() - 1]; int ch2 = cs.charAt(0); for (int i = 0; i < bigrams.length; i++) { final int ch1 = ch2; ch2 = cs.charAt(i + 1); bigrams[i] = (ch1 << 16) | ch2; } return bigrams; } ``` You can then do what you want with these including writing an algorithm that sorts the bigrams for efficient comparison, or use them in your current algorithm with `Set<Integer>`. Also be careful of overflow in the result computation. I acknowledge that with the `Set` method you would have to input two `CharSequence`s with close to all `2^16 * 2^16` bigrams to get two set sizes above half `Integer.MAX_VALUE`. With the duplicates method you just have to input two `CharSequence` that have a combined length above `Integer.MAX_VALUE` and then adding the sizes will overflow the `int`. For example if you were to compare two books!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org