Michael Ohlrogge added the comment:
This is my first time posting here, so apologies if I'm breaking rules.
I'd like to put in a vote in favor of this patch to get the matching scores.
I am a researcher at Stanford University using this tool to match up about
100,000 different names of companies/entities in two different datasets that I
have. The names reflect the same underlying entities but because they come
from different datasets, the spellings, abbreviations, etc. differ.
It would be helpful to me to be able to run the get_scored_close_matches()
function and then sort the results by how close the matches were. If I could
for instance determine, based on some spot checking / sampling of the results,
that everything with a match above a certain threshold is almost certainly
correct, whereas those below a certain threshold need to be reviewed by hand,
that would be helpful for me.
I suppose I can accomplish something similar by playing around with setting the
matching threshold at different levels. Nevertheless, with as many possible
matches as I am doing, the algorithm takes a decent amount of time to run, and
I don't have a good way to know ex-ante what a reasonable threshold would be.
Just in general, I think it can be useful information for users to know how
much confidence to have in the matches produced by the algorithm. Users could
choose to formulate this confidence either as a direct function of the score or
perhaps based on some other factors, such as a statistical analysis procedure
that takes the score into account.
Thanks to everyone who put this package together and who suggested the patch.
--
nosy: +michaelohlrogge
versions: +Python 2.7 -Python 3.5
___
Python tracker
<http://bugs.python.org/issue21344>
___
___
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com