Nicolas George <geo...@nsup.org> wrote: > Emanuel Berg (12024-07-10): > > Okay, this is gonna be a challenge to most guys who have been > > processing text for a long time. > > > > So, I would like a command, function or script, 'original', > > that takes a string STR and a text file TXT and outputs > > a score, from 0 to 100, how _original_ STR is, compared to > > what is already in TXT. > > > > So if I do > > > > $ original "This isn't just another party" comments.txt > > > > this will score 0 if that exact phrase to the letter already > > exists in comments.txt. > > > > But it will score 100 if not a single of those words exists in > > the file! Because that would be 100% original. > > > > Those endpoints are easy. But how to make it score - say - 62% > > if some of the words are present, mostly spelled like that and > > combined in ways that are not completely different? > > > > Note: The above examples are examples, other definitions of > > originality are okay. That is not the important part now - but > > can be as interesting a part, later. > > You can use that: > > https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein distance isn't suited to the problem. It compares the entirety of two strings. Emanuel is interesting in comparing one string against substrings of a potentially much larger string, or even substrings of the first string in random order against portions of the second string!