Emanuel Berg (12024-07-10): > Okay, this is gonna be a challenge to most guys who have been > processing text for a long time. > > So, I would like a command, function or script, 'original', > that takes a string STR and a text file TXT and outputs > a score, from 0 to 100, how _original_ STR is, compared to > what is already in TXT. > > So if I do > > $ original "This isn't just another party" comments.txt > > this will score 0 if that exact phrase to the letter already > exists in comments.txt. > > But it will score 100 if not a single of those words exists in > the file! Because that would be 100% original. > > Those endpoints are easy. But how to make it score - say - 62% > if some of the words are present, mostly spelled like that and > combined in ways that are not completely different? > > Note: The above examples are examples, other definitions of > originality are okay. That is not the important part now - but > can be as interesting a part, later.
You can use that: https://en.wikipedia.org/wiki/Levenshtein_distance But you also need to define what you want with more precision: How do you count the replacement of a word by a synonym? How do you count a change in the order of the words? How do you count a transparent spelling mistake? How do you count a spelling mistake that turns a word into another existing word? Not related to Debian, putting “[OT]” in the subject. Regards, -- Nicolas George