Nicolas George <geo...@nsup.org> wrote:
> Emanuel Berg (12024-07-10):
> > Okay, this is gonna be a challenge to most guys who have been
> > processing text for a long time.
> > 
> > So, I would like a command, function or script, 'original',
> > that takes a string STR and a text file TXT and outputs
> > a score, from 0 to 100, how _original_ STR is, compared to
> > what is already in TXT.
> > 
> > So if I do
> > 
> > $ original "This isn't just another party" comments.txt
> > 
> > this will score 0 if that exact phrase to the letter already
> > exists in comments.txt.
> > 
> > But it will score 100 if not a single of those words exists in
> > the file! Because that would be 100% original.
> > 
> > Those endpoints are easy. But how to make it score - say - 62%
> > if some of the words are present, mostly spelled like that and
> > combined in ways that are not completely different?
> > 
> > Note: The above examples are examples, other definitions of
> > originality are okay. That is not the important part now - but
> > can be as interesting a part, later.  
> 
> You can use that:
> 
> https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein distance isn't suited to the problem. It compares the
entirety of two strings. Emanuel is interesting in comparing one string
against substrings of a potentially much larger string, or even
substrings of the first string in random order against portions of the
second string!

Reply via email to