Nicolas George wrote: > You can use that: > > https://en.wikipedia.org/wiki/Levenshtein_distance > > But you also need to define what you want with more > precision: > > How do you count the replacement of a word by a synonym? > > How do you count a change in the order of the words? > > How do you count a transparent spelling mistake? > > How do you count a spelling mistake that turns a word into > another existing word?
Indeed, one can have a bunch of such rules and apply them and award points and stuff. But maybe one could also have a more general, basic mechanical/math/stats inspired algorithm? Or a combination! I forgot to say, the file with data. One can either see this as just a bunch of data. How original is the new data compared to the old. One can also see the data as a bunch of entries, say one for each line. To what extent is the new entry unlike all others? That sounds more easy, but not necessarily so because that can be applied generally as well. "This entry was unlike all others. However bits and pieces of it appear all over the place." See? How can we define what is original in a way that 1. makes sense and 2. is useful for this application? > Not related to Debian, putting "[OT]" in the subject. I forgot to say, one is expected to only use software from the Debian repos or other sources readily available on FOSS Unix-like systems. -- underground experts united https://dataswamp.org/~incal