Nicolas George wrote:

> You can use that:
>
> https://en.wikipedia.org/wiki/Levenshtein_distance
>
> But you also need to define what you want with more
> precision:
>
> How do you count the replacement of a word by a synonym?
>
> How do you count a change in the order of the words?
>
> How do you count a transparent spelling mistake?
>
> How do you count a spelling mistake that turns a word into
> another existing word?

Indeed, one can have a bunch of such rules and apply them and
award points and stuff.

But maybe one could also have a more general, basic
mechanical/math/stats inspired algorithm?

Or a combination!

I forgot to say, the file with data. One can either see this
as just a bunch of data. How original is the new data compared
to the old.

One can also see the data as a bunch of entries, say one for
each line. To what extent is the new entry unlike all others?

That sounds more easy, but not necessarily so because that can
be applied generally as well.

"This entry was unlike all others. However bits and pieces of
it appear all over the place."

See? How can we define what is original in a way that 1.
makes sense and 2. is useful for this application?

> Not related to Debian, putting "[OT]" in the subject.

I forgot to say, one is expected to only use software from the
Debian repos or other sources readily available on
FOSS Unix-like systems.

-- 
underground experts united
https://dataswamp.org/~incal

Reply via email to