On 09/04/2022 13:14, Christian Gollwitzer wrote:
Am 08.04.22 um 09:21 schrieb Antoon Pardon:
The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).
Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.
The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually
containing
them. A range object almost works, with the only problem it is not
comparable with a list.
Then write your own comparator function?
Also, if the only case where this actually works is the index of all
other records, then a simple boolean flag "all" vs. "these items in the
index list" would suffice - doesn't it?
Christian
Writing a comparator function is only possible for a given key. So my
approach would be:
1) Write a comparator function that takes params X and Y, such that:
if key data is missing from X, return 1
If key data is missing from Y return -1
if X > Y return 1
if X < Y return -1
return 0 # They are equal and key data for both is present
2) Sort the data using the comparator function.
3) Run through the data with a trailing enumeration loop, merging
matching records together.
4) If there are no records copied out with missing
key data, then you are done, so exit.
5) Choose a new key and repeat from step 1).
Regards
Ian
--
Ian Hobson
Tel (+66) 626 544 695
--
This email has been checked for viruses by AVG.
https://www.avg.com
--
https://mail.python.org/mailman/listinfo/python-list