BrJohan wrote: > On 11/06/2014 14:23, BrJohan wrote: >> For some genealogical purposes I consider using Python's re module. >> >> Rather many names can be spelled in a number of similar ways, and in >> order to match names even if they are spelled differently, I will build >> regular expressions, each of which is supposed to match a number of >> similar names. >> >> I guess that there will be a few hundred such regular expressions >> covering most popular names. >> >> Now, my problem: Is there a way to decide whether any two - or more - of >> those regular expressions will match the same string? >> >> Or, stated a little differently: >> >> Can it, for a pair of regular expressions be decided whether at least >> one string matching both of those regular expressions, can be >> constructed? >> >> If it is possible to make such a decision, then how? Anyone aware of an >> algorithm for this? > > Thank you all for valuable input and interesting thoughts. > > After having reconsidered my problem, it might be better to approach it > a little differently. > > Either to state the regexps simply like: > "(Kristina)|(Christina)|(Cristine)|(Kristine)" > instead of "((K|(Ch))ristina)|([CK]ristine)" > > Or to put the namevariants in some sequence of sets having elements like: > ("Kristina", "Christina", "Cristine", "Kristine") > Matching is then just applying the 'in' operator. > > I see two distinct advantages. > 1. Readability and maintainability > 2. Any namevariant occurring in just one regexp or set means no risk of > erroneous matching. > > Comments?
I like the simple variant kristinas = ("Kristina", "Christina", "Cristine", "Kristine") But instead of matching with "in" you could build a dict that maps the name variants to a normalised name normalized_names = { "Kristina": "Kristina", "Christina": "Kristina", ... "John": "John", "Johann": "John", ... } def normalized(name): return normalized_names.get(name, name) If you put persons in another dict or a database indexed by the normalised name lookup = { "Kristina": ["Kristina Smith", "Christina Miller"], ... } you can find all Kristinas with two look-ups: >>> lookup[normalized("Kristine")] ['Kristina Smith', 'Christina Miller'] PS: A problem with this approach might be that (name in nameset_A) and (name in nameset_B) implies nameset_A == nameset_B -- https://mail.python.org/mailman/listinfo/python-list