On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote: > My suggestion would be that we try to canonicalize all the author > entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that > would probably need further fixups to detect the additional names that > need rewriting.
What I've implemented in bugdb.py already includes converting ISO-8859-1 to UTF-8 (in any case where the author name is not valid UTF-8 - a general property of text encodings is that if something is valid UTF-8, it almost certainly is already encoded in ASCII or UTF-8 already), with special handling of NBSP and with fixups for all the cases where the results of converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the name in the original ChangeLog was not in fact UTF-8). I've also now made bugdb.py check the list of fixups both before and after recoding (which may help in some cases where e.g. a fixup is putting a name in canonical form, meaning such a fixup doesn't need to be given in forms with both UTF-8 and ISO-8859-1 encodings even if the name appears with both those encodings in the history). Because the author extraction is based on the ChangeLog entry included in the original commit, any subsequent commits that (wrongly or correctly) recoded ChangeLog entries are not relevant. -- Joseph S. Myers j...@polyomino.org.uk