On 28/12/2019 17:14, Joseph Myers wrote: > On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote: > >> My suggestion would be that we try to canonicalize all the author >> entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that >> would probably need further fixups to detect the additional names that >> need rewriting. > > What I've implemented in bugdb.py already includes converting ISO-8859-1 > to UTF-8 (in any case where the author name is not valid UTF-8 - a general > property of text encodings is that if something is valid UTF-8, it almost > certainly is already encoded in ASCII or UTF-8 already), with special > handling of NBSP and with fixups for all the cases where the results of > converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the > name in the original ChangeLog was not in fact UTF-8). > > I've also now made bugdb.py check the list of fixups both before and after > recoding (which may help in some cases where e.g. a fixup is putting a > name in canonical form, meaning such a fixup doesn't need to be given in > forms with both UTF-8 and ISO-8859-1 encodings even if the name appears > with both those encodings in the history). > > Because the author extraction is based on the ChangeLog entry included in > the original commit, any subsequent commits that (wrongly or correctly) > recoded ChangeLog entries are not relevant. >
I've added the list of emails that I posted yesterday to the conversion scripts. I've not written anything to reprocess that yet. I want to leave that until we've completed the general review of the preferred changes we want. Auto-generating that data from the list will probably be easier than maintaining it inside bugdb.py for now. R.