On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote:

> My suggestion would be that we try to canonicalize all the author
> entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that
> would probably need further fixups to detect the additional names that
> need rewriting.

What I've implemented in bugdb.py already includes converting ISO-8859-1 
to UTF-8 (in any case where the author name is not valid UTF-8 - a general 
property of text encodings is that if something is valid UTF-8, it almost 
certainly is already encoded in ASCII or UTF-8 already), with special 
handling of NBSP and with fixups for all the cases where the results of 
converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the 
name in the original ChangeLog was not in fact UTF-8).

I've also now made bugdb.py check the list of fixups both before and after 
recoding (which may help in some cases where e.g. a fixup is putting a 
name in canonical form, meaning such a fixup doesn't need to be given in 
forms with both UTF-8 and ISO-8859-1 encodings even if the name appears 
with both those encodings in the history).

Because the author extraction is based on the ChangeLog entry included in 
the original commit, any subsequent commits that (wrongly or correctly) 
recoded ChangeLog entries are not relevant.

-- 
Joseph S. Myers
j...@polyomino.org.uk

Reply via email to