Re: Git conversion: fixing email addresses from ChangeLog files

Richard Earnshaw (lists) Sat, 28 Dec 2019 09:24:08 -0800

On 28/12/2019 17:14, Joseph Myers wrote:
> On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote:
> 
>> My suggestion would be that we try to canonicalize all the author
>> entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that
>> would probably need further fixups to detect the additional names that
>> need rewriting.
> 
> What I've implemented in bugdb.py already includes converting ISO-8859-1 
> to UTF-8 (in any case where the author name is not valid UTF-8 - a general 
> property of text encodings is that if something is valid UTF-8, it almost 
> certainly is already encoded in ASCII or UTF-8 already), with special 
> handling of NBSP and with fixups for all the cases where the results of 
> converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the 
> name in the original ChangeLog was not in fact UTF-8).
> 
> I've also now made bugdb.py check the list of fixups both before and after 
> recoding (which may help in some cases where e.g. a fixup is putting a 
> name in canonical form, meaning such a fixup doesn't need to be given in 
> forms with both UTF-8 and ISO-8859-1 encodings even if the name appears 
> with both those encodings in the history).
> 
> Because the author extraction is based on the ChangeLog entry included in 
> the original commit, any subsequent commits that (wrongly or correctly) 
> recoded ChangeLog entries are not relevant.
>


I've added the list of emails that I posted yesterday to the conversion
scripts.  I've not written anything to reprocess that yet.  I want to
leave that until we've completed the general review of the preferred
changes we want.  Auto-generating that data from the list will probably
be easier than maintaining it inside bugdb.py for now.

R.

Re: Git conversion: fixing email addresses from ChangeLog files

Reply via email to