On 28/12/2019 12:04, Jakub Jelinek wrote:
> On Fri, Dec 27, 2019 at 07:47:02PM +0000, Richard Earnshaw (lists) wrote:
>> Email addresses from the ChangeLog files are not validated during
>> commits, so a number of typos exist in the extracted data.  I've
>> extracted the 'Author:' entry from a prototype conversion and then piped
>> that through sort and uniq -c.  Subsequent analysis shows the following
>> addresses/names that are likely in need of some resolution.
>>
>> Several of the names differ only in case and a case-insensitive sort
>> would ignore these.  But some clearly show typing errors.
> 
> Looks like quite mixed bag of various issues.  Some are caused by some level
> of dysgraphia of lots of us and the right choice is quite obvious, others
> are just that the same name can have different forms and all of them are ok
> (Bradley vs. Brad, Jeffrey vs. Jeff), others whether middle name is used or
> not (and whether . appears after the first letter of middle name or not),
> and another category is whether non-accented name is used, or accented one.
> If we need to choose just one between the various correct forms of
> name/email, I'd say it is important that we at least don't choose forms with
> encoding issues.

I don't know whether tools that analyse git repos to generate statistics
about users contributions care about canonicalization of names; they may
just key off email addresses.  I'm not going to try to try to fix those
up, unless specifically asked.  For example, some users have

<first.last@domain>

and

<userid@domain>


> Should we give people the choice of accented vs. non-accented form, e.g.
> Uroš started using the accented form recently, using non-accented form in
> the past probably because the encoding of the ChangeLog files in the past used
> to be non-trustable, I remember several times when some people committed a
> ChangeLog change that recoded everything in there as if e.g. the input was
> in ISO-8859-1 and saved it as UTF-8 (which did quite some harm to already
> UTF-8 encoded names).
> E.g. in
>       7 Author: Tobias Schl??ter <tobias.schlue...@physik.uni-muenchen.de>
>      10 Author: Tobias Schl?ter <tobias.schlue...@physik.uni-muenchen.de>
> *    167 Author: Tobias Schlueter <tobias.schlue...@physik.uni-muenchen.de>
>        3 Author: Tobias Schlueter <tobias.shclue...@physik.uni-muenchen.de>
>        2 Author: Tobias Schlueter <tobis.schlue...@physik.uni-muenchen.de>
>       94 Author: Tobias Schl"uter <tobias.schlue...@physik.uni-muenchen.de>
>  
> *     39 Author: Tobias Schl?ter <t...@gcc.gnu.org>
>        1 Author: Tobias Schlueter <t...@gcc.gnu.org>
>        1 Author: Tobias Schl?uter <t...@gcc.gnu.org>
>        4 Author: Tobias Schlüter <t...@gcc.gnu.org>
> the second * certainly doesn't look right even when it is most common,
> Tobias' name is surely Tobias Schlüter with possible transliteration
> Tobias Schlueter.
> *      7 Author: OndÅ?ej Bílka <nel...@seznam.cz>
>        3 Author: Ond?ej Bílka <nel...@seznam.cz>
> Neither of these is correct and the first one looks like an example of the
> ISO-8859-1 to UTF-8 recodings, the latter like ISO-8859-2 encoded name,
> the correct name is Ondřej Bílka in UTF-8 accented and Ondrej Bilka if
> non-accented.
> 
>       Jakub
> 

Yes, accents are tricky.  Some people care massively, others not at all.
 I think the above renderings come from the fact that the file I posted
contains more than one encoding style and that breaks any tools that try
to automatically resolve the encoding.  When I grep the email for Ondřej
directly it displays correctly for the second one in a UTF-8 locale.
When I load the attachment I posted yesterday into emacs it doesn't even
try to render the extended characters and simply displays them as hex
values with a leading \.

My suggestion would be that we try to canonicalize all the author
entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that
would probably need further fixups to detect the additional names that
need rewriting.

R.

Reply via email to