On Fri, Dec 27, 2019 at 07:47:02PM +0000, Richard Earnshaw (lists) wrote:
> Email addresses from the ChangeLog files are not validated during
> commits, so a number of typos exist in the extracted data.  I've
> extracted the 'Author:' entry from a prototype conversion and then piped
> that through sort and uniq -c.  Subsequent analysis shows the following
> addresses/names that are likely in need of some resolution.
> 
> Several of the names differ only in case and a case-insensitive sort
> would ignore these.  But some clearly show typing errors.

Looks like quite mixed bag of various issues.  Some are caused by some level
of dysgraphia of lots of us and the right choice is quite obvious, others
are just that the same name can have different forms and all of them are ok
(Bradley vs. Brad, Jeffrey vs. Jeff), others whether middle name is used or
not (and whether . appears after the first letter of middle name or not),
and another category is whether non-accented name is used, or accented one.
If we need to choose just one between the various correct forms of
name/email, I'd say it is important that we at least don't choose forms with
encoding issues.
Should we give people the choice of accented vs. non-accented form, e.g.
Uroš started using the accented form recently, using non-accented form in
the past probably because the encoding of the ChangeLog files in the past used
to be non-trustable, I remember several times when some people committed a
ChangeLog change that recoded everything in there as if e.g. the input was
in ISO-8859-1 and saved it as UTF-8 (which did quite some harm to already
UTF-8 encoded names).
E.g. in
      7 Author: Tobias Schl??ter <tobias.schlue...@physik.uni-muenchen.de>
     10 Author: Tobias Schl?ter <tobias.schlue...@physik.uni-muenchen.de>
*    167 Author: Tobias Schlueter <tobias.schlue...@physik.uni-muenchen.de>
       3 Author: Tobias Schlueter <tobias.shclue...@physik.uni-muenchen.de>
       2 Author: Tobias Schlueter <tobis.schlue...@physik.uni-muenchen.de>
      94 Author: Tobias Schl"uter <tobias.schlue...@physik.uni-muenchen.de>
 
*     39 Author: Tobias Schl?ter <t...@gcc.gnu.org>
       1 Author: Tobias Schlueter <t...@gcc.gnu.org>
       1 Author: Tobias Schl?uter <t...@gcc.gnu.org>
       4 Author: Tobias Schlüter <t...@gcc.gnu.org>
the second * certainly doesn't look right even when it is most common,
Tobias' name is surely Tobias Schlüter with possible transliteration
Tobias Schlueter.
*      7 Author: OndÅ?ej Bílka <nel...@seznam.cz>
       3 Author: Ond?ej Bílka <nel...@seznam.cz>
Neither of these is correct and the first one looks like an example of the
ISO-8859-1 to UTF-8 recodings, the latter like ISO-8859-2 encoded name,
the correct name is Ondřej Bílka in UTF-8 accented and Ondrej Bilka if
non-accented.

        Jakub

Reply via email to