On Fri, Dec 27, 2019 at 07:47:02PM +0000, Richard Earnshaw (lists) wrote: > Email addresses from the ChangeLog files are not validated during > commits, so a number of typos exist in the extracted data. I've > extracted the 'Author:' entry from a prototype conversion and then piped > that through sort and uniq -c. Subsequent analysis shows the following > addresses/names that are likely in need of some resolution. > > Several of the names differ only in case and a case-insensitive sort > would ignore these. But some clearly show typing errors.
Looks like quite mixed bag of various issues. Some are caused by some level of dysgraphia of lots of us and the right choice is quite obvious, others are just that the same name can have different forms and all of them are ok (Bradley vs. Brad, Jeffrey vs. Jeff), others whether middle name is used or not (and whether . appears after the first letter of middle name or not), and another category is whether non-accented name is used, or accented one. If we need to choose just one between the various correct forms of name/email, I'd say it is important that we at least don't choose forms with encoding issues. Should we give people the choice of accented vs. non-accented form, e.g. Uroš started using the accented form recently, using non-accented form in the past probably because the encoding of the ChangeLog files in the past used to be non-trustable, I remember several times when some people committed a ChangeLog change that recoded everything in there as if e.g. the input was in ISO-8859-1 and saved it as UTF-8 (which did quite some harm to already UTF-8 encoded names). E.g. in 7 Author: Tobias Schl??ter <tobias.schlue...@physik.uni-muenchen.de> 10 Author: Tobias Schl?ter <tobias.schlue...@physik.uni-muenchen.de> * 167 Author: Tobias Schlueter <tobias.schlue...@physik.uni-muenchen.de> 3 Author: Tobias Schlueter <tobias.shclue...@physik.uni-muenchen.de> 2 Author: Tobias Schlueter <tobis.schlue...@physik.uni-muenchen.de> 94 Author: Tobias Schl"uter <tobias.schlue...@physik.uni-muenchen.de> * 39 Author: Tobias Schl?ter <t...@gcc.gnu.org> 1 Author: Tobias Schlueter <t...@gcc.gnu.org> 1 Author: Tobias Schl?uter <t...@gcc.gnu.org> 4 Author: Tobias Schlüter <t...@gcc.gnu.org> the second * certainly doesn't look right even when it is most common, Tobias' name is surely Tobias Schlüter with possible transliteration Tobias Schlueter. * 7 Author: OndÅ?ej BÃlka <nel...@seznam.cz> 3 Author: Ond?ej Bílka <nel...@seznam.cz> Neither of these is correct and the first one looks like an example of the ISO-8859-1 to UTF-8 recodings, the latter like ISO-8859-2 encoded name, the correct name is Ondřej Bílka in UTF-8 accented and Ondrej Bilka if non-accented. Jakub