On 28/12/2019 12:04, Jakub Jelinek wrote: > On Fri, Dec 27, 2019 at 07:47:02PM +0000, Richard Earnshaw (lists) wrote: >> Email addresses from the ChangeLog files are not validated during >> commits, so a number of typos exist in the extracted data. I've >> extracted the 'Author:' entry from a prototype conversion and then piped >> that through sort and uniq -c. Subsequent analysis shows the following >> addresses/names that are likely in need of some resolution. >> >> Several of the names differ only in case and a case-insensitive sort >> would ignore these. But some clearly show typing errors. > > Looks like quite mixed bag of various issues. Some are caused by some level > of dysgraphia of lots of us and the right choice is quite obvious, others > are just that the same name can have different forms and all of them are ok > (Bradley vs. Brad, Jeffrey vs. Jeff), others whether middle name is used or > not (and whether . appears after the first letter of middle name or not), > and another category is whether non-accented name is used, or accented one. > If we need to choose just one between the various correct forms of > name/email, I'd say it is important that we at least don't choose forms with > encoding issues.
I don't know whether tools that analyse git repos to generate statistics about users contributions care about canonicalization of names; they may just key off email addresses. I'm not going to try to try to fix those up, unless specifically asked. For example, some users have <first.last@domain> and <userid@domain> > Should we give people the choice of accented vs. non-accented form, e.g. > Uroš started using the accented form recently, using non-accented form in > the past probably because the encoding of the ChangeLog files in the past used > to be non-trustable, I remember several times when some people committed a > ChangeLog change that recoded everything in there as if e.g. the input was > in ISO-8859-1 and saved it as UTF-8 (which did quite some harm to already > UTF-8 encoded names). > E.g. in > 7 Author: Tobias Schl??ter <tobias.schlue...@physik.uni-muenchen.de> > 10 Author: Tobias Schl?ter <tobias.schlue...@physik.uni-muenchen.de> > * 167 Author: Tobias Schlueter <tobias.schlue...@physik.uni-muenchen.de> > 3 Author: Tobias Schlueter <tobias.shclue...@physik.uni-muenchen.de> > 2 Author: Tobias Schlueter <tobis.schlue...@physik.uni-muenchen.de> > 94 Author: Tobias Schl"uter <tobias.schlue...@physik.uni-muenchen.de> > > * 39 Author: Tobias Schl?ter <t...@gcc.gnu.org> > 1 Author: Tobias Schlueter <t...@gcc.gnu.org> > 1 Author: Tobias Schl?uter <t...@gcc.gnu.org> > 4 Author: Tobias Schlüter <t...@gcc.gnu.org> > the second * certainly doesn't look right even when it is most common, > Tobias' name is surely Tobias Schlüter with possible transliteration > Tobias Schlueter. > * 7 Author: OndÅ?ej BÃlka <nel...@seznam.cz> > 3 Author: Ond?ej Bílka <nel...@seznam.cz> > Neither of these is correct and the first one looks like an example of the > ISO-8859-1 to UTF-8 recodings, the latter like ISO-8859-2 encoded name, > the correct name is Ondřej Bílka in UTF-8 accented and Ondrej Bilka if > non-accented. > > Jakub > Yes, accents are tricky. Some people care massively, others not at all. I think the above renderings come from the fact that the file I posted contains more than one encoding style and that breaks any tools that try to automatically resolve the encoding. When I grep the email for Ondřej directly it displays correctly for the second one in a UTF-8 locale. When I load the attachment I posted yesterday into emacs it doesn't even try to render the extended characters and simply displays them as hex values with a leading \. My suggestion would be that we try to canonicalize all the author entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that would probably need further fixups to detect the additional names that need rewriting. R.