On 07/12/23 at 20:24 +0100, Andreas Tille wrote: > Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum: > > On 07/12/23 at 09:58 +0100, Andreas Tille wrote: > > > Hi, > > > > > > by chance I realised that the uploaders table contains some names where > > > names > > > are not stripped: > > > > > > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from > > > uploaders u where name like '% ' or name like ' %' ; > > > name_with_spaces | uploader > > > --------------------------+------------------------------------------- > > > " Mehdi Dogguy" | Mehdi Dogguy <me...@debian.org> > > > " David Paleino" | David Paleino <da...@debian.org> > > > " Stéphane Glondu" | Stéphane Glondu <glo...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > "Andreas Tille " | Andreas Tille <ti...@debian.org> > > > " LI Daobing" | LI Daobing <lidaob...@debian.org> > > > " David Paleino" | David Paleino <da...@debian.org> > > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > > (20 rows) > > > ... > > > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' > > > || email WHERE name like ' %' or name like '% ' ; > > > > > > > Uploaders is refreshed every few hours from archive data, so a one-time > > UPDATE would not help. UDD usually tries to preserve inaccuracies, so > > those might be interesting for QA work. > > OK. > > > In your case, why don't you use the email address to identify uploaders? > > Since this also does not work: > > udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' > GROUP BY uploader; > count | uploader > -------+------------------------------------ > 1 | Andreas Tille <ti...@debian.org> > 1 | Andreas Tille <andr...@an3as.eu> > 8785 | Andreas Tille <ti...@debian.org> > (3 Zeilen) > > > (possibly combining it with the carnivore data to identify different emails > > belonging to the same person ?) > > I could fiddle around with carnivore but that's overkill for thst > purpose and I insist that not stripping blanks from names does not make > any sense, IMHO. (1 Zeile) > > > BTW: I found > > udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN > maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh > WHERE name ilike '%tille%' group by name; > count | name > -------+--------------- > 16524 | Andreas Tille > (1 Zeile) > > So why do I have 8707 uploads per uploaders but 16524 per upload_history? > > Is my assumption wrong that both values should match (modulo some wrongly > spelled names)
If you look at the uploaders table, there are three columns: - 'uploader', than contains the raw data - 'name' and 'email' that contain the parsed (and trimmed) data udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3; uploader | name | email | count ------------------------------------+-----------------+------------------+------- Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 8785 Andreas Tille <andr...@an3as.eu> | Andreas Tille | andr...@an3as.eu | 1 Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 1 So, just use name and/or email? Lucas