[Cloud] Re: Validating multiple usernames?

2021-09-05 Thread Roy Smith
Sigh.  It's even more complicated than that.  It looks like the "name" entry 
doesn't always match the name you passed in the API call, but is subject to 
case mapping, trailing whitespace stripping, and maybe a few other things?

$ curl -s 
'https://en.wikipedia.org/w/api.php?action=query&format=json&list=users&usprop=groups%7Ceditcount%7Cgender&ususers=roySmith|roysmith'
 | json_pp
{
   "query" : {
  "users" : [
 {
"name" : "RoySmith",
"gender" : "unknown",
"groups" : [
   "sysop",
   "*",
   "user",
   "autoconfirmed"
],
"userid" : 130326,
"editcount" : 58645
 },
 {
"name" : "Roysmith",
"missing" : ""
 }
  ]
   },
   "batchcomplete" : ""
}


I'm assuming the entries in the returned "users" list are guaranteed to be in 
the same order as the input parameters?  I can't find anyplace that says this, 
but it seems logical.  Can somebody confirm that it's true?


> On Sep 4, 2021, at 6:46 PM, Roy Smith  wrote:
> 
> I turns out, this is a little more complicated than it appeared at first;  
> usercontribs and list users have different concepts of "invalid".  If you ask 
> for usercontribs on "1.2.3.4", it's valid.  If you pass in "1.2.3.0/24", you 
> get baduser..  But list users returns:
> 
> {
> "batchcomplete": "",
> "query": {
> "users": [
> {
> "name": "1.2.3.4",
> "invalid": ""
> }
> ]
> }
> }
> 
> which I guess makes sense in that context since it can't map it to a userid.  
> I can work around this, but mentioning it for the sake of some poor developer 
> searching the archives N years from now trying to figure it out :-)
> 
> 
>> On Aug 19, 2021, at 6:21 PM, Bryan Davis > > wrote:
>> 
>> On Thu, Aug 19, 2021 at 4:04 PM Roy Smith > > wrote:
>>> 
>>> I've got a tool which parses sockpuppet investigation (SPI) pages and does 
>>> some analysis.  One of the steps is I need to validate that all of the 
>>> usernames found in the SPI report are valid.  I do that by sequentially 
>>> calling usercontribs on each name with uclimit=1 and seeing if I get a 
>>> baduser error.
>>> 
>>> This works, but it's slow because I need to make 1 API call for each user.  
>>> For a big SPI case, the time to do this swamps everything else.  Is there a 
>>> more efficient way to do this?  Some API call where I can give it a bunch 
>>> of usernames in a batch and have it tell me which ones are invalid?  
>>> Alternatively, is there a regex I could apply on the client side to test if 
>>> a username is valid?
>>> 
>>> The most common type of invalid name I see is when somebody puts down an 
>>> iprange (i.e. 1.2.4.0/24) as a username.  Testing for that client-side 
>>> would be trivial, but it might miss some others.
>> 
>> You can do lookups in batches of 50 (500 if you have the
>> "apihighlimits" right which is commonly granted by the "Bots" group on
>> movement wikis) with
>> > >.
>> 
>> Here's a quick example:
>> >  
>> >
>> 
>> The results will look something like:
>> ```
>> {
>>"batchcomplete": true,
>>"query": {
>>"users": [
>>{
>>"name": "Bryan Davis",
>>"missing": true
>>},
>>{
>>"userid": 2619078,
>>"name": "BryanDavis"
>>},
>>{
>>"userid": 19474624,
>>"name": "BDavis (WMF)"
>>},
>>{
>>"userid": 24257381,
>>"name": "Bd808"
>>}
>>]
>>}
>> }
>> ```
>> 
>> Bryan
>> -- 
>> Bryan Davis  Technical Engagement  Wikimedia Foundation
>> Principal Software Engineer   Boise, ID USA
>> [[m:User:BDavis_(WMF)]]  irc: bd808
>> ___
>> Cloud mailing list -- cloud@lists.wikimedia.org 
>> 
>> List information: 
>> https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ 
>> 
>> 
> 
> ___
> Cloud mailing list -- cloud@lists.wikimedia.org
> List information: 
> https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimed

[Cloud] Re: Validating multiple usernames?

2021-09-05 Thread Roy Smith
Ugh.  That's not even true.  It looks like all the invalid entries are emitted 
first, then the valid ones.  And duplicates are deduplicated.

So, we're down to you give it a bunch of names, and it gives you back a a bunch 
of data which may not have the same number of entries as your input list, the 
entries aren't guaranteed to be in the same order as the input (despite the 
fact that the python mwclient goes out of its way to present it as an 
OrderedDict), and the output keys aren't guaranteed to match the input keys.

> On Sep 5, 2021, at 3:18 PM, Roy Smith  wrote:
> 
> I'm assuming the entries in the returned "users" list are guaranteed to be in 
> the same order as the input parameters?  I can't find anyplace that says 
> this, but it seems logical.  Can somebody confirm that it's true?

___
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/


[Cloud] Re: Validating multiple usernames?

2021-09-05 Thread Bryan Davis
On Sun, Sep 5, 2021 at 1:18 PM Roy Smith  wrote:
>
> Sigh.  It's even more complicated than that.  It looks like the "name" entry 
> doesn't always match the name you passed in the API call, but is subject to 
> case mapping, trailing whitespace stripping, and maybe a few other things?

MediaWiki normalizes usernames using non-trivial rules [0]. The level
of abstraction in this code will have you chase through a number of
classes to figure out all of the rules. The "simple" version of
canonicalizing a username is something like:

* Replace all whitespace characters with underscores (`_`)
* Reduce any runs of multiple underscores to a single underscore
* Trim any leading or trailing underscores from the string
* Capitalize the string

The real rules are a bit more complicated than this [1] and include
rejecting names containing certain special characters or runs of
characters.


> I'm assuming the entries in the returned "users" list are guaranteed to be in 
> the same order as the input parameters?  I can't find anyplace that says 
> this, but it seems logical.  Can somebody confirm that it's true?

I see you've already figured this out from your follow up message, but
for the sake of future readers, no. Each username provided to the
query is normalized before querying the database and any invalid
usernames are output first [2].

[0]: 
https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791effc24361ff/includes/user/UserNameUtils.php#L244-L317
[1]: 
https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791effc24361ff/includes/title/MediaWikiTitleCodec.php#L333-L579
[2]: 
https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791effc24361ff/includes/api/ApiQueryUsers.php#L156-L173

Bryan
-- 
Bryan Davis  Technical Engagement  Wikimedia Foundation
Principal Software Engineer   Boise, ID USA
[[m:User:BDavis_(WMF)]]  irc: bd808
___
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/