Got it. I am also looking for rough ideas at this point. The edit rate of
fawiki is not that high, some 5-6K per day
<https://quarry.wmflabs.org/query/32328> (and I am guessing 3-4K if
restricting to article namespace). But note that we only care about edits
in the last 6-12 months, by users who have been active in the last 1-2
months. Older edits may translate to a resource you no more have access to,
and inactive users are not particularly helpful for our purposes either.

On Thu, Dec 27, 2018 at 1:57 PM John <phoenixoverr...@gmail.com> wrote:

> What’s fawiki’s edit rate? Processing a diff shouldn’t take more than 1-2
> seconds especially if you optimize the logic. I’m just spitballing ideas at
> this point, but the logic should be easy
>
> On Thu, Dec 27, 2018 at 12:37 PM Huji Lee <huji.h...@gmail.com> wrote:
>
>> We will never know who "owns" which book. We only know that they have
>> used it as a source a number of times. It could very well be that they just
>> can easily borrow that book from a library (as is my case, with a lot of
>> books and journals I have used as sources on I Wikipedia).
>>
>> The profiling issue is beyond this discussion, and I will make sure to
>> mention that on fawiki, but one can already "profile" users using their
>> edits (it is quite easy for people to look at my edits on fawiki and
>> realize that I read and write about Persian music based on my fawiki edits;
>> knowing that I also use some of the books on this topic as my source
>> wouldn't add much to the picture; of note, my real world life and identity
>> is unrelated to Persian music or music in general, so profiles are not
>> always as revealing anyway).
>>
>> @John: I had not heard of mwparserfromhell and it is really cool! But how
>> exactly does it come into play? The issue is less of being able to parse
>> wikicode (what we really need is pretty much a regex search for the Persian
>> equivalent of {{cite book}} template, and a second regex pattern that looks
>> for the "name" parameter inside matches for the first one). Frankly, I am
>> less worried about the steps *after* we found a "ciite book" instance, and
>> more about the steps leading to it (running many many diffs).
>>
>> Perhaps I am not fully understanding your thoughts, so please elaborate.
>>
>> Thank you both!
>>
>> On Thu, Dec 27, 2018 at 1:24 PM T Paris <tparis.w...@gmail.com> wrote:
>>
>>> Could I ask that you guys make this an “opt in” feature.  Both because
>>> it’ll speed up the bot and also because once you start identifying which
>>> books people own, you start to develop a profile on people.
>>>
>>>
>>>
>>> v/r,
>>>
>>> TP
>>>
>>>
>>>
>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>
>>>
>>>
>>> *From: *Huji Lee <huji.h...@gmail.com>
>>> *Sent: *Thursday, December 27, 2018 11:42 AM
>>> *To: *Labs <lab...@lists.wikimedia.org>
>>> *Subject: *[Cloud] List of users who have access to certain references
>>>
>>>
>>>
>>> This is an idea that came up on fawiki, and there is some merit to it. I
>>> just want to figure out the best approach to implement it and would love
>>> your input.
>>>
>>>
>>>
>>> *TL;DR: *We want to sweep through the recent edits in articles, look at
>>> each diff, see if it contains the addition of a "{{cite book}}" template,
>>> and if so, set it aside for future processing by another code.
>>>
>>>
>>>
>>> I wonder if there are already scripts in pywikibot that would help
>>> initiate this. If not, I wonder what is the best strategy to implement this
>>> using MW API.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Huji
>>>
>>>
>>>
>>> ------------
>>>
>>>
>>>
>>> Long version:
>>>
>>>
>>>
>>> The idea is to identify users who probably have access to certain
>>> offline sources, so that if another user needs something to be checked in
>>> that source and they don't have access to it, they know who to ask. For
>>> instance, if I have access to a physical copy of Encyclopedia Britannica
>>> (let's say it is a book and is not available digitally), and you want me to
>>> check if it has an entry for  Sir Isaac Newton, it would be great if
>>> instead of or in addition to asking on the village pump (which I might not
>>> follow), you would ask me directly.
>>>
>>>
>>>
>>> The assumption is that if the same user keeps adding the same "{{cite
>>> book}}" template in many articles (e.g. if I add the {{cite book | title =
>>> Encyclopedia Britannica | ... }} in several edits across several articles),
>>> then that user most likely has access to that source. And if these edits
>>> are relatively recent and the user is still active, then chances are the
>>> user can still access that source if another user asks them to.
>>>
>>>
>>>
>>> So if we find all such edits, we probably can aggregate them into a
>>> table that shows "Huji" added a {{cite book}} for a book titled
>>> "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by
>>> the frequency column, we might have a good list of user-source pairs.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to