Got it. I am also looking for rough ideas at this point. The edit rate of fawiki is not that high, some 5-6K per day <https://quarry.wmflabs.org/query/32328> (and I am guessing 3-4K if restricting to article namespace). But note that we only care about edits in the last 6-12 months, by users who have been active in the last 1-2 months. Older edits may translate to a resource you no more have access to, and inactive users are not particularly helpful for our purposes either.
On Thu, Dec 27, 2018 at 1:57 PM John <phoenixoverr...@gmail.com> wrote: > What’s fawiki’s edit rate? Processing a diff shouldn’t take more than 1-2 > seconds especially if you optimize the logic. I’m just spitballing ideas at > this point, but the logic should be easy > > On Thu, Dec 27, 2018 at 12:37 PM Huji Lee <huji.h...@gmail.com> wrote: > >> We will never know who "owns" which book. We only know that they have >> used it as a source a number of times. It could very well be that they just >> can easily borrow that book from a library (as is my case, with a lot of >> books and journals I have used as sources on I Wikipedia). >> >> The profiling issue is beyond this discussion, and I will make sure to >> mention that on fawiki, but one can already "profile" users using their >> edits (it is quite easy for people to look at my edits on fawiki and >> realize that I read and write about Persian music based on my fawiki edits; >> knowing that I also use some of the books on this topic as my source >> wouldn't add much to the picture; of note, my real world life and identity >> is unrelated to Persian music or music in general, so profiles are not >> always as revealing anyway). >> >> @John: I had not heard of mwparserfromhell and it is really cool! But how >> exactly does it come into play? The issue is less of being able to parse >> wikicode (what we really need is pretty much a regex search for the Persian >> equivalent of {{cite book}} template, and a second regex pattern that looks >> for the "name" parameter inside matches for the first one). Frankly, I am >> less worried about the steps *after* we found a "ciite book" instance, and >> more about the steps leading to it (running many many diffs). >> >> Perhaps I am not fully understanding your thoughts, so please elaborate. >> >> Thank you both! >> >> On Thu, Dec 27, 2018 at 1:24 PM T Paris <tparis.w...@gmail.com> wrote: >> >>> Could I ask that you guys make this an “opt in” feature. Both because >>> it’ll speed up the bot and also because once you start identifying which >>> books people own, you start to develop a profile on people. >>> >>> >>> >>> v/r, >>> >>> TP >>> >>> >>> >>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for >>> Windows 10 >>> >>> >>> >>> *From: *Huji Lee <huji.h...@gmail.com> >>> *Sent: *Thursday, December 27, 2018 11:42 AM >>> *To: *Labs <lab...@lists.wikimedia.org> >>> *Subject: *[Cloud] List of users who have access to certain references >>> >>> >>> >>> This is an idea that came up on fawiki, and there is some merit to it. I >>> just want to figure out the best approach to implement it and would love >>> your input. >>> >>> >>> >>> *TL;DR: *We want to sweep through the recent edits in articles, look at >>> each diff, see if it contains the addition of a "{{cite book}}" template, >>> and if so, set it aside for future processing by another code. >>> >>> >>> >>> I wonder if there are already scripts in pywikibot that would help >>> initiate this. If not, I wonder what is the best strategy to implement this >>> using MW API. >>> >>> >>> >>> Thanks, >>> >>> Huji >>> >>> >>> >>> ------------ >>> >>> >>> >>> Long version: >>> >>> >>> >>> The idea is to identify users who probably have access to certain >>> offline sources, so that if another user needs something to be checked in >>> that source and they don't have access to it, they know who to ask. For >>> instance, if I have access to a physical copy of Encyclopedia Britannica >>> (let's say it is a book and is not available digitally), and you want me to >>> check if it has an entry for Sir Isaac Newton, it would be great if >>> instead of or in addition to asking on the village pump (which I might not >>> follow), you would ask me directly. >>> >>> >>> >>> The assumption is that if the same user keeps adding the same "{{cite >>> book}}" template in many articles (e.g. if I add the {{cite book | title = >>> Encyclopedia Britannica | ... }} in several edits across several articles), >>> then that user most likely has access to that source. And if these edits >>> are relatively recent and the user is still active, then chances are the >>> user can still access that source if another user asks them to. >>> >>> >>> >>> So if we find all such edits, we probably can aggregate them into a >>> table that shows "Huji" added a {{cite book}} for a book titled >>> "Encyclopedia Britannica" 17 times, and so on and so forth. Sorting it by >>> the frequency column, we might have a good list of user-source pairs. >>> >>> >>> >>> >>> _______________________________________________ >>> Wikimedia Cloud Services mailing list >>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >>> https://lists.wikimedia.org/mailman/listinfo/cloud >> >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud > > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud