Yes, I have a Toolforge account and there are a bunch of cronjobs that run weekly (and a few that run daily).
The code can be found at https://github.com/PersianWikipedia/fawikibot/tree/master/HujiBot where stats.py is the program that actually connects to the DB, but weekly.py and weekly-slow.py are wrapper scripts in which you can find the SQL queries themselves. All queries initiate on fawiki_p so by simply searching those two python files for keywords like "commonswiki" or "enwiki" you will find all cross-wiki joins we currently have. On Mon, Nov 16, 2020 at 3:20 PM Joaquin Oltra Hernandez < jhernan...@wikimedia.org> wrote: > I have incorporated some of the suggestions and info from the threads to > the wiki page. > > I would like to document specific code examples, specially of SQL cross > joins, and how migrating away from them would look like. If you have looked > at this and done changes to your queries and code it would be super helpful > if you can point me to the repo/code to show real examples in the > documentation. > > If you can share real use cases of your use cases like Huji Lee and > MusikAnimal, it is also very useful to discuss and get help, thanks Brooke > and ACN. > > Huji, how do you run the queries? Do you have a Toolforge project with > code where you query the DB? Do you use Quarry or maybe PAWS? The migration > path is different depending on your workflow and skills so that background > information helps provide suggestions. Like ACN mentioned, the answer > varies a lot depending on context and use case, we can try and help you do > the changes bit by bit. If you want to make a new thread with the specifics > of your code we should be able to help come to solutions. > > On Wed, Nov 11, 2020 at 5:25 AM AntiCompositeNumber < > anticompositenum...@gmail.com> wrote: > >> Most cross-db JOINs can be recreated using two queries and an external >> tool to filter the results. However, there are some queries that would >> be simply impractical due to the large amount of data involved, and >> the query for overlapping local and Commons images is one of them. >> There are basically two ways to recreate the query: re-implement the >> inner join or re-implement a semi-join subquery. >> >> Recreating a JOIN is conceptually very simple: get two lists and >> compare them. However, there are 67,034 files on fawiki, 891,286 files >> on enwiki, and 65,559,375 files on Commons. Simply joining by name >> would be impossible -- MariaDB would time out a few hundred times >> before returning all that data, and even if it did, storing those >> lists even as efficiently as possible would be quite the memory hog. >> So the query would have to be paginated. The only common identifier we >> have is the file name, and since the letters in the names aren't >> evenly distributed, paginating wouldn't exactly be fun. >> The other option is implementing the Commons lookup like a semi-join >> subquery. Iterate over the local data, paginating any way you want. >> Then, for every item, query the Commons database for that title. Of >> course, we're now making a million requests to the database, which >> isn't going to be very fast simply due to network delays. We could be >> a little nicer and group a bunch of titles together in the query, >> which will probably get us down from a million queries to fifty >> thousand or so. Of course, this all gets more complicated if you want >> a query more complex than SELECT enwiki_p.img_title FROM >> enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title = >> commonswiki_p.img_title; >> >> I understand the system engineering reasons for this change, but I >> think it's worth underscoring exactly how disruptive it will be for >> the queries that depended on this functionality. I'm certainly no >> expert, but I'm willing to help wrap queries in Python until they >> start working again. >> >> ACN >> >> On Tue, Nov 10, 2020 at 8:48 PM Huji Lee <huji.h...@gmail.com> wrote: >> > >> > Cross-wiki JOINS are used by some of the queries we run regularly for >> fawiki. One of those queries looks for articles that don't have an image in >> their infobox in fawiki, but do have one on enwiki, so that we can >> use/import that image. Another one JOINs fawiki data with commons data to >> look for redundant images. Yet another one, looks for articles that all use >> an image that doesn't exist (for cleanup purposes) but needs to join with >> commons db because the referenced file might exist there. Lastly, we have a >> report that looks for fair use images on fawiki that had the same name as >> an image on enwiki where the enwiki copy was deleted; this usually >> indicates in improper application of fair use, and enwiki -- due to its >> larger community -- finds and deletes these faster than we could on fawiki. >> > >> > There may be other cases I am unaware of. The point is, losing the >> cross-wiki JOIN capability can make some of the above tasks really >> difficult or completely impossible. >> > >> > On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez < >> jhernan...@wikimedia.org> wrote: >> >> >> >> TLDR: Wiki Replicas' architecture is being redesigned for stability >> and performance. Cross database JOINs will not be available and a host >> connection will only allow querying its associated DB. See [1] for more >> details. >> >> >> >> Hi! >> >> >> >> In the interest of making and keeping Wiki Replicas a stable and >> performant service, a new backend architecture is needed. There is some >> impact in the features and usage patterns. >> >> >> >> What should I do? To avoid breaking changes, you can start making the >> following changes *now*: >> >> - Update existing tools to ensure queries are executed against the >> proper database connection >> >> - Eg: If you want to query the `eswiki_p` DB, you must connect to >> the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not >> to enwiki or other hosts >> >> - Check your existing tools and services queries for cross database >> JOINs, rewrite the joins in application code >> >> - Eg: If you are doing a join across databases, for example joining >> `enwiki_p` and `eswiki_p`, you will need to query them separately, and >> filter the results of the separate queries in the code >> >> >> >> Timeline: >> >> - November - December: Early adopter testing >> >> - January 2021: Existing and new systems online, transition period >> starts >> >> - February 2021: Old hardware is decommissioned >> >> >> >> We need your help >> >> - If you would like to beta test the new architecture, please let us >> know and we will reach out to you soon >> >> - Sharing examples / descriptions of how a tool or service was >> updated, writing a common solution or some example code others can utilize >> and reference, helping others on IRC and the mailing lists >> >> >> >> If you have questions or need help adapting your code or queries, >> please contact us [2], or write on the talk page [3]. >> >> >> >> We will be sending reminders, and more specific examples of the >> changes via email and on the wiki page. For more information see [1]. >> >> >> >> [1]: >> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign >> >> [2]: >> https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication >> >> [3]: >> https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign >> >> >> >> -- >> >> Joaquin Oltra Hernandez >> >> Developer Advocate - Wikimedia Foundation >> >> _______________________________________________ >> >> Wikimedia Cloud Services announce mailing list >> >> cloud-annou...@lists.wikimedia.org (formerly >> labs-annou...@lists.wikimedia.org) >> >> https://lists.wikimedia.org/mailman/listinfo/cloud-announce >> >> _______________________________________________ >> >> Wikimedia Cloud Services mailing list >> >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > >> > _______________________________________________ >> > Wikimedia Cloud Services mailing list >> > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> > https://lists.wikimedia.org/mailman/listinfo/cloud >> >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > > > -- > Joaquin Oltra Hernandez > Developer Advocate - Wikimedia Foundation > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud >
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud