Yes, I have a Toolforge account and there are a bunch of cronjobs that run
weekly (and a few that run daily).

The code can be found at
https://github.com/PersianWikipedia/fawikibot/tree/master/HujiBot where
stats.py is the program that actually connects to the DB, but weekly.py and
weekly-slow.py are wrapper scripts in which you can find the SQL queries
themselves. All queries initiate on fawiki_p so by simply searching those
two python files for keywords like "commonswiki" or "enwiki" you will find
all cross-wiki joins we currently have.

On Mon, Nov 16, 2020 at 3:20 PM Joaquin Oltra Hernandez <
jhernan...@wikimedia.org> wrote:

> I have incorporated some of the suggestions and info from the threads to
> the wiki page.
>
> I would like to document specific code examples, specially of SQL cross
> joins, and how migrating away from them would look like. If you have looked
> at this and done changes to your queries and code it would be super helpful
> if you can point me to the repo/code to show real examples in the
> documentation.
>
> If you can share real use cases of your use cases like Huji Lee and
> MusikAnimal, it is also very useful to discuss and get help, thanks Brooke
> and ACN.
>
> Huji, how do you run the queries? Do you have a Toolforge project with
> code where you query the DB? Do you use Quarry or maybe PAWS? The migration
> path is different depending on your workflow and skills so that background
> information helps provide suggestions. Like ACN mentioned, the answer
> varies a lot depending on context and use case, we can try and help you do
> the changes bit by bit. If you want to make a new thread with the specifics
> of your code we should be able to help come to solutions.
>
> On Wed, Nov 11, 2020 at 5:25 AM AntiCompositeNumber <
> anticompositenum...@gmail.com> wrote:
>
>> Most cross-db JOINs can be recreated using two queries and an external
>> tool to filter the results. However, there are some queries that would
>> be simply impractical due to the large amount of data involved, and
>> the query for overlapping local and Commons images is one of them.
>> There are basically two ways to recreate the query: re-implement the
>> inner join or re-implement a semi-join subquery.
>>
>> Recreating a JOIN is conceptually very simple: get two lists and
>> compare them. However, there are 67,034 files on fawiki, 891,286 files
>> on enwiki, and 65,559,375 files on Commons. Simply joining by name
>> would be impossible -- MariaDB would time out a few hundred times
>> before returning all that data, and even if it did, storing those
>> lists even as efficiently as possible would be quite the memory hog.
>> So the query would have to be paginated. The only common identifier we
>> have is the file name, and since the letters in the names aren't
>> evenly distributed, paginating wouldn't exactly be fun.
>> The other option is implementing the Commons lookup like a semi-join
>> subquery. Iterate over the local data, paginating any way you want.
>> Then, for every item, query the Commons database for that title. Of
>> course, we're now making a million requests to the database, which
>> isn't going to be very fast simply due to network delays. We could be
>> a little nicer and group a bunch of titles together in the query,
>> which will probably get us down from a million queries to fifty
>> thousand or so. Of course, this all gets more complicated if you want
>> a query more complex than SELECT enwiki_p.img_title FROM
>> enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title =
>> commonswiki_p.img_title;
>>
>> I understand the system engineering reasons for this change, but I
>> think it's worth underscoring exactly how disruptive it will be for
>> the queries that depended on this functionality. I'm certainly no
>> expert, but I'm willing to help wrap queries in Python until they
>> start working again.
>>
>> ACN
>>
>> On Tue, Nov 10, 2020 at 8:48 PM Huji Lee <huji.h...@gmail.com> wrote:
>> >
>> > Cross-wiki JOINS are used by some of the queries we run regularly for
>> fawiki. One of those queries looks for articles that don't have an image in
>> their infobox in fawiki, but do have one on enwiki, so that we can
>> use/import that image. Another one JOINs fawiki data with commons data to
>> look for redundant images. Yet another one, looks for articles that all use
>> an image that doesn't exist (for cleanup purposes) but needs to join with
>> commons db because the referenced file might exist there. Lastly, we have a
>> report that looks for fair use images on fawiki that had the same name as
>> an image on enwiki where the enwiki copy was deleted; this usually
>> indicates in improper application of fair use, and enwiki -- due to its
>> larger community -- finds and deletes these faster than we could on fawiki.
>> >
>> > There may be other cases I am unaware of. The point is, losing the
>> cross-wiki JOIN capability can make some of the above tasks really
>> difficult or completely impossible.
>> >
>> > On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez <
>> jhernan...@wikimedia.org> wrote:
>> >>
>> >> TLDR: Wiki Replicas' architecture is being redesigned for stability
>> and performance. Cross database JOINs will not be available and a host
>> connection will only allow querying its associated DB. See [1] for more
>> details.
>> >>
>> >> Hi!
>> >>
>> >> In the interest of making and keeping Wiki Replicas a stable and
>> performant service, a new backend architecture is needed. There is some
>> impact in the features and usage patterns.
>> >>
>> >> What should I do? To avoid breaking changes, you can start making the
>> following changes *now*:
>> >> - Update existing tools to ensure queries are executed against the
>> proper database connection
>> >>   - Eg: If you want to query the `eswiki_p` DB, you must connect to
>> the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not
>> to enwiki or other hosts
>> >> - Check your existing tools and services queries for cross database
>> JOINs, rewrite the joins in application code
>> >>   - Eg: If you are doing a join across databases, for example joining
>> `enwiki_p` and `eswiki_p`, you will need to query them separately, and
>> filter the results of the separate queries in the code
>> >>
>> >> Timeline:
>> >> - November - December: Early adopter testing
>> >> - January 2021: Existing and new systems online, transition period
>> starts
>> >> - February 2021: Old hardware is decommissioned
>> >>
>> >> We need your help
>> >> - If you would like to beta test the new architecture, please let us
>> know and we will reach out to you soon
>> >> - Sharing examples / descriptions of how a tool or service was
>> updated, writing a common solution or some example code others can utilize
>> and reference, helping others on IRC and the mailing lists
>> >>
>> >> If you have questions or need help adapting your code or queries,
>> please contact us [2], or write on the talk page [3].
>> >>
>> >> We will be sending reminders, and more specific examples of the
>> changes via email and on the wiki page. For more information see [1].
>> >>
>> >> [1]:
>> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign
>> >> [2]:
>> https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication
>> >> [3]:
>> https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign
>> >>
>> >> --
>> >> Joaquin Oltra Hernandez
>> >> Developer Advocate - Wikimedia Foundation
>> >> _______________________________________________
>> >> Wikimedia Cloud Services announce mailing list
>> >> cloud-annou...@lists.wikimedia.org (formerly
>> labs-annou...@lists.wikimedia.org)
>> >> https://lists.wikimedia.org/mailman/listinfo/cloud-announce
>> >> _______________________________________________
>> >> Wikimedia Cloud Services mailing list
>> >> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> >> https://lists.wikimedia.org/mailman/listinfo/cloud
>> >
>> > _______________________________________________
>> > Wikimedia Cloud Services mailing list
>> > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> > https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>
>
> --
> Joaquin Oltra Hernandez
> Developer Advocate - Wikimedia Foundation
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to