Re: [Cloud] Loading wikipedia dump onto Clouds

MusikAnimal Mon, 13 Apr 2020 16:20:04 -0700

Is the source code public? Maybe the queries could be improved. I ran into
many such issues too after the actor migration, but after taking advantage
of specialized views[0] and join decomposition (get just the actor IDs,
i.e. rev_actor, then the actor_names in a separate query), my tools are
seemingly as fast as they were before.


~ MA

[0]
https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#Advanced_use_cases_of_specialized_views

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee <huji.h...@gmail.com> wrote:

> I understand. However, I think that the use case we are looking at is
> relatively unique. I also think that indexes we need may not be desirable
> for all the Wiki Replicas (they would often be multi-column indexes geared
> towards a specific set of queries) and I honestly don't want to go through
> the several weeks (months?) of discussion to justify them.
>
> Note that if we open the can of "more indexes on Wiki Replicas" worms,
> this would all of a sudden become an all-wiki discussion. I'm not sure if
> there are more than a handful wikis that do this level of page-level and
> user-level analytics as fawiki does, which means for most wikis (and for
> most Wiki Replica databases) those additional indexes may not even be
> justified.
>
> Even if we were to generalize parts of this approach and bring it to Wiki
> Replicas, I would still argue that doing it at a smaller extent (one wiki
> DB for now) would be a reasonable starting point, no?
>
> On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis <bd...@wikimedia.org> wrote:
>
>> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee <huji.h...@gmail.com> wrote:
>> >
>> > One possible solution is to create a script which is scheduled to run
>> once a month; the script would download the latest dump of the wiki
>> database,[3] load it into MySQL/MariaDB, create some additional indexes
>> that would make our desired queries run faster, and generate the reports
>> using this database. A separate script can then purge the data a few days
>> later.
>>
>> If I am understanding your proposal here, I think the main difference
>> from the current Wiki Replicas would be "create some additional
>> indexes that would make our desired queries run faster". We do have
>> some indexes and views in the Wiki Replicas which are specifically
>> designed to make common things faster today. If possible, adding to
>> these rather than building a one-off process of moving lots of data
>> round for your tool would be nice.
>>
>> I say this not because what you are proposing is a ridiculous
>> solution, but because it is a unique solution for your current problem
>> that will not help others who are having similar problems. Having 1
>> tool use ToolsDB or a custom Cloud VPS project like this is possible,
>> but having 100 tools try to follow that pattern themselves is not.
>>
>> > Out of abundance of caution, I thought I should ask for permission now,
>> rather than forgiveness later. Do we have a process for getting approval
>> for projects that require gigabytes of storage and hours of computation, or
>> is what I proposed not even remotely considered a "large" project, meaning
>> I am being overly cautious?
>>
>> <https://phabricator.wikimedia.org/project/view/2875/>
>>
>> Bryan
>> --
>> Bryan Davis              Technical Engagement      Wikimedia Foundation
>> Principal Software Engineer                               Boise, ID USA
>> [[m:User:BDavis_(WMF)]]                                      irc: bd808
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Loading wikipedia dump onto Clouds

Reply via email to