Re: [Cloud] Loading wikipedia dump onto Clouds

Huji Lee Mon, 13 Apr 2020 14:04:03 -0700

I understand. However, I think that the use case we are looking at is
relatively unique. I also think that indexes we need may not be desirable
for all the Wiki Replicas (they would often be multi-column indexes geared
towards a specific set of queries) and I honestly don't want to go through
the several weeks (months?) of discussion to justify them.


Note that if we open the can of "more indexes on Wiki Replicas" worms, this
would all of a sudden become an all-wiki discussion. I'm not sure if there
are more than a handful wikis that do this level of page-level and
user-level analytics as fawiki does, which means for most wikis (and for
most Wiki Replica databases) those additional indexes may not even be
justified.

Even if we were to generalize parts of this approach and bring it to Wiki
Replicas, I would still argue that doing it at a smaller extent (one wiki
DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis <bd...@wikimedia.org> wrote:

> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee <huji.h...@gmail.com> wrote:
> >
> > One possible solution is to create a script which is scheduled to run
> once a month; the script would download the latest dump of the wiki
> database,[3] load it into MySQL/MariaDB, create some additional indexes
> that would make our desired queries run faster, and generate the reports
> using this database. A separate script can then purge the data a few days
> later.
>
> If I am understanding your proposal here, I think the main difference
> from the current Wiki Replicas would be "create some additional
> indexes that would make our desired queries run faster". We do have
> some indexes and views in the Wiki Replicas which are specifically
> designed to make common things faster today. If possible, adding to
> these rather than building a one-off process of moving lots of data
> round for your tool would be nice.
>
> I say this not because what you are proposing is a ridiculous
> solution, but because it is a unique solution for your current problem
> that will not help others who are having similar problems. Having 1
> tool use ToolsDB or a custom Cloud VPS project like this is possible,
> but having 100 tools try to follow that pattern themselves is not.
>
> > Out of abundance of caution, I thought I should ask for permission now,
> rather than forgiveness later. Do we have a process for getting approval
> for projects that require gigabytes of storage and hours of computation, or
> is what I proposed not even remotely considered a "large" project, meaning
> I am being overly cautious?
>
> <https://phabricator.wikimedia.org/project/view/2875/>
>
> Bryan
> --
> Bryan Davis              Technical Engagement      Wikimedia Foundation
> Principal Software Engineer                               Boise, ID USA
> [[m:User:BDavis_(WMF)]]                                      irc: bd808
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Loading wikipedia dump onto Clouds

Reply via email to