Hi all,

We have a set of database reports (on users, articles, etc.) that we used
to generate on a weekly basis.[1] Ever since the introduction of the *actor*
table,[2] many of the reports that have to do with users have become so
slow that the SQL query cannot finish within a reasonable time and is
killed. Some other reports have also become slower over time; all of these
are shown in red in [1].

One possible solution is to create a script which is scheduled to run once
a month; the script would download the latest dump of the wiki database,[3]
load it into MySQL/MariaDB, create some additional indexes that would make
our desired queries run faster, and generate the reports using this
database. A separate script can then purge the data a few days later.

We can use the current-version-only DB dumps for this purpose. I am
guessing that this process would take several hours to run (somewhere
between 2 and 10) and would require about 2 GB of storage just to download
and decompress the dump file, and some additional space on the DB side (for
data, indexes, etc.)

Out of abundance of caution, I thought I should ask for permission now,
rather than forgiveness later. Do we have a process for getting approval
for projects that require gigabytes of storage and hours of computation, or
is what I proposed not even remotely considered a "large" project, meaning
I am being overly cautious?

Please advise!
Huji


  [1]
https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3
  [2] https://phabricator.wikimedia.org/T223406
  [3] https://dumps.wikimedia.org/fawiki/20200401/
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to