It may help to put things in context and remind everyone that among the
most visited websites in the world (of which, Wikipedia is #5 per Alexa
[0]) the vast majority do not even use a relational database, making
real-time large-scale analytic queries something that is not easily
available to the uninitiated or the third-party query writer. The fact that
we can run a lot of these analytical queries using near-real-time data is
an exceptional and amazing feat.

PS: Bryan misspelled ETL by accident :)

[0]: https://www.alexa.com/topsites


On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis <bd...@wikimedia.org> wrote:

> On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverr...@gmail.com> wrote:
> >
> > Please don’t be patronizing insulting. What did read. What used to be
> one fairly quick query will now be thousands of decoupled queries. I know a
> number of cases where the decoupled query count can get into the hundreds
> of thousands. So what used to take 1 query and <2 minutes is now going to
> hammer the database with ~300k queries and additional logic on the
> programming side.
>
> The Wiki Replica databases are going though some tough changes for
> "analytic" queries right now. By analytic, I mean the sort of queries
> that tools want/need to do where the data generated by the wikis is
> being combed through to find correlations that are not needed to
> support the "operational" queries that are done by the MediaWiki
> software. The database industry terms OLTP (online transactional
> processing) [0] and OLAP (online analytic processing) [1] are used to
> describe these related, but very different use cases.
>
> Folks involved in MediaWiki core work have been making a lot of
> changes to the OLTP schema used by all wikis to make it more efficient
> in storage size and to reduce data duplication. These changes are very
> much needed to keep the wikis the size of the larger Wikimedia wikis
> performant as the amount of pages grows. Unfortunately these same
> changes are making many OLAP queries more expensive. The DBA and Cloud
> Services teams will try to do what they can to minimize the amount of
> additional slowness that is introduced, but there are going to be
> limits to what we can accomplish.
>
> Today the best advice I can give is to ask for help looking at queries
> which are too slow to work at all for the tools that need them. This
> mailing list, the #wikimedia-cloud Freenode IRC channel, or
> Phabricator are the best places to ask today. Some queries might be
> easy to make faster, others sadly might be as good as they can get and
> need other more complicated changes like splitting the query into
> smaller batches of work in various ways.
>
> The Technical Engagement team is interested in starting a project
> sometime in the next fiscal year (July 2019 - June 2020) to gather
> requirements for an OLAP database schema that would be easier for
> tools to use. Once we have that we would move into a phase of trying
> to design a ELT (extract, transform, load) pipeline that would convert
> the OLTP schema that MediaWiki has (what you see on the Wiki Replica
> hosts today) into that OLAP schema. This part is likely to be tricky
> to do while still keeping the lag from the production databases
> reasonably low and the security of the data high. This project is all
> speculative at this point however. I would like to see it happen, but
> it will take help from a lot of folks and a fairly large investment in
> time and probably hardware as well. I can't give any timeline for
> completion or honestly even promise that it will eventually happen
> yet. I can tell you however that we know this is a growing problem,
> and that there are people who would like to help make it better.
>
>
> [0]: https://en.wikipedia.org/wiki/Online_transaction_processing
> [1]: https://en.wikipedia.org/wiki/Online_analytical_processing
> [2]: https://en.wikipedia.org/wiki/Extract,_transform,_load
>
> Bryan
> --
> Bryan Davis              Wikimedia Foundation    <bd...@wikimedia.org>
> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement    Boise, ID USA
> irc: bd808                                        v:415.839.6885 x6855
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to