It may help to put things in context and remind everyone that among the most visited websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not even use a relational database, making real-time large-scale analytic queries something that is not easily available to the uninitiated or the third-party query writer. The fact that we can run a lot of these analytical queries using near-real-time data is an exceptional and amazing feat.
PS: Bryan misspelled ETL by accident :) [0]: https://www.alexa.com/topsites On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis <bd...@wikimedia.org> wrote: > On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverr...@gmail.com> wrote: > > > > Please don’t be patronizing insulting. What did read. What used to be > one fairly quick query will now be thousands of decoupled queries. I know a > number of cases where the decoupled query count can get into the hundreds > of thousands. So what used to take 1 query and <2 minutes is now going to > hammer the database with ~300k queries and additional logic on the > programming side. > > The Wiki Replica databases are going though some tough changes for > "analytic" queries right now. By analytic, I mean the sort of queries > that tools want/need to do where the data generated by the wikis is > being combed through to find correlations that are not needed to > support the "operational" queries that are done by the MediaWiki > software. The database industry terms OLTP (online transactional > processing) [0] and OLAP (online analytic processing) [1] are used to > describe these related, but very different use cases. > > Folks involved in MediaWiki core work have been making a lot of > changes to the OLTP schema used by all wikis to make it more efficient > in storage size and to reduce data duplication. These changes are very > much needed to keep the wikis the size of the larger Wikimedia wikis > performant as the amount of pages grows. Unfortunately these same > changes are making many OLAP queries more expensive. The DBA and Cloud > Services teams will try to do what they can to minimize the amount of > additional slowness that is introduced, but there are going to be > limits to what we can accomplish. > > Today the best advice I can give is to ask for help looking at queries > which are too slow to work at all for the tools that need them. This > mailing list, the #wikimedia-cloud Freenode IRC channel, or > Phabricator are the best places to ask today. Some queries might be > easy to make faster, others sadly might be as good as they can get and > need other more complicated changes like splitting the query into > smaller batches of work in various ways. > > The Technical Engagement team is interested in starting a project > sometime in the next fiscal year (July 2019 - June 2020) to gather > requirements for an OLAP database schema that would be easier for > tools to use. Once we have that we would move into a phase of trying > to design a ELT (extract, transform, load) pipeline that would convert > the OLTP schema that MediaWiki has (what you see on the Wiki Replica > hosts today) into that OLAP schema. This part is likely to be tricky > to do while still keeping the lag from the production databases > reasonably low and the security of the data high. This project is all > speculative at this point however. I would like to see it happen, but > it will take help from a lot of folks and a fairly large investment in > time and probably hardware as well. I can't give any timeline for > completion or honestly even promise that it will eventually happen > yet. I can tell you however that we know this is a growing problem, > and that there are people who would like to help make it better. > > > [0]: https://en.wikipedia.org/wiki/Online_transaction_processing > [1]: https://en.wikipedia.org/wiki/Online_analytical_processing > [2]: https://en.wikipedia.org/wiki/Extract,_transform,_load > > Bryan > -- > Bryan Davis Wikimedia Foundation <bd...@wikimedia.org> > [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA > irc: bd808 v:415.839.6885 x6855 > > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud