Re: [Cloud] Loading wikipedia dump onto Clouds

2020-05-04 Thread Huji Lee
With several weeks of delay, I finally had a chance to submit three Phab tasks around this topic. I have at least 10 more queries that I could add to the list, but I don't want to overwhelm people, so I will wait for those three to be resolved first. Thanks! Huji On Tue, Apr 14, 2020 at 8:53 AM Hu

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-14 Thread Huji Lee
I completely appreciate the points you are making, Bryan and Jaime. And I would very much enjoy "dealing with you" if we end up going to "Cloud VPS project" route! If anything, I keep learning new things from you all. Let's start where you suggested. I will create Phab tickets on which I will seek

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-14 Thread Huji Lee
Yes. If you go to the source of all those pages, there is a hidden HTML element ( kind) that has the SQL code for that report. Here is one example: [1] [1] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-14 Thread Jaime Crespo
Actually, as an idea, I don't think it is a bad one. In fact, there already exists the ticket: https://phabricator.wikimedia.org/T59617 Which basically is one where a summary of data can be shared, but not the individual rows, so once a summary was created and written into the tables as a sort of

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-13 Thread Bryan Davis
On Mon, Apr 13, 2020 at 3:03 PM Huji Lee wrote: > > On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis wrote: >> >> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee wrote: >> > >> > One possible solution is to create a script which is scheduled to run once >> > a month; the script would download the latest du

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-13 Thread MusikAnimal
Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemi

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-13 Thread Huji Lee
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the s

Re: [Cloud] Loading wikipedia dump onto Clouds

2020-04-13 Thread Bryan Davis
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee wrote: > > One possible solution is to create a script which is scheduled to run once a > month; the script would download the latest dump of the wiki database,[3] > load it into MySQL/MariaDB, create some additional indexes that would make > our desired