Re: [Foundation-l] Wikistats is back

2009-01-05 Thread Brion Vibber
On 12/24/08 3:31 PM, Brian wrote: > I am still quite shocked at the amount of time the english wikipedia takes > to dump, especially since we seem to have close links to folks who work at > mysql. To me it seems that one of two things must be the case: > > 1. Wikipedia has outgrown mysql, in the se

[Foundation-l] Wikistats is back

2009-01-02 Thread Erik Zachte
A week ago I published new wikistats files, for the first time in 7 months, only to retract them 2 days later, when it turned out that counts for some wikis were completely wrong. After some serious bug hunting I nailed the creepy creature that had been hiding in an unexpected corner (most bugs fin

Re: [Foundation-l] Wikistats is back

2009-01-02 Thread Samuel Klein
Woo!! Thank you belatedly for my new years' dose of infodisiac. --SJ On Wed, Dec 24, 2008 at 5:50 PM, Erik Zachte wrote: > New wikistats reports have been published today, for the first time since > May 2008. The reports have been generated on the new wikistats server > 'Bayes', which is opera

Re: [Foundation-l] Wikistats is back

2009-01-02 Thread Gerard Meijssen
Hoi, On that note ... http://hardware.slashdot.org/article.pl?sid=09/01/02/1546214 Thanks, GerardM 2009/1/1 geni > 2008/12/25 Gerard Meijssen : > > Hoi, > > It is not one either. It has been said repeatedly that the process of a > > straightforward back up is something that is done on a reg

Re: [Foundation-l] Wikistats is back

2009-01-01 Thread geni
2008/12/25 Gerard Meijssen : > Hoi, > It is not one either. It has been said repeatedly that the process of a > straightforward back up is something that is done on a regular basis. No it hasn't -- geni ___ foundation-l mailing list foundation-l@lists

Re: [Foundation-l] Wikistats is back to May 2008 version

2008-12-25 Thread Ziko van Dijk
Beste Erik, Kan gebeuren, ik verwacht des te meer met spanning de nieuwe cijfers. Goed dat je het even nog hebt gemeldt, want ik was al een bijdraag voor een maillinglist aan het schrijven over de heel lage cijfers voor Duitse nieuwe wikipedianen. Erik, het was mooi om je te leren kennen, en in 20

Re: [Foundation-l] Wikistats is back

2008-12-25 Thread Aryeh Gregor
On Wed, Dec 24, 2008 at 7:09 PM, Brian wrote: > Interesting. I realize that the dump is extremely large, but if 7zip is > really the bottleneck then to me the solutions are straightforward: > > 1. Offer an uncompressed version of the dump for download. Bandwidth is > cheap and downloads can be res

[Foundation-l] Wikistats is back to May 2008 version

2008-12-25 Thread Erik Zachte
There is something seriously wrong with the figures for some wikipedias in the new wikistats reports. The figures for some wikis are much too low. When comparing csv files (raw counts) produced in May 2008 and produced recently it is quite easy to tell the difference. For some wikis the data for mo

Re: [Foundation-l] Wikistats is back

2008-12-25 Thread John Vandenberg
On 12/25/08, Gerard Meijssen wrote: > Hoi, > It is not one either. It has been said repeatedly that the process of a > straightforward back up is something that is done on a regular basis. This > however includes a lot of information that we do not allow to be included in > the data export that is

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Gerard Meijssen
Hoi, It is not one either. It has been said repeatedly that the process of a straightforward back up is something that is done on a regular basis. This however includes a lot of information that we do not allow to be included in the data export that is made available to the public. So never mind wh

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread David Gerard
2008/12/25 geni : > I'd more be thinking of handing over a stack of hard drives to > wikimedia chapter reps at wikimania . 2TB external hard disk, gzip on the fly (gzipping is faster than the network - remember, Wikimedia gzips data going between internal servers in the same rack because CPU is

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread geni
2008/12/25 David Gerard : > 2008/12/25 Brian : > >> But at least this would allow Erik, researchers and archivers to get the >> dump faster than they can get the compressed version. The number of people >> who want this can't be > 100, can it? It would need to be metered by an API >> I guess. > > >

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Robert Rohde
On Wed, Dec 24, 2008 at 6:29 PM, Brian wrote: > I'm also curious, what is the estimated amount of time to decompress this > thing? Somewhere around 1 week, I'd guesstimate. -Robert Rohde ___ foundation-l mailing list foundation-l@lists.wikimedia.org U

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread David Gerard
2008/12/25 Brian : > But at least this would allow Erik, researchers and archivers to get the > dump faster than they can get the compressed version. The number of people > who want this can't be > 100, can it? It would need to be metered by an API > I guess. Maybe we can run a sneakernet of DLT

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
I'm also curious, what is the estimated amount of time to decompress this thing? On Wed, Dec 24, 2008 at 7:24 PM, Brian wrote: > But at least this would allow Erik, researchers and archivers to get the > dump faster than they can get the compressed version. The number of people > who want this c

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
But at least this would allow Erik, researchers and archivers to get the dump faster than they can get the compressed version. The number of people who want this can't be > 100, can it? It would need to be metered by an API I guess. Cheers, Brian On Wed, Dec 24, 2008 at 7:18 PM, Robert Rohde wro

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Robert Rohde
On Wed, Dec 24, 2008 at 6:05 PM, Brian wrote: > Hi Robert, > > I'm not sure I agree with you.. > > (3 terabytes / 10 megabytes) seconds in days = 3.64 days > > That is, on my university connection I could download the dump in just a few > days. The only cost is bandwidth. While you might be corre

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
Hi Robert, I'm not sure I agree with you.. (3 terabytes / 10 megabytes) seconds in days = 3.64 days That is, on my university connection I could download the dump in just a few days. The only cost is bandwidth. On Wed, Dec 24, 2008 at 6:46 PM, Robert Rohde wrote: > On Wed, Dec 24, 2008 at 4:0

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Robert Rohde
On Wed, Dec 24, 2008 at 4:09 PM, Brian wrote: > Interesting. I realize that the dump is extremely large, but if 7zip is > really the bottleneck then to me the solutions are straightforward: > > 1. Offer an uncompressed version of the dump for download. Bandwidth is > cheap and downloads can be res

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread David Gerard
2008/12/25 Erik Zachte : > Hi Brian, Brion once explained to me that the post processing of the dump is > the main bottleneck. > Compressing articles with tens of thousands of revisions is a major resource > drain. > Right now every dump is even compressed twice, into bzip2 (for wider > platform c

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
Also, I wonder if these folks have been consulted for their expertise in compressing wikipedia data: http://prize.hutter1.net/ On Wed, Dec 24, 2008 at 5:09 PM, Brian wrote: > Interesting. I realize that the dump is extremely large, but if 7zip is > really the bottleneck then to me the solutions

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward: 1. Offer an uncompressed version of the dump for download. Bandwidth is cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a servi

[Foundation-l] Wikistats is back

2008-12-24 Thread Erik Zachte
Hi Brian, Brion once explained to me that the post processing of the dump is the main bottleneck. Compressing articles with tens of thousands of revisions is a major resource drain. Right now every dump is even compressed twice, into bzip2 (for wider platform compatibility) and 7zip format (for 2

[Foundation-l] Wikistats is back

2008-12-24 Thread Erik Zachte
John: > For the "Page Views" data on some projects, the May data > looks unusually lower than the June data; > could it be that the May data isn't > a complete month for some projects? Yes, that is indeed the case. I will omit the incomplete month on subsequent reports. Erik Zachte __

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Jon
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thank you Erik! Erik Zachte wrote: > New wikistats reports have been published today, for the first time since > May 2008. The reports have been generated on the new wikistats server > ‘Bayes’, which is operational since a few weeks. The dump process

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread Brian
Nice work Erik! I am still quite shocked at the amount of time the english wikipedia takes to dump, especially since we seem to have close links to folks who work at mysql. To me it seems that one of two things must be the case: 1. Wikipedia has outgrown mysql, in the sense that, while we can put

Re: [Foundation-l] Wikistats is back

2008-12-24 Thread John Vandenberg
Thank you Erik! For the "Page Views" data on some projects, the May data looks unusually lower than the June data; could it be that the May data isnt a complete month for some projects? http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthly.htm http://stats.wikimedia.org/wikiquote/EN/Tab

[Foundation-l] Wikistats is back

2008-12-24 Thread Erik Zachte
New wikistats reports have been published today, for the first time since May 2008. The reports have been  generated on the new wikistats server ‘Bayes’, which is operational since a few weeks. The dump process itself had been restarted some weeks earlier, new dumps are now available for all 700+ w