On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner <carni...@gmail.com> wrote:
> On Thu, Aug 20, 2009 at 14:10, Anthony<wikim...@inbox.org> wrote: > > "if one chooses a random page from Wikipedia right now, what is the > > probability of receiving a vandalized revision" The best way to answer > that > > question would be with a manually processed random sample taken from a > > pre-chosen moment in time. As few as 1000 revisions would probably be > > sufficient, if I know anything about statistics, but I'll let someone > with > > more knowledge of statistics verify or refute that. The results will > depend > > heavily on one's definition of "vandalism", though. > > I did this in an informal fashion in 2005 during my "hundred article" > surveys. Of the 503 pages I looked at, only one was clearly > vandalized the first time I looked at it, so I'd say a thousand > samples is probably too small to get any sort of precision on the > vandalism rate. Why? My understanding is that, if your methodology was correct, you can say with 96% confidence that the percentage of vandalized articles is less than 0.6%. That's useful. With 1000 samples, if you found two instances of vandalism, you'd have a 97% confidence that the percentage of vandalized articles is less than 0.5%. I don't think it's that low, but if you publish the details of your "hundred article" surveys, I might be persuaded that it is. If we really do have that figure to that level of assurance, a more useful statistic would be the percentage of pageviews that result in a vandalized article. That could be arrived at by weighting by pageviews while choosing your random sample. One flaw I found in my proposed methodology is that the "moment in time" needs to be randomized, since certain times of the day/week/year might very well experience higher vandalism than others. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l