On Thu, Apr 14, 2011 at 4:04 PM, harrismh777 <harrismh...@charter.net> wrote: > > How many web crawlers have you built? Are there any web programmers out > there who need a web bot to hit multiple sites zillions of times a month > from different places on earth to 'up' the number of hits for economic > reasons? I've seen my share of this.
A well-behaved spider will (a) have a UA that identifies itself (as a bot, and preferably as itself - eg "GoogleBot", etc - some even go so far as to include a URL for more info), and (b) start by fetching /robots.txt before they go any further. Servers can recognize properly-built crawlers. And improperly-built crawlers, deliberately trying to hammer a server to lie about browser stats? Seriously, do you think people actually care THAT much? > How mamy times have you altered the identity of your web browser so that > the web site would 'work'? You know, stupid messages from the server that > say, "We only support IE 6+, upgrade your browser...", so you tell it > you're using IE 6 and, well no problem. Yep. Which means that the figures will always be skewed toward IE a bit. But it's a lot less than you might think; most people don't leave UA switchers active all the time, and the number of web sites that require them is dropping. It's true that UA switching will tip toward IE (I've never seen a site where you have to pretend to be Google Chrome), but the epidemology is, I believe, not all that high. > Web site data is bogus. It assumes even distributions... it assumes even > usage of the site from all surfers, it assumes no web crawlers and no bots, > it assumes no browser identity tampering, and it assumes that there aren't > those who for economic reasons are not inflating the numbers deliberately > (no, really??) from world-owned bot farms. Even distributions of what? 1) Assuming nothing, it merely gives data. About one site. That's why overall "browser marketshare" stats have to be done by averaging multiple sites. 2) Web crawlers - see above. If you've ever looked at AWStats or Webalizer or *insert stats engine here*, you'll have seen that it will identify them. AWStats goes a bit further and will identify "viewed traffic" and "not viewed traffic" even if it's unable to identify the specific bot. 3) Yes, it assumes no UA switchers, obviously. It's just based on headers. But I reckon you could easily identify someone who's using a switcher, based on other headers - for instance, I doubt very much that IE6 will send "Accept-Encoding: gzip,deflate,sdch" (which my Chrome does). 4) Assumes people aren't deliberately fiddling the figures. Yeah, that would be correct. We're in the realm of conspiracy theories here... does anyone seriously think that browser stats are THAT important that they'd go to multiple web servers with deceitful hits? Not forgetting that they'd have to mix up the IPs, make plausible "browsing sessions" (with referers and image retrieval and so on), vary the date/times, etc, etc, etc, etc... and generate enough hits to make a reasonable dent in the figures. > There is no reliable way to measure free software usage. But, there sure > is a lot of posturing going on in the market place ... wonder why? Sure, and there's no reliable way to measure non-free software usage either. What's the difference? You could count sales of Microsoft Office, and you could count downloads of Open Office. Neither is any more accurate than the other; although I think the 24-hour figures for Firefox 4 / IE 9 downloads are fairly indicative, since people can't get them off their respective OS install CDs. And this isn't restricted to electronica either. Which is more popular, Coca-Cola or Pepsi? Do more people vote Liberal or Labour, Republican or Democrat, Whig or Tory? Statisticking is a huge science. Most of it involves figuring out what's important - anyone can get data, but getting useful information out of the data takes some work. Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list