Raj, I've been experimenting with R to compute simple statistics from my web logs somewhat similar to what you're describing. For instance, I'm working on trying to classify a unique IP or domain name requestor as 'human' or 'robot' based on the number of seconds between requests for pages. I've found that the easiest method of work, given my (elementary) knowledge of R and my (professional) knowledge of perl, is to run my logs through a perl program to pre-process the data, before submitting it to R. The output of running my Apache web log through my perl program looks like this tab-delimited output: [EMAIL PROTECTED]:~/weblogstats$ ./weblogtimediff.pl access_log.20071130.sorted |head DateTime Source TimeDiff Type 30/Nov/2007 00:00:47 54.100.68.58.sikkanet.com 15 unknown 30/Nov/2007 00:00:48 54.100.68.58.sikkanet.com 1 unknown 30/Nov/2007 00:01:19 54.100.68.58.sikkanet.com 31 unknown 30/Nov/2007 00:01:25 54.100.68.58.sikkanet.com 6 unknown 30/Nov/2007 00:01:29 ip-61-14-181-116.asianetcom.net 15 unknown 30/Nov/2007 00:01:40 54.100.68.58.sikkanet.com 15 unknown 30/Nov/2007 00:01:41 54.100.68.58.sikkanet.com 1 unknown 30/Nov/2007 00:01:44 llf520049.crawl.yahoo.net 14 robot 30/Nov/2007 00:01:46 ip-61-14-181-116.asianetcom.net 17 unknown [EMAIL PROTECTED]:~/weblogstats$
In this, I also make a preliminary classification into 'robot' (because it identified itself as such in the browser field), 'human' (because it submitted a text string to my internal search engine), or 'unknown'. Unfortunately, this approach doesn't seem to be working. The distributions of both the 'humans' and 'robots' seemed to be Poisson by inspection. I therefore created box plots of the log(mean(time intervals)), but the 'humans' versus the 'robots' were indistinguishable by inspection. As this is not exactly what I'm paid to do, I just play with this on my spare time, so I haven't tried anything else yet. If it's of general interest to this group, I'd be happy to publish my program for this. Otherwise, Raj, if you're interested, I'd be happy to send it to you privately. One oddity I noted is that Apache logs are not always in chronological order. The date/time stamp is when the request occurred, but it's written in the log when the request is completed. Thus, for a long download, several, shorter subsequent downloads may have been requested and completed before the earlier, long one. I was confused by negative time differences from my program until I discovered this. Subsequently, I sort my Apache log in chronological order before passing it through my program. Hope this helps. Let me know if you have any other questions. -Kevin -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Raj Mathur Sent: Thursday, January 31, 2008 8:31 AM To: r-help Subject: [R] Newbie: Using R to analyse Apache logs hits=-2.5 tests=BAYES_00,FORGED_RCVD_HELO X-USF-Spam-Flag: NO Hi, I have a requirement to scan Apache logs and discover ``exceptions''. Exceptions can be of two types: 1. A single IP generating a large amount of traffic within a given time frame (for definable values of ``large'' and ``time frame''). 2. A single IP hitting a wide set of URLs on the server (indicates a crawler), again for definable values of ``wide''. I'm a complete newbie to R (and to statistics), so the questions are: - Can R help me generate graphs which would help me identify these activities? - Has someone already done something like this? If so, where could I find it? - If not, can someone help me with the stats (and R) part to help me achieve these objectives? Any software that gets created as a result would be released under a FOSS license. Data massaging, tuning, etc. are not an issue. We'd be dealing with a few hundred thousand or a million records a day. Regards, -- Raju -- Raj Mathur [EMAIL PROTECTED] http://kandalaya.org/ Freedom in Technology & Software || February 2008 || http://freed.in/ GPG: 78D4 FC67 367F 40E2 0DD5 0FEF C968 D0EF CC68 D17F PsyTrance & Chill: http://schizoid.in/ || It is the mind that moves ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.