I have a problem, In a few cases "robot-exclusion-useragent" have 2 or more values, is there a manner to fix it? For example, robot askjeeves has three names.
2010/4/13 Barry Rowlingson <b.rowling...@lancaster.ac.uk>: > On Tue, Apr 13, 2010 at 6:26 PM, Sebastian Kruk <residuo.so...@gmail.com> > wrote: >> Dear R-list users: >> >> I would like to import a database of web robots, >> http://www.robotstxt.org/db/all.txt, it´s formatted RFC-822, ¿how can >> I do it? > > RFC822 looks very much like R's package DESCRIPTION files, and they > are read in using read.dcf because they are conformant to 'Debian > Control File' format. So I tried read.dcf on it: > > > robots = read.dcf("all.txt") > > dim(robots) > [1] 298 38 > > so that's a matrix: > > > dimnames(robots) > [[1]] > NULL > > [[2]] > [1] "robot-id" "robot-name" > [3] "robot-cover-url" "robot-details-url" > [5] "robot-owner-name" "robot-owner-url" > [7] "robot-owner-email" "robot-status" > [9] "robot-purpose" "robot-type" > [11] "robot-platform" "robot-availability" > [13] "robot-exclusion" "robot-exclusion-useragent" > [15] "robot-noindex" "robot-host" > [17] "robot-from" "robot-useragent" > [19] "robot-language" "robot-description" > [21] "robot-history" "robot-environment" > [23] "modified-date" "modified-by" > [25] "robot-nofollow" "robot-owner-name2" > [27] "robot-owner-url2" "robot-owner-email2" > [29] "robot-owner-name3" "robot-owner-name4" > [31] "robot-environment1" "robot-environment2" > [33] "robot-purpose1" "robot-purpose2" > [35] "robot-purpose3" "robot-platform1" > [37] "robot-description1" "robot-description2" > > and I guess it pads out the columns so every row has every possible > variable value even if it doesn't exist in the record for that robot. > > Sorted? > > Barry > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.