[Haskell-cafe] Wikipedia archiving bot - code review

Gwern Branwen Mon, 25 Jun 2007 17:18:00 -0700

Hey everyone. So I've been learning Haskell for a while now, and I've found the 
best way to move from theory to practice is to just write something useful for 
yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to 
stop links to external websites from breaking on me. So I wrote this little 
program using the TagSoup library which will download Wikipedia articles, parse 
out external links, and then ask WebCite to archive them.


But there's a problem: no matter how I look at it, it's just way too slow. 
Running on a measly 100 articles at a time, it'll eat up to half my processor 
time and RAM (according to top). I converted it over to ByteStrings since 
that's supposed to be a lot better than regular Strings, but that didn't seem 
to help much.
So I'm curious: in what way could this code be better? How could it be more 
idiomatic or shorter? Particularly, how could it be more efficient either in 
space or time? Any comments are appreciate.

{- Module      :  Main.hs
   License     :  public domain
   Maintainer  :  Gwern Branwen <[EMAIL PROTECTED]>
   Stability   :  unstable
   Portability :  portable
   Functionality: retrieve specified articles from Wikipedia and request 
WebCite to archive all URLs found.
   TODO: send an equivalent request to the Internet Archive.
         Not in any way rate-limited.
   BUGS: Issues redundant archive requests.
         Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the 
documentation right, this barfs
         on the full UTF-8 character set, but Wikipedia definitely exercises 
the full UTF-8 set.
   USE: Print to stdin a succession of Wikipedia article names (whitespace in 
names should be escaped as '_').
        A valid invocation might be, say: '$echo Fujiwara_no_Teika 
Fujiwara_no_Shunzei | archive-bot'
        All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would 
then be backed up.
        If you wanted to run this on all of Wikipedia, you could take the 
current 'all-titles-in-ns0'
        gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into 
archive-bot. -}

module Main where
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf)
import Monad (liftM)
import Data.Set (toList, fromList)
import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, 
lines, unlines, pack, unpack, words)

main :: IO ()
main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< 
(liftM B.words $ B.getContents))
              where sortNub :: [[B.ByteString]] -> [B.ByteString]
                    sortNub = toList . fromList . concat

fetchArticleText :: B.ByteString -> IO [B.ByteString]
fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ 
B.unpack article))
                           where wikipedia = "http://en.wikipedia.org/wiki/";

extractURLs :: String -> B.ByteString
extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags 
arg), (_,x) <- atts, "http://"; `isPrefixOf` x])

archiveURL :: B.ByteString -> IO String
archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) 
++ emailAddress)
                 where emailAddress = "&[EMAIL PROTECTED]"

--
gwern
MAC10 M3 L34A1 Walther MPL AKS-74 HK-GR6 subsonic rounds ballistic media special

pgpPotHursSVu.pgp
Description: PGP signature

_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Wikipedia archiving bot - code review

Reply via email to