Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.
But there's a problem: no matter how I look at it, it's just way too slow.
Running on a measly 100 articles at a time, it'll eat up to half my processor
time and RAM (according to top). I converted it over to ByteStrings since
that's supposed to be a lot better than regular Strings, but that didn't seem
to help much.
So I'm curious: in what way could this code be better? How could it be more
idiomatic or shorter? Particularly, how could it be more efficient either in
space or time? Any comments are appreciate.
{- Module : Main.hs
License : public domain
Maintainer : Gwern Branwen <[EMAIL PROTECTED]>
Stability : unstable
Portability : portable
Functionality: retrieve specified articles from Wikipedia and request
WebCite to archive all URLs found.
TODO: send an equivalent request to the Internet Archive.
Not in any way rate-limited.
BUGS: Issues redundant archive requests.
Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the
documentation right, this barfs
on the full UTF-8 character set, but Wikipedia definitely exercises
the full UTF-8 set.
USE: Print to stdin a succession of Wikipedia article names (whitespace in
names should be escaped as '_').
A valid invocation might be, say: '$echo Fujiwara_no_Teika
Fujiwara_no_Shunzei | archive-bot'
All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would
then be backed up.
If you wanted to run this on all of Wikipedia, you could take the
current 'all-titles-in-ns0'
gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into
archive-bot. -}
module Main where
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf)
import Monad (liftM)
import Data.Set (toList, fromList)
import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents,
lines, unlines, pack, unpack, words)
main :: IO ()
main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<<
(liftM B.words $ B.getContents))
where sortNub :: [[B.ByteString]] -> [B.ByteString]
sortNub = toList . fromList . concat
fetchArticleText :: B.ByteString -> IO [B.ByteString]
fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++
B.unpack article))
where wikipedia = "http://en.wikipedia.org/wiki/"
extractURLs :: String -> B.ByteString
extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags
arg), (_,x) <- atts, "http://" `isPrefixOf` x])
archiveURL :: B.ByteString -> IO String
archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url)
++ emailAddress)
where emailAddress = "&[EMAIL PROTECTED]"
--
gwern
MAC10 M3 L34A1 Walther MPL AKS-74 HK-GR6 subsonic rounds ballistic media special
pgpPotHursSVu.pgp
Description: PGP signature
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
