"softwindow" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
> it is difficult to get all URL's in a page <snip> Is this really so hard?: ----------------- from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\ Word,dblQuotedString,alphanums,SkipTo,makeHTMLTags import urllib # extract all <a> anchor tags - makeHTMLTags defines a # fairly robust pair of match patterns, not just "<tag>","</tag>" linkOpenTag,linkCloseTag = makeHTMLTags("a") link = linkOpenTag + \ SkipTo(linkCloseTag).setResultsName("body") + \ linkCloseTag.suppress() # read the HTML source from some random URL serverListPage = urllib.urlopen( "http://www.google.com" ) htmlText = serverListPage.read() serverListPage.close() # use the link grammar to scan the HTML source for toks,strt,end in link.scanString(htmlText): print toks.startA.href,"->",toks.body ----------------- Prints: /url?sa=p&pref=ig&pval=2&q=http://www.google.com/ig%3Fhl%3Den -> Personalized Home https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en -> Sign in /imghp?hl=en&tab=wi&ie=UTF-8 -> Images http://groups.google.com/grphp?hl=en&tab=wg&ie=UTF-8 -> Groups http://news.google.com/nwshp?hl=en&tab=wn&ie=UTF-8 -> News http://froogle.google.com/frghp?hl=en&tab=wf&ie=UTF-8 -> Froogle /maphp?hl=en&tab=wl&ie=UTF-8 -> Maps /intl/en/options/ -> more » /advanced_search?hl=en -> Advanced Search /preferences?hl=en -> Preferences /language_tools?hl=en -> Language Tools /intl/en/ads/ -> Advertising Programs /services/ -> Business Solutions /intl/en/about.html -> About Google -- Paul -- http://mail.python.org/mailman/listinfo/python-list