Adam Monsen wrote: > The following script is a high-performance link (<a > href="...">...</a>) extractor. [...] > * extract links from text (most likey valid HTML) [...] > import re > import urllib > > whiteout = re.compile(r'\s+') > > # grabs hyperlinks from text > href_re = re.compile(r''' > <a(?P<attrs>[^>]* # start of tag > href=(?P<delim>["']) # delimiter > (?P<link>[^"']*) # link > (?P=delim) # delimiter > [^>]*)> # rest of start tag > (?P<content>.*?) # link content > </a> # end tag > ''', re.VERBOSE | re.IGNORECASE)
A few notes: The single or double quote delimiters are optional in some cases (and frequently omitted even when required by the current standard). Where blank-spaces may appear is HTML entities is not so clear. To follow the standard, one would have to acquire the SGML standard, which costs money. Popular browsers allow end tags such as "</a >" which the RE above will reject. I'm not good at reading RE's, but it looks like the first line will greedily match the entire start tag, and then back-track to find the href attribute. There appear to many other good opportunities for a cleverly-constructed input to force big-time backtracking; for example, a '>' will end the start-tag, but in within the delimiters, it's just another character. Can anyone show a worst-case run-time? Writing a Python RE to match all and only legal anchor tags may not be possible. Writing a regular expression to do so is definitely not possible. [...] > def getLinks(html_data): > newdata = whiteout.sub(' ', html_data) > matches = href_re.finditer(newdata) > ancs = [] > for match in matches: > d = match.groupdict() > a = {} > a['href'] = d.get('link', None) The statement above doesn't seem necessary. The 'href' just gets over-written below, as just another attribute. > a['content'] = d.get('content', None) > attr_matches = attrs_re.finditer(d.get('attrs', None)) > for match in attr_matches: > da = match.groupdict() > name = da.get('name', None) > a[name] = da.get('value', None) > ancs.append(a) > return ancs -- --Bryan -- http://mail.python.org/mailman/listinfo/python-list