On Jul 22, 5:43 pm, Filip <pink...@gmail.com> wrote: > > My library, rather than parsing the whole input into a tree, processes > it like a char stream with regular expressions. >
Filip - In general, parsing HTML with re's is fraught with easily-overlooked deviations from the norm. But since you have stepped up to the task, here are some comments on your re's: # You should use raw string literals throughout, as in: # blah_re = re.compile(r'sljdflsflds') # (note the leading r before the string literal). raw string literals # really help keep your re expressions clean, so that you don't ever # have to double up any '\' characters. # Attributes might be enclosed in single quotes, or not enclosed in any quotes at all. attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL | re.UNICODE | re.IGNORECASE) # Needs re.IGNORECASE, and can have tag attributes, such as <BR CLEAR="ALL"> line_break_re = re.compile('<br\/?>', re.UNICODE) # what about HTML entities defined using hex syntax, such as &#xxxx; amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE) How would you extract data from a table? For instance, how would you extract the data entries from the table at this URL: http://tf.nist.gov/tf-cgi/servers.cgi ? This would be a good example snippet for your module documentation. Try extracting all of the <a href=...>sldjlsfjd</a> links from yahoo.com, and see how much of what you expect actually gets matched. Good luck! -- Paul -- http://mail.python.org/mailman/listinfo/python-list