On Jul 23, 3:53 am, Paul McGuire <pt...@austin.rr.com> wrote: > # You should use raw string literals throughout, as in: > # blah_re = re.compile(r'sljdflsflds') > # (note the leading r before the string literal). raw string > literals > # really help keep your re expressions clean, so that you don't ever > # have to double up any '\' characters.
Thanks, I didn't know about that, updated my code. > # Attributes might be enclosed in single quotes, or not enclosed in > any quotes at all. > attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL | > re.UNICODE | re.IGNORECASE) Of course, you mean attribute's *value* can be enclosed in single/ double quotes? To be true, I haven't seen single quote variant in HTML lately but I checked it and it seems to be in the specs and it can be even quite useful (man learns something every day). Thank you for pointing that one out, I updated the code accordingly (just realized that condition check REs need an update too :/). As far as the lack of value quoting is concerned, I am not so sure I need this - It would significanly obfuscate my REs and this practice is rather deprecated, considered unsafe and I've seen it only in very old websites. > How would you extract data from a table? For instance, how would you > extract the data entries from the table at this > URL:http://tf.nist.gov/tf-cgi/servers.cgi? This would be a good example > snippet for your module documentation. This really seems like a nice example. I'll surely explain it in my docs (examples are surely needed there ;)). > Try extracting all of the <a href=...>sldjlsfjd</a> links from > yahoo.com, and see how much of what you expect actually gets matched. The library was used in my humble production environment, processing a few hundred thousand+ of pages and spitting out about 10000 SQL records so it does work quite good with a simple task like extracting all links. However, I can't really say that the task introduced enough diversity (there were only 9 different page templates) to say that the library is 'tested'... On Jul 26, 5:51 pm, John Machin <sjmac...@lexicon.net> wrote: > On Jul 23, 11:53 am, Paul McGuire <pt...@austin.rr.com> wrote: > > > On Jul 22, 5:43 pm, Filip <pink...@gmail.com> wrote: > > > # Needs re.IGNORECASE, and can have tag attributes, such as <BR > > CLEAR="ALL"> > > line_break_re = re.compile('<br\/?>', re.UNICODE) > > Just in case somebody actually uses valid XHTML :-) it might be a good > idea to allow for <br /> > > > # what about HTML entities defined using hex syntax, such as &#xxxx; > > amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE) > > What about the decimal syntax ones? E.g. not only and   > but also   > > Also, entity names can contain digits e.g. ¹ ¾ Thanks for pointing this out, I fixed that. Although it has very little impact on how the library performs its main task (I'd like to see some comments on that ;)). -- http://mail.python.org/mailman/listinfo/python-list