On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote: > Chas Emerick wrote: > >> The principle and the practice diverge significantly in our neck of >> the woods. The current project involves consuming and making sense >> of extraordinarily (and typically unnecessarily) complex XHTML. > > wasn't your original complaint that ET didn't do the "right thing" > when > you removed elements from a mixed-content tree? (something than can be > trivially handled with a 2-line helper function)
Yes, that was the initial issue, but the delta between Elements and DOM-style elements leads to other issues. There's no doubt that the needed helpers are simple, but all things being equal, not having to carry them around anywhere we're doing DOM manipulations is a big plus. > why mutate the tree if all you want is to extract information from it? > doesn't sound very efficient to me... Because we're far from doing anything that is regular or one-off in nature. We're systematizing the extraction of data from functionally unstructured content, and it's flatly necessary to normalize the XHTML into something that can be easily consumed by the processes we've built that can do that content->data extraction/conversion from plain text, XML, PDF, and now XHTML. Remember, corner cases. :-) - Chas -- http://mail.python.org/mailman/listinfo/python-list