On 07/02/2008, Jon Ribbens <[EMAIL PROTECTED]> wrote: > On Thu, Feb 07, 2008 at 05:50:37PM +0000, Michael Sparks wrote: > > > The code at > > > http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35 > > > is wrong, for example. > > > > That's because it whitelists a collection of tags but doesn't whitelist > > specific attributes, I presume. > > That's certainly a big problem, yes. There are other issues, but more > importantly from my point of view, is that it works in completely the > wrong way ;-) It uses a lax HTML parser to try and work out what's > going on with the input, and then strips any 'bad data' that it > recognises. This will fall apart if the HTML is mangled in such a way > that the 'tag stripper' parser doesn't understand it, but a web > browser will. Given all the different versions of all the different > browsers out there, this approach is doomed to failure. > > The correct way to do it would be to strip everything *except* that > which is 100% recognised to be allowable. i.e. never allow a '<' > or '&' character through (or any other character, for that matter) > unless we know precisely what its effect is and that it complies with > the HTML spec.
Hi, I have used Beautiful Soup for parsing html. It works very nicely and I didn't see much of an issue with speed in parsing several hundred html files every hour or so. I also rolled my own using various regex's and stuff nicked from a perl lib. It was awful and feature incomplete. Beautiful Soup worked better. Shaun Laughey. _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk