On Thu, Feb 07, 2008 at 05:50:37PM +0000, Michael Sparks wrote: > > The code at > > http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35 > > is wrong, for example. > > That's because it whitelists a collection of tags but doesn't whitelist > specific attributes, I presume.
That's certainly a big problem, yes. There are other issues, but more importantly from my point of view, is that it works in completely the wrong way ;-) It uses a lax HTML parser to try and work out what's going on with the input, and then strips any 'bad data' that it recognises. This will fall apart if the HTML is mangled in such a way that the 'tag stripper' parser doesn't understand it, but a web browser will. Given all the different versions of all the different browsers out there, this approach is doomed to failure. The correct way to do it would be to strip everything *except* that which is 100% recognised to be allowable. i.e. never allow a '<' or '&' character through (or any other character, for that matter) unless we know precisely what its effect is and that it complies with the HTML spec. _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk