Steve Bergman wrote: > Fredrik Lundh wrote: > >> ("sanitizing" HTML data by running filters over encoded 8-bit data is >> hardly >> ever the right thing to do...) >> >> >> >> > I'm very much open to suggestions as to the right way to do this. I'm > working on this primarily as a learning project and security is my > motivation for wanting to strip the unprintables. > > Is there a better way? (This is a mod_python app , just for reference.)
Deal with encodings properly. That characters are "unprintable" means that you have an encoding mismatch - your output device (usually a terminal, but a browser is a sort of device too) can't make sense of certain byte codes - and pukes on you. But these bytecode come from somewhere, and aren't "random". So I suggest you read upon the subjects of unicode, encodings - and this in the context of python, of course :) BTW: if that HTML was XHTML, it weren't valid if the contents didn't match the specified encoding in the header - which doesn't mean that sometimes these mismatch because of misunderstandings on the programmer side. Diez -- http://mail.python.org/mailman/listinfo/python-list