In article <[EMAIL PROTECTED]>, "Fredrik Lundh" <[EMAIL PROTECTED]> wrote:
> Tony Nelson wrote: > > > I'd like to have a fast way to validate large amounts of string data as > > being UTF-8. > > define "validate". All data conforms to the UTF-8 encoding format. I can stand if someone has made data that impersonates UTF-8 that isn't really Unicode. > > I don't see a fast way to do it in Python, though: > > > > unicode(s,'utf-8').encode('utf-8) > > if "validate" means "make sure the byte stream doesn't use invalid > sequences", a plain > > unicode(s, "utf-8") > > should be sufficient. You are correct. I misunderstood what was happening in my code. I apologise for wasting bandwidth and your time (and I wasted my own time as well). Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough for my purpose, adding about 25% to the time to load a file. ________________________________________________________________________ TonyN.:' [EMAIL PROTECTED] ' <http://www.georgeanelson.com/> -- http://mail.python.org/mailman/listinfo/python-list