Tony Nelson wrote: > I'd like to have a fast way to validate large amounts of string data as > being UTF-8. > > I don't see a fast way to do it in Python, though: > > unicode(s,'utf-8').encode('utf-8) > > seems to notice at least some of the time (the unicode() part works but > the encode() part bombs). I don't consider a RE based solution to be > fast. GLib provides a routine to do this, and I am using GTK so it's > included in there somewhere, but I don't see a way to call GLib > routines. I don't want to write another extension module.
I somehow doubt that the encode bombs. Can you provide some more details? Maybe of some allegedly not working strings? Besides that, it's unneccessary - the unicode(s, "utf-8") should be sufficient. If there are any undecodable byte sequences in there, that should find them. Regards, Diez -- http://mail.python.org/mailman/listinfo/python-list