J. Bagg wrote: > I've checked the original files using od and they don't have BOMs. > > I'll remove them in the servlet. The overhead is probably small enough > unless somebody is doing a massive search. We have a limit anyway to > prevent somebody stealing the entire set of data. > > I started writing the Python search because the ancient C search had > started putting out BOMs. I'm actually mystified because our home Linux > box does not add BOMs even though it runs 2.7 but my work one does even > though it has the same version. The only difference is Fedora 18 v > Fedora 17. > > The BOMs are certainly there: > > <86> <AD><FB>%R 10C0203z-621 > %A François-Xavier Le_Bourdonnec > > 0000000 206 255 373 % R 1 0 C 0 2 0 3 z - > > J >
Were these files edited with Notepad? According to http://docs.python.org/2/library/codecs.html#encodings-and-unicode """ To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. """ To strip off such a UTF-8 encoded BOM you can open the source file with "utf-8-sig" and write the output to a (different!) file with "utf-8" with codecs.open(source, "r", encoding="utf-8-sig") as instream: with codecs.open(dest, "w", encoding="utf-8") as outstream: shutil.copyfileobj(instream, outstream) -- https://mail.python.org/mailman/listinfo/python-list