James G. sack (jim) added the comment: More discussion of utf_8.py decoding behavior (and possible change):
For my needs, I would like the decoding parts of the utf_8 module to treat an initial BOM as an optional signature and skip it if there is one (just like the utf_8_sig decoder). In fact I have a working patch that replaces the utf_8_sig decode, IncrementalDecoder and StreamReader components by direct transplants from utf_8_sig (as recently repaired -- there was a SteamReader error). However the reason for discussion is to ask how it might impact existing code. I can imagine there might be utf_8 client code out there which expects to see a leading U+feff as (perhaps) a clue that the output should be returned with a BOM-signature (say) to accomodate the guessed input requirements of the remote correspondant. Making my work easier might actually make someone else's work (probably, annoyingly) harder. So what to do? I can just live with code like if input[0] == u"\ufeff": input=input[1:} spread around, and of course slightly different for incremental and stream inputs. But I probably wouldn't. I would probably substitute a "my_utf_8" encoding for to make my code a little cleaner. Another thought I had would require "the other guy" to update his code, but at least it wouldn't make his work annoyingly difficult like my original change might have. Here's the basic outline: - Add another decoder function that returns a 3-tuple decode3(input, errors='strict') => (data, consumed, had_bom) where had_bom is true if a leading bom was seen and skipped - then the usual decode is just something like def decode(input, errors='strict'): return decode3(input, errors)[:2] - add member variable and accessor to both IncrementalDecoder and StreamReader classes something like def had_bom(self): return self.had_bom and initialize/set the self.had_bom variable as required. This complicates the interface somewhat and requires some additional documantation. Tpo document my original simple [-minded] idea required possibly only a few more words in the existing paragraph on utf_8_sig, to mention that both mods had the same decoding behavior but different encoding. I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost the same", it's possible that future refactoring might unify them with differences contained in behavor-flags (eg, skip_leading_bom). The leading bom processing might even be pushed into codecs.utf_8_decode for possible minor advantages. Is there anybody monitoring this who has an opinion on this? ..jim ---------- versions: +Python 2.6 __________________________________ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1328> __________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com