> The codec is doing its job, but I want to override the codepoint for this > character (plus others) to use the html entity instead (from \227 to > — in this case). > > I see hints writing your own codec and updating the decoding_map, but I > could use some more detail. > > Is that the best way to solve the problem?
I would say so, yes. Look at the source code of cp1252, and it should be fairly obvious how a charmap codec works. Make a copy of it, and remove the EM DASH line. This will give you a codec that just won't encode the character at all anymore. Then write an error handler that returns u"—" for \227, but otherwise continues to raise errors. See PEP 293 for code examples of error handlers. Notice that this approach only works for encoding; for decoding, your scheme can't work, because you would need to specify how — occurring in the input should get decoded - as u"—" or as u"\u2014"? Most likely, decoding that output is of no concern to you, in which case the approach with the error handler is the best way (IMO). Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list