Re: modifying a codec

Martin v. Löwis Wed, 05 Nov 2008 23:30:34 -0800

> The codec is doing its job, but I want to override the codepoint for this 
> character (plus others) to use the html entity instead (from \227  to 
> &mdash; in this case).
> 
> I see hints writing your own codec and updating the decoding_map, but I 
> could use some more detail.
> 
> Is that the best way to solve the problem?


I would say so, yes. Look at the source code of cp1252, and it should be
 fairly obvious how a charmap codec works. Make a copy of it, and remove
the EM DASH line. This will give you a codec that just won't encode the
character at all anymore.

Then write an error handler that returns u"&mdash;" for \227, but
otherwise continues to raise errors. See PEP 293 for code examples
of error handlers.

Notice that this approach only works for encoding; for decoding, your
scheme can't work, because you would need to specify how &mdash;
occurring in the input should get decoded -
as u"&mdash;" or as u"\u2014"? Most likely, decoding that output
is of no concern to you, in which case the approach with the error
handler is the best way (IMO).

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Re: modifying a codec

Reply via email to