Terry J. Reedy <tjre...@udel.edu> added the comment:

Looking at cjkencodings.py the format is pretty clear. The file consists of one 
statement that creates one dict that maps encoding names to a pair of (encoded) 
byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 
per chunk (line). From the usage you deduced that the first is encoded with 
named encoding and the second encoded with utf-8. (For anyone wondering, a 
separate utf-8 strings is needed for each encoding because each other encoding 
is limited to a different subset of unicode chars.)

So I am not completely convinced that pulling the file apart is a complete win. 
Another entry could be added (the file is formatted with that possibility in 
mind), but it would certainly be much easier if the original formatting program 
were available. I do have a couple of questions.

1. Did one of us create the test strings (if so, how) or do they come from an 
authoritative source (like the unicode site) that created and checked them with 
their reference implementations. If so, the missing pair *is* a puzzle. Anyway, 
if so, is there any possibility that we would need to get new test strings from 
that source? Or are the limitations of these coding definitely fixed.

2. If you create a test file for hz codec with the hz codec, how do we know it 
is correct? It would only serve to detect changes in the future.

----------
components: +Tests

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12057>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to