On 19/08/12 15:25, Steven D'Aprano wrote:
Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)
Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.
Consider the automated recognition of a CAPTCHA. As the chars have to be
entered by the user on a keyboard, only the most basic charset can be
used, so the problem of which chars are possible is quite limited.
--
http://mail.python.org/mailman/listinfo/python-list