On Feb 24, 2008, at 4:46 PM, Chris Little wrote:
DM Smith wrote:
I have added a -n flag to osis2mod.
I'm going to add it to the other major importers (osis2gbs & imp2*)
just
as soon as I get things into a fairly stable state.
This flag, to be enabled, requires osis2mod to be compiled with ICU
support enabled.
-n stands for normalized to NFC, the agreed upon UTF-8 encoding
When should this flag be used?
1) When the input is UTF-8
and
2) It is not known to be NFC
First, I feel like there's really no reason NOT to perform
normalization, provided that the input is UTF-8. Even if the input is
already in NFC, it won't hurt anything to do it again. It will take
extra time to compile the module, but I feel like it's better to be
safe
than sorry in this case.
I mostly agree. But once I know that the module is NFC, I'd rather not
take the hit. I must have made the KJV into a module 100 or more times
before I got it right.
Second, your comment about needing UTF-8 input makes me think we
should
go ahead and add encoding conversion to the importers as well,
possibly
with automatic charset detection.
I'd like to see OSIS modules also be UTF-8.
What mechanism were you thinking of for automatic charset detection? I
have a buggy routine to detect whether something is UTF-8, 7-bit ascii
or other. We could use that (once I fix it).
As to automatic charset detection, could we require that every input
to osis2mod have:
<?xml version="1.0" encoding="UTF-8"?>
or
<?xml version="1.0" encoding="cp1252"?>
and use whatever is the value for the encoding attribute?
-- DM
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page