On 08/04/2012 10:22 AM, David Haslam wrote:
Wow!
What Peter means is that after all the ASCII stuff (up to the tilde), these
are also counted:
0E0030 14 TAG DIGIT ZERO
0E0031 11 TAG DIGIT ONE
0E0032 10 TAG DIGIT TWO
0E0033 7 TAG DIGIT THREE
0E0034 6 TAG DIGIT FOUR
0E0035 5 TAG DIGIT FIVE
0E0042 18 TAG LATIN CAPITAL LETTER B
0E0043 11 TAG LATIN CAPITAL LETTER C
0E0044 16 TAG LATIN CAPITAL LETTER D
0E0046 28 TAG LATIN CAPITAL LETTER F
0E0056 7 TAG LATIN CAPITAL LETTER V
0E0070 21 TAG LATIN SMALL LETTER P
David
Yes, these are intended and fall under the following line of the guidelines:
Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) to
simplify Regex processing
They are inserted at various division boundaries to simplify regexes. So
the B-tag marks book boundaries. C is for chapter, D is for div, F is
for footnote, V is for verse, and p needs to be capitalized but
represents paragraphs. The digit tags represent section levels, IIRC.
Unfortunately, no one includes these in fonts, much less keyboards, so
they're a pain to work with, but they simplify regexes so drastically
that they're worth it. And I consider the probability that anyone would
use them in USFM so slim that I'm willing to risk the possibility of
false positives in my regex matching.
--Chris
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page