On 08/04/2012 10:22 AM, David Haslam wrote:
Wow!

What Peter means is that after all the ASCII stuff (up to the tilde), these
are also counted:

0E0030  󠀰       14      TAG DIGIT ZERO
0E0031  󠀱       11      TAG DIGIT ONE
0E0032  󠀲       10      TAG DIGIT TWO
0E0033  󠀳       7       TAG DIGIT THREE
0E0034  󠀴       6       TAG DIGIT FOUR
0E0035  󠀵       5       TAG DIGIT FIVE
0E0042  󠁂       18      TAG LATIN CAPITAL LETTER B
0E0043  󠁃       11      TAG LATIN CAPITAL LETTER C
0E0044  󠁄       16      TAG LATIN CAPITAL LETTER D
0E0046  󠁆       28      TAG LATIN CAPITAL LETTER F
0E0056  󠁖       7       TAG LATIN CAPITAL LETTER V
0E0070  󠁰       21      TAG LATIN SMALL LETTER P


David

Yes, these are intended and fall under the following line of the guidelines:

Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) to simplify Regex processing

They are inserted at various division boundaries to simplify regexes. So the B-tag marks book boundaries. C is for chapter, D is for div, F is for footnote, V is for verse, and p needs to be capitalized but represents paragraphs. The digit tags represent section levels, IIRC.

Unfortunately, no one includes these in fonts, much less keyboards, so they're a pain to work with, but they simplify regexes so drastically that they're worth it. And I consider the probability that anyone would use them in USFM so slim that I'm willing to risk the possibility of false positives in my regex matching.

--Chris


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to