Re: [sword-devel] usfm2osis.py

Chris Little Sat, 04 Aug 2012 15:57:03 -0700

On 08/04/2012 10:22 AM, David Haslam wrote:

Wow!


What Peter means is that after all the ASCII stuff (up to the tilde), these
are also counted:

0E0030  󠀰       14      TAG DIGIT ZERO
0E0031  󠀱       11      TAG DIGIT ONE
0E0032  󠀲       10      TAG DIGIT TWO
0E0033  󠀳       7       TAG DIGIT THREE
0E0034  󠀴       6       TAG DIGIT FOUR
0E0035  󠀵       5       TAG DIGIT FIVE
0E0042  󠁂       18      TAG LATIN CAPITAL LETTER B
0E0043  󠁃       11      TAG LATIN CAPITAL LETTER C
0E0044  󠁄       16      TAG LATIN CAPITAL LETTER D
0E0046  󠁆       28      TAG LATIN CAPITAL LETTER F
0E0056  󠁖       7       TAG LATIN CAPITAL LETTER V
0E0070  󠁰       21      TAG LATIN SMALL LETTER P


David


Yes, these are intended and fall under the following line of the guidelines:

Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) tosimplify Regex processing

They are inserted at various division boundaries to simplify regexes. Sothe B-tag marks book boundaries. C is for chapter, D is for div, F isfor footnote, V is for verse, and p needs to be capitalized butrepresents paragraphs. The digit tags represent section levels, IIRC.

Unfortunately, no one includes these in fonts, much less keyboards, sothey're a pain to work with, but they simplify regexes so drasticallythat they're worth it. And I consider the probability that anyone woulduse them in USFM so slim that I'm willing to risk the possibility offalse positives in my regex matching.


--Chris


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] usfm2osis.py

Reply via email to