Interim progress report.

I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the 
contents to Mat_utf8-odt

I opened the .odt file using 7-Zip from the Windows Explorer context menu, and 
extracted the file contents.xml

I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as 
contents.pp.xml
This is simply a layout change that's easier to read.

I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text 
was (mostly) Myanmar Unicode.

I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all 
blank lines.
The output file is now contents.pp.txt

This is now something that's readable content in Myanmar Unicode, with some 
English text such as "The Gospel according Matthew" near the start.

The file is best viewed using BabelPad with the option Display Colours | Colour 
Code by Script.
This shows Myanmar characters in light green, and non-Myanmar characters in 
other colours.

Observations:
1. The font conversion to Unicode left a few scattered characters unconverted. 
:(

0000C8  È       18      LATIN CAPITAL LETTER E WITH GRAVE
0000D8  Ø       20      LATIN CAPITAL LETTER O WITH STROKE
0000F2  ò       3       LATIN SMALL LETTER O WITH GRAVE

The complete character frequency analysis is attached.

2. A few verse numbers? are still present here and there.
3. The content contains section headings and parallel passage headings as well 
as verse text.

I have just uploaded the file contents.pp.zip to a new folder in my Box account 
and added Cyrille & Michael as viewers.


Best regards,

David

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, May 13, 2019 9:19 AM, Cyrille <lafricai...@gmail.com> wrote:

> Hello,
> I recently receive a modern translation of Myanmar of the NT, Psalms and
> Proverbs with permission to create a new module.
> But the problems are many... Firs to get the text.
> I tested different way, but it's done with PageMaker!
> I can get the text but the problem is I don't have the verses number
> because they are next in a parallel column and when I copy it I have
> only the biblical text.
> I have a pdf also but when I convert it to text (with pdftotext) the
> columns are mixed.
> Someone can help me whit any idea?
> Next problem is the Unicode... The text is not typed in unicode but use
> a special font.
> I can send everything you need or push it the git.crosswire.
>
> Thanks for help.
>
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


Code point      Character       Count   Character Name
000020          11,545  SPACE
000028  (       149     LEFT PARENTHESIS
000029  )       149     RIGHT PARENTHESIS
00002D  -       1,091   HYPHEN-MINUS
000031  1       4       DIGIT ONE
000032  2       2       DIGIT TWO
000036  6       1       DIGIT SIX
000038  8       1       DIGIT EIGHT
00003B  ;       14      SEMICOLON
000047  G       1       LATIN CAPITAL LETTER G
00004B  K       1       LATIN CAPITAL LETTER K
00004D  M       1       LATIN CAPITAL LETTER M
000053  S       2       LATIN CAPITAL LETTER S
000054  T       1       LATIN CAPITAL LETTER T
000061  a       2       LATIN SMALL LETTER A
000063  c       2       LATIN SMALL LETTER C
000064  d       3       LATIN SMALL LETTER D
000065  e       3       LATIN SMALL LETTER E
000067  g       1       LATIN SMALL LETTER G
000068  h       2       LATIN SMALL LETTER H
000069  i       1       LATIN SMALL LETTER I
00006C  l       1       LATIN SMALL LETTER L
00006D  m       6       LATIN SMALL LETTER M
00006E  n       1       LATIN SMALL LETTER N
00006F  o       2       LATIN SMALL LETTER O
000070  p       1       LATIN SMALL LETTER P
000072  r       1       LATIN SMALL LETTER R
000073  s       1       LATIN SMALL LETTER S
000074  t       2       LATIN SMALL LETTER T
000077  w       1       LATIN SMALL LETTER W
0000C8  È      18      LATIN CAPITAL LETTER E WITH GRAVE
0000D8  Ø      20      LATIN CAPITAL LETTER O WITH STROKE
0000F2  ò      3       LATIN SMALL LETTER O WITH GRAVE
001000  က     7,640   MYANMAR LETTER KA
001001  ခ     2,396   MYANMAR LETTER KHA
001002  ဂ     265     MYANMAR LETTER GA
001004  င     6,256   MYANMAR LETTER NGA
001005  စ     2,392   MYANMAR LETTER CA
001006  ဆ     1,020   MYANMAR LETTER CHA
001007  ဇ     376     MYANMAR LETTER JA
001008  ဈ     3       MYANMAR LETTER JHA
001009  ဉ     154     MYANMAR LETTER NYA
00100A  ည     3,621   MYANMAR LETTER NNYA
00100B  ဋ     4       MYANMAR LETTER TTA
00100C  ဌ     7       MYANMAR LETTER TTHA
00100D  ဍ     9       MYANMAR LETTER DDA
00100F  ဏ     79      MYANMAR LETTER NNA
001010  တ     5,765   MYANMAR LETTER TA
001011  ထ     1,461   MYANMAR LETTER THA
001012  ဒ     204     MYANMAR LETTER DA
001013  ဓ     43      MYANMAR LETTER DHA
001014  န     3,173   MYANMAR LETTER NA
001015  ပ     2,987   MYANMAR LETTER PA
001016  ဖ     974     MYANMAR LETTER PHA
001017  ဗ     38      MYANMAR LETTER BA
001018  ဘ     458     MYANMAR LETTER BHA
001019  မ     5,731   MYANMAR LETTER MA
00101A  ယ     1,455   MYANMAR LETTER YA
00101B  ရ     2,536   MYANMAR LETTER RA
00101C  လ     3,514   MYANMAR LETTER LA
00101D  ဝ     375     MYANMAR LETTER WA
00101E  သ     7,122   MYANMAR LETTER SA
00101F  ဟ     777     MYANMAR LETTER HA
001020  ဠ     1       MYANMAR LETTER LLA
001021  အ     3,239   MYANMAR LETTER A
001024  ဤ     215     MYANMAR LETTER II
001025  ဥ     81      MYANMAR LETTER U
001026  ဦ     198     MYANMAR LETTER UU
001027  ဧ     42      MYANMAR LETTER E
001029  ဩ     12      MYANMAR LETTER O
00102B  ါ     1,453   MYANMAR VOWEL SIGN TALL AA
00102C  ာ     9,440   MYANMAR VOWEL SIGN AA
00102D  ိ     8,154   MYANMAR VOWEL SIGN I
00102E  ီ     876     MYANMAR VOWEL SIGN II
00102F  ု     8,430   MYANMAR VOWEL SIGN U
001030  ူ     2,760   MYANMAR VOWEL SIGN UU
001031  ေ     7,541   MYANMAR VOWEL SIGN E
001032  ဲ     589     MYANMAR VOWEL SIGN AI
001036  ံ     1,129   MYANMAR SIGN ANUSVARA
001037  ့     5,309   MYANMAR SIGN DOT BELOW
001038  း     7,959   MYANMAR SIGN VISARGA
001039  ္     293     MYANMAR SIGN VIRAMA
00103A  ်     18,107  MYANMAR SIGN ASAT
00103B  ျ     2,344   MYANMAR CONSONANT SIGN MEDIAL YA
00103C  ြ     4,347   MYANMAR CONSONANT SIGN MEDIAL RA
00103D  ွ     1,762   MYANMAR CONSONANT SIGN MEDIAL WA
00103E  ှ     2,546   MYANMAR CONSONANT SIGN MEDIAL HA
001040  ၀     90      MYANMAR DIGIT ZERO
001041  ၁     359     MYANMAR DIGIT ONE
001042  ၂     242     MYANMAR DIGIT TWO
001043  ၃     187     MYANMAR DIGIT THREE
001044  ၄     137     MYANMAR DIGIT FOUR
001045  ၅     89      MYANMAR DIGIT FIVE
001046  ၆     81      MYANMAR DIGIT SIX
001047  ၇     61      MYANMAR DIGIT SEVEN
001048  ၈     67      MYANMAR DIGIT EIGHT
001049  ၉     72      MYANMAR DIGIT NINE
00104A  ၊     601     MYANMAR SIGN LITTLE SECTION
00104B  ။     1,489   MYANMAR SIGN SECTION
00104C  ၌     379     MYANMAR SYMBOL LOCATIVE
00104D  ၍     564     MYANMAR SYMBOL COMPLETED
00104E  ၎     54      MYANMAR SYMBOL AFOREMENTIONED
00104F  ၏     1,699   MYANMAR SYMBOL GENITIVE
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to