Re: [sword-devel] Bible in Myanmar

Michael H Tue, 14 May 2019 13:56:05 -0700

You should be able to configure a regex search to find the verse
boundaries.


Once you have verse boundaries, if you configure the text into Verse per
line it should be possible to assign each row a chapter and verse number
from a reference. That is, the 3341 verse in the New Testament is usually
John 20:31 (I don't have that memorized, just an example.)

On Tue, May 14, 2019 at 3:22 PM Cyrille <lafricai...@gmail.com> wrote:

> Ok thank you!  I have already all the text in unicode but without the
> verse numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>
> Hi Cyrille
>
> If I can find the time tomorrow or later, I’ll have a look at what might
> be feasible.
>
> Thanks for all these useful links.
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Tue, May 14, 2019 at 14:08, Cyrille <lafricai...@gmail.com> wrote:
>
> I send my message again because it was bigger.
>
> The conversion to UTF-8 is 99% solved!! I used a online converter:
>
> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
> or:
> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>
> See the result here
> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>
> .
>
> Now the only problem is how to get the verse and chapter number...
>
>
> Il 14/05/2019 13:53, Michael H ha scritto:
>
> Cyrille, (Peter),
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can I get
> added to this project?
>
> Here are the first few lines of Matthew copied from the PDF:
> ------
> &Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
> z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>
> -----
> And here are the first few lines of Matthew copied from the Pagemaker
> file:
> -----
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;
> usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk
> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf
> wdab;&d,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in a
> different order.
>
> The letters that change are likely those points that aren't compatible
> with unicode, and pagemaker reassigned them to ensure that the file is more
> widely viewable. Since a conversion is already planned, these won't matter
> as much, but the font embedded in the PDF is different than the font
> attached to the pagemaker file,  If you do start from the PDF, you'll need
> to extract the font to get the code points.
>
> The problem is that the PDF export from pagemaker sorts the letters into
> the order they appear on the page.  Burmese text has Indian style
> ligatures, where vowels tend to jump over or under the previous letters,
> sometimes back 2 or three letters. If you study the following snippets from
> the beginning of Matthew, you can see there is a difference in order, as
> well as some glyphs are modified.
>
> So, from the PDF letters are out of order, but from Pagemaker, letters are
> encoded into control points. Fixing the control points is easy and happens
> with the unicode conversion.  Fixing the letter order is not easy. You'll
> need a first language speaker and plenty of time.
>
> The guidance I received on another group was to use either LO Draw or
> Indesign to export the text from Pagemaker.  I'll look into LO Draw again,
> but I don't have access to an older version of Indesign (the pagemaker
> import was removed in CS6).
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com> wrote:
>
>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see
>> the text there.
>>
>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it
>> with a zip archive progeram.  The text is in the Pagemaker file at the top
>> level of the archive, but encoded with a lot of extraneous information.
>> (The English text "Matthew" appears at hex location 7A76972).
>>
>> When I open the fonts with fontforge, Fontforge suggests the fonts are
>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>> However when I copy the text (I copied from LO Draw) and paste it into
>> jedit and save that as unicode: Reopening the file has a warning 'not
>> unicode, text may be missing'.
>>
>> So, what this means is that there are some glyphs encoded into locations
>> that unicode treats as control or non-printing codes. The text needs to be
>> dealt with as a specific encoding that matches whatever the original font
>> actually uses. I haven't figured out what the original text files were
>> encoded with. Without that knowledge, I'm not sure my system clipboard or
>> editor (jedit) will properly respect the glyphs in unusual locations until
>> the conversion to unicode, and I don't trust myself to be able to detect if
>> it is or is not properly converted.
>>
>> On Mon, May 13, 2019 at 10:11 AM Cyrille <lafricai...@gmail.com> wrote:
>>
>>> David,
>>> Probably you are right about TECkit
>>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>> if we get the text it will help us to convert in UNICODE.
>>> About how to get the text, your method is out of my skills :)
>>> I you succeed please let me know.
>>>
>>> Il 13/05/2019 16:21, David Haslam ha scritto:
>>>
>>> Given the insights from Michael Hart, it may be feasible to temporarily
>>> rearrange the main text stream as follows :
>>>
>>> 1. Replace every EOL by a horizontal tab.
>>> 2. Insert an EOL after each verse end character.
>>>
>>> Observe that the above two steps are wholly reversible such that the
>>> original text stream can be restored later.
>>>
>>> In effect the text stream is now in verse per line (VPL) layout, albeit
>>> without verse tags. Some adjustments may be necessary if there any section
>>> headings, etc.
>>>
>>> 3. Add line numbers with the first number being reset to 1 at the start
>>> of each chapter, numbers incrementing by 1 for each line.
>>> 4. Add a left margin USFM verse tag \v_
>>>
>>> Steps 3&4 can be implemented in various ways. For my part, I’d use a
>>> bespoke TextPipe filter.
>>>
>>> Another method to consider might be to use Excel formulae. I recall
>>> resorting to such a method in the early days of Go Bible.
>>>
>>> Now restore the original layout by reverting steps 2 & 1, if this is
>>> really necessary. That is, if the original text layout appeared to be
>>> paragraphed.
>>>
>>> 5. Decide how & where to insert paragraph tags.
>>>
>>> 6. Add chapter tags, book ID and main title tags, etc.
>>>
>>> Hope this gives some useful suggestions that point towards a practical
>>> solution.
>>>
>>> Best regards
>>>
>>> David
>>>
>>>
>>> Sent from ProtonMail Mobile
>>>
>>>
>>> On Mon, May 13, 2019 at 14:57, Michael H <cma...@gmail.com> wrote:
>>>
>>> Cyrille
>>>
>>> LibreOffice Draw attempts to open the pagemaker file, with limited
>>> success. But it confirms that even in the pagemaker source, the verse
>>> numbers are a separate text stream. With this source, there is no way to
>>> copy the text with verse numbers intact. It appears to be stored with each
>>> book in it's own text stream. Each book is a separate text stream in the
>>> page maker file. LO Draw isn't rendering all of the pages, only the first
>>> 10, So I've only explored Matthew further.
>>>
>>> Based on Matthew only, the verses seem to all end with the character "-"
>>> or ";/", which should aid in the reconstruction. I've looked through the
>>> PDF and this seems to be the case for all books visually as well. However,
>>> this isn't perfect: I find 1107 of these characters in Matthew, instead of
>>> the expected 1071 verses.  But since the text stream has a book
>>> introduction, this is likely easily explained. Hopefully this gets you well
>>> down the path to creating a stream with verses.
>>>
>>> I would NOT start from the PDF file, but from the pagemaker file.  The
>>> PDF almost certainly has a lot of text rearranging and extra characters
>>> like page numbers and running heads.  Pagemaker has the book text in a
>>> single stream, in a form that will convert to unicode relatively easily.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: 
>>> sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel@crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>>
> _______________________________________________
> sword-devel mailing list: 
> sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
>
>
>
> _______________________________________________
> sword-devel mailing list: 
> sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Bible in Myanmar

Reply via email to