If Michael’s observations are anything to go by, then maybe I can script the 
recovery of chapter & verse tags.

We shall see ....

Even if I’m not immediately successful - valuable lessons can be learned in the 
attempt.

David

Sent from ProtonMail Mobile

On Tue, May 14, 2019 at 21:21, Cyrille <lafricai...@gmail.com> wrote:

> Ok thank you!  I have already all the text in unicode but without the verse 
> numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>
>> Hi Cyrille
>>
>> If I can find the time tomorrow or later, I’ll have a look at what might be 
>> feasible.
>>
>> Thanks for all these useful links.
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>> On Tue, May 14, 2019 at 14:08, Cyrille <lafricai...@gmail.com> wrote:
>>
>>> I send my message again because it was bigger.
>>>
>>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>> or:
>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>
>>> See the result 
>>> [here](https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=).
>>>
>>> Now the only problem is how to get the verse and chapter number...
>>>
>>> Il 14/05/2019 13:53, Michael H ha scritto:
>>>
>>>> Cyrille, (Peter),
>>>>
>>>> Maybe further discussion on this belongs in Gitlab as issues.  Can I get 
>>>> added to this project?
>>>>
>>>> Here are the first few lines of Matthew copied from the PDF:
>>>> ------
>>>>
>>>> &Sifrmaw;OD; {0Ha*vdusrf;
>>>> The Gospel According to Matthew
>>>> ed'gef;
>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf 
>>>> z;O;D \om;jzp\f / (rmu k2;14)
>>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) 
>>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>>>> -----
>>>> And here are the first few lines of Matthew copied from the Pagemaker file:
>>>> -----
>>>> Sifrmaw;OD; {0Ha*vdusrf;
>>>> The Gospel According to Matthew
>>>> ed'gef;
>>>> usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;
>>>> usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd; 
>>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk 
>>>> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf  
>>>> wdab;&d,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>>>>
>>>> You can see that some letters have changed, and some others are in a 
>>>> different order.
>>>>
>>>> The letters that change are likely those points that aren't compatible 
>>>> with unicode, and pagemaker reassigned them to ensure that the file is 
>>>> more widely viewable. Since a conversion is already planned, these won't 
>>>> matter as much, but the font embedded in the PDF is different than the 
>>>> font attached to the pagemaker file,  If you do start from the PDF, you'll 
>>>> need to extract the font to get the code points.
>>>>
>>>> The problem is that the PDF export from pagemaker sorts the letters into 
>>>> the order they appear on the page.  Burmese text has Indian style 
>>>> ligatures, where vowels tend to jump over or under the previous letters, 
>>>> sometimes back 2 or three letters. If you study the following snippets 
>>>> from the beginning of Matthew, you can see there is a difference in order, 
>>>> as well as some glyphs are modified.
>>>>
>>>> So, from the PDF letters are out of order, but from Pagemaker, letters are 
>>>> encoded into control points. Fixing the control points is easy and happens 
>>>> with the unicode conversion.  Fixing the letter order is not easy. You'll 
>>>> need a first language speaker and plenty of time.
>>>>
>>>> The guidance I received on another group was to use either LO Draw or 
>>>> Indesign to export the text from Pagemaker.  I'll look into LO Draw again, 
>>>> but I don't have access to an older version of Indesign (the pagemaker 
>>>> import was removed in CS6).
>>>>
>>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com> wrote:
>>>>
>>>>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker 
>>>>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see 
>>>>> the text there.
>>>>>
>>>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it 
>>>>> with a zip archive progeram.  The text is in the Pagemaker file at the 
>>>>> top level of the archive, but encoded with a lot of extraneous 
>>>>> information.  (The English text "Matthew" appears at hex location 
>>>>> 7A76972).
>>>>>
>>>>> When I open the fonts with fontforge, Fontforge suggests the fonts are 
>>>>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>>>>> However when I copy the text (I copied from LO Draw) and paste it into 
>>>>> jedit and save that as unicode: Reopening the file has a warning 'not 
>>>>> unicode, text may be missing'.
>>>>>
>>>>> So, what this means is that there are some glyphs encoded into locations 
>>>>> that unicode treats as control or non-printing codes. The text needs to 
>>>>> be dealt with as a specific encoding that matches whatever the original 
>>>>> font actually uses. I haven't figured out what the original text files 
>>>>> were encoded with. Without that knowledge, I'm not sure my system 
>>>>> clipboard or editor (jedit) will properly respect the glyphs in unusual 
>>>>> locations until the conversion to unicode, and I don't trust myself to be 
>>>>> able to detect if it is or is not properly converted.
>>>>>
>>>>> On Mon, May 13, 2019 at 10:11 AM Cyrille <lafricai...@gmail.com> wrote:
>>>>>
>>>>>> David,
>>>>>> Probably you are right about 
>>>>>> [TECkit](http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit),
>>>>>>  if we get the text it will help us to convert in UNICODE.
>>>>>> About how to get the text, your method is out of my skills :)
>>>>>> I you succeed please let me know.
>>>>>>
>>>>>> Il 13/05/2019 16:21, David Haslam ha scritto:
>>>>>>
>>>>>>> Given the insights from Michael Hart, it may be feasible to temporarily 
>>>>>>> rearrange the main text stream as follows :
>>>>>>>
>>>>>>> 1. Replace every EOL by a horizontal tab.
>>>>>>> 2. Insert an EOL after each verse end character.
>>>>>>>
>>>>>>> Observe that the above two steps are wholly reversible such that the 
>>>>>>> original text stream can be restored later.
>>>>>>>
>>>>>>> In effect the text stream is now in verse per line (VPL) layout, albeit 
>>>>>>> without verse tags. Some adjustments may be necessary if there any 
>>>>>>> section headings, etc.
>>>>>>>
>>>>>>> 3. Add line numbers with the first number being reset to 1 at the start 
>>>>>>> of each chapter, numbers incrementing by 1 for each line.
>>>>>>> 4. Add a left margin USFM verse tag \v_
>>>>>>>
>>>>>>> Steps 3&4 can be implemented in various ways. For my part, I’d use a 
>>>>>>> bespoke TextPipe filter.
>>>>>>>
>>>>>>> Another method to consider might be to use Excel formulae. I recall 
>>>>>>> resorting to such a method in the early days of Go Bible.
>>>>>>>
>>>>>>> Now restore the original layout by reverting steps 2 & 1, if this is 
>>>>>>> really necessary. That is, if the original text layout appeared to be 
>>>>>>> paragraphed.
>>>>>>>
>>>>>>> 5. Decide how & where to insert paragraph tags.
>>>>>>>
>>>>>>> 6. Add chapter tags, book ID and main title tags, etc.
>>>>>>>
>>>>>>> Hope this gives some useful suggestions that point towards a practical 
>>>>>>> solution.
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> Sent from ProtonMail Mobile
>>>>>>>
>>>>>>> On Mon, May 13, 2019 at 14:57, Michael H <cma...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Cyrille
>>>>>>>>
>>>>>>>> LibreOffice Draw attempts to open the pagemaker file, with limited 
>>>>>>>> success. But it confirms that even in the pagemaker source, the verse 
>>>>>>>> numbers are a separate text stream. With this source, there is no way 
>>>>>>>> to copy the text with verse numbers intact. It appears to be stored 
>>>>>>>> with each book in it's own text stream. Each book is a separate text 
>>>>>>>> stream in the page maker file. LO Draw isn't rendering all of the 
>>>>>>>> pages, only the first 10, So I've only explored Matthew further.
>>>>>>>>
>>>>>>>> Based on Matthew only, the verses seem to all end with the character 
>>>>>>>> "-" or ";/", which should aid in the reconstruction. I've looked 
>>>>>>>> through the PDF and this seems to be the case for all books visually 
>>>>>>>> as well. However, this isn't perfect: I find 1107 of these characters 
>>>>>>>> in Matthew, instead of the expected 1071 verses.  But since the text 
>>>>>>>> stream has a book introduction, this is likely easily explained. 
>>>>>>>> Hopefully this gets you well down the path to creating a stream with 
>>>>>>>> verses.
>>>>>>>>
>>>>>>>> I would NOT start from the PDF file, but from the pagemaker file.  The 
>>>>>>>> PDF almost certainly has a lot of text rearranging and extra 
>>>>>>>> characters like page numbers and running heads.  Pagemaker has the 
>>>>>>>> book text in a single stream, in a form that will convert to unicode 
>>>>>>>> relatively easily.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> sword-devel mailing list:
>>>>>>> sword-devel@crosswire.org
>>>>>>>
>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel@crosswire.org
>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list:
>>>> sword-devel@crosswire.org
>>>>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list:
>> sword-devel@crosswire.org
>>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to