Il 14/05/2019 22:55, Cyrille ha scritto: > > > Il 14/05/2019 22:45, Michael H ha scritto: >> Cyrille, did you start from the PDF or the pagemaker file? > PMaker >> Either way, you should send a snippet to your source and validate the >> words are still readable. As small as 30 words should be enough. The convert text? If yes look the attached file. >> >> On Tue, May 14, 2019 at 8:09 AM Cyrille <lafricai...@gmail.com >> <mailto:lafricai...@gmail.com>> wrote: >> >> I send my message again because it was bigger. >> >> The conversion to UTF-8 is 99% solved!! I used a online converter: >> >> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html >> or: >> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm >> >> See the result here >> >> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>. >> >> Now the only problem is how to get the verse and chapter number... >> >> >> Il 14/05/2019 13:53, Michael H ha scritto: >>> Cyrille, (Peter), >>> >>> Maybe further discussion on this belongs in Gitlab as issues. >>> Can I get added to this project? >>> >>> Here are the first few lines of Matthew copied from the PDF: >>> ------ >>> &Sifrmaw;OD; {0Ha*vdusrf; >>> The Gospel According to Matthew >>> ed'gef; >>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f >>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d >>> tmvaf z;O;D \om;jzp\f / (rmu k2;14) >>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) >>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm >>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf >>> wG U Ny;D >>> >>> ----- >>> And here are the first few lines of Matthew copied from the >>> Pagemaker file: >>> ----- >>> Sifrmaw;OD; {0Ha*vdusrf; >>> The Gospel According to Matthew >>> ed'gef; >>> usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf; >>> usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd; >>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf >>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD >>> ol\trnfrSm av0djzpf\/ olonf wdab;&d,tkdifteD;wGif >>> a,Zl;ocifESifhawGU NyD; >>> >>> >>> You can see that some letters have changed, and some others are >>> in a different order. >>> >>> The letters that change are likely those points that aren't >>> compatible with unicode, and pagemaker reassigned them to ensure >>> that the file is more widely viewable. Since a conversion is >>> already planned, these won't matter as much, but the font >>> embedded in the PDF is different than the font attached to the >>> pagemaker file, If you do start from the PDF, you'll need to >>> extract the font to get the code points. >>> >>> The problem is that the PDF export from pagemaker sorts the >>> letters into the order they appear on the page. Burmese text >>> has Indian style ligatures, where vowels tend to jump over or >>> under the previous letters, sometimes back 2 or three letters. >>> If you study the following snippets from the beginning of >>> Matthew, you can see there is a difference in order, as well as >>> some glyphs are modified. >>> >>> So, from the PDF letters are out of order, but from Pagemaker, >>> letters are encoded into control points. Fixing the control >>> points is easy and happens with the unicode conversion. Fixing >>> the letter order is not easy. You'll need a first language >>> speaker and plenty of time. >>> >>> The guidance I received on another group was to use either LO >>> Draw or Indesign to export the text from Pagemaker. I'll look >>> into LO Draw again, but I don't have access to an older version >>> of Indesign (the pagemaker import was removed in CS6). >>> >>> >>> On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com >>> <mailto:cma...@gmail.com>> wrote: >>> >>> I unzipped the pagemaker file, and when I open >>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can >>> 'find' all of the book names, and see the text there. >>> >>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip >>> and open it with a zip archive progeram. The text is in the >>> Pagemaker file at the top level of the archive, but encoded >>> with a lot of extraneous information. (The English text >>> "Matthew" appears at hex location 7A76972). >>> >>> When I open the fonts with fontforge, Fontforge suggests the >>> fonts are encoded as unicode (but the glyphs are obviously >>> not in the right spot.) >>> However when I copy the text (I copied from LO Draw) and >>> paste it into jedit and save that as unicode: Reopening the >>> file has a warning 'not unicode, text may be missing'. >>> >>> So, what this means is that there are some glyphs encoded >>> into locations that unicode treats as control or >>> non-printing codes. The text needs to be dealt with as a >>> specific encoding that matches whatever the original font >>> actually uses. I haven't figured out what the original text >>> files were encoded with. Without that knowledge, I'm not >>> sure my system clipboard or editor (jedit) will properly >>> respect the glyphs in unusual locations until the conversion >>> to unicode, and I don't trust myself to be able to detect if >>> it is or is not properly converted. >>> >>> On Mon, May 13, 2019 at 10:11 AM Cyrille >>> <lafricai...@gmail.com <mailto:lafricai...@gmail.com>> wrote: >>> >>> David, >>> Probably you are right about TECkit >>> >>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>, >>> if we get the text it will help us to convert in UNICODE. >>> About how to get the text, your method is out of my >>> skills :) >>> I you succeed please let me know. >>> >>> Il 13/05/2019 16:21, David Haslam ha scritto: >>>> Given the insights from Michael Hart, it may be >>>> feasible to temporarily rearrange the main text stream >>>> as follows : >>>> >>>> 1. Replace every EOL by a horizontal tab. >>>> 2. Insert an EOL after each verse end character. >>>> >>>> Observe that the above two steps are wholly reversible >>>> such that the original text stream can be restored later. >>>> >>>> In effect the text stream is now in verse per line >>>> (VPL) layout, albeit without verse tags. Some >>>> adjustments may be necessary if there any section >>>> headings, etc. >>>> >>>> 3. Add line numbers with the first number being reset >>>> to 1 at the start of each chapter, numbers incrementing >>>> by 1 for each line. >>>> 4. Add a left margin USFM verse tag \v_ >>>> >>>> Steps 3&4 can be implemented in various ways. For my >>>> part, I’d use a bespoke TextPipe filter. >>>> >>>> Another method to consider might be to use Excel >>>> formulae. I recall resorting to such a method in the >>>> early days of Go Bible. >>>> >>>> Now restore the original layout by reverting steps 2 & >>>> 1, if this is really necessary. That is, if the >>>> original text layout appeared to be paragraphed. >>>> >>>> 5. Decide how & where to insert paragraph tags. >>>> >>>> 6. Add chapter tags, book ID and main title tags, etc. >>>> >>>> Hope this gives some useful suggestions that point >>>> towards a practical solution. >>>> >>>> Best regards >>>> >>>> David >>>> >>>> >>>> Sent from ProtonMail Mobile >>>> >>>> >>>> On Mon, May 13, 2019 at 14:57, Michael H >>>> <cma...@gmail.com <mailto:cma...@gmail.com>> wrote: >>>>> Cyrille >>>>> >>>>> LibreOffice Draw attempts to open the pagemaker file, >>>>> with limited success. But it confirms that even in the >>>>> pagemaker source, the verse numbers are a separate >>>>> text stream. With this source, there is no way to copy >>>>> the text with verse numbers intact. It appears to be >>>>> stored with each book in it's own text stream. Each >>>>> book is a separate text stream in the page maker file. >>>>> LO Draw isn't rendering all of the pages, only the >>>>> first 10, So I've only explored Matthew further. >>>>> >>>>> Based on Matthew only, the verses seem to all end with >>>>> the character "-" or ";/", which should aid in the >>>>> reconstruction. I've looked through the PDF and this >>>>> seems to be the case for all books visually as well. >>>>> However, this isn't perfect: I find 1107 of these >>>>> characters in Matthew, instead of the expected 1071 >>>>> verses. But since the text stream has a book >>>>> introduction, this is likely easily explained. >>>>> Hopefully this gets you well down the path to creating >>>>> a stream with verses. >>>>> >>>>> I would NOT start from the PDF file, but from the >>>>> pagemaker file. The PDF almost certainly has a lot of >>>>> text rearranging and extra characters like page >>>>> numbers and running heads. Pagemaker has the book >>>>> text in a single stream, in a form that will convert >>>>> to unicode relatively easily. >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: sword-devel@crosswire.org >>>> <mailto:sword-devel@crosswire.org> >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at above page >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> <mailto:sword-devel@crosswire.org> >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at >>> above page >>> >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> <mailto:sword-devel@crosswire.org> >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> <mailto:sword-devel@crosswire.org> >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page >> >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page >
<<attachment: TIT.zip>>
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page