Il 14/05/2019 22:26, David Haslam ha scritto: > If Michael’s observations are anything to go by, then maybe I can > script the recovery of chapter & verse tags. > > We shall see .... > > Even if I’m not immediately successful - valuable lessons can be > learned in the attempt. Very, well, I'll wait for you ;) > > David > > Sent from ProtonMail Mobile > > > On Tue, May 14, 2019 at 21:21, Cyrille <lafricai...@gmail.com > <mailto:lafricai...@gmail.com>> wrote: >> Ok thank you! I have already all the text in unicode but without the >> verse numbers and chapters... I begun manually... >> >> Il 14/05/2019 22:17, David Haslam ha scritto: >>> Hi Cyrille >>> >>> If I can find the time tomorrow or later, I’ll have a look at what >>> might be feasible. >>> >>> Thanks for all these useful links. >>> >>> David >>> >>> Sent from ProtonMail Mobile >>> >>> >>> On Tue, May 14, 2019 at 14:08, Cyrille <lafricai...@gmail.com >>> <mailto:lafricai...@gmail.com>> wrote: >>>> I send my message again because it was bigger. >>>> >>>> The conversion to UTF-8 is 99% solved!! I used a online converter: >>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html >>>> or: >>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm >>>> >>>> See the result here >>>> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>. >>>> >>>> Now the only problem is how to get the verse and chapter number... >>>> >>>> >>>> Il 14/05/2019 13:53, Michael H ha scritto: >>>>> Cyrille, (Peter), >>>>> >>>>> Maybe further discussion on this belongs in Gitlab as issues. Can >>>>> I get added to this project? >>>>> >>>>> Here are the first few lines of Matthew copied from the PDF: >>>>> ------ >>>>> &Sifrmaw;OD; {0Ha*vdusrf; >>>>> The Gospel According to Matthew >>>>> ed'gef; >>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f >>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d >>>>> tmvaf z;O;D \om;jzp\f / (rmu k2;14) >>>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) >>>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm >>>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf >>>>> wG U Ny;D >>>>> >>>>> ----- >>>>> And here are the first few lines of Matthew copied from the >>>>> Pagemaker file: >>>>> ----- >>>>> Sifrmaw;OD; {0Ha*vdusrf; >>>>> The Gospel According to Matthew >>>>> ed'gef; >>>>> usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf; >>>>> usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd; >>>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf >>>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD >>>>> ol\trnfrSm av0djzpf\/ olonf wdab;&d,tkdifteD;wGif >>>>> a,Zl;ocifESifhawGU NyD; >>>>> >>>>> >>>>> You can see that some letters have changed, and some others are in >>>>> a different order. >>>>> >>>>> The letters that change are likely those points that aren't >>>>> compatible with unicode, and pagemaker reassigned them to ensure >>>>> that the file is more widely viewable. Since a conversion is >>>>> already planned, these won't matter as much, but the font embedded >>>>> in the PDF is different than the font attached to the pagemaker >>>>> file, If you do start from the PDF, you'll need to extract the >>>>> font to get the code points. >>>>> >>>>> The problem is that the PDF export from pagemaker sorts the >>>>> letters into the order they appear on the page. Burmese text has >>>>> Indian style ligatures, where vowels tend to jump over or under >>>>> the previous letters, sometimes back 2 or three letters. If you >>>>> study the following snippets from the beginning of Matthew, you >>>>> can see there is a difference in order, as well as some glyphs are >>>>> modified. >>>>> >>>>> So, from the PDF letters are out of order, but from Pagemaker, >>>>> letters are encoded into control points. Fixing the control points >>>>> is easy and happens with the unicode conversion. Fixing the >>>>> letter order is not easy. You'll need a first language speaker and >>>>> plenty of time. >>>>> >>>>> The guidance I received on another group was to use either LO Draw >>>>> or Indesign to export the text from Pagemaker. I'll look into LO >>>>> Draw again, but I don't have access to an older version of >>>>> Indesign (the pagemaker import was removed in CS6). >>>>> >>>>> >>>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com >>>>> <mailto:cma...@gmail.com>> wrote: >>>>> >>>>> I unzipped the pagemaker file, and when I open >>>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find' >>>>> all of the book names, and see the text there. >>>>> >>>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip >>>>> and open it with a zip archive progeram. The text is in the >>>>> Pagemaker file at the top level of the archive, but encoded >>>>> with a lot of extraneous information. (The English text >>>>> "Matthew" appears at hex location 7A76972). >>>>> >>>>> When I open the fonts with fontforge, Fontforge suggests the >>>>> fonts are encoded as unicode (but the glyphs are obviously not >>>>> in the right spot.) >>>>> However when I copy the text (I copied from LO Draw) and paste >>>>> it into jedit and save that as unicode: Reopening the file has >>>>> a warning 'not unicode, text may be missing'. >>>>> >>>>> So, what this means is that there are some glyphs encoded into >>>>> locations that unicode treats as control or non-printing >>>>> codes. The text needs to be dealt with as a specific encoding >>>>> that matches whatever the original font actually uses. I >>>>> haven't figured out what the original text files were encoded >>>>> with. Without that knowledge, I'm not sure my system clipboard >>>>> or editor (jedit) will properly respect the glyphs in unusual >>>>> locations until the conversion to unicode, and I don't trust >>>>> myself to be able to detect if it is or is not properly >>>>> converted. >>>>> >>>>> On Mon, May 13, 2019 at 10:11 AM Cyrille >>>>> <lafricai...@gmail.com <mailto:lafricai...@gmail.com>> wrote: >>>>> >>>>> David, >>>>> Probably you are right about TECkit >>>>> >>>>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>, >>>>> if we get the text it will help us to convert in UNICODE. >>>>> About how to get the text, your method is out of my skills :) >>>>> I you succeed please let me know. >>>>> >>>>> Il 13/05/2019 16:21, David Haslam ha scritto: >>>>>> Given the insights from Michael Hart, it may be feasible >>>>>> to temporarily rearrange the main text stream as follows : >>>>>> >>>>>> 1. Replace every EOL by a horizontal tab. >>>>>> 2. Insert an EOL after each verse end character. >>>>>> >>>>>> Observe that the above two steps are wholly reversible >>>>>> such that the original text stream can be restored later. >>>>>> >>>>>> In effect the text stream is now in verse per line (VPL) >>>>>> layout, albeit without verse tags. Some adjustments may >>>>>> be necessary if there any section headings, etc. >>>>>> >>>>>> 3. Add line numbers with the first number being reset to >>>>>> 1 at the start of each chapter, numbers incrementing by 1 >>>>>> for each line. >>>>>> 4. Add a left margin USFM verse tag \v_ >>>>>> >>>>>> Steps 3&4 can be implemented in various ways. For my >>>>>> part, I’d use a bespoke TextPipe filter. >>>>>> >>>>>> Another method to consider might be to use Excel >>>>>> formulae. I recall resorting to such a method in the >>>>>> early days of Go Bible. >>>>>> >>>>>> Now restore the original layout by reverting steps 2 & 1, >>>>>> if this is really necessary. That is, if the original >>>>>> text layout appeared to be paragraphed. >>>>>> >>>>>> 5. Decide how & where to insert paragraph tags. >>>>>> >>>>>> 6. Add chapter tags, book ID and main title tags, etc. >>>>>> >>>>>> Hope this gives some useful suggestions that point >>>>>> towards a practical solution. >>>>>> >>>>>> Best regards >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> Sent from ProtonMail Mobile >>>>>> >>>>>> >>>>>> On Mon, May 13, 2019 at 14:57, Michael H >>>>>> <cma...@gmail.com <mailto:cma...@gmail.com>> wrote: >>>>>>> Cyrille >>>>>>> >>>>>>> LibreOffice Draw attempts to open the pagemaker file, >>>>>>> with limited success. But it confirms that even in the >>>>>>> pagemaker source, the verse numbers are a separate text >>>>>>> stream. With this source, there is no way to copy the >>>>>>> text with verse numbers intact. It appears to be stored >>>>>>> with each book in it's own text stream. Each book is a >>>>>>> separate text stream in the page maker file. LO Draw >>>>>>> isn't rendering all of the pages, only the first 10, So >>>>>>> I've only explored Matthew further. >>>>>>> >>>>>>> Based on Matthew only, the verses seem to all end with >>>>>>> the character "-" or ";/", which should aid in the >>>>>>> reconstruction. I've looked through the PDF and this >>>>>>> seems to be the case for all books visually as well. >>>>>>> However, this isn't perfect: I find 1107 of these >>>>>>> characters in Matthew, instead of the expected 1071 >>>>>>> verses. But since the text stream has a book >>>>>>> introduction, this is likely easily explained. Hopefully >>>>>>> this gets you well down the path to creating a stream >>>>>>> with verses. >>>>>>> >>>>>>> I would NOT start from the PDF file, but from the >>>>>>> pagemaker file. The PDF almost certainly has a lot of >>>>>>> text rearranging and extra characters like page numbers >>>>>>> and running heads. Pagemaker has the book text in a >>>>>>> single stream, in a form that will convert to unicode >>>>>>> relatively easily. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>>> <mailto:sword-devel@crosswire.org> >>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>> Instructions to unsubscribe/change your settings at above page >>>>> >>>>> _______________________________________________ >>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>> <mailto:sword-devel@crosswire.org> >>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>> Instructions to unsubscribe/change your settings at above page >>>>> >>>>> >>>>> _______________________________________________ >>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>> Instructions to unsubscribe/change your settings at above page >>>> >>> >>> >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >> > > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page