You should be able to configure a regex search to find the verse boundaries.
Once you have verse boundaries, if you configure the text into Verse per line it should be possible to assign each row a chapter and verse number from a reference. That is, the 3341 verse in the New Testament is usually John 20:31 (I don't have that memorized, just an example.) On Tue, May 14, 2019 at 3:22 PM Cyrille <lafricai...@gmail.com> wrote: > Ok thank you! I have already all the text in unicode but without the > verse numbers and chapters... I begun manually... > > Il 14/05/2019 22:17, David Haslam ha scritto: > > Hi Cyrille > > If I can find the time tomorrow or later, I’ll have a look at what might > be feasible. > > Thanks for all these useful links. > > David > > Sent from ProtonMail Mobile > > > On Tue, May 14, 2019 at 14:08, Cyrille <lafricai...@gmail.com> wrote: > > I send my message again because it was bigger. > > The conversion to UTF-8 is 99% solved!! I used a online converter: > > https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html > or: > http://burglish.my-mm.org/latest/trunk/web/fontconv.htm > > See the result here > <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=> > . > > Now the only problem is how to get the verse and chapter number... > > > Il 14/05/2019 13:53, Michael H ha scritto: > > Cyrille, (Peter), > > Maybe further discussion on this belongs in Gitlab as issues. Can I get > added to this project? > > Here are the first few lines of Matthew copied from the PDF: > ------ > &Sifrmaw;OD; {0Ha*vdusrf; > The Gospel According to Matthew > ed'gef; > usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f > usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf > z;O;D \om;jzp\f / (rmu k2;14) > olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) > a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm > av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D > > ----- > And here are the first few lines of Matthew copied from the Pagemaker > file: > ----- > Sifrmaw;OD; {0Ha*vdusrf; > The Gospel According to Matthew > ed'gef; > usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf; > usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd; > tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk > 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm av0djzpf\/ olonf > wdab;&d,tkdifteD;wGif a,Zl;ocifESifhawGU NyD; > > > You can see that some letters have changed, and some others are in a > different order. > > The letters that change are likely those points that aren't compatible > with unicode, and pagemaker reassigned them to ensure that the file is more > widely viewable. Since a conversion is already planned, these won't matter > as much, but the font embedded in the PDF is different than the font > attached to the pagemaker file, If you do start from the PDF, you'll need > to extract the font to get the code points. > > The problem is that the PDF export from pagemaker sorts the letters into > the order they appear on the page. Burmese text has Indian style > ligatures, where vowels tend to jump over or under the previous letters, > sometimes back 2 or three letters. If you study the following snippets from > the beginning of Matthew, you can see there is a difference in order, as > well as some glyphs are modified. > > So, from the PDF letters are out of order, but from Pagemaker, letters are > encoded into control points. Fixing the control points is easy and happens > with the unicode conversion. Fixing the letter order is not easy. You'll > need a first language speaker and plenty of time. > > The guidance I received on another group was to use either LO Draw or > Indesign to export the text from Pagemaker. I'll look into LO Draw again, > but I don't have access to an older version of Indesign (the pagemaker > import was removed in CS6). > > > On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com> wrote: > >> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker >> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see >> the text there. >> >> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it >> with a zip archive progeram. The text is in the Pagemaker file at the top >> level of the archive, but encoded with a lot of extraneous information. >> (The English text "Matthew" appears at hex location 7A76972). >> >> When I open the fonts with fontforge, Fontforge suggests the fonts are >> encoded as unicode (but the glyphs are obviously not in the right spot.) >> However when I copy the text (I copied from LO Draw) and paste it into >> jedit and save that as unicode: Reopening the file has a warning 'not >> unicode, text may be missing'. >> >> So, what this means is that there are some glyphs encoded into locations >> that unicode treats as control or non-printing codes. The text needs to be >> dealt with as a specific encoding that matches whatever the original font >> actually uses. I haven't figured out what the original text files were >> encoded with. Without that knowledge, I'm not sure my system clipboard or >> editor (jedit) will properly respect the glyphs in unusual locations until >> the conversion to unicode, and I don't trust myself to be able to detect if >> it is or is not properly converted. >> >> On Mon, May 13, 2019 at 10:11 AM Cyrille <lafricai...@gmail.com> wrote: >> >>> David, >>> Probably you are right about TECkit >>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>, >>> if we get the text it will help us to convert in UNICODE. >>> About how to get the text, your method is out of my skills :) >>> I you succeed please let me know. >>> >>> Il 13/05/2019 16:21, David Haslam ha scritto: >>> >>> Given the insights from Michael Hart, it may be feasible to temporarily >>> rearrange the main text stream as follows : >>> >>> 1. Replace every EOL by a horizontal tab. >>> 2. Insert an EOL after each verse end character. >>> >>> Observe that the above two steps are wholly reversible such that the >>> original text stream can be restored later. >>> >>> In effect the text stream is now in verse per line (VPL) layout, albeit >>> without verse tags. Some adjustments may be necessary if there any section >>> headings, etc. >>> >>> 3. Add line numbers with the first number being reset to 1 at the start >>> of each chapter, numbers incrementing by 1 for each line. >>> 4. Add a left margin USFM verse tag \v_ >>> >>> Steps 3&4 can be implemented in various ways. For my part, I’d use a >>> bespoke TextPipe filter. >>> >>> Another method to consider might be to use Excel formulae. I recall >>> resorting to such a method in the early days of Go Bible. >>> >>> Now restore the original layout by reverting steps 2 & 1, if this is >>> really necessary. That is, if the original text layout appeared to be >>> paragraphed. >>> >>> 5. Decide how & where to insert paragraph tags. >>> >>> 6. Add chapter tags, book ID and main title tags, etc. >>> >>> Hope this gives some useful suggestions that point towards a practical >>> solution. >>> >>> Best regards >>> >>> David >>> >>> >>> Sent from ProtonMail Mobile >>> >>> >>> On Mon, May 13, 2019 at 14:57, Michael H <cma...@gmail.com> wrote: >>> >>> Cyrille >>> >>> LibreOffice Draw attempts to open the pagemaker file, with limited >>> success. But it confirms that even in the pagemaker source, the verse >>> numbers are a separate text stream. With this source, there is no way to >>> copy the text with verse numbers intact. It appears to be stored with each >>> book in it's own text stream. Each book is a separate text stream in the >>> page maker file. LO Draw isn't rendering all of the pages, only the first >>> 10, So I've only explored Matthew further. >>> >>> Based on Matthew only, the verses seem to all end with the character "-" >>> or ";/", which should aid in the reconstruction. I've looked through the >>> PDF and this seems to be the case for all books visually as well. However, >>> this isn't perfect: I find 1107 of these characters in Matthew, instead of >>> the expected 1071 verses. But since the text stream has a book >>> introduction, this is likely easily explained. Hopefully this gets you well >>> down the path to creating a stream with verses. >>> >>> I would NOT start from the PDF file, but from the pagemaker file. The >>> PDF almost certainly has a lot of text rearranging and extra characters >>> like page numbers and running heads. Pagemaker has the book text in a >>> single stream, in a form that will convert to unicode relatively easily. >>> >>> >>> >>> >>> _______________________________________________ >>> sword-devel mailing list: >>> sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >>> >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >> >> > _______________________________________________ > sword-devel mailing list: > sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > > > > _______________________________________________ > sword-devel mailing list: > sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page