Il 14/05/2019 22:48, Michael H ha scritto: > You should be able to configure a regex search to find the verse > boundaries. > > Once you have verse boundaries, if you configure the text into Verse > per line it should be possible to assign each row a chapter and verse > number from a reference. That is, the 3341 verse in the New Testament > is usually John 20:31 (I don't have that memorized, just an example.)
I have no idea how to do this :) > > On Tue, May 14, 2019 at 3:22 PM Cyrille <lafricai...@gmail.com > <mailto:lafricai...@gmail.com>> wrote: > > Ok thank you! I have already all the text in unicode but without > the verse numbers and chapters... I begun manually... > > Il 14/05/2019 22:17, David Haslam ha scritto: >> Hi Cyrille >> >> If I can find the time tomorrow or later, I’ll have a look at >> what might be feasible. >> >> Thanks for all these useful links. >> >> David >> >> Sent from ProtonMail Mobile >> >> >> On Tue, May 14, 2019 at 14:08, Cyrille <lafricai...@gmail.com >> <mailto:lafricai...@gmail.com>> wrote: >>> I send my message again because it was bigger. >>> >>> The conversion to UTF-8 is 99% solved!! I used a online converter: >>> >>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html >>> or: >>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm >>> >>> See the result here >>> >>> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>. >>> >>> Now the only problem is how to get the verse and chapter number... >>> >>> >>> Il 14/05/2019 13:53, Michael H ha scritto: >>>> Cyrille, (Peter), >>>> >>>> Maybe further discussion on this belongs in Gitlab as issues. >>>> Can I get added to this project? >>>> >>>> Here are the first few lines of Matthew copied from the PDF: >>>> ------ >>>> &Sifrmaw;OD; {0Ha*vdusrf; >>>> The Gospel According to Matthew >>>> ed'gef; >>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f >>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl >>>> sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14) >>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) >>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm >>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS >>>> ahf wG U Ny;D >>>> >>>> ----- >>>> And here are the first few lines of Matthew copied from the >>>> Pagemaker file: >>>> ----- >>>> Sifrmaw;OD; {0Ha*vdusrf; >>>> The Gospel According to Matthew >>>> ed'gef; >>>> usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf; >>>> usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd; >>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf >>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD >>>> ol\trnfrSm av0djzpf\/ olonf wdab;&d,tkdifteD;wGif >>>> a,Zl;ocifESifhawGU NyD; >>>> >>>> >>>> You can see that some letters have changed, and some others are >>>> in a different order. >>>> >>>> The letters that change are likely those points that aren't >>>> compatible with unicode, and pagemaker reassigned them to >>>> ensure that the file is more widely viewable. Since a >>>> conversion is already planned, these won't matter as much, but >>>> the font embedded in the PDF is different than the font >>>> attached to the pagemaker file, If you do start from the PDF, >>>> you'll need to extract the font to get the code points. >>>> >>>> The problem is that the PDF export from pagemaker sorts the >>>> letters into the order they appear on the page. Burmese text >>>> has Indian style ligatures, where vowels tend to jump over or >>>> under the previous letters, sometimes back 2 or three letters. >>>> If you study the following snippets from the beginning of >>>> Matthew, you can see there is a difference in order, as well as >>>> some glyphs are modified. >>>> >>>> So, from the PDF letters are out of order, but from Pagemaker, >>>> letters are encoded into control points. Fixing the control >>>> points is easy and happens with the unicode conversion. Fixing >>>> the letter order is not easy. You'll need a first language >>>> speaker and plenty of time. >>>> >>>> The guidance I received on another group was to use either LO >>>> Draw or Indesign to export the text from Pagemaker. I'll look >>>> into LO Draw again, but I don't have access to an older version >>>> of Indesign (the pagemaker import was removed in CS6). >>>> >>>> >>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cma...@gmail.com >>>> <mailto:cma...@gmail.com>> wrote: >>>> >>>> I unzipped the pagemaker file, and when I open >>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can >>>> 'find' all of the book names, and see the text there. >>>> >>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip >>>> and open it with a zip archive progeram. The text is in >>>> the Pagemaker file at the top level of the archive, but >>>> encoded with a lot of extraneous information. (The English >>>> text "Matthew" appears at hex location 7A76972). >>>> >>>> When I open the fonts with fontforge, Fontforge suggests >>>> the fonts are encoded as unicode (but the glyphs are >>>> obviously not in the right spot.) >>>> However when I copy the text (I copied from LO Draw) and >>>> paste it into jedit and save that as unicode: Reopening the >>>> file has a warning 'not unicode, text may be missing'. >>>> >>>> So, what this means is that there are some glyphs encoded >>>> into locations that unicode treats as control or >>>> non-printing codes. The text needs to be dealt with as a >>>> specific encoding that matches whatever the original font >>>> actually uses. I haven't figured out what the original text >>>> files were encoded with. Without that knowledge, I'm not >>>> sure my system clipboard or editor (jedit) will properly >>>> respect the glyphs in unusual locations until the >>>> conversion to unicode, and I don't trust myself to be able >>>> to detect if it is or is not properly converted. >>>> >>>> On Mon, May 13, 2019 at 10:11 AM Cyrille >>>> <lafricai...@gmail.com <mailto:lafricai...@gmail.com>> wrote: >>>> >>>> David, >>>> Probably you are right about TECkit >>>> >>>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>, >>>> if we get the text it will help us to convert in UNICODE. >>>> About how to get the text, your method is out of my >>>> skills :) >>>> I you succeed please let me know. >>>> >>>> Il 13/05/2019 16:21, David Haslam ha scritto: >>>>> Given the insights from Michael Hart, it may be >>>>> feasible to temporarily rearrange the main text stream >>>>> as follows : >>>>> >>>>> 1. Replace every EOL by a horizontal tab. >>>>> 2. Insert an EOL after each verse end character. >>>>> >>>>> Observe that the above two steps are wholly reversible >>>>> such that the original text stream can be restored later. >>>>> >>>>> In effect the text stream is now in verse per line >>>>> (VPL) layout, albeit without verse tags. Some >>>>> adjustments may be necessary if there any section >>>>> headings, etc. >>>>> >>>>> 3. Add line numbers with the first number being reset >>>>> to 1 at the start of each chapter, numbers >>>>> incrementing by 1 for each line. >>>>> 4. Add a left margin USFM verse tag \v_ >>>>> >>>>> Steps 3&4 can be implemented in various ways. For my >>>>> part, I’d use a bespoke TextPipe filter. >>>>> >>>>> Another method to consider might be to use Excel >>>>> formulae. I recall resorting to such a method in the >>>>> early days of Go Bible. >>>>> >>>>> Now restore the original layout by reverting steps 2 & >>>>> 1, if this is really necessary. That is, if the >>>>> original text layout appeared to be paragraphed. >>>>> >>>>> 5. Decide how & where to insert paragraph tags. >>>>> >>>>> 6. Add chapter tags, book ID and main title tags, etc. >>>>> >>>>> Hope this gives some useful suggestions that point >>>>> towards a practical solution. >>>>> >>>>> Best regards >>>>> >>>>> David >>>>> >>>>> >>>>> Sent from ProtonMail Mobile >>>>> >>>>> >>>>> On Mon, May 13, 2019 at 14:57, Michael H >>>>> <cma...@gmail.com <mailto:cma...@gmail.com>> wrote: >>>>>> Cyrille >>>>>> >>>>>> LibreOffice Draw attempts to open the pagemaker file, >>>>>> with limited success. But it confirms that even in >>>>>> the pagemaker source, the verse numbers are a >>>>>> separate text stream. With this source, there is no >>>>>> way to copy the text with verse numbers intact. It >>>>>> appears to be stored with each book in it's own text >>>>>> stream. Each book is a separate text stream in the >>>>>> page maker file. LO Draw isn't rendering all of the >>>>>> pages, only the first 10, So I've only explored >>>>>> Matthew further. >>>>>> >>>>>> Based on Matthew only, the verses seem to all end >>>>>> with the character "-" or ";/", which should aid in >>>>>> the reconstruction. I've looked through the PDF and >>>>>> this seems to be the case for all books visually as >>>>>> well. However, this isn't perfect: I find 1107 of >>>>>> these characters in Matthew, instead of the expected >>>>>> 1071 verses. But since the text stream has a book >>>>>> introduction, this is likely easily explained. >>>>>> Hopefully this gets you well down the path to >>>>>> creating a stream with verses. >>>>>> >>>>>> I would NOT start from the PDF file, but from the >>>>>> pagemaker file. The PDF almost certainly has a lot >>>>>> of text rearranging and extra characters like page >>>>>> numbers and running heads. Pagemaker has the book >>>>>> text in a single stream, in a form that will convert >>>>>> to unicode relatively easily. >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>> <mailto:sword-devel@crosswire.org> >>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>> Instructions to unsubscribe/change your settings at above page >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: sword-devel@crosswire.org >>>> <mailto:sword-devel@crosswire.org> >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at >>>> above page >>>> >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: sword-devel@crosswire.org >>>> <mailto:sword-devel@crosswire.org> >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at above page >>> >> >> >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> <mailto:sword-devel@crosswire.org> >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > <mailto:sword-devel@crosswire.org> > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page