Il 15/05/2019 19:18, David Haslam ha scritto: > Each of the last 1 or 2 characters of each verse is a regular Myanmar > punctuation mark. > Do you know wich mark? > We need to be careful how we apply this. There may well be > some exceptions. > > Windows users should install BabelPad. This free Unicode text editor > is highly recommended. > > http://www.babelstone.co.uk/Software/BabelPad.html > > It will help in all sorts of ways, not least in analysis. > > David > > Sent from ProtonMail Mobile > > > On Wed, May 15, 2019 at 18:08, Cyrille <lafricai...@gmail.com > <mailto:lafricai...@gmail.com>> wrote: >> I have not understood everything yet ... But I trust you. But if you >> have the courage to explain to me I want to learn :) >> What I don't understand is how you can find the marker of each verse >> and chapter in the utf8 text? What is this marker in question? >> >> Il 15/05/2019 19:03, David Haslam ha scritto: >>> Michael’s description matches how I imagined the method >>> during my waking moments this morning. :) >>> >>> David >>> >>> Sent from ProtonMail Mobile >>> >>> >>> On Wed, May 15, 2019 at 17:33, Michael H <cma...@gmail.com >>> <mailto:cma...@gmail.com>> wrote: >>>> I've been working long hours and emailing in my break time. David >>>> has the basics of converting to VPL. >>>> >>>> I would then make the entire work a column in a spreadsheet. >>>> >>>> Then in other collumns insert a list of Book/chapter/verse in order. >>>> >>>> The BCV and versetext columns should align and can be verified, >>>> and adjusted where things don't match perfectly, like maybe 3 John >>>> has 15 instead of 14 verses. >>>> >>>> Once the columns align, you can merge them into another column via >>>> concatenation operations (&). This last column becomes your output. >>>> >>>> The output needs to consider that section titles and section ranges >>>> belong in front of the verse marker. That is a bit more complex >>>> search and replace, but can be done successfully. >>>> >>>> >>>> >>>> On Wed, May 15, 2019 at 11:12 AM David Haslam >>>> <dfh...@protonmail.com <mailto:dfh...@protonmail.com>> wrote: >>>> >>>> The attachment contains a counted list of Myanmar words >>>> containing a font conversion error. >>>> /NB. We need to match these words with what they are in the >>>> legacy font./ >>>> >>>> This issue should be discussed with the current maintainer of >>>> the SIL *TECkit* converter, whoever that may be. >>>> >>>> It may be worthwhile asking our friends at the SIL *Writing >>>> Systems Technology* team. See >>>> https://scripts.sil.org/default >>>> >>>> /Aside: My friend Martin Hosken of SIL knew the late Keith >>>> Stribley - the former webmaster of ThanLwinSoft./ >>>> >>>> Best regards, >>>> >>>> David >>>> >>>> Sent with ProtonMail <https://protonmail.com> Secure Email. >>>> >>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>> On Wednesday, May 15, 2019 4:41 PM, David Haslam >>>> <dfh...@protonmail.com <mailto:dfh...@protonmail.com>> wrote: >>>> >>>>> _*Observations*: (continued)_ >>>>> >>>>> 5. The string "*Kd;*" also looks anomalous. It's found only >>>>> once in >>>>> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊ >>>>> >>>>> 6. It's evident from the PDF file that the text is paragraphed >>>>> with indented first lines. See >>>>> >>>>> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0 >>>>> >>>>> My hunch is that these leading paragraph indents may have been >>>>> coded within contents.xml as the self-closing >>>>> element *<text:tab/>*. There are 372 matches to this. >>>>> >>>>> So not only do we need to provide chapter and verse tags (plus >>>>> section headings & parallel passage titles, etc), we also need >>>>> to reconstruct all the paragraph tags. >>>>> >>>>> /NB. All structural XML indents were removed by the filter >>>>> "Remove blanks at SOL" in the file /*/contents.pp.tx/*/that >>>>> was output by my simple TextPipe filter. So that's quite a >>>>> different matter./ >>>>> >>>>> Best regards, >>>>> >>>>> David >>>>> >>>>> Sent with ProtonMail <https://protonmail.com> Secure Email. >>>>> >>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>>> On Wednesday, May 15, 2019 2:22 PM, David Haslam >>>>> <dfh...@protonmail.com <mailto:dfh...@protonmail.com>> wrote: >>>>> >>>>>> _*Observations:* (continued*)*_ >>>>>> >>>>>> 4. In addition to the reported instances of the anomalous 3 >>>>>> characters (*È,Ø,ò*) found after the font conversion, >>>>>> there are 6 instances of the string "*m;*" that are >>>>>> also probably due to bugs in the converter. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> David >>>>>> >>>>>> Sent with ProtonMail <https://protonmail.com> Secure Email. >>>>>> >>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>>>> On Wednesday, May 15, 2019 12:41 PM, David Haslam >>>>>> <dfh...@protonmail.com <mailto:dfh...@protonmail.com>> wrote: >>>>>> >>>>>>> Yep - sure - later I can do that. >>>>>>> >>>>>>> David >>>>>>> >>>>>>> Sent from ProtonMail Mobile >>>>>>> >>>>>>> >>>>>>> On Wed, May 15, 2019 at 11:26, Cyrille >>>>>>> <lafricai...@gmail.com <mailto:lafricai...@gmail.com>> wrote: >>>>>>>> David I have no count in box, and I want not to create one. >>>>>>>> Can you push on https://framadrop.org/ it's totally free >>>>>>>> and secure (and private). >>>>>>>> Thank you. >>>>>>>> >>>>>>>> >>>>>>>> Il 15/05/2019 11:46, David Haslam ha scritto: >>>>>>>>> Interim progress report. >>>>>>>>> >>>>>>>>> I downloaded the file Mat_utf8.zip from Cyrille's link and >>>>>>>>> unzipped the contents to Mat_utf8-odt >>>>>>>>> >>>>>>>>> I opened the .odt file using 7-Zip from the Windows Explorer >>>>>>>>> context menu, and extracted the file contents.xml >>>>>>>>> >>>>>>>>> I used Notepad++ plug-in XMLTools to pretty print the XML file >>>>>>>>> and saved it as contents.pp.xml >>>>>>>>> This is simply a layout change that's easier to read. >>>>>>>>> >>>>>>>>> I viewed the .pp.xml file in BabelPad, which confirmed that the >>>>>>>>> non-XML text was (mostly) Myanmar Unicode. >>>>>>>>> >>>>>>>>> I used a TextPipe filter to remove all XML tags, blanks from SOL >>>>>>>>> & EOL and all blank lines. >>>>>>>>> The output file is now contents.pp.txt >>>>>>>>> >>>>>>>>> This is now something that's readable content in Myanmar Unicode, >>>>>>>>> with some English text such as "The Gospel according Matthew" near >>>>>>>>> the start. >>>>>>>>> >>>>>>>>> The file is best viewed using BabelPad with the option Display >>>>>>>>> Colours | Colour Code by Script. >>>>>>>>> This shows Myanmar characters in light green, and non-Myanmar >>>>>>>>> characters in other colours. >>>>>>>>> >>>>>>>>> Observations: >>>>>>>>> 1. The font conversion to Unicode left a few scattered characters >>>>>>>>> unconverted. :( >>>>>>>>> >>>>>>>>> 0000C8 È 18 LATIN CAPITAL LETTER E WITH GRAVE >>>>>>>>> 0000D8 Ø 20 LATIN CAPITAL LETTER O WITH STROKE >>>>>>>>> 0000F2 ò 3 LATIN SMALL LETTER O WITH GRAVE >>>>>>>>> >>>>>>>>> The complete character frequency analysis is attached. >>>>>>>>> >>>>>>>>> 2. A few verse numbers? are still present here and there. >>>>>>>>> 3. The content contains section headings and parallel passage >>>>>>>>> headings as well as verse text. >>>>>>>>> >>>>>>>>> I have just uploaded the file contents.pp.zip to a new folder in >>>>>>>>> my Box account and added Cyrille & Michael as viewers. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>>> Sent with ProtonMail Secure Email. >>>>>>>>> >>>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>>>>>>> On Monday, May 13, 2019 9:19 AM, Cyrille <lafricai...@gmail.com> >>>>>>>>> <mailto:lafricai...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> I recently receive a modern translation of Myanmar of the NT, >>>>>>>>>> Psalms and >>>>>>>>>> Proverbs with permission to create a new module. >>>>>>>>>> But the problems are many... Firs to get the text. >>>>>>>>>> I tested different way, but it's done with PageMaker! >>>>>>>>>> I can get the text but the problem is I don't have the verses >>>>>>>>>> number >>>>>>>>>> because they are next in a parallel column and when I copy it I >>>>>>>>>> have >>>>>>>>>> only the biblical text. >>>>>>>>>> I have a pdf also but when I convert it to text (with pdftotext) >>>>>>>>>> the >>>>>>>>>> columns are mixed. >>>>>>>>>> Someone can help me whit any idea? >>>>>>>>>> Next problem is the Unicode... The text is not typed in unicode >>>>>>>>>> but use >>>>>>>>>> a special font. >>>>>>>>>> I can send everything you need or push it the git.crosswire. >>>>>>>>>> >>>>>>>>>> Thanks for help. >>>>>>>>>> >>>>>>>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>>>>>>> <mailto:sword-devel@crosswire.org> >>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>>>>>> Instructions to unsubscribe/change your settings at above page >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> sword-devel mailing list: sword-devel@crosswire.org >>>>>>>>> <mailto:sword-devel@crosswire.org> >>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>>>>>>> Instructions to unsubscribe/change your settings at above page >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> sword-devel mailing list: sword-devel@crosswire.org >>>> <mailto:sword-devel@crosswire.org> >>>> http://www.crosswire.org/mailman/listinfo/sword-devel >>>> Instructions to unsubscribe/change your settings at above page >>>> >>> >>> >>> >>> _______________________________________________ >>> sword-devel mailing list: sword-devel@crosswire.org >>> http://www.crosswire.org/mailman/listinfo/sword-devel >>> Instructions to unsubscribe/change your settings at above page >> > > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page