[sword-devel] Soft hyphens - a question about mkfastmod and Lucene search

2020-06-11 Thread David Haslam
If the text of a SWORD module has words that contain a soft hyphen (U+00AD) what happens to these when the Lucene search index is created? Are such soft hyphens stripped by mkfastmod ? My understanding is that words that contain an ordinary hyphen U+2010 (or hyphen/minus U+002D) are treated as

Re: [sword-devel] Soft hyphens

2017-11-07 Thread David Haslam
Michael Hart's example from Cebuano (a language of the Philippines) should be treated with a degree of caution before transferring the described method to any of the Bantu languages in Africa. This caution may be warranted on the grounds that the alphabets of several Bantu languages contain digrap

Re: [sword-devel] Soft hyphens

2017-11-04 Thread Cyrille
Hi Michael, Thank you for this informations, I have to read them carefully. But can you give me an example with a file of dic with hyphenation. Le 03/11/2017 à 16:31, Michael H a écrit : > Hi Cyrille,  > > I am preparing to study breakpoints for Cebuano to produce a hunspell > hyphenation list, bu

Re: [sword-devel] Soft hyphens

2017-11-03 Thread Michael H
Hi Cyrille, I am preparing to study breakpoints for Cebuano to produce a hunspell hyphenation list, but haven't completed the process of implementing it. I am working from 3 paper Cebuano bibles typeset at different times, and manually copying the existing hyphenated words into a list. Here's my

Re: [sword-devel] Soft hyphens

2017-11-03 Thread Cyrille
It becomes a bit difficult for me to follow this post with all these technical terms in another language :-) :-( But what I can tell you is that I am very interested in a hyphenation dictionary. I have already created a spelling dictionary for kikongo , an

Re: [sword-devel] Soft hyphens

2017-11-03 Thread David Haslam
I had similar thoughts as Michael outlined. This morning, I compiled an Excel workbook tabulating the Lingala words found to contain a soft hyphen. It has been attached to the issue in the GitLab repo. https://gitlab.com/lafricain79/LinVB/issues/10 And - yes - it's not only incomplete as a dict

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Michael H
If you have a list of words with valid hyphenation points, it is very valuable to someone someday that list is documented as a spelling dictionary, even if it is incomplete and known to be. Finding valid hyphenation points is the biggest chunk of time in preparing for publication. and in many Afric

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
Since my 9:26pm reply, I've been a busy bee, and generated a counted list of the Lingala words that contain a soft hyphen. i.e. After I removed the multiple and "useless" occurrences. There are 4584 such words, though one escapee has just "ambushed" me. 001 ­Israel This one begins with a so

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
I didn't ignore it, but you may have missed my reply when you started to compose yours. The ZWNJ is indeed the proper character to use. This is a semantic matter, nothing to do with hyphenated word-wrap at line end, which is solely presentational. David -- Sent from: http://sword-dev.350566.n

Re: [sword-devel] Soft hyphens

2017-11-02 Thread ref...@gmx.net
Hi David, I think Michael has made a point which you ignored in your response - Indic and other scripts. The correct character in most of these places though is likely a zero width non joiner space character, at least it would be in Arabic derived scripts. I think the correct solution is that if we

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
What would be of interest as a practical benefit for future typesetters is to prepare a comprehensive replace list for all the longer words in the LinVB source text. The search column would contain the word without a soft hyphen. The replace column would contain the same word with a soft hyphen at

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
Regexp `([ [:punct:]]\xAD|\xAD[ [:punct:]])` is a reasonable definition for a "useless soft hyphen", unless in the language there is a punctuation mark that is used as part of a word. The inventors of some alphabets chose more wisely than others by allocating for the glottal stop the character cal

Re: [sword-devel] Soft hyphens

2017-11-02 Thread DM Smith
I see your point. For them to be useful, every word should have a soft hyphens between syllables (or intra-word semantic breaks). Not just some. It is just as likely in a dynamic word wrap of a browser (or other etext viewer) whose width can change that any word but the first few on a line will

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
They should use the ZWNJ rather than the soft hyphen. ZWNJ = Zero Width Non Joiner U+200C. The caution should not have been necessary. David -- Sent from: http://sword-dev.350566.n4.nabble.com/ ___ sword-devel mailing list: sword-devel@crosswire.or

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
Having soft hyphens to improve readability on hand held small devices is fine in theory, but it's not in practice. The more I've thought about soft hyphens, the more I've understood that their use was a kludge for a particular typesetting task at one time for publishing a printed Bible from Quark

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Cyrille
Le 02/11/2017 à 15:28, DM Smith a écrit : > I don’t think they should be removed upstream except to fix errors. David > classified these as multiple and useless. Regarding useless, I’m not sure > that “punctuation” is such a universal language construct that it can be > included in such a dete

Re: [sword-devel] Soft hyphens

2017-11-02 Thread DM Smith
I don’t think they should be removed upstream except to fix errors. David classified these as multiple and useless. Regarding useless, I’m not sure that “punctuation” is such a universal language construct that it can be included in such a determination. E.g. An apostrophe is often used as a glo

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Michael H
The nonjoiner (U200c) is probably the best candidate for a proper replacement, but doing something like that really needs native eyes to confirm it still renders the right way. And the nonjoiner character is likely going to have all the same search functionality that the soft hyphen will. Only whe

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Michael H
CAUTION: The soft hyphen is sometimes used in Indian and East Asian language scripts to prevent two adjacent characters from becoming a combined ligature. This is more common in minor languages. It is commonly used when the font in use while being typed is designed for another language using the s

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Cyrille
I just read your proposal on gitlab that joined mine. So we agree for a job on the osis. Add the option to the conversion script would be great. Le 02/11/2017 à 13:35, Cyrille a écrit : > > Le 02/11/2017 à 13:25, David Haslam a écrit : >> It is a much simpler task to remove ALL soft hyphens rather

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Cyrille
Le 02/11/2017 à 13:25, David Haslam a écrit : > It is a much simpler task to remove ALL soft hyphens rather than removing > only the delinquent ones! My proposition is to remove it in the osis file maybe during the conversion from usfm to osis, with o2u.py. Maybe Ryan would accept to add this in

Re: [sword-devel] Soft hyphens

2017-11-02 Thread Cyrille
t; Peter > > Peter > > Sent from my mobile. Please forgive shortness, typos and weird > autocorrects. > > > Original Message ---- > Subject: Re: [sword-devel] Soft hyphens > From: David Haslam > To: sword-devel@crosswire.org > CC: > > >

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
It is a much simpler task to remove ALL soft hyphens rather than removing only the delinquent ones! - multiple soft hyphens at the same position in a word - useless soft hyphens (before or after a space or punctuation mark) Delinquent ones were quite a common occurrence in the Lingala source text

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
I am recommending the complete removal of soft hyphens because their use is a typographical kludge not semantic construction. See https://crosswire.org/wiki/Converting_SFM_Bibles_to_OSIS#Soft_hyphens Being a kludge, there could never be any possibility that any particular word would always have t

Re: [sword-devel] Soft hyphens

2017-11-02 Thread ref...@gmx.net
Leaving aside the module you are working on, how many other modules have the same problem? If it is a few only, we might as well reissue them and worry about engine enhancement later. PeterPeterSent from my mobile. Please forgive shortness, typos and weird autocorrects. Original Message ---

Re: [sword-devel] Soft hyphens

2017-11-02 Thread David Haslam
Update: Research results of SWORD search for soft hyphens: In Xiphos there is a problem with the exact search. If the same word occurs in the text both with and without a soft hyphen, - A search for the word with a soft hyphen will find only those instances - A search for the word without a soft

Re: [sword-devel] Soft hyphens

2017-11-01 Thread Peter von Kaehne
On Tue, 2017-10-31 at 23:28 +, Peter von Kaehne wrote: > On Tue, 2017-10-31 at 13:53 -0700, David Haslam wrote: > > > > How does SWORD treat soft hyphens, particularly during search? > > Try it out. Create a non-compressed module with some deliberately > splashed about soft hyphens or other t

Re: [sword-devel] Soft hyphens

2017-11-01 Thread David Haslam
My current focus will be to help upstream. I've made a simple bespoke TextPipe filter that can do two tidy ups: 1. Remove multiple soft hyphens 2. Remove useless soft hyphens NB. A soft hyphen is "useless" unless it's between two word characters. Aside: If I reversed the order of 1 & 2, I thin

Re: [sword-devel] Soft hyphens

2017-10-31 Thread Peter von Kaehne
On Tue, 2017-10-31 at 13:53 -0700, David Haslam wrote: > > How does SWORD treat soft hyphens, particularly during search? Try it out. Create a non-compressed module with some deliberately splashed about soft hyphens or other things you want to test. And then search for the words. with or without

[sword-devel] Soft hyphens

2017-10-31 Thread David Haslam
The problem with invisible characters is that you can all too easily key more than one without realising it. This is the case with soft hyphens, which may be found in a few source texts. For example, in a text development currenly under my horizon, there are not only a large number of soft hyph

Re: [sword-devel] Soft hyphens?

2017-04-01 Thread David Haslam
Someone once developed an algorithm called *KUCut* to insert zero width spaces into Thai text. Not sure of the current state of play, but I do know that the text used as the test bed for machine learning was the *ThaiKJV* of Philip Pope, which was the source text for our module. An unrelated disc

Re: [sword-devel] Soft hyphens?

2017-04-01 Thread DM Smith
Can Lucene code be improved? Short answer: No. Long answer: I’ve suggested improvements in the past when it was felt that JSword and SWORD should be able to use the same Lucene indexes. Going from memory, the argument against any change was that a mechanism would be needed to know when the index

Re: [sword-devel] Soft hyphens?

2017-04-01 Thread David Haslam
Interesting. Question prompted by an addition to /Tentative suggestions/ in https://crosswire.org/wiki/CrossWire_KJV#KJV_module: Can the Lucene code be improved ? David -- View this message in context: http://sword-dev.350566.n4.nabble.com/Soft-hyphens-tp4657045p4657048.html Sent from the S

Re: [sword-devel] Soft hyphens?

2017-04-01 Thread DM Smith
SWORD uses Lucene’s StandardAnalyzer which in turn uses WhitespaceTokenizer. It doesn’t use WordDelimiterFilter. As such it doesn’t handle hyphenated words well, including soft hyphen. In Him, DM > On Apr 1, 2017, at 8:56 AM, David Haslam wrote: > > Does SWORD search using Lucene igno

[sword-devel] Soft hyphens?

2017-04-01 Thread David Haslam
Does SWORD search using Lucene ignore the presence of a soft hyphen in any word? i.e. If the user searches for 'violence' and the word in the text was 'vio­lence' would it be found? NB. The second instance contains a soft hyphen \xAD between 'vio' and 'lence'. Best regards, David -- View thi

Re: [sword-devel] Soft hyphens?

2011-07-05 Thread David Haslam
During the past few days, we have been working again on the USFM files for the Belarusian translation by Professor Vasilij S. Semukha. This is one of the few translations I have come across for which the source text files contain a significant number of soft hyphens. Extracted from Character Frequ

Re: [sword-devel] Soft hyphens?

2009-04-24 Thread David Haslam
My question was prompted by recent discussions in the context of mobile phones with narrow displays. Equally valid to ponder in the context of narrow windows on more conventional platforms. The longest two words in the English Bible are the proper names in Isaiah 8:1 (Mahershalalhashbaz) and Psal

Re: [sword-devel] Soft hyphens?

2009-04-23 Thread Eeli Kaikkonen
David Haslam wrote: Which SWORD frontends (if any) correctly handle the soft hyphen (U+00AD) to break up big words when necessary? See http://en.wikipedia.org/wiki/Hyphen#Hyphens_in_computing http://en.wikipedia.org/wiki/Hyphen#Hyphens_in_computing "When flowing text, a system may consider t

[sword-devel] Soft hyphens?

2009-04-23 Thread David Haslam
Which SWORD frontends (if any) correctly handle the soft hyphen (U+00AD) to break up big words when necessary? See http://en.wikipedia.org/wiki/Hyphen#Hyphens_in_computing http://en.wikipedia.org/wiki/Hyphen#Hyphens_in_computing "When flowing text, a system may consider the soft hyphen to be a