Regexp `([ [:punct:]]\xAD|\xAD[ [:punct:]])` is a reasonable definition for a "useless soft hyphen", unless in the language there is a punctuation mark that is used as part of a word.
The inventors of some alphabets chose more wisely than others by allocating for the glottal stop the character called "modifier letter turned comma" rather than the simple apostrophe or "right single quotation mark". e.g. Hawaian and Tongan. But I take your point. And yes, Lingala does use the "right single quotation mark" as part of a word. U+00AB « 5,959 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK U+00BB » 5,956 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK U+2018 ‘ 694 LEFT SINGLE QUOTATION MARK U+2019 ’ 8,072 RIGHT SINGLE QUOTATION MARK U+201A ‚ 3 SINGLE LOW-9 QUOTATION MARK U+201C “ 34 LEFT DOUBLE QUOTATION MARK U+201D ” 21 RIGHT DOUBLE QUOTATION MARK That there are more than 8000 instances of U+2019 is evidence of this use. Some of the left ones may be typos or there may be some real use as a third level of quotation mark? Anyway, I just checked the OSIS XML file from Cyrille from 5 days ago. There were no occurrences of either `\xAD\x{2019}` or `\x{2019}\xAD` So in that sense it was a safe thing to do when I removed the "useless" ones. Yet given the overall purpose of the soft hyphen, it seems to me now that it's a really question far better to be addressed during text development than during module build or within SWORD filtering. Fr Cyrille has agreed that we can postprocess the generated OSIS file by removing them. The only unsettled question concerns the USFM files themselves. Best regards, David -- Sent from: http://sword-dev.350566.n4.nabble.com/ _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page