Re: [sword-devel] Fw: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS

Peter von Kaehne Thu, 29 May 2025 09:28:08 -0700

I think this has been discussed well.

this should be done on a semantic level and not with a kludge and a hack.
the obvious semantic solution is to frame words in w tags and then use CSS/trigger and option/whatever agreed from there.

From: sword-devel <[email protected]> on behalf of David Haslam <[email protected]>
Sent: Thursday, May 29, 2025 3:47 pm
To: sword-devel mailing list <[email protected]>
Cc: Modules Issues <[email protected]>; [email protected] <[email protected]>
Subject: [sword-devel] Fw: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS

NB. I have cancelled the earlier email because the attachment was too large for sword-devel.
It had been in the queue for moderator approval.

The eXperimental module KhmerNTx.zip may now be downloaded from this link on my box.net account.

Please see below for the significant details.

Best regards,

David

Sent with Proton Mail secure email.

------- Forwarded Message -------
From: David Haslam <[email protected]>
Date: On Thursday, May 29th, 2025 at 9:26 AM
Subject: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS
To: sword-devel mailing list <[email protected]>
CC: [email protected] <[email protected]>, Modules Issues <[email protected]>

Dear SWORD Developers (and our Modules Team),

While watching the livestream funeral of OT Scholar the late Gordon D Wenham yesterday (St Mary's Church, Charlton Kings), I had a bright idea.

I'd been working recently on potential improvements for the KhmerNT module relating to marking the Lexical Word Divisions.
Khmer is one of the languages of SE Asia whose Writing System (aka Script) largely has NO SPACE BETWEEN WORDS.
Others include: Lao, Thai, Myanmar (aka Burmese), together with other languages in the region that employ one of these scripts (e.g. Isaan).

Until the present, the KhmerNT module makes use of the ZWSP = Zero Width Space to mark lexical word boundaries.
This helps with SWORD search for whole words, because even though the divisions between words are invisible to human eyes, they are accessible to computer software.

Wouldn't it be nice if ... (cue to sing the melody by the Beach Boys) 🎶
We could instead use a visible Unicode character
That character could be hidden by means of an existing SWORD filter

There is such a character!!!
U+2019 is one of the codepoints hidden (or changed) by the filter UTF8GreekAccents.

U+2019 (RIGHT SINGLE QUOTATION MARK) is commonly used in digital editions of the NT Greek as the apostrophe, not as a quotation mark.

In NT Greek, it appears in:

- Elisions: When a vowel at the end of a word is dropped (e.g., δι’ instead of διά before a vowel).
- Contractions or abbreviations: e.g., ἐπ’ for ἐπί, καθ’ for κατά.

While U+2019 is typographically correct for apostrophes in modern typesetting, some older or simpler digital texts may use U+0027 (straight apostrophe). However, U+2019 is the preferred character in high-quality, properly typeset Greek texts.

I then set about to test my idea by making a further update to an already eXperimental version of the module, provisionally named KhmerNTx.

It "worked like a dream". 😎

With Greek accents hidden, the text looks like this:
ខ្ញុំពេត្រុស ជាសាវករបស់ព្រះយេស៊ូគ្រិស្ដ ជូនចំពោះពួកអ្នកដែលព្រះជាម្ចាស់បានជ្រើសរើស ហើយដែលបានបែកខ្ញែកគ្នាទៅស្នាក់នៅបណ្ដោះអាសន្ននៅស្រុកប៉ុនតុស ស្រុកកាឡាទី ស្រុកកាប៉ាដូគា ស្រុកអាស៊ី និងស្រុកប៉ីធូនា (I Peter 1:1 [KhmerNTx])

With Greek accents displayed, the text looks like this:

ខ្ញុំ’ពេត្រុស ជា’សាវក’របស់’ព្រះ’យេស៊ូ’គ្រិស្ដ ជូន’ចំពោះ’ពួកអ្នក’ដែល’ព្រះជាម្ចាស់’បាន’ជ្រើសរើស ហើយ’ដែល’បាន’បែកខ្ញែក’គ្នា’ទៅ’ស្នាក់’នៅ’បណ្ដោះអាសន្ន’នៅ’ស្រុក’ប៉ុនតុស ស្រុក’កាឡាទី ស្រុក’កាប៉ាដូគា ស្រុក’អាស៊ី និង’ស្រុក’ប៉ីធូនា (I Peter 1:1 [KhmerNTx])

I have attached the compressed module for any of you to explore & play with further.

Aside: The previous update already made use of the OSIS XML w element to enclose each lexical Khmer word. That remains the case.
In this way, the module source text is ready to be adapted for further enhancements such as adding Strong's numbers, etc, to make a Study Edition.

Steve Hyde and the translators in Cambodia are currently preparing to publish the complete Khmer Bible.
He has requested my assistance in improving the actual word divisions for the 39 OT books.
I've already been sent the source text, exported from their database.

Since early May, I have been exploring how the Grok AI engine can make a positive contribution to the success of this challenging task.
More on that subject later.

Best regards,

David

Sent with Proton Mail secure email.

_______________________________________________
sword-devel mailing list: [email protected]
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Fw: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS

Reply via email to