On Thu, 6 Mar 2003, Adrian Korten wrote:

> We came up against a small problem with our Thai test module. When 
> searching for a word whose characters are part of other words, there is 
> no way to delimit the word. This occurs because Thai has no word breaks. 
> Somehow, the rtf engine seems to break the Thai words reasonably 
> accurately on the display of text. However, that same logic does not 
> seem to be in the search module.

Like Troy mentioned, we can turn on the ICU Thai word-breaking for
searches.  This, the option to display with whitespace word-breaks, and 
transliteration with whitespace word-breaks were actually the reasons why 
I didn't drop the relatively large Thai dictionary from ICU

> The only alternative that I could come up with is to place Unicode 
> characters in as word breaks. Unicode has various characters to indicate 
> word breaks (non-breaking spaces, hyphenable breaks) invisibly. These 
> would have to be placed in the actual text module as UTF8 characters.

You should encode as Unicode recommends, which I assume means no divisions 
between words at all.  Adding tags like Frank suggested wouldn't help 
anyway because the strip filters will strip them out before searching.

--Chris


_______________________________________________
sword-devel mailing list
[EMAIL PROTECTED]
http://www.crosswire.org/mailman/listinfo/sword-devel

Reply via email to