Re: [sword-devel] indexed search discrepancy

Troy A. Griffitts Fri, 28 Aug 2009 15:26:51 -0700

Thanks again Matthew.  Writing quick for lack of time right now.

In general, we avoid the use of wchar_t because it is define differently
on different systems, making its intended use (as a unicode character)
holder at best essentially useless for anything other than UTF-16, and
at least confusing and ambiguous.


I could probably look this up, but since you know where everything is in
clucene by now...

What EXACTLY is TCHAR defined as (i.e. what is sizeof(TCHAR))?  Same on
all platforms?

What does lucene_utf8towc return? TCHAR? wchar_t?

What I'm trying to determine is:

Is clucene expecting UTF-16
(which can represent 15 bits of unicode glyph space in 2 bytes,
reserving the upper bit as a multicode indicator, and if set then moves
to 4+ bytes after 15 bits)?

... or is clucene just saying 16 bits of unicode glyph space is good
enough for government work; we're not gonna worry about the rest?

>From the pros in the definition of the method you gave, it sounds like
knowing the sizeof the return value for lucene_utf8towc might tell us
the answer.

Thanks again for doing the legwork.

        -Troy.




Matthew Talbert wrote:
>>> We have methods to convert to both UTF-16 and UTF-32 in our engine,
>>> which don't need a fixed length buffer, so I would like to replace:
>>>
>>> lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE);
>>>
>>> with a call to our code, if we can nail down exactly what clucene wants
>>> in the resultant wcharBuffer
> 
> lucene_utf8towcs calls lucene_utf8towc for every character; the
> comment on the function is this:
> 
> /**
>  * lucene_utf8towc:
>  * @p: a pointer to Unicode character encoded as UTF-8
>  *
>  * Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
>  * If @p does not point to a valid UTF-8 encoded character, results are
>  * undefined. If you are not sure that the bytes are complete
>  * valid Unicode characters, you should use lucene_utf8towc_validated()
>  * instead.
>  *
>  * Return value: the resulting character
>  **/
> 
> The call to doc->Add actually expects a TCHAR, so if your utf8 to
> utf16 conversion can produce a TCHAR, then that's all that would be
> necessary I think.
> 
> Matthew
> 
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] indexed search discrepancy

Reply via email to