TCHAR is even more ambigous than wchar-t. if UNICODE is defined then TCHAR is wchar-t. otherwise, it is plain char. I'm away form my computer but clucene is definitely converting to utf16 or utf32 depending on platform. so i think it is always proper unicode. one way or another, the field needs to be converted to a wchar-t containing utf 16/32
On 8/28/09, Troy A. Griffitts <scr...@crosswire.org> wrote: > Thanks again Matthew. Writing quick for lack of time right now. > > In general, we avoid the use of wchar_t because it is define differently > on different systems, making its intended use (as a unicode character) > holder at best essentially useless for anything other than UTF-16, and > at least confusing and ambiguous. > > I could probably look this up, but since you know where everything is in > clucene by now... > > What EXACTLY is TCHAR defined as (i.e. what is sizeof(TCHAR))? Same on > all platforms? > > What does lucene_utf8towc return? TCHAR? wchar_t? > > What I'm trying to determine is: > > Is clucene expecting UTF-16 > (which can represent 15 bits of unicode glyph space in 2 bytes, > reserving the upper bit as a multicode indicator, and if set then moves > to 4+ bytes after 15 bits)? > > ... or is clucene just saying 16 bits of unicode glyph space is good > enough for government work; we're not gonna worry about the rest? > > From the pros in the definition of the method you gave, it sounds like > knowing the sizeof the return value for lucene_utf8towc might tell us > the answer. > > Thanks again for doing the legwork. > > -Troy. > > > > > Matthew Talbert wrote: >>>> We have methods to convert to both UTF-16 and UTF-32 in our engine, >>>> which don't need a fixed length buffer, so I would like to replace: >>>> >>>> lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE); >>>> >>>> with a call to our code, if we can nail down exactly what clucene wants >>>> in the resultant wcharBuffer >> >> lucene_utf8towcs calls lucene_utf8towc for every character; the >> comment on the function is this: >> >> /** >> * lucene_utf8towc: >> * @p: a pointer to Unicode character encoded as UTF-8 >> * >> * Converts a sequence of bytes encoded as UTF-8 to a Unicode character. >> * If @p does not point to a valid UTF-8 encoded character, results are >> * undefined. If you are not sure that the bytes are complete >> * valid Unicode characters, you should use lucene_utf8towc_validated() >> * instead. >> * >> * Return value: the resulting character >> **/ >> >> The call to doc->Add actually expects a TCHAR, so if your utf8 to >> utf16 conversion can produce a TCHAR, then that's all that would be >> necessary I think. >> >> Matthew >> >> _______________________________________________ >> sword-devel mailing list: sword-devel@crosswire.org >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page