On Fri, Aug 28, 2009 at 7:12 PM, Troy A. Griffitts<scr...@crosswire.org> wrote: > > Matthew Talbert wrote: >> TCHAR is even more ambigous than wchar-t. if UNICODE is defined then >> TCHAR is wchar-t. otherwise, it is plain char. I'm away form my >> computer but clucene is definitely converting to utf16 or utf32 >> depending on platform. so i think it is always proper unicode. one way >> or another, the field needs to be converted to a wchar-t containing >> utf 16/32 > > Thanks again Matthew. Can you confirm what I think you've said above: > > clucene checks the platform (maybe with something like sizeof(wchar_t)) > and then converts to UTF-8 stream to either a UTF-16 or a UTF-32 encoded > stream? This is hard for me to understand, but what I think you've > stated. Here's why. > > > You may understand this, but just to make sure, converting from a > variable-character-length stream like UTF-8 to 16-bit values is not UTF-16. > > There are only a few choices lucene_utf8towc can return: 32-bits, > 16-bits, some other crazy thing. > > *** 32-bits: > If lucene_utf8towc always returns a single 32-bit value to represent the > given UTF-8 character, then clucene can handle the full range of unicode > and we still have investigation to do into what lucene_utf8towcs does > with the return value from lucene_utf8wc. > > *** 16-bits: > If lucene_utf8towc always returns either a 16-bit or 32-bit single > value, and presuming the comment to the method to be true, we should be > able to conclude that clucene cannot handle the full range of unicode > characters on platforms that define wchar_t as 16-bits. 16 bits is not > enough bits to represent all unicode values in a single value. > > *** some other crazy thing: > If lucene_utf8towc somehow can return multiple 16-bit values to > represent a single character (not sure how it could do this AND have the > comment to the method still be true without a crazy return object > (list<wchar_t>?)) then indeed how I understand your assessment makes > sense: clucene checks the platform (maybe with something like > sizeof(wchar_t)) and then converts to UTF-8 stream to either a UTF-16 or > a UTF-32 encoded stream > > So, just to confirm, does lucene_utf8towc really have some way of return > multi-values for a single unicode character on platforms that define > wchar_t as 16-bits? > > Since clucene uses wchar_t, my expected conclusion would have been (*** > 16-bits), above: full range supported on linux, 16-bits of glyph-space > supported on windows. > > Thanks again. Please don't rush to a computer to investigate if you're > not sure. I also can pull the source for clucene down when I get home > tonight. > > -Troy.
OK, I'm still unclear on what's happening after spending time digging through the source. The part that poses the biggest problem is that lucene_utf8towc appears (to me) to be getting a correct, 32-bit value which it stores as an int. However, it then assigns this int directly to wchar_t. So if I'm understanding this correctly, then if the Unicode value happens to be too big to store in 16-bits, then this would be incorrect. It would essentially cause data corruption, yes? However, there may be more going on here than just this. For instance, there's a "repl_wchar.h" file, a "PlatformWin32.h" file, other config files all spending a good deal of code on things like _UCS2. So it's possible that somehow it works correctly. However, the fact that I got different search results on Windows would indicate that there is certainly a difference. I got more results, so would that indicate that wchar_t is bigger or smaller? I think it would mean smaller. Matthew _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page