Hi malte, This is a modified function that Ken, Richard (and maybe Jacque) had a hand in some time ago. It does essentially the same thing that Slava suggested but I offer it as it has helped me.
Thanks Ron function RawDataToUTF16 pData -- Examine the data to determine encoding: -- UTF8 has 0xEF 0xBB 0xBF -- UTF16BE has 0xFE 0xFF -- UTF16LE has 0xFF 0xFE switch case charToNum(byte 1 of pData) = 0 put "UTF16BE" into tTextEncoding break case charToNum(byte 1 of pData) = 0xFE and charToNum(char 2 of pData) = 0xFF delete byte 1 to 2 of pData put "UTF16BE" into tTextEncoding break case charToNum(byte 1 of pData) = 0xFF and charToNum(char 2 of pData) = 0xFE delete byte 1 to 2 of pData put "UTF16LE" into tTextEncoding break case char 1 to 3 of pData is "Ôªø" put "utf8" into ttextencoding break default put "UTF8" into tTextEncoding break end switch -- if tTextEncoding begins with "UTF16" then -- Check byte order, swapping if needed: if the processor is "x86" then put "LE" into tHostByteOrder else put "BE" into tHostByteOrder end if if byte -2 to -1 of tTextEncoding <> tHostByteOrder then put swapbytes(pData) into pData end if -- Already utf16, so nothing more needs to be done: #put uniEncode(uniDecode(pData, utf16),16) into tFieldData put pData into tFieldData else put uniEncode(pData, "utf8") into tFieldData end if -- Convert from utf8 to Rev's native utf16: replace uniencode("Åv","Japanese") with "**" in tFieldData replace CRLF with cr in tFieldData replace numtochar(13) with cr in tfieldData --affects japanese ? replace "**" with uniencode("Åv","Japanese") in tFieldData return tFieldData end RawDataToUTF16 On Wed, Jun 1, 2011 at 10:56 PM, Slava Paperno <sl...@lexiconbridge.com> wrote: > Malte, > > As I said, I'm discovering these things as I go--I hadn't even heard of LC > until last month. I'm finding that work with Unicode in LC involves a lot of > jumping through hoops, but so far I have been able to do everything I > needed. So don't give up :) > > I am not sure why your stack doesn't "know" whether the text in your field > is UTF-16 or plain ANSI, but here is what I do: > > When I read some text from a file into a variable, I assume that it is > UTF-8. There is no harm in that. Even if it turns out to be plain English, > it can still be treated that way. > > When I assign that text to a field, I always use > > set the unicodeText of field MyField to uniEncode(myVar, "UTF8") > > Now the text in the field is UTF-16. I check to see if the first two bytes > are decimal 255 followed by decimal 254 (or the reverse, 254 followed by > 255), and if they are, I delete them, because that's BOM. > > I can read and edit the field using the system's multilanguage input, like > the Russian keyboard in Windows. Russian and English can be typed in any > combination, but it is still all UTF-16. Each letter and each punctuation > mark is a two-byte sequence. If you call length(the unicodeText of field > MyField) it will report twice the number of characters that you see in the > field. > > So if I have to access character N in the field, I do this: > > set useUnicode to true > put char N to char N+1 of field MyField into myChar > answer charToNum(myChar) > That will show you a decimal number, like 1072 if myChar is a lower case > Cyrillic a or an ASCII number if it is an English letter. > > Even plain English letters must be accessed like that, as two bytes. For > English, the first byte is a null, and the second is the ASCII of the > letter, but you don't need to think of that. Just treat every letter as a > two-char sequence. > > If the user types in that field, what he types is in UTF-16. > > If I need to do anything with the text in the field, like store it to a > file, I read it into a variable: > > put the unicodeText of field MyField into myVar2 > > and immediately convert it to UTF-8: > > put uniDecode(myVar2, "UTF16") into myVar2 > > Now myVar2 is UTF-8 and can be stored in a file or processed by scripts. > > There are apparently limitations to what you can do with Cyrillic in LC, but > the things that I have listed all work for me. > > Slava > >> -----Original Message----- >> From: use-livecode-boun...@lists.runrev.com [mailto:use-livecode- >> boun...@lists.runrev.com] On Behalf Of Malte Brill >> Sent: Wednesday, June 01, 2011 9:23 AM >> To: use-livecode@lists.runrev.com >> Subject: Re: Re: Cyrillic input >> >> Thanks mark and Slava! >> >> well, this is getting me a bit further. Now if only I knew if I could > reliably check if >> the text in my field regular ASCII or UTF encoded, that would really make > my >> day. >> >> Cheers, >> >> malte >> > > > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode > _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode