On 28/06/2011 18:27, Guy Harris wrote: > On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote: > >> On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris <g...@alum.mit.edu> wrote: >>> 1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the >>> encoding (are you seeing the issue with Norwegian characters on your >>> system? If so, what's the setting of LANG?); >> I only had issues with Norwegian characters in file names reported via >> simple_dialog(), and my LANG is empty. > OK, what OS are you using? If it's a UN*X, try compiling and running the > attached C program; does it print your name correctly on your > terminal/terminal emulator (it writes it out in UTF-8), and does the file it > creates (your name is its name - yeah, complete with a space between "Stig" > and "Bjørlykke", and with no ".txt" at the end) have a name that shows up > correctly if you do "ls"? If it's Windows, then you're probably just seeing > bug 5715. > >> Another problem is that we still have issues regarding UTF-8 strings >> in packets. We should really fix that... > We have an issue regarding strings in packets in general. Strings might be > in a number of encodings, including ASCII (meaning that any byte with the 8th > bit set is something that shouldn't be there), other national variants of ISO > 646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with > no surrogate pairs"), ISO 8859/x for various values of x, various ISO > 2022-based encodings (e.g., the EUC encodings), various national standards, > various DOS and Windows code pages, various Mac OS encodings, EBCDIC, > whatever encodings are used for SMS, etc., etc., etc, etc.: > > http://en.wikipedia.org/wiki/Template:Character_encoding > > I don't know whether all of the encodings in question can be mapped to > Unicode without information loss. An arbitrary string of octets definitely > can't be mapped to UTF-8 without information loss; consider a putatively > UTF-8-encoded string that contains an octet sequence that's not valid in > UTF-8. > > Perhaps, in the Wireshark dissection engine, we should initially store string > values as a pair {encoding, counted octet string} (counted so that octets > with the value 0 don't cause problems), and: > > when putting them into a textual representation of the protocol tree or > into columns or something else to be shown to humans, map them to UTF-8, with > anything that can't be mapped to UTF-8 - including, if the encoding is > putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown > as the Unicode replacement character U+FFFD; > > when comparing them in a display filter, attempt to map them to UTF-8 > (and save the result), and: > > if the mapping fails, treat *all* comparisons except for > inequality as failing, and treat comparisons for inequality as succeeding; > > if the mapping succeeds, compare the two strings; > > when making them available to software inside *Shark (C/C++ code, Lua > code, Python code, etc.), attempt to convert them to whatever the appropriate > representation is (presumably UTF-8), and have the routines to fetch those > values support returning a "conversion failed" indication (or perhaps offer > both a "convert for display to humans" version that uses U+FFFD for failure > and a "convert for processing" version that returns "can't do it" for > failure). > > Here's the program I mentioned above: For reference, here's the test executable output on Win7, using the SDK 7.0 build environment (a cmd.prompt):
c:\temp>test Stig Bj├©rlykke Now creating a file with Stig's name as its name c:\temp>dir Volume in drive C has no label. Volume Serial Number is D845-44D4 Directory of c:\temp 29/06/2011 10:30 <DIR> . 29/06/2011 10:30 <DIR> .. 29/06/2011 10:30 17 Stig Bjørlykke 29/06/2011 10:28 77,312 test.exe 2 File(s) 77,329 bytes 2 Dir(s) 65,078,947,840 bytes free The output of the executable was the same using Powershell. -- Regards, Graham Bloice
___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev@wireshark.org> Archives: http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe