Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Graham Bloice Wed, 29 Jun 2011 02:39:19 -0700

On 28/06/2011 18:27, Guy Harris wrote:
> On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote:
>
>> On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris <g...@alum.mit.edu> wrote:
>>>        1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
>>> encoding (are you seeing the issue with Norwegian characters on your 
>>> system?  If so, what's the setting of LANG?);
>> I only had issues with Norwegian characters in file names reported via
>> simple_dialog(), and my LANG is empty.
> OK, what OS are you using?  If it's a UN*X, try compiling and running the 
> attached C program; does it print your name correctly on your 
> terminal/terminal emulator (it writes it out in UTF-8), and does the file it 
> creates (your name is its name - yeah, complete with a space between "Stig" 
> and "Bjørlykke", and with no ".txt" at the end) have a name that shows up 
> correctly if you do "ls"?  If it's Windows, then you're probably just seeing 
> bug 5715.
>
>> Another problem is that we still have issues regarding UTF-8 strings
>> in packets.  We should really fix that...
> We have an issue regarding strings in packets in general.  Strings might be 
> in a number of encodings, including ASCII (meaning that any byte with the 8th 
> bit set is something that shouldn't be there), other national variants of ISO 
> 646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with 
> no surrogate pairs"), ISO 8859/x for various values of x, various ISO 
> 2022-based encodings (e.g., the EUC encodings), various national standards, 
> various DOS and Windows code pages, various Mac OS encodings, EBCDIC, 
> whatever encodings are used for SMS, etc., etc., etc, etc.:
>
>       http://en.wikipedia.org/wiki/Template:Character_encoding
>
> I don't know whether all of the encodings in question can be mapped to 
> Unicode without information loss.  An arbitrary string of octets definitely 
> can't be mapped to UTF-8 without information loss; consider a putatively 
> UTF-8-encoded string that contains an octet sequence that's not valid in 
> UTF-8.
>
> Perhaps, in the Wireshark dissection engine, we should initially store string 
> values as a pair {encoding, counted octet string} (counted so that octets 
> with the value 0 don't cause problems), and:
>
>       when putting them into a textual representation of the protocol tree or 
> into columns or something else to be shown to humans, map them to UTF-8, with 
> anything that can't be mapped to UTF-8 - including, if the encoding is 
> putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown 
> as the Unicode replacement character U+FFFD;
>
>       when comparing them in a display filter, attempt to map them to UTF-8 
> (and save the result), and:
>
>               if the mapping fails, treat *all* comparisons except for 
> inequality as failing, and treat comparisons for inequality as succeeding;
>
>               if the mapping succeeds, compare the two strings;
>
>       when making them available to software inside *Shark (C/C++ code, Lua 
> code, Python code, etc.), attempt to convert them to whatever the appropriate 
> representation is (presumably UTF-8), and have the routines to fetch those 
> values support returning a "conversion failed" indication (or perhaps offer 
> both a "convert for display to humans" version that uses U+FFFD for failure 
> and a "convert for processing" version that returns "can't do it" for 
> failure).
>
> Here's the program I mentioned above:
For reference, here's the test executable output on Win7, using the SDK 7.0
build environment (a cmd.prompt):


c:\temp>test
Stig Bj├©rlykke
Now creating a file with Stig's name as its name

c:\temp>dir
 Volume in drive C has no label.
 Volume Serial Number is D845-44D4

 Directory of c:\temp

29/06/2011  10:30    <DIR>          .
29/06/2011  10:30    <DIR>          ..
29/06/2011  10:30                17 Stig BjÃ¸rlykke
29/06/2011  10:28            77,312 test.exe
               2 File(s)         77,329 bytes
               2 Dir(s)  65,078,947,840 bytes free

The output of the executable was the same using Powershell.

-- 
Regards,

Graham Bloice

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev@wireshark.org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Reply via email to