On 5/10/11 9:17 AM, Greg Wooledge wrote: >>> Is the accented character >>> a single-byte character, or a multi-byte character, in your locale? >> >> a multi-byte character, i think >> How to confirm that ?
(Keep in mind as you read my answers that I know very little more than anyone else about Unicode combining characters and character composition.) > >> $ echo /Users/thomas/Downloads/réz | h >> + echo $'/Users/thomas/Downloads/re?\201z' >> + hexdump -C >> 00000000 2f 55 73 65 72 73 2f 74 68 6f 6d 61 73 2f 44 6f >> |/Users/thomas/Do| >> 00000010 77 6e 6c 6f 61 64 73 2f 72 65 cc 81 7a 0a |wnloads/re..z.| >> 0000001e > > Oh... now this is interesting. In my locale (not the one I'm writing this > email from, but the one I tested in), an é is 0xc3 0xa9 which is the UTF-8 > encoding of the Unicode character U+00E9, LATIN SMALL LETTER E WITH ACUTE. > > In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL > LETTER E followed by U+0301 COMBINING ACUTE ACCENT. That's not valid UTF-8, since UTF-8 requires that the shortest sequence be used to encode a character. The general problem with combining characters still exists (the one in the message I referenced in an earlier reply), but this case has more to do with Mac OS X and its use of both precomposed and decomposed UTF-8 than anything. > I'm not intimately familiar with this stuff myself, but it looks like > a real bastard to me... I thought the point of UTF-8 was that you could > read it a byte at a time, and know when you encountered a byte that > signified the start of a multi-byte character. But apparently not! > If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the > only indicator that you are in a multi-byte character comes with the > *second* byte, so you have to backtrack. What idiot thought this up? It's a way to provide a general mechanism for combining characters. Most locales have unicode/utf-8 characters defined for the most common accented characters (e.g., U+00E9), and the U+0301 stuff is a way to add accents to less common characters without using up a character. It is going to be a bitch to handle. > With that in mind, let's see if I can reproduce some of this problem. > Please bear in mind that as I paste this from the test environment > terminal into the email-writing terminal, I have to make some manual > adjustments to preserve the observed output. I doubt you would be able to reproduce this on any system but Mac OS X. Mac OS X keeps filenames in decomposed Unicode and keyboard input in precomposed Unicode. Dragging and dropping filenames doesn't do the decomposed-precomposed conversion. > wooledg@wooledg:~$ touch $'re\xcc\x81z' > wooledg@wooledg:~$ echo r?z > r?z > wooledg@wooledg:~$ echo r*z > réz > wooledg@wooledg:~$ ls -b r*z > réz > > The terminal, when presented with the string of bytes that is the filename, > renders it as réz. However, Bash's globbing does NOT recognize this as > a three-character filename beginning with 'r' and ending with 'z', as > the r?z glob was not expanded. ls -b also doesn't think there is anything > particularly noteworthy about this filename, which is slightly annoying. > > (Bash's failure to glob this might be a second bug, or possibly another > manifestation of the same bug you're pursuing.) It's not a bug; that really is two characters. Just because U+00E9 and the two-character combination U+0065 U+0301 look the same (I think the term is identical graphemes) doesn't mean they are identical. On RHEL 5 and Debian 9, at least, the file system stores filenames using the same characters as used to create them. You were able to recreate how Mac OS X stores filenames, but: > When I double-click and then middle-click to select and paste the filename > as rendered by the terminal back into the terminal, however, I do not > get re\xcc\x81z any more; rather, I get r\xc3\xa9z. So my attempts > to reproduce your reported problem in this way fail. Because something does the decomposed-precomposed conversion. > The next obvious way to reproduce the problem would be to get bash to > produce the filename itself through tab completion, rather than pasting. > With that in mind, I'll try to move the file to a different name that > will be tab-completable. The other difference is that drag-and-drop on Mac OS X (at least dropping from the finder) produces full pathnames. I was able to reproduce display problems (which I haven't yet investigated) using that, but not using tab completion in the way you did. (And Mac OS X does seem to have a problem with wcwidth: wcwidth on U+0301 returns 1 instead of 0). Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU c...@case.edu http://cnswww.cns.cwru.edu/~chet/