On 9/30/19 8:39 PM, Geoff Kuenning wrote: > $'\361' is a valid character in Latin-1, which is how it happened to arise > in my case. Also, I tested with the C locale, which should be agnostic to > character encodings, and got the same result.
That's the strange part. I can't reproduce this with the C locale at all -- it's a separate code path that just treats every byte as a character. I didn't try a lot of non-UTF8 encodings, but I can't reproduce it on any of the (mostly western European) ISO8859-1 locales I tried. That's why I ended up using UTF-8 for my tests and figuring out where the problem was. > > The general Unix philosophy, which in this case says "I'm not going to pass > judgment on the weird things you do even though I don't understand them", > argues for being able to handle any arbitrary sequence of bytes, at least > on Linux. Yeah, on Linux, at least with the common file systems, the filenames are still just byte sequences. That's not the case everywhere -- as I said, you can't even create a file with an invalid byte sequence in the name on Mac OS X, no matter what your locale is. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/