Follow-up Comment #12, bug #65108 (group groff): [comment #11 comment #11:] > [comment #0 original submission:] > > we have no way of knowing what the file system's character encoding is. > > Might be ISO 8859-1, UTF-8, UTF-16BE/LE, or something else entirely. > > I'm not sure now if that's a meaningful question. The file system seems to just store a string of bytes as the file name, and leave it up to the shell how to interpret that.
> $ mkdir foo > $ cd foo > $ echo résumé | iconv -tutf8 | xargs touch > $ echo résumé | iconv -tlatin1 | xargs touch > $ echo * | od -c > 0000000 r 303 251 s u m 303 251 r 351 s u m 351 \n > 0000020 > Then a UTF-8 shell produces: > $ ls > résumé 'r'$'\351''sum'$'\351' > and a Latin-1 shell produces: > $ ls > résumé résumé > That is, both filenames are valid (but different) strings of Latin-1 characters. In UTF-8, one of them is a string of valid characters, and one has two invalid bytes in it. It's also valid Latin-2, Latin-5, Latin-9, and KOI8-R, to name four other encodings supported by _groff_. > This is an ext4 file system, but I would imagine any other Unix-based one would have to work the same in order to interact with shells consistently. I feel like we're saying the same thing, or compatible things. A file named "résumé1.ms" might be stored on the file system using either character encoding, or, on a Widows system, using UTF-16LE. A _groff_ user with a document that wants to `so` that file name: $ grep -F .so résumé.ms .so résumé1.ms .so résumé2.ms .so résumé3.ms ...is going to need either an encoding match between résumé.ms's contents and their file system, or some sophistication about character encodings. That's why I want to be able to support: $ grep -F .so résumé.ms .so r\[u00E9]sum\[u00E9]1.ms .so r\[u00E9]sum\[u00E9]2.ms .so r\[u00E9]sum\[u00E9]3.ms That way a person doesn't have to _preconv_ their document. Or *did* _preconv_ their document and this is what the program left them with because that tool has no sense of context regarding requests that take file name arguments: `so`, `soquiet`, `mso`, `msoquiet`, `open`, `opena`, `psbb`, `cf`, `fp`, `hpf`, `hpfa`, `nx`, or `trf`. I feel like we might be talking past each other...? _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?65108> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature