Follow-up Comment #12, bug #65108 (group groff):

[comment #11 comment #11:]
> [comment #0 original submission:]
> > we have no way of knowing what the file system's character encoding is.
> > Might be ISO 8859-1, UTF-8, UTF-16BE/LE, or something else entirely.
> 
> I'm not sure now if that's a meaningful question.  The file system seems to
just store a string of bytes as the file name, and leave it up to the shell
how to interpret that.

> $ mkdir foo
> $ cd foo
> $ echo résumé | iconv -tutf8 | xargs touch
> $ echo résumé | iconv -tlatin1 | xargs touch
> $ echo * | od -c
> 0000000   r 303 251   s   u   m 303 251       r 351   s   u   m 351  \n
> 0000020


> Then a UTF-8 shell produces:

> $ ls
>  résumé  'r'$'\351''sum'$'\351'


> and a Latin-1 shell produces:

> $ ls
> résumé  résumé


> That is, both filenames are valid (but different) strings of Latin-1
characters.  In UTF-8, one of them is a string of valid characters, and one
has two invalid bytes in it.

It's also valid Latin-2, Latin-5, Latin-9, and KOI8-R, to name four other
encodings supported by _groff_.
 
> This is an ext4 file system, but I would imagine any other Unix-based one
would have to work the same in order to interact with shells consistently.

I feel like we're saying the same thing, or compatible things.

A file named "résumé1.ms" might be stored on the file system using either
character encoding, or, on a Widows system, using UTF-16LE.  A _groff_ user
with a document that wants to `so` that file name:


$ grep -F .so résumé.ms
.so résumé1.ms
.so résumé2.ms
.so résumé3.ms


...is going to need either an encoding match between résumé.ms's contents
and their file system, or some sophistication about character encodings.

That's why I want to be able to support:


$ grep -F .so résumé.ms
.so r\[u00E9]sum\[u00E9]1.ms
.so r\[u00E9]sum\[u00E9]2.ms
.so r\[u00E9]sum\[u00E9]3.ms


That way a person doesn't have to _preconv_ their document.

Or *did* _preconv_ their document and this is what the program left them with
because that tool has no sense of context regarding requests that take file
name arguments: `so`, `soquiet`, `mso`, `msoquiet`, `open`, `opena`, `psbb`,
`cf`, `fp`, `hpf`, `hpfa`, `nx`, or `trf`.

I feel like we might be talking past each other...?


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?65108>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Attachment: signature.asc
Description: PGP signature

Reply via email to