On 08/24/2012 08:32 AM, Stefan Sperling wrote:
On Thu, Aug 23, 2012 at 07:19:29PM -0400, Geoff Steckel wrote:
Well, yes, using a character set conversion API in stupid ways can
munge data. How does that relate to anything I was saying?
As long as iconv is only used to display data, not to change file
contents, you're perfectly right.
Yes, that's what I meant (sorry if I wasn't clear enough).

Open the file, allow the user to specify the file's encoding
(and maybe auto-detect it somehow, but always allow the user to
override this), load the data into a buffer, convert the buffer
for display, and show it on the screen.

The user can now edit the buffer in the display encoding.

Before saving, convert back to the file's encoding. If that fails
because the user added characters that cannot be represented in the
original encoding, complain and offer the option to save the file
in a suitable encoding.

A real example is a L***x editor using iconv. Open a 5000 line file,
change line 100, line 500 contains a non-conforming character,
file is truncated there.

Not pretty.
Yeah, that's obviously not done right.

We can easily imagine other problems like a mix of character encodings
ending up in a file by accident. Sometimes this is done on purpose
however and then the display conversion step gets interesting, but
at a minimum it should display one of the encodings correctly and
allow users to switch the display encoding if necessary.

Another real example. Bring up line containing non-conforming character.
Line appears blank.

I agree that it takes a great deal of care to implement a multi-character
set editor such that it works on all useful files while displaying in
a particular locale's character set.
Yes, not every combination can be made to work. E.g. displaying any of
the non-latin1 subset of UTF-8 in a latin1 locale just won't work,
and this must be treated as a user error (invalid input or locale
configuration). And that's fine since it's an expected failure mode.
It just needs to be handled in a way that doesn't destroy data.

It isn't a trivial task on all accounts but the result would be useful.

But for this kind of feature to appear in mg we'll need iconv in base.
As a first step, adding a UTF-8 mode to mg, where file content is expected
to be UTF-8 encoded, would be much easier and already quite useful.
Alas, I've managed club membership lists with a mix of names from
many countries. In order to print these, I've had to edit in hex UTF-8
and then download all the glyphs to the printer. Luckily I had
a program which would automatically do the glyph loading.
Printing a list of CD contents presented the same problem.

These would seem to be valid files with multiple irreconcilable
character sets. At least I didn't have to deal with bidirectional encodings.

Another time I was making banners for my office for all the immigrants
and expatriates. Then I did have to deal with bidirectional characters.
It was an interesting problem: I can't read Arabic or Hangul script,
switching locales before each edit was incredibly error-prone,
and most of the banners needed to have an American-English transliteration,
interpretation, or alternate "English" name.

It's difficult to create a useful tool that doesn't require the user to deal with
\u+xxxx at least some of the time and still presents a pleasing display.

I really think iconv should have a variant which emits a unique and displayable alternate decoding for any character not in the current locale and does a unique
reverse encoding for output. That would allow editing of any file in any
locale while presenting a simple and intuitive display for the great
majority of cases. It would be reasonable to emit a warning to the user for
mixed character sets.

Geoff Steckel

Reply via email to