[ CC's trimmed, since mail to the bug will reach -policy ] On Mon, 2003-01-06 at 16:07, Jason Gunthorpe wrote:
> Fixing progams that handle terminal input is a different matter IMHO, it's > something that should be decided on a more case by case basis, and alot of > cases might be effortless handled just by extending ncurses/slang A lot of programs don't use curses... > I think the philosophy should be that everything should be converted to > UTF-8 after it is read from the terminal. Programs that interface with the > terminal need to convert. I generally agree with that. > Changing programs that handle terminal input is a far smaller scope than > changing every program that touches argv and every program that does > terminal input. If by 'touching argv' you mean 'modifying and creating output based on', then I hope you agree that we will almost certainly have to make those programs grok Unicode anyways, as I said before. UTF-8 is a multibyte encoding, and traversing and manipulating it correctly generally requires one to use different string functions (although stuff like strchr(foo, '.') will still work). > If this route is followed then a huge swath of programs are half correct > already, their only problem is that they will not be converting utf-8 for > display. That might be best handled through glibc (again, changing > *everything* just to get around the lack of utf-8 terminals is insane) Output is a big problem, I agree. But how exactly do you propose to modify glibc? > Well, that's not true. At the shell level everything is tagged. The shell > knows things returned from readdir are utf-8 No, it doesn't! Even if we force users to run a script which converts all legacy encodings to UTF-8, people will still have files NFS mounted readonly on other systems, files that they created using a legacy program, files on CD-ROM or DVD, etc. What do you mean anyways that everything on the shell level is tagged? How is that possible? What if I do something like this: touch $(nc www.random.org 80) > When I mean 'all cases' I mean the cases the come up in a system with only > UTF-8 names in the filesystem, not one that has mixed encodings already > in the filesystem, that's hopeless. But mixed encodings will happen in the real world. It is unavoidable. There is a lot of legacy data. > > For the case you named above, I think what should happen is that 'ls' > > converts all the arguments to UTF-8 for internal processing. For the > > first argument, UTF-8 validation will fail, so ls will try converting > > from the locale's charset, which will work. The rest of the arguments > > will validate as UTF-8, so ls just goes on its way. > > Eww, that's gross, it isn't definate that UTF-8 validation will always > fail for non UTF-8 text, you could easially get lucky and type in a word > that is valid UTF-8, but needs conversion! That's a terribly subtle UI > bug. I agree, it sucks and it's pretty gross. But I don't think there is a better solution. > Consider the shell to be a scripting language just like python/java and > look at how it's handled there - all internal strings are UTF-8, functions > that read/write to the terminal convert automatically, functions exist to > convert arbitary text/files. Yes, but even in Python/Java/C# or whatever, you don't always know the encoding for sure; what if you're opening up a Debian changelog? By default the strema will be opened using the user's locale encoding, but we already mandated that Debian changelogs be UTF-8. > You have everything needed to make the shell work uniformly in any > environment, but some cases might require an iconv, but the iconv is > required for *all* users, not just those with different locale settings. I > think that's a good goal. I don't see how you can make iconv just make everything work. > The trouble is, the shell interfaces with the terminal, so it is the only > thing in a position to know how to convert characters coming from the > terimal to UTF-8, nothing else can do this. As I said, I don't think the shell knows everything, and I think just modifying the shell will not fix everything, even if it did.