On 12/07/2019, Ingo Schwarze <schwa...@usta.de> wrote: > Hi Ian, > > ropers wrote on Fri, Jul 12, 2019 at 01:37:16AM +0200: >> On 11/07/2019, Ingo Schwarze <schwa...@usta.de> wrote: > >>> There is no reason to make it different. ASCII is a subset of Unicode, >>> with the same numbering. So the "U" looks redundant to me. > >> There are several reasons why it isn't redundant: > > Your reasons are not part of the solution but part of the problem. > > Logically, the task is very simple: > > 1. Only UTF-8 input is needed because ASCII is a subset of that, > and no other character set or encoding must be supported. > (Of course, a method to input arbitrary bytes that do not form > characters is also needed but that is rather tangential to this > discussion).
Oh, no, nono. Just blindly throwing arbitrary eight-bit bytes at the system isn't what I had in mind with altnumd. That hasn't been what I had in mind ever since I started reading up on UTF-8 and Unicode in general.[0] However, one needs to be careful here: UTF-8-encoded characters are not strictly the same thing as scan codes or whatever is in the keyboard buffer { and that's the part I'm reeeally asking for help with; to wit: >> And again, to me the hardest part of that is figuring out >> how or where or what exactly to insert into the, or what, keyboard buffer. >> (...) >> Any hints on how to even start with that hardest part, or **what to read** >> or **where to look** would be MORE than welcome }. Grabbing key combos and inserting something back into some keyboard buffer is still input, rendering the correct UTF-8 character is output. altnumd with e.g. CP437 support would allow users to type those decimals (with Alt+numpad), but it should actually act just the same as if the correct corresponding U+xxxx code point had been input in some other way. Wikipedia's CP437 code page table currently, for convenience, also lists the corresponding Unicode code points for each character[1]: <https://en.wikipedia.org/wiki/CP437#Character_set> The idea is to let people type like they know how to, and it would create the same character, even though we're now on Unicode. HOWEVER, while a CP437 .altnumrc supporting Alt+0 thru 255 might be the default configuration, the more fundamental idea is to have this be completely configurable, which is also how different code pages would be supported: .altnumrc would contain key-value pairs. On the left, any decimal is possible, far beyond 255 actually, for I-need-this-for-the-easter-egg reasons. Thus, on the left: any long long unsigned int decimal. On the right: any U+xxxx code point (or perhaps comma-separated list of code points?): #.altnumrc for CP437: ##################### 000 U+0000 001 U+263A 002 U+263B 003 U+2665 004 U+2666 005 U+2663 006 U+2660 007 U+2022 008 U+25D8 009 U+25CB 010 U+25D9 011 U+2642 012 U+2640 013 U+266A 014 U+266B 015 U+263C (...) 032 U+0020 033 U+0021 ...and so on. (Dropping leading zeroes would be legal for Alt+0-255; maybe throughout.) Want Windows Alt codes instead? No problem, just swap out this .altnumrc with a version for CP1252[2] and restart altnumd. Want ISO-8859-1[3]? Likewise. Not all numbers would have to be defined, and in fact both CP1252 and ISO-8859-1 leave some undefined. Heck, if you wanted, you could roll your own .altnumrc with only two entries, for the c-cedilla and a-umlaut you mentioned, set at any numbers you please. OR at least that's the idea. You're correct that it's all talk so far. Figuring out if I really can code this? That's where I asked for help on what to read, see above. > 2. Physical keys must produce the characters printed on them. > > 3. One method is needed to input codepoints numerically, but not > more than one. I kind of agree -- except, the Alt code method already exists, and has existed for a long time, and is still widespread.[4] In some sense, any other new entry method has a higher burden of proof. However, precisely because altnumd as envisaged already requires most of the U+xxxx code point-savvy plumbing, I realise that if things actually progressed that far, then adding on some actually universal Alt+u<codept> entry support would be marching downhill. > 4. One method may be convenient to enter often-needed characters > quickly (like Compose in X) and likely one mathod for languages > that need very large numbers of characters (i don't know much > about those). > > Items 1 to 3 are really the meat of the matter. Item 4 is more like > an add-on for convenience. The way I read that, you mentioned two things in 4.: the shortcut quick method and the comprehensive method. altnumd as envisaged *could* do both, thanks to the crazy plan to make the left decimal a ridiculously long int; however it would be much more practical to use Alt+<numpad> for something short or already-known like CP437, and to have actual code point support involve the literal entry of Unicode code points. Neither of those are substitutes for normal CJK input[5], which I know a little about. What's normally done e.g. in Japan is, they type in romanization, using the Latin character equivalents for the kana[6] syllables, and that's automatically converted to kana characters, and there's an input method editor[7] for the conversion to kanji where things work a little like T9 predictive text entry used to work on ye olde mobile phone number pad text input: The IME lists the most common choices first, and in most cases you just have to press space and keep typing. These things don't really get in the way of what we're discussing here. They don't have too much to do with it, in fact. Adding the odd character, sure, but nobody actually does their normal typing in either Alt codes or U+xxxx. Btw., if ANYONE: - knows what's currently best-supported state-of-the-art for Japanese support on OpenBSD <http://ports.su/inputmethods> (which IME port to use), or - has a spare JAPANESE layout internal keyboard for a DELL Latitude D630 laptop, shoot me an email. > > That said, i think i'll retire from this thread because we are just > talking. No hard feelings. I did want to respond to this email anyway, but you're under absolutely no obligation to write another response in turn. Or even read this. > Besides, i have a strong suspicion that you shold pick a > simpler project, in particular as your first project. You may be right. It may all come to nothing. > This one seems seriously difficult conceptionally, I think I've thought much of that through, see above. And if I fail, these ideas will still be out there, and who knows... > exceedingly difficult > technically, in particular regarding the complex kernel-xenocara-userland > interactions, and *terrifyingly* complicated from a system integration > perspective - and you know, when the goal is to make something fit > for practical use (and commit), the system intergration part is > often the most dangerous obstacle in the first place, often challenging > even for seasoned developers. Nothing wrong with picking a project > that is *technically* difficult if you feel adventurous (as long > as it is cleanly self-contained), but do try to start with projects > where system integration is easy, or expect almost certain eventual > failure - quite likely after already having invested lots of work. Actually, this is partly a product of a tiny but related first project where I have already written (very messy but actual) code. I'd send you a copy off-list if you're morbidly interested. It's not much and not pretty and not polished, nor quite finished though. I agree that the odds of failure are high.[8] That related project was an ascii program (OpenBSD currently only has an old man page, and it's rather limited). Surely I could do better? Famous last words. It turns out that between Unicode and legacy codes, figuring out how to CORRECTLY deal with most bytes is really complex, and I may yet write up my findings and observations and remaining related questions. To tech@, perhaps. What I have successfully done though is I've *output* correctly encoded higher UTF-8 characters to the tty. Yes, to xterm. I hadn't noticed what you and NilsOla have noticed, that both Compose key and Ctrl+Shift U+xxxx *input* support on XTerm 344 on OpenBSD 6.5 currently don't seem to work. Both work on XTerm 322 on Linux. Do we know if that's ever worked on OpenBSD? > P.S. about broken spam filters: > >> I've just noticed yet another false positive where Gmail has >> classified your email as spam here for the n-th time. I'm not sure if >> that's just happening to my mailbox, or if it's Gmail-wide or, worse, >> if lots of MTAs out there treat your emails as spam. > > So far, i have heard about outlook.com (which obviously nobody > should use anyway) occasionally classifying all mail coming from > the University of Karlsruhe (kit.edu) as spam, and about gmail.com > doing the same in rare cases. Both of these appear to sometimes > consider that university - which is among the dozen or so most > important technical research universities in Germany - as a spam > site. There is nothing much that i can do about that. > >> (There seems to be a trend where big corps are quite happy to >> discourage people from running their own MTAs > > Not just running their own MTAs, also using non-commercial .edu > infrastructure. > >> and increasingly throw their weight around rejecting anything that >> isn't credentialled up the wazoo with SPF, DKIM, DMARC or whatever, > > Of course large advertising corporations do what it takes to grab > market share, and vendor-lock in by breaking compatibility is a classic > method for doing that. > > If your spam filter is broken, fix it. I can hardly help with that. > If your ISP won't let you fix it, get a better provider. I've already created a filter that says never to send your emails to Spam anymore, but like an arsehole, Gmail still complains thus: >> "This message was not sent to Spam because of a filter that you created. >> [Edit Filters]" What part of, "because I chose to" do they not understand? Google are notorious for thinking they know better than the user what the latter wants. You see that attitude in many of their products. That hubris seems to be a big part of their corporate philosophy. This is almost as insulting as YouTube telling me off for trusting "the wrong broadcasters". Yeuch. > Even if i wanted to contact the kit.edu postmasters to ask whether > they can do anything about your problem, you didn't provide any > information whatsoever - like for which exact reason which receiving > Google mail server classified which sending kit.edu mailserver as > a spam site, and at which exact time. Such information should be > sent privately, not to public lists. I could send you the latest header or Original Sauce(TM) off-list -- only if you want it though. I've fixed the issue on my end except for Google's whining. I just thought I'd tell you in case you didn't know. Thanks so much for your time. :) Ian [0] Okay, sorry, I lied again: I have in truth not entirely abandoned the idea of perhaps including the "throw exactly those bits at the system" functionality as an OPTIONALLY configurable EXTRA feature. The issue there is that it would require support for a separate syntax in .altnumrc. Perhaps actual 01010101 binary to avoid any confusion with the left-hand side decimals and the right-hand side U+hexadecimals? Normal Alt code decimal on the left, hail-Mary binary on the right? Of course, throwing random 80-FF bytes at a UTF-8 system could be... interesting, but might have its uses for fuzzing and testing. [1] sans "U+", which is unusual and misleading to the point of being incorrect, but I'm not going to fix that, because let the edit warriors slaughter somebody else who is completely correct [2] https://en.wikipedia.org/wiki/Windows-1252#Character_set [3] https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout [4] Even on systems that do support Unicode. Just not on Unix-likes. [5] https://en.wikipedia.org/wiki/CJK_characters [6] https://en.wikipedia.org/wiki/Kana#Hiragana_and_katakana [7] https://en.wikipedia.org/wiki/Input_method [8] https://www.youtube.com/watch?v=XiiCkWEavI0&t=42s