On 12/07/2019, Ingo Schwarze <schwa...@usta.de> wrote:
> Hi Ian,
>
> ropers wrote on Fri, Jul 12, 2019 at 01:37:16AM +0200:
>> On 11/07/2019, Ingo Schwarze <schwa...@usta.de> wrote:
>
>>> There is no reason to make it different.  ASCII is a subset of Unicode,
>>> with the same numbering.  So the "U" looks redundant to me.
>
>> There are several reasons why it isn't redundant:
>
> Your reasons are not part of the solution but part of the problem.
>
> Logically, the task is very simple:
>
>  1. Only UTF-8 input is needed because ASCII is a subset of that,
>     and no other character set or encoding must be supported.
>     (Of course, a method to input arbitrary bytes that do not form
>     characters is also needed but that is rather tangential to this
>     discussion).

Oh, no, nono. Just blindly throwing arbitrary eight-bit bytes at the
system isn't what I had in mind with altnumd.  That hasn't been what I
had in mind ever since I started reading up on UTF-8 and Unicode in
general.[0]

However, one needs to be careful here: UTF-8-encoded characters are
not strictly the same thing as scan codes or whatever is in the
keyboard buffer
{
   and that's the part I'm reeeally asking for help with; to wit:
>> And again, to me the hardest part of that is figuring out
>> how or where or what exactly to insert into the, or what, keyboard buffer.
>> (...)
>> Any hints on how to even start with that hardest part, or **what to read**
>> or **where to look** would be MORE than welcome
}.
Grabbing key combos and inserting something back into some keyboard
buffer is still input, rendering the correct UTF-8 character is
output.

altnumd with e.g. CP437 support would allow users to type those
decimals (with Alt+numpad), but it should actually act just the same
as if the correct corresponding U+xxxx code point had been input in
some other way.
  Wikipedia's CP437 code page table currently, for convenience, also
lists the corresponding Unicode code points for each character[1]:
<https://en.wikipedia.org/wiki/CP437#Character_set>
  The idea is to let people type like they know how to, and it would
create the same character, even though we're now on Unicode.
  HOWEVER, while a CP437 .altnumrc supporting Alt+0 thru 255 might be
the default configuration, the more fundamental idea is to have this
be completely configurable, which is also how different code pages
would be supported:
.altnumrc would contain key-value pairs. On the left, any decimal is
possible, far beyond 255 actually, for I-need-this-for-the-easter-egg
reasons. Thus, on the left: any long long unsigned int decimal. On the
right: any U+xxxx code point (or perhaps comma-separated list of code
points?):

#.altnumrc for CP437:
#####################
000 U+0000
001 U+263A
002 U+263B
003 U+2665
004 U+2666
005 U+2663
006 U+2660
007 U+2022
008 U+25D8
009 U+25CB
010 U+25D9
011 U+2642
012 U+2640
013 U+266A
014 U+266B
015 U+263C
(...)
032 U+0020
033 U+0021
...and so on.
(Dropping leading zeroes would be legal for Alt+0-255; maybe throughout.)

Want Windows Alt codes instead? No problem, just swap out this
.altnumrc with a version for CP1252[2] and restart altnumd. Want
ISO-8859-1[3]? Likewise. Not all numbers would have to be defined, and
in fact both CP1252 and ISO-8859-1 leave some undefined. Heck, if you
wanted, you could roll your own .altnumrc with only two entries, for
the c-cedilla and a-umlaut you mentioned, set at any numbers you
please.

OR at least that's the idea. You're correct that it's all talk so far.
Figuring out if I really can code this? That's where I asked for help
on what to read, see above.

>  2. Physical keys must produce the characters printed on them.
>
>  3. One method is needed to input codepoints numerically, but not
>     more than one.

I kind of agree -- except, the Alt code method already exists, and has
existed for a long time, and is still widespread.[4]
In some sense, any other new entry method has a higher burden of proof.
However, precisely because altnumd as envisaged already requires most
of the U+xxxx code point-savvy plumbing, I realise that if things
actually progressed that far, then adding on some actually universal
Alt+u<codept> entry support would be marching downhill.

>  4. One method may be convenient to enter often-needed characters
>     quickly (like Compose in X) and likely one mathod for languages
>     that need very large numbers of characters (i don't know much
>     about those).
>
> Items 1 to 3 are really the meat of the matter.  Item 4 is more like
> an add-on for convenience.

The way I read that, you mentioned two things in 4.: the shortcut
quick method and the comprehensive method. altnumd as envisaged
*could* do both, thanks to the crazy plan to make the left decimal a
ridiculously long int; however it would be much more practical to use
Alt+<numpad> for something short or already-known like CP437, and to
have actual code point support involve the literal entry of Unicode
code points.
Neither of those are substitutes for normal CJK input[5], which I know
a little about. What's normally done e.g. in Japan is, they type in
romanization, using the Latin character equivalents for the kana[6]
syllables, and that's automatically converted to kana characters, and
there's an input method editor[7] for the conversion to kanji where
things work a little like T9 predictive text entry used to work on ye
olde mobile phone number pad text input: The IME lists the most common
choices first, and in most cases you just have to press space and keep
typing.
These things don't really get in the way of what we're discussing
here. They don't have too much to do with it, in fact. Adding the odd
character, sure, but nobody actually does their normal typing in
either Alt codes or U+xxxx.
Btw., if ANYONE:
- knows what's currently best-supported state-of-the-art for Japanese
support on OpenBSD <http://ports.su/inputmethods> (which IME port to
use), or
- has a spare JAPANESE layout internal keyboard for a DELL Latitude
D630 laptop,
shoot me an email.

>
> That said, i think i'll retire from this thread because we are just
> talking.

No hard feelings. I did want to respond to this email anyway, but
you're under absolutely no obligation to write another response in
turn. Or even read this.

> Besides, i have a strong suspicion that you shold pick a
> simpler project, in particular as your first project.

You may be right. It may all come to nothing.

> This one seems seriously difficult conceptionally,

I think I've thought much of that through, see above. And if I fail,
these ideas will still be out there, and who knows...

> exceedingly difficult
> technically, in particular regarding the complex kernel-xenocara-userland
> interactions, and *terrifyingly* complicated from a system integration
> perspective - and you know, when the goal is to make something fit
> for practical use (and commit), the system intergration part is
> often the most dangerous obstacle in the first place, often challenging
> even for seasoned developers.  Nothing wrong with picking a project
> that is *technically* difficult if you feel adventurous (as long
> as it is cleanly self-contained), but do try to start with projects
> where system integration is easy, or expect almost certain eventual
> failure - quite likely after already having invested lots of work.

Actually, this is partly a product of a tiny but related first project
where I have already written (very messy but actual) code. I'd send
you a copy off-list if you're morbidly interested. It's not much and
not pretty and not polished, nor quite finished though. I agree that
the odds of failure are high.[8]

That related project was an ascii program (OpenBSD currently only has
an old man page, and it's rather limited). Surely I could do better?
Famous last words. It turns out that between Unicode and legacy codes,
figuring out how to CORRECTLY deal with most bytes is really complex,
and I may yet write up my findings and observations and remaining
related questions. To tech@, perhaps.

What I have successfully done though is I've *output* correctly
encoded higher UTF-8 characters to the tty. Yes, to xterm. I hadn't
noticed what you and NilsOla have noticed, that both Compose key and
Ctrl+Shift U+xxxx *input* support on XTerm 344 on OpenBSD 6.5
currently don't seem to work. Both work on XTerm 322 on Linux. Do we
know if that's ever worked on OpenBSD?

> P.S. about broken spam filters:
>
>> I've just noticed yet another false positive where Gmail has
>> classified your email as spam here for the n-th time.  I'm not sure if
>> that's just happening to my mailbox, or if it's Gmail-wide or, worse,
>> if lots of MTAs out there treat your emails as spam.
>
> So far, i have heard about outlook.com (which obviously nobody
> should use anyway) occasionally classifying all mail coming from
> the University of Karlsruhe (kit.edu) as spam, and about gmail.com
> doing the same in rare cases.  Both of these appear to sometimes
> consider that university - which is among the dozen or so most
> important technical research universities in Germany - as a spam
> site.  There is nothing much that i can do about that.
>
>> (There seems to be a trend where big corps are quite happy to
>> discourage people from running their own MTAs
>
> Not just running their own MTAs, also using non-commercial .edu
> infrastructure.
>
>> and increasingly throw their weight around rejecting anything that
>> isn't credentialled up the wazoo with SPF, DKIM, DMARC or whatever,
>
> Of course large advertising corporations do what it takes to grab
> market share, and vendor-lock in by breaking compatibility is a classic
> method for doing that.
>
> If your spam filter is broken, fix it.  I can hardly help with that.
> If your ISP won't let you fix it, get a better provider.

I've already created a filter that says never to send your emails to
Spam anymore, but like an arsehole, Gmail still complains thus:
>> "This message was not sent to Spam because of a filter that you created.
>> [Edit Filters]"
What part of, "because I chose to" do they not understand? Google are
notorious for thinking they know better than the user what the latter
wants. You see that attitude in many of their products. That hubris
seems to be a big part of their corporate philosophy. This is almost
as insulting as YouTube telling me off for trusting "the wrong
broadcasters". Yeuch.

> Even if i wanted to contact the kit.edu postmasters to ask whether
> they can do anything about your problem, you didn't provide any
> information whatsoever - like for which exact reason which receiving
> Google mail server classified which sending kit.edu mailserver as
> a spam site, and at which exact time.  Such information should be
> sent privately, not to public lists.

I could send you the latest header or Original Sauce(TM) off-list --
only if you want it though. I've fixed the issue on my end except for
Google's whining. I just thought I'd tell you in case you didn't know.

Thanks so much for your time.
:)
Ian

[0] Okay, sorry, I lied again: I have in truth not entirely abandoned
the idea of perhaps including the "throw exactly those bits at the
system" functionality as an OPTIONALLY configurable EXTRA feature. The
issue there is that it would require support for a separate syntax in
.altnumrc. Perhaps actual 01010101 binary to avoid any confusion with
the left-hand side decimals and the right-hand side U+hexadecimals?
Normal Alt code decimal on the left, hail-Mary binary on the right? Of
course, throwing random 80-FF bytes at a UTF-8 system could be...
interesting, but might have its uses for fuzzing and testing.
[1] sans "U+", which is unusual and misleading to the point of being
incorrect, but I'm not going to fix that, because let the edit
warriors slaughter somebody else who is completely correct
[2] https://en.wikipedia.org/wiki/Windows-1252#Character_set
[3] https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
[4] Even on systems that do support Unicode. Just not on Unix-likes.
[5] https://en.wikipedia.org/wiki/CJK_characters
[6] https://en.wikipedia.org/wiki/Kana#Hiragana_and_katakana
[7] https://en.wikipedia.org/wiki/Input_method
[8] https://www.youtube.com/watch?v=XiiCkWEavI0&t=42s

Reply via email to