Re: When will OpenBSD become a friendly place for bug reporters?

ropers Thu, 11 Jul 2019 16:37:59 -0700

On 11/07/2019, Ingo Schwarze <schwa...@usta.de> wrote:
> Hi Ian,

Hi Ingo,


I've just noticed yet another false positive where Gmail has
classified your email as spam here for the n-th time.  I'm not sure if
that's just happening to my mailbox, or if it's Gmail-wide or, worse,
if lots of MTAs out there treat your emails as spam.
(There seems to be a trend where big corps are quite happy to
discourage people from running their own MTAs and increasingly throw
their weight around rejecting anything that isn't credentialled up the
wazoo with SPF, DKIM, DMARC or whatever, and of course there are
powerful interests who want to deanonymise the Net, which may be
related.)
Either way, maybe this is something you'll want to look into from your end?

> ropers wrote on Thu, Jul 11, 2019 at 12:41:45AM +0200:
>
>> While I'm personally only or mainly interested in Alt+numeric input,
>> if altnumd existed, it would probably be comparatively easy to then
>> extend it and add support for Alt+u0000 thru Alt+u10ffff, with the U
>> becoming a reserved keyword unambiguously signifying that what follows
>> will be a Unicode code point between U+0000 and U+10FFFF.
>
> There is no reason to make it different.  ASCII is a subset of Unicode,
> with the same numbering.  So the "U" looks redundant to me.

There are several reasons why it isn't redundant:

1. Alt codes are decimal, but Unicode code points are hexadecimal.
Alt+73, Alt+78, Alt+71, Alt+79 "INGO" would become "sxqy" if treated as hex.

2. Unicode code points (format: U+xxxx, mostly, though it goes up to U+10FFFF)
are NOT character bytes. I quoted Wikipedia on this in my email two days ago:
>> [4] <https://en.wikipedia.org/wiki/Character_encoding#Terminology>:
>> "The compromise solution that was eventually found and developed into
>> Unicode was to break the assumption (dating back to telegraph codes)
>> that each character should always directly correspond to a particular
>> sequence of bits. Instead, characters would first be mapped to a
>> universal intermediate representation in the form of abstract numbers
>> called __code points__. Code points would then be represented in a
>> variety of ways and with various default numbers of bits per character
>> (__code units__) depending on context. To encode code points higher
>> than the length of the code unit, such as above 256 for 8-bit units,
>> the solution was to implement variable-width encodings where an escape
>> sequence would signal that subsequent bits should be parsed as a
>> higher code point."
You are correct that in the case of the variable-length UTF-8 and for
the 128 (non-extended/7-bit) ASCII characters only, this isn't a
problem, because with them, code points (U+xxxx) and code units
(bytes) actually ARE still substantially identical. That saving grace
pretty much does not exist with other, non-UTF-8 Unicode encodings.
Okay, maybe it still does if you drop all the leading zeroes over
multiple bytes. However:

3. I would be wary of dropping leading zeroes in the case of Unicode
code point support. With Alt codes, the precedent of optionally
allowing the dropping of leading zeroes has been set, but pretty much
all Unicode documentation I'm aware of consistently prints code points
in the U+xxxx format (or longer, up to U+10FFFF where applicable).
There's a good argument for supporting code point entry exactly as
written, and nobody writes U+0 through U+FFF.
If you install the gucharmap package
<http://ports.su/x11/gnome/gucharmap>, it has a Character Details tab
where you can not only see how much UTF-8 code units (bytes) can
differ from code points (U+xxxx), but you can also see that even for
those low code points where both match, the U+xxxx is still not
printed with leading zeroes omitted.

4. One also should be as restrained and conservative as is practical
in terms of "claiming" key combinations, especially claiming them
system-wide. Yes, users could set up some hotkey somewhere that kills
and relaunches altnumd (I'm not even sure if that belongs in altnumd
itself), but you don't want to do that all the time just to type a key
combo that collides with altnumd. "Hold down Alt while typing U,
<x>,<x>,<x>,<x>, then release Alt" is quite specific, and could reduce
cases where a sequence in .altnumrc collides with something else.
"Hold down Alt while typing up to three digits on the number pad, then
release Alt" is also relatively specific, though perhaps one might
accept non-numpad digit entry too, or make that choice configurable.
Alt+<n> single digit is more likely to collide with something, though
the long-standing precedent of Alt codes existing at least on some
platforms may make that less likely.

5. Perhaps there could be an opportunity for simplifying and unifying
Alt+U<codept> and existing iffy Ctrl+Shift U+xxxx support? OTOH, maybe
it's better to deliberately not collide with that other method and
maybe that's a good reason for a universal Unicode code point method
to reside at its own key combo. Remember, my actual goal is Alt code
support, not Alt+U<codept> support. The opportunity to tack on
U<codept> support once Alt+<numpad> support exists was more of an
outgrowth showing that at least for now, our desires seem to point in
the same direction.

I'll pause here, because this has gotten long.
"Further bulletins as events warrant." Or when I get around to it, rather.

All the best now,
Ian

>> There's a huge competence gap between us,
>
> Quite likely.  I'm so clueless that right now, i can't even seem to get
> Compose to work even though i'm sure i had it working in the past.
> This is on amd64-current, inside xterm(1) and ksh(1):
>
>    $ locale
>   LANG=
>   LC_COLLATE="en_US.UTF-8"
>   LC_CTYPE="en_US.UTF-8"
>   LC_MONETARY="en_US.UTF-8"
>   LC_NUMERIC="en_US.UTF-8"
>   LC_TIME="en_US.UTF-8"
>   LC_MESSAGES="en_US.UTF-8"
>   LC_ALL=en_US.UTF-8
>    $ setxkbmap -query -v -v -v
>   Setting verbose level to 8
>   locale is en_US.UTF-8
>   Trying to load rules file ./rules/base...
>   Trying to load rules file /usr/X11R6/share/X11/xkb/rules/base...
>   Success.
>   Applied rules from base:
>   rules:      base
>   model:      pc105
>   layout:     de
>   Trying to build keymap using the following components:
>   keycodes:   xfree86+aliases(qwertz)
>   types:      complete
>   compat:     complete
>   symbols:    pc+de+inet(pc105)+terminate(ctrl_alt_bksp)
>   geometry:   pc(pc105)
>   rules:      base
>   model:      pc105
>   layout:     de
>
> At this point, the caps key toggles caps lock, i.e. pressing
>
>   caps a caps a
>
> results in the input "Aa".
>
>    $ setxkbmap -option compose:caps -v -v -v
>   Setting verbose level to 8
>   locale is en_US.UTF-8
>   Trying to load rules file ./rules/base...
>   Trying to load rules file /usr/X11R6/share/X11/xkb/rules/base...
>   Success.
>   Applied rules from base:
>   rules:      base
>   model:      pc105
>   layout:     de
>   options:    compose:caps
>   Trying to build keymap using the following components:
>   keycodes:   xfree86+aliases(qwertz)
>   types:      complete
>   compat:     complete
>   symbols:    pc+de+inet(pc105)+terminate(ctrl_alt_bksp)+compose(caps)
>   geometry:   pc(pc105)
>
> Now, the caps key no longer toggles caps lock and becomes a dead key,
> i.e. pressing
>
>   caps , c caps " a
>
> results in the input "ca".  However, the resulting input is really
> ASCII-c ASCII-a rather than the expected c-cedille a-umlaut.
> It looks like Compose works well enough to discard the , and ",
> but not well enough to actually generate non-ASCII characters.
>
> Somewhat grumpy today,
>   Ingo
>

Re: When will OpenBSD become a friendly place for bug reporters?

Reply via email to