On Sat, Jul 11, 2020 at 08:32:34PM +0000, davidson wrote: > '!' marks the spot of nonbreaking spaces that made it into OP's first > report of odd behavior, upon testing the white scissors XCompose rule: > > $ grep "WHITE SCISSORS" d-u_xcompose_2020-07-08.nbsp | tr $'\xc2\xa0' \! > <Multi_key> <s> <x>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! : "✄"!!!! U2704 > # WHITE SCISSORS
Note that tr does not handle multi-character sequences. If you pass something like tr abc xyz It does *not* look for "abc" sequences and convert only those sequences. Rather, it looks at single characters. It converts 'a' to 'x', and 'b' to 'y', and 'c' to 'z'. The number of characters in the first pattern is supposed to match the number of characters in the second pattern, so that there is a 1:1 mapping. GNU tr also does not handle multi-byte *characters* correctly (which violates POSIX -- it's a known bug). So, your tr command actually converts all c2 bytes into ! and all a0 bytes into ! as well. Not *just* c2a0 pairs. Nevertheless, this is useful as a first pass approximation to say that, hey, there *might* be a bunch of NBSPs here, and you should take a closer look. NBSPs most often result when someone gets lazy and pastes a line from a web page or from a Microsoft Word/Excel document into a Unix terminal or X11 application, instead of pasting just the characters they actually want. Web pages, especially *older* web pages, often use NBSPs for primitive formatting.