Paul Eggert wrote:
On 02/02/2018 03:30 PM, L A Walsh wrote:
> most computer files (vs. user-files) are still single-byte.
That's because so many of them are ASCII. But ASCII files are not the
issue here. grep's behavior hasn't changed when operating on ASCII files
in typical locales. The issue is text using a non-ASCII encoding that is
not compatible with your locale; e.g., if your text file uses ISO 8859-1
but your locale specifies UTF-8.
----
I've had my locale as UTF-8 since around 2000. My music collection
needed french, english, middle east, and now japanese chars -- so I set
things
to UTF-8. I didn't need perfection. For the email, I needed to know what
files the text was in so I could look at those mbox's with a mail-reader
or with a text editor. I needed grep to work as a 1st level search tool.
It's failed on that score.
Still if it just searched for the bytes that I put in the search string, I'm
not sure how it would "go wrong".
In my experience, UTF-8 has long been winning this battle, in the sense
that UTF-8 is by far the dominant encoding for the non-ASCII files I
regularly use. So I use a UTF-8 locale, and suggest this as a good
default for most users nowadays.
It's not possible to get direct statistics about encoding for all user
files. However, we can see what's being published on the web. Currently
UTF-8 is being used by about 90% of public websites whose character
encoding can be determined, according to the latest W3Techs survey. ISO
8859-1 is in second place, at about 4%. See:
https://w3techs.com/technologies/overview/character_encoding/all
Whereas this one was:
Domain: Non-ISO extended-ASCII text, with very long lines
So theoretically, it would never match any locale.
Problem is on a mailbox, different emails can have different encodings.
But I didn't care -- I typed in an ascii string -- so let it search in
octets
w/no encoding.
It's also such that in a mailbox it's very likely there are going to
be lines (maybe "very long lines"), but the text I was searching for
was <80 chars.
I'm really surprised it was decided to break compat -- as I've been
doing searches like this for over 2 decades - not often, mind you, but
it's one of the big advantages for me of keeping mailboxes for my IMAP
server in mbox format. Maildir format or others would kill search ability
with slow file-IO. ;^/