bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary

2017-08-27 Thread Paul Eggert
Simon wrote: Sorry my description was slightly ambiguous. I should not have said skip so much as treats the file as binary and does not find a match because each character takes 2 octets as per utf-8. $ mkdir tmp $ cd tmp $ $ printf '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\

bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary

2017-08-27 Thread Paul Eggert
Simon wrote: Windows text files can start with a byte order mark of U+FEFF and then be encoded in UTF-8. These are skipped as being binary files. I can't reproduce this problem on Fedora 26 x86-64. Here's how I tried: $ printf '\357\273\277x\n' >t $ LC_ALL=C grep x t | od -c 000 357 273 2

bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary

2017-08-27 Thread Simon
Windows text files can start with a byte order mark of U+FEFF and then be encoded in UTF-8. These are skipped as being binary files.