Eric Blake wrote:
On 05/23/2015 06:04 PM, L. A. Walsh wrote:
the standard & extended RE's don't find find NUL's:
Because NULs imply binary data,
I can think of multiple cases were at least 1 'nul'
would be found in text data -- the most prime example
being that it is a Microsoft Text file.
While MS usually uses a BOM at the beginning of
files, since NT's original format was only LSB/UCS-2, one
still runs into the occasional file -- but just rare enough that
I don't have the vim command to change it in the buffer to a compat
format that I waste time looking it up.
But more to the point some unix files were designed to
work on file -- not just limited to text -- 'strings' for
example. Right now, it seems grep has lost much in the
'robust' category -- I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file
"-a" doesn't work, BTW:
Ishtar:/tmp> grep -a '\000\000' zeros
Ishtar:/tmp> echo $?
1
Ishtar:/tmp> grep -P '\000\000' zeros
Binary file zeros matches
But there it is -- if grep wasn't meant to handle binary files,
it wouldn't know to call 'zeroes' a binary file.
Many of the coreutils have worked equally well on binary
as well as txt. (cat, split, tr, wc to name a few). But how
can 'shuf' claim to work on input lines yet have this allowed:
-z, --zero-terminated
line delimiter is NUL, not newline.
'nl' claims the file, 'zeros' (4k of nulls -- created
by bash, that can write a file of zeros, but not read it)
is 1 line.
'pr' will print it (though not too well).
'xargs': <zeros xargs -0 |wc
1 0 4096
POSIX is a least common denominator -- it is not a standard
of quality in any way. People argue to dumb down POSIX
utils, because some corp wants to get a posix label but
has a few shortcomings -- so they donate enough money and
posix changes it's rules.
'less' works with it, but 'more' works faster (just doesn't
display ctl chars). --- but one of the files I searched through
was base64 encoded, and in at least 2 places in the file were
a a run of ~100-200 zeros (in a 10k or more file).
(That's what I'm looking for -- signs of corruption)...
and grepping binary data has unspecified
results per POSIX. What's more, the NEWS for 2.21 documents that grep
is now taking the liberty of treating NUL as a line terminator when -a
is not in effect, thanks to the behavior being otherwise unspecified by
POSIX.
----
With a "-0" switch, I presume (not default behavior -- that would
be ungood :^/ )
Try using 'grep -a' to force grep to treat the file as non-binary, in
spite of the NULs.
doesn't work -- as mentioned above. I'd say it's a bug
fair and square...