bug#25707: [PATCH] grep: don't forcefully strip carriage returns

Eric Blake Thu, 16 Feb 2017 09:46:19 -0800

On 02/16/2017 11:26 AM, Eli Zaretskii wrote:

>> I'm of the opinion that undossify_input causes more problems than it
>> solves.  We should trust fopen("r") to do the right thing, rather than
>> reinventing it ourselves.
> 
> FYI: You'd be losing an important feature for non-Cygwin DOS/Windows
> users if you remove undossify_input and decide to trust fopen's "r"
> (or "rt") mode.


"rt" mode is not required to exist. And I don't know any modern
implementation of "r" mode on a system with non-zero O_BINARY that eats
ALL \r - both Cygwin and mingw just change \r\n into \n while still
preserving other \r.  The undossify() code that Paul just removed did
NOT behave the same as text mode (in that it did, perhaps
unintentionally, eat ALL \r).

I count it worse to TRY and reimplement the OS "r" mode and get the
implementation wrong, with more lines of code, than to just trust the OS
to do it correctly in the first place.

The undossify() code may do the right thing on text files, but is
absolutely wrong on binary files.  My alternative patch was less
intrusive, and at least preserves the undossify behavior on text files,
but with the requirement that fcntl(0, F_GETFL) & O_TEXT is a reliable
way to tell if an fd is currently opened in text mode before slamming it
over to the binary mode required for undossify to work.

>  That's because reading a file which was opened in
> text-mode generally removes _all_ CR characters, even if they are not
> followed by a newline; it also stops on the first ^Z character in the
> file, treating it as a kind of "software EOF", a legacy from CP/M
> years.

No, I don't know of any fopen(,"r") code that eats _all_ CR.  Certainly
cygwin's implementation does not do that.

> 
> That's why the original patch switched the file descriptor to binary
> mode (Grep used 'open', not 'fopen', in those days) and used
> undossify_input: that allowed Grep to DTRT with these use cases,
> removing CRs only if they are followed by a newline, and not stopping
> at ^Z.  As a side effect, undossify_input also collects the
> information needed for displaying byte offsets.

Yes, you do make a point that the side effect of reimplementing text
mode ourselves on a forced binary fd lets us "count" byte offsets where
the count could be text while the scan was binary, or where the count
could be binary while the scan was text.  But in reality, are there any
users that ever want a mixed-mode count?  If you are scanning in binary,
you want the binary count; if you are scanning in text, you want the
text count.  And in those two scenarios, with a sane text-mode from the
OS, still give the correct counts with a lot less code in grep.

> 
> It seems to me that when one bumps into some code which looks
> incorrect or less than optimal, and one considers its replacement with
> a more clever code, it would be a good idea to ask the person(s) who
> contributed the original code, in case there was some good reason for
> doing it that way.  Was that done in this case?  If not, it should
> have been.
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

bug#25707: [PATCH] grep: don't forcefully strip carriage returns

Reply via email to