On Fri, Jun 09, 2017 at 07:58:48AM +0000, kshe wrote:
> Hi,
>
> There is an ongoing confusion in sed's source about the concept of EOL:
> the code contradicts itself as to whether it should be determined by a
> trailing NUL or by looking up the `len' field of the SPACE structure.
>
> At least two commands (`l' and `s') assume that the pattern space is
> always NUL-terminated. However, at least two other commands (`c' and
> `D') do not comply with that assumption, modifying it by merely updating
> its length attribute. As a result, using `l' or `s' after `c' or `D'
> produces incorrect output.
>
> For instance, in the following excerpt, `l' should print `bb$':
>
> $ printf 'a\nbb\n' | sed '${l;q;};N;D'
> bbbb$
> bb
>
> For the same underlying reason, it also disregards the deletion of the
> pattern space that `c' is supposed to entail:
>
> $ echo abc | sed 'c\
> > text
> > l'
> text
> abc$
>
> The substitute command can likewise get confused and add an extra
> character after a replacement:
>
> $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;D'
> bbxb
>
> Note how this does not happen when the substitution pattern matches more
> than the empty string:
>
> $ printf 'a\nbb\n' | sed '${s/.$/&x/;q;};N;D'
> bbx
>
> After an empty match, the character that `s' adds to the pattern space
> depends on how memmove(3) modified it beforehand. Here's a very
> simplified version of the original sequence of commands from which I
> discovered these bugs, where `i' appear instead of `b':
>
> $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;s/a/ZZi/;D'
> bbxi
>
> Similarily, when one believes the pattern space to be empty after a `c'
> command, something like `s/.*//' will magically revive one character
> from that emptiness:
>
> $ echo abc | sed 'c\
> > text
> > s/.*//'
> text
> a
>
> As I came across these in an actual script that I was writing, I was
> quite amazed to find out that they went unnoticed for so long, without
> anyone ever trying to see what `l' does after `c' or `D'. Indeed, the
> `l' command is broken since revision 1.1 of process.c, making that bug
> older than I am. It has also been present in the other BSDs, but:
>
> o FreeBSD fixed it by accident in 2004 by adding support for
> multibyte encodings;
>
> o NetBSD fixed it by accident in 2014 by importing FreeBSD's sed
> (merging their changes afterwards, but that didn't reintroduce
> the bug, since they hadn't touched that part of the code).
>
> The more serious bug regarding empty matches in the substitute command
> was introduced in OpenBSD in 2011, when schwarze@ decided to rewrite its
> main loop, intending to "fix multiple issues regarding the replacement
> of zero-length strings", but apparently with the unfortunate side effect
> of introducing a new issue regarding the replacement of zero-length
> strings. Finally, this commit was adopted by FreeBSD in 2016 when they
> synchronised some of their code with OpenBSD's to ease importing other
> fixes from it.
>
> Thus, OpenBSD still has the old `l' bug as well as the more recent `s'
> one, FreeBSD only has the latter, and since 2014 NetBSD is not longer
> affected by any of these, although the fact that `c' and `D' omit to
> NUL-terminate the pattern space could be considered a bug in itself,
> since at least one piece of code present for 20 years in their tree was
> making the contrary assumption. So who knows where else it was made?
> Actually, with all due respect, probably not even the original
> developers: many of their choices suggest that the end of the pattern
> space was not always meant to be encoded as a trailing NUL, but they
> nevertheless designed the function `lputs' to only take a simple string
> parameter, by which no separate length information could ever be
> accessible.
>
> I therefore have the strong feeling that this is not the last bug of
> BSD's sed. The remaining ones may be hidden in dark corners, but I do
> expect them to show up at some point in the future, considering how
> fragile the existing code looks. The overall situation also prevents
> any significant improvement to be successfully conducted, as changing
> more than two lines will most likely lead to subtle regressions; see the
> last few commits at
>
> https://svnweb.freebsd.org/base/head/usr.bin/sed/process.c
>
> and
>
> http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/usr.bin/sed/process.c
>
> for recent examples of such failed attempts.
>
> Anyway, until I digress even more, here is a patch for that one, forcing
> `c' and `D' to put a NUL where `l' and `s' expect one to be. By the
> way, whether the fault is on one side or the other is actually unclear.
> It does seem redundant to enforce such a convention when one already has
> access to the length of the corresponding string. As a matter of fact,
> the overwhelming majority of the code does not appear to ever rely on a
> NUL being there at all; for instance, `D' and `P' use memchr(3), not
> strchr(3), which, besides, allows for the manipulation of NUL bytes
> originally present in the input. Hence a more complete patch could
> instead try to unify the whole source to follow one convention
> consistently; however, that would probably take as much time as a full
> rewrite, because that is essentially what would need to be done, so, for
> now, I kept it as simple as it could be, to avoid messing it up even
> more than it already is.
>
> --- a/src/usr.bin/sed/process.c Wed Feb 22 14:09:09 2017
> +++ b/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017
> @@ -121,7 +121,7 @@ redirect:
> goto redirect;
> case 'c':
> pd = 1;
> - psl = 0;
> + ps[psl = 0] = '\0';
> if (cp->a2 == NULL || lastaddr || lastline())
> (void)fprintf(outfile, "%s", cp->t);
> break;
> @@ -138,6 +138,7 @@ redirect:
> } else {
> psl -= (p + 1) - ps;
> memmove(ps, p + 1, psl);
> + ps[psl] = '\0';
> goto top;
> }
> case 'g':
>
> Also note how one could have thought of fixing `D' by making its
> memmove(3) go up to `psl + 1' instead of `psl', since there now should
> always be a NUL at the end; however, a separate assignment makes the
> intent clearer, and I read somewhere that explicit was better than
> implicit.
>
> That being said, the more I think about this, the more I am convinced
> that the problematic commands were in fact `l' and `s'. Since the code
> already uses fgetln(3) to read lines, it would have been nicer (and
> optimal) to never rely on NULs. As stated above, this is already the
> case almost everywhere; therefore, if any of the OpenBSD developers
> wanted to elaborate on that idea after committing such an unsatisfying
> patch, I would of course be very happy to help.
>
> Kind regards,
>
> kshe
Thanks for the analysis and diff, I hope to get a chanche to think
about this soon. At least I'll make sure this diff is not forgotten,
-Otto