Re: sed(1): missing NUL in pattern space

Otto Moerbeek Thu, 15 Jun 2017 00:02:30 -0700

On Tue, Jun 13, 2017 at 10:08:11AM +0000, kshe wrote:

> On Sat, 10 Jun 2017 10:25:27 +0000, Otto Moerbeek wrote:
> > Thanks for the analysis and diff, I hope to get a chanche to think
> > about this soon. At least I'll make sure this diff is not forgotten,
> >     -Otto
> 
> I have seen the tests that you recently added to the tree; in the mean
> time, however, while I continued experimenting and reading the source, I
> discovered more bugs (as expected), as well as other trivial issues
> which might be worthy of note before committing any real fixes.
> 
> First of all, my comprehension of the `c' command has slightly changed
> since my last message: I now think its behaviour should be aligned with
> that of `p', `P', `w' and others, all of which produce no output after
> `c', treating a deleted pattern space as truly inexistent, and not
> merely as an empty one.  It would hence appear more coherent to add
> 
>       if (pd)
>               break;
> 
> before calling lputs(), as is done in many other places.  This would
> then change the expected output of one of the newly added tests.
> 
> In general though, that `c' command is a big mess.  It is supposed to
> delete the pattern space: in other implementations of sed that I have
> tested, that means it will essentially behave like `i' followed by `d'.
> In BSD's sed, however, it continues processing commands...  Here's one
> of the many bugs that arise because of that:
> 
>       $ echo abc | sed 'c\
>       > text
>       > s/.*/x/
>       > p
>       > s/.*/y/
>       > p'
>       text
>       x
> 
> So why isn't `y' printed too?  In comparison, using flags instead of
> standalone commands:
> 
>       $ echo abc | sed 'c\
>       > text
>       > s/.*/x/p
>       > s/.*/y/p'
>       text
>       x
>       y
> 
> Here, we do get a `y', but only one (there should be two given that `-n'
> has not been supplied to suppress normal output).  Since at least `p' as
> a flag given to `s' appears to work, let us try the same with `w':
> 
>       $ echo abc | sed 'c\
>       > text
>       > s/.*/x/w /tmp/sed.out
>       > s/.*/y/w /tmp/sed.out'
>       text
>       $ cat /tmp/sed.out
>       x
> 
> But this time it doesn't work.  Now add a third `s' command...
> 
>       $ echo abc | sed 'c\
>       > text
>       > s/.*/x/w /tmp/sed.out
>       > s/.*/y/w /tmp/sed.out
>       > s/.*/z/w /tmp/sed.out'
>       text
>       z
>       $ cat /tmp/sed.out
>       x
>       z
> 
> And it works.  Well, not really: we do see a `z', but still no `y'.
> Indeed, this does not make any sense.
> 
> The reason for this strange behaviour is nevertheless easy to figure out
> when looking at how `s' is implemented, keeping in mind that the
> substitute space is globally defined.  In any case, even if one was to
> fix that particular bug, I would probably be able to find more.  This is
> why I believe that the only sane solution is that, just like `d', `c'
> should simply
> 
>       goto new;
> 
> which would, at the same time, obviate the need for the `deleted' field
> of the SPACE structure, subsequently allowing for substantial cleanups.
> 
> Also, aside of that `c' mess, there is still the fact that NUL bytes
> originally present in the input are not well handled at all.  Of course,
> `l' cannot print past them (whereas `p' can), which is a shame; but
> significantly worse is the inconsistent behaviour of `s' in that case.
> At first, it seems to work well (in the output, `^@' represents a NUL):
> 
>       $ printf 'x\0y' | sed 's/x/_/'
>       _^@y
> 
> It can even handle some empty matches perfectly:
> 
>       $ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives
>       _x^@y
> 
> But not all of them:
> 
>       $ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off
>       x_
>       $ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise
>       x
> 
> One might argue that NUL bytes are not supposed to happen in textual
> input, so that these examples are irrelevant; the real problem however
> is that it also affects newlines, thus perfectly valid text:
> 
>       $ printf 'x\ny' | sed 'N;s/[[:<:]]/_/'
>       _x
>       y
>       $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/'
>       x_
>       $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2'
>       x
> 
> So maybe substitute() should test for (slen == 0) instead of the
> questionable (*s == '\0' || *s == '\n').  That would seem more correct
> to me, but this is only a guess since I must admit that I don't fully
> understand that part of the code yet.  Here is a diff, for instance,
> that fixes all the above issues.
> 
> --- a/src/usr.bin/sed/process.c       Wed Jun  7 09:56:20 2017
> +++ b/src/usr.bin/sed/process.c       Tue Jun 13 07:21:59 2017
> @@ -387,14 +388,11 @@ substitute(struct s_command *cp)
>                * and at the end of the line, terminate.
>                */
>               if (match[0].rm_so == match[0].rm_eo) {
> -                     if (*s == '\0' || *s == '\n')
> -                             slen = -1;
> -                     else
> -                             slen--;
> -                     if (*s != '\0') {
> -                             cspace(&SS, s++, 1, APPEND);
> -                             le++;
> -                     }
> +                     if (slen == 0)
> +                             break;
> +                     cspace(&SS, s++, 1, APPEND);
> +                     slen--;
> +                     le++;
>                       lastempty = 1;
>               } else
>                       lastempty = 0;
> 
> However, even if it passes the regression tests with the same success as
> the current sed, please do not blindly commit this patch.  Although I
> fail to see what it was, surely there must have been a reason to test
> for that particular condition in the original code, and not simply for
> (slen == 0).  This patch is therefore only intented as an example of how
> a fix might look like, but definitely not as the correct and final
> solution.
> 
> Lastly, I have prepared one more patch dealing with trivial issues that
> I found while browsing the code.  Contrary to the previous one, I am
> very confident that it will not have any undesired effects, although of
> course it should be carefully reviewed, as any patch should be.
> 
> --- a/src/usr.bin/sed/defs.h  Fri Jan 20 10:26:16 2017
> +++ b/src/usr.bin/sed/defs.h  Mon Jun 12 09:32:57 2017
> @@ -129,7 +129,6 @@ typedef struct {
>       size_t len;             /* Current length. */
>       int deleted;            /* If deleted. */
>       int append_newline;     /* If originally terminated by \n. */
> -     char *back;             /* Backing memory. */
>       size_t blen;            /* Backing memory length. */
>  } SPACE;
>  
> --- a/src/usr.bin/sed/process.c       Wed Jun  7 09:56:20 2017
> +++ b/src/usr.bin/sed/process.c       Tue Jun 13 07:21:59 2017
> @@ -77,9 +77,9 @@ static regex_t *defpreg;
>  size_t maxnsub;
>  regmatch_t *match;
>  
> -#define OUT() do {\
> -     fwrite(ps, 1, psl, outfile);\
> -     if (psanl) fputc('\n', outfile);\
> +#define OUT() do {                           \
> +     fwrite(ps, 1, psl, outfile);            \
> +     if (psanl) fputc('\n', outfile);        \
>  } while (0)
>  
>  void
> @@ -145,15 +145,15 @@ redirect:
>                               cspace(&PS, hs, hsl, REPLACE);
>                               break;
>                       case 'G':
> -                             cspace(&PS, "\n", 1, 0);
> -                             cspace(&PS, hs, hsl, 0);
> +                             cspace(&PS, "\n", 1, APPEND);
> +                             cspace(&PS, hs, hsl, APPEND);
>                               break;
>                       case 'h':
>                               cspace(&HS, ps, psl, REPLACE);
>                               break;
>                       case 'H':
> -                             cspace(&HS, "\n", 1, 0);
> -                             cspace(&HS, ps, psl, 0);
> +                             cspace(&HS, "\n", 1, APPEND);
> +                             cspace(&HS, ps, psl, APPEND);
>                               break;
>                       case 'i':
>                               (void)fprintf(outfile, "%s", cp->t);
> @@ -171,7 +171,7 @@ redirect:
>                               break;
>                       case 'N':
>                               flush_appends();
> -                             cspace(&PS, "\n", 1, 0);
> +                             cspace(&PS, "\n", 1, APPEND);
>                               if (!mf_fgets(&PS, 0))
>                                       exit(0);
>                               break;
> @@ -256,8 +256,9 @@ redirect:
>                       cp = cp->next;
>               } /* for all cp */
>  
> -new:         if (!nflag && !pd)
> +             if (!nflag && !pd)
>                       OUT();
> +new:
>               flush_appends();
>       } /* for all lines */
>  }
> @@ -266,7 +267,7 @@ new:              if (!nflag && !pd)
>   * TRUE if the address passed matches the current program state
>   * (lastline, linenumber, ps).
>   */
> -#define      MATCH(a)                                                \
> +#define MATCH(a)                                                     \
>       (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) :    \
>           (a)->type == AT_LINE ? linenum == (a)->u.l : lastline()
>  
> @@ -361,7 +362,7 @@ substitute(struct s_command *cp)
>                       cspace(&SS, s, match[0].rm_so - le, APPEND);
>  
>               /* Skip zero-length matches right after other matches. */
> -             if (lastempty || (match[0].rm_so - le) ||
> +             if (lastempty || match[0].rm_so != le ||
>                   match[0].rm_so != match[0].rm_eo) {
>                       if (n <= 1) {
>                               /* Want this match: append replacement. */
> @@ -370,7 +371,7 @@ substitute(struct s_command *cp)
>                                       n = -1;
>                       } else {
>                               /* Want a later match: append original. */
> -                             if (match[0].rm_eo - le)
> +                             if (match[0].rm_eo != le)
>                                       cspace(&SS, s, match[0].rm_eo - le,
>                                           APPEND);
>                               n--;
> @@ -418,7 +416,6 @@ substitute(struct s_command *cp)
>       PS = SS;
>       psanl = tspace.append_newline;
>       SS = tspace;
> -     SS.space = SS.back;
>  
>       /* Handle the 'p' flag. */
>       if (cp->u.s->p)
> @@ -547,13 +544,14 @@ regexec_e(regex_t *preg, const char *string, int eflag
>  static void
>  regsub(SPACE *sp, char *string, char *src)
>  {
> -     int len, no;
> +     int no;
> +     size_t len;
>       char c, *dst;
>  
> -#define      NEEDSP(reqlen)                                                  
> \
> +#define NEEDSP(reqlen)                                                       
> \
>       if (sp->len + (reqlen) + 1 >= sp->blen) {                       \
>               size_t newlen = sp->blen + (reqlen) + 1024;             \
> -             sp->space = sp->back = xrealloc(sp->back, newlen);      \
> +             sp->space = xrealloc(sp->space, newlen);                \
>               sp->blen = newlen;                                      \
>               dst = sp->space + sp->len;                              \
>       }
> @@ -572,8 +570,7 @@ regsub(SPACE *sp, char *string, char *src)
>                       NEEDSP(1);
>                       *dst++ = c;
>                       ++sp->len;
> -             } else if (match[no].rm_so != -1 && match[no].rm_eo != -1) {
> -                     len = match[no].rm_eo - match[no].rm_so;
> +             } else if ((len = match[no].rm_eo - match[no].rm_so) != 0) {
>                       NEEDSP(len);
>                       memmove(dst, string + match[no].rm_so, len);
>                       dst += len;
> @@ -594,18 +591,18 @@ cspace(SPACE *sp, const char *p, size_t len, enum e_sp
>  {
>       size_t tlen;
>  
> +     if (spflag == REPLACE)
> +             sp->len = 0;
> +
>       /* Make sure SPACE has enough memory and ramp up quickly. */
>       tlen = sp->len + len + 1;
>       if (tlen > sp->blen) {
>               size_t newlen = tlen + 1024;
> -             sp->space = sp->back = xrealloc(sp->back, newlen);
> +             sp->space = xrealloc(sp->space, newlen);
>               sp->blen = newlen;
>       }
>  
> -     if (spflag == REPLACE)
> -             sp->len = 0;
> -
> -     memmove(sp->space + sp->len, p, len);
> +     memcpy(sp->space + sp->len, p, len);
>  
>       sp->space[sp->len += len] = '\0';
>  }
> 
> Here is the corresponding list of changes:
> 
> 1.  Remove the `back' field from the SPACE structure, which was useless
> since always equal to the `space' field.
> 
> 2.  Use APPEND in calls to cspace() that previously used 0 directly,
> which could be slightly confusing if one did not know the order of the
> `e_spflag' enum, and also missed the point of it being defined in the
> first place.
> 
> 3.  As suggested by the outdated comment preceding its definition,
> cspace() was originally only meant to append to, not to replace, the
> contents of a SPACE.  The calculation of the size for the potentially
> new buffer was however never adjusted, which meant that, with input
> lines of approximately equal length, the function would allocate roughly
> twice as much memory as necessary, half of that memory staying unused.
> To fix it, conditionally set `sp->len' to zero before the relevent
> assignment, not after.
> 
> 4.  Since cspace() never receives arguments that overlap in memory (as
> that would make little sense), and considering that it is probably one
> of the most frequently called routines, make it use memcpy(3) in lieu of
> memmove(3).
> 
> 5.  Change conditionals of the form (a - b) to (a != b) for better
> readability.
> 
> 6.  Use `size_t' instead of `int' for a variable holding a length.
> 
> 7.  Slightly improve the logic of regsub() to avoid a bunch of no-op
> instructions when `len' is zero.
> 
> 8.  Move a goto label previously redirecting into a conditional that was
> always false within that context.
> 
> 9.  Add missing tabs to align backslashes in macro definitions.
> 
> Kind regards,
> 
> kshe


Hi,

One thing: in general, try to limit posts to one diff including
explanation, with a proper Subject: line, it makes this easier to
track.

        -Otto

Re: sed(1): missing NUL in pattern space

Reply via email to