On Tue, Jun 13, 2017 at 10:08:11AM +0000, kshe wrote:
> On Sat, 10 Jun 2017 10:25:27 +0000, Otto Moerbeek wrote:
> > Thanks for the analysis and diff, I hope to get a chanche to think
> > about this soon. At least I'll make sure this diff is not forgotten,
> > -Otto
>
> I have seen the tests that you recently added to the tree; in the mean
> time, however, while I continued experimenting and reading the source, I
> discovered more bugs (as expected), as well as other trivial issues
> which might be worthy of note before committing any real fixes.
>
> First of all, my comprehension of the `c' command has slightly changed
> since my last message: I now think its behaviour should be aligned with
> that of `p', `P', `w' and others, all of which produce no output after
> `c', treating a deleted pattern space as truly inexistent, and not
> merely as an empty one. It would hence appear more coherent to add
>
> if (pd)
> break;
>
> before calling lputs(), as is done in many other places. This would
> then change the expected output of one of the newly added tests.
>
> In general though, that `c' command is a big mess. It is supposed to
> delete the pattern space: in other implementations of sed that I have
> tested, that means it will essentially behave like `i' followed by `d'.
> In BSD's sed, however, it continues processing commands... Here's one
> of the many bugs that arise because of that:
>
> $ echo abc | sed 'c\
> > text
> > s/.*/x/
> > p
> > s/.*/y/
> > p'
> text
> x
>
> So why isn't `y' printed too? In comparison, using flags instead of
> standalone commands:
>
> $ echo abc | sed 'c\
> > text
> > s/.*/x/p
> > s/.*/y/p'
> text
> x
> y
>
> Here, we do get a `y', but only one (there should be two given that `-n'
> has not been supplied to suppress normal output). Since at least `p' as
> a flag given to `s' appears to work, let us try the same with `w':
>
> $ echo abc | sed 'c\
> > text
> > s/.*/x/w /tmp/sed.out
> > s/.*/y/w /tmp/sed.out'
> text
> $ cat /tmp/sed.out
> x
>
> But this time it doesn't work. Now add a third `s' command...
>
> $ echo abc | sed 'c\
> > text
> > s/.*/x/w /tmp/sed.out
> > s/.*/y/w /tmp/sed.out
> > s/.*/z/w /tmp/sed.out'
> text
> z
> $ cat /tmp/sed.out
> x
> z
>
> And it works. Well, not really: we do see a `z', but still no `y'.
> Indeed, this does not make any sense.
>
> The reason for this strange behaviour is nevertheless easy to figure out
> when looking at how `s' is implemented, keeping in mind that the
> substitute space is globally defined. In any case, even if one was to
> fix that particular bug, I would probably be able to find more. This is
> why I believe that the only sane solution is that, just like `d', `c'
> should simply
>
> goto new;
>
> which would, at the same time, obviate the need for the `deleted' field
> of the SPACE structure, subsequently allowing for substantial cleanups.
>
> Also, aside of that `c' mess, there is still the fact that NUL bytes
> originally present in the input are not well handled at all. Of course,
> `l' cannot print past them (whereas `p' can), which is a shame; but
> significantly worse is the inconsistent behaviour of `s' in that case.
> At first, it seems to work well (in the output, `^@' represents a NUL):
>
> $ printf 'x\0y' | sed 's/x/_/'
> _^@y
>
> It can even handle some empty matches perfectly:
>
> $ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives
> _x^@y
>
> But not all of them:
>
> $ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off
> x_
> $ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise
> x
>
> One might argue that NUL bytes are not supposed to happen in textual
> input, so that these examples are irrelevant; the real problem however
> is that it also affects newlines, thus perfectly valid text:
>
> $ printf 'x\ny' | sed 'N;s/[[:<:]]/_/'
> _x
> y
> $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/'
> x_
> $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2'
> x
>
> So maybe substitute() should test for (slen == 0) instead of the
> questionable (*s == '\0' || *s == '\n'). That would seem more correct
> to me, but this is only a guess since I must admit that I don't fully
> understand that part of the code yet. Here is a diff, for instance,
> that fixes all the above issues.
>
> --- a/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017
> +++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017
> @@ -387,14 +388,11 @@ substitute(struct s_command *cp)
> * and at the end of the line, terminate.
> */
> if (match[0].rm_so == match[0].rm_eo) {
> - if (*s == '\0' || *s == '\n')
> - slen = -1;
> - else
> - slen--;
> - if (*s != '\0') {
> - cspace(&SS, s++, 1, APPEND);
> - le++;
> - }
> + if (slen == 0)
> + break;
> + cspace(&SS, s++, 1, APPEND);
> + slen--;
> + le++;
> lastempty = 1;
> } else
> lastempty = 0;
>
> However, even if it passes the regression tests with the same success as
> the current sed, please do not blindly commit this patch. Although I
> fail to see what it was, surely there must have been a reason to test
> for that particular condition in the original code, and not simply for
> (slen == 0). This patch is therefore only intented as an example of how
> a fix might look like, but definitely not as the correct and final
> solution.
>
> Lastly, I have prepared one more patch dealing with trivial issues that
> I found while browsing the code. Contrary to the previous one, I am
> very confident that it will not have any undesired effects, although of
> course it should be carefully reviewed, as any patch should be.
>
> --- a/src/usr.bin/sed/defs.h Fri Jan 20 10:26:16 2017
> +++ b/src/usr.bin/sed/defs.h Mon Jun 12 09:32:57 2017
> @@ -129,7 +129,6 @@ typedef struct {
> size_t len; /* Current length. */
> int deleted; /* If deleted. */
> int append_newline; /* If originally terminated by \n. */
> - char *back; /* Backing memory. */
> size_t blen; /* Backing memory length. */
> } SPACE;
>
> --- a/src/usr.bin/sed/process.c Wed Jun 7 09:56:20 2017
> +++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017
> @@ -77,9 +77,9 @@ static regex_t *defpreg;
> size_t maxnsub;
> regmatch_t *match;
>
> -#define OUT() do {\
> - fwrite(ps, 1, psl, outfile);\
> - if (psanl) fputc('\n', outfile);\
> +#define OUT() do { \
> + fwrite(ps, 1, psl, outfile); \
> + if (psanl) fputc('\n', outfile); \
> } while (0)
>
> void
> @@ -145,15 +145,15 @@ redirect:
> cspace(&PS, hs, hsl, REPLACE);
> break;
> case 'G':
> - cspace(&PS, "\n", 1, 0);
> - cspace(&PS, hs, hsl, 0);
> + cspace(&PS, "\n", 1, APPEND);
> + cspace(&PS, hs, hsl, APPEND);
> break;
> case 'h':
> cspace(&HS, ps, psl, REPLACE);
> break;
> case 'H':
> - cspace(&HS, "\n", 1, 0);
> - cspace(&HS, ps, psl, 0);
> + cspace(&HS, "\n", 1, APPEND);
> + cspace(&HS, ps, psl, APPEND);
> break;
> case 'i':
> (void)fprintf(outfile, "%s", cp->t);
> @@ -171,7 +171,7 @@ redirect:
> break;
> case 'N':
> flush_appends();
> - cspace(&PS, "\n", 1, 0);
> + cspace(&PS, "\n", 1, APPEND);
> if (!mf_fgets(&PS, 0))
> exit(0);
> break;
> @@ -256,8 +256,9 @@ redirect:
> cp = cp->next;
> } /* for all cp */
>
> -new: if (!nflag && !pd)
> + if (!nflag && !pd)
> OUT();
> +new:
> flush_appends();
> } /* for all lines */
> }
> @@ -266,7 +267,7 @@ new: if (!nflag && !pd)
> * TRUE if the address passed matches the current program state
> * (lastline, linenumber, ps).
> */
> -#define MATCH(a) \
> +#define MATCH(a) \
> (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) : \
> (a)->type == AT_LINE ? linenum == (a)->u.l : lastline()
>
> @@ -361,7 +362,7 @@ substitute(struct s_command *cp)
> cspace(&SS, s, match[0].rm_so - le, APPEND);
>
> /* Skip zero-length matches right after other matches. */
> - if (lastempty || (match[0].rm_so - le) ||
> + if (lastempty || match[0].rm_so != le ||
> match[0].rm_so != match[0].rm_eo) {
> if (n <= 1) {
> /* Want this match: append replacement. */
> @@ -370,7 +371,7 @@ substitute(struct s_command *cp)
> n = -1;
> } else {
> /* Want a later match: append original. */
> - if (match[0].rm_eo - le)
> + if (match[0].rm_eo != le)
> cspace(&SS, s, match[0].rm_eo - le,
> APPEND);
> n--;
> @@ -418,7 +416,6 @@ substitute(struct s_command *cp)
> PS = SS;
> psanl = tspace.append_newline;
> SS = tspace;
> - SS.space = SS.back;
>
> /* Handle the 'p' flag. */
> if (cp->u.s->p)
> @@ -547,13 +544,14 @@ regexec_e(regex_t *preg, const char *string, int eflag
> static void
> regsub(SPACE *sp, char *string, char *src)
> {
> - int len, no;
> + int no;
> + size_t len;
> char c, *dst;
>
> -#define NEEDSP(reqlen)
> \
> +#define NEEDSP(reqlen)
> \
> if (sp->len + (reqlen) + 1 >= sp->blen) { \
> size_t newlen = sp->blen + (reqlen) + 1024; \
> - sp->space = sp->back = xrealloc(sp->back, newlen); \
> + sp->space = xrealloc(sp->space, newlen); \
> sp->blen = newlen; \
> dst = sp->space + sp->len; \
> }
> @@ -572,8 +570,7 @@ regsub(SPACE *sp, char *string, char *src)
> NEEDSP(1);
> *dst++ = c;
> ++sp->len;
> - } else if (match[no].rm_so != -1 && match[no].rm_eo != -1) {
> - len = match[no].rm_eo - match[no].rm_so;
> + } else if ((len = match[no].rm_eo - match[no].rm_so) != 0) {
> NEEDSP(len);
> memmove(dst, string + match[no].rm_so, len);
> dst += len;
> @@ -594,18 +591,18 @@ cspace(SPACE *sp, const char *p, size_t len, enum e_sp
> {
> size_t tlen;
>
> + if (spflag == REPLACE)
> + sp->len = 0;
> +
> /* Make sure SPACE has enough memory and ramp up quickly. */
> tlen = sp->len + len + 1;
> if (tlen > sp->blen) {
> size_t newlen = tlen + 1024;
> - sp->space = sp->back = xrealloc(sp->back, newlen);
> + sp->space = xrealloc(sp->space, newlen);
> sp->blen = newlen;
> }
>
> - if (spflag == REPLACE)
> - sp->len = 0;
> -
> - memmove(sp->space + sp->len, p, len);
> + memcpy(sp->space + sp->len, p, len);
>
> sp->space[sp->len += len] = '\0';
> }
>
> Here is the corresponding list of changes:
>
> 1. Remove the `back' field from the SPACE structure, which was useless
> since always equal to the `space' field.
>
> 2. Use APPEND in calls to cspace() that previously used 0 directly,
> which could be slightly confusing if one did not know the order of the
> `e_spflag' enum, and also missed the point of it being defined in the
> first place.
>
> 3. As suggested by the outdated comment preceding its definition,
> cspace() was originally only meant to append to, not to replace, the
> contents of a SPACE. The calculation of the size for the potentially
> new buffer was however never adjusted, which meant that, with input
> lines of approximately equal length, the function would allocate roughly
> twice as much memory as necessary, half of that memory staying unused.
> To fix it, conditionally set `sp->len' to zero before the relevent
> assignment, not after.
>
> 4. Since cspace() never receives arguments that overlap in memory (as
> that would make little sense), and considering that it is probably one
> of the most frequently called routines, make it use memcpy(3) in lieu of
> memmove(3).
>
> 5. Change conditionals of the form (a - b) to (a != b) for better
> readability.
>
> 6. Use `size_t' instead of `int' for a variable holding a length.
>
> 7. Slightly improve the logic of regsub() to avoid a bunch of no-op
> instructions when `len' is zero.
>
> 8. Move a goto label previously redirecting into a conditional that was
> always false within that context.
>
> 9. Add missing tabs to align backslashes in macro definitions.
>
> Kind regards,
>
> kshe
Hi,
One thing: in general, try to limit posts to one diff including
explanation, with a proper Subject: line, it makes this easier to
track.
-Otto