Re: input file ends with non-newline

Eric Blake Sat, 07 Dec 2024 20:03:18 -0800

On Fri, Dec 06, 2024 at 05:33:09PM -0500, Douglas McIlroy wrote:
> Arguments are collected if a macro name is "immediately" followed by a
> left parenthesis. Experiment shows that arguments are not collected
> when a macro name occurs at the end of a file (without a following
> newline) and the next input file begins with a left parenthesis. I
> believe this behavior is incorrect.
>

GNU m4 is not injecting a missing newline, so much as it's input
scanner refuses to recognize tokens that would extend across file
boundaries.  And, as you correctly noted, POSIX says the behavior is
undefined if the input lacks an ending newline, so there we are free
to make it do whatever is more useful.

It's not just file boundaries: at least m4wrap() behaves the same way,
by injecting an artificial end-of-token boundary between the first and
second layers of nested unwrapping. Consider this example:

echo 'changequote([,])define([ab],[AB])m4wrap(b)m4wrap(a)' | m4

which outputs "\nAB", but:

$ echo 'changequote([,])define([ab],[AB])m4wrap([m4wrap(b)a])' | m4

changes the output to "\nab".  What's going on?  Under the hood, the
output "a" from one m4wrap is concatenated with whatever text is next
in the input stream from any other LIFO m4wrap()s; so the first
example shows a and b adjacent after the two m4wrap'd text strings are
replayed in reverse order turning into a macro expansion.  But in the
second example, even though there are no intervening characters in the
output stream, there WAS an intervention - m4 reached the
"end-of-file" of the first layer of m4wrap, and proceeded to dive into
the second layer of m4wrap, which was a stronger boundary than two
wraps at the same nesting layer.

But one of the powers of m4 is the ability to create macro names by
concatenation on rescan.  It seems odd that end-of-file should
interfere with what is normally m4's strong point of rescanning
unquoted output to see if new macro calls appear.

Would you like to submit a patch to lift the artificial limitation of
splitting tokens at a file boundary?  Should that be done
unconditionally, or gated somehow (preferably with the default
behavior as it has always been, where you have to opt in to the new
behavior)?  And if gated, would it be by a new command-line argument,
or would it be something you can toggle on or off at will while m4 is
running, or both?  And does such concatenation work across frozen
files?  Should there be limits on what can be concatenated?  You
mention "macro" and "(args)" turning into a call of macro(), but would
"def" and "ine(args)" turn into a call of define(args); or what about
"changequote(<<,>>)<" and "<is this quoted>>"?  Do you really want
file B to behave differently because of whatever was left at the end
of file A?  Thus, my gut feel is that m4 is unlikely to change from
its current behavior, because the design costs outweigh the
corner-case benefits.

> However, the underlying error lies not with m4, but with the input
> file. According to POSIX, a valid nonempty text file must end with a
> newline. Other experiment suggests that m4 silently appends a missing
> newline. Should it not warn when it does so?

First off, it _is_ possible to define a macro that warns if it is
invoked with 0 arguments (ifelse on $# and errprint are handy).  But
that doesn't help you if you want to handle an arbtrary word, rather
than a known macro name, at the end of the file.  And that doesn't
help with your desire to put the macro name in one file and its
(arguments) in another.

Meanwhile, it _is_ a feature of GNU m4 that if the input lacks the
trailing newline, it will produce output that also lacks the trailing
newline (POSIX doesn't say whether that makes any sense - it says
portable use of m4 is limited to the input to being a text file, but
not whether the output should be a text file; but that means it is
probably not portable to try to rely on that behavior).  So if we
started warning because of your situation, we might break others who
have come to rely on that extension behavior.  It's also possible that
even when your m4 file ends in a newline, you ended it with " dnl\n"
or had some use of m4wrap that does not itself produce a newline, so
that the output produced is NOT a text file, according to POSIX; POSIX
does not say whether that is well-defined behavior, but I suspect
there may be some non-GNU m4 implementations that supply a trailing
newline for a file when your m4 program doesn't, even though GNU m4
doesn't supply that newline.

And lest you think that using m4 to produce output without a trailing
newline is the only worry, there are other ways in which you can
produce a non-text-file output: use of include, syscmd, or undivert to
produce NUL bytes in the output stream (even though GNU m4 itself may
not handle them gracefully), producing lines longer than the platforms
line-buffer limits (although GNU systems try not to have such limits,
there are other systems which do, and m4 recursion makes it very easy
to write something that expands to a large length at the expense of
some CPU churning), or encoding errors (GNU m4 is not yet upgraded to
the Unicode world, so it will happily break bytes apart and mangle
UTF-8 into mojibake if you are not careful).

The rest of my mail is a bit of a tangent: I've long wished that I had
a way to read in a file containing a blob of arbitrary text, and
sanitize it (with translit or patsubst) for further safe processing in
m4; my ideal way would be doing something like:

$ tail -n2 A_head.m4
dnl ...any other setup code above
changequote(`````,''''')define(`data',`````
$ head -n2 A_tail.m4
''''')changequote`'dnl now defn(`data') can access that blob as a
dnl single argument of whatever file(s) are passed in the middle
$ m4 A_head.m4 your_file_here A_tail.m4

where you can pick whatever complicated changequote needed so that you
don't have to worry about the contents of your_file_here inadvertantly
triggering m4 syntax.  I can't quite do that with the include builtin,
where the file you just parsed in _will_ be parsed as m4, rather than
raw data.  But GNU m4's current refusal to allow concatenation across
file boundaries means that it errors out on an unterminated string
instead of letting my theoretical trick work, just the same as it
prevents you from continuing a macro name or its parameter list across
the file boundary.

You _can_ do tricks with changecom to do roughly the same: if you know
what the first few bytes of the file are, you can set up a changecom
that starts with those first few bytes and ends with a long sequence
unlikely to be in the middle of the included file - at which point you
can now treat the entire input file as a single m4 comment which
undergoes no further expansion, then trim off your suffix sequence as
you sanitize the data.  But then you are at the issue of how you
detect those first few bytes; perhaps syscmd or esyscmd can be used
for that, although it's another set of interesting language barriers
when you try to write m4 that produces valid shell code for grabbing
untrusted bytes in a sanitized manner.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: input file ends with non-newline

Reply via email to