On Fri, Dec 06, 2024 at 05:33:09PM -0500, Douglas McIlroy wrote: > Arguments are collected if a macro name is "immediately" followed by a > left parenthesis. Experiment shows that arguments are not collected > when a macro name occurs at the end of a file (without a following > newline) and the next input file begins with a left parenthesis. I > believe this behavior is incorrect. >
GNU m4 is not injecting a missing newline, so much as it's input scanner refuses to recognize tokens that would extend across file boundaries. And, as you correctly noted, POSIX says the behavior is undefined if the input lacks an ending newline, so there we are free to make it do whatever is more useful. It's not just file boundaries: at least m4wrap() behaves the same way, by injecting an artificial end-of-token boundary between the first and second layers of nested unwrapping. Consider this example: echo 'changequote([,])define([ab],[AB])m4wrap(b)m4wrap(a)' | m4 which outputs "\nAB", but: $ echo 'changequote([,])define([ab],[AB])m4wrap([m4wrap(b)a])' | m4 changes the output to "\nab". What's going on? Under the hood, the output "a" from one m4wrap is concatenated with whatever text is next in the input stream from any other LIFO m4wrap()s; so the first example shows a and b adjacent after the two m4wrap'd text strings are replayed in reverse order turning into a macro expansion. But in the second example, even though there are no intervening characters in the output stream, there WAS an intervention - m4 reached the "end-of-file" of the first layer of m4wrap, and proceeded to dive into the second layer of m4wrap, which was a stronger boundary than two wraps at the same nesting layer. But one of the powers of m4 is the ability to create macro names by concatenation on rescan. It seems odd that end-of-file should interfere with what is normally m4's strong point of rescanning unquoted output to see if new macro calls appear. Would you like to submit a patch to lift the artificial limitation of splitting tokens at a file boundary? Should that be done unconditionally, or gated somehow (preferably with the default behavior as it has always been, where you have to opt in to the new behavior)? And if gated, would it be by a new command-line argument, or would it be something you can toggle on or off at will while m4 is running, or both? And does such concatenation work across frozen files? Should there be limits on what can be concatenated? You mention "macro" and "(args)" turning into a call of macro(), but would "def" and "ine(args)" turn into a call of define(args); or what about "changequote(<<,>>)<" and "<is this quoted>>"? Do you really want file B to behave differently because of whatever was left at the end of file A? Thus, my gut feel is that m4 is unlikely to change from its current behavior, because the design costs outweigh the corner-case benefits. > However, the underlying error lies not with m4, but with the input > file. According to POSIX, a valid nonempty text file must end with a > newline. Other experiment suggests that m4 silently appends a missing > newline. Should it not warn when it does so? First off, it _is_ possible to define a macro that warns if it is invoked with 0 arguments (ifelse on $# and errprint are handy). But that doesn't help you if you want to handle an arbtrary word, rather than a known macro name, at the end of the file. And that doesn't help with your desire to put the macro name in one file and its (arguments) in another. Meanwhile, it _is_ a feature of GNU m4 that if the input lacks the trailing newline, it will produce output that also lacks the trailing newline (POSIX doesn't say whether that makes any sense - it says portable use of m4 is limited to the input to being a text file, but not whether the output should be a text file; but that means it is probably not portable to try to rely on that behavior). So if we started warning because of your situation, we might break others who have come to rely on that extension behavior. It's also possible that even when your m4 file ends in a newline, you ended it with " dnl\n" or had some use of m4wrap that does not itself produce a newline, so that the output produced is NOT a text file, according to POSIX; POSIX does not say whether that is well-defined behavior, but I suspect there may be some non-GNU m4 implementations that supply a trailing newline for a file when your m4 program doesn't, even though GNU m4 doesn't supply that newline. And lest you think that using m4 to produce output without a trailing newline is the only worry, there are other ways in which you can produce a non-text-file output: use of include, syscmd, or undivert to produce NUL bytes in the output stream (even though GNU m4 itself may not handle them gracefully), producing lines longer than the platforms line-buffer limits (although GNU systems try not to have such limits, there are other systems which do, and m4 recursion makes it very easy to write something that expands to a large length at the expense of some CPU churning), or encoding errors (GNU m4 is not yet upgraded to the Unicode world, so it will happily break bytes apart and mangle UTF-8 into mojibake if you are not careful). The rest of my mail is a bit of a tangent: I've long wished that I had a way to read in a file containing a blob of arbitrary text, and sanitize it (with translit or patsubst) for further safe processing in m4; my ideal way would be doing something like: $ tail -n2 A_head.m4 dnl ...any other setup code above changequote(`````,''''')define(`data',````` $ head -n2 A_tail.m4 ''''')changequote`'dnl now defn(`data') can access that blob as a dnl single argument of whatever file(s) are passed in the middle $ m4 A_head.m4 your_file_here A_tail.m4 where you can pick whatever complicated changequote needed so that you don't have to worry about the contents of your_file_here inadvertantly triggering m4 syntax. I can't quite do that with the include builtin, where the file you just parsed in _will_ be parsed as m4, rather than raw data. But GNU m4's current refusal to allow concatenation across file boundaries means that it errors out on an unterminated string instead of letting my theoretical trick work, just the same as it prevents you from continuing a macro name or its parameter list across the file boundary. You _can_ do tricks with changecom to do roughly the same: if you know what the first few bytes of the file are, you can set up a changecom that starts with those first few bytes and ends with a long sequence unlikely to be in the middle of the included file - at which point you can now treat the entire input file as a single m4 comment which undergoes no further expansion, then trim off your suffix sequence as you sanitize the data. But then you are at the issue of how you detect those first few bytes; perhaps syscmd or esyscmd can be used for that, although it's another set of interesting language barriers when you try to write m4 that produces valid shell code for grabbing untrusted bytes in a sanitized manner. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org