Hi Mark,

On Wed, Feb 19, 2020 at 10:40 AM Mark Dilger
<mark.dil...@enterprisedb.com> wrote:
> This would only make sense to me if the string held in $_ had already been 
> checked for safety, but Catalog.pm does very little to verify that the string 
> is safe to eval.  The assumption here, so far as I can infer, is that we 
> don’t embed anything dangerous in our .dat files, so this should be ok.  That 
> may be true for the moment, but I can imagine a day when we start embedding 
> perl functions as quoted text inside a data file, or shell commands as text 
> which look enough like perl for eval() to be able to execute them.  
> Developers who edit these .dat files and mess up the quoting, and then rerun 
> ‘make’ to get the new .c and .h files generated, may not like the side 
> effects.  Perhaps I’m being overly paranoid….

The use case for that seems slim. However, at a brief glance your
module seems more robust in other ways.

> Rather than add more code generation logic based on the design of Catalog.pm, 
> I wrote a perl based data file parser that parses .dat files and returns 
> vivified perl data, as Catalog.pm does, but with much stricter parsing logic 
> to make certain nothing dangerous gets eval()ed.  I put the new module in 
> DataFile.pm.
> [...]
> The new parser is more flexible about the structure of the data, which seems 
> good to me for making it easier to add or modify data files in the future.  
> The new parser does not yet have a means of hacking up the data to add 
> autogenerated fields and such that Catalog.pm does, but I think a more clean 
> break between parsing and autovivifying fields would be good anyway.

Separation of concerns sounds like a good idea, but I've not fully
thought it through. For one advantage, I think it might be nicer to
have indexing.dat and toasting.dat instead of having to dig the info
out of C macros. This was evident while recently experimenting with
generating catcache control data.

As for the patch, I have not done a full review, but I have some
comments based on a light read-through:

utils/Makefile:

+# location of commandtag.dat
+headerdir = $(top_srcdir)/src/include/utils

This variable name is too generic for what the comment says it is. A
better abstraction, if we want one, would be the full path to the
commandtag input file. The other script invocations in this Makefile
don't do it this way, but that's a separate patch.

+# location to write generated headers
+sourcedir = $(top_srcdir)/src/backend/utils

Calling the output the source is bound to confuse people. The comment
implies all generated headers, not just the ones introduced by the
patch. I would just output to the current directory (i.e. have an
--output option with a default empty string). Also, if we want to
output somewhere else, I would imagine it'd be under the top builddir,
not srcdir.

+$(PERL) -I $(top_srcdir)/src/include/utils $<
--headerdir=$(headerdir) --sourcedir=$(sourcedir)
--inputfile=$(headerdir)/commandtag.dat

1. headerdir is entirely unused by the script
2. We can default to working dir for the output as mentioned above
3. -I $(top_srcdir)/src/include/utils is supposed to point to the dir
containing DataFile.pm, but since gencommandtag.pl has "use lib..."
it's probably not needed here. I don't recall why we keep the "-I"
elsewhere. (ditto in Solution.pm)

I'm thinking it would look something like this:

+$(PERL) $<  --inputfile=$(top_srcdir)/src/include/utils/commandtag.dat

--
utils/misc/Makefile

+all: distprep
+
 # Note: guc-file.c is not deleted by 'make clean',
 # since we want to ship it in distribution tarballs.
 clean:
  @rm -f lex.yy.c
+
+maintainer-clean: clean

Seems non-functional.

--
DataFiles.pm

I haven't studied this in detail, but I'll suggest that if this meant
to have wider application, maybe it should live in src/tools ?

I'm not familiar with using different IO routines depending on the OS
-- what's the benefit of that?

--
gencommandtag.pl

slurp_without_comments() is unused.

sanity_check_data() seems longer than the main body of the script
(minus header boilerplate), and I wonder if we can pare it down some.
For one, I have difficulty imagining anyone would accidentally type an
unprintable or non-ascii character in a command tag and somehow not
realize it. For another, duplicating checks that were done earlier
seems like a maintenance headache.

dataerror() is defined near the top, but other functions are defined
at the bottom.

+# Generate all output internally before outputting anything, to avoid
+# partially overwriting generated files under error conditions

My personal preference is, having this as a design requirement
sacrifices readability for unclear gain, especially since a "chunk"
also includes things like header boilerplate. That said, the script is
also short enough that it doesn't make a huge difference either way.
Speaking of boilerplate, it's better for readability to separate that
from actual code such as:

typedef enum CommandTag
{
#define FIRST_COMMANDTAG COMMANDTAG_$sorted[0]->{taglabel})

--
tcop/dest.c

+ * We no longer display LastOid, but to preserve the wire protocol,
+ * we write InvalidOid where the LastOid used to be written.  For
+ * efficiency in the snprintf(), hard-code InvalidOid as zero.

Hmm, is hard-coding zero going to make any difference here?

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to