Hi Mark, On Wed, Feb 19, 2020 at 10:40 AM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > This would only make sense to me if the string held in $_ had already been > checked for safety, but Catalog.pm does very little to verify that the string > is safe to eval. The assumption here, so far as I can infer, is that we > don’t embed anything dangerous in our .dat files, so this should be ok. That > may be true for the moment, but I can imagine a day when we start embedding > perl functions as quoted text inside a data file, or shell commands as text > which look enough like perl for eval() to be able to execute them. > Developers who edit these .dat files and mess up the quoting, and then rerun > ‘make’ to get the new .c and .h files generated, may not like the side > effects. Perhaps I’m being overly paranoid….
The use case for that seems slim. However, at a brief glance your module seems more robust in other ways. > Rather than add more code generation logic based on the design of Catalog.pm, > I wrote a perl based data file parser that parses .dat files and returns > vivified perl data, as Catalog.pm does, but with much stricter parsing logic > to make certain nothing dangerous gets eval()ed. I put the new module in > DataFile.pm. > [...] > The new parser is more flexible about the structure of the data, which seems > good to me for making it easier to add or modify data files in the future. > The new parser does not yet have a means of hacking up the data to add > autogenerated fields and such that Catalog.pm does, but I think a more clean > break between parsing and autovivifying fields would be good anyway. Separation of concerns sounds like a good idea, but I've not fully thought it through. For one advantage, I think it might be nicer to have indexing.dat and toasting.dat instead of having to dig the info out of C macros. This was evident while recently experimenting with generating catcache control data. As for the patch, I have not done a full review, but I have some comments based on a light read-through: utils/Makefile: +# location of commandtag.dat +headerdir = $(top_srcdir)/src/include/utils This variable name is too generic for what the comment says it is. A better abstraction, if we want one, would be the full path to the commandtag input file. The other script invocations in this Makefile don't do it this way, but that's a separate patch. +# location to write generated headers +sourcedir = $(top_srcdir)/src/backend/utils Calling the output the source is bound to confuse people. The comment implies all generated headers, not just the ones introduced by the patch. I would just output to the current directory (i.e. have an --output option with a default empty string). Also, if we want to output somewhere else, I would imagine it'd be under the top builddir, not srcdir. +$(PERL) -I $(top_srcdir)/src/include/utils $< --headerdir=$(headerdir) --sourcedir=$(sourcedir) --inputfile=$(headerdir)/commandtag.dat 1. headerdir is entirely unused by the script 2. We can default to working dir for the output as mentioned above 3. -I $(top_srcdir)/src/include/utils is supposed to point to the dir containing DataFile.pm, but since gencommandtag.pl has "use lib..." it's probably not needed here. I don't recall why we keep the "-I" elsewhere. (ditto in Solution.pm) I'm thinking it would look something like this: +$(PERL) $< --inputfile=$(top_srcdir)/src/include/utils/commandtag.dat -- utils/misc/Makefile +all: distprep + # Note: guc-file.c is not deleted by 'make clean', # since we want to ship it in distribution tarballs. clean: @rm -f lex.yy.c + +maintainer-clean: clean Seems non-functional. -- DataFiles.pm I haven't studied this in detail, but I'll suggest that if this meant to have wider application, maybe it should live in src/tools ? I'm not familiar with using different IO routines depending on the OS -- what's the benefit of that? -- gencommandtag.pl slurp_without_comments() is unused. sanity_check_data() seems longer than the main body of the script (minus header boilerplate), and I wonder if we can pare it down some. For one, I have difficulty imagining anyone would accidentally type an unprintable or non-ascii character in a command tag and somehow not realize it. For another, duplicating checks that were done earlier seems like a maintenance headache. dataerror() is defined near the top, but other functions are defined at the bottom. +# Generate all output internally before outputting anything, to avoid +# partially overwriting generated files under error conditions My personal preference is, having this as a design requirement sacrifices readability for unclear gain, especially since a "chunk" also includes things like header boilerplate. That said, the script is also short enough that it doesn't make a huge difference either way. Speaking of boilerplate, it's better for readability to separate that from actual code such as: typedef enum CommandTag { #define FIRST_COMMANDTAG COMMANDTAG_$sorted[0]->{taglabel}) -- tcop/dest.c + * We no longer display LastOid, but to preserve the wire protocol, + * we write InvalidOid where the LastOid used to be written. For + * efficiency in the snprintf(), hard-code InvalidOid as zero. Hmm, is hard-coding zero going to make any difference here? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services