On Fri, Apr 2, 2010 at 3:53 PM, Michael Tharp <g...@partiallystapled.com> wrote: > Most Esteemed Hackers: > > Due to popular demand on #postgresql (by which I mean David Fetter), I have > been spending a little time making the internal SQL parser available to > clients via a C-language SQL function. The function itself is extremely > simple: just a wrapper around a call to raw_parser followed by nodeToString.
Seems reasonable. > Most of the "hard stuff" has been in parsing the output of nodeToString on > the client side. So, I have a few questions to help gauge interest in > related patches: > > Is there interest in a patch to extend nodes/outfuncs.c with support for > serializing more node types? Coverage has been pretty good so far but > various utility statements and their related nodes are missing, e.g. > AlterTableStmt and GrantStmt. I expect that this will be the least > contentious suggestion. This wouldn't bother me provided the code footprint is small. I would be against adding a lot of complexity for this. > The nodeToString format as it stands is somewhat ambiguous with respect to > the type of a node member's value if one does not have access to > readfuncs.c. For example, a T_BitString called foo is serialized as ':foo > b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'. > This may just mean that _outToken needs to escape the leading 'b'. A similar > problem exists for booleans ('true' as a string vs. as a boolean). I am not inclined to change this. Turning the format into something self-describing seems to me to be significant work and a significant compatibility break for a very small amount of benefit. > Additionally, values may span more than one token for certain types e.g. > Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that don't > have a corresponding read function and output an array of space-separated > integers. PlanInvalItem even seems to use a format containing parentheses, > which the tokenizer splits as if it were a list. While most of these only > occur in plan nodes and thus don't affect my use case (Datum being the > exception), it would be ideal if they could be parsed more > straightforwardly. I'm not inclined to change this, either. > These last two problems perhaps can be worked around by escaping more things > in _outToken, but maybe it would be smarter to make the fields > self-descriptive in terms of type. For example, the field names could be > prefixed with a short string describing its type, which in most cases would > be a single character, e.g. 's:schemaname' for a char*, 'b:true' for a bool, > 'n:...' for any node (including Value nodes), or longer strings for less > commonly used types like the integer arrays in plan nodes (although these > would probably be better as a real integer list). These could be used to > unambiguously parse individual tokens and also to determine how many or what > kind of token to expect for multi-token values such as Datum which would > otherwise require guessing. Does this seem reasonable? Is there another > format that might make more sense? This seems ugly to me and I don't see the utility of it. > As far as I can tell, the current parser in nodes/read.c ignores the field > names entirely, so this can be done without changing postgres' own parsing > code at all and without affecting backwards compatibility of any stored > trees. Does anyone else out there use nodeToString() output in their own > tools, and if so, does this make your life easier or harder? > > Lastly, I'll leave a link to my WIP implementation in case anyone is > interested: > http://bitbucket.org/gxti/parse_sql/src/ > Currently I'm working on adding support for cooked parse trees and figuring > out what, if anything, I need to do to support multibyte encodings. My > personal use is for parsing DDL so the input is decidedly not hostile but > I'd still like to make this a generally useful module. > > Thanks in advance for any comments, tips, or flames sent my way. Thanks for having a thick skin. :-) I'm having a hard time imaging what you could use this for without encoding a lot of information about the meaning of particular constructs. In which case the self-describing stuff is not needed. As you point out downthread, if all you want to do is compare, it's not needed either. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers