On Mar 22, 4:11 pm, rh0dium <[EMAIL PROTECTED]> wrote: > Hi all, > > I am struggling with parsing the following data: > <snip> > As a side note: Is this the right approach to using pyparsing. Do we > start from the inside and work our way out or should I have started > with looking at the bigger picture ( keyword + "{" + OneOrMore key / > vals + "}" + ) I started there but could figure out how to look > multiline - I'm assuming I'd just join them all up? > > Thanks
I think your "inside-out" approach is just fine. Start by composing expressions for the different "pieces" of your input text, then steadily build up more and more complex forms. I think the main complication you have is that of using commaSeparatedList for your list of real numbers. commaSeparatedList is a very generic helper expression. From the online example (http:// pyparsing.wikispaces.com/space/showimage/commasep.py), here is a sample of the data that commaSeparatedList will handle: "a,b,c,100.2,,3", "d, e, j k , m ", "'Hello, World', f, g , , 5.1,x", "John Doe, 123 Main St., Cleveland, Ohio", "Jane Doe, 456 St. James St., Los Angeles , California ", In other words, the content of the items between commas is pretty much anything that is *not* a comma. If you change your definition of atflist to: atflist = Suppress("(") + commaSeparatedList # + Suppress(")") (that is, comment out the trailing right paren), you'll get this successful parse result: ['0.21', '0.24', '0.6', '0.24', '0.24', '0.6)'] In your example, you are parsing a list of floating point numbers, in a list delimited by commas, surrounded by parens. This definition of atflist should give you more control over the parsing process, and give you real floats to boot: floatnum = Combine(Word(nums) + "." + Word(nums) + Optional('e'+oneOf("+ -")+Word(nums))) floatnum.setParseAction(lambda t:float(t[0])) atflist = Suppress("(") + delimitedList(floatnum) + Suppress(")") Now I get this output for your parse test: [0.20999999999999999, 0.23999999999999999, 0.59999999999999998, 0.23999999999999999, 0.23999999999999999, 0.59999999999999998] So you can see that this has actually parsed the numbers and converted them to floats. I went ahead and added support for scientific notation in floatnum, since I see that you have several atfvalues that are standalone floats, some using scientific notation. To add these, just expand atfvalues to: atfvalues = ( floatnum | Word(nums) | atfstr | atflist ) (At this point, I'll go on to show how to parse the rest of the data structure - if you want to take a stab at it yourself, stop reading here, and then come back to compare your results with my approach.) To parse the overall structure, now that you have expressions for the different component pieces, look into using Dict (or more simply using the helper function dictOf) to define results names automagically for you based on the attribute names in the input. Dict does *not* change any of the parsing or matching logic, it just adds named fields in the parsed results corresponding to the key names found in the input. Dict is a complex pyparsing class, but dictOf simplfies things. dictOf takes two arguments: dictOf(keyExpression, valueExpression) This translates to: Dict( OneOrMore( Group(keyExpression + valueExpression) ) ) For example, to parse the lists of entries that look like: name = "gtc" dielectric = 2.75e-05 unitTimeName = "ns" timePrecision = 1000 unitLengthName = "micron" etc. just define that this is "a dict of entries each composed of a key consisting of a Word(alphas), followed by a suppressed '=' sign and an atfvalues", that is: attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues) dictOf takes care of all of the repetition and grouping necessary for Dict to do its work. These attribute dicts are nested within an outer main dict, which is "a dict of entries, each with a key of Word(alphas), and a value of an optional quotedString (an alias, perhaps?), a left brace, an attrDict, and a right brace," or: mainDict = dictOf( Word(alphas), Optional(quotedString)("alias") + Suppress("{") + attrDict + Suppress("}") ) By adding this code to what you already have: attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues) mainDict = dictOf( Word(alphas), Optional(quotedString)("alias") + Suppress("{") + attrDict + Suppress("}") ) You can now write: md = mainDict.parseString(test1) print md.dump() print md.Layer.lineStyle and get this output: [['Technology', ['name', 'gtc'], ['dielectric', 2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision', '1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'], ['gridResolution', '5'], ['unitVoltageName', 'v'], ['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'], ['currentPrecision', '1000'], ['unitPowerName', 'pw'], ['powerPrecision', '1000'], ['unitResistanceName', 'kohm'], ['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'], ['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'], ['inductancePrecision', '100']], ['Tile', 'unit', ['width', 0.22], ['height', 1.6899999999999999]], ['Layer', 'PRBOUNDARY', ['layerNumber', '0'], ['maskName', ''], ['visible', '1'], ['selectable', '1'], ['blink', '0'], ['color', 'cyan'], ['lineStyle', 'solid'], ['pattern', 'blank'], ['pitch', '0'], ['defaultWidth', '0'], ['minWidth', '0'], ['minSpacing', '0']]] - Layer: ['PRBOUNDARY', ['layerNumber', '0'], ['maskName', ''], ['visible', '1'], ['selectable', '1'], ['blink', '0'], ['color', 'cyan'], ['lineStyle', 'solid'], ['pattern', 'blank'], ['pitch', '0'], ['defaultWidth', '0'], ['minWidth', '0'], ['minSpacing', '0']] - alias: PRBOUNDARY - blink: 0 - color: cyan - defaultWidth: 0 - layerNumber: 0 - lineStyle: solid - maskName: - minSpacing: 0 - minWidth: 0 - pattern: blank - pitch: 0 - selectable: 1 - visible: 1 - Technology: [['name', 'gtc'], ['dielectric', 2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision', '1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'], ['gridResolution', '5'], ['unitVoltageName', 'v'], ['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'], ['currentPrecision', '1000'], ['unitPowerName', 'pw'], ['powerPrecision', '1000'], ['unitResistanceName', 'kohm'], ['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'], ['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'], ['inductancePrecision', '100']] - capacitancePrecision: 10000000 - currentPrecision: 1000 - dielectric: 2.75e-005 - gridResolution: 5 - inductancePrecision: 100 - lengthPrecision: 1000 - name: gtc - powerPrecision: 1000 - resistancePrecision: 10000000 - timePrecision: 1000 - unitCapacitanceName: pf - unitCurrentName: ma - unitInductanceName: nh - unitLengthName: micron - unitPowerName: pw - unitResistanceName: kohm - unitTimeName: ns - unitVoltageName: v - voltagePrecision: 1000000 - Tile: ['unit', ['width', 0.22], ['height', 1.6899999999999999]] - alias: unit - height: 1.69 - width: 0.22 solid Cheers! -- Paul -- http://mail.python.org/mailman/listinfo/python-list