On Dec 13, 9:01 am, Ramdas <[EMAIL PROTECTED]> wrote: > Hi Paul, > > I am cross posting the same to grab your attention at pyparsing forums > too. 1000 apologies on the same count! > > I am a complete newbie to parsing and totally new to pyparsing. > > I have adapted your code to store the line numbers as below. > Surprisingly, the line numbers printed, when I scrap some of the URLs, > is not accurate and is kind of way off. > <snip>
Ramdas - You will have to send me that URL off-list using e-mail, Google Groups masks it and I can't pull it up. In my example, I used the Yahoo home page. What is the URL you used, and which tags' results were off? Just some comments: - I did a quasi-verification of my results, using a quick-and-dirty re match. This did not give me the line numbers, but did give me counts of tag names (if anyone knows how to get the string location of an re match, this would be the missing link for an alternative solution to this problem). I added this code after the code I posted earlier: print "Quick-and-dirty verify using re's" import re openTagRe = re.compile("<([^ >/!]+)") tally2 = defaultdict(int) for match in openTagRe.findall(html): tally2[match] += 1 for t in tally2.keys(): print t,tally2[t], if tally2[t] != len(tagLocs[t]): print "<<<" else: print This crude verifier turned up no mismatches when parsing the Yahoo home page. - Could the culprit be your unique function? You did not post the code for this, so I had to make up my own: def unique(lst): return sorted(list(set(lst))) This does trim some of the line numbers, but I did not try to validate this. - In your getlinenos function, it is not necessary to call setParseAction every time. You only need to do this once, probably right after you define the tallyTagLineNumber function. - Here is an abbreviated form of getlinenos: def getlinenos(page): # clear out tally dict, so as not to get crossover data from # a previously-parsed page tagLocs.clear() anyOpenTag.searchString(page) return dict((k,unique(v)) for k,v in tagLocs.items()) If you wanted, you could even inline the unique logic, without too much obfuscation. -- Paul -- http://mail.python.org/mailman/listinfo/python-list