En Thu, 05 Apr 2007 18:08:46 -0300, [EMAIL PROTECTED] <[EMAIL PROTECTED]> escribió:
> I am trying to write a parser for a text string. Specifically, I am > trying to take a filename that contains meta-data about the content of > the A/V file (mpg, mp3, etc.). > > I first split the filename into fields separated by spaces and dots. > > Then I have a series of regular expression matches. I like > Cartesian's 'event-based' parser approach though the even table gets a > bit unwieldy as it grows. Also, I would prefer to have the 'action' > result in a variable assignment specific to the test. E.g. > > def parseName(name): > fields = sd.split(name) > fields, ext = fields[:-1], fields[-1] > year = '' > capper = '' > series = None > episodeNum = None > programme = '' > episodeName = '' > past_title = false > for f in fields: > if year_re.match(f): > year = f > past_title = True > else: > my_match = capper_re.match(f): > if my_match: > capper = capper_re.match(f).group(1) > if capper == 'JJ' or capper == 'JeffreyJacobs': > capper = 'Jeffrey C. Jacobs' > past_title = True > else: > my_match = epnum_re.match(f): > if my_match: > series, episodeNum = my_match.group('series', > 'episode') > past_title = True > else: > # If I think of other parse elements, they go > here. > # Otherwise, name is part of a title; check for > capitalization > if f[0] >= 'a' and f[0] <= 'z' and f not in > do_not_capitalize: > f = f.capitalize() > if past_title: > if episodeName: episodeName += ' ' > episodeName += f > else: > if programme: programme += ' ' > programme += f > > return programme, series, episodeName, episodeNum, year, capper, > ext > > Now, the problem with this code is that it assumes only 2 pieces of > free-form meta-data in the name (i.e. Programme Name and Episode > Name). Also, although this is not directly adaptable to Cartesian's > approach, you COULD rewrite it using a dictionary in the place of > local variable names so that the event lookup could consist of 3 > properties per event: compiled_re, action_method, dictionary_string. > But even with that, in the case of the epnum match, two assignments > are required so perhaps a convoluted scheme such that if > dictionary_string is a list, for each of the values returned by > action_method, bind the result to the corresponding ith dictionary > element named in dictionary_string, which seems a bit convoluted. And > the fall-through case is state-dependent since the 'unrecognized > field' should be shuffled into a different variable dependent on > state. Still, if there is a better approach I am certainly up for > it. I love event-based parsers so I have no problem with that > approach in general. Maybe it's worth using a class instance. Define methods to handle each matching regex, and keep state in the instance. class NameParser: def handle_year(self, field, match): self.year = field self.past_title = True def handle_capper(self, field, match): capper = match.group(1) if capper == 'JJ' or capper == 'JeffreyJacobs': capper = 'Jeffrey C. Jacobs' self.capper = capper self.past_title = True def parse(self, name): fields = sd.split(name) for field in fields: for regex,handler in self.handlers: match = regex.match(field) if match: handler(self, field, match) break You have to build the handlers list, containing (regex, handler) items; the "unknown" case might be a match-all expression at the end. Well, after playing a bit with decorators I got this: class NameParser: year = '' capper = '' series = None episodeNum = None programme = '' episodeName = '' past_title = False handlers = [] def __init__(self, name): self.name = name self.parse() def handle_this(regex, handlers=handlers): # A decorator; associates the function to the regex # (Not intended to be used as a normal method! not even a static method!) def register(function, regex=regex): handlers.append((re.compile(regex), function)) return function return register @handle_this(r"\(?\d+\)?") def handle_year(self, field, match): self.year = field self.past_title = True @handle_this(r"(expression)") def handle_capper(self, field, match): capper = match.group(1) if capper == 'JJ' or capper == 'JeffreyJacobs': capper = 'Jeffrey C. Jacobs' self.capper = capper self.past_title = True @handle_this(r".*") def handle_unknown(self, field, match): if field[0] >= 'a' and field[0] <= 'z' and field not in do_not_capitalize: field = field.capitalize() if self.past_title: if self.episodeName: self.episodeName += ' ' self.episodeName += field else: if self.programme: self.programme += ' ' self.programme += field def parse(self): fields = sd.split(self.name) for field in fields: for regex,handler in self.handlers: match = regex.match(field) if match: handler(self, field, match) break -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list