Licheng Fang wrote: > Basically, the problem is this: > > >>>> p = re.compile("do|dolittle") >>>> p.match("dolittle").group() >>>> > 'do' > > Python's NFA regexp engine trys only the first option, and happily > rests on that. There's another example: > > >>>> p = re.compile("one(self)?(selfsufficient)?") >>>> p.match("oneselfsufficient").group() >>>> > 'oneself' > > The Python regular expression engine doesn't exaust all the > possibilities, but in my application I hope to get the longest possible > match, starting from a given point. > > Is there a way to do this in Python? > > Licheng,
If you need regexes, why not just reverse-sort your expressions? This seems a lot easier and faster than writing another regex compiler. Reverse-sorting places the longer ones ahead of the shorter ones. >>> targets = ['be', 'bee', 'been', 'being'] >>> targets.sort () >>> targets.reverse () >>> regex = '|'.join (targets) >>> re.findall (regex, 'Having been a bee in a former life, I don\'t mind being what I am and wouldn\'t want to be a bee ever again.') ['been', 'bee', 'being', 'be', 'bee'] You might also take a look at a stream editor I recently came out with: http://cheeseshop.python.org/pypi/SE/2.2%20beta It has been well received, especially by newbies, I believe because it is so simple to use and allows very compact coding. >>> import SE >>> Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING') >>> Bee_Editor ('Having been a bee in a former life, I don\'t mind being what I am and wouldn\'t want to be a bee ever again.') "Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again." Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets. >>> SE.SE ('<EAT> be==, bee==, been==, being==,')(above_sting) 'been,bee,being,be,bee,' You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files. >>> Key_Word_Translator = SE.SE (''' "*INT=int substitute" "*DECIMAL=decimal substitute" "*FACTION=faction substitute" "*NUMERALS=numerals substitute" # ... etc. ''') I don't know if that could serve. Regards Frederic -- http://mail.python.org/mailman/listinfo/python-list