Hi Steve, On Dec 30, 12:01 am, Steven D'Aprano <st...@remove-this- cybersource.com.au> wrote: > On Tue, 29 Dec 2009 21:01:05 -0800, beginner wrote: > > Hi All, > > > I run into a problem. I have a string s that can be a number of > > possible things. I use a regular expression code like below to match and > > parse it. But it looks very ugly. Also, the strings are literally > > matched twice -- once for matching and once for extraction -- which > > seems to be very slow. Is there any better way to handle this? > > The most important thing you should do is to put the regular expressions > into named variables, rather than typing them out twice. The names > should, preferably, describe what they represent. > > Oh, and you should use raw strings for regexes. In this particular > example, I don't think it makes a difference, but if you ever modify the > strings, it will! > > You should get rid of the unnecessary double calls to match. That's just > wasteful. Also, since re.match tests the start of the string, you don't > need the leading ^ regex (but you do need the $ to match the end of the > string). > > You should also fix the syntax error, where you have "elif s=='-'" > instead of "elif s='-'". > > You should consider putting the cheapest test(s) first, or even moving > the expensive tests into a separate function. > > And don't be so stingy with spaces in your source code, it helps > readability by reducing the density of characters. > > So, here's my version: > > def _re_match_items(s): > # Setup some regular expressions. > COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)' > FLOAT_RE = COMMON_RE + '$' > BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$' > DATE_RE = r'\d{1,2}-\w+-\d{1,2}$' > mo = re.match(FLOAT_RE, s) # "mo" short for "match object" > if mo: > return float(mo.group(1).replace(',', '')) > # Otherwise mo will be None and we go on to the next test. > mo = re.match(BRACKETED_FLOAT_RE, s) > if mo: > return -float(mo.group(1).replace(',', '')) > if re.match(DATE_RE, s): > return dateutil.parser.parse(s, dayfirst=True) > raise ValueError("bad string can't be matched") > > def convert_data_item(s): > if s = '-': > return None > else: > try: > return _re_match_items(s) > except ValueError: > print "Unrecognized format %s" % s > return s > > Hope this helps. > > -- > Steven
This definitely helps. I don't know if it should be s=='-' or s='-'. I thought == means equal and = means assignment? Thanks again, G -- http://mail.python.org/mailman/listinfo/python-list