> What I mean is, I want to change, e.g.: > > "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72 > S. Ct. 394, 397, 96 L.Ed. 475 (1952)." > > into: > > "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)." > > Generally, the beginning pattern would consist of: > > 1. Two names, consisting of one or more words, always separated by a > "v." > > 2. One, two, or three citations, each of which always has a volume > number ("342") followed by a name, consisting of one or two word > units always ending with "." ("U.S."), followed by a page number ("429") > > 3. Each citation may contain a comma and a second page number (", 434") > > 4. Optionally, a parenthesized year ("(1952)") > > 5. A final "."
>>> import re >>> tests = ['Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475 (1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).', 'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459.'] >>> r= re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\s+((?:\d+,\s*)+)\s*(.*?)(\(\d{4}\))?\.$') >>> results = [r.match(x) for x in tests] >>> for x in range(0,3): ... print "Test %i" % x ... print "="*20 ... print "\n".join(["%s: %s" % (a,results[x].group(b)) for a,b in zip(["Party1", "Party2", "Court", "Pages", "Extra", "Year"], range(1,7))]) ... Test 0 ==================== Party1: Doremus Party2: Board of Education of Hawthorne, Court: 342 Pages: 429, 434, Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475 Year: (1952) Test 1 ==================== Party1: Joe Party2: Volcano, Fork, 123 Internet, et. al, Court: 314 Pages: 123, 43, Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459 Year: (2005) Test 2 ==================== Party1: Grandma Party2: RIAA, Court: 314 Pages: 123, 43, Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459 Year: None Things get a little messy if one of the parties has digits followed by whitespace, followed by "U.S" in their name, such as a ficticious "99 U.S. Luftballoons". Caveat regextor. There are also some places where trailing commas end up in items if there are multiple parties. You may want to strip them off too before reassembling them. Reassemble the pieces as needed. Season to taste. Bake at 350 for 20-25 minutes until golden brown. HTH, or at least gets you on the path to regexp mangling. -tkc -- http://mail.python.org/mailman/listinfo/python-list