On Tue, 02 May 2006 22:37:04 -0700, ProvoWallis wrote: > I have a file that looks like this: > > <SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC > GUIDE<XC>92<LT>1(b)(1) > > (i.e., <<SC>[chapter name]<XC>[multiple or single book page > ranges]<SC>[chapter name]<XC>[multiple or single book page > ranges]<LT>[code] > > but I want to change it so that it looks like this > > <1><SC>APPEAL<XC>40-24<LT>1(b)(1) > <1><SC>APPEAL<XC>40-46<LT>1(b)(1) > <1><SC>APPEAL<XC>42-46<LT>1(b)(1) > <1><SC>APPEAL<XC>42-48<LT>1(b)(1) > <1><SC>APPEAL<XC>42-62<LT>1(b)(1) > <1><SC>APPEAL<XC>42-63<LT>1(b)(1) > <1><SC>PROC GUIDE<XC>92<LT>1(b)(1)
I'll show my code first, then explain it. -- cut here -- cut here -- cut here -- cut here -- cut here -- import re s = "<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \ "<SC>PROC GUIDE<XC>92<LT>1(b)(1)" s_space = " " # a single space s_empty = "" # empty string pat = re.compile("\s*<SC>([^<]+)<XC>([^<]+)") lst = [] while True: m = pat.search(s) if not m: break title = m.group(1).strip() xc = m.group(2) xc = xc.replace(s_space, s_empty) tup = (title, xc) lst.append(tup) s = pat.sub(s_empty, s, 1) lt = s.strip() for title, xc in lst: lst_pp = xc.split(";") for pp in lst_pp: print "<1><SC>%s<XC>%s%s" % (title, pp, lt) -- cut here -- cut here -- cut here -- cut here -- cut here -- My strategy here is to divide the problem into two separate parts: first, I collect all the data we need; then, I reformat the collected data and print it in the desired format. "pat" is a compiled regular expression. It recognizes the SC and XC codes, and collects the strings enclosed by those codes: ([^<]+) The above regular expression means "any character that is not a '<'", "one or more of them", and since it's in parentheses it's remembered so we can collect it later. So we collect title and the XC page ranges. We tidy them up a bit: title.strip() will remove any leading or trailing white space from the title. The replace() on the XC string gets rid of any spaces; I'm assuming that the spaces are optional and the semicolons are the real separators here. Now, we could save the title and XC string in two lists, but that would be silly in Python. It's easier to pair them up in a tuple, and save the tuple in a single list. You can do it in one line, but I made the tuple explicit ("tup"). After we collect them, we use a sub() to chop the collected data out of the source string. A while loop runs until all the SC and XC values are collected; anything left over is assumed to be the LT. Now, we have all the data; it's easy enough to rearrange it. We can convert the XC string into a list of page ranges just by calling .split(";"), which will split on semicolons. Loop over this list, printing each time, and there you go. I'll leave packaging these up into tidy functions, reading the data from the file, etc. as exercises for the reader. :-) If you have any questions on how this works or why I did things the way I did, ask away. Good luck! -- Steve R. Hastings "Vita est" [EMAIL PROTECTED] http://www.blarg.net/~steveha -- http://mail.python.org/mailman/listinfo/python-list