Regular expression to structure HTML
I'm kind of new to regular expressions, and I've spent hours trying to finesse a regular expression to build a substitution. What I'd like to do is extract data elements from HTML and structure them so that they can more readily be imported into a database. No -- sorry -- I don't want to use BeautifulSoup (though I have for other projects). Humor me, please -- I'd really like to see if this can be done with just regular expressions. Note that the output is referenced using named groups. My challenge is successfully matching the HTML tags in between the first table row, and the second table row. I'd appreciate any suggestions to improve the approach. rText = "8583New Horizon Technical Academy, Inc #4Jefferson701149371Career Learning CenterJefferson70113" rText = re.compile(r'()(?P\d+)()()()(?P[A- Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: \g\n', rText) print rText LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4Jefferson701149371Career Learning Center|PARISH:Jefferson|ZIP:70113 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
Screw: >>> html = """ 14313 Python Hammer Institute #2 Jefferson 70114 8583 New Screwdriver Technical Academy, Inc #4 Jefferson 70114 9371 Career RegEx Center Jefferson 70113 """ Hammer: First remove line returns. Then remove extra spaces. Then insert a line return to restore logical rows on each combination. For more information, see: http://www.qc4blog.com/?p=55 >>> s = re.sub(r'\n','', html) >>> s = re.sub(r'\s{2,}', '', s) >>> s = re.sub('()()', r'\1\n\2', s) >>> print s 14313Python Hammer Institute #2Jefferson70114 8583New Screwdriver Technical Academy, Inc #4Jefferson70114 9371Career RegEx CenterJefferson70113 >>> p = re.compile(r"()(?P\d+)()(>> valign=top>)(>> href=lic_details\.asp)(\?lic_number=\d+)(>)(?P[\s\S\WA-Za-z0-9]*?)()()(?:>> valign=top>)(?P[\s\WA-Za-z]+)()(>> valign=top>)(?P\d+)()()$", re.M) >>> n = >>> p.sub(r'LICENSE:\g|NAME:\g|PARISH:\g|ZIP:\g', >>> s) >>> print n LICENSE:14313|NAME:Python Hammer Institute #2|PARISH:Jefferson|ZIP: 70114 LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4| PARISH:Jefferson|ZIP:70114 LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113 >>> The solution was to escape the period in the ".asp" string, e.g., "\.asp". I also had to limit the pattern in the grouping by using a "?" qualifier to limit the "greediness" of the "*" pattern metacharacter. Now, who would like to turn that re.compile pattern into a MULTILINE expression, combining the re.M and re.X flags? Documentation says that one should be able to use the bitwise OR operator (e.g., re.M | re.X), but I sure couldn't get it to work. Sometimes a hammer actually is the right tool if you hit the screw long and hard enough. I think I'll try to hit some more screws with my new hammer. Good day. On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > I'm kind of new to regular expressions, and I've spent hours trying to > finesse a regular expression to build a substitution. > > What I'd like to do is extract data elements from HTML and structure > them so that they can more readily be imported into a database. > > No -- sorry -- I don't want to use BeautifulSoup (though I have for > other projects). Humor me, please -- I'd really like to see if this > can be done with just regular expressions. > > Note that the output is referenced using named groups. > > My challenge is successfully matching the HTML tags in between the > first table row, and the second table row. > > I'd appreciate any suggestions to improve the approach. > > rText = "8583 href=lic_details.asp?lic_number=8583>New Horizon Technical Academy, > Inc #4Jefferson70114 tr>9371 lic_number=9371>Career Learning Center valign=top>Jefferson70113" > > rText = re.compile(r'()(?P\d+)()( valign=top>)()(?P[A- > Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME: > \g\n', rText) > > print rText > > LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 valign=top>Jefferson70114 valign=top>9371 lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression to structure HTML
On Oct 2, 11:14 pm, greg wrote: > Brian D wrote: > > This isn't merely a question of knowing when to use the right > > tool. It's a question about how to become a better developer using > > regular expressions. > > It could be said that if you want to learn how to use a > hammer, it's better to practise on nails rather than > screws. > > -- > Greg It could be said that the bandwidth in technical forums should be reserved for on-topic exchanges, not flaming intelligent people who might have something to contribute to the forum. The truth is, I found a solution where others were ostensibly either too lazy to attempt, or too eager grandstanding their superiority to assist. Who knows -- maybe I'll provide an alternative to BeautifulSoup one day. -- http://mail.python.org/mailman/listinfo/python-list
How to insert string in each match using RegEx iterator
By what method would a string be inserted at each instance of a RegEx match? For example: string = '123 abc 456 def 789 ghi' newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi' Here's the code I started with: >>> rePatt = re.compile('\d+\s') >>> iterator = rePatt.finditer(string) >>> count = 0 >>> for match in iterator: if count < 1: print string[0:match.start()] + ' INSERT ' + string[match.start ():match.end()] elif count >= 1: print ' INSERT ' + string[match.start():match.end()] count = count + 1 My code returns an empty string. I'm new to Python, but I'm finding it really enjoyable (with the exception of this challenging puzzle). Thanks in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to insert string in each match using RegEx iterator
On Jun 9, 11:19 pm, Roy Smith wrote: > In article > , > > "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > By what method would a string be inserted at each instance of a RegEx > > match? > > > For example: > > > string = '123 abc 456 def 789 ghi' > > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi' > > If you want to do what I think you are saying, you should be looking at the > join() string method. I'm thinking something along the lines of: > > groups = match_object.groups() > newstring = " INSERT ".join(groups) Fast answer, Roy. Thanks. That would be a graceful solution if it works. I'll give it a try and post a solution. Meanwhile, I know there's a logical problem with the way I was concatenating strings in the iterator loop. Here's a single instance example of what I'm trying to do: >>> string = 'abc 123 def 456 ghi 789' >>> match = rePatt.search(string) >>> print string[0:match.start()] + 'INSERT ' + string[match.end():len(string)] abc INSERT def 456 ghi 789 -- http://mail.python.org/mailman/listinfo/python-list
Re: How to insert string in each match using RegEx iterator
On Jun 9, 11:35 pm, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > On Jun 9, 11:19 pm, Roy Smith wrote: > > > > > In article > > , > > > "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > > By what method would a string be inserted at each instance of a RegEx > > > match? > > > > For example: > > > > string = '123 abc 456 def 789 ghi' > > > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi' > > > If you want to do what I think you are saying, you should be looking at the > > join() string method. I'm thinking something along the lines of: > > > groups = match_object.groups() > > newstring = " INSERT ".join(groups) > > Fast answer, Roy. Thanks. That would be a graceful solution if it > works. I'll give it a try and post a solution. > > Meanwhile, I know there's a logical problem with the way I was > concatenating strings in the iterator loop. > > Here's a single instance example of what I'm trying to do: > > >>> string = 'abc 123 def 456 ghi 789' > >>> match = rePatt.search(string) > >>> print string[0:match.start()] + 'INSERT ' + > >>> string[match.end():len(string)] > > abc INSERT def 456 ghi 789 Thanks Roy. A little closer to a solution. I'm still processing how to step forward, but this is a good start: >>> string = 'abc 123 def 456 ghi 789' >>> rePatt = re.compile('\s\d+\s') >>> foundGroup = rePatt.findall(string) >>> newstring = ' INSERT '.join(foundGroup) >>> print newstring 123 INSERT 456 What I really want to do is return the full string, not just the matches -- concatenated around the ' INSERT ' string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to insert string in each match using RegEx iterator
On Jun 10, 5:17 am, Paul McGuire wrote: > On Jun 9, 11:13 pm, "504cr...@gmail.com" <504cr...@gmail.com> wrote: > > > By what method would a string be inserted at each instance of a RegEx > > match? > > Some might say that using a parsing library for this problem is > overkill, but let me just put this out there as another data point for > you. Pyparsing (http://pyparsing.wikispaces.com) supports callbacks > that allow you to embellish the matched tokens, and create a new > string containing the modified text for each match of a pyparsing > expression. Hmm, maybe the code example is easier to follow than the > explanation... > > from pyparsing import Word, nums, Regex > > # an integer is a 'word' composed of numeric characters > integer = Word(nums) > > # or use this if you prefer > integer = Regex(r'\d+') > > # attach a parse action to prefix 'INSERT ' before the matched token > integer.setParseAction(lambda tokens: "INSERT " + tokens[0]) > > # use transformString to search through the input, applying the > # parse action to all matches of the given expression > test = '123 abc 456 def 789 ghi' > print integer.transformString(test) > > # prints > # INSERT 123 abc INSERT 456 def INSERT 789 ghi > > I offer this because often the simple examples that get posted are > just the barest tip of the iceberg of what the poster eventually plans > to tackle. > > Good luck in your Pythonic adventure! > -- Paul Thanks for all of the instant feedback. I have enumerated three responses below: First response: Peter, I wonder if you (or anyone else) might attempt a different explanation for the use of the special sequence '\1' in the RegEx syntax. The Python documentation explains: \number Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters. In practice, this appears to be the key to the key device to your clever solution: >>> re.compile(r"(\d+)").sub(r"INSERT \1", string) 'abc INSERT 123 def INSERT 456 ghi INSERT 789' >>> re.compile(r"(\d+)").sub(r"INSERT ", string) 'abc INSERT def INSERT ghi INSERT ' I don't, however, precisely understand what is meant by "the group of the same number" -- or maybe I do, but it isn't explicit. Is this just a shorthand reference to match.group(1) -- if that were valid -- implying that the group match result is printed in the compile execution? Second response: I've encountered a problem with my RegEx learning curve which I'll be posting in a new thread -- how to escape hash characters # in strings being matched, e.g.: >>> string = re.escape('123#456') >>> match = re.match('\d+', string) >>> print match <_sre.SRE_Match object at 0x00A6A800> >>> print match.group() 123 Third response: Paul, Thanks for the referring me to the Pyparsing module. I'm thoroughly enjoying Python, but I'm not prepared right now to say I've mastered the Pyparsing module. As I continue my work, however, I'll be tackling the problem of parsing addresses, exactly as the Pyparsing module example illustrates. I'm sure I'll want to use it then. -- http://mail.python.org/mailman/listinfo/python-list
How to escape # hash character in regex match strings
I've encountered a problem with my RegEx learning curve -- how to escape hash characters # in strings being matched, e.g.: >>> string = re.escape('123#abc456') >>> match = re.match('\d+', string) >>> print match <_sre.SRE_Match object at 0x00A6A800> >>> print match.group() 123 The correct result should be: 123456 I've tried to escape the hash symbol in the match string without result. Any ideas? Is the answer something I overlooked in my lurching Python schooling? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to escape # hash character in regex match strings
On Jun 11, 2:01 am, Lie Ryan wrote: > 504cr...@gmail.com wrote: > > I've encountered a problem with my RegEx learning curve -- how to > > escape hash characters # in strings being matched, e.g.: > > >>>> string = re.escape('123#abc456') > >>>> match = re.match('\d+', string) > >>>> print match > > > <_sre.SRE_Match object at 0x00A6A800> > >>>> print match.group() > > > 123 > > > The correct result should be: > > > 123456 > > > I've tried to escape the hash symbol in the match string without > > result. > > > Any ideas? Is the answer something I overlooked in my lurching Python > > schooling? > > As you're not being clear on what you wanted, I'm just guessing this is > what you wanted: > > >>> s = '123#abc456' > >>> re.match('\d+', re.sub('#\D+', '', s)).group() > '123456' > >>> s = '123#this is a comment and is ignored456' > >>> re.match('\d+', re.sub('#\D+', '', s)).group() > > '123456'- Hide quoted text - > > - Show quoted text - Sorry I wasn't more clear. I positively appreciate your reply. It provides half of what I'm hoping to learn. The hash character is actually a desirable hook to identify a data entity in a scraping routine I'm developing, but not a character I want in the scrubbed data. In my application, the hash makes a string of alphanumeric characters unique from other alphanumeric strings. The strings I'm looking for are actually manually-entered identifiers, but a real machine-created identifier shouldn't contain that hash character. The correct pattern should be 'A1234509', but is instead often merely entered as '#12345' when the first character, representing an alphabet sequence for the month, and the last two characters, representing a two-digit year, can be assumed. Identifying the hash character in a RegEx match is a way of trapping the string and transforming it into its correct machine- generated form. Other patterns the strings can take in their manually-created form: A#12345 #1234509 Garbage in, garbage out -- I know. I wish I could tell the people entering the data how challenging it is to work with what they provide, but it is, after all, a screen-scraping routine. I'm surprised it's been so difficult to find an example of the hash character in a RegEx string -- for exactly this type of situation, since it's so common in the real world that people want to put a pound symbol in front of a number. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: How to insert string in each match using RegEx iterator
On Jun 10, 10:13 am, Peter Otten <__pete...@web.de> wrote: > 504cr...@gmail.com wrote: > > I wonder if you (or anyone else) might attempt a different explanation > > for the use of the special sequence '\1' in the RegEx syntax. > > > The Python documentation explains: > > > \number > > Matches the contents of the group of the same number. Groups are > > numbered starting from 1. For example, (.+) \1 matches 'the the' or > > '55 55', but not 'the end' (note the space after the group). This > > special sequence can only be used to match one of the first 99 groups. > > If the first digit of number is 0, or number is 3 octal digits long, > > it will not be interpreted as a group match, but as the character with > > octal value number. Inside the '[' and ']' of a character class, all > > numeric escapes are treated as characters. > > > In practice, this appears to be the key to the key device to your > > clever solution: > > >>>> re.compile(r"(\d+)").sub(r"INSERT \1", string) > > > 'abc INSERT 123 def INSERT 456 ghi INSERT 789' > > >>>> re.compile(r"(\d+)").sub(r"INSERT ", string) > > > 'abc INSERT def INSERT ghi INSERT ' > > > I don't, however, precisely understand what is meant by "the group of > > the same number" -- or maybe I do, but it isn't explicit. Is this just > > a shorthand reference to match.group(1) -- if that were valid -- > > implying that the group match result is printed in the compile > > execution? > > If I understand you correctly you are right. Another example: > > >>> re.compile(r"([a-z]+)(\d+)").sub(r"number=\2 word=\1", "a1 zzz42") > > 'number=1 word=a number=42 word=zzz' > > For every match of "[a-z]+\d+" in the original string "\1" in > "number=\2 word=\1" is replaced with the actual match for "[a-z]+" and > "\2" is replaced with the actual match for "\d+". > > The result, e. g. "number=1 word=a", is then used to replace the actual > match for group 0, i. e. "a1" in the example. > > Peter- Hide quoted text - > > - Show quoted text - Wow! That is so cool. I had to process it for a little while to get it. >>> s = '111bbb333' >>> re.compile('(\d+)([b]+)(\d+)').sub(r'First string: \1 Second string: \2 >>> Third string: \3', s) 'First string: 111 Second string: bbb Third string: 333' MRI scans would no doubt reveal that people who attain a mastery of RegEx expressions must have highly developed areas of the brain. I wonder where the RegEx part of the brain might be located. That was a really clever teaching device. I really appreciate you taking the time to post it, Peter. I'm definitely getting a schooling on this list. Thanks! -- http://mail.python.org/mailman/listinfo/python-list