Regular expression to structure HTML

2009-10-01 Thread 504cr...@gmail.com
I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Note that the output is referenced using named groups.

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

I'd appreciate any suggestions to improve the approach.


rText = "8583New Horizon Technical Academy,
Inc #4Jefferson701149371Career Learning CenterJefferson70113"

rText = re.compile(r'()(?P\d+)()()()(?P[A-
Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME:
\g\n', rText)

print rText

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4Jefferson701149371Career Learning Center|PARISH:Jefferson|ZIP:70113



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to structure HTML

2009-10-02 Thread 504cr...@gmail.com
Screw:

>>> html = """  

14313


Python
Hammer Institute #2


Jefferson


70114


  

  

8583


New
Screwdriver Technical Academy, Inc #4


Jefferson


70114


  

  

9371


Career
RegEx Center


Jefferson


70113


  """

Hammer:

First remove line returns.
Then remove extra spaces.
Then insert a line return to restore logical rows on each 
combination. For more information, see: http://www.qc4blog.com/?p=55

>>> s = re.sub(r'\n','', html)
>>> s = re.sub(r'\s{2,}', '', s)
>>> s = re.sub('()()', r'\1\n\2', s)
>>> print s
14313Python Hammer Institute #2Jefferson70114
8583New Screwdriver Technical Academy, Inc #4Jefferson70114
9371Career RegEx CenterJefferson70113
>>> p = re.compile(r"()(?P\d+)()(>> valign=top>)(>> href=lic_details\.asp)(\?lic_number=\d+)(>)(?P[\s\S\WA-Za-z0-9]*?)()()(?:>>  valign=top>)(?P[\s\WA-Za-z]+)()(>> valign=top>)(?P\d+)()()$", re.M)
>>> n = 
>>> p.sub(r'LICENSE:\g|NAME:\g|PARISH:\g|ZIP:\g',
>>>  s)
>>> print n
LICENSE:14313|NAME:Python Hammer Institute #2|PARISH:Jefferson|ZIP:
70114
LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4|
PARISH:Jefferson|ZIP:70114
LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113
>>>

The solution was to escape the period in the ".asp" string, e.g.,
"\.asp". I also had to limit the pattern in the  grouping by
using a "?" qualifier to limit the "greediness" of the "*" pattern
metacharacter.

Now, who would like to turn that re.compile pattern into a MULTILINE
expression, combining the re.M and re.X flags?

Documentation says that one should be able to use the bitwise OR
operator (e.g., re.M | re.X), but I sure couldn't get it to work.

Sometimes a hammer actually is the right tool if you hit the screw
long and hard enough.

I think I'll try to hit some more screws with my new hammer.

Good day.

On Oct 2, 12:10 am, "504cr...@gmail.com" <504cr...@gmail.com> wrote:
> I'm kind of new to regular expressions, and I've spent hours trying to
> finesse a regular expression to build a substitution.
>
> What I'd like to do is extract data elements from HTML and structure
> them so that they can more readily be imported into a database.
>
> No -- sorry -- I don't want to use BeautifulSoup (though I have for
> other projects). Humor me, please -- I'd really like to see if this
> can be done with just regular expressions.
>
> Note that the output is referenced using named groups.
>
> My challenge is successfully matching the HTML tags in between the
> first table row, and the second table row.
>
> I'd appreciate any suggestions to improve the approach.
>
> rText = "8583 href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
> Inc #4Jefferson70114 tr>9371 lic_number=9371>Career Learning Center valign=top>Jefferson70113"
>
> rText = re.compile(r'()(?P\d+)()( valign=top>)()(?P[A-
> Za-z0-9#\s\S\W]+)().+$').sub(r'LICENSE:\g|NAME:
> \g\n', rText)
>
> print rText
>
> LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4 valign=top>Jefferson70114 valign=top>9371 lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression to structure HTML

2009-10-03 Thread 504cr...@gmail.com
On Oct 2, 11:14 pm, greg  wrote:
> Brian D wrote:
> > This isn't merely a question of knowing when to use the right
> > tool. It's a question about how to become a better developer using
> > regular expressions.
>
> It could be said that if you want to learn how to use a
> hammer, it's better to practise on nails rather than
> screws.
>
> --
> Greg

It could be said that the bandwidth in technical forums should be
reserved for on-topic exchanges, not flaming intelligent people who
might have something to contribute to the forum. The truth is, I found
a solution where others were ostensibly either too lazy to attempt, or
too eager grandstanding their superiority to assist. Who knows --
maybe I'll provide an alternative to BeautifulSoup one day.

-- 
http://mail.python.org/mailman/listinfo/python-list


How to insert string in each match using RegEx iterator

2009-06-09 Thread 504cr...@gmail.com
By what method would a string be inserted at each instance of a RegEx
match?

For example:

string = '123 abc 456 def 789 ghi'
newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'

Here's the code I started with:

>>> rePatt = re.compile('\d+\s')
>>> iterator = rePatt.finditer(string)
>>> count = 0
>>> for match in iterator:
if count < 1:
print string[0:match.start()] + ' INSERT ' + string[match.start
():match.end()]
elif count >= 1:
print ' INSERT ' + string[match.start():match.end()]
count = count + 1

My code returns an empty string.

I'm new to Python, but I'm finding it really enjoyable (with the
exception of this challenging puzzle).

Thanks in advance.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to insert string in each match using RegEx iterator

2009-06-09 Thread 504cr...@gmail.com
On Jun 9, 11:19 pm, Roy Smith  wrote:
> In article
> ,
>
>  "504cr...@gmail.com" <504cr...@gmail.com> wrote:
> > By what method would a string be inserted at each instance of a RegEx
> > match?
>
> > For example:
>
> > string = '123 abc 456 def 789 ghi'
> > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'
>
> If you want to do what I think you are saying, you should be looking at the
> join() string method.  I'm thinking something along the lines of:
>
> groups = match_object.groups()
> newstring = " INSERT ".join(groups)

Fast answer, Roy. Thanks. That would be a graceful solution if it
works. I'll give it a try and post a solution.

Meanwhile, I know there's a logical problem with the way I was
concatenating strings in the iterator loop.

Here's a single instance example of what I'm trying to do:

>>> string = 'abc 123 def 456 ghi 789'
>>> match = rePatt.search(string)
>>> print string[0:match.start()] + 'INSERT ' + string[match.end():len(string)]
abc INSERT def 456 ghi 789
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to insert string in each match using RegEx iterator

2009-06-09 Thread 504cr...@gmail.com
On Jun 9, 11:35 pm, "504cr...@gmail.com" <504cr...@gmail.com> wrote:
> On Jun 9, 11:19 pm, Roy Smith  wrote:
>
>
>
> > In article
> > ,
>
> >  "504cr...@gmail.com" <504cr...@gmail.com> wrote:
> > > By what method would a string be inserted at each instance of a RegEx
> > > match?
>
> > > For example:
>
> > > string = '123 abc 456 def 789 ghi'
> > > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'
>
> > If you want to do what I think you are saying, you should be looking at the
> > join() string method.  I'm thinking something along the lines of:
>
> > groups = match_object.groups()
> > newstring = " INSERT ".join(groups)
>
> Fast answer, Roy. Thanks. That would be a graceful solution if it
> works. I'll give it a try and post a solution.
>
> Meanwhile, I know there's a logical problem with the way I was
> concatenating strings in the iterator loop.
>
> Here's a single instance example of what I'm trying to do:
>
> >>> string = 'abc 123 def 456 ghi 789'
> >>> match = rePatt.search(string)
> >>> print string[0:match.start()] + 'INSERT ' + 
> >>> string[match.end():len(string)]
>
> abc INSERT def 456 ghi 789

Thanks Roy. A little closer to a solution. I'm still processing how to
step forward, but this is a good start:

>>> string = 'abc 123 def 456 ghi 789'
>>> rePatt = re.compile('\s\d+\s')
>>> foundGroup = rePatt.findall(string)
>>> newstring = ' INSERT '.join(foundGroup)
>>> print newstring
 123  INSERT  456

What I really want to do is return the full string, not just the
matches -- concatenated around the ' INSERT ' string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to insert string in each match using RegEx iterator

2009-06-10 Thread 504cr...@gmail.com
On Jun 10, 5:17 am, Paul McGuire  wrote:
> On Jun 9, 11:13 pm, "504cr...@gmail.com" <504cr...@gmail.com> wrote:
>
> > By what method would a string be inserted at each instance of a RegEx
> > match?
>
> Some might say that using a parsing library for this problem is
> overkill, but let me just put this out there as another data point for
> you.  Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
> that allow you to embellish the matched tokens, and create a new
> string containing the modified text for each match of a pyparsing
> expression.  Hmm, maybe the code example is easier to follow than the
> explanation...
>
> from pyparsing import Word, nums, Regex
>
> # an integer is a 'word' composed of numeric characters
> integer = Word(nums)
>
> # or use this if you prefer
> integer = Regex(r'\d+')
>
> # attach a parse action to prefix 'INSERT ' before the matched token
> integer.setParseAction(lambda tokens: "INSERT " + tokens[0])
>
> # use transformString to search through the input, applying the
> # parse action to all matches of the given expression
> test = '123 abc 456 def 789 ghi'
> print integer.transformString(test)
>
> # prints
> # INSERT 123 abc INSERT 456 def INSERT 789 ghi
>
> I offer this because often the simple examples that get posted are
> just the barest tip of the iceberg of what the poster eventually plans
> to tackle.
>
> Good luck in your Pythonic adventure!
> -- Paul

Thanks for all of the instant feedback. I have enumerated three
responses below:

First response:

Peter,

I wonder if you (or anyone else) might attempt a different explanation
for the use of the special sequence '\1' in the RegEx syntax.

The Python documentation explains:

\number
Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches 'the the' or
'55 55', but not 'the end' (note the space after the group). This
special sequence can only be used to match one of the first 99 groups.
If the first digit of number is 0, or number is 3 octal digits long,
it will not be interpreted as a group match, but as the character with
octal value number. Inside the '[' and ']' of a character class, all
numeric escapes are treated as characters.

In practice, this appears to be the key to the key device to your
clever solution:

>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)

'abc INSERT 123 def INSERT 456 ghi INSERT 789'

>>> re.compile(r"(\d+)").sub(r"INSERT ", string)

'abc INSERT  def INSERT  ghi INSERT '

I don't, however, precisely understand what is meant by "the group of
the same number" -- or maybe I do, but it isn't explicit. Is this just
a shorthand reference to match.group(1) -- if that were valid --
implying that the group match result is printed in the compile
execution?

Second response:

I've encountered a problem with my RegEx learning curve which I'll be
posting in a new thread -- how to escape hash characters # in strings
being matched, e.g.:

>>> string = re.escape('123#456')
>>> match = re.match('\d+', string)
>>> print match

<_sre.SRE_Match object at 0x00A6A800>
>>> print match.group()

123

Third response:

Paul,

Thanks for the referring me to the Pyparsing module. I'm thoroughly
enjoying Python, but I'm not prepared right now to say I've mastered
the Pyparsing module. As I continue my work, however, I'll be tackling
the problem of parsing addresses, exactly as the Pyparsing module
example illustrates. I'm sure I'll want to use it then.
-- 
http://mail.python.org/mailman/listinfo/python-list


How to escape # hash character in regex match strings

2009-06-10 Thread 504cr...@gmail.com
I've encountered a problem with my RegEx learning curve -- how to
escape hash characters # in strings being matched, e.g.:

>>> string = re.escape('123#abc456')
>>> match = re.match('\d+', string)
>>> print match

<_sre.SRE_Match object at 0x00A6A800>
>>> print match.group()

123

The correct result should be:

123456

I've tried to escape the hash symbol in the match string without
result.

Any ideas? Is the answer something I overlooked in my lurching Python
schooling?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to escape # hash character in regex match strings

2009-06-11 Thread 504cr...@gmail.com
On Jun 11, 2:01 am, Lie Ryan  wrote:
> 504cr...@gmail.com wrote:
> > I've encountered a problem with my RegEx learning curve -- how to
> > escape hash characters # in strings being matched, e.g.:
>
> >>>> string = re.escape('123#abc456')
> >>>> match = re.match('\d+', string)
> >>>> print match
>
> > <_sre.SRE_Match object at 0x00A6A800>
> >>>> print match.group()
>
> > 123
>
> > The correct result should be:
>
> > 123456
>
> > I've tried to escape the hash symbol in the match string without
> > result.
>
> > Any ideas? Is the answer something I overlooked in my lurching Python
> > schooling?
>
> As you're not being clear on what you wanted, I'm just guessing this is
> what you wanted:
>
> >>> s = '123#abc456'
> >>> re.match('\d+', re.sub('#\D+', '', s)).group()
> '123456'
> >>> s = '123#this is a comment and is ignored456'
> >>> re.match('\d+', re.sub('#\D+', '', s)).group()
>
> '123456'- Hide quoted text -
>
> - Show quoted text -

Sorry I wasn't more clear. I positively appreciate your reply. It
provides half of what I'm hoping to learn. The hash character is
actually a desirable hook to identify a data entity in a scraping
routine I'm developing, but not a character I want in the scrubbed
data.

In my application, the hash makes a string of alphanumeric characters
unique from other alphanumeric strings. The strings I'm looking for
are actually manually-entered identifiers, but a real machine-created
identifier shouldn't contain that hash character. The correct pattern
should be 'A1234509', but is instead often merely entered as '#12345'
when the first character, representing an alphabet sequence for the
month, and the last two characters, representing a two-digit year, can
be assumed. Identifying the hash character in a RegEx match is a way
of trapping the string and transforming it into its correct machine-
generated form.

Other patterns the strings can take in their manually-created
form:

A#12345
#1234509

Garbage in, garbage out -- I know. I wish I could tell the people
entering the data how challenging it is to work with what they
provide, but it is, after all, a screen-scraping routine.

I'm surprised it's been so difficult to find an example of the hash
character in a RegEx string -- for exactly this type of situation,
since it's so common in the real world that people want to put a pound
symbol in front of a number.

Thanks!

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to insert string in each match using RegEx iterator

2009-06-11 Thread 504cr...@gmail.com
On Jun 10, 10:13 am, Peter Otten <__pete...@web.de> wrote:
> 504cr...@gmail.com wrote:
> > I wonder if you (or anyone else) might attempt a different explanation
> > for the use of the special sequence '\1' in the RegEx syntax.
>
> > The Python documentation explains:
>
> > \number
> >     Matches the contents of the group of the same number. Groups are
> > numbered starting from 1. For example, (.+) \1 matches 'the the' or
> > '55 55', but not 'the end' (note the space after the group). This
> > special sequence can only be used to match one of the first 99 groups.
> > If the first digit of number is 0, or number is 3 octal digits long,
> > it will not be interpreted as a group match, but as the character with
> > octal value number. Inside the '[' and ']' of a character class, all
> > numeric escapes are treated as characters.
>
> > In practice, this appears to be the key to the key device to your
> > clever solution:
>
> >>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)
>
> > 'abc INSERT 123 def INSERT 456 ghi INSERT 789'
>
> >>>> re.compile(r"(\d+)").sub(r"INSERT ", string)
>
> > 'abc INSERT  def INSERT  ghi INSERT '
>
> > I don't, however, precisely understand what is meant by "the group of
> > the same number" -- or maybe I do, but it isn't explicit. Is this just
> > a shorthand reference to match.group(1) -- if that were valid --
> > implying that the group match result is printed in the compile
> > execution?
>
> If I understand you correctly you are right. Another example:
>
> >>> re.compile(r"([a-z]+)(\d+)").sub(r"number=\2 word=\1", "a1 zzz42")
>
> 'number=1 word=a number=42 word=zzz'
>
> For every match of "[a-z]+\d+" in the original string "\1" in
> "number=\2 word=\1" is replaced with the actual match for "[a-z]+" and
> "\2" is replaced with the actual match for "\d+".
>
> The result, e. g. "number=1 word=a", is then used to replace the actual
> match for group 0, i. e. "a1" in the example.
>
> Peter- Hide quoted text -
>
> - Show quoted text -

Wow! That is so cool. I had to process it for a little while to get
it.

>>> s = '111bbb333'
>>> re.compile('(\d+)([b]+)(\d+)').sub(r'First string: \1 Second string: \2 
>>> Third string: \3', s)
'First string: 111 Second string: bbb Third string: 333'

MRI scans would no doubt reveal that people who attain a mastery of
RegEx expressions must have highly developed areas of the brain. I
wonder where the RegEx part of the brain might be located.

That was a really clever teaching device. I really appreciate you
taking the time to post it, Peter. I'm definitely getting a schooling
on this list.

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list