could ildg said: > I want to use re because I want to extract something from a html. It > will be very complicated without using re. But while using re, I > found that I must exlude a hole word "</td>", certainly, there are > many many "</td>" in this html.
Actually, for properly processing html, you shouldn't really be using regular expressions, precisely because the problem is complicated - regular expressions are too simple and can't properly model a language like HTML, which is generated by a context free grammar. If thats only meaningless technical mumbo-jumbo to you, never mind - the important point is you shouldn't really use an re. Trust me :) What you want for a job like is an HTML parser. Theres one in the standard library; if it doesnt suit, there are plenty of third party ones. I like Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/ If you insist on using an re, well I'm sure someone on this group will figure out a solution to your issue thats as good as you're going to get... > > My re is as below: > _____________________________________________ > r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}' > ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>' > ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE) > _____________________________________________ > There should be over 30 matches in the html. But I find nothing by > re.finditer(html) because my last line of re is wrong. I can't use > "(?P<name>.+)</td>" because there are many many "</td>" in the html > and I just want the ".*" to match what are before the firest "</td>". > So I think if there is some idea I can exclude a word, this will be > done. Assume there is "NOT(WORD)" can do it, I just need to write the > last line of the re as "(?P<name>(NOT(</td>))+)</td>". > But I still have no idea after thinking and trying for a very long time. > > In other words, I want the "</td>" of "(?P<name>.+)</td>" to be > exactly the first "</td>" in this match. And there is more than one > match in this html, so this must be done by using re. > > And I can't use any of your idea because what I want I deal with is a > very complicated html, not just a single line of word. > > I can copy part of the html up to here but it's kinda too lengthy. > On 8/15/05, John Machin <[EMAIL PROTECTED]> wrote: > > could ildg wrote: > > > In re, the punctuation "^" can exclude a single character, but I want > > > to exclude a whole word now. for example I have a string "hi, how are > > > you. hello", I want to extract all the part before the world "hello", > > > I can't use ".*[^hello]" because "^" only exclude single char "h" or > > > "e" or "l" or "o". Will somebody tell me how to do it? Thanks. > > > > (1) Why must you use re? It's often a good idea to use string methods > > where they can do the job you want. > > (2) What do you want to have happen if "hello" is not in the string? > > > > Example: > > > > C:\junk>type upto.py > > def upto(strg, what): > > k = strg.find(what) > > if k > -1: > > return strg[:k] > > return None # or raise an exception > > > > helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello. > > that's it" > > > > print repr(upto(helo, "HELLO")) > > print repr(upto(helo, "hello")) > > print repr(upto(helo, "hi")) > > print repr(upto(helo, "goodbye")) > > print repr(upto("", "goodbye")) > > print repr(upto("", "")) > > > > C:\junk>upto.py > > 'hi, how are you? ' > > "hi, how are you? HELLO I'm fine, thank you " > > '' > > None > > None > > '' > > > > HTH, > > John > > -- > > http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list