Johny wrote: > I have the following text > > <title>Goods Item 146 (174459989) - OurWebSite</title> > > from which I need to extract > `Goods Item 146 ' > > Can anyone help with regexp? > Thank you for help > L.
In general, parsing HTML with regular expressions is a bad idea. Usually, you use something like BeautifulSoup to parse the HTML, extract the desired field, like the contents of "<title>", then work on that. If you try to do this line by line with regular expressions, it will fail when the line breaks aren't where you expect. If you try to do a whole document with regular expressions, other material such as content in comments can be misrecognized. Try something like this: # Regular expression to extract group before "(NNNNN)" kreextractitem = re.compile(r'^(.*)\(\d+\)) pagetree = BeautifulSoup.BeautifulSoup(stringcontaininghtml) titleitem = pagetree.find({'title':True, 'TITLE':True}) if titleitem : titletext = " ".join(atag.findAll(text=True, recursive=True)) # Text of TITLE item is now in "titletext" as a string. groups = kreextractitem.search(titletext) if groups : goodsitem = groups.group(1).strip() # "goodsitem" now contains everything before "(NNNN)" This approach will work no matter where the line breaks are in the original HTML. John Nagle -- http://mail.python.org/mailman/listinfo/python-list