inhahe wrote: > I hope this is an appropriate mailing list for BeautifulSoup questions, > it's been a long time since I've used python-list and I don't remember if > third-party modules are on topic. I did try posting to the BeautifulSoup > mailing list on Google groups, but I've waited a day or two and my message > hasn't been approved yet. > > Say I have the following HTML (I hope this shows up as plain text here > rather than formatting): > > <div style="font-size: 20pt;"><span style="color: > #000000;"><em><strong>"Is today the day?"</strong></em></span></div> > > And I want to extract the "Is today the day?" part. There are other places > in the document with <em> and <strong>, but this is the only place that > uses color #000000, so I want to extract anything that's within a color > #000000 style, even if it's nested multiple levels deep within that. > > - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's > defined as #000000 > - Sometimes the <strong> is within the <em> and sometimes the <em> is > within the <strong>. > - There may be other discrepancies I haven't noticed yet > > How can I do this in BeautifulSoup (or is this better done in lxml.html)? > Thanks
I don't see how to do this with a lot of glue code, but it may get you started: def recursive_attr(elem, path): path = path.split("/") for name in path: if elem is None: break elem = getattr(elem, name) return elem def find(soup): for outer in soup.find_all( "span", style=re.compile(r"color:\s*(RGB\(0,\s*0,\s* 0\)|#000000)")): for inner in [ recursive_attr(outer, "strong/em"), recursive_attr(outer, "em/strong"),]: if inner is not None: yield inner.string def normalize_ws(s): return " ".join(s.split()) html = ... soup = bs4.BeautifulSoup(html) for match in find(soup): print(normalize_ws(match)) -- https://mail.python.org/mailman/listinfo/python-list