> On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nisp...@gmail.com> wrote: > > Hi Laura, > > Sure, I got special requirement that just parse html file into DOM tree, > > by only general basic modules, and based on my DOM tree structure, draft an > > bitmap. > > > > So, could you give me an direction how to get the DOM tree? > > Currently, I just think out to use something like stack, I mean, maybe read > > the file line by line, adding to a stack data structure(list for example), > > and, then, got the parent/child relation .etc > > > > I don't know if what I said is easy to achieve, I am just trying. > > Any better suggestions will be great appreciated. > > If you want to recreate the same DOM structure that would be created > by a browser, the standardized algorithm to do so is very complicated, > but you can find it at > http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html. > > If you're not necessarily seeking perfect fidelity, I would encourage > you to try to find some way to incorporate beautifulsoup into your > project. It likely won't produce the same structure that a real > browser would, but it should do well enough to scrape from even badly > malformed html. > > I recommend against using an XML parser, because HTML isn't XML, and > such a parser may choke even on perfectly valid HTML such as this: > > <!DOCTYPE html> > <html> > <head><title>Document</title></head> > <body> > First line > <br> > Second line > </body> > </html>
Hi, Hmm, it's really complex. Currently, I don't need to involve all error handling,and assume html is well formatted, then, generate the DOM tree. Html sample below: <!DOCTYPE html> <!-- saved from url=(0026)http://www.opera.com/about --> <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="description" content="Opera is an independent Scandinavian company that's been in the business of making web browsers since 1994. Read more about Opera Software here."> <title>About - Opera Software</title> <link rel="apple-touch-icon" sizes="57x57" href="http://d2jc9zwbrclgz3.cloudfront.net/static-heap/da/dafd15591b35d4f81ca96cf7de6582d705850ff0/apple-touch-icon-57x57.png"> </head> <body screen_capture_injected="true"><div style="position: fixed; top: 0px; left: 0px; height: 0px; width: 0px; z-index: 9999999;"><div style="position: fixed; top: 100%; height: 0px;"><div style="position: relative;"></div></div></div> <!-- Google Tag Manager --> <nav class="business-menu"> <ul> <li><a data-action-id="header_item" href="http://operamediaworks.com/">Opera Mediaworks</a></li> </ul> </nav> <main role="main" class="generic_landing_page"> <h1>Who we are, what we do</h1> <figure class="visuals"> <img src="./About - Opera Software_files/pro-kompaniyu.jpg" alt="" width="900" height="424"> </figure> <ul class="blocks col3"> <li> <h3>Vision</h3> <p>We strive to develop superior products and services for our users around the world, through state-of-the-art technology, innovation, leadership and partnerships.</p><p><a href="http://www.operasoftware.com/company/vision" target="_self">Find out more</a>.</p> </li> <li> </ul> </main> <footer class="ns--hf"> <aside> <div class="hf--extra"> <h2 class="hf--visuallyhidden">Page language</h2> <div id="language" class="hf--language hf--hover-enabled hf--popup-container"> <input id="language-toggle" class="hf--popup-toggle hf--visuallyhidden" type="checkbox" aria-haspopup="true"> <label for="language-toggle" class="hf--popup-toggle-label" tabindex="0"> <span class="hf--hide-overflow"> <span class="">Select your language:</span> <span class="">English</span> </span> </label> </div> </div> </aside> <div class="hf--meta hf--clearfix"> <small class="hf--company">Copyright ? 2014 Opera Software ASA. All rights reserved. <a data-action-id="footer_item" href="http://www.opera.com/privacy">Privacy.</a> <a data-action-id="footer_item" href="http://www.opera.com/terms">Terms of Use.</a> </small> </div> </footer> </body></html> -- https://mail.python.org/mailman/listinfo/python-list