> Hi, > > I'm trying to use the Beautiful Soup package to parse through the > "bookmarks.html" file which Firefox exports all your bookmarks into. > 've been struggling with the documentation trying to figure out how to > extract all the urls. Has anybody got a couple of longer examples using > Beautiful Soup I could play around with? > > Thanks, > Martin.
Martin, SE is a stream editor that does not introduce the overhead and complications of overkill parsing. See if it suits your needs: http://cheeseshop.python.org/pypi/SE/2.2%20beta >>> import SE >>> Bookmark_Filter = SE.SE (''' <EAT> # delete all unmatched input "~(?i)<a.*?href.*?>~==\n" # keep hrefs and add a new line "~(?i)[^>]+/a>~==\n\n" # keep text till end of anchor and add two newlines | # run <a= <A= </a>= </A>= href\== HREF\== >= # delete the noise (extend to your liking) ''') >>> print Bookmark_Filter (r'C:\WINDOWS\Application >>> Data\Mozilla\Profiles\default\wwaidm0p.slt\bookmarks.html', '') # 2nd parameter '' commands string output. Default is a file. ... "http://www.inksupply.com/index.cfm?source=html/main2.html" ADD_DATE="1016024829" LAST_VISIT="1039439802" LAST_CHARSET="ISO-8859-1" MIS Associates Inc. "http://www.weink.com/" ADD_DATE="1016034183" LAST_VISIT="1118782455" LAST_CHARSET="windows-1252" Inkjet, Laser, Copier, Fax Supplies "http://www.nextrend.com/analysis/content/pr_9-19-2000.asp" ADD_DATE="1018037196" LAST_VISIT="1126289805" LAST_CHARSET="ISO-8859-1" NexTrend - Press Releases "http://wp.netscape.com/escapes/search/netsearch_E.html" ADD_DATE="1021644432" LAST_VISIT="1023182857" LAST_CHARSET="ISO-8859-1" Net Search Page - Google "http://www.python.org/" ADD_DATE="1021651575" LAST_VISIT="1121690494" LAST_CHARSET="ISO-8859-1" Python Language Website "http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch" ADD_DATE="1027354641" LAST_VISIT="1115386846" LAST_CHARSET="windows-1252" http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch ... etc. You may refine this further by adding more deletions or substitutions. Adding them one by one and examining the output each time around is very easy and straightforward. The SE object accepts strings as well as file names and then returns strings by default, so developing interactively in an IDLE window using a sample data string is extremely fast and painless, because it is possible to develop incrementally, one step at a time. >>> Bookmark_Filter.save ('bookmark_filter.se') # Save definitions to an >>> editable text file >>> Bookmark_Filter = SE. SE. ('bookmark_filter.se') # Next time naming the >>> definition file makes the same object Regards Frederic -- http://mail.python.org/mailman/listinfo/python-list