> Hi,
>
> I'm trying to use the Beautiful Soup package to parse through the
> "bookmarks.html" file which Firefox exports all your bookmarks into.
> 've been struggling with the documentation trying to figure out how to
> extract all the urls. Has anybody got a couple of longer examples using
> Beautiful Soup I could play around with?
>
> Thanks,
> Martin.


Martin,

   SE is a stream editor that does not introduce the overhead and complications 
of overkill parsing. See if it suits your needs:
http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>> import SE
>>> Bookmark_Filter  = SE.SE ('''
      <EAT>   # delete all unmatched input
      "~(?i)<a.*?href.*?>~==\n"    # keep hrefs and add a new line
      "~(?i)[^>]+/a>~==\n\n"  # keep text till end of anchor and add two 
newlines
      |   # run
       <a= <A= </a>= </A>= href\== HREF\==  >=      # delete the noise (extend 
to your liking)
''')

>>> print Bookmark_Filter (r'C:\WINDOWS\Application 
>>> Data\Mozilla\Profiles\default\wwaidm0p.slt\bookmarks.html', '')    # 2nd
parameter '' commands string output. Default is a file.
...

 "http://www.inksupply.com/index.cfm?source=html/main2.html"; 
ADD_DATE="1016024829" LAST_VISIT="1039439802" LAST_CHARSET="ISO-8859-1"
MIS Associates Inc.

 "http://www.weink.com/"; ADD_DATE="1016034183" LAST_VISIT="1118782455" 
LAST_CHARSET="windows-1252"
Inkjet, Laser, Copier, Fax Supplies

 "http://www.nextrend.com/analysis/content/pr_9-19-2000.asp"; 
ADD_DATE="1018037196" LAST_VISIT="1126289805" LAST_CHARSET="ISO-8859-1"
NexTrend - Press Releases

 "http://wp.netscape.com/escapes/search/netsearch_E.html"; ADD_DATE="1021644432" 
LAST_VISIT="1023182857" LAST_CHARSET="ISO-8859-1"
Net Search Page - Google

 "http://www.python.org/"; ADD_DATE="1021651575" LAST_VISIT="1121690494" 
LAST_CHARSET="ISO-8859-1"
Python Language Website

 "http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch"; 
ADD_DATE="1027354641" LAST_VISIT="1115386846"
LAST_CHARSET="windows-1252"
http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch

... etc.


You may refine this further by adding more deletions or substitutions. Adding 
them one by one and examining the output each time
around is very easy and straightforward. The SE object accepts strings as well 
as file names and then returns strings by default, so
developing interactively in an IDLE window using a sample data string is 
extremely fast and painless, because it is possible to
develop incrementally, one step at a time.

>>> Bookmark_Filter.save ('bookmark_filter.se')    # Save definitions to an 
>>> editable text file
>>> Bookmark_Filter = SE. SE. ('bookmark_filter.se')    # Next time naming the 
>>> definition file makes the same object

Regards

Frederic


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to