Re: Screen scraper to get all 'a title' elements

Grobu Wed, 25 Nov 2015 14:31:56 -0800

Hi

It seems that links on that Wikipedia page follow the structure :
<a href="..." title="...">


You could extract a list of link titles with something like :
re.findall( r'\<a[^>]+title="(.+?)"', html )

HTH,

-Grobu-


On 25/11/15 21:55, MRAB wrote:

On 2015-11-25 20:42, ryguy7272 wrote:

Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For
instance, I see the following:
<a title="Accident, Maryland"
href="/wiki/Accident,_Maryland">Accident</a>
<a class="new" title="Ala-Lemu (page does not exist)"
href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
<a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
<a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
Peaks</a>

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names";
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
     print(link)

All that does is get the title of the page.  I tried to get the links
from that url, with this script.

A 'title' element has the form "<title ...>". What you should be looking
for are 'a' elements, those of the form "<a ...>".

import urllib2
import re

#connect to a URL
website =
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')


#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!


--
https://mail.python.org/mailman/listinfo/python-list

Re: Screen scraper to get all 'a title' elements

Reply via email to