Re: Screen scraper to get all 'a title' elements

2015-11-26 Thread Denis McMahon
On Wed, 25 Nov 2015 12:42:00 -0800, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > I'm trying to figure out how to list all 'a title' elements. a is the element tag, title is an attribute of the htmlanchorelement. co

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread TP
On Wed, Nov 25, 2015 at 12:42 PM, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names Wildly offtopic but interesting, easy way to grab/analyze Wikipedia data using F# instead of Python http://evelinag.com/blog/2015/11-18-f-tac

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu
Chris, Marko, thank you both for your links and explanations! -- https://mail.python.org/mailman/listinfo/python-list

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:53 AM, Marko Rauhamaa wrote: > Regular expressions can handle any regular language just fine. They are > commonly used to define the lexical tokens of a language. Not sure about _defining_ them, but they're certainly often used to _recognize_ them, eg in syntax highligh

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Marko Rauhamaa
Grobu : > Sorry, I wasn't aware of regex being on the dark side :-) No, regular expressions are great for many purposes. Parsing context-free syntax isn't one of them. See: https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy> Most modern programming languages including HTML are con

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:44 AM, Grobu wrote: > On 26/11/15 00:06, Chris Angelico wrote: >> >> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 wrote: >>> >>> Thanks!! Is that regex? Can you explain exactly what it is doing? >>> Also, it seems to pick up a lot more than just the list I wanted, but >

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu
On 26/11/15 00:06, Chris Angelico wrote: On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 wrote: Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please expla

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 10:37 AM, ryguy7272 wrote: > Wow! Awesome! I bookmarked that link! > Thanks for everything!!! Also bookmark this link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags And read it before you do any parsing of HTML using

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote: > On 25/11/15 23:48, ryguy7272 wrote: > >> re.findall( r'\]+title="(.+?)"', html ) > [ ... ] > > Thanks!! Is that regex? Can you explain exactly what it is doing? > > Also, it seems to pick up a lot more than just the list I wanted

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu
On 25/11/15 23:48, ryguy7272 wrote: re.findall( r'\]+title="(.+?)"', html ) [ ... ] Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain wha

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 wrote: > Thanks!! Is that regex? Can you explain exactly what it is doing? > Also, it seems to pick up a lot more than just the list I wanted, but that's > ok, I can see why it does that. > > Can you just please explain what it's doing??? It's a trap!

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote: > Hi > > It seems that links on that Wikipedia page follow the structure : > > > You could extract a list of link titles with something like : > re.findall( r'\]+title="(.+?)"', html ) > > HTH, > > -Grobu- > > > On 25/11/15 21

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu
Hi It seems that links on that Wikipedia page follow the structure : You could extract a list of link titles with something like : re.findall( r'\]+title="(.+?)"', html ) HTH, -Grobu- On 25/11/15 21:55, MRAB wrote: On 2015-11-25 20:42, ryguy7272 wrote: Hello experts. I'm looking at this

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Chris Angelico
On Thu, Nov 26, 2015 at 9:04 AM, ryguy7272 wrote: > Ok, I guess that makes sense. So, I just tried the script below, and got > nothing... > > import requests > from bs4 import BeautifulSoup > > r = > requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names";) > soup = Beautiful

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
On Wednesday, November 25, 2015 at 3:42:21 PM UTC-5, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > I'm trying to figure out how to list all 'a title' elements. For instance, I > see the following: > Accident > href=

Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread MRAB
On 2015-11-25 20:42, ryguy7272 wrote: Hello experts. I'm looking at this url: https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: Accident Ala-Lemu Alert Apocalypse Peaks So, I tried putti

Screen scraper to get all 'a title' elements

2015-11-25 Thread ryguy7272
Hello experts. I'm looking at this url: https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: Accident Ala-Lemu Alert Apocalypse Peaks So, I tried putting a script together to get 'title'. He