+ tutor On Sun, Oct 29, 2017 at 6:57 AM, Kishore Kumar Alajangi < akishorec...@gmail.com> wrote:
> Hi, > > I am facing an issue with listing specific urls inside web page, > > https://economictimes.indiatimes.com/archive.cms > > Page contains link urls by year and month vise, > Ex: /archive/year-2001,month-1.cms > > I am able to list all required urls using the below code, > > from bs4 import BeautifulSoup > import re, csv > import urllib.request > import scrapy > req = > urllib.request.Request('http://economictimes.indiatimes.com/archive.cms', > headers={'User-Agent': 'Mozilla/5.0'}) > > > links = [] > totalPosts = [] > url = "http://economictimes.indiatimes.com" > data = urllib.request.urlopen(req).read() > page = BeautifulSoup(data,'html.parser') > > for link in page.findAll('a', href = re.compile('^/archive/')): //retrieving > urls starts with "archive" > l = link.get('href') > links.append(url+l) > > > with open("output.txt", "a") as f: > for post in links: > post = post + '\n' > f.write(post) > > *sample result in text file:* > > http://economictimes.indiatimes.com/archive/year-2001,month-1.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-2.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-3.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-4.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-5.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-6.cms > > > List of urls I am storing in a text file, From the month urls I want to > retrieve day urls starts with "/archivelist", I am using > > the below code, but I am not getting any result, If I check with inspect > element the urls are available starting with /archivelist, > > <a href="/archivelist/year-2001,month-3,starttime=36951.cms"></a> > > Kindly help me where I am doing wrong. > > from bs4 import BeautifulSoup > import re, csv > import urllib.request > import scrapy > > file = open("output.txt", "r") > > > for i in file: > > urls = urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'}) > > data1 = urllib.request.urlopen(urls).read() > > page1 = BeautifulSoup(data1, 'html.parser') > > for link1 in page1.findAll(href = re.compile('^/archivelist/')): > > l1 = link1.get('href') > > print(l1) > > > Thanks, > > Kishore. > > > > > > -- https://mail.python.org/mailman/listinfo/python-list