Listing link urls

2017-10-29 Thread Kishore Kumar Alajangi
Hi,

I am facing an issue with listing specific urls inside web page,

https://economictimes.indiatimes.com/archive.cms

Page contains link urls by year and month vise,
Ex: /archive/year-2001,month-1.cms

I am able to list all required urls using the below code,

from bs4 import BeautifulSoup
import re, csv
import urllib.request
import scrapy
req = urllib.request.Request('http://economictimes.indiatimes.com/archive.cms',
headers={'User-Agent': 'Mozilla/5.0'})


links = []
totalPosts = []
url = "http://economictimes.indiatimes.com";
data = urllib.request.urlopen(req).read()
page = BeautifulSoup(data,'html.parser')

for link in page.findAll('a', href = re.compile('^/archive/')):
//retrieving urls starts with "archive"
l = link.get('href')
links.append(url+l)


with open("output.txt", "a") as f:
 for post in links:
 post = post + '\n'
 f.write(post)

*sample result in text file:*

http://economictimes.indiatimes.com/archive/year-2001,month-1.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-2.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-3.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-4.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-5.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-6.cms


List of urls I am storing in a text file, From the month urls I want
to retrieve day urls starts with "/archivelist", I am using

the below code, but I am not getting any result, If I check with
inspect element the urls are available starting with /archivelist,



Kindly help me where I am doing wrong.

from bs4 import BeautifulSoup
import re, csv
import urllib.request
import scrapy

file = open("output.txt", "r")


for i in file:

urls = urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})

data1 = urllib.request.urlopen(urls).read()

page1 = BeautifulSoup(data1, 'html.parser')

for link1 in page1.findAll(href = re.compile('^/archivelist/')):

l1 = link1.get('href')

print(l1)


Thanks,

Kishore.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Listing link urls

2017-10-29 Thread Kishore Kumar Alajangi
+ tutor

On Sun, Oct 29, 2017 at 6:57 AM, Kishore Kumar Alajangi <
akishorec...@gmail.com> wrote:

> Hi,
>
> I am facing an issue with listing specific urls inside web page,
>
> https://economictimes.indiatimes.com/archive.cms
>
> Page contains link urls by year and month vise,
> Ex: /archive/year-2001,month-1.cms
>
> I am able to list all required urls using the below code,
>
> from bs4 import BeautifulSoup
> import re, csv
> import urllib.request
> import scrapy
> req = 
> urllib.request.Request('http://economictimes.indiatimes.com/archive.cms', 
> headers={'User-Agent': 'Mozilla/5.0'})
>
>
> links = []
> totalPosts = []
> url = "http://economictimes.indiatimes.com";
> data = urllib.request.urlopen(req).read()
> page = BeautifulSoup(data,'html.parser')
>
> for link in page.findAll('a', href = re.compile('^/archive/')): //retrieving 
> urls starts with "archive"
> l = link.get('href')
> links.append(url+l)
>
>
> with open("output.txt", "a") as f:
>  for post in links:
>  post = post + '\n'
>  f.write(post)
>
> *sample result in text file:*
>
> http://economictimes.indiatimes.com/archive/year-2001,month-1.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-2.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-3.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-4.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-5.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-6.cms
>
>
> List of urls I am storing in a text file, From the month urls I want to 
> retrieve day urls starts with "/archivelist", I am using
>
> the below code, but I am not getting any result, If I check with inspect 
> element the urls are available starting with /archivelist,
>
> 
>
> Kindly help me where I am doing wrong.
>
> from bs4 import BeautifulSoup
> import re, csv
> import urllib.request
> import scrapy
>
> file = open("output.txt", "r")
>
>
> for i in file:
>
> urls = urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})
>
> data1 = urllib.request.urlopen(urls).read()
>
> page1 = BeautifulSoup(data1, 'html.parser')
>
> for link1 in page1.findAll(href = re.compile('^/archivelist/')):
>
> l1 = link1.get('href')
>
> print(l1)
>
>
> Thanks,
>
> Kishore.
>
>
>
>
>
>
-- 
https://mail.python.org/mailman/listinfo/python-list