On Sun, 29 Jun 2014 10:32:00 -0700, subhabangalore wrote:

> I am opening multiple URLs with urllib.open, now one Url has huge html
> source files, like that each one has. As these files are read I am
> trying to concatenate them and put in one txt file as string.
> From this big txt file I am trying to take out each html file body of
> each URL and trying to write and store them

OK, let me clarify what I think you said.

First you concatenate all the web pages into a single file.
Then you extract all the page bodies from the single file and save them 
as separate files.

This seems a silly way to do things, why don't you just save each html 
body section as you receive it?

This sounds like it should be something as simple as:

from BeautifulSoup import BeautifulSoup
import requests

urlList = [ 
    "http://something/";, 
    "http://something/";, 
    "http://something/";, 
    ....... ]

n = 0
for url in urlList:
    r = requests.get( url )
    soup = BeautifulSoup( r.content )
    body = soup.find( "body" )
    fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" )
    fp.write( body.prettify() )
    fp.close
    n += 1

will give you:

scraped/body00000.htm
scraped/body00001.htm
scraped/body00002.htm
........

for as many urls as you have in your url list. (make sure the target 
directory exists!)

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to