date:20160502

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico

On Mon, May 2, 2016 at 4:47 PM, DFS  wrote:
> I'm not specifying a local web cache with either (wouldn't know how or where
> to look).  If you have Windows, you can try it.
> ---
> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
>  Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
>  xmlHTTP.Open "GET", webpage
>  xmlHTTP.Send
>  Set fso = CreateObject("Scripting.FileSystemObject")
>  Set fOut = fso.CreateTextFile(webfile, True)
>   fOut.WriteLine xmlHTTP.ResponseText
>  fOut.Close
>  Set fOut= Nothing
>  Set fso = Nothing
>  Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) &
> " seconds"
> ---

There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).

Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:

import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 2:27 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops? That's less than half as
fast as my results. Surprising.

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.

Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful
language. I'm extremely impressed by how tight the code can get.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all.

True. And it has been my assumption - tho not with 10MB file.

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant.

Good point. Test below.

If you believe otherwise, demonstrate it.

http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes),
so this is 16.6x larger. So I would expect python to linearly run in
16.6 * 0.88 = 14.6 seconds.

10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75
seconds - way too fast, so I checked and no data was written to file,
and I couldn't even open the webpage with a browser. Looks like I had
been temporarily blocked from the site. After a couple minutes, I was
able to access it again).

I noticed urllib and curl returned the html as is, but urllib2 and
requests added enhancements that should make the data easier to parse.
Based on speed and functionality and documentation, I believe I'll be
using the requests HTTP library (I will actually be doing a small amount
of web scraping).

50 matches

Mail list logo