Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico
On Mon, May 2, 2016 at 4:47 PM, DFS  wrote:
> I'm not specifying a local web cache with either (wouldn't know how or where
> to look).  If you have Windows, you can try it.
> ---
> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
>  Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
>  xmlHTTP.Open "GET", webpage
>  xmlHTTP.Send
>  Set fso = CreateObject("Scripting.FileSystemObject")
>  Set fOut = fso.CreateTextFile(webfile, True)
>   fOut.WriteLine xmlHTTP.ResponseText
>  fOut.Close
>  Set fOut= Nothing
>  Set fso = Nothing
>  Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) &
> " seconds"
> ---

There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).

Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:



import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 2:27 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)


Yeah on my system I get 1.8 out of this, amounting to 0.18s.


You get 1.8 seconds total for the 10 loops?  That's less than half as 
fast as my results.  Surprising.




I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.


Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful 
language.  I'm extremely impressed by how tight the code can get.




You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all.


True.  And it has been my assumption - tho not with 10MB file.



I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant.


Good point.  Test below.



If you believe otherwise, demonstrate it.


http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), 
so this is 16.6x larger.  So I would expect python to linearly run in 
16.6 * 0.88 = 14.6 seconds.


10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75 
seconds - way too fast, so I checked and no data was written to file, 
and I couldn't even open the webpage with a browser.  Looks like I had 
been temporarily blocked from the site.  After a couple minutes, I was 
able to access it again).


I noticed urllib and curl returned the html as is, but urllib2 and 
requests added enhancements that should make the data easier to parse. 
Based on speed and functionality and documentation, I believe I'll be 
using the requests HTTP library (I will actually be doing a small amount 
of web scraping).



VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Code Opinion - Enumerate

2016-05-02 Thread Sayth Renshaw
As a reference here is a functional implementation of conways GOL.
http://programmablelife.blogspot.com.au/2012/08/conways-game-of-life-in-clojure.html

The author first does it in clojure and then transliterates it to python.

Just good for a different view.

Sayth
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 12:37 AM, DFS wrote:
> On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> > I'm again going back to the point of: its fast enough. When comparing
> > two small numbers, "twice as slow" is meaningless.
> 
> Speed is always meaningful.
> 
> I know python is relatively slow, but it's a cool, concise, powerful 
> language.  I'm extremely impressed by how tight the code can get.

I'm sorry, but no. Speed is not always meaningful. 

It's not even usually meaningful, because you can't quantify what "speed
is". In context, you're claiming this is twice as slow (even though my
tests show dramatically better performance), but what details are
different?

You're ignoring the fact that Python might have a constant overhead --
meaning, for a 1k download, it might have X speed cost. For a 1meg
download, it might still have the exact same X cost.

Looking narrowly, that overhead looks like "twice as slow", but that's
not meaningful at all. Looking larger, that overhead is a pittance.

You aren't measuring that.

> > You have an assumption you haven't answered, that downloading a 10 meg
> > file will be twice as slow as downloading this tiny file. You haven't
> > proven that at all.
> 
> True.  And it has been my assumption - tho not with 10MB file.

And that assumption is completely invalid.

> I noticed urllib and curl returned the html as is, but urllib2 and 
> requests added enhancements that should make the data easier to parse. 
> Based on speed and functionality and documentation, I believe I'll be 
> using the requests HTTP library (I will actually be doing a small amount 
> of web scraping).

The requests library's added-value is ease-of-use, and its overhead is
likely tiny: so using it means you spend less effort making a thing
happen. I recommend you embrace this. 

> VBScript
> 1st run: 7.70 seconds
> 2nd run: 5.38
> 3rd run: 7.71
> 
> So python matches or beats VBScript at this much larger file.  Kewl.

This is what I'm talking about: Python might have a constant overhead,
but looking at larger operations, its probably comparable. Not fast,
mind you. Python isn't the fastest language out there. But in real world
work, its usually fast enough.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Peter Otten
DFS wrote:

>> Is VB using a local web cache, and Python not?
> 
> I'm not specifying a local web cache with either (wouldn't know how or
> where to look).  If you have Windows, you can try it.

I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with

> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
> Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
> xmlHTTP.Open "GET", webpage
  
  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"

> xmlHTTP.Send
> Set fso = CreateObject("Scripting.FileSystemObject")
> Set fOut = fso.CreateTextFile(webfile, True)
> fOut.WriteLine xmlHTTP.ResponseText
> fOut.Close
> Set fOut= Nothing
> Set fso = Nothing
> Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime -
> startTime,3) & " seconds"
> ---
> save it to a .vbs file and run it like this:
> $cscript /nologo filename.vbs
> 


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread BartC

On 02/05/2016 04:39, DFS wrote:

To save a webpage to a file:
-
1. import urllib
2. urllib.urlretrieve("http://econpy.pythonanywhere.com
/ex/001.html","D:\file.html")
-

That's it!

Coming from VB/A background, some of the stuff you can do with python -
with ease - is amazing.


VBScript version
--
1. Option Explicit
2. Dim xmlHTTP, fso, fOut
3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
4. xmlHTTP.Open "GET", "http://econpy.pythonanywhere.com/ex/001.html";
5. xmlHTTP.Send
6. Set fso = CreateObject("Scripting.FileSystemObject")
7. Set fOut = fso.CreateTextFile("D:\file.html", True)
8.  fOut.WriteLine xmlHTTP.ResponseText
9. fOut.Close
10. Set fOut = Nothing
11. Set fso  = Nothing
12. Set xmlHTTP = Nothing
--

Technically, that VBS will run with just lines 3-9, but that's still 6
lines of code vs 2 for python.


It seems Python provides a higher level solution compared with VBS. 
Python presumably also has to do those Opens and Sends, but they are 
hidden away inside urllib.urlretrieve.


You can do the same with VB just by wrapping up these lines in a 
subroutine. As you would if this had to be executed in a dozen different 
places for example. Then you could just write:


getfile("http://econpy.pythonanywhere.com/ex/001.html";, "D:/file.html")

in VBS too. (The forward slash in the file name ought to work.)

(I don't know VBS; I assume it does /have/ subroutines? What I haven't 
factored in here is error handling which might yet require more coding 
in VBS compared with Python)


--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list


loading multiple module with same name using importlib.machinery.SourceFileLoader

2016-05-02 Thread ulf . worsoe
I have observed this behaviour, for some reason only on OS X (and Python 
3.5.1): I use importlib.machinery.SourceFileLoader to load a long list of 
modules. The modules are not located in the loader path, and many of them have 
the same name, i.e. I would have:

m1 = importlib.machinery.SourceFileLoader("Module","path/to/m1/Module.py")
m2 = importlib.machinery.SourceFileLoader("Module","path/to/m2/Module.py")

Sometimes the modules will contain members from other modules with the same 
name, e.g. m1/module.py would define a function "m1func" that does not exist in 
m2/module.py, but the function would appear in m2, and examining 
m2.m1func.__code__.co_filename shows that it comes from m1. Members that are 
defined in both m1 and m2 are not overwritten, though.

Is this a bug in importlib.machinery.SourceFileLoader or are we in the Land of 
Undefined Behaviour here?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread Marko Rauhamaa
BartC :

> On 02/05/2016 04:39, DFS wrote:
>> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
>> /ex/001.html","D:\file.html")
> [...]
>
> It seems Python provides a higher level solution compared with VBS.
> Python presumably also has to do those Opens and Sends, but they are
> hidden away inside urllib.urlretrieve.

Relevant questions include:

 * Is a solution available?

 * Is the solution well thought out?

Python does have a lot of great stuff available, which is nice.
Unfortunately, many of the handy facilities are lacking in the
well-thought-out department.

For example, the urlretrieve() function above blocks. You can't use it
with the asyncio or select modules. You are left with:

   https://docs.python.org/3/library/asyncio-stream.html#get-http-h
   eaders>

Database facilities are notorious offenders. Also, json.load and
json.loads don't allow you to decode JSON in chunks.

If asyncio breaks through, I expect all blocking stdlib function calls
to be adapted for it over the coming years. I'm not overly fond of the
asyncio programming model, but it does sport two new killer features:

 * any blocking operation can be interrupted

 * events can be multiplexed


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread Steven D'Aprano
On Mon, 2 May 2016 08:12 pm, Marko Rauhamaa wrote:

> For example, the urlretrieve() function above blocks. You can't use it
> with the asyncio or select modules.


The urlretrieve function is one of the oldest functions in the std library.
It literally only exists because Guido was working on a computer somewhere,
found that he did have wget, and decided it would be faster to write his
own in Python than download and install wget.

And because this was very early in Python's history, the barrier to getting
into the std lib was much less, especially for stuff Guido wrote himself,
so there it is. These days, I doubt it would be included. It would probably
be a recipe in the docs.

Compared to a full-featured tool like wget or curl, urlretrieve is missing a
lot of stuff which is considered essential, like limiting/configuring the
rate, support for cookies and authentication, retrying on error, etc.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Tim Chase
On 2016-05-02 00:06, DFS wrote:
> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for
> 10 iterations, vs 0.88 for python.

In addition to the other debugging recommendations in sibling
threads, a couple other things to try:

1) use a local debugging proxy so that you can compare the headers to
see if anything stands out

2) in light of #1, can you confirm/deny whether one is using gzip
compression and the other isn't?

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Private message regarding: Howw to prevent the duplication of any value in a column within a CSV file (python)

2016-05-02 Thread Ian Kelly
On Mon, May 2, 2016 at 3:52 AM, Adam Davis  wrote:
> Hi Ian,
>
> I'm really struggling to implement a set into my code as I'm a beginner,
> it's taking me a while to grasp the idea of it. If I was to show you my code
> so you get an idea of my aim/function of the code, would you be able to help
> me at all?

Sure, although I'd recommend posting it to the list so that others
might also be able to help.
-- 
https://mail.python.org/mailman/listinfo/python-list


starting docker container messes up terminal settings

2016-05-02 Thread Larry Martell
I am starting a docker container from a subprocess.Popen and it works,
but when the script returns, the terminal settings of my shell are
messed up. Nothing is echoed and return doesn't cause a newline. I can
fix this with 'tset' in the terminal, but I don't want to require
that. Has anyone here worked with docker and had seen and solved this
issue?
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: starting docker container messes up terminal settings

2016-05-02 Thread Joaquin Alzola
>I am starting a docker container from a subprocess.Popen and it works, but 
>when the script returns, the terminal settings of my shell are messed up. 
>Nothing is echoed and return doesn't cause a >newline. I can fix this with 
>'tset' in the terminal, but I don't want to require that. Has anyone here 
>worked with docker and had seen and solved this issue?

It is good to put part of the code you think is causing the error (Popen 
subprocess)
This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What should Python apps do when asked to show help?

2016-05-02 Thread Grant Edwards
On 2016-05-01, c...@zip.com.au  wrote:

>>Didn't the OP specify that he was writing a command-line utility for
>>Linux/Unix?
>>
>>Discussing command line operation for Windows or OS-X seems rather
>>pointless.
>
> OS-X _is_ UNIX. I spent almost all my time on this Mac in terminals. It is a 
> very nice to use UNIX in many regards.

I include what you're doing under the category "Unix".  When I talk
about "OS X", I mean what my 84 year old mother is using.  I assumed
everybody thought that way.  ;)

-- 
Grant Edwards   grant.b.edwardsYow! If I am elected no one
  at   will ever have to do their
  gmail.comlaundry again!

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: starting docker container messes up terminal settings

2016-05-02 Thread Larry Martell
On Mon, May 2, 2016 at 10:08 AM, Joaquin Alzola
 wrote:
>>I am starting a docker container from a subprocess.Popen and it works, but 
>>when the script returns, the terminal settings of my shell are messed up. 
>>Nothing is echoed and return doesn't cause a >newline. I can fix this with 
>>'tset' in the terminal, but I don't want to require that. Has anyone here 
>>worked with docker and had seen and solved this issue?
>
> It is good to put part of the code you think is causing the error (Popen 
> subprocess)

cmd = ['sudo',
   'docker',
   'run',
   '-t',
   '-i',
   'elucidbio/capdata:v2',
   'bash'
]
p = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread DFS

On 5/2/2016 5:26 AM, BartC wrote:

On 02/05/2016 04:39, DFS wrote:

To save a webpage to a file:
-
1. import urllib
2. urllib.urlretrieve("http://econpy.pythonanywhere.com
/ex/001.html","D:\file.html")
-

That's it!

Coming from VB/A background, some of the stuff you can do with python -
with ease - is amazing.


VBScript version
--
1. Option Explicit
2. Dim xmlHTTP, fso, fOut
3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
4. xmlHTTP.Open "GET", "http://econpy.pythonanywhere.com/ex/001.html";
5. xmlHTTP.Send
6. Set fso = CreateObject("Scripting.FileSystemObject")
7. Set fOut = fso.CreateTextFile("D:\file.html", True)
8.  fOut.WriteLine xmlHTTP.ResponseText
9. fOut.Close
10. Set fOut = Nothing
11. Set fso  = Nothing
12. Set xmlHTTP = Nothing
--

Technically, that VBS will run with just lines 3-9, but that's still 6
lines of code vs 2 for python.


It seems Python provides a higher level solution compared with VBS.
Python presumably also has to do those Opens and Sends, but they are
hidden away inside urllib.urlretrieve.

You can do the same with VB just by wrapping up these lines in a
subroutine. As you would if this had to be executed in a dozen different
places for example. Then you could just write:

getfile("http://econpy.pythonanywhere.com/ex/001.html";, "D:/file.html")

in VBS too. (The forward slash in the file name ought to work.)



Of course.  Taken to its extreme, I could eventually replace you with 
one line of code :)


But python does it for me.  That would save me 8 lines...




(I don't know VBS; I assume it does /have/ subroutines? What I haven't
factored in here is error handling which might yet require more coding
in VBS compared with Python)


Yeah, VBS has subs and functions.  And strange, limited error handling. 
And a single data type, called Variant.  But it's installed with Windows 
so it's easy to get going with.



--
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread Larry Martell
On Mon, May 2, 2016 at 11:15 AM, DFS  wrote:
> Of course.  Taken to its extreme, I could eventually replace you with one
> line of code :)

That reminds me of something I heard many years ago.

Every non-trivial program can be simplified by at least one line of code.
Every non trivial program has at least one bug.

Therefore every non-trivial program can be reduced to one line of code
with a bug.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 html scraper that supports javascript

2016-05-02 Thread zljubisic


I tried to use the following code:

from bs4 import BeautifulSoup
from selenium import webdriver

PHANTOMJS_PATH = 
'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'

url = 
'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'

browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get(url)

soup = BeautifulSoup(browser.page_source, "html.parser")

x = soup.prettify()

print(x)


When I print x variable, I would expect to see something like this:
https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"; 
id="vjs_video_3_html5_api" class="vjs-tech" preload="none">https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";>


but I can't come to that point.

Regards.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread Manolo Martínez
On 05/02/16 at 11:24am, Larry Martell wrote:
> That reminds me of something I heard many years ago.
> 
> Every non-trivial program can be simplified by at least one line of code.
> Every non trivial program has at least one bug.
> 
> Therefore every non-trivial program can be reduced to one line of code
> with a bug.

Well, not really. Every non-trivial program can be reduced to one line
of code, but then the resulting program is not non-trivial (as it cannot
be further reduced), and therefore there are no guarantees that it will
have a bug.

M
-- 
https://mail.python.org/mailman/listinfo/python-list


Best way to clean up list items?

2016-05-02 Thread DFS

Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
Want: list1 = ['Item 1','Item 2']


I wrote this, which works fine, but maybe it can be tidier?

1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
2. list3 = [t.strip(' ') for t in list2]#trim whitespace
3. list1  = filter(None, list3) #remove empty items


After each step:

1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
2. list3 = ['Item 1','Item 2','']  #trim whitespace
3. list1 = ['Item 1','Item 2'] #remove empty items


Thanks!
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 html scraper that supports javascript

2016-05-02 Thread DFS

On 5/2/2016 11:33 AM, zljubi...@gmail.com wrote:



I tried to use the following code:

from bs4 import BeautifulSoup
from selenium import webdriver

PHANTOMJS_PATH = 
'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'

url = 
'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'

browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get(url)

soup = BeautifulSoup(browser.page_source, "html.parser")

x = soup.prettify()

print(x)


When I print x variable, I would expect to see something like this:
https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"; id="vjs_video_3_html5_api" 
class="vjs-tech" preload="none">https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";>


but I can't come to that point.

Regards.



I was doing something similar recently.  Try this:

f = open(somefilename)
soup = BeautifulSoup.BeautifulSoup(f)
f.close()
print soup.prettify()


--
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread Jussi Piitulainen
DFS writes:

> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> Want: list1 = ['Item 1','Item 2']
>
>
> I wrote this, which works fine, but maybe it can be tidier?
>
> 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> 2. list3 = [t.strip(' ') for t in list2]#trim whitespace
> 3. list1  = filter(None, list3) #remove empty items
>
> After each step:
>
> 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> 2. list3 = ['Item 1','Item 2','']  #trim whitespace
> 3. list1 = ['Item 1','Item 2'] #remove empty items

Try filter(None, (t.strip() for t in list1)). The default.

Funny-looking data you have.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread justin walters
On May 2, 2016 10:03 AM, "Jussi Piitulainen" 
wrote:
>
> DFS writes:
>
> > Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> > Want: list1 = ['Item 1','Item 2']
> >
> >
> > I wrote this, which works fine, but maybe it can be tidier?
> >
> > 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> > 2. list3 = [t.strip(' ') for t in list2]#trim whitespace
> > 3. list1  = filter(None, list3) #remove empty items
> >
> > After each step:
> >
> > 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> > 2. list3 = ['Item 1','Item 2','']  #trim whitespace
> > 3. list1 = ['Item 1','Item 2'] #remove empty items

You could also try compiled regex to remove unwanted characters.

Then loop through the list and do a replace for each item.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 09:33 AM, DFS wrote:
> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']

I'm curious how you got to this point, it seems like you can solve the
problem in how this is generated.

> Want: list1 = ['Item 1','Item 2']

That said:

list1 = [t.strip() for t in list1 if t and not t.isspace()]

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread Peter Otten
DFS wrote:

> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> Want: list1 = ['Item 1','Item 2']
> 
> 
> I wrote this, which works fine, but maybe it can be tidier?
> 
> 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> 2. list3 = [t.strip(' ') for t in list2]#trim whitespace
> 3. list1  = filter(None, list3) #remove empty items
> 
> 
> After each step:
> 
> 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> 2. list3 = ['Item 1','Item 2','']  #trim whitespace
> 3. list1 = ['Item 1','Item 2'] #remove empty items
> 
> 
> Thanks!

s.strip() strips all whitespace, so you can combine steps 1 and 2:

>>> items = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>> stripped = (s.strip() for s in items)

The (...) instead of [...] denote a generator expression, so the iteration 
has not started yet. The final step uses a list comprehension instead of 
filter():

>>> [s for s in stripped if s]
['Item 1', 'Item 2']

That way the same code works with both Python 2 and Python 3. Note that you 
can iterate over the generator expression only once; if you try it again 
you'll end empty-handed:

>>> [s for s in stripped if s]
[]

If you want to do it in one step here are two options that both involve some 
duplicate work:

>>> [s.strip() for s in items if s and not s.isspace()]
['Item 1', 'Item 2']
>>> [s.strip() for s in items if s.strip()]
['Item 1', 'Item 2']


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 html scraper that supports javascript

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 08:33 AM, zljubi...@gmail.com wrote:
> I tried to use the following code:
> 
> from bs4 import BeautifulSoup
> from selenium import webdriver
> 
> PHANTOMJS_PATH =
> 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
> 
> url =
> 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'
> 
> browser = webdriver.PhantomJS(PHANTOMJS_PATH)
> browser.get(url)
> 
> soup = BeautifulSoup(browser.page_source, "html.parser")
> 
> x = soup.prettify()
> 
> print(x)
> 
> When I print x variable, I would expect to see something like this:
>  src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58";
> id="vjs_video_3_html5_api" class="vjs-tech" preload="none"> type="application/x-mpegURL"
> src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";>
> 
> 
> but I can't come to that point.

Why? As important as it is to show code, you need to show what actually
happens and what error message is produced.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread DFS

On 5/2/2016 1:25 PM, Stephen Hansen wrote:

On Mon, May 2, 2016, at 09:33 AM, DFS wrote:

Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']


I'm curious how you got to this point, it seems like you can solve the
problem in how this is generated.



from lxml import html
import requests

webpage = 
"http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2";


page  = requests.get(webpage)
tree  = html.fromstring(page.content)
addr1 = tree.xpath('//span[@class="text3"]/text()')
print 'Addresses: ', addr1


I'd prefer to get clean data in the first place, but I don't know a 
better way to extract it from the HTML.




--
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread DFS

On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:

DFS writes:


Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
Want: list1 = ['Item 1','Item 2']


I wrote this, which works fine, but maybe it can be tidier?

1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
2. list3 = [t.strip(' ') for t in list2]#trim whitespace
3. list1  = filter(None, list3) #remove empty items

After each step:

1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
2. list3 = ['Item 1','Item 2','']  #trim whitespace
3. list1 = ['Item 1','Item 2'] #remove empty items


Try filter(None, (t.strip() for t in list1)). The default.


Works and drops a line of code.  Thx.




Funny-looking data you have.


I know - sadly, it's actual data:


from lxml import html
import requests

webpage = 
"http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2";


page  = requests.get(webpage)
tree  = html.fromstring(page.content)
addr1 = tree.xpath('//span[@class="text3"]/text()')
print 'Addresses: ', addr1


I couldn't figure out a better way to extract it from the HTML (maybe 
XML and DOM?)

--
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 11:09 AM, DFS wrote:
> I'd prefer to get clean data in the first place, but I don't know a 
> better way to extract it from the HTML.

Ah, right. I didn't know you were scraping HTML. Scraping HTML is rarely
clean so you have to do a lot of cleanup.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread Jussi Piitulainen
DFS writes:

> On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:
>> DFS writes:
>>
>>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>> Want: list1 = ['Item 1','Item 2']

. .

>> Funny-looking data you have.
>
> I know - sadly, it's actual data:
>
> 
> from lxml import html
> import requests
>
> webpage =
> "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2";
>
> page  = requests.get(webpage)
> tree  = html.fromstring(page.content)
> addr1 = tree.xpath('//span[@class="text3"]/text()')
> print 'Addresses: ', addr1
> 
>
> I couldn't figure out a better way to extract it from the HTML (maybe
> XML and DOM?)

I should have guessed :) But now I'm a bit worried about those spaces
inside your items. Can it happen that item text is split into strings in
the middle? Then the above sanitation does the wrong thing.

If someone has the right solution, I'm watching, too.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Best way to clean up list items?

2016-05-02 Thread DFS

On 5/2/2016 2:27 PM, Jussi Piitulainen wrote:

DFS writes:


On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:

DFS writes:


Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
Want: list1 = ['Item 1','Item 2']


. .


Funny-looking data you have.


I know - sadly, it's actual data:


from lxml import html
import requests

webpage =
"http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2";

page  = requests.get(webpage)
tree  = html.fromstring(page.content)
addr1 = tree.xpath('//span[@class="text3"]/text()')
print 'Addresses: ', addr1


I couldn't figure out a better way to extract it from the HTML (maybe
XML and DOM?)


I should have guessed :) But now I'm a bit worried about those spaces
inside your items. Can it happen that item text is split into strings in
the middle?


Meaning split by me, or comes 'malformed' from the data source?



Then the above sanitation does the wrong thing.

If someone has the right solution, I'm watching, too.



Here's the raw data as stored in the tree:

---
1st page

['\r\n', '\r\n1918 W End 
Ave, Nashville, TN 37203', '\r\n
  ', '\r\n1806 Hayes St, Nashville, 
TN 37203', '\r\n', '\r\n 
1701 Broadway, Nashville, TN 37203', '\r\n', '\r\n
209 10th Ave S, Nashville, TN 37203', '\r\n 
   ', '\r\n907 20th Ave S, Nashville, TN 
37212', '\r\n', '\r\n911 
20th Ave S, Nashville, TN 37212', '\r\n', '\r\n 
  1722 W End Ave, Nashville, TN 37203', '\r\n 
 ', '\r\n1905 Hayes St, 
Nashville, TN 37203', '\r\n
  ', '\r\n2000 W End Ave, 
Nashville, TN 37203']


---

Next page

['\r\n', '\r\n120 19th 
Ave N, Nashville, TN 37203', '\r\n
  ', '\r\n1719 W End Ave Ste 101, 
Nashville, TN 37203', '\r\n
  ', '\r\n1922 W End Ave, Nashville, TN 
37203', '\r\n', '\r\n
  909 20th Ave S, Nashville, TN 37212', '\r\n 
 ', '\r\n
  1807 Church St, Nashville, TN 37203', '\r\n 
 ', '\r\n1721 Church St, Nashville, TN 37203', 
'\r\n', '\r\n718 
Division St, Nashville, TN 37203', '\r\n', '\r\n 
   907 12th Ave S, Nashville, TN 37203', '\r\n 
  ', '\r\n204 21st Ave S, 
Nashville, TN 37203', '\r\n
  ', '\r\n1811 Division St, Nashville, 
TN 37203', '\r\n', '\r\n 
903 Gleaves St, Nashville, TN 37203', '\r\n', '\r\n
1720 W End Ave Ste 530, Nashville, TN 37203', '\r\n 
   ', '\r\n
1200 Division St Ste 100-A, Nashville, TN 37203', '\r\n 
   ', '\r\n
422 7th Ave S, Nashville, TN 37203', '\r\n', 
'\r\n605 8th Ave S, Nashville, TN 37203']


and so on
---

I've checked a couple hundred addresses visually, and so far I've only 
seen 2 formats:


1. '\r\n'
2. '\r\n   address  '


--
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 html scraper that supports javascript

2016-05-02 Thread zljubisic

> Why? As important as it is to show code, you need to show what actually
> happens and what error message is produced.

If you run the code you will see that html that I got doesn't have link to the 
flash video. I should somehow do something (press play video button maybe) in 
order to get html with reference to the video file on this page.

Regards
-- 
https://mail.python.org/mailman/listinfo/python-list


Need help understanding list structure

2016-05-02 Thread moa47401
I've been using an old text parsing library and have been able to accomplish 
most of what I wanted to do. But I don't understand the list structure it uses 
well enough to build additional methods.

If I print the list, it has thousands of elements within its brackets separated 
by commas as I would expect. But the elements appear to be memory pointers not 
the actual text. Here's an example:


If I iterate over the list, I do get the actual text of each element and am 
able to use it.

Also, if I iterate over the list and place each element in a new list using 
append, then each element in the new list is the text I expect not memory 
pointers.

But... if I copy the old list to a new list using 

new = old[:] 
or 
new = list(old)

the new list is exactly like the original with memory pointers.

Can someone help me understand why or under what circumstances a list shows 
pointers instead of the text data?



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need help understanding list structure

2016-05-02 Thread Erik

On 02/05/16 22:30, moa47...@gmail.com wrote:

Can someone help me understand why or under what circumstances a list
shows pointers instead of the text data?


When Python's "print" statement/function is invoked, it will print the 
textual representation of the object according to its class's __str__ or

__repr__ method. That is, the print function prints out whatever text
the class says it should.

For classes which don't implement a __str__ or __repr__ method, then
the text "" is used - where CLASS is the class
name and ADDRESS is the "memory pointer".

> If I iterate over the list, I do get the actual text of each element
> and am able to use it.
>
> Also, if I iterate over the list and place each element in a new list
> using append, then each element in the new list is the text I expect
> not memory pointers.

Look at the __iter__ method of the class of the object you are iterating 
over. I suspect that it returns string objects, not the objects that are 
in the list itself.


String objects have a __str__ or __repr__ method that represents them as 
the text, so that is what 'print' will output.


Hope that helps, E.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need help understanding list structure

2016-05-02 Thread moa47401

> When Python's "print" statement/function is invoked, it will print the 
> textual representation of the object according to its class's __str__ or
> __repr__ method. That is, the print function prints out whatever text
> the class says it should.
> 
> For classes which don't implement a __str__ or __repr__ method, then
> the text "" is used - where CLASS is the class
> name and ADDRESS is the "memory pointer".
> 
>  > If I iterate over the list, I do get the actual text of each element
>  > and am able to use it.
>  >
>  > Also, if I iterate over the list and place each element in a new list
>  > using append, then each element in the new list is the text I expect
>  > not memory pointers.
> 
> Look at the __iter__ method of the class of the object you are iterating 
> over. I suspect that it returns string objects, not the objects that are 
> in the list itself.
> 
> String objects have a __str__ or __repr__ method that represents them as 
> the text, so that is what 'print' will output.
> 
> Hope that helps, E.

Yes, that does help. You're right. The author of the library I'm using didn't 
implement either a __str__ or __repr__ method. Am I correct in assuming that 
parsing a large text file would be quicker returning pointers instead of 
strings? I've never run into this before.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What should Python apps do when asked to show help?

2016-05-02 Thread cs

On 02May2016 14:07, Grant Edwards  wrote:

On 2016-05-01, c...@zip.com.au  wrote:


Didn't the OP specify that he was writing a command-line utility for
Linux/Unix?

Discussing command line operation for Windows or OS-X seems rather
pointless.


OS-X _is_ UNIX. I spent almost all my time on this Mac in terminals. It is a
very nice to use UNIX in many regards.


I include what you're doing under the category "Unix".  When I talk
about "OS X", I mean what my 84 year old mother is using.  I assumed
everybody thought that way.  ;)


Weird. My 79 year old mother uses "Apple". I can only presume there's no "OS X" 
for her.


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need help understanding list structure

2016-05-02 Thread Michael Torrie
On 05/02/2016 04:33 PM, moa47...@gmail.com wrote:
> Yes, that does help. You're right. The author of the library I'm
> using didn't implement either a __str__ or __repr__ method. Am I
> correct in assuming that parsing a large text file would be quicker
> returning pointers instead of strings? I've never run into this
> before.

I'm not sure what you mean by "returning pointers." The list isn't
returning pointers. It's a list of *objects*.  To be specific, a list of
gedcom.Element objects, though they could be anything, including numbers
or strings.  If you refer to the source code where the Element class is
defined you can see what these objects contain. I suspect they contain a
lot more information than simply text.

Lists of objects is a common idiom in Python.  As you've discovered, if
you shallow copy a list, the new list will contain the exact same
objects.  In many cases, this does not matter.  For example a list of
numbers, which are immutable or unchangeable objects.  It doesn't matter
that the instances are shared, since the instances themselves will never
change.  If the objects are mutable, as they are in your case, a shallow
copy may not always be what you want.

As to your question.  A list never shows "pointers" as you say.  A list
always contains objects, and if you simply "print" the list, it will try
to show a representation of the list, using the objects' repr dunder
methods.  Some classes I have used have their repr methods print out
what the constructor would look like, if you were to construct the
object yourself.  This is very useful.  If I recall, this is what
BeautifulSoup objects do, which is incredibly useful.

In your case, as Erik said, the objects you are dealing with don't
provide repr dunder methods, so Python just lets you know they are
objects of a certain class, and what their ids are, which is helpful if
you're trying to determine if two objects are the same object.  These
are not "pointers" in the sense you're talking.  You'll get text if the
object prints text for you. This is true of any object you might store
in the list.

I hope this helps a bit.  Exploring from the interactive prompt as you
are doing is very useful, once you understand what it's saying to you.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need help understanding list structure

2016-05-02 Thread Ben Finney
moa47...@gmail.com writes:

> Am I correct in assuming that parsing a large text file would be
> quicker returning pointers instead of strings?

What do you mean by “return a pointer”? Python doesn't have pointers.

In the Python language, a container type (such as ‘set’, ‘list’, ‘dict’,
etc.) contains the objects directly. There are no “pointers” there; by
accessing the items of a container, you access the items directly.


What do you mean by “would be quicker”? I am concerned you are seeking
speed of the program at the expense of understandability and clarity of
the code.

Instead, you should be writing clear, maintainable code.

*Only if* the clear, maintainable code you write then actually ends up
being too slow, should you then worry about what parts are quick or slow
by *measuring* the specific parts of code to discover what is actually
occupying the time.

-- 
 \ “All television is educational television. The question is: |
  `\   what is it teaching?” —Nicholas Johnson |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread jfong
DFS at 2016/5/2 UTC+8 11:39:33AM wrote:
> To save a webpage to a file:
> -
> 1. import urllib
> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com
>  /ex/001.html","D:\file.html")
> -
> 
> That's it!

Why my system can't do it?

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import urlretrieve
Traceback (most recent call last):
  File "", line 1, in 
ImportError: cannot import name 'urlretrieve'

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread DFS

On 5/2/2016 8:45 PM, jf...@ms4.hinet.net wrote:

DFS at 2016/5/2 UTC+8 11:39:33AM wrote:

To save a webpage to a file:
-
1. import urllib
2. urllib.urlretrieve("http://econpy.pythonanywhere.com
 /ex/001.html","D:\file.html")
-

That's it!


Why my system can't do it?

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

from urllib import urlretrieve

Traceback (most recent call last):
  File "", line 1, in 
ImportError: cannot import name 'urlretrieve'



try

from urllib.request import urlretrieve

http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3


I'm running python 2.7.11 (32-bit)
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 4:42 AM, Peter Otten wrote:

DFS wrote:


Is VB using a local web cache, and Python not?


I'm not specifying a local web cache with either (wouldn't know how or
where to look).  If you have Windows, you can try it.


I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with


Option Explicit
Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile  = "D:\econpy001.html"
startTime = Timer
For i = 1 to 10
Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
xmlHTTP.Open "GET", webpage


  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"



Tried that, and from later on that stackoverflow page:

xmlHTTP.setRequestHeader "Cache-Control", "private"

Neither made a difference.  In fact, I saw faster times than ever - as 
low as 0.41 for 10 loops.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).


100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.



Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:



import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA



Thanks for the good feedback.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico
On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:
> On 5/2/2016 3:19 AM, Chris Angelico wrote:
>
>> There's an easier way to test if there's caching happening. Just crank
>> the iterations up from 10 to 100 and see what happens to the times. If
>> your numbers are perfectly fair, they should be perfectly linear in
>> the iteration count; eg a 1.8 second ten-iteration loop should become
>> an 18 second hundred-iteration loop. Obviously they won't be exactly
>> that, but I would expect them to be reasonably close (eg 17-19
>> seconds, but not 2 seconds).
>
>
> 100 loops
> Finished VBScript in 3.953 seconds
> Finished VBScript in 3.608 seconds
> Finished VBScript in 3.610 seconds
>
> Bit of a per-loop speedup going from 10 to 100.

How many seconds was it for 10 loops?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 10:00 PM, Chris Angelico wrote:

On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:

On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).



100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.


How many seconds was it for 10 loops?

ChrisA


~0.44


--
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread jfong
DFS at 2016/5/3 9:12:24AM wrote:
> try
> 
> from urllib.request import urlretrieve
> 
> http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3
> 
> 
> I'm running python 2.7.11 (32-bit)

Alright, it works...someway.

I try to get a zip file. It works, the file can be unzipped correctly.

>>> from urllib.request import urlretrieve
>>> urlretrieve("http://www.caprilion.com.tw/fed.zip";, "d:\\temp\\temp.zip")
('d:\\temp\\temp.zip', )
>>>

But when I try to get this forum page, it does get a html file but can't be 
viewed normally.

>>> urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ
bmR7A", "d:\\temp\\temp.html")
('d:\\temp\\temp.html', )
>>>

I suppose the html is a much complex situation where more processes need to be 
done before it can be opened by a web browser:-)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 08:27 PM, jf...@ms4.hinet.net wrote:
> But when I try to get this forum page, it does get a html file but can't
> be viewed normally.

What does that mean?

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread DFS

On 5/2/2016 11:27 PM, jf...@ms4.hinet.net wrote:

DFS at 2016/5/3 9:12:24AM wrote:

try

from urllib.request import urlretrieve

http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3


I'm running python 2.7.11 (32-bit)


Alright, it works...someway.

I try to get a zip file. It works, the file can be unzipped correctly.


from urllib.request import urlretrieve
urlretrieve("http://www.caprilion.com.tw/fed.zip";, "d:\\temp\\temp.zip")

('d:\\temp\\temp.zip', )




But when I try to get this forum page, it does get a html file but can't be 
viewed normally.


urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ

bmR7A", "d:\\temp\\temp.html")
('d:\\temp\\temp.html', )




I suppose the html is a much complex situation where more processes need to be 
done before it can be opened by a web browser:-)



Who knows what Google has done... it won't open in Opera.  The tab title 
shows up, but after 20-30 seconds the screen just stays blank and the 
cursor quits loading.


It's a mess - try running it thru BeautifulSoup.prettify() and it looks 
better.



import BeautifulSoup
from urllib.request import urlretrieve
webfile = "D:\\afile.html"
urllib.urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A",webfile)
f = open(webfile)
soup = BeautifulSoup.BeautifulSoup(f)
f.close()
print soup.prettify()




--
https://mail.python.org/mailman/listinfo/python-list


Re: You gotta love a 2-line python solution

2016-05-02 Thread jfong
Stephen Hansen at 2016/5/3 11:49:22AM wrote:
> On Mon, May 2, 2016, at 08:27 PM, jf...@ms4.hinet.net wrote:
> > But when I try to get this forum page, it does get a html file but can't
> > be viewed normally.
> 
> What does that mean?
> 
> -- 
> Stephen Hansen
>   m e @ i x o k a i . i o

The page we are looking at:-)
https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Michael Torrie
On 05/02/2016 01:37 AM, DFS wrote:
> So python matches or beats VBScript at this much larger file.  Kewl.

If you download something large enough to be meaningful, you'll find the
runtime speeds should all converge to something showing your internet
connection speed.  Try downloading a 4 GB file, for example.  You're
trying to benchmark an io-bound operation.  After you move past the very
small and meaningless examples that simply benchmark the overhead of the
connection building, you'll find that all languages, even compiled
languages like C, should run at the same speed on average.  Neither VBS
nor Python will be faster than each other.

Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/3/2016 12:06 AM, Michael Torrie wrote:


Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.


Be glad to.  Helps me learn python, so bring whatever challenge you want 
and I'll try to keep up.


One small comparison I was able to make was VBA vs python/pyodbc to 
summarize an Access database.  Not quite a fair test, but interesting 
nonetheless.


---

Access 2003 file
Access 2003 VBA code

2,099,101 rows
114 tables  (max row = 600288)
971 columns
  text:  503
  boolean:   4
  numeric:   351
  date-time: 108
  binary:5
309 indexes (25 foreign keys)
333,549,568 bytes on disk
Time: 0.18 seconds

---

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6

2,099,101 rows
114 tables (max row = 600288)
971  columns
  text:  503
  numeric:   351
  date-time: 108
  binary:5
  boolean:   4
309 indexes (foreign keys na via ODBC*)
333,549,568 bytes on disk
Time: 0.49 seconds

* the Access ODBC driver doesn't support
  the SQLForeignKeys function

---

--
https://mail.python.org/mailman/listinfo/python-list