Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016 at 4:47 PM, DFS wrote: > I'm not specifying a local web cache with either (wouldn't know how or where > to look). If you have Windows, you can try it. > --- > Option Explicit > Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i > webpage = "http://econpy.pythonanywhere.com/ex/001.html"; > webfile = "D:\econpy001.html" > startTime = Timer > For i = 1 to 10 > Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") > xmlHTTP.Open "GET", webpage > xmlHTTP.Send > Set fso = CreateObject("Scripting.FileSystemObject") > Set fOut = fso.CreateTextFile(webfile, True) > fOut.WriteLine xmlHTTP.ResponseText > fOut.Close > Set fOut= Nothing > Set fso = Nothing > Set xmlHTTP = Nothing > Next > endTime = Timer > wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) & > " seconds" > --- There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). Then the next thing to test would be to create a deliberately-slow web server, and connect to that. Put a two-second delay into it, to simulate a distant or overloaded server, and see if your logs show the correct result. Something like this: import time try: import http.server as BaseHTTPServer # Python 3 except ImportError: import BaseHTTPServer # Python 2 class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header("Content-type","text/html") self.end_headers() self.wfile.write(b"Hello, ") time.sleep(2) self.wfile.write(b"world!") server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP) server.serve_forever() --- Test that with a web browser or command-line downloader (go to http://127.0.0.1:1234/), and make sure that (a) it produces "Hello, world!", and (b) it takes two seconds. Then set your test scripts to downloading that URL. (Be sure to set them back to low iteration counts first!) If the times are true and fair, they should all come out pretty much the same - ten iterations, twenty seconds. And since all that's changed is the server, this will be an accurate demonstration of what happens in the real world: network requests aren't always fast. Incidentally, you can also watch the server's log to see if it's getting the appropriate number of requests. It may turn out that changing the web server actually materially changes your numbers. Comment out the sleep call and try it again - you might find that your numbers come closer together, because this naive server doesn't send back 204 NOT MODIFIED responses or anything. Again, though, this would prove that you're not actually measuring language performance, because the tests are more dependent on the server than the client. Even if the files themselves aren't being cached, you might find that DNS is. So if you truly want to eliminate variables, replace the name in your URL with an IP address. It's another thing that might mess with your timings, without actually being a language feature. Networking has about four billion variables in it. You're messing with one of the least significant: the programming language :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 2:27 AM, Stephen Hansen wrote: On Sun, May 1, 2016, at 10:59 PM, DFS wrote: startTime = time.clock() for i in range(loops): r = urllib2.urlopen(webpage) f = open(webfile,"w") f.write(r.read()) f.close endTime = time.clock() print "Finished urllib2 in %.2g seconds" %(endTime-startTime) Yeah on my system I get 1.8 out of this, amounting to 0.18s. You get 1.8 seconds total for the 10 loops? That's less than half as fast as my results. Surprising. I'm again going back to the point of: its fast enough. When comparing two small numbers, "twice as slow" is meaningless. Speed is always meaningful. I know python is relatively slow, but it's a cool, concise, powerful language. I'm extremely impressed by how tight the code can get. You have an assumption you haven't answered, that downloading a 10 meg file will be twice as slow as downloading this tiny file. You haven't proven that at all. True. And it has been my assumption - tho not with 10MB file. I suspect you have a constant overhead of X, and in this toy example, that makes it seem twice as slow. But when downloading a file of size, you'll have the same constant factor, at which point the difference is irrelevant. Good point. Test below. If you believe otherwise, demonstrate it. http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2 It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), so this is 16.6x larger. So I would expect python to linearly run in 16.6 * 0.88 = 14.6 seconds. 10 loops per run 1st run $ python timeGetHTML.py Finished urllib in 8.5 seconds Finished urllib2 in 5.6 seconds Finished requests in 7.8 seconds Finished pycurl in 6.5 seconds wait a couple minutes, then 2nd run $ python timeGetHTML.py Finished urllib in 5.6 seconds Finished urllib2 in 5.7 seconds Finished requests in 5.2 seconds Finished pycurl in 6.4 seconds It's a little more than 1/3 of my estimate - so good news. (when I was doing these tests, some of the python results were 0.75 seconds - way too fast, so I checked and no data was written to file, and I couldn't even open the webpage with a browser. Looks like I had been temporarily blocked from the site. After a couple minutes, I was able to access it again). I noticed urllib and curl returned the html as is, but urllib2 and requests added enhancements that should make the data easier to parse. Based on speed and functionality and documentation, I believe I'll be using the requests HTTP library (I will actually be doing a small amount of web scraping). VBScript 1st run: 7.70 seconds 2nd run: 5.38 3rd run: 7.71 So python matches or beats VBScript at this much larger file. Kewl. -- https://mail.python.org/mailman/listinfo/python-list
Re: Code Opinion - Enumerate
As a reference here is a functional implementation of conways GOL. http://programmablelife.blogspot.com.au/2012/08/conways-game-of-life-in-clojure.html The author first does it in clojure and then transliterates it to python. Just good for a different view. Sayth -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016, at 12:37 AM, DFS wrote: > On 5/2/2016 2:27 AM, Stephen Hansen wrote: > > I'm again going back to the point of: its fast enough. When comparing > > two small numbers, "twice as slow" is meaningless. > > Speed is always meaningful. > > I know python is relatively slow, but it's a cool, concise, powerful > language. I'm extremely impressed by how tight the code can get. I'm sorry, but no. Speed is not always meaningful. It's not even usually meaningful, because you can't quantify what "speed is". In context, you're claiming this is twice as slow (even though my tests show dramatically better performance), but what details are different? You're ignoring the fact that Python might have a constant overhead -- meaning, for a 1k download, it might have X speed cost. For a 1meg download, it might still have the exact same X cost. Looking narrowly, that overhead looks like "twice as slow", but that's not meaningful at all. Looking larger, that overhead is a pittance. You aren't measuring that. > > You have an assumption you haven't answered, that downloading a 10 meg > > file will be twice as slow as downloading this tiny file. You haven't > > proven that at all. > > True. And it has been my assumption - tho not with 10MB file. And that assumption is completely invalid. > I noticed urllib and curl returned the html as is, but urllib2 and > requests added enhancements that should make the data easier to parse. > Based on speed and functionality and documentation, I believe I'll be > using the requests HTTP library (I will actually be doing a small amount > of web scraping). The requests library's added-value is ease-of-use, and its overhead is likely tiny: so using it means you spend less effort making a thing happen. I recommend you embrace this. > VBScript > 1st run: 7.70 seconds > 2nd run: 5.38 > 3rd run: 7.71 > > So python matches or beats VBScript at this much larger file. Kewl. This is what I'm talking about: Python might have a constant overhead, but looking at larger operations, its probably comparable. Not fast, mind you. Python isn't the fastest language out there. But in real world work, its usually fast enough. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
DFS wrote: >> Is VB using a local web cache, and Python not? > > I'm not specifying a local web cache with either (wouldn't know how or > where to look). If you have Windows, you can try it. I don't have Windows, but if I'm to believe http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive the page is indeed cached and you can disable caching with > Option Explicit > Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i > webpage = "http://econpy.pythonanywhere.com/ex/001.html"; > webfile = "D:\econpy001.html" > startTime = Timer > For i = 1 to 10 > Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") > xmlHTTP.Open "GET", webpage xmlHTTP.setRequestHeader "Cache-Control", "max-age=0" > xmlHTTP.Send > Set fso = CreateObject("Scripting.FileSystemObject") > Set fOut = fso.CreateTextFile(webfile, True) > fOut.WriteLine xmlHTTP.ResponseText > fOut.Close > Set fOut= Nothing > Set fso = Nothing > Set xmlHTTP = Nothing > Next > endTime = Timer > wscript.echo "Finished VBScript in " & FormatNumber(endTime - > startTime,3) & " seconds" > --- > save it to a .vbs file and run it like this: > $cscript /nologo filename.vbs > -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On 02/05/2016 04:39, DFS wrote: To save a webpage to a file: - 1. import urllib 2. urllib.urlretrieve("http://econpy.pythonanywhere.com /ex/001.html","D:\file.html") - That's it! Coming from VB/A background, some of the stuff you can do with python - with ease - is amazing. VBScript version -- 1. Option Explicit 2. Dim xmlHTTP, fso, fOut 3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") 4. xmlHTTP.Open "GET", "http://econpy.pythonanywhere.com/ex/001.html"; 5. xmlHTTP.Send 6. Set fso = CreateObject("Scripting.FileSystemObject") 7. Set fOut = fso.CreateTextFile("D:\file.html", True) 8. fOut.WriteLine xmlHTTP.ResponseText 9. fOut.Close 10. Set fOut = Nothing 11. Set fso = Nothing 12. Set xmlHTTP = Nothing -- Technically, that VBS will run with just lines 3-9, but that's still 6 lines of code vs 2 for python. It seems Python provides a higher level solution compared with VBS. Python presumably also has to do those Opens and Sends, but they are hidden away inside urllib.urlretrieve. You can do the same with VB just by wrapping up these lines in a subroutine. As you would if this had to be executed in a dozen different places for example. Then you could just write: getfile("http://econpy.pythonanywhere.com/ex/001.html";, "D:/file.html") in VBS too. (The forward slash in the file name ought to work.) (I don't know VBS; I assume it does /have/ subroutines? What I haven't factored in here is error handling which might yet require more coding in VBS compared with Python) -- Bartc -- https://mail.python.org/mailman/listinfo/python-list
loading multiple module with same name using importlib.machinery.SourceFileLoader
I have observed this behaviour, for some reason only on OS X (and Python 3.5.1): I use importlib.machinery.SourceFileLoader to load a long list of modules. The modules are not located in the loader path, and many of them have the same name, i.e. I would have: m1 = importlib.machinery.SourceFileLoader("Module","path/to/m1/Module.py") m2 = importlib.machinery.SourceFileLoader("Module","path/to/m2/Module.py") Sometimes the modules will contain members from other modules with the same name, e.g. m1/module.py would define a function "m1func" that does not exist in m2/module.py, but the function would appear in m2, and examining m2.m1func.__code__.co_filename shows that it comes from m1. Members that are defined in both m1 and m2 are not overwritten, though. Is this a bug in importlib.machinery.SourceFileLoader or are we in the Land of Undefined Behaviour here? -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
BartC : > On 02/05/2016 04:39, DFS wrote: >> 2. urllib.urlretrieve("http://econpy.pythonanywhere.com >> /ex/001.html","D:\file.html") > [...] > > It seems Python provides a higher level solution compared with VBS. > Python presumably also has to do those Opens and Sends, but they are > hidden away inside urllib.urlretrieve. Relevant questions include: * Is a solution available? * Is the solution well thought out? Python does have a lot of great stuff available, which is nice. Unfortunately, many of the handy facilities are lacking in the well-thought-out department. For example, the urlretrieve() function above blocks. You can't use it with the asyncio or select modules. You are left with: https://docs.python.org/3/library/asyncio-stream.html#get-http-h eaders> Database facilities are notorious offenders. Also, json.load and json.loads don't allow you to decode JSON in chunks. If asyncio breaks through, I expect all blocking stdlib function calls to be adapted for it over the coming years. I'm not overly fond of the asyncio programming model, but it does sport two new killer features: * any blocking operation can be interrupted * events can be multiplexed Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On Mon, 2 May 2016 08:12 pm, Marko Rauhamaa wrote: > For example, the urlretrieve() function above blocks. You can't use it > with the asyncio or select modules. The urlretrieve function is one of the oldest functions in the std library. It literally only exists because Guido was working on a computer somewhere, found that he did have wget, and decided it would be faster to write his own in Python than download and install wget. And because this was very early in Python's history, the barrier to getting into the std lib was much less, especially for stuff Guido wrote himself, so there it is. These days, I doubt it would be included. It would probably be a recipe in the docs. Compared to a full-featured tool like wget or curl, urlretrieve is missing a lot of stuff which is considered essential, like limiting/configuring the rate, support for cookies and authentication, retrying on error, etc. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 2016-05-02 00:06, DFS wrote: > Then I tested them in loops - the VBScript is MUCH faster: 0.44 for > 10 iterations, vs 0.88 for python. In addition to the other debugging recommendations in sibling threads, a couple other things to try: 1) use a local debugging proxy so that you can compare the headers to see if anything stands out 2) in light of #1, can you confirm/deny whether one is using gzip compression and the other isn't? -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Private message regarding: Howw to prevent the duplication of any value in a column within a CSV file (python)
On Mon, May 2, 2016 at 3:52 AM, Adam Davis wrote: > Hi Ian, > > I'm really struggling to implement a set into my code as I'm a beginner, > it's taking me a while to grasp the idea of it. If I was to show you my code > so you get an idea of my aim/function of the code, would you be able to help > me at all? Sure, although I'd recommend posting it to the list so that others might also be able to help. -- https://mail.python.org/mailman/listinfo/python-list
starting docker container messes up terminal settings
I am starting a docker container from a subprocess.Popen and it works, but when the script returns, the terminal settings of my shell are messed up. Nothing is echoed and return doesn't cause a newline. I can fix this with 'tset' in the terminal, but I don't want to require that. Has anyone here worked with docker and had seen and solved this issue? -- https://mail.python.org/mailman/listinfo/python-list
RE: starting docker container messes up terminal settings
>I am starting a docker container from a subprocess.Popen and it works, but >when the script returns, the terminal settings of my shell are messed up. >Nothing is echoed and return doesn't cause a >newline. I can fix this with >'tset' in the terminal, but I don't want to require that. Has anyone here >worked with docker and had seen and solved this issue? It is good to put part of the code you think is causing the error (Popen subprocess) This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt. -- https://mail.python.org/mailman/listinfo/python-list
Re: What should Python apps do when asked to show help?
On 2016-05-01, c...@zip.com.au wrote: >>Didn't the OP specify that he was writing a command-line utility for >>Linux/Unix? >> >>Discussing command line operation for Windows or OS-X seems rather >>pointless. > > OS-X _is_ UNIX. I spent almost all my time on this Mac in terminals. It is a > very nice to use UNIX in many regards. I include what you're doing under the category "Unix". When I talk about "OS X", I mean what my 84 year old mother is using. I assumed everybody thought that way. ;) -- Grant Edwards grant.b.edwardsYow! If I am elected no one at will ever have to do their gmail.comlaundry again! -- https://mail.python.org/mailman/listinfo/python-list
Re: starting docker container messes up terminal settings
On Mon, May 2, 2016 at 10:08 AM, Joaquin Alzola wrote: >>I am starting a docker container from a subprocess.Popen and it works, but >>when the script returns, the terminal settings of my shell are messed up. >>Nothing is echoed and return doesn't cause a >newline. I can fix this with >>'tset' in the terminal, but I don't want to require that. Has anyone here >>worked with docker and had seen and solved this issue? > > It is good to put part of the code you think is causing the error (Popen > subprocess) cmd = ['sudo', 'docker', 'run', '-t', '-i', 'elucidbio/capdata:v2', 'bash' ] p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On 5/2/2016 5:26 AM, BartC wrote: On 02/05/2016 04:39, DFS wrote: To save a webpage to a file: - 1. import urllib 2. urllib.urlretrieve("http://econpy.pythonanywhere.com /ex/001.html","D:\file.html") - That's it! Coming from VB/A background, some of the stuff you can do with python - with ease - is amazing. VBScript version -- 1. Option Explicit 2. Dim xmlHTTP, fso, fOut 3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") 4. xmlHTTP.Open "GET", "http://econpy.pythonanywhere.com/ex/001.html"; 5. xmlHTTP.Send 6. Set fso = CreateObject("Scripting.FileSystemObject") 7. Set fOut = fso.CreateTextFile("D:\file.html", True) 8. fOut.WriteLine xmlHTTP.ResponseText 9. fOut.Close 10. Set fOut = Nothing 11. Set fso = Nothing 12. Set xmlHTTP = Nothing -- Technically, that VBS will run with just lines 3-9, but that's still 6 lines of code vs 2 for python. It seems Python provides a higher level solution compared with VBS. Python presumably also has to do those Opens and Sends, but they are hidden away inside urllib.urlretrieve. You can do the same with VB just by wrapping up these lines in a subroutine. As you would if this had to be executed in a dozen different places for example. Then you could just write: getfile("http://econpy.pythonanywhere.com/ex/001.html";, "D:/file.html") in VBS too. (The forward slash in the file name ought to work.) Of course. Taken to its extreme, I could eventually replace you with one line of code :) But python does it for me. That would save me 8 lines... (I don't know VBS; I assume it does /have/ subroutines? What I haven't factored in here is error handling which might yet require more coding in VBS compared with Python) Yeah, VBS has subs and functions. And strange, limited error handling. And a single data type, called Variant. But it's installed with Windows so it's easy to get going with. -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On Mon, May 2, 2016 at 11:15 AM, DFS wrote: > Of course. Taken to its extreme, I could eventually replace you with one > line of code :) That reminds me of something I heard many years ago. Every non-trivial program can be simplified by at least one line of code. Every non trivial program has at least one bug. Therefore every non-trivial program can be reduced to one line of code with a bug. -- https://mail.python.org/mailman/listinfo/python-list
Re: Python3 html scraper that supports javascript
I tried to use the following code: from bs4 import BeautifulSoup from selenium import webdriver PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' browser = webdriver.PhantomJS(PHANTOMJS_PATH) browser.get(url) soup = BeautifulSoup(browser.page_source, "html.parser") x = soup.prettify() print(x) When I print x variable, I would expect to see something like this: https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"; id="vjs_video_3_html5_api" class="vjs-tech" preload="none">https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";> but I can't come to that point. Regards. -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On 05/02/16 at 11:24am, Larry Martell wrote: > That reminds me of something I heard many years ago. > > Every non-trivial program can be simplified by at least one line of code. > Every non trivial program has at least one bug. > > Therefore every non-trivial program can be reduced to one line of code > with a bug. Well, not really. Every non-trivial program can be reduced to one line of code, but then the resulting program is not non-trivial (as it cannot be further reduced), and therefore there are no guarantees that it will have a bug. M -- https://mail.python.org/mailman/listinfo/python-list
Best way to clean up list items?
Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] Want: list1 = ['Item 1','Item 2'] I wrote this, which works fine, but maybe it can be tidier? 1. list2 = [t.replace("\r\n", "") for t in list1] #remove \r\n 2. list3 = [t.strip(' ') for t in list2]#trim whitespace 3. list1 = filter(None, list3) #remove empty items After each step: 1. list2 = [' Item 1 ',' Item 2 ',' '] #remove \r\n 2. list3 = ['Item 1','Item 2',''] #trim whitespace 3. list1 = ['Item 1','Item 2'] #remove empty items Thanks! -- https://mail.python.org/mailman/listinfo/python-list
Re: Python3 html scraper that supports javascript
On 5/2/2016 11:33 AM, zljubi...@gmail.com wrote: I tried to use the following code: from bs4 import BeautifulSoup from selenium import webdriver PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' browser = webdriver.PhantomJS(PHANTOMJS_PATH) browser.get(url) soup = BeautifulSoup(browser.page_source, "html.parser") x = soup.prettify() print(x) When I print x variable, I would expect to see something like this: https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"; id="vjs_video_3_html5_api" class="vjs-tech" preload="none">https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";> but I can't come to that point. Regards. I was doing something similar recently. Try this: f = open(somefilename) soup = BeautifulSoup.BeautifulSoup(f) f.close() print soup.prettify() -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
DFS writes: > Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] > Want: list1 = ['Item 1','Item 2'] > > > I wrote this, which works fine, but maybe it can be tidier? > > 1. list2 = [t.replace("\r\n", "") for t in list1] #remove \r\n > 2. list3 = [t.strip(' ') for t in list2]#trim whitespace > 3. list1 = filter(None, list3) #remove empty items > > After each step: > > 1. list2 = [' Item 1 ',' Item 2 ',' '] #remove \r\n > 2. list3 = ['Item 1','Item 2',''] #trim whitespace > 3. list1 = ['Item 1','Item 2'] #remove empty items Try filter(None, (t.strip() for t in list1)). The default. Funny-looking data you have. -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On May 2, 2016 10:03 AM, "Jussi Piitulainen" wrote: > > DFS writes: > > > Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] > > Want: list1 = ['Item 1','Item 2'] > > > > > > I wrote this, which works fine, but maybe it can be tidier? > > > > 1. list2 = [t.replace("\r\n", "") for t in list1] #remove \r\n > > 2. list3 = [t.strip(' ') for t in list2]#trim whitespace > > 3. list1 = filter(None, list3) #remove empty items > > > > After each step: > > > > 1. list2 = [' Item 1 ',' Item 2 ',' '] #remove \r\n > > 2. list3 = ['Item 1','Item 2',''] #trim whitespace > > 3. list1 = ['Item 1','Item 2'] #remove empty items You could also try compiled regex to remove unwanted characters. Then loop through the list and do a replace for each item. -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On Mon, May 2, 2016, at 09:33 AM, DFS wrote: > Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] I'm curious how you got to this point, it seems like you can solve the problem in how this is generated. > Want: list1 = ['Item 1','Item 2'] That said: list1 = [t.strip() for t in list1 if t and not t.isspace()] -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
DFS wrote: > Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] > Want: list1 = ['Item 1','Item 2'] > > > I wrote this, which works fine, but maybe it can be tidier? > > 1. list2 = [t.replace("\r\n", "") for t in list1] #remove \r\n > 2. list3 = [t.strip(' ') for t in list2]#trim whitespace > 3. list1 = filter(None, list3) #remove empty items > > > After each step: > > 1. list2 = [' Item 1 ',' Item 2 ',' '] #remove \r\n > 2. list3 = ['Item 1','Item 2',''] #trim whitespace > 3. list1 = ['Item 1','Item 2'] #remove empty items > > > Thanks! s.strip() strips all whitespace, so you can combine steps 1 and 2: >>> items = ['\r\n Item 1 ',' Item 2 ','\r\n '] >>> stripped = (s.strip() for s in items) The (...) instead of [...] denote a generator expression, so the iteration has not started yet. The final step uses a list comprehension instead of filter(): >>> [s for s in stripped if s] ['Item 1', 'Item 2'] That way the same code works with both Python 2 and Python 3. Note that you can iterate over the generator expression only once; if you try it again you'll end empty-handed: >>> [s for s in stripped if s] [] If you want to do it in one step here are two options that both involve some duplicate work: >>> [s.strip() for s in items if s and not s.isspace()] ['Item 1', 'Item 2'] >>> [s.strip() for s in items if s.strip()] ['Item 1', 'Item 2'] -- https://mail.python.org/mailman/listinfo/python-list
Re: Python3 html scraper that supports javascript
On Mon, May 2, 2016, at 08:33 AM, zljubi...@gmail.com wrote: > I tried to use the following code: > > from bs4 import BeautifulSoup > from selenium import webdriver > > PHANTOMJS_PATH = > 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' > > url = > 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' > > browser = webdriver.PhantomJS(PHANTOMJS_PATH) > browser.get(url) > > soup = BeautifulSoup(browser.page_source, "html.parser") > > x = soup.prettify() > > print(x) > > When I print x variable, I would expect to see something like this: > src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"; > id="vjs_video_3_html5_api" class="vjs-tech" preload="none"> type="application/x-mpegURL" > src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar";> > > > but I can't come to that point. Why? As important as it is to show code, you need to show what actually happens and what error message is produced. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On 5/2/2016 1:25 PM, Stephen Hansen wrote: On Mon, May 2, 2016, at 09:33 AM, DFS wrote: Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] I'm curious how you got to this point, it seems like you can solve the problem in how this is generated. from lxml import html import requests webpage = "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"; page = requests.get(webpage) tree = html.fromstring(page.content) addr1 = tree.xpath('//span[@class="text3"]/text()') print 'Addresses: ', addr1 I'd prefer to get clean data in the first place, but I don't know a better way to extract it from the HTML. -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On 5/2/2016 12:57 PM, Jussi Piitulainen wrote: DFS writes: Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] Want: list1 = ['Item 1','Item 2'] I wrote this, which works fine, but maybe it can be tidier? 1. list2 = [t.replace("\r\n", "") for t in list1] #remove \r\n 2. list3 = [t.strip(' ') for t in list2]#trim whitespace 3. list1 = filter(None, list3) #remove empty items After each step: 1. list2 = [' Item 1 ',' Item 2 ',' '] #remove \r\n 2. list3 = ['Item 1','Item 2',''] #trim whitespace 3. list1 = ['Item 1','Item 2'] #remove empty items Try filter(None, (t.strip() for t in list1)). The default. Works and drops a line of code. Thx. Funny-looking data you have. I know - sadly, it's actual data: from lxml import html import requests webpage = "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"; page = requests.get(webpage) tree = html.fromstring(page.content) addr1 = tree.xpath('//span[@class="text3"]/text()') print 'Addresses: ', addr1 I couldn't figure out a better way to extract it from the HTML (maybe XML and DOM?) -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On Mon, May 2, 2016, at 11:09 AM, DFS wrote: > I'd prefer to get clean data in the first place, but I don't know a > better way to extract it from the HTML. Ah, right. I didn't know you were scraping HTML. Scraping HTML is rarely clean so you have to do a lot of cleanup. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
DFS writes: > On 5/2/2016 12:57 PM, Jussi Piitulainen wrote: >> DFS writes: >> >>> Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] >>> Want: list1 = ['Item 1','Item 2'] . . >> Funny-looking data you have. > > I know - sadly, it's actual data: > > > from lxml import html > import requests > > webpage = > "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"; > > page = requests.get(webpage) > tree = html.fromstring(page.content) > addr1 = tree.xpath('//span[@class="text3"]/text()') > print 'Addresses: ', addr1 > > > I couldn't figure out a better way to extract it from the HTML (maybe > XML and DOM?) I should have guessed :) But now I'm a bit worried about those spaces inside your items. Can it happen that item text is split into strings in the middle? Then the above sanitation does the wrong thing. If someone has the right solution, I'm watching, too. -- https://mail.python.org/mailman/listinfo/python-list
Re: Best way to clean up list items?
On 5/2/2016 2:27 PM, Jussi Piitulainen wrote: DFS writes: On 5/2/2016 12:57 PM, Jussi Piitulainen wrote: DFS writes: Have: list1 = ['\r\n Item 1 ',' Item 2 ','\r\n '] Want: list1 = ['Item 1','Item 2'] . . Funny-looking data you have. I know - sadly, it's actual data: from lxml import html import requests webpage = "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"; page = requests.get(webpage) tree = html.fromstring(page.content) addr1 = tree.xpath('//span[@class="text3"]/text()') print 'Addresses: ', addr1 I couldn't figure out a better way to extract it from the HTML (maybe XML and DOM?) I should have guessed :) But now I'm a bit worried about those spaces inside your items. Can it happen that item text is split into strings in the middle? Meaning split by me, or comes 'malformed' from the data source? Then the above sanitation does the wrong thing. If someone has the right solution, I'm watching, too. Here's the raw data as stored in the tree: --- 1st page ['\r\n', '\r\n1918 W End Ave, Nashville, TN 37203', '\r\n ', '\r\n1806 Hayes St, Nashville, TN 37203', '\r\n', '\r\n 1701 Broadway, Nashville, TN 37203', '\r\n', '\r\n 209 10th Ave S, Nashville, TN 37203', '\r\n ', '\r\n907 20th Ave S, Nashville, TN 37212', '\r\n', '\r\n911 20th Ave S, Nashville, TN 37212', '\r\n', '\r\n 1722 W End Ave, Nashville, TN 37203', '\r\n ', '\r\n1905 Hayes St, Nashville, TN 37203', '\r\n ', '\r\n2000 W End Ave, Nashville, TN 37203'] --- Next page ['\r\n', '\r\n120 19th Ave N, Nashville, TN 37203', '\r\n ', '\r\n1719 W End Ave Ste 101, Nashville, TN 37203', '\r\n ', '\r\n1922 W End Ave, Nashville, TN 37203', '\r\n', '\r\n 909 20th Ave S, Nashville, TN 37212', '\r\n ', '\r\n 1807 Church St, Nashville, TN 37203', '\r\n ', '\r\n1721 Church St, Nashville, TN 37203', '\r\n', '\r\n718 Division St, Nashville, TN 37203', '\r\n', '\r\n 907 12th Ave S, Nashville, TN 37203', '\r\n ', '\r\n204 21st Ave S, Nashville, TN 37203', '\r\n ', '\r\n1811 Division St, Nashville, TN 37203', '\r\n', '\r\n 903 Gleaves St, Nashville, TN 37203', '\r\n', '\r\n 1720 W End Ave Ste 530, Nashville, TN 37203', '\r\n ', '\r\n 1200 Division St Ste 100-A, Nashville, TN 37203', '\r\n ', '\r\n 422 7th Ave S, Nashville, TN 37203', '\r\n', '\r\n605 8th Ave S, Nashville, TN 37203'] and so on --- I've checked a couple hundred addresses visually, and so far I've only seen 2 formats: 1. '\r\n' 2. '\r\n address ' -- https://mail.python.org/mailman/listinfo/python-list
Re: Python3 html scraper that supports javascript
> Why? As important as it is to show code, you need to show what actually > happens and what error message is produced. If you run the code you will see that html that I got doesn't have link to the flash video. I should somehow do something (press play video button maybe) in order to get html with reference to the video file on this page. Regards -- https://mail.python.org/mailman/listinfo/python-list
Need help understanding list structure
I've been using an old text parsing library and have been able to accomplish most of what I wanted to do. But I don't understand the list structure it uses well enough to build additional methods. If I print the list, it has thousands of elements within its brackets separated by commas as I would expect. But the elements appear to be memory pointers not the actual text. Here's an example: If I iterate over the list, I do get the actual text of each element and am able to use it. Also, if I iterate over the list and place each element in a new list using append, then each element in the new list is the text I expect not memory pointers. But... if I copy the old list to a new list using new = old[:] or new = list(old) the new list is exactly like the original with memory pointers. Can someone help me understand why or under what circumstances a list shows pointers instead of the text data? -- https://mail.python.org/mailman/listinfo/python-list
Re: Need help understanding list structure
On 02/05/16 22:30, moa47...@gmail.com wrote: Can someone help me understand why or under what circumstances a list shows pointers instead of the text data? When Python's "print" statement/function is invoked, it will print the textual representation of the object according to its class's __str__ or __repr__ method. That is, the print function prints out whatever text the class says it should. For classes which don't implement a __str__ or __repr__ method, then the text "" is used - where CLASS is the class name and ADDRESS is the "memory pointer". > If I iterate over the list, I do get the actual text of each element > and am able to use it. > > Also, if I iterate over the list and place each element in a new list > using append, then each element in the new list is the text I expect > not memory pointers. Look at the __iter__ method of the class of the object you are iterating over. I suspect that it returns string objects, not the objects that are in the list itself. String objects have a __str__ or __repr__ method that represents them as the text, so that is what 'print' will output. Hope that helps, E. -- https://mail.python.org/mailman/listinfo/python-list
Re: Need help understanding list structure
> When Python's "print" statement/function is invoked, it will print the > textual representation of the object according to its class's __str__ or > __repr__ method. That is, the print function prints out whatever text > the class says it should. > > For classes which don't implement a __str__ or __repr__ method, then > the text "" is used - where CLASS is the class > name and ADDRESS is the "memory pointer". > > > If I iterate over the list, I do get the actual text of each element > > and am able to use it. > > > > Also, if I iterate over the list and place each element in a new list > > using append, then each element in the new list is the text I expect > > not memory pointers. > > Look at the __iter__ method of the class of the object you are iterating > over. I suspect that it returns string objects, not the objects that are > in the list itself. > > String objects have a __str__ or __repr__ method that represents them as > the text, so that is what 'print' will output. > > Hope that helps, E. Yes, that does help. You're right. The author of the library I'm using didn't implement either a __str__ or __repr__ method. Am I correct in assuming that parsing a large text file would be quicker returning pointers instead of strings? I've never run into this before. -- https://mail.python.org/mailman/listinfo/python-list
Re: What should Python apps do when asked to show help?
On 02May2016 14:07, Grant Edwards wrote: On 2016-05-01, c...@zip.com.au wrote: Didn't the OP specify that he was writing a command-line utility for Linux/Unix? Discussing command line operation for Windows or OS-X seems rather pointless. OS-X _is_ UNIX. I spent almost all my time on this Mac in terminals. It is a very nice to use UNIX in many regards. I include what you're doing under the category "Unix". When I talk about "OS X", I mean what my 84 year old mother is using. I assumed everybody thought that way. ;) Weird. My 79 year old mother uses "Apple". I can only presume there's no "OS X" for her. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: Need help understanding list structure
On 05/02/2016 04:33 PM, moa47...@gmail.com wrote: > Yes, that does help. You're right. The author of the library I'm > using didn't implement either a __str__ or __repr__ method. Am I > correct in assuming that parsing a large text file would be quicker > returning pointers instead of strings? I've never run into this > before. I'm not sure what you mean by "returning pointers." The list isn't returning pointers. It's a list of *objects*. To be specific, a list of gedcom.Element objects, though they could be anything, including numbers or strings. If you refer to the source code where the Element class is defined you can see what these objects contain. I suspect they contain a lot more information than simply text. Lists of objects is a common idiom in Python. As you've discovered, if you shallow copy a list, the new list will contain the exact same objects. In many cases, this does not matter. For example a list of numbers, which are immutable or unchangeable objects. It doesn't matter that the instances are shared, since the instances themselves will never change. If the objects are mutable, as they are in your case, a shallow copy may not always be what you want. As to your question. A list never shows "pointers" as you say. A list always contains objects, and if you simply "print" the list, it will try to show a representation of the list, using the objects' repr dunder methods. Some classes I have used have their repr methods print out what the constructor would look like, if you were to construct the object yourself. This is very useful. If I recall, this is what BeautifulSoup objects do, which is incredibly useful. In your case, as Erik said, the objects you are dealing with don't provide repr dunder methods, so Python just lets you know they are objects of a certain class, and what their ids are, which is helpful if you're trying to determine if two objects are the same object. These are not "pointers" in the sense you're talking. You'll get text if the object prints text for you. This is true of any object you might store in the list. I hope this helps a bit. Exploring from the interactive prompt as you are doing is very useful, once you understand what it's saying to you. -- https://mail.python.org/mailman/listinfo/python-list
Re: Need help understanding list structure
moa47...@gmail.com writes: > Am I correct in assuming that parsing a large text file would be > quicker returning pointers instead of strings? What do you mean by “return a pointer”? Python doesn't have pointers. In the Python language, a container type (such as ‘set’, ‘list’, ‘dict’, etc.) contains the objects directly. There are no “pointers” there; by accessing the items of a container, you access the items directly. What do you mean by “would be quicker”? I am concerned you are seeking speed of the program at the expense of understandability and clarity of the code. Instead, you should be writing clear, maintainable code. *Only if* the clear, maintainable code you write then actually ends up being too slow, should you then worry about what parts are quick or slow by *measuring* the specific parts of code to discover what is actually occupying the time. -- \ “All television is educational television. The question is: | `\ what is it teaching?” —Nicholas Johnson | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
DFS at 2016/5/2 UTC+8 11:39:33AM wrote: > To save a webpage to a file: > - > 1. import urllib > 2. urllib.urlretrieve("http://econpy.pythonanywhere.com > /ex/001.html","D:\file.html") > - > > That's it! Why my system can't do it? Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In tel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from urllib import urlretrieve Traceback (most recent call last): File "", line 1, in ImportError: cannot import name 'urlretrieve' -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On 5/2/2016 8:45 PM, jf...@ms4.hinet.net wrote: DFS at 2016/5/2 UTC+8 11:39:33AM wrote: To save a webpage to a file: - 1. import urllib 2. urllib.urlretrieve("http://econpy.pythonanywhere.com /ex/001.html","D:\file.html") - That's it! Why my system can't do it? Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In tel)] on win32 Type "help", "copyright", "credits" or "license" for more information. from urllib import urlretrieve Traceback (most recent call last): File "", line 1, in ImportError: cannot import name 'urlretrieve' try from urllib.request import urlretrieve http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3 I'm running python 2.7.11 (32-bit) -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 4:42 AM, Peter Otten wrote: DFS wrote: Is VB using a local web cache, and Python not? I'm not specifying a local web cache with either (wouldn't know how or where to look). If you have Windows, you can try it. I don't have Windows, but if I'm to believe http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive the page is indeed cached and you can disable caching with Option Explicit Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i webpage = "http://econpy.pythonanywhere.com/ex/001.html"; webfile = "D:\econpy001.html" startTime = Timer For i = 1 to 10 Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") xmlHTTP.Open "GET", webpage xmlHTTP.setRequestHeader "Cache-Control", "max-age=0" Tried that, and from later on that stackoverflow page: xmlHTTP.setRequestHeader "Cache-Control", "private" Neither made a difference. In fact, I saw faster times than ever - as low as 0.41 for 10 loops. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 3:19 AM, Chris Angelico wrote: There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). 100 loops Finished VBScript in 3.953 seconds Finished VBScript in 3.608 seconds Finished VBScript in 3.610 seconds Bit of a per-loop speedup going from 10 to 100. Then the next thing to test would be to create a deliberately-slow web server, and connect to that. Put a two-second delay into it, to simulate a distant or overloaded server, and see if your logs show the correct result. Something like this: import time try: import http.server as BaseHTTPServer # Python 3 except ImportError: import BaseHTTPServer # Python 2 class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header("Content-type","text/html") self.end_headers() self.wfile.write(b"Hello, ") time.sleep(2) self.wfile.write(b"world!") server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP) server.serve_forever() --- Test that with a web browser or command-line downloader (go to http://127.0.0.1:1234/), and make sure that (a) it produces "Hello, world!", and (b) it takes two seconds. Then set your test scripts to downloading that URL. (Be sure to set them back to low iteration counts first!) If the times are true and fair, they should all come out pretty much the same - ten iterations, twenty seconds. And since all that's changed is the server, this will be an accurate demonstration of what happens in the real world: network requests aren't always fast. Incidentally, you can also watch the server's log to see if it's getting the appropriate number of requests. It may turn out that changing the web server actually materially changes your numbers. Comment out the sleep call and try it again - you might find that your numbers come closer together, because this naive server doesn't send back 204 NOT MODIFIED responses or anything. Again, though, this would prove that you're not actually measuring language performance, because the tests are more dependent on the server than the client. Even if the files themselves aren't being cached, you might find that DNS is. So if you truly want to eliminate variables, replace the name in your URL with an IP address. It's another thing that might mess with your timings, without actually being a language feature. Networking has about four billion variables in it. You're messing with one of the least significant: the programming language :) ChrisA Thanks for the good feedback. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Tue, May 3, 2016 at 11:51 AM, DFS wrote: > On 5/2/2016 3:19 AM, Chris Angelico wrote: > >> There's an easier way to test if there's caching happening. Just crank >> the iterations up from 10 to 100 and see what happens to the times. If >> your numbers are perfectly fair, they should be perfectly linear in >> the iteration count; eg a 1.8 second ten-iteration loop should become >> an 18 second hundred-iteration loop. Obviously they won't be exactly >> that, but I would expect them to be reasonably close (eg 17-19 >> seconds, but not 2 seconds). > > > 100 loops > Finished VBScript in 3.953 seconds > Finished VBScript in 3.608 seconds > Finished VBScript in 3.610 seconds > > Bit of a per-loop speedup going from 10 to 100. How many seconds was it for 10 loops? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 10:00 PM, Chris Angelico wrote: On Tue, May 3, 2016 at 11:51 AM, DFS wrote: On 5/2/2016 3:19 AM, Chris Angelico wrote: There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). 100 loops Finished VBScript in 3.953 seconds Finished VBScript in 3.608 seconds Finished VBScript in 3.610 seconds Bit of a per-loop speedup going from 10 to 100. How many seconds was it for 10 loops? ChrisA ~0.44 -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
DFS at 2016/5/3 9:12:24AM wrote: > try > > from urllib.request import urlretrieve > > http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3 > > > I'm running python 2.7.11 (32-bit) Alright, it works...someway. I try to get a zip file. It works, the file can be unzipped correctly. >>> from urllib.request import urlretrieve >>> urlretrieve("http://www.caprilion.com.tw/fed.zip";, "d:\\temp\\temp.zip") ('d:\\temp\\temp.zip', ) >>> But when I try to get this forum page, it does get a html file but can't be viewed normally. >>> urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ bmR7A", "d:\\temp\\temp.html") ('d:\\temp\\temp.html', ) >>> I suppose the html is a much complex situation where more processes need to be done before it can be opened by a web browser:-) -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On Mon, May 2, 2016, at 08:27 PM, jf...@ms4.hinet.net wrote: > But when I try to get this forum page, it does get a html file but can't > be viewed normally. What does that mean? -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
On 5/2/2016 11:27 PM, jf...@ms4.hinet.net wrote: DFS at 2016/5/3 9:12:24AM wrote: try from urllib.request import urlretrieve http://stackoverflow.com/questions/21171718/urllib-urlretrieve-file-python-3-3 I'm running python 2.7.11 (32-bit) Alright, it works...someway. I try to get a zip file. It works, the file can be unzipped correctly. from urllib.request import urlretrieve urlretrieve("http://www.caprilion.com.tw/fed.zip";, "d:\\temp\\temp.zip") ('d:\\temp\\temp.zip', ) But when I try to get this forum page, it does get a html file but can't be viewed normally. urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJ bmR7A", "d:\\temp\\temp.html") ('d:\\temp\\temp.html', ) I suppose the html is a much complex situation where more processes need to be done before it can be opened by a web browser:-) Who knows what Google has done... it won't open in Opera. The tab title shows up, but after 20-30 seconds the screen just stays blank and the cursor quits loading. It's a mess - try running it thru BeautifulSoup.prettify() and it looks better. import BeautifulSoup from urllib.request import urlretrieve webfile = "D:\\afile.html" urllib.urlretrieve("https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A",webfile) f = open(webfile) soup = BeautifulSoup.BeautifulSoup(f) f.close() print soup.prettify() -- https://mail.python.org/mailman/listinfo/python-list
Re: You gotta love a 2-line python solution
Stephen Hansen at 2016/5/3 11:49:22AM wrote: > On Mon, May 2, 2016, at 08:27 PM, jf...@ms4.hinet.net wrote: > > But when I try to get this forum page, it does get a html file but can't > > be viewed normally. > > What does that mean? > > -- > Stephen Hansen > m e @ i x o k a i . i o The page we are looking at:-) https://groups.google.com/forum/#!topic/comp.lang.python/jFl3GJbmR7A -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 05/02/2016 01:37 AM, DFS wrote: > So python matches or beats VBScript at this much larger file. Kewl. If you download something large enough to be meaningful, you'll find the runtime speeds should all converge to something showing your internet connection speed. Try downloading a 4 GB file, for example. You're trying to benchmark an io-bound operation. After you move past the very small and meaningless examples that simply benchmark the overhead of the connection building, you'll find that all languages, even compiled languages like C, should run at the same speed on average. Neither VBS nor Python will be faster than each other. Now if you want to talk about processing the data once you have it, there we can talk about speeds and optimization. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/3/2016 12:06 AM, Michael Torrie wrote: Now if you want to talk about processing the data once you have it, there we can talk about speeds and optimization. Be glad to. Helps me learn python, so bring whatever challenge you want and I'll try to keep up. One small comparison I was able to make was VBA vs python/pyodbc to summarize an Access database. Not quite a fair test, but interesting nonetheless. --- Access 2003 file Access 2003 VBA code 2,099,101 rows 114 tables (max row = 600288) 971 columns text: 503 boolean: 4 numeric: 351 date-time: 108 binary:5 309 indexes (25 foreign keys) 333,549,568 bytes on disk Time: 0.18 seconds --- same Access 2003 file 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 2,099,101 rows 114 tables (max row = 600288) 971 columns text: 503 numeric: 351 date-time: 108 binary:5 boolean: 4 309 indexes (foreign keys na via ODBC*) 333,549,568 bytes on disk Time: 0.49 seconds * the Access ODBC driver doesn't support the SQLForeignKeys function --- -- https://mail.python.org/mailman/listinfo/python-list