Program inefficiency?
I wrote the following simple program to loop through our help files and fix some errors (in case you can't see the subtle RE search that's happening, we're replacing spaces in bookmarks with _'s) the program works great except for one thing. It's significantly slower through the later files in the search then through the early ones... Before anyone criticizes, I recognize that that middle section could be simplified with a for loop... I just haven't cleaned it up... The problem is that the first 300 files take about 10-15 seconds and the last 300 take about 2 minutes... If we do more than about 1500 files in one run, it just hangs up and never finishes... Is there a solution here that I'm missing? What am I doing that is so inefficient? # File: masseditor.py import re import os import time def massreplace(): editfile = open("pathname\editfile.txt") filestring = editfile.read() filelist = filestring.splitlines() ##errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(>)+') for i in range(len(filelist)): source = open(filelist[i]) starttext = source.read() interimtext = replacecycle(starttext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) interimtext = replacecycle(interimtext) finaltext = replacecycle(interimtext) source.close() source = open(filelist[i],"w") source.write(finaltext) source.close() ##if errorcheck.findall(finaltext)!=[]: ##print errorcheck.findall(finaltext) ##print filelist[i] if i == 100: print "done 100" print time.clock() elif i == 300: print "done 300" print time.clock() elif i == 600: print "done 600" print time.clock() elif i == 1000: print "done 1000" print time.clock() print "done" print i print time.clock() def replacecycle(starttext): p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+') p2= re.compile('(name=")+(.*)( )+(.*)(">)+') p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)(">)+') p4= re.compile('(name=")+(.*)(\')+(.*)(">)+') p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+') p6= re.compile('(name=")+(.*)(-)+(.*)(">)+') p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">)+') p8= re.compile('(name=")+(.*)(<)+(.*)(">)+') p7= re.compile('(href=|HREF=")+(.*)(#)+(.*)(:)+(.*)(">)+') p8= re.compile('(name=")+(.*)(:)+(.*)(">)+') p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(">)+') p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+') p100= re.compile('(a name=)+(.*)(-)+(.*)(>)+') q1= r"\1\2\3\4_\6\7" q2= r"\1\2_\4\5" interimtext = p1.sub(q1, starttext) interimtext = p2.sub(q2, interimtext) interimtext = p3.sub(q1, interimtext) interimtext = p4.sub(q2, interimtext) interimtext = p5.sub(q1, interimtext) interimtext = p6.sub(q2, interimtext) interimtext = p7.sub(q1, interimtext) interimtext = p8.sub(q2, interimtext) interimtext = p9.sub(q1, interimtext) interimtext = p10.sub(q2, interimtext) interimtext = p100.sub(q2, interimtext) return interimtext massreplace() -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
I did try moveing the re.compile's up and out of the replacecylce() but it didn't impact the time in any meaningful way (2 seconds maybe)... I'm not sure what an shell+sed script is... I'm fairly new to Python and my only other coding experience is with VBA... This was my first Python program In case it helps... We started with only 6 loops of replacecycle() but had to keep adding progressively more as we found more and more links with lots of spaces in them... As we did that, the program's time grew progressively longer but the length grew multiplicatively with the added number of cycles... This is exactly what I would have expected and it leads me to believe that the problem does not lie in the replacecycle() def but in the masseditor() def... *shrug* -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
XP is the OS... the files are split across a ton of subdirectories already... I'm actually starting to think there's a problem with certain files, however... We create help files for clients using RoboHelp... RoboHelp has Source HTML and then "webhelp" html which is what actually goes to the client... I'm trying to mass maintenance the "source" files... Right now, my program works but you've got to delete the webhelp files first... I figured that (based on the exponential growth in processing time) it was the additional number of files... However, after streamlining the codes I got the following results done 300 4.1904767226e-006 done 600 7.97062280262 done 900 22.3963802662 done 1200 29.9211888662 done 1375 35.3465962853 with the webhelp deleted and done 300 4.1904767226e-006 done 600 7.6259175398 done 900 13.3994678095 still processing 10 minutes later with the webhelp intact Since the system didn't hang sometime after 1375 (and in fact, still hasn't made it there), I can only assume that it hit one of the webhelp files and freaked out... The thing that's really weird is that the files it's hanging on appear to be some of the most basic files in the whole system (small, not alot going on... no hits on the RE search)... So I may just tell the users to delete the webhelp and have robohelp recreate it after they've run the program... -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
no swaps... memory usage is about 14k (these are small Html files)... no hard drive cranking away or fan on my laptop going nutty... CPU usage isn't even pegged... that's what makes me think it's not some sort of bizarre memory leak... Unfortunately, it also means I'm out of ideas... -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
For anyone that cares, I figured out the "problem"... the webhelp files that it hits the wall on are the compiled search files... They are the only files in the system that have line lengths that are RIDICULOUS in length... I'm looking at one right now that has 32767 characters all on one line... I'm absolutely certain that that's the problem... Thanks for everyone's help -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
The search is trying to replace the spaces in our bookmarks (and the links that go to those bookmarks)... The bookmark tag looks like this: and the bookmark tag looks like this some pitfalls I've already run up against... SOMETIMES (but not often) the a and the href (or name) is split across a line... this led me to just drop the ")+') and the corresponding name replace and then the one corner case we ran into of p100= re.compile('(a name=)+(.*)(-)+(.*)(>)+') -- http://mail.python.org/mailman/listinfo/python-list
Re: Program inefficiency?
It think he's saying it should look like this: # File: masseditor.py import re import os import time p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<:)+(.*)(">)+') p2= re.compile('(name=")+(.*)(\w\'\?-<:)+(.*)(">)+') p100= re.compile('(a name=)+(.*)(-)+(.*)(>)+') q1= r"\1\2\3\4_\6\7" q2= r"\1\2_\4\5" def massreplace(): editfile = open("C:\Program Files\Credit Risk Management\Masseditor \editfile.txt") filestring = editfile.read() filelist = filestring.splitlines() for i in range(len(filelist)): source = open(filelist[i]) starttext = source.read() for i in range (13): interimtext = p1.sub(q1, starttext) interimtext= p2.sub(q2, interimtext) interimtext= p100.sub(q2, interimtext) source.close() source = open(filelist[i],"w") source.write(finaltext) source.close() massreplace() I'll try that and see how it works... -- http://mail.python.org/mailman/listinfo/python-list
Pulling data from a .asps site
There's a government website which shows public data for banks. We'd like to pull the data down programmatically but the data is "hidden" behind .aspx... Is there anyway in Python to hook in directly to a browser (firefox or IE) to do the following... 1) Fill the search criteria 2) Press the "Search" button 3) Press another button (the CSV button) on the resulting page 4) Then grab the data out of the notepad file that pops up If this is a wild good chase, let me know... (or if there's a better way besides Python... I may have to explore writing a firefox plug-in or something)... -- http://mail.python.org/mailman/listinfo/python-list
easy_install
For the life of me I can not figure out how to get easy_install to work. The syntax displayed on the web page does not appear to work properly. easy_install c:\MySQL_python-1.2.2-py2.4-win32.egg Is there a simpler way to install a python egg? Or am I missing something with easy_install? -- http://mail.python.org/mailman/listinfo/python-list
Re: easy_install
On Feb 8, 9:27 am, "Diez B. Roggisch" wrote: > hall.j...@gmail.com wrote: > > For the life of me I can not figure out how to get easy_install to > > work. The syntax displayed on the web page does not appear to work > > properly. > > > easy_install c:\MySQL_python-1.2.2-py2.4-win32.egg > > It usually works for me - so what does "not appear to work properly" > actually mean? > > Diez http://peak.telecommunity.com/DevCenter/EasyInstall#downloading-and-installing-a-package seems to imply that after installation I can goto a command prompt and type easy_install c:\MySQL_python-1.2.2-py2.4-win32.egg I tried doing this in the python interpreter and on a straight "cmd" command prompt (the site doesn't really specify). I also tried "import easy_install" and then easy_install c:\MySQL_python-1.2.2-py2.4-win32.egg easy_install ("c:\MySQL_python-1.2.2-py2.4-win32.egg") and a couple other permutations and never got it to run (error messages for the first group were "invalid syntax" and were various flavors of "module not callable" for the second group). -- http://mail.python.org/mailman/listinfo/python-list
Re: easy_install
I had it downloaded and sitting in the root c:\ but didn't get it to run because I didn't think about the \scripts folder not being in the Path. Problem solved and fixed. Thank you all for your help. On a side note, "easy_install MySQL-python" produced the following messages: Searching for MySQL-python Reading http://pypi.python.org/simple/MySQL_python/ Reading http://sourceforge.net/projects/mysql-python Reading http://sourceforge.net/projects/mysql-python/ Best match: MySQL-python 1.2.3b1 Downloading http://osdn.dl.sourceforge.net/sourceforge/mysql-python/MySQL-python-1.2.3b1.tar.gz Processing MySQL-python-1.2.3b1.tar.gz Running MySQL-python-1.2.3b1\setup.py -q bdist_egg --dist-dir c: \docume~1\jhall\locals~1\temp\easy_install-t_ph9k\MySQL- python-1.2.3b1\egg-dist-tmp-3gtuz9 error: The system cannot find the file specified installing from the hard drive worked fine, however. -- http://mail.python.org/mailman/listinfo/python-list
Re: socket error: connection refused?
It's a security conflict. You should be able to run it again and have it work. Our company's cisco does the same thing (even after we approve the app) -- http://mail.python.org/mailman/listinfo/python-list
tuple.index() and tuple.count()
Before the inevitable response comes, let me assure you I've read through the posts from Guido about this. 7 years ago Guido clearly expressed a displeasure with allowing these methods for tuple. Let me lay out (in a fresh way) why I think we should reconsider. 1) It's counterintuitive to exclude them: It makes very little sense why an indexable data structure wouldn't have .index() as a method. It makes even less sense to not allow .count() 2) There's no technical reason (that I'm aware of) why these can't be added 3) It does not (contrary to one of Guido's assertions) require any relearning of anything. It's a new method that could be added without breaking any code whatsoever (there isn't even a UserTuple.py to break) 4) The additional documentation is relatively minute (especially since it could be copied and pasted virtually verbatim from the list methods 5) It's MORE Pythonic to do it this way (more intuitive, less boilerplate) 6) It jives with the help file better. One of Guido's many stated reasons was that tuples are for heterogeneous sequences and lists are for homogeneous sequences. While this may be hypothetically true, the help file does not come close to pointing you in this direction nor does the implementation of the language. example: "Tuples have many uses. For example: (x, y) coordinate pairs, employee records from a database, etc. Tuples, like strings, are immutable: it is not possible to assign to the individual items of a tuple (you can simulate much of the same effect with slicing and concatenation, though). It is also possible to create tuples which contain mutable objects, such as lists." is a quote from the help file. Not only does it never mention homogeneous vs. heterogeneous but mentions both immutable and mutable which draws your mind and attention to that aspect. While tuples and lists may have different uses based on convention, there's really only two reasons to ever use a tuple: Efficiency or dictionary keys (or some similar immutability requirement). The implementation contains absolutely NOTHING to reinforce the idea that lists are for homogeneous data. The implementation of the language contains EVERY indication that tuples are second class citizens only to be used for those limited functions above (in fact, efficiency isn't even talked about in the documentation... I pieced that together from other threads). Tuples could have been implemented as frozenlist just as easily. The lack of .index() and .count() appears to be primarily motivated by a subtle and silent (at least in the documentation) desire to push towards coding "best practice" rather than for any technical reason. While I'm certainly not a "change for change sake" kind of guy and I understand the "bang for your buck" thinking, I'm just not seeing the rational for stopping this so forcibly. I get the impression that if a perfect working patch was submitted, Guido might still reject it which just seems odd to me. Again, I'm not trying to raise a stink or open old wounds, I just ran across it in an app, started doing some research and was thoroughly confused (for the record, I'm using the tuples as dictionary keys and had a desire to do k.count() for some edit checks and realized I had to convert the thing to a list first to run count() ) -- http://mail.python.org/mailman/listinfo/python-list
Re: tuple.index() and tuple.count()
never mind... a coworker pointed me to this http://bugs.python.org/issue1696444 apparently they're there in py3k... -- http://mail.python.org/mailman/listinfo/python-list
Preferred method for "Assignment by value"
As a relative new comer to Python, I haven't done a heck of a lot of hacking around with it. I had my first run in with Python's quirky (to me at least) tendency to assign by reference rather than by value (I'm coming from a VBA world so that's the terminology I'm using). I was surprised that these two cases behave so differently test = [[1],[2]] x = test[0] x[0] = 5 test >>> [[5],[2]] x = 1 test >>>[[5],[2]] x >>> 1 Now I've done a little reading and I think I understand the problem... My issue is, "What's the 'best practise' way of assigning just the value of something to a new name?" i.e. test = [[1,2],[3,4]] I need to do some data manipulation with the first list in the above list without changing obviously x = test[0] will not work as any changes i make will alter the original... I found that I could do this: x = [] + test[0] that gets me a "pure" (i.e. unconnected to test[0] ) list but that concerned me as a bit kludgy Thanks for you time and help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Preferred method for "Assignment by value"
Thank you both, the assigning using slicing works perfectly (as I'm sure you knew it would)... It just didn't occur to me because it seemed a little nonintuitive... The specific application was def dicttolist (inputdict): finallist=[] for k, v in inputdict.iteritems(): temp = v temp.insert(0,k) finallist.append(temp) return finallist to convert a dictionary to a list. We deal with large amounts of bankdata which the dictionary is perfect for since loan number is a perfect key... at the end, though, I have to throw it into a csv file and the csv writer doesn't like dictionaries (since the key is an iterable string it iterates over each value in the key) by changing temp = v[:] the code worked perfectly (although changing temp.insert(0,k) to temp = [k] + temp also worked fine... I didn't like that as I knew it was a workaround) Thanks again for the help -- http://mail.python.org/mailman/listinfo/python-list
Re: Preferred method for "Assignment by value"
I think the fundamental "disconnect" is this issue of mutability and immutability that people talk about (mainly regarding tuples and whether they should be thought of as static lists or not) Coming from VBA I have a tendency to think of everything as an array... So when I create the following test=[1,2],[3,4],[5,6] I'm annoyed to find out that I can change do the following test[1][1] = 3 but i can't do test[1] = [3,3] and so I throw tuples out the window and never use them again... The mental disconnect I had (until now) was that my original tuple was in affect "creating" 3 objects (the lists) within a 4th object (the tuple)... Previously, I'd been thinking of the tuple as one big object (mentally forcing them into the same brain space as multi-dimensional arrays in VBA) This was a nice "aha" moment for me... -- http://mail.python.org/mailman/listinfo/python-list