Finding duplicate file names and modifying them based on elements of the path
I have an interesting problem I'm trying to solve. I have a solution almost working, but it's super ugly, and know there has to be a better, cleaner way to do it. I have a list of path names that have this form: /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file I need to find all the file names (basenames) in the list that are duplicates, and for each one that is a dup, prepend dir4 to the filename as long as the dir4/file pair is unique. If there are multiple dir4/files in the list, then I also need to add a sequence number based on the sorted value of dir5 (which is a date in ddMONyy format). For example, if my list contains: /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1 /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3 Then I want to end up with: /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1 /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1 /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3 My solution involves multiple maps and multiple iterations through the data. How would you folks do this? -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 18, 6:36 pm, Simon Cropper wrote: > On 19/07/12 08:20, larry.mart...@gmail.com wrote: > > > > > > > > > > > I have an interesting problem I'm trying to solve. I have a solution > > almost working, but it's super ugly, and know there has to be a > > better, cleaner way to do it. > > > I have a list of path names that have this form: > > > /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file > > > I need to find all the file names (basenames) in the list that are > > duplicates, and for each one that is a dup, prepend dir4 to the > > filename as long as the dir4/file pair is unique. If there are > > multiple dir4/files in the list, then I also need to add a sequence > > number based on the sorted value of dir5 (which is a date in ddMONyy > > format). > > > For example, if my list contains: > > > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 > > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1 > > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3 > > > Then I want to end up with: > > > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1 > > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2 > > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1 > > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3 > > > My solution involves multiple maps and multiple iterations through the > > data. How would you folks do this? > > Hi Larry, > > I am making the assumption that you intend to collapse the directory > tree and store each file in the same directory, otherwise I can't think > of why you need to do this. Hi Simon, thanks for the reply. It's not quite this - what I am doing is creating a zip file with relative path names, and if there are duplicate files the parts of the path that are not be carried over need to get prepended to the file names to make then unique, > > If this is the case, then I would... > > 1. import all the files into an array > 2. parse path to extract forth level directory name and base name. > 3. reiterate through the array > 3.1 check if base filename exists in recipient directory > 3.2 if not, copy to recipient directory > 3.3 if present, append the directory path then save > 3.4 create log of success or failure > > Personally, I would not have some files with abcd_file1 and others as > file2 because if it is important enough to store a file in a separate > directory you should also note where file2 came from as well. When > looking at your results at a later date you are going to have to open > file2 (which I presume must record where it relates to) to figure out > where it came from. If it is in the name it is easier to review. > > In short, consistency is the name of the game; if you are going to do it > for some then do it for all; and finally it will be easier for others > later to work out what you have done. Yeah, I know, but this is for a client, and this is what they want. -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 18, 4:49 pm, Paul Rubin wrote: > "larry.mart...@gmail.com" writes: > > I have an interesting problem I'm trying to solve. I have a solution > > almost working, but it's super ugly, and know there has to be a > > better, cleaner way to do it. ... > > > My solution involves multiple maps and multiple iterations through the > > data. How would you folks do this? > > You could post your code and ask for suggestions how to improve it. > There are a lot of not-so-natural constraints in that problem, so it > stands to reason that the code will be a bit messy. The whole > specification seems like an antipattern though. You should just give a > sensible encoding for the filename regardless of whether other fields > are duplicated or not. You also don't seem to address the case where > basename, dir4, and dir5 are all duplicated. > > The approach I'd take for the spec as you wrote it is: > > 1. Sort the list on the (basename, dir4, dir5) triple, saving original > location (numeric index) of each item > 2. Use itertools.groupby to group together duplicate basenames. > 3. Within the groups, use groupby again to gather duplicate dir4's, > 4. Within -those- groups, group by dir5 and assign sequence numbers in > groups where there's more than one file > 5. Unsort to get the rewritten items back into the original order. > > Actual code is left as an exercise. I replied to this before, but I don't see, so if this is a duplicate, sorry. Thanks for the reply Paul. I had not heard of itertools. It sounds like just what I need for this. But I am having 1 issue - how do you know how many items are in each group? Without knowing that I have to either make 2 passes through the data, or else work on the previous item (when I'm in an iteration after the first then I know I have dups). But that very quickly gets crazy with trying to keep the previous values. -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 19, 1:02 pm, "Prasad, Ramit" wrote: > > > I am making the assumption that you intend to collapse the directory > > > tree and store each file in the same directory, otherwise I can't think > > > of why you need to do this. > > > Hi Simon, thanks for the reply. It's not quite this - what I am doing > > is creating a zip file with relative path names, and if there are > > duplicate files the parts of the path that are not be carried over > > need to get prepended to the file names to make then unique, > > Depending on the file system of the client, you can hit file name > length limits. I would think it would be better to just create > the full structure in the zip. > > Just something to keep in mind, especially if you see funky behavior. Thanks, but it's not what the client wants. -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 18, 4:49 pm, Paul Rubin wrote: > "larry.mart...@gmail.com" writes: > > I have an interesting problem I'm trying to solve. I have a solution > > almost working, but it's super ugly, and know there has to be a > > better, cleaner way to do it. ... > > > My solution involves multiple maps and multiple iterations through the > > data. How would you folks do this? > > You could post your code and ask for suggestions how to improve it. > There are a lot of not-so-natural constraints in that problem, so it > stands to reason that the code will be a bit messy. The whole > specification seems like an antipattern though. You should just give a > sensible encoding for the filename regardless of whether other fields > are duplicated or not. You also don't seem to address the case where > basename, dir4, and dir5 are all duplicated. > > The approach I'd take for the spec as you wrote it is: > > 1. Sort the list on the (basename, dir4, dir5) triple, saving original > location (numeric index) of each item > 2. Use itertools.groupby to group together duplicate basenames. > 3. Within the groups, use groupby again to gather duplicate dir4's, > 4. Within -those- groups, group by dir5 and assign sequence numbers in > groups where there's more than one file > 5. Unsort to get the rewritten items back into the original order. > > Actual code is left as an exercise. Thanks very much for the reply Paul. I did not know about itertools. This seems like it will be perfect for me. But I'm having 1 issue, how do I know how many of a given basename (and similarly how many basename/dir4s) there are? I don't know that I have to modify a file until I've passed it, so I have to do all kinds of contortions to save the previous one, and deal with the last one after I fall out of the loop, and it's getting very nasty. reports_list is the list sorted on basename, dir4, dir5 (tool is dir4, file_date is dir5): for file, file_group in groupby(reports_list, lambda x: x[0]): # if file is unique in file_group do nothing, but how can I tell if file is unique? for tool, tool_group in groupby(file_group, lambda x: x[1]): # if tool is unique for file, change file to tool_file, but how can I tell if tool is unique for file? for file_date, file_date_group in groupby(tool_group, lambda x: x[2]): You can't do a len on the iterator that is returned from groupby, and I've tried to do something with imap or defaultdict, but I'm not getting anywhere. I guess I can just make 2 passes through the data, the first time getting counts. Or am I missing something about how groupby works? Thanks! -larry -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 19, 1:56 pm, Paul Rubin wrote: > "larry.mart...@gmail.com" writes: > > You can't do a len on the iterator that is returned from groupby, and > > I've tried to do something with imap or defaultdict, but I'm not > > getting anywhere. I guess I can just make 2 passes through the data, > > the first time getting counts. Or am I missing something about how > > groupby works? > > I posted another reply to your other message, which reached me earlier. > If you're still stuck, post again, though I probably won't be able to > reply til tomorrow or the next day. I really appreciate the offer, but I'm going to go with MRAB's solution. It works, and I understand it ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 19, 3:32 pm, MRAB wrote: > On 19/07/2012 20:06, larry.mart...@gmail.com wrote: > > > > > > > > > On Jul 19, 1:02 pm, "Prasad, Ramit" wrote: > >> > > I am making the assumption that you intend to collapse the directory > >> > > tree and store each file in the same directory, otherwise I can't think > >> > > of why you need to do this. > > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing > >> > is creating a zip file with relative path names, and if there are > >> > duplicate files the parts of the path that are not be carried over > >> > need to get prepended to the file names to make then unique, > > >> Depending on the file system of the client, you can hit file name > >> length limits. I would think it would be better to just create > >> the full structure in the zip. > > >> Just something to keep in mind, especially if you see funky behavior. > > > Thanks, but it's not what the client wants. > > Here's another solution, not using itertools: > > from collections import defaultdict > from os.path import basename, dirname > from time import strftime, strptime > > # Starting with the original paths > > paths = [ > "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3", > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1", > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2", > "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1", > "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3", > ] > > def make_dir5_key(path): > date = strptime(path.split("/")[6], "%d%b%y") > return strftime("%y%b%d", date) > > # Collect the paths into a dict keyed by the basename > > files = defaultdict(list) > for path in paths: > files[basename(path)].append(path) > > # Process a list of paths if there's more than one entry > > renaming = [] > > for name, entries in files.items(): > if len(entries) > 1: > # Collect the paths in each subgroup into a dict keyed by dir4 > > subgroup = defaultdict(list) > for path in entries: > subgroup[path.split("/")[5]].append(path) > > for dir4, subentries in subgroup.items(): > # Sort the subentries by dir5 (date) > subentries.sort(key=make_dir5_key) > > if len(subentries) > 1: > for index, path in enumerate(subentries): > renaming.append((path, > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name))) > else: > path = subentries[0] > renaming.append((path, "{}/{}_{}".format(dirname(path), > dir4, name))) > else: > path = entries[0] > > for old_path, new_path in renaming: > print("Rename {!r} to {!r}".format(old_path, new_path)) Thanks a million MRAB. I really like this solution. It's very understandable and it works! I had never seen .format before. I had to add the index of the positional args to them to make it work. -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 19, 1:43 pm, Paul Rubin wrote: > "larry.mart...@gmail.com" writes: > > Thanks for the reply Paul. I had not heard of itertools. It sounds > > like just what I need for this. But I am having 1 issue - how do you > > know how many items are in each group? > > Simplest is: > > for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])): > gs = list(group) # convert iterator to a list > n = len(gs) # this is the number of elements > > there is some theoretical inelegance in that it requires each group to > fit in memory, but you weren't really going to have billions of files > with the same basename. > > If you're not used to iterators and itertools, note there are some > subtleties to using groupby to iterate over files, because an iterator > actually has state. It bumps a pointer and maybe consumes some input > every time you advance it. In a situation like the above, you've got > some nexted iterators (the groupby iterator generating groups, and the > individual group iterators that come out of the groupby) that wrap the > same file handle, so bad confusion can result if you advance both > iterators without being careful (one can consume file input that you > thought would go to another). It seems that if you do a list(group) you have consumed the list. This screwed me up for a while, and seems very counter-intuitive. > This isn't as bad as it sounds once you get used to it, but it can be > a source of frustration at first. > > BTW, if you just want to count the elements of an iterator (while > consuming it), > > n = sum(1 for x in xs) > > counts the elements of xs without having to expand it into an in-memory > list. > > Itertools really makes Python feel a lot more expressive and clean, > despite little kinks like the above. -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding duplicate file names and modifying them based on elements of the path
On Jul 19, 7:01 pm, "larry.mart...@gmail.com" wrote: > On Jul 19, 3:32 pm, MRAB wrote: > > > > > > > > > > > On 19/07/2012 20:06, larry.mart...@gmail.com wrote: > > > > On Jul 19, 1:02 pm, "Prasad, Ramit" wrote: > > >> > > I am making the assumption that you intend to collapse the directory > > >> > > tree and store each file in the same directory, otherwise I can't > > >> > > think > > >> > > of why you need to do this. > > > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing > > >> > is creating a zip file with relative path names, and if there are > > >> > duplicate files the parts of the path that are not be carried over > > >> > need to get prepended to the file names to make then unique, > > > >> Depending on the file system of the client, you can hit file name > > >> length limits. I would think it would be better to just create > > >> the full structure in the zip. > > > >> Just something to keep in mind, especially if you see funky behavior. > > > > Thanks, but it's not what the client wants. > > > Here's another solution, not using itertools: > > > from collections import defaultdict > > from os.path import basename, dirname > > from time import strftime, strptime > > > # Starting with the original paths > > > paths = [ > > "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3", > > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1", > > "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2", > > "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1", > > "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3", > > ] > > > def make_dir5_key(path): > > date = strptime(path.split("/")[6], "%d%b%y") > > return strftime("%y%b%d", date) > > > # Collect the paths into a dict keyed by the basename > > > files = defaultdict(list) > > for path in paths: > > files[basename(path)].append(path) > > > # Process a list of paths if there's more than one entry > > > renaming = [] > > > for name, entries in files.items(): > > if len(entries) > 1: > > # Collect the paths in each subgroup into a dict keyed by dir4 > > > subgroup = defaultdict(list) > > for path in entries: > > subgroup[path.split("/")[5]].append(path) > > > for dir4, subentries in subgroup.items(): > > # Sort the subentries by dir5 (date) > > subentries.sort(key=make_dir5_key) > > > if len(subentries) > 1: > > for index, path in enumerate(subentries): > > renaming.append((path, > > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name))) > > else: > > path = subentries[0] > > renaming.append((path, "{}/{}_{}".format(dirname(path), > > dir4, name))) > > else: > > path = entries[0] > > > for old_path, new_path in renaming: > > print("Rename {!r} to {!r}".format(old_path, new_path)) > > Thanks a million MRAB. I really like this solution. It's very > understandable and it works! I had never seen .format before. I had to > add the index of the positional args to them to make it work. Also, in make_dir5_key the format specifier for strftime should be %y%m %d so they sort properly. -- http://mail.python.org/mailman/listinfo/python-list
Re: help with making my code more efficient
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote: > On Fri, Dec 21, 2012 at 11:19 AM, larry.mart...@gmail.com > > wrote: > > > This code works, but it takes way too long to run - e.g. when cdata has > > 600,000 elements (which is typical for my app) it takes 2 hours for this to > > run. > > > > > > Can anyone give me some suggestions on speeding this up? > > > > > > > It sounds like you may have enough data to want to not keep it all in > > memory. Have you considered switching to a database? You could then > > execute SQL queries against it. It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python. -- http://mail.python.org/mailman/listinfo/python-list
Re: help with making my code more efficient
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > On 12/20/2012 07:19 PM, larry.mart...@gmail.com wrote: > > > I have a list of tuples that contains a tool_id, a time, and a message. I > > want to select from this list all the elements where the message matches > > some string, and all the other elements where the time is within some diff > > of any matching message for that tool. > > > Here is how I am currently doing this: > > No, it's not. This is a fragment of code, without enough clues as to > > what else is going. We can guess, but that's likely to make a mess. Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > First question is whether this code works exactly correctly? Yes, the code works. I end up with just the rows I want. > Are you only concerned about speed, not fixing features? Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > As far as I can tell, the logic that includes the time comparison is bogus. Not at all. > You don't do anything there to worry about the value of tup[2], just whether > some > item has a nearby time. Of course, I could misunderstand the spec. The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > Are you making a global called 'self' ? That name is by convention only > used in methods to designate the instance object. What's the attribute > self? Yes, self is my instance object. self.message contains the string of interest that I need to look for. > Can cdata have duplicates, and are they significant? No, it will not have duplicates. > Are you including the time building that as part of your 2 hour measurement? No, the 2 hours is just the time to run the cdata[:] = [tup for tup in cdata if determine(tup)] > Is the list sorted in any way? Yes, the list is sorted by tool and datetime. > Chances are your performance bottleneck is the doubly-nested loop. You > have a list comprehension at top-level code, and inside it calls a > function that also loops over the 600,000 items. So the inner loop gets > executed 360 billion times. You can cut this down drastically by some > judicious sorting, as well as by having a map of lists, where the map is > keyed by the tool. Thanks. I will try that. > > # record time for each message matching the specified message for each tool > > > messageTimes = {} > > You're building a dictionary; are you actually using the value (1), or > is only the key relevant? Only the keys. > A set is a dict without a value. Yes, I could use a set, but I don't think that would make it measurably faster. > But more mportantly, you never look up anything in this dictionary. So why > isn't it a list? For that matter, why don't you just use the > messageTimes list? Yes, it could be a list too. > > for row in cdata: # tool, time, message > > > if self.message in row[2]: > > > messageTimes[row[0], row[1]] = 1 > > > > > > # now pull out each message that is within the time diff for each matched > > message > > > # as well as the matched messages themselves > > > > > > def determine(tup): > > > if self.message in tup[2]: return True # matched message > > > > > > for (tool, date_time) in messageTimes: > > > if tool == tup[0]: > > > if abs(date_time-tup[1]) <= tdiff: > > >return True > > > > > > return False > > > > > > cdata[:] = [tup for tup in cdata if determine(tup)] > > > > As the code exists, there's no need to copy the list. Just do a simple > bind. This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. > > > > > > > > This code works, but it takes way too long to run - e.g. when cdata has > > 600,000 elements (which is typical for my app) it takes 2 hours for this to > > run. > > > > > > Can anyone give me some suggestions on speeding this up? -- http://mail.python.org/mailman/listinfo/python-list
Re: help with making my code more efficient
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote: > On 12/20/2012 08:46 PM, larry.mart...@gmail.com wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> > > > Of course it's a fragment - it's part of a large program and I was just > > showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > missing context. And you use self without it being an argument to the > function. Like it's a global. I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as: def generate_data(self): > > > > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, > > it just takes too long. > > > > > >> As far as I can tell, the logic that includes the time comparison is > >> bogus. > > > Not at all. > > > > > >> You don't do anything there to worry about the value of tup[2], just > >> whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes > > from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > sense to me, so I can't really optimize it, just make suggestions. Yes, tup[1] is the datetime. I mistyped last night. > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of > > interest that I need to look for. > > > > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > > > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > > > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > > > So in your first loop, you could simply split the list into separate > lists, one per tup[0] value, and the lists as dictionary items, keyed by > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > >recs = messageTimes[tup[0]] I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. > Instead of a for loop over recs, use a binary search to identify the > first item that's >= date_time-tdiff. Then if it's less than > date_time+tdiff, return True, otherwise False. Check out the bisect > module. Function bisect_left() should do what you want in a sorted list. Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. This was the code I had: times = messageTimes[tup[0]] le = bisect.bisect_right(times, tup[1]) ge = bisect.bisect_left(times, tup[1]) return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > >> As the code exists, there's no need to copy the list. Just do a simple > >> bind. > > > This statement is to remove the items from cdata that I don't want. I don't > > know what you mean by bind. I'm not familiar with that python function. > > Every "assignment" to a simple name is really a rebinding of that name. > cdata = [tup for tup in cdata if determine(tup)] > > will rebind the name to the new object, much quicker than copying. If > this is indeed a top-level line, it should be equivalent. But if in > fact this is inside some other function, it may violate some other > assumptions. In particular, if there are other names for the same > object, then you're probably stuck with modifying it in place, using > slice notation. The slice notation was left over when when cdata was a tuple. Now that it's a list I don't need that any more. > BTW, a set is generally much more memory efficient than a dict, when you > don't use the "value". But since I think you'll be better off with a > dict of lists, it's a moot point. I'm going back to square 1 and try and do all from SQL. -- http://mail.python.org/mailman/listinfo/python-list
Re: help with making my code more efficient
On Friday, December 21, 2012 10:57:19 AM UTC-7, larry@gmail.com wrote: > On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote: > > On 12/20/2012 08:46 PM, larry.mart...@gmail.com wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> > > > Of course it's a fragment - it's part of a large program and I was just > > > showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > > missing context. And you use self without it being an argument to the > > function. Like it's a global. > I didn't show the entire method, only what I thought was relevant to my > issue. The method is declared as: > > def generate_data(self): > > > > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, > > > it just takes too long. > > >> As far as I can tell, the logic that includes the time comparison is > > >> bogus. > > > Not at all. > > >> You don't do anything there to worry about the value of tup[2], just > > >> whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes > > > from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > > sense to me, so I can't really optimize it, just make suggestions. > Yes, tup[1] is the datetime. I mistyped last night. > > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of > > > interest that I need to look for. > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > So in your first loop, you could simply split the list into separate > > lists, one per tup[0] value, and the lists as dictionary items, keyed by > > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > >recs = messageTimes[tup[0]] > I made that change ant went from taking over 2 hours to 54 minutes. A > dramatic improvement, but still not adequate for my app. > > Instead of a for loop over recs, use a binary search to identify the > > first item that's >= date_time-tdiff. Then if it's less than > > date_time+tdiff, return True, otherwise False. Check out the bisect > > module. Function bisect_left() should do what you want in a sorted list. > Didn't know about bisect. Thanks. I thought it would be my savior for sure. > But unfortunaly when I added that, it blows up with out of memory. The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. Thanks very much for all your help. > This was the code I had: > > times = messageTimes[tup[0]] > > le = bisect.bisect_right(times, tup[1]) > > ge = bisect.bisect_left(times, tup[1]) > > return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and > times[ge]-tup[1] <= tdiff) > > > > > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > > > > > >> As the code exists, there's no need to copy the list. Just do a simple
Re: help with making my code more efficient
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote: > On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote: > > > On Friday, December 21, 2012 10:57:19 AM UTC-7, larry@gmail.com wrote: > > >> > > >> Didn't know about bisect. Thanks. I thought it would be my savior for > >> sure. But unfortunaly when I added that, it blows up with out of memory. > > > The out of memory error had nothing to do with using bisect. I had > > introduced a typo that I really though would have caused a variable > > referenced before assignment error. But it did not do that, and instead > > somehow caused all the memory in my machine to get used up. When I fixed > > that, it worked really well with bisect. The code that was taking 2 hours > > was down to 20 minutes, and even better, a query that was taking 40 minutes > > was down to 8 seconds. > > > > > > Thanks very much for all your help. > > Congratulations. But you're not done. A fourfold improvement isn't > nearly as much as I'd expect. When you go from a n-squared algorithm to > an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold > improvement. Assuming we're talking just about time spent in the cdata > = list-comprehension list. > > It's also probably possible to simplify, and get some speedup from other > parts of the code other than that one loop. For example, if the bisect > is really working right, it's probably unnecessary to even copy the > original cdata. You said it was sorted. So bisect it directly, and > skip the messageTimes intermediate data structure. It may not help, but > i suspect it will. > > > > >> This was the code I had: > >> > >> times = messageTimes[tup[0]] > > >> > > >> le = bisect.bisect_right(times, tup[1]) > >> ge = bisect.bisect_left(times, tup[1]) > >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and > >> times[ge]-tup[1] <= tdiff) > > > > I'm stymied making further suggestions to this fragment. Without seeing > the changes you apparently made to messageTimes, I can't even be sure > this is equivalent. And I wonder if even looking up tup[1] is > worthwhile, since your caller already knows the index in cdata. If you > used cdata directly, you might not need a sort at all. > > I wish you could post some sample data, and identify which records are > intended to be True. Obviously you could simplify, and just use ints > where it's using datetimes here. But if your rule is simply accept all > records that come in a cluster with no delta more than a threshold, then > you don't even need any search at all. Just pass the index in, and > compare the datetime with its predecessor and successor. > > Could we see the whole fragment as you originally had it, but with the > optimizations you've done so far? The more I look at that determine() > function, the more futile it seems. I just don't understand what the > criteria are supposed to be. Your list-comp loops through all of cdata, > and for each item, passes that item to determine(). determine() loops > through a copy of cdata, checking to see if any of them is within tdiff > of tup. But won't that always return True, since you'll encounter the > record and compare it to itself? I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match. Here's my original code: # record time for each message matching the specified message for each tool messageTimes = {} for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0], row[1]] = 1 # now pull out each message that is within the time diff for each matched message # as well as the matched messages themselves def determine(tup): if self.message in tup[2]: return True # matched message for (tool, date_time) in messageTimes: if tool == tup[0]: if abs(date_time-tup[1]) <= tdiff: return True return False cdata[:] = [
Re: help with making my code more efficient
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote: > On 12/21/2012 11:47 PM, larry.mart...@gmail.com wrote: > > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote: > >> On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote: > >> > >>>> > I think you're misunderstanding what I need to do. I have a set of rows > > from the database with tool, time, and message. The user has specified a > > string and a time threshold. From that set of rows I need to return all the > > rows that contain the user's string and all the other rows that match the > > tool from the matched rows and have a time within the threshold. > > > > cdata has all the rows. messageTimes has the times of all the matched > > messages, keyed by tool. In determine() I don't look though cdata - it gets > > one element from cdata and I see if that should be selected because it > > either matches the user's message, or it's within the threshold of one that > > did match. > > > > Here's my original code: > > > > # record time for each message matching the specified message for each tool > > messageTimes = {} > > for row in cdata: # tool, time, message > > if self.message in row[2]: > > messageTimes[row[0], row[1]] = 1 > > > > # now pull out each message that is within the time diff for each matched > > message > > # as well as the matched messages themselves > > > > def determine(tup): > > if self.message in tup[2]: return True # matched message > > > > for (tool, date_time) in messageTimes: > > if tool == tup[0]: > > if abs(date_time-tup[1]) <= tdiff: > >return True > > > > return False > > > > cdata[:] = [tup for tup in cdata if determine(tup)] > > > > Here's the code now: > > > > > ># Parse data and make a list of the time for each message matching > > the specified message for each tool > > messageTimes = defaultdict(list)# a dict with sorted lists > > > > for row in cdata: # tool, time, message > > if self.message in row[2]: > > messageTimes[row[0]].append(row[1]) > > > > # now pull out each message that is within the time context for > > each matched message > > # as well as the matched messages themselves > > > > # return true is we should keep this message > > def determine(tup): > > if self.message in tup[2]: return True # matched message > > if seconds == 0: return False# no time context > > specified > > > > times = messageTimes[tup[0]] # get the list of > > matched messages for this tool > > > > le = bisect.bisect_right(times, tup[1]) # find time less than > > or equal to tup[1] > > ge = bisect.bisect_left(times, tup[1])# find time greater > > then to equal to tup[1] > > return (le and tup[1]-times[le-1] <= tdiff) or (ge != > > len(times) and times[ge]-tup[1] <= tdiff) > > > > cdata = [tup for tup in cdata if determine(tup)] > > > > In my test case, cdata started with 600k rows, 30k matched the users > > string, and a total of 110k needed to be returned (which is what cdata > > ended up with) - the 30k that matched the string, and 80k that were within > > the time threshold. > > > > I think the point you may have missed is the tool - I only return a row if > > it's the same tool as a matched message and within the threshold. > > > > I hope I've explained this better. Thanks again. > > That is better, and the point I missed noticing before is that > messageTimes is substantially smaller than cdata; it's already been > filtered down by looking for self.message in its row[2]. The code was > there, but I didn't relate. Remember I was bothered that you didn't > look at tup[2] when you were comparing for time-similarity. Well, you > did that implicitly, since messageTimes was already filtered. Sorry > about that. > > That also lowers my expectations for improvement ratio, since instead of > 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So > now my expectations are only 4:1 to 10:1. > > Still, there's room for improvement. (1) You should only need one > bisect in determi
Re: Best way to structure data for efficient searching
On Mar 28, 1:52 pm, Jon Clements wrote: > On Wednesday, 28 March 2012 19:39:54 UTC+1, larry@gmail.com wrote: > > I have the following use case: > > > I have a set of data that is contains 3 fields, K1, K2 and a > > timestamp. There are duplicates in the data set, and they all have to > > processed. > > > Then I have another set of data with 4 fields: K3, K4, K5, and a > > timestamp. There are also duplicates in that data set, and they also > > all have to be processed. > > > I need to find all the items in the second data set where K1==K3 and > > K2==K4 and the 2 timestamps are within 20 seconds of each other. > > > I have this working, but the way I did it seems very inefficient - I > > simply put the data in 2 arrays (as tuples) and then walked through > > the entire second data set once for each item in the first data set, > > looking for matches. > > > Is there a better, more efficient way I could have done this? > > It might not be more *efficient* but others might find it more readable, and > it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and > query as (untested) > > select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - > t2.timestamp) < 20 This is part of django app, and the data came from mysql. Through a mechanism I can't change at this time (it would take a lot of code changes and this new functionality is needed ASAP) I get all the data at once and have to winnow it down. > Failing that, two (default)dicts with a tuple as the pair, then use that as > your base. Since there are duplicates, I can't use a dict. And if I have any extraneous data in the keys (i.e. something to make them unique) then I still have to walk through the entire dict to find the matches. -- http://mail.python.org/mailman/listinfo/python-list
Best way to structure data for efficient searching
I have the following use case: I have a set of data that is contains 3 fields, K1, K2 and a timestamp. There are duplicates in the data set, and they all have to processed. Then I have another set of data with 4 fields: K3, K4, K5, and a timestamp. There are also duplicates in that data set, and they also all have to be processed. I need to find all the items in the second data set where K1==K3 and K2==K4 and the 2 timestamps are within 20 seconds of each other. I have this working, but the way I did it seems very inefficient - I simply put the data in 2 arrays (as tuples) and then walked through the entire second data set once for each item in the first data set, looking for matches. Is there a better, more efficient way I could have done this? -- http://mail.python.org/mailman/listinfo/python-list
parsing nested unbounded XML fields with ElementTree
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had: When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far: nodes = [] def parseChild(c): if c.tag == 'Node': if 'Name' in c.attrib: nodes.append(c.attrib['Name']) for c1 in c: parseChild(c1) else: for node in nodes: print node, print c.tag for parent in tree.getiterator(): for child in parent: for x in child: parseChild(x) My problem is that I don't know when I'm done with a node and I should remove a level of nesting. I would think this is a fairly common situation, but I could not find any examples of parsing a file like this. Perhaps I'm going about it completely wrong. -- https://mail.python.org/mailman/listinfo/python-list
python script hangs when run from subprocess
I have a python script and when I run it directly from the command line it runs to completion. But I need to run it from another script. I do that like this: p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) rv = p.wait() out_buf = p.stdout.read() When I do this, wait never returns. If I trace the underlying script it's always in the same write to stderr that never seems to complete: write(2, "/KA22/05Feb12/Images/12063LBO003"..., 24457 I run many other scripts and commands in the same manner, and they all complete, it's just this one. Anyone have any ideas why this is happening, and how I can further debug or fix this? TIA! -larry -- https://mail.python.org/mailman/listinfo/python-list
Re: python script hangs when run from subprocess
On Saturday, September 7, 2013 5:19:25 AM UTC-6, Peter Otten wrote: > larry.mart...@gmail.com wrote: > > > > > I have a python script and when I run it directly from the command line it > > > runs to completion. But I need to run it from another script. I do that > > > like this: > > > > > > p = subprocess.Popen(cmd, stdout=subprocess.PIPE, > > > stderr=subprocess.STDOUT) rv = p.wait() > > > out_buf = p.stdout.read() > > > > > > When I do this, wait never returns. If I trace the underlying script it's > > > always in the same write to stderr that never seems to complete: > > > > > > write(2, "/KA22/05Feb12/Images/12063LBO003"..., 24457 > > > > > > I run many other scripts and commands in the same manner, and they all > > > complete, it's just this one. Anyone have any ideas why this is happening, > > > and how I can further debug or fix this? > > > > The script writes to an OS buffer, and when that buffer is full it blocks > > forever. p.wait() in turn then waits forever for the script to terminate... > > > > As a fix try > > > > out_buf = subprocess.Popen(...).communicate()[0] > > > > > > This uses threads or select (depending on the OS) to avoid the problem -- > > and is prominently mentionend in the documentation: > > > > http://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait Thanks. I hadn't seen this. I'll check it out. -- https://mail.python.org/mailman/listinfo/python-list
Re: python script hangs when run from subprocess
On Saturday, September 7, 2013 9:47:47 AM UTC-6, Nobody wrote: > On Sat, 07 Sep 2013 03:55:02 -0700, larry.mart...@gmail.com wrote: > > > > > I have a python script and when I run it directly from the command line > > > it runs to completion. But I need to run it from another script. I do > > > that like this: > > > > > > p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) > > > rv = p.wait() > > > out_buf = p.stdout.read() > > > > > > When I do this, wait never returns. > > > > The last two statements are the wrong way around. If you're reading a > > process' output via a pipe, you shouldn't wait() for it until it has > > closed its end of the pipe. > > > > As it stands, you have a potential deadlock. If the subprocess tries to > > write more data than will fit into the pipe, it will block until the > > parent reads from the pipe. But the parent won't read from the pipe until > > after the subprocess has terminated, which won't happen because the > > subprocess is blocked waiting for the parent to read from the pipe ... Thanks. I reversed the order of the wait and read calls, and it no longer hangs. -- https://mail.python.org/mailman/listinfo/python-list
setup.py not found
I'm trying to install a package (cx_Oracle) on a mac running 10.5.7. I've done this on other platforms, but never on a mac. I followed the instructions given, but when I try and run setup I get: Apollo:instantclient_10_2 user$ python setup.py build /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/ Python.app/Contents/MacOS/Python: can't open file 'setup.py': [Errno 2] No such file or directory Is there something else I have to install first to have this? Thanks! -- http://mail.python.org/mailman/listinfo/python-list