Finding duplicate file names and modifying them based on elements of the path

2012-07-18 Thread larry.mart...@gmail.com
I have an interesting problem I'm trying to solve. I have a solution
almost working, but it's super ugly, and know there has to be a
better, cleaner way to do it.

I have a list of path names that have this form:

/dir0/dir1/dir2/dir3/dir4/dir5/dir6/file

I need to find all the file names (basenames) in the list that are
duplicates, and for each one that is a dup, prepend dir4 to the
filename as long as the dir4/file pair is unique. If there are
multiple dir4/files in the list, then I also need to add a sequence
number based on the sorted value of dir5 (which is a date in ddMONyy
format).

For example, if my list contains:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3

Then I want to end up with:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3

My solution involves multiple maps and multiple iterations through the
data. How would you folks do this?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 18, 6:36 pm, Simon Cropper
 wrote:
> On 19/07/12 08:20, larry.mart...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it.
>
> > I have a list of path names that have this form:
>
> > /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> > I need to find all the file names (basenames) in the list that are
> > duplicates, and for each one that is a dup, prepend dir4 to the
> > filename as long as the dir4/file pair is unique. If there are
> > multiple dir4/files in the list, then I also need to add a sequence
> > number based on the sorted value of dir5 (which is a date in ddMONyy
> > format).
>
> > For example, if my list contains:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> > Then I want to end up with:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> Hi Larry,
>
> I am making the assumption that you intend to collapse the directory
> tree and store each file in the same directory, otherwise I can't think
> of why you need to do this.

Hi Simon, thanks for the reply. It's not quite this - what I am doing
is creating a zip file with relative path names, and if there are
duplicate files the parts of the path that are not be carried over
need to get prepended to the file names to make then unique,
>
> If this is the case, then I would...
>
> 1. import all the files into an array
> 2. parse path to extract forth level directory name and base name.
> 3. reiterate through the array
>     3.1 check if base filename exists in recipient directory
>     3.2 if not, copy to recipient directory
>     3.3 if present, append the directory path then save
>     3.4 create log of success or failure
>
> Personally, I would not have some files with abcd_file1 and others as
> file2 because if it is important enough to store a file in a separate
> directory you should also note where file2 came from as well. When
> looking at your results at a later date you are going to have to open
> file2 (which I presume must record where it relates to) to figure out
> where it came from. If it is in the name it is easier to review.
>
> In short, consistency is the name of the game; if you are going to do it
> for some then do it for all; and finally it will be easier for others
> later to work out what you have done.

Yeah, I know, but this is for a client, and this is what they want.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 18, 4:49 pm, Paul Rubin  wrote:
> "larry.mart...@gmail.com"  writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

I replied to this before, but I don't see, so if this is a duplicate,
sorry.

Thanks for the reply Paul. I had not heard of itertools. It sounds
like just what I need for this. But I am having 1 issue - how do you
know how many items are in each group? Without knowing that I have to
either make 2 passes through the data, or else work on the previous
item (when I'm in an iteration after the first then I know I have
dups). But that very quickly gets crazy with trying to keep the
previous values.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 19, 1:02 pm, "Prasad, Ramit"  wrote:
> > > I am making the assumption that you intend to collapse the directory
> > > tree and store each file in the same directory, otherwise I can't think
> > > of why you need to do this.
>
> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > is creating a zip file with relative path names, and if there are
> > duplicate files the parts of the path that are not be carried over
> > need to get prepended to the file names to make then unique,
>
> Depending on the file system of the client, you can hit file name
> length limits. I would think it would be better to just create
> the full structure in the zip.
>
> Just something to keep in mind, especially if you see funky behavior.

Thanks, but it's not what the client wants.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 18, 4:49 pm, Paul Rubin  wrote:
> "larry.mart...@gmail.com"  writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.

reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):

for file, file_group in groupby(reports_list, lambda x: x[0]):
# if file is unique in file_group do nothing, but how can I tell
if file is unique?
for tool, tool_group in groupby(file_group, lambda x: x[1]):
# if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):


You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or  defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?

Thanks!
-larry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 19, 1:56 pm, Paul Rubin  wrote:
> "larry.mart...@gmail.com"  writes:
> > You can't do a len on the iterator that is returned from groupby, and
> > I've tried to do something with imap or      defaultdict, but I'm not
> > getting anywhere. I guess I can just make 2 passes through the data,
> > the first time getting counts. Or am I missing something about how
> > groupby works?
>
> I posted another reply to your other message, which reached me earlier.
> If you're still stuck, post again, though I probably won't be able to
> reply til tomorrow or the next day.

I really appreciate the offer, but I'm going to go with MRAB's
solution. It works, and I understand it ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 19, 3:32 pm, MRAB  wrote:
> On 19/07/2012 20:06, larry.mart...@gmail.com wrote:
>
>
>
>
>
>
>
> > On Jul 19, 1:02 pm, "Prasad, Ramit"  wrote:
> >> > > I am making the assumption that you intend to collapse the directory
> >> > > tree and store each file in the same directory, otherwise I can't think
> >> > > of why you need to do this.
>
> >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> >> > is creating a zip file with relative path names, and if there are
> >> > duplicate files the parts of the path that are not be carried over
> >> > need to get prepended to the file names to make then unique,
>
> >> Depending on the file system of the client, you can hit file name
> >> length limits. I would think it would be better to just create
> >> the full structure in the zip.
>
> >> Just something to keep in mind, especially if you see funky behavior.
>
> > Thanks, but it's not what the client wants.
>
> Here's another solution, not using itertools:
>
> from collections import defaultdict
> from os.path import basename, dirname
> from time import strftime, strptime
>
> # Starting with the original paths
>
> paths = [
>      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
>      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> ]
>
> def make_dir5_key(path):
>      date = strptime(path.split("/")[6], "%d%b%y")
>      return strftime("%y%b%d", date)
>
> # Collect the paths into a dict keyed by the basename
>
> files = defaultdict(list)
> for path in paths:
>      files[basename(path)].append(path)
>
> # Process a list of paths if there's more than one entry
>
> renaming = []
>
> for name, entries in files.items():
>      if len(entries) > 1:
>          # Collect the paths in each subgroup into a dict keyed by dir4
>
>          subgroup = defaultdict(list)
>          for path in entries:
>              subgroup[path.split("/")[5]].append(path)
>
>          for dir4, subentries in subgroup.items():
>              # Sort the subentries by dir5 (date)
>              subentries.sort(key=make_dir5_key)
>
>              if len(subentries) > 1:
>                  for index, path in enumerate(subentries):
>                      renaming.append((path,
> "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
>              else:
>                  path = subentries[0]
>                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> dir4, name)))
>      else:
>          path = entries[0]
>
> for old_path, new_path in renaming:
>      print("Rename {!r} to {!r}".format(old_path, new_path))

Thanks a million MRAB. I really like this solution. It's very
understandable and it works! I had never seen .format before. I had to
add the index of the positional args to them to make it work.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 19, 1:43 pm, Paul Rubin  wrote:
> "larry.mart...@gmail.com"  writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
>   for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
>      gs = list(group)  # convert iterator to a list
>      n = len(gs)       # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state.  It bumps a pointer and maybe consumes some input
> every time you advance it.  In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

> This isn't as bad as it sounds once you get used to it, but it can be
> a source of frustration at first.
>
> BTW, if you just want to count the elements of an iterator (while
> consuming it),
>
>      n = sum(1 for x in xs)
>
> counts the elements of xs without having to expand it into an in-memory
> list.
>
> Itertools really makes Python feel a lot more expressive and clean,
> despite little kinks like the above.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding duplicate file names and modifying them based on elements of the path

2012-07-19 Thread larry.mart...@gmail.com
On Jul 19, 7:01 pm, "larry.mart...@gmail.com"
 wrote:
> On Jul 19, 3:32 pm, MRAB  wrote:
>
>
>
>
>
>
>
>
>
> > On 19/07/2012 20:06, larry.mart...@gmail.com wrote:
>
> > > On Jul 19, 1:02 pm, "Prasad, Ramit"  wrote:
> > >> > > I am making the assumption that you intend to collapse the directory
> > >> > > tree and store each file in the same directory, otherwise I can't 
> > >> > > think
> > >> > > of why you need to do this.
>
> > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > >> > is creating a zip file with relative path names, and if there are
> > >> > duplicate files the parts of the path that are not be carried over
> > >> > need to get prepended to the file names to make then unique,
>
> > >> Depending on the file system of the client, you can hit file name
> > >> length limits. I would think it would be better to just create
> > >> the full structure in the zip.
>
> > >> Just something to keep in mind, especially if you see funky behavior.
>
> > > Thanks, but it's not what the client wants.
>
> > Here's another solution, not using itertools:
>
> > from collections import defaultdict
> > from os.path import basename, dirname
> > from time import strftime, strptime
>
> > # Starting with the original paths
>
> > paths = [
> >      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
> >      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> > ]
>
> > def make_dir5_key(path):
> >      date = strptime(path.split("/")[6], "%d%b%y")
> >      return strftime("%y%b%d", date)
>
> > # Collect the paths into a dict keyed by the basename
>
> > files = defaultdict(list)
> > for path in paths:
> >      files[basename(path)].append(path)
>
> > # Process a list of paths if there's more than one entry
>
> > renaming = []
>
> > for name, entries in files.items():
> >      if len(entries) > 1:
> >          # Collect the paths in each subgroup into a dict keyed by dir4
>
> >          subgroup = defaultdict(list)
> >          for path in entries:
> >              subgroup[path.split("/")[5]].append(path)
>
> >          for dir4, subentries in subgroup.items():
> >              # Sort the subentries by dir5 (date)
> >              subentries.sort(key=make_dir5_key)
>
> >              if len(subentries) > 1:
> >                  for index, path in enumerate(subentries):
> >                      renaming.append((path,
> > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
> >              else:
> >                  path = subentries[0]
> >                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> > dir4, name)))
> >      else:
> >          path = entries[0]
>
> > for old_path, new_path in renaming:
> >      print("Rename {!r} to {!r}".format(old_path, new_path))
>
> Thanks a million MRAB. I really like this solution. It's very
> understandable and it works! I had never seen .format before. I had to
> add the index of the positional args to them to make it work.

Also, in make_dir5_key the format specifier for strftime should be %y%m
%d so they sort properly.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: help with making my code more efficient

2012-12-20 Thread larry.mart...@gmail.com
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> On Fri, Dec 21, 2012 at 11:19 AM, larry.mart...@gmail.com
> 
>  wrote:
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 
> > 600,000 elements (which is typical for my app) it takes 2 hours for this to 
> > run.
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?
> 
> >
> 
> 
> 
> It sounds like you may have enough data to want to not keep it all in
> 
> memory. Have you considered switching to a database? You could then
> 
> execute SQL queries against it.

It came from a database. Originally I was getting just the data I wanted using 
SQL, but that was taking too long also. I was selecting just the messages I 
wanted, then for each one of those doing another query to get the data within 
the time diff of each. That was resulting in tens of thousands of queries. So I 
changed it to pull all the potential matches at once and then process it in 
python. 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: help with making my code more efficient

2012-12-20 Thread larry.mart...@gmail.com
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, larry.mart...@gmail.com wrote:
> 
> > I have a list of tuples that contains a tool_id, a time, and a message. I 
> > want to select from this list all the elements where the message matches 
> > some string, and all the other elements where the time is within some diff 
> > of any matching message for that tool. 
> 
> > Here is how I am currently doing this:
> 
> No, it's not.  This is a fragment of code, without enough clues as to
> 
> what else is going.  We can guess, but that's likely to make a mess.

Of course it's a fragment - it's part of a large program and I was just showing 
the relevant parts. 

> First question is whether this code works exactly correctly?  

Yes, the code works. I end up with just the rows I want. 

> Are you only concerned about speed, not fixing features?  

Don't know what you mean by 'fixing features'. The code does what I want, it 
just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogus.  

Not at all. 

> You don't do  anything there to worry about the value of tup[2], just whether 
> some
> item has a nearby time.  Of course, I could misunderstand the spec.

The data comes from a database. tup[2] is a datetime column. tdiff comes from a 
datetime.timedelta() 

> Are you making a global called 'self' ?  That name is by convention only
> used in methods to designate the instance object.  What's the attribute
> self?

Yes, self is my instance object. self.message contains the string of interest 
that I need to look for. 

> Can cdata have duplicates, and are they significant? 

No, it will not have duplicates.

> Are you including  the time building that as part of your 2 hour measurement? 

No, the 2 hours is just the time to run the 

cdata[:] = [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?

Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop.  You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items.  So the inner loop gets
> executed 360 billion times.  You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.

Thanks. I will try that.

> > # record time for each message matching the specified message for each tool
> 
> > messageTimes = {}
>
> You're building a dictionary;  are you actually using the value (1), or
>  is only the key relevant?  

Only the keys.

> A set is a dict without a value.  

Yes, I could use a set, but I don't think that would make it measurably faster. 

> But more mportantly, you never look up anything in this dictionary.  So why
> isn't it a list?  For that matter, why don't you just use the
> messageTimes list?

Yes, it could be a list too.
 
> > for row in cdata:   # tool, time, message
> 
> > if self.message in row[2]:
> 
> > messageTimes[row[0], row[1]] = 1
> 
> >
> 
> > # now pull out each message that is within the time diff for each matched 
> > message
> 
> > # as well as the matched messages themselves
> 
> >
> 
> > def determine(tup):
> 
> > if self.message in tup[2]: return True  # matched message 
> 
> >
> 
> > for (tool, date_time) in messageTimes:
> 
> > if tool == tup[0]:
> 
> > if abs(date_time-tup[1]) <= tdiff: 
> 
> >return True
> 
> >
> 
> > return False
> 
> > 
> 
> > cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> 
> 
> As the code exists, there's no need to copy the list.  Just do a simple
> bind.

This statement is to remove the items from cdata that I don't want. I don't 
know what you mean by bind. I'm not familiar with that python function. 

> 
> 
> 
> >
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 
> > 600,000 elements (which is typical for my app) it takes 2 hours for this to 
> > run. 
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: help with making my code more efficient

2012-12-21 Thread larry.mart...@gmail.com
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 08:46 PM, larry.mart...@gmail.com wrote:
> 
> > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> 
> >> 
> 
> > Of course it's a fragment - it's part of a large program and I was just 
> > showing the relevant parts. 
> 
> But it seems these are methods in a class, or something, so we're
> missing context.  And you use self without it being an argument to the
> function.  Like it's a global.

I didn't show the entire method, only what I thought was relevant to my issue. 
The method is declared as:

def generate_data(self):

> 
> > 
> 
> > Yes, the code works. I end up with just the rows I want.
> 
> >> Are you only concerned about speed, not fixing features?  
> 
> > Don't know what you mean by 'fixing features'. The code does what I want, 
> > it just takes too long.
> 
> >
> 
> >> As far as I can tell, the logic that includes the time comparison is 
> >> bogus.  
> 
> > Not at all. 
> 
> >
> 
> >> You don't do  anything there to worry about the value of tup[2], just 
> >> whether some
> 
> >> item has a nearby time.  Of course, I could misunderstand the spec.
> 
> > The data comes from a database. tup[2] is a datetime column. tdiff comes 
> > from a datetime.timedelta() 
> 
> I thought that tup[1] was the datetime.  In any case, the loop makes no
> sense to me, so I can't really optimize it, just make suggestions.

Yes, tup[1] is the datetime. I mistyped last night. 

> >> Are you making a global called 'self' ?  That name is by convention only
> 
> >> used in methods to designate the instance object.  What's the attribute
> 
> >> self?
> 
> > Yes, self is my instance object. self.message contains the string of 
> > interest that I need to look for. 
> 
> >
> 
> >> Can cdata have duplicates, and are they significant? 
> 
> > No, it will not have duplicates.
> 
> >
> 
> >> Is the list sorted in any way?
> 
> > Yes, the list is sorted by tool and datetime.
> 
> >
> 
> >> Chances are your performance bottleneck is the doubly-nested loop.  You
> 
> >> have a list comprehension at top-level code, and inside it calls a
> 
> >> function that also loops over the 600,000 items.  So the inner loop gets
> 
> >> executed 360 billion times.  You can cut this down drastically by some
> 
> >> judicious sorting, as well as by having a map of lists, where the map is
> 
> >> keyed by the tool.
> 
> > Thanks. I will try that.
> 
> 
> 
> So in your first loop, you could simply split the list into separate
> lists, one per tup[0] value, and the lists as dictionary items, keyed by
> that tool string.
> 
> Then inside the determine() function, make a local ref to the particular
> 
> list for the tool.
> 
>recs = messageTimes[tup[0]]

I made that change ant went from taking over 2 hours to 54 minutes. A dramatic 
improvement, but still not adequate for my app. 

> Instead of a for loop over recs, use a binary search to identify the
> first item that's >= date_time-tdiff.  Then if it's less than
> date_time+tdiff, return True, otherwise False.  Check out the bisect
> module.  Function bisect_left() should do what you want in a sorted list.

Didn't know about bisect. Thanks. I thought it would be my savior for sure. But 
unfortunaly when I added that, it blows up with out of memory. This was the 
code I had:

times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and 
times[ge]-tup[1] <= tdiff)

> >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> >> As the code exists, there's no need to copy the list.  Just do a simple 
> >> bind.
> 
> > This statement is to remove the items from cdata that I don't want. I don't 
> > know what you mean by bind. I'm not familiar with that python function. 
> 
> Every "assignment" to a simple name is really a rebinding of that name.
> cdata = [tup for tup in cdata if determine(tup)]
> 
> will rebind the name to the new object, much quicker than copying.  If
> this is indeed a top-level line, it should be equivalent.  But if in
> fact this is inside some other function, it may violate some other
> assumptions.  In particular, if there are other names for the same
> object, then you're probably stuck with modifying it in place, using
> slice notation.

The slice notation  was left over when when cdata was a tuple. Now that it's a 
list I don't need that any more. 

> BTW, a set is generally much more memory efficient than a dict, when you
> don't use the "value".  But since I think you'll be better off with a
> dict of lists, it's a moot point.

I'm going back to square 1 and try and do all from SQL.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: help with making my code more efficient

2012-12-21 Thread larry.mart...@gmail.com
On Friday, December 21, 2012 10:57:19 AM UTC-7, larry@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> > On 12/20/2012 08:46 PM, larry.mart...@gmail.com wrote: 
> > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: 
> > >> 
 
> > > Of course it's a fragment - it's part of a large program and I was just 
> > > showing the relevant parts. 
 
> > But it seems these are methods in a class, or something, so we're 
> > missing context.  And you use self without it being an argument to the 
> > function.  Like it's a global.

> I didn't show the entire method, only what I thought was relevant to my 
> issue. The method is declared as:
>  
> def generate_data(self):

> > > 
 
> > > Yes, the code works. I end up with just the rows I want.
  
> > >> Are you only concerned about speed, not fixing features?  
 
> > > Don't know what you mean by 'fixing features'. The code does what I want, 
> > > it just takes too long.
 
> > >> As far as I can tell, the logic that includes the time comparison is 
> > >> bogus.  
 
> > > Not at all. 

> > >> You don't do  anything there to worry about the value of tup[2], just 
> > >> whether some
> > >> item has a nearby time.  Of course, I could misunderstand the spec.

> > > The data comes from a database. tup[2] is a datetime column. tdiff comes 
> > > from a datetime.timedelta() 
 
> > I thought that tup[1] was the datetime.  In any case, the loop makes no
> > sense to me, so I can't really optimize it, just make suggestions.
 
> Yes, tup[1] is the datetime. I mistyped last night. 

> > >> Are you making a global called 'self' ?  That name is by convention only
> > >> used in methods to designate the instance object.  What's the attribute
> > >> self?

> > > Yes, self is my instance object. self.message contains the string of 
> > > interest that I need to look for. 

> > >> Can cdata have duplicates, and are they significant? 

> > > No, it will not have duplicates.

> > >> Is the list sorted in any way?
 
> > > Yes, the list is sorted by tool and datetime.

> > >> Chances are your performance bottleneck is the doubly-nested loop.  You

> > >> have a list comprehension at top-level code, and inside it calls a

> > >> function that also loops over the 600,000 items.  So the inner loop gets

> > >> executed 360 billion times.  You can cut this down drastically by some

> > >> judicious sorting, as well as by having a map of lists, where the map is
> > >> keyed by the tool.

> > > Thanks. I will try that.

> > So in your first loop, you could simply split the list into separate 
> > lists, one per tup[0] value, and the lists as dictionary items, keyed by
> > that tool string.

> > Then inside the determine() function, make a local ref to the particular 
> > list for the tool.
> >recs = messageTimes[tup[0]]

> I made that change ant went from taking over 2 hours to 54 minutes. A 
> dramatic improvement, but still not adequate for my app. 

> > Instead of a for loop over recs, use a binary search to identify the 
> > first item that's >= date_time-tdiff.  Then if it's less than 
> > date_time+tdiff, return True, otherwise False.  Check out the bisect 
> > module.  Function bisect_left() should do what you want in a sorted list.

> Didn't know about bisect. Thanks. I thought it would be my savior for sure. 
> But unfortunaly when I added that, it blows up with out of memory. 

The out of memory error had nothing to do with using bisect. I had introduced a 
typo that I really though would have caused a variable referenced before 
assignment error. But it did not do that, and instead somehow caused all the 
memory in my machine to get used up. When I fixed that, it worked really well 
with bisect. The code that was taking 2 hours was down to 20 minutes, and even 
better, a query that was taking 40 minutes was down to 8 seconds. 

Thanks very much for all your help. 




> This was the code I had:
>  
> times = messageTimes[tup[0]]
> 
> le = bisect.bisect_right(times, tup[1])
> 
> ge = bisect.bisect_left(times, tup[1])
> 
> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and 
> times[ge]-tup[1] <= tdiff)
> 
> 
> 
> > >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> > 
> 
> > >> As the code exists, there's no need to copy the list.  Just do a simple 

Re: help with making my code more efficient

2012-12-21 Thread larry.mart...@gmail.com
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote:
> 
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, larry@gmail.com wrote:
> 
> >> 
> 
> >> Didn't know about bisect. Thanks. I thought it would be my savior for 
> >> sure. But unfortunaly when I added that, it blows up with out of memory. 
> 
> > The out of memory error had nothing to do with using bisect. I had 
> > introduced a typo that I really though would have caused a variable 
> > referenced before assignment error. But it did not do that, and instead 
> > somehow caused all the memory in my machine to get used up. When I fixed 
> > that, it worked really well with bisect. The code that was taking 2 hours 
> > was down to 20 minutes, and even better, a query that was taking 40 minutes 
> > was down to 8 seconds. 
> 
> >
> 
> > Thanks very much for all your help. 
> 
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold 
> improvement.  Assuming we're talking just about time spent in the cdata 
>  = list-comprehension list.
> 
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
> 
> >> This was the code I had:
> >> 
> >> times = messageTimes[tup[0]]
> 
> >>
> 
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and 
> >> times[ge]-tup[1] <= tdiff)
> 
> 
> 
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure 
> this is equivalent.  And I wonder if even looking up tup[1] is 
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
>  
> I wish you could post some sample data, and identify which records are 
> intended to be True.  Obviously you could simplify, and just use ints 
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
> 
> Could we see the whole fragment as you originally had it, but with the 
> optimizations you've done so far?  The more I look at that determine() 
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows from 
the database with tool, time, and message. The user has specified a string and 
a time threshold. From that set of rows I need to return all the rows that 
contain the user's string and all the other rows that match the tool from the 
matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messages, 
keyed by tool. In determine() I don't look though cdata - it gets one element 
from cdata and I see if that should be selected because it either matches the 
user's message, or it's within the threshold of one that did match.

Here's my original code:

# record time for each message matching the specified message for each tool 
messageTimes = {} 
for row in cdata:   # tool, time, message 
if self.message in row[2]: 
messageTimes[row[0], row[1]] = 1 

# now pull out each message that is within the time diff for each matched 
message 
# as well as the matched messages themselves 

def determine(tup): 
if self.message in tup[2]: return True  # matched message 

for (tool, date_time) in messageTimes: 
if tool == tup[0]: 
if abs(date_time-tup[1]) <= tdiff: 
   return True 

return False 

cdata[:] = [

Re: help with making my code more efficient

2012-12-24 Thread larry.mart...@gmail.com
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 11:47 PM, larry.mart...@gmail.com wrote: 
> > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> >> On 12/21/2012 03:36 PM, larry.mart...@gmail.com wrote:
> >>
> >>>>  > I think you're misunderstanding what I need to do. I have a set of rows 
> > from the database with tool, time, and message. The user has specified a 
> > string and a time threshold. From that set of rows I need to return all the 
> > rows that contain the user's string and all the other rows that match the 
> > tool from the matched rows and have a time within the threshold. 
> >
> > cdata has all the rows. messageTimes has the times of all the matched 
> > messages, keyed by tool. In determine() I don't look though cdata - it gets 
> > one element from cdata and I see if that should be selected because it 
> > either matches the user's message, or it's within the threshold of one that 
> > did match.
> > 
> > Here's my original code: 
> >
> > # record time for each message matching the specified message for each tool 
> > messageTimes = {} 
> > for row in cdata:   # tool, time, message  
> > if self.message in row[2]:  
> > messageTimes[row[0], row[1]] = 1 
> > 
> > # now pull out each message that is within the time diff for each matched 
> > message  
> > # as well as the matched messages themselves  
> >
> > def determine(tup):  
> > if self.message in tup[2]: return True  # matched message  
> > 
> > for (tool, date_time) in messageTimes:  
> > if tool == tup[0]:  
> > if abs(date_time-tup[1]) <= tdiff:  
> >return True 
> > 
> > return False  
> >  
> > cdata[:] = [tup for tup in cdata if determine(tup)] 
> >
> > Here's the code now: 
> > 
> > 
> ># Parse data and make a list of the time for each message matching 
> > the specified message for each tool 
> > messageTimes = defaultdict(list)# a dict with sorted lists 
> > 
> > for row in cdata:   # tool, time, message 
> > if self.message in row[2]: 
> > messageTimes[row[0]].append(row[1]) 
> > 
> > # now pull out each message that is within the time context for 
> > each matched message 
> > # as well as the matched messages themselves 
> > 
> > # return true is we should keep this message 
> > def determine(tup): 
> > if self.message in tup[2]: return True # matched message
> > if seconds == 0: return False# no time context 
> > specified 
> > 
> > times = messageTimes[tup[0]]  # get the list of 
> > matched messages for this tool 
> > 
> > le = bisect.bisect_right(times, tup[1])   # find time less than 
> > or equal to tup[1] 
> > ge = bisect.bisect_left(times, tup[1])# find time greater 
> > then to equal to tup[1] 
> > return (le and tup[1]-times[le-1] <= tdiff) or (ge != 
> > len(times) and times[ge]-tup[1] <= tdiff) 
> > 
> > cdata = [tup for tup in cdata if determine(tup)] 
> > 
> > In my test case, cdata started with 600k rows, 30k matched the users 
> > string, and a total of 110k needed to be returned (which is what cdata 
> > ended up with) - the 30k that matched the string, and 80k that were within 
> > the time threshold.  
> >
> > I think the point you may have missed is the tool - I only return a row if 
> > it's the same tool as a matched message and within the threshold. 
> > 
> > I hope I've explained this better. Thanks again.   
> 
> That is better, and the point I missed noticing before is that 
> messageTimes is substantially smaller than cdata;  it's already been 
> filtered down by looking for self.message in its row[2].  The code was 
> there, but I didn't relate.  Remember I was bothered that you didn't 
> look at tup[2] when you were comparing for time-similarity.  Well, you 
> did that implicitly, since messageTimes was already filtered.  Sorry 
> about that. 
> 
> That also lowers my expectations for improvement ratio, since instead of 
> 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
> now my expectations are only 4:1 to 10:1. 
>
> Still, there's room for improvement.  (1) You should only need one
> bisect in determi

Re: Best way to structure data for efficient searching

2012-04-02 Thread larry.mart...@gmail.com
On Mar 28, 1:52 pm, Jon Clements  wrote:
> On Wednesday, 28 March 2012 19:39:54 UTC+1, larry@gmail.com  wrote:
> > I have the following use case:
>
> > I have a set of data that is contains 3 fields, K1, K2 and a
> > timestamp. There are duplicates in the data set, and they all have to
> > processed.
>
> > Then I have another set of data with 4 fields: K3, K4, K5, and a
> > timestamp. There are also duplicates in that data set, and they also
> > all have to be processed.
>
> > I need to find all the items in the second data set where K1==K3 and
> > K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> > I have this working, but the way I did it seems very inefficient - I
> > simply put the data in 2 arrays (as tuples) and then walked through
> > the entire second data set once for each item in the first data set,
> > looking for matches.
>
> > Is there a better, more efficient way I could have done this?
>
> It might not be more *efficient* but others might find it more readable, and 
> it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and 
> query as (untested)
>
> select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - 
> t2.timestamp) < 20

This is part of django app, and the data came from mysql. Through a
mechanism I can't change at this time (it would take a lot of code
changes and this new functionality is needed ASAP) I get all the data
at once and have to winnow it down.


> Failing that, two (default)dicts with a tuple as the pair, then use that as 
> your base.

Since there are duplicates, I can't use a dict. And if I have any
extraneous data in the keys (i.e. something to make them unique) then
I still have to walk through the entire dict to find the matches.
-- 
http://mail.python.org/mailman/listinfo/python-list


Best way to structure data for efficient searching

2012-04-02 Thread larry.mart...@gmail.com
I have the following use case:

I have a set of data that is contains 3 fields, K1, K2 and a
timestamp. There are duplicates in the data set, and they all have to
processed.

Then I have another set of data with 4 fields: K3, K4, K5, and a
timestamp. There are also duplicates in that data set, and they also
all have to be processed.

I need to find all the items in the second data set where K1==K3 and
K2==K4 and the 2 timestamps are within 20 seconds of each other.

I have this working, but the way I did it seems very inefficient - I
simply put the data in 2 arrays (as tuples) and then walked through
the entire second data set once for each item in the first data set,
looking for matches.

Is there a better, more efficient way I could have done this?
-- 
http://mail.python.org/mailman/listinfo/python-list


parsing nested unbounded XML fields with ElementTree

2013-11-25 Thread larry.mart...@gmail.com
I have an XML file that has an element called "Node". These can be nested to 
any depth and the depth of the nesting is not known to me. I need to parse the 
file and preserve the nesting. For exmaple, if the XML file had:


   
  

  

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't 
know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib: 
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

My problem is that I don't know when I'm done with a node and I should remove a 
level of nesting. I would think this is a fairly common situation, but I could 
not find any examples of parsing a file like this. Perhaps I'm going about it 
completely wrong.
-- 
https://mail.python.org/mailman/listinfo/python-list


python script hangs when run from subprocess

2013-09-07 Thread larry.mart...@gmail.com
I have a python script and when I run it directly from the command line it runs 
to completion. But I need to run it from another script. I do that like this:

p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
rv = p.wait()
out_buf = p.stdout.read()

When I do this, wait never returns. If I trace the underlying script it's 
always in the same write to stderr that never seems to complete:

write(2, "/KA22/05Feb12/Images/12063LBO003"..., 24457

I run many other scripts and commands in the same manner, and they all 
complete, it's just this one. Anyone have any ideas why this is happening, and 
how I can further debug or fix this?

TIA!
-larry
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python script hangs when run from subprocess

2013-09-07 Thread larry.mart...@gmail.com
On Saturday, September 7, 2013 5:19:25 AM UTC-6, Peter Otten wrote:
> larry.mart...@gmail.com wrote:
> 
> 
> 
> > I have a python script and when I run it directly from the command line it
> 
> > runs to completion. But I need to run it from another script. I do that
> 
> > like this:
> 
> > 
> 
> > p = subprocess.Popen(cmd, stdout=subprocess.PIPE,
> 
> > stderr=subprocess.STDOUT) rv = p.wait()
> 
> > out_buf = p.stdout.read()
> 
> > 
> 
> > When I do this, wait never returns. If I trace the underlying script it's
> 
> > always in the same write to stderr that never seems to complete:
> 
> > 
> 
> > write(2, "/KA22/05Feb12/Images/12063LBO003"..., 24457
> 
> > 
> 
> > I run many other scripts and commands in the same manner, and they all
> 
> > complete, it's just this one. Anyone have any ideas why this is happening,
> 
> > and how I can further debug or fix this?
> 
> 
> 
> The script writes to an OS buffer, and when that buffer is full it blocks 
> 
> forever. p.wait() in turn then waits forever for the script to terminate...
> 
> 
> 
> As a fix try
> 
> 
> 
> out_buf = subprocess.Popen(...).communicate()[0]
> 
> 
> 
> 
> 
> This uses threads or select (depending on the OS) to avoid the problem -- 
> 
> and is prominently mentionend in the documentation:
> 
> 
> 
> http://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait

Thanks. I hadn't seen this. I'll check it out. 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python script hangs when run from subprocess

2013-09-07 Thread larry.mart...@gmail.com
On Saturday, September 7, 2013 9:47:47 AM UTC-6, Nobody wrote:
> On Sat, 07 Sep 2013 03:55:02 -0700, larry.mart...@gmail.com wrote:
> 
> 
> 
> > I have a python script and when I run it directly from the command line
> 
> > it runs to completion. But I need to run it from another script. I do
> 
> > that like this: 
> 
> > 
> 
> > p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
> 
> > rv = p.wait()
> 
> > out_buf = p.stdout.read()
> 
> > 
> 
> > When I do this, wait never returns.
> 
> 
> 
> The last two statements are the wrong way around. If you're reading a
> 
> process' output via a pipe, you shouldn't wait() for it until it has
> 
> closed its end of the pipe.
> 
> 
> 
> As it stands, you have a potential deadlock. If the subprocess tries to
> 
> write more data than will fit into the pipe, it will block until the
> 
> parent reads from the pipe. But the parent won't read from the pipe until
> 
> after the subprocess has terminated, which won't happen because the
> 
> subprocess is blocked waiting for the parent to read from the pipe ...

Thanks. I reversed the order of the wait and read calls, and it no longer 
hangs. 
-- 
https://mail.python.org/mailman/listinfo/python-list


setup.py not found

2009-07-17 Thread larry.mart...@gmail.com
I'm trying to install a package (cx_Oracle) on a mac running 10.5.7.
I've done this on other platforms, but never on a mac. I followed the
instructions given, but when I try and run setup I get:

Apollo:instantclient_10_2 user$ python setup.py build
/System/Library/Frameworks/Python.framework/Versions/2.5/Resources/
Python.app/Contents/MacOS/Python: can't open file 'setup.py': [Errno
2] No such file or directory

Is there something else I have to install first to have this?

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list