Steven D'Aprano wrote: > On Fri, 28 Oct 2005 06:22:11 -0700, [EMAIL PROTECTED] wrote: > >>Which is quite fast. The only problems is that the file might be huge. > > What *you* call huge and what *Python* calls huge may be very different > indeed. What are you calling huge? >
I'm not saying that it is too big for Python. I am saying that it is too big for the systems it is going to run on. These files can be 22 MB or 5 GB or ..., depending on the situation. It might not be okay to run a tool that claims that much memory, even if it is available. > >>I really have no need for reading the entire file into a string as I am >>doing here. All I want is to count occurences this substring. Can I >>somehow count occurences in a file without reading it into a string >>first? > > Magic? > That would be nice :) But you misunderstand me... > You have to read the file into memory at some stage, otherwise how can you > see what value the bytes are? I haven't said that I would like to scan the file without reading it. I am just saying that the .count() functionality implemented into strings could just as well be applied to some abstraction such as a stream (I come from C++). In C++, the count() functionality would be separated as much as possible from any concrete datatype (such as a string), precisely because it is a concept that is applicable at a more abstract level. I should be able to say "count the substring occurences of this stream" or "using this iterator" or something to that effect. If I could say print file("filename", "rb").count("\x00\x00\x01\x00") (or something like that) instead of the original print file("filename", "rb").read().count("\x00\x00\x01\x00") it would be exactly what I am after. What is the conceptual difference? The first solution should be at least as fast as the second. I have to read and compare the characters anyway. I just don't need to store them in a string. In essence, I should be able to use the "count occurences" functionality on more things, such as a file, or even better, a file read through a buffer with a size specified by me. > > Here is another thought. What are you going to do with the count when you > are done? That sounds to me like a pretty pointless result: "Hi user, the > file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like > to do another file?" > It might sound pointless to you, but it is not pointless for my purposes :) If you must know, the above one-liner actually counts the number of frames in an MPEG2 file. I want to know this number for a number of files for various reasons. I don't want it to take forever. > If you are planning to use this count to do something, perhaps there is a > more efficient way to combine the two steps into one -- especially > valuable if your files really are huge. > Of course, but I don't need to do anything else in this case. /David -- http://mail.python.org/mailman/listinfo/python-list