On 21 Nov 2005 13:59:12 -0800, [EMAIL PROTECTED] wrote: >I tried the solutions you provided..these are not as robust as i >thought would be... >may be i should put the problem more clearly... > >here it goes.... > >I have a bunch of documents and each document has a header which is >common to all files. I read each file process it and compute the >frequency of words in each file. now I want to ignore the header in >each file. It is easy if the header is always at the top. but >apparently its not. it could be at the bottom as well. So I want a >function which goes through the file content and ignores the common >header and return the remaining text to compute the frequencies..Also >the header is not just one line..it includes licences and all other >stuff and may be 50 to 60 lines as well..This "remove_header" has to be >much more efficient as the files may be huge. As this is a very small >part of the whole problem i dont want this to slow down my entire >code... > Does this "header" have fixed-constant-string beginning and similar fixed end with possibly variably text between? I.e., and can there be multiple headers (i.e., header+ instead of header)?
Assuming this is a grammar[1] of your file: datafile: [leading_string] header+ [trailing_string] header: header_start header_middle header_end 0) is this a text file of lines? or? 1) is header_start a fixed constant string? 2) does header_start begin with the first character of a line? 3) does it end with the end of the same or 3a) subsequent line? 4) does header_end begin at the beginning of a line? 4a) like 3 4b) like 3a 5) can we ignore header_middle as never containing header_end in any form (e.g. in quotes or comments etc)? 6) Anything else you can think of ;-) [1] using [x] to mean optional x and some_name to mean a string composed by some rules given by some_name: ... (or described in prose as here ;-) and some_name+ to mean one or more some_name. (BTW some_name would mean exactly one, [some_name] zero or one, some_name* zero or morem and somename+ one or more). What's needed is the final resolution to actual constants or patterns of primitives. Can you define header_start: "The actual fixed constant character string defining the header" header_end: "whatever?" Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list