I'm building a file parser but I have a problem I'm not sure how to solve. The files this will parse have the potential to be huge (multiple GBs). There are distinct sections of the file that I want to read into separate dictionaries to perform different operations on. Each section has specific begin and end statements like the following:
KEYWORD . . . END KEYWORD The very first thing I do is read the entire file contents into a string. I then store the contents in a list, splitting on line ends as follows: file_lines = file_contents.split('\n') Next, I build smaller lists from the different sections using the begin and end keywords: begin_index = file_lines.index(begin_keyword) end_index = file_lines.index(end_keyword) small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ] I then plan on parsing each list to build the different dictionaries. The problem is that one begin statement is a substring of another begin statement as in the following example: BAR END BAR FOOBAR END FOOBAR I can't just look for the line in the list that contains BAR because FOOBAR might come first in the list. My list would then look like [foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m] I don't really want to use regular expressions, but I don't see a way to get around this without doing so. Does anyone have any suggestions on how to accomplish this? If regexps are the way to go, is there an efficient way to parse the contents of a potentially large list using regular expressions? Any help is appreciated! Thanks, Aaron -- "Tis better to be silent and be thought a fool, than to speak and remove all doubt." -- Abraham Lincoln -- http://mail.python.org/mailman/listinfo/python-list