I have been thinking about the thread we have had where the job seemed to be to read in a log file and if some string was found, process the line before it and generate some report. Is that generally correct?
The questioner suggested they needed both the entire file as one string but also as a list of strings. Suggestions were made, IF SO, to read the entire file twice, once as a whole and once as lines. Another suggestion was to read either version ONCE and use cheaper Python methods to make the second copy. We also looked at a similar issue about using a buffer to keep the last N lines. I thought of another tack that may be in between but allow serious functionality. OUTLINE: Use just the readlines version to get a list of strings representing each line. Assuming the searched text is static, no need for regular expressions. You can ask if 'something' in line But if you find it, you may not have an index so a use of enumerate (or zip) to make a tuple might be of use. You can do a list comprehension on an enumerate object to get both the indexes where the requested 'something' was found and also optionally the line contents (ignored) and be able to use the indices to look at the index (or more) before. Here is an example using a fake program where I create the four lines and generate only the tuples needed for further processing, or just the indices for the line ABOVE where it was found. It can be done with simple list comprehensions or made into a generator expression. -CODE- """ Sample code showing how to read a (simulated) file and search for a fixed string and return the item number in a list of strings for further processing including of earlier lines. """ # Make test data without use of file fromfile = str1 = """alpha line one beta line two gamma line three alphabet line four""" lines= fromfile.split('\n') print("RAW data: ", lines) # just for illustration errors = [(index,line) for (index, line) in enumerate(lines) if 'bet' in line] just_indices = [index - 1 for (index, line) in enumerate(lines) if 'bet' in line] from pprint import pprint print("ERROR tuples:") # just for illustration pprint(errors) print("Just error indices:") pprint(just_indices) -END-CODE- -OUTPUT- RAW data: ['alpha line one', 'beta line two', 'gamma line three', 'alphabet line four'] ERROR tuples: [(1, 'beta line two'), (3, 'alphabet line four')] Just error indices: [0, 2] -END-OUTPUT- Again, this did two ways, and only one is needed. But the next step would be to iterate over the results and process the earlier line to find whatever it is you need to report. Many ways to do that such as: for (index, ignore) in errors or for index in just_indices You can use a regular expression on a line at a time. And so on. Again, other methods mentioned work fine, and using a deque to store earlier lines in a limited buffer while not reading the entire file into memory would also be a good way. Warning: the above assumes the text found will never be in the zeroeth line. Otherwise, you need to check as accessing line -1 may actually return the last line! As stated many times, there seem to be an amazing number of ways to do anything. As an example, I mentioned using zip above. One obvious method is to zip it with a range statement making it look just like enumerate. A more subtle one would be to make a copy of the set of lines the same length but each line content shifted by one. Zip that to the original and you get a tuple with (line 0, null) then (line 1, line 0) up to (line n, line n-1) Yes, that doubles memory use but you can solve so much more in one somewhat more complicated list comprehension. Anyone want to guess how? If you recall, some regular expression matches something on the previous line. Let us make believe you wrote a function called do_the_match(line, re) that applies the regular expression to the line of text and returns the matching text or perhaps an empty string if not found. So if you define that function then the following code will work. First, make my funny zip: I make lines_after as copy of lines shifted over one. Or more exactly circularly permuted by one. The goal is to search in line_after and if found, do the regular expression match in line before. >>> lines_after = lines[1:] >>> lines_after.append(lines[0]) >>> lines_after ['beta line two', 'gamma line three', 'alphabet line four', 'alpha line one'] >>> lines ['alpha line one', 'beta line two', 'gamma line three', 'alphabet line four'] >>> list(zip(lines_after, lines)) [('beta line two', 'alpha line one'), ('gamma line three', 'beta line two'), ('alphabet line four', 'gamma line three'), ('alpha line one', 'alphabet line four')] So the list comprehension looks something like this: matches = [ do_theMatch(line, re) for (line_after, line) in zip(line_after, line) if 'something' in line_after ] Need I mention the above code was used in Python 3.7.0 ??? I think it is important to try to learn the idioms and odd customs of a language even if sometimes it results in more memory or CPU usage but maybe better not to overly complicate the code till nobody understands it. The latter may be in a mild way 'elegant' but even if I made the variable names more descriptive it might be harder to understand than a simple enumerated version or several passes in more normal iteration statements. Feel free to comment. I have a thick skin and love to learn from others. Avi _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor