On 2014-04-26 23:53, oyster wrote: > I will try to explain my situation to my best, but English is not my > native language, I don't know whether I can make it clear at last.
Your follow-up reply made much more sense and your written English is far better than many native speakers'. :-) > Every SECTION starts with 2 special lines; these 2 lines is special > because they have some same characters (the length is not const for > different section) at the beginning; these same characters is called > the KEY for this section. For every 2 neighbor sections, they have > different KEYs. I suspect you have a minimum number of characters (or words) to consider, otherwise a single character duplicated at the beginning of the line would delimit a section, such as abcd afgh because they share the commonality of an "a". The code I provided earlier should give you what you describe. I've tweaked and tested, and provided it below. Note that I require a minimum overlap of 6 characters (MIN_LEN). It also gathers the initial stuff (that you want to discard) under the empty key, so you can either delete that, or ignore it. > I need a method to split the whole text into SECTIONs and to know > all the KEYs > > I have tried to solve this problem via re module I don't think the re module will be as much help here. -tkc from collections import defaultdict import itertools as it MIN_LEN = 6 def overlap(s1, s2): "Given 2 strings, return the initial overlap between them" return ''.join( c1 for c1, c2 in it.takewhile( lambda pair: pair[0] == pair[1], it.izip(s1, s2) ) ) prevline = "" # the initial key under which preamble gets stored output = defaultdict(list) key = None with open("data.txt") as f: for line in f: if len(line) >= MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]: key = overlap(prevline, line) output[key].append(line) prevline = line for k,v in output.items(): print str(k).center(60,'=') print ''.join(v) . -- https://mail.python.org/mailman/listinfo/python-list