Hello, list. I have a list of sentence in text files that I use to filter-out some data. I managed the list so badly that now it's become literally a mess.
Let's say the list has a sentence below 1. "Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. " 2. "Python has been an important part of Google" 3. "important part of Google" As you see sentence 2 is a subset of sentence 1 so I don't need to have sentence 1 on the list. (For some reason, it's no problem to have sentence 3. Only sentence that has the "same prefix part" is the one I want to remove) So I decided to clean up the list. I tried to do this simple brute-force manner, like --------------------------------------------------------------------------- sorted_list = sorted(file('thelist'), key=len) for line in sorted_list[:] unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] sorted_list = list(set(sorted_list) - (unneeded)) .... --------------------------------------------------------------------------- This is so slow and not so helpful because the list is so big(more than 100M bytes and has about 3 million lines) and I have more than 100 lists. I'm not familiar with algorithms/data structure and large-scale data processing, so any advice, suggestions and recommendations will be appreciated. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list