Dear All, Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e: Fasta file: >header1|info1:info2_info3 general text >header2|info1:info2_info3 general text
headers file: header1|info1:info2_info3 header2|info1:info2_info3 I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first. I have done a script which seems to work, but with a couple of 'side effects' Here is my script: ------------------------------------------------------------------- import re class Extractor(): def __init__(self,headers_file, fasta_file,output_file): with open(headers_file,'r') as inp0: counter0=0 container='' inp0_bis=inp0.read().split('\n') for x in inp0_bis: container+=x.replace(':','_').replace('|','_') with open(fasta_file,'r') as inp1: inp1_bis=inp1.read().split('>') for i in inp1_bis: i_bis= i.split('\n') match = re.search(i_bis[0].replace(':','_').replace('|','_'),container) if match: counter0+=1 with open(output_file,'at') as out0: out0.write('>'+i) print '{} sequences were found'.format(counter0) ------------------------------------------------------------------- Side effects: 1) The very first header is written as >>header1 rather than >header1 2) the number of sequences found is 1 more than the ones actually found! Have you got any thoughts about causes/solutions? Thanks for your time! P.S.: I think I have removed the double posting... not sure... Max -- https://mail.python.org/mailman/listinfo/python-list