On 25/9/2013 16:06, mstagliamonte wrote: > Dear All, > > Here I am, with another newbie question. I am trying to extract some lines > from a fasta (text) file which match the headers in another file. i.e: > Fasta file: >>header1|info1:info2_info3 > general text >>header2|info1:info2_info3 > general text > > headers file: > header1|info1:info2_info3 > header2|info1:info2_info3 > > I want to create a third file, similar to the first one, but only containing > headers and text of what is listed in the second file. Also, I want to print > out how many headers were actually found from the second file to match the > first. > > I have done a script which seems to work, but with a couple of 'side effects' > Here is my script: > ------------------------------------------------------------------- > import re > class Extractor(): > > def __init__(self,headers_file, fasta_file,output_file): > with open(headers_file,'r') as inp0: > counter0=0 > container='' > inp0_bis=inp0.read().split('\n') > for x in inp0_bis: > container+=x.replace(':','_').replace('|','_') > with open(fasta_file,'r') as inp1: > inp1_bis=inp1.read().split('>') > for i in inp1_bis: > i_bis= i.split('\n') > > match = > re.search(i_bis[0].replace(':','_').replace('|','_'),container) > if match: > counter0+=1 > with open(output_file,'at') as out0: > out0.write('>'+i) > print '{} sequences were found'.format(counter0) > > ------------------------------------------------------------------- > Side effects: > 1) The very first header is written as >>header1 rather than >header1 > 2) the number of sequences found is 1 more than the ones actually found! > > Have you got any thoughts about causes/solutions? >
The cause is the same. The first line in the file starts with a "<", and you're splitting on the same. So the first item of inp1_bis is the empty string. That string is certainly contained within container, so it matches, and produces a result of ">" You can "fix" this by adding a line after the "for i in inp1_bis" line if not i: continue But it also seems to me you're doing the search backwards. if the Fasta file has a line like: >der it would be considered a match! Seems to me you'd want to only match lines which contain an entire header. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list