On 25/09/2013 21:06, mstagliamonte wrote:
Dear All,

Here I am, with another newbie question. I am trying to extract some lines from 
a fasta (text) file which match the headers in another file. i.e:
Fasta file:
header1|info1:info2_info3
general text
header2|info1:info2_info3
general text

headers file:
header1|info1:info2_info3
header2|info1:info2_info3

I want to create a third file, similar to the first one, but only containing 
headers and text of what is listed in the second file. Also, I want to print 
out how many headers were actually found from the second file to match the 
first.

I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():

     def __init__(self,headers_file, fasta_file,output_file):
         with open(headers_file,'r') as inp0:
             counter0=0
             container=''
             inp0_bis=inp0.read().split('\n')
             for x in inp0_bis:
                 container+=x.replace(':','_').replace('|','_')
             with open(fasta_file,'r') as inp1:
                 inp1_bis=inp1.read().split('>')
                 for i in inp1_bis:
                     i_bis= i.split('\n')
                     match = 
re.search(i_bis[0].replace(':','_').replace('|','_'),container)
                     if match:
                         counter0+=1
                         with open(output_file,'at') as out0:
                             out0.write('>'+i)
              print '{} sequences were found'.format(counter0)

-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!

Have you got any thoughts about causes/solutions?

Thanks for your time!

Here's my version:

class Extractor():
    def __init__(self, headers_file, fasta_file, output_file):
        with open(headers_file) as inp:
            headers = set('>' + line for line in inp)

        counter = 0
        accept = False

        with open(fasta_file) as inp, open(output_file, 'w') as out:
            for line in inp:
                if line.startswith('>'):
                    accept = line in headers
                    if accept:
                        counter += 1

                if accept:
                    out.write(line)

        print '{} sequences were found'.format(counter)

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to