[EMAIL PROTECTED] wrote: > Hi, > > I have a file with several entries in the form: > > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB), > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and > dethiobiotin synthetase (bioD), complete cds. > > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469 > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA. > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1 > > and I would like to create a file that has only the following: > > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 > > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8 > > Could anyone please tell me how can I do it? > > Many thanks in advance > Sofia
Here's my first iteration: C:\junk>type sofia.py prefixes = ['/GEN=', '/gb:'] def extract(fname): f = open(fname, 'r') chunks = [[]] for line in f: words = line.split() if words: chunks[-1].extend(words) else: chunks.append([]) for chunk in chunks: if not chunk: continue output = [chunk[0]] for word in chunk: for prefix in prefixes: if word.startswith(prefix): output.append(word) break print ' '.join(output) if __name__ == "__main__": import sys extract(sys.argv[1]) C:\junk>sofia.py sofia.txt AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1 1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1 Before I fix the duplicate in the first line, you need to say whether you really want the /gb:BC009007.1 in the second line thrown away -- IOW, what's the rule? For each prefix, either (1) get the first "word" that starts with that prefix or (2) get all unique such words. You choose. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list