Hi Danny, Thanks for your reply. I have been using BioPython for long time. I found their BLAST parser buggy (IMPO), otherwise BioPython has more cool modules.
In my case Parser did not iterate over hits and it turned out difficult for me to look into it in detail. Also, I wanted to work more on my own so that I get more understanding over parsing XML documents. I was about to post to the list seeking some systematic explanation to the example on PLR. http://www.python.org/doc/lib/dom-example.html Frankly it looked more complex. could I request you to explain your pseudocode. It is confusing when you say call a function within another function. ### pseudocode ### > def parse_Hsp(node): > ## get at the Hit_hsps element, and call > parse_Hit_hsps() on it. > > > def parse_Hit_hsps(node): > ## get all of the Hsp elements, and call > parse_Hsp() on each one of > ## them. > > > def parse_Hsp(node): > ## extract the query and hit coordinates out of > the node. > ###### After posting my previous question, I have been working to get the output. I wrote the following lines. (Ref: Jones and Drakes - Python & XML) from xml.dom import minidom import sys def findTextnodes(nodeList): for subnode in nodeList: if subnode.nodeType == subnode.ELEMENT_NODE: print "Element node: " + subnode.tagName findTextnodes(subnode.childNodes) elif subnode.nodeType == subnode.TEXT_NODE: print "text node:" + subnode.data doc = minidom.parse(sys.stdin) findTextnodes(doc.childNodes) My Aim: I wanted all to extract HSP that are more than 1. The significance of this is that I can create an exon structure based on the result. My intended output is a tab delim. txt with: Hit_id Hsp_evalue Hsp_query-from Hsp_query-to Hsp_hit-from Hsp_hit-to I will work on along your suggestions. I have looked at the example that you asked me to look at. I did not understand that. I will post my questions in my next e-mail. Thanks. -K My output: Although this is not what I wanted :-( Element node: BlastOutput_query-ID text node:lcl|1_4694 text node: Element node: BlastOutput_query-def text node:gi|4508026|ref|NM_003423.1| Homo sapiens zinc finger protein 43 (HTF6) (ZNF43), mRNA text node: Element node: BlastOutput_query-len text node:3003 text node: Element node: BlastOutput_param text node: Element node: Parameters text node: Element node: Parameters_expect text node:10 text node: Element node: Parameters_sc-match text node:1 text node: Element node: Parameters_sc-mismatch text node:-3 text node: Element node: Parameters_gap-open text node:5 text node: Element node: Parameters_gap-extend text node:2 text node: text node: text node: Element node: BlastOutput_iterations text node: Element node: Iteration text node: Element node: Iteration_iter-num text node:1 text node: Element node: Iteration_hits text node: Element node: Hit text node: Element node: Hit_num text node:1 text node: Element node: Hit_id text node:gi|22814739|gb|BU508506.1| text node: Element node: Hit_def text node:AGENCOURT_10094591 NIH_MGC_71 Homo sapiens cDNA clone IMAGE:6502598 5', mRNA sequence. text node: Element node: Hit_accession text node:BU508506 text node: Element node: Hit_len text node:912 text node: Element node: Hit_hsps text node: Element node: Hsp text node: Element node: Hsp_num text node:1 text node: Element node: Hsp_bit-score text node:1485.28 text node: Element node: Hsp_score text node:749 text node: Element node: Hsp_evalue text node:0 text node: Element node: Hsp_query-from text node:715 text node: Element node: Hsp_query-to text node:1513 text node: Element node: Hsp_hit-from text node:1 text node: Element node: Hsp_hit-to text node:804 text node: Element node: Hsp_query-frame text node:1 text node: Element node: Hsp_hit-frame text node:1 text node: Element node: Hsp_identity text node:794 text node: Element node: Hsp_positive text node:794 text node: Element node: Hsp_gaps text node:5 text node: Element node: Hsp_align-len text node:804 text node: Element node: Hsp_qseq text node:TGGATTTAACCAATGTTTGCCAGCTACCCAGAGCAAAATATTTCTATTTGATAAATGTGTGAAAGCCTTTCATAAATTTTCAAATTCAAACAGACATAAGATAAGCCATACTGAAAAAAAACTTTTCAAATGCAAAGAATGTGGCAAATCATTTTGCATGCTTCCACATCTAGCTCAACATAAAATAATTCATACCAGAGTGAATTTCTGCAAATGTGAAAAATGTGGAAAAGCTTTTAACTGCCCTTCAATCATCACTAAACATAAGAGAATTAATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGTCTTTAATTGGTCCTCACGCCTTACTACACATAAAAAAAATTATACTAGATACAAACTCTACAAATGTGAAGAATGTGGCAAAGCTTTTAACAAGTCCTCAATCCTTACTACCCATAAGATAATTCGCACTGGAGAGAAATTCTACAAATGTAAAGAATGTGCCAAAGCTTTTAACCAATCCTCAAACCTTACTGAACATAAGAAAATTCATCCTGGAGAGAAACCTTACAAATGTGAAGAATGTGGCAAAGCCTTTAACTGGCCCTCAACTCTTACTAAACATAAGAGAATTCATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGCTTTTAACCAGTTCTCAAACCTTACTACACATAAGAGAATCCATACTGCAGAGAAATTCTATAAATGTACAGAATGT-GGTGAAGCTTTT-AGCCGGTCCTCAAACCTTACTAAACAT-AAGAAAATTCATACT--GAAAAGAAACCCTAC text node: Element node: Hsp_hseq text node:TGGATTTAACCAATGTTTGCCAGCTACCCAGAGCAAAATATTTCTATTTGATAAATGTGTGAAAGCCTTTCATAAATTTTCAAATTCAAACAGACATAAGATAAGCCATACTGAAAAAAAACTTTTCAAATGCAAAGAATGTGGCAAATCATTTTGCATGCTTCCACATCTAGCTCAACATAAAATAATTCATACCAGAGTGAATTTCTGCAAATGTGAAAAATGTGGAAAAGCTTTTAACTGCCCTTCAATCATCACTAAACATAAGAGAATTAATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGTCTTTAATTGGTCCTCACGCCTTACTACACATAAAAAAAATTATACTAGATACAAACTCTACAAATGTGAAGAATGTGGCAAAGCTTTTAACAAGTCCTCAATCCTTACTACCCATAAGATAATTCGCACTGGAGAGAAATTCTACAAATGTAAAGAATGTGCCAAAGCTTTTAACCAATCCTCAAACCTTACTGAACATAAGAAAATTCATCCTGGAGAGAAACCTTACAAATGTGAAGAATGTGGCAAAGCCTTTAACTGGCCCTCAACTCTTACTAAACATAAGAGAATTCATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGCCTTTAACCAGTTCTCAAACCTTACTACACATAAGAGAATCCATACTGCAGAGAAATTCTATAAATGTACAGAATGTGGGTGAAGCTTTTAACCCGGCCCTCAAACCTTACTAAACATAAAAAAAATTCATACTTGAAAAAGAAACCCTAC text node: Element node: Hsp_midline text node:|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||| | |||| |||||||||||||||||||| || |||||||||||| |||||||||||||| text node: text node: Element node: Hsp text node: Element node: Hsp_num text node:2 text node: --- Danny Yoo <[EMAIL PROTECTED]> wrote: > > > On Fri, 31 Dec 2004, kumar s wrote: > > > I am trying to parse BLAST output (Basic Local > Alignment Search Tool, > > size around more than 250 KB ). > > [xml text cut] > > > Hi Kumar, > > Just as a side note: have you looked at Biopython > yet? > > http://biopython.org/ > > I mention this because Biopython comes with parsers > for BLAST; it's > possible that you may not even need to touch XML > parsing if the BLAST > parsers in Biopython are sufficiently good. Other > people have already > solved the parsing problem for BLAST: you may be > able to take advantage of > that work. > > > > I wanted to parse out : > > > > <Hsp_query-from> <Hsp_query-out) > > <Hsp_hit-from></Hsp_hit-from> > > <Hsp_hit-to></Hsp_hit-to> > > Ok, I see that you are trying to get the content of > the High Scoring Pair > (HSP) query and hit coordinates. > > > > > I wrote a ver small 4 line code to obtain it. > > > > for bls in doc.getElementsByTagName('Hsp_num'): > > bls.normalize() > > if bls.firstChild.data >1: > > print bls.firstChild.data > > This might not work. 'bls.firstChild.data' is a > string, not a number, so > the expression: > > bls.firstChild.data > 1 > > is most likely buggy. Here, try using this function > to get the text out > of an element: > > ### > def get_text(node): > """Returns the child text contents of the > node.""" > buffer = [] > for c in node.childNodes: > if c.nodeType == c.TEXT_NODE: > buffer.append(c.data) > return ''.join(buffer) > ### > > (code adapted from: > http://www.python.org/doc/lib/dom-example.html) > > > > For example: > > ### > >>> doc = > xml.dom.minidom.parseString("<a><b>hello</b><b>world</b></a>") > >>> for bnode in doc.getElementsByTagName('b'): > ... print "I see:", get_text(bnode) > ... > I see: hello > I see: world > ### > > > > > > Could any one help me directing how to get the > elements in that tag. > > One way to approach structured parsing problems > systematically is to write > a function for each particular element type that > you're trying to parse. > > From the sample XML that you've shown us, it appears > that your document > consists of a single 'Hit' root node. Each 'Hit' > appears to have a > 'Hit_hsps' element. A 'Hit_hsps' element can have > several 'Hsp's > associated to it. And a 'Hsp' element contains > those coordinates that you > are interested in. > > > More formally, we can structure our parsing code to > match the structure > of the data: > > ### pseudocode ### > def parse_Hsp(node): > ## get at the Hit_hsps element, and call > parse_Hit_hsps() on it. > > > def parse_Hit_hsps(node): > ## get all of the Hsp elements, and call > parse_Hsp() on each one of > ## them. > > > def parse_Hsp(node): > ## extract the query and hit coordinates out of > the node. > ###### > > > To see another example of this kind of program > structure, see: > > http://www.python.org/doc/lib/dom-example.html > > > Please feel free to ask more questions. Good luck > to you. > > __________________________________ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
