Re: [Tutor] Parsing a block of XML text

kumar s Fri, 31 Dec 2004 23:02:52 -0800

Hi Danny, 
  Thanks for your reply. I have been using BioPython
for long time. I found their BLAST parser buggy
(IMPO), otherwise BioPython has more cool modules.


In my case Parser did not iterate over hits and it
turned out difficult for me to look into it in detail.
Also, I wanted to work more on my own so that I get
more understanding over parsing XML documents. 

I was about to post to the list seeking some
systematic explanation to the example on PLR. 
http://www.python.org/doc/lib/dom-example.html

Frankly it looked more complex. could I request you to
explain your pseudocode. It is confusing when you say
call a function within another function.  


### pseudocode ###
> def parse_Hsp(node):
>     ## get at the Hit_hsps element, and call
> parse_Hit_hsps() on it.
> 
> 
> def parse_Hit_hsps(node):
>     ## get all of the Hsp elements, and call
> parse_Hsp() on each one of
>     ## them.
> 
> 
> def parse_Hsp(node):
>     ## extract the query and hit coordinates out of
> the node.
> ######



After posting my previous question, I have been
working to get the output. I wrote the following
lines. 

(Ref: Jones and Drakes - Python & XML)
from xml.dom import minidom
import sys
def findTextnodes(nodeList):
    for subnode in nodeList:
        if subnode.nodeType == subnode.ELEMENT_NODE:
            print "Element node: " + subnode.tagName
            

            findTextnodes(subnode.childNodes)
        elif subnode.nodeType == subnode.TEXT_NODE:
            print "text node:" + subnode.data

doc = minidom.parse(sys.stdin)
findTextnodes(doc.childNodes)

My Aim:
I wanted all to extract HSP that are more than 1. The
significance of this is that I can create an exon
structure based on the result. 
My intended output is a tab delim. txt with:
Hit_id
Hsp_evalue
Hsp_query-from
Hsp_query-to
Hsp_hit-from
Hsp_hit-to

I will work on along your suggestions. I have looked
at the example that you asked me to look at. I did not
understand that. I will post my questions in my next
e-mail.

Thanks.
-K

My output: Although this is not what I wanted :-(

  
Element node: BlastOutput_query-ID
text node:lcl|1_4694
text node:
  
Element node: BlastOutput_query-def
text node:gi|4508026|ref|NM_003423.1| Homo sapiens
zinc finger protein 43 (HTF6) (ZNF43), mRNA
text node:
  
Element node: BlastOutput_query-len
text node:3003
text node:
  
Element node: BlastOutput_param
text node:
    
Element node: Parameters
text node:
      
Element node: Parameters_expect
text node:10
text node:
      
Element node: Parameters_sc-match
text node:1
text node:
      
Element node: Parameters_sc-mismatch
text node:-3
text node:
      
Element node: Parameters_gap-open
text node:5
text node:
      
Element node: Parameters_gap-extend
text node:2
text node:
    
text node:
  
text node:
  
Element node: BlastOutput_iterations
text node:
    
Element node: Iteration
text node:
      
Element node: Iteration_iter-num
text node:1
text node:
      
Element node: Iteration_hits
text node:
        
Element node: Hit
text node:
          
Element node: Hit_num
text node:1
text node:
          
Element node: Hit_id
text node:gi|22814739|gb|BU508506.1|
text node:
          
Element node: Hit_def
text node:AGENCOURT_10094591 NIH_MGC_71 Homo sapiens
cDNA clone IMAGE:6502598 5', mRNA sequence.
text node:
          
Element node: Hit_accession
text node:BU508506
text node:
          
Element node: Hit_len
text node:912
text node:
          
Element node: Hit_hsps
text node:
            
Element node: Hsp
text node:
              
Element node: Hsp_num
text node:1
text node:
              
Element node: Hsp_bit-score
text node:1485.28
text node:
              
Element node: Hsp_score
text node:749
text node:
              
Element node: Hsp_evalue
text node:0
text node:
              
Element node: Hsp_query-from
text node:715
text node:
              
Element node: Hsp_query-to
text node:1513
text node:
              
Element node: Hsp_hit-from
text node:1
text node:
              
Element node: Hsp_hit-to
text node:804
text node:
              
Element node: Hsp_query-frame
text node:1
text node:
              
Element node: Hsp_hit-frame
text node:1
text node:
              
Element node: Hsp_identity
text node:794
text node:
              
Element node: Hsp_positive
text node:794
text node:
              
Element node: Hsp_gaps
text node:5
text node:
              
Element node: Hsp_align-len
text node:804
text node:
              
Element node: Hsp_qseq
text
node:TGGATTTAACCAATGTTTGCCAGCTACCCAGAGCAAAATATTTCTATTTGATAAATGTGTGAAAGCCTTTCATAAATTTTCAAATTCAAACAGACATAAGATAAGCCATACTGAAAAAAAACTTTTCAAATGCAAAGAATGTGGCAAATCATTTTGCATGCTTCCACATCTAGCTCAACATAAAATAATTCATACCAGAGTGAATTTCTGCAAATGTGAAAAATGTGGAAAAGCTTTTAACTGCCCTTCAATCATCACTAAACATAAGAGAATTAATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGTCTTTAATTGGTCCTCACGCCTTACTACACATAAAAAAAATTATACTAGATACAAACTCTACAAATGTGAAGAATGTGGCAAAGCTTTTAACAAGTCCTCAATCCTTACTACCCATAAGATAATTCGCACTGGAGAGAAATTCTACAAATGTAAAGAATGTGCCAAAGCTTTTAACCAATCCTCAAACCTTACTGAACATAAGAAAATTCATCCTGGAGAGAAACCTTACAAATGTGAAGAATGTGGCAAAGCCTTTAACTGGCCCTCAACTCTTACTAAACATAAGAGAATTCATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGCTTTTAACCAGTTCTCAAACCTTACTACACATAAGAGAATCCATACTGCAGAGAAATTCTATAAATGTACAGAATGT-GGTGAAGCTTTT-AGCCGGTCCTCAAACCTTACTAAACAT-AAGAAAATTCATACT--GAAAAGAAACCCTAC
text node:
              
Element node: Hsp_hseq
text
node:TGGATTTAACCAATGTTTGCCAGCTACCCAGAGCAAAATATTTCTATTTGATAAATGTGTGAAAGCCTTTCATAAATTTTCAAATTCAAACAGACATAAGATAAGCCATACTGAAAAAAAACTTTTCAAATGCAAAGAATGTGGCAAATCATTTTGCATGCTTCCACATCTAGCTCAACATAAAATAATTCATACCAGAGTGAATTTCTGCAAATGTGAAAAATGTGGAAAAGCTTTTAACTGCCCTTCAATCATCACTAAACATAAGAGAATTAATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGTCTTTAATTGGTCCTCACGCCTTACTACACATAAAAAAAATTATACTAGATACAAACTCTACAAATGTGAAGAATGTGGCAAAGCTTTTAACAAGTCCTCAATCCTTACTACCCATAAGATAATTCGCACTGGAGAGAAATTCTACAAATGTAAAGAATGTGCCAAAGCTTTTAACCAATCCTCAAACCTTACTGAACATAAGAAAATTCATCCTGGAGAGAAACCTTACAAATGTGAAGAATGTGGCAAAGCCTTTAACTGGCCCTCAACTCTTACTAAACATAAGAGAATTCATACTGGAGAGAAACCCTACACATGTGAAGAATGTGGCAAAGCCTTTAACCAGTTCTCAAACCTTACTACACATAAGAGAATCCATACTGCAGAGAAATTCTATAAATGTACAGAATGTGGGTGAAGCTTTTAACCCGGCCCTCAAACCTTACTAAACATAAAAAAAATTCATACTTGAAAAAGAAACCCTAC
text node:
              
Element node: Hsp_midline
text
node:||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||| | |||| |||||||||||||||||||| ||
||||||||||||   ||||||||||||||
text node:
            
text node:
            
Element node: Hsp
text node:
              
Element node: Hsp_num
text node:2
text node:

























--- Danny Yoo <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 31 Dec 2004, kumar s wrote:
> 
> > I am trying to parse BLAST output (Basic Local
> Alignment Search Tool,
> > size around more than 250 KB ).
> 
> [xml text cut]
> 
> 
> Hi Kumar,
> 
> Just as a side note: have you looked at Biopython
> yet?
> 
>     http://biopython.org/
> 
> I mention this because Biopython comes with parsers
> for BLAST; it's
> possible that you may not even need to touch XML
> parsing if the BLAST
> parsers in Biopython are sufficiently good.  Other
> people have already
> solved the parsing problem for BLAST: you may be
> able to take advantage of
> that work.
> 
> 
> > I wanted to parse out :
> >
> > <Hsp_query-from> <Hsp_query-out)
> >  <Hsp_hit-from></Hsp_hit-from>
> >   <Hsp_hit-to></Hsp_hit-to>
> 
> Ok, I see that you are trying to get the content of
> the High Scoring Pair
> (HSP) query and hit coordinates.
> 
> 
> 
> > I wrote a ver small 4 line code to obtain it.
> >
> > for bls in doc.getElementsByTagName('Hsp_num'):
> >     bls.normalize()
> >     if bls.firstChild.data >1:
> >             print bls.firstChild.data
> 
> This might not work.  'bls.firstChild.data' is a
> string, not a number, so
> the expression:
> 
>     bls.firstChild.data > 1
> 
> is most likely buggy.  Here, try using this function
> to get the text out
> of an element:
> 
> ###
> def get_text(node):
>     """Returns the child text contents of the
> node."""
>     buffer = []
>     for c in node.childNodes:
>         if c.nodeType == c.TEXT_NODE:
>             buffer.append(c.data)
>     return ''.join(buffer)
> ###
> 
> (code adapted from:
> http://www.python.org/doc/lib/dom-example.html)
> 
> 
> 
> For example:
> 
> ###
> >>> doc =
>
xml.dom.minidom.parseString("<a><b>hello</b><b>world</b></a>")
> >>> for bnode in doc.getElementsByTagName('b'):
> ...     print "I see:", get_text(bnode)
> ...
> I see: hello
> I see: world
> ###
> 
> 
> 
> 
> > Could any one help me directing how to get the
> elements in that tag.
> 
> One way to approach structured parsing problems
> systematically is to write
> a function for each particular element type that
> you're trying to parse.
> 
> From the sample XML that you've shown us, it appears
> that your document
> consists of a single 'Hit' root node.  Each 'Hit'
> appears to have a
> 'Hit_hsps' element.  A 'Hit_hsps' element can have
> several 'Hsp's
> associated to it.  And a 'Hsp' element contains
> those coordinates that you
> are interested in.
> 
> 
> More formally, we can structure our parsing code to
> match the structure
> of the data:
> 
> ### pseudocode ###
> def parse_Hsp(node):
>     ## get at the Hit_hsps element, and call
> parse_Hit_hsps() on it.
> 
> 
> def parse_Hit_hsps(node):
>     ## get all of the Hsp elements, and call
> parse_Hsp() on each one of
>     ## them.
> 
> 
> def parse_Hsp(node):
>     ## extract the query and hit coordinates out of
> the node.
> ######
> 
> 
> To see another example of this kind of program
> structure, see:
> 
>     http://www.python.org/doc/lib/dom-example.html
> 
> 
> Please feel free to ask more questions.  Good luck
> to you.
> 
> 



        
                
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Parsing a block of XML text

Reply via email to