jog <[EMAIL PROTECTED]> wrote: > Hi, > I want to get text out of some nodes of a huge xml file (1,5 GB). The > architecture of the xml file is something like this > <parent> > <page> > <title>bla</title> > <id></id> > <revision> > <id></id> > <text>blablabla</text> > <revision> > </page> > <page> > </page> > .... > </parent> > I want to combine the text out of page:title and page:revision:text for > every single page element. One by one I want to index these combined > texts (so for each page one index) > What is the most efficient API for that?: SAX ( I don?t thonk so) DOM > or pulldom? > Or should I just use Xpath somehow. > I don`t want to do anything else with his xml file afterwards. > I hope someone will understand me..... > Thank you very much > Jog
I would use Expat interface from Python, Awk, or even Bash shell. I'm most familiar with shell interface to Expat, which would go something like start() # Usage: start tag att=value ... { case $1 in page) unset title text ;; esac } data() # Usage: data text { case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in title.page.*) title=$1 ;; text.revision.page) text=$1 ;; esac } end() # Usage: end tag { case $1 in page) echo "title=$title text=$text" ;; esac } expat -s start -d data -e end < file.xml -- William Park <[EMAIL PROTECTED]>, Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ -- http://mail.python.org/mailman/listinfo/python-list