jog <[EMAIL PROTECTED]> wrote:
> Hi,
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
> <parent>
>    <page>
>     <title>bla</title>
>     <id></id>
>     <revision>
>       <id></id>
>       <text>blablabla</text>
>     <revision>
>    </page>
>    <page>
>    </page>
>     ....
> </parent>
> I want to combine the text out of page:title and page:revision:text for
> every single page element. One by one I want to index these combined
> texts (so for each page one index)
> What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
> or pulldom?
> Or should I just use Xpath somehow.
> I don`t want to do anything else with his xml file afterwards.
> I hope someone will understand me.....
> Thank you very much
> Jog

I would use Expat interface from Python, Awk, or even Bash shell.  I'm
most familiar with shell interface to Expat, which would go something
    start()             # Usage: start tag att=value ...
        case $1 in
            page) unset title text ;;
    data()              # Usage: data text
        case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
  *) title=$1 ;;
   text=$1 ;;
    end()               # Usage: end tag
        case $1 in
            page) echo "title=$title text=$text" ;;
    expat -s start -d data -e end < file.xml

William Park <[EMAIL PROTECTED]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
BashDiff: Super Bash shell


Reply via email to