jog <[EMAIL PROTECTED]> wrote:
> Hi,
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
> <parent>
>    <page>
>     <title>bla</title>
>     <id></id>
>     <revision>
>       <id></id>
>       <text>blablabla</text>
>     <revision>
>    </page>
>    <page>
>    </page>
>     ....
> </parent>
> I want to combine the text out of page:title and page:revision:text for
> every single page element. One by one I want to index these combined
> texts (so for each page one index)
> What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
> or pulldom?
> Or should I just use Xpath somehow.
> I don`t want to do anything else with his xml file afterwards.
> I hope someone will understand me.....
> Thank you very much
> Jog

I would use Expat interface from Python, Awk, or even Bash shell.  I'm
most familiar with shell interface to Expat, which would go something
like
    
    start()             # Usage: start tag att=value ...
    {
        case $1 in
            page) unset title text ;;
        esac
    }
    data()              # Usage: data text
    {
        case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
            title.page.*) title=$1 ;;
            text.revision.page) text=$1 ;;
        esac
    }
    end()               # Usage: end tag
    {
        case $1 in
            page) echo "title=$title text=$text" ;;
        esac
    }
    expat -s start -d data -e end < file.xml

-- 
William Park <[EMAIL PROTECTED]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
           http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
          http://freshmeat.net/projects/bashdiff/
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to