Hi,

I was looking at using the elementtree parser in python to pull out a more
or less plain text version of a module quickly for search indexing.
Incidentally, it is quite a bit faster than calling striptext - on the esv
and kjv, it took about 80% of the time striptext takes

I ran into problems trying it on the NETfree however - there seems to be
trailing osis tags at the end of books:
For example, from Genesis 50:26
'So Joseph died at the age of 110.<note osisRef="Gen.50.26" n="33"></note>
After they embalmed him, his body<note osisRef="Gen.50.26" n="34"></note>
was placed in a coffin in Egypt.<milestone type="line" /><milestone
type="line" /> </div> *<chapter eID="Gen.50"/></div>*'

The last two tags in bold shouldn't be there - they are unmatched anywhere,
and removing them allows parsing to work.

The third last tag, which is a div, matches with a tag in the heading of the
chapter - is the raw entry of a verse meant to be able to be taken as valid
xml by itself? If so, this is also invalid.

God Bless,
Ben
-------------------------------------------------------------------------------------------
The Lord is not slow to fulfill his promise as some count slowness,
but is patient toward you, not wishing that any should perish,
but that all should reach repentance.
2 Peter 3:9 (ESV)
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to