Thank you Terry, Dan and Dieter for encouraging me to post here. I have
already solved the problem albeit with a not so efficient solution.
Perhaps, it is useful to present it here anyway in case some light can
be added to this.
My job is to parse a complicated XML (iso metadata) and pick up values
of certain fields in certain conditions. This goes for the most part
well. I am working with xml.etree.elementtree, which proved sufficient
for the most part and the rest of the project. JSON is not an option
within this project.
The specific trouble was in this section, itself the child of a more
complicated parent: (for simplicity tags are renamed and namespaces removed)
<tagA>
<tagB>
<tagC>
<string>Something</string>
</tagC>
<tagC>
<string>Something else</string>
</tagC>
<tagC>
<note>
<title>
<string>value</string>
</title>
<date0>
<date1>
<date2>
<gco:Date>2020-11-06</gco:Date>
</date2>
<dateType>
<code blah lots of strange things blah />
</dateType>
</date1>
</date0>
</note>
</tagC>
</tagB>
</tagA>
Basically, I have to get what is in tagC/string but only if the value of
tagC/note/title/string is "value". As you see, there are several tagC,
all children of tagB, but tagC can have different meanings(!). And no, I
have no control over how these XML fields are constructed.
In principle it is easy to make a "findall" and get strings for tagC, using:
elem.findall("./tagA/tagB/tagC/string")
and then get the content and append in case there is more than one
tagC/string like: "Something, Something else".
However, the hard thing to do here is to get those only when
tagC/note/title/string='value'. I was expecting to find a way of
specifying a certain construction in square brackets, like
[@string='value'] or [@/tagC/note/title/string='value'], as is usual in
XML and possible in xml.etree. However this proved difficult (at least
for me). So this is the "brute" solution I implemented:
- find all children of tagA/tagB
- check if /tagA/tagB/tagC/note/title/string has "value"
- if yes find all tagA/tagB/tagC/string
In quasi-Python:
string = []
element0 = elem.findall("./tagA/tagB/")
for element1 in element0:
element2 = element1.find("./tagA/tagB/tagC/note/title/string")
if element2.text == 'value'
element3 = element1.findall("./tagA/tagB/tagC/string)
for element4 in element3:
string.append(element4.text)
Crude, but works. As I wrote above, I was wishing that a bracketed
clause of the type [@ ...] already in the first "findall" would do a
more efficient job but alas my knowledge of xml is too rudimentary.
Perhaps something to tinker on in the coming weeks.
Have a nice weekend!
On 2020-11-06 20:10, Terry Reedy wrote:
On 11/6/2020 11:17 AM, Hernán De Angelis wrote:
I am confronting some XML parsing challenges and would like to ask
some questions to more knowledgeable Python users. Apparently there
exists a group for such questions but that list (xml-sig) has
apparently not received (or archived) posts since May 2018(!). I
wonder if there are other list or forum for Python XML questions, or
if this list would be fine for that.
If you don't hear otherwise, try here. Or try stackoverflow.com and
tag questions with python and xml.
--
https://mail.python.org/mailman/listinfo/python-list