I'm attempting to do the following: A) Read/scan/iterate/etc. through a semi-large XML file (about 135 mb) B) Grab specific fields and output to a tab-delimited text file
The only problem I'm having is that the tab-delimited text file requires a different order of values than which appear in the XML file. Example below. <Title> <Item ID="1234abcd"> <ItemVal ValueID="image" value="image.jpg" /> <ItemVal ValueID="name" value="My Wonderful Product 1" /> <ItemVal ValueID="description" value="My Wonderful Product 1 is a wonderful product, indeed." /> </Item> <Item ID="2345bcde"> <ItemVal ValueID="image" value="image2.jpg" /> <ItemVal ValueID="name" value="My Wonderful Product 2" /> <ItemVal ValueID="description" value="My Wonderful Product 2 is a wonderful product, indeed." /> </Item> <Item ID="3456cdef"> <ItemVal ValueID="image" value="image3.jpg" /> <ItemVal ValueID="description" value="My Wonderful Product 3 is a wonderful product, indeed." /> <ItemVal ValueID="name" value="My Wonderful Product 3" /> </Item> </Title> (Note: The last item "3456cdef" shows the description value as being before the name, where as in previous items, it comes after. This is to simulate the XML data with which I am working.) And the tab-delimited text file should appear as follows: (tabs are as 2 spaces, for the sake of readability here) (ID,name,description,image) 1234abcd My Wonderful Product 1 My Wonderful Product 1 is a wonderful product, indeed. image.jpg 2345bcde My Wonderful Product 2 My Wonderful Product 2 is a wonderful product, indeed. image2.jpg 3456cdef My Wonderful Product 3 My Wonderful Product 3 is a wonderful product, indeed. image3.jpg Currently, I'm working with the lxml library for iteration and parsing, though this is proving to be a bit of a challenge for data that needs to be reorganized (such as mine). Sample below. ''' Start code ''' from lxml import etree def main(): # Far too much room would be taken up if I were to paste my # real code here, so I will give a smaller example of what # I'm doing. Also, I do realize this is a very naive way to do # what it is I'm trying to accomplish... besides the fact # that it doesn't work as intended in the first place. out = open('output.txt','w') cat = etree.parse('catalog.xml') for el in cat.iter(): # Search for the first item, make a new line for it # and output the ID if el.tag == "Item": out.write("\n%s\t" % (el.attrib['ID'])) elif el.tag == "ItemVal": if el.attrib['ValueID'] == "name": out.write("%s\t" % (el.attrib['value'])) elif el.attrib['ValueID'] == "description": out.write("%s\t" % (el.attrib['value'])) elif el.attrib['ValueID'] == "image": out.write("%s\t" % (el.attrib['value'])) out.close() if __name__ == '__main__': main() ''' End code ''' I now realize that etree.iter() is meant to be used in an entirely different fashion, but my brain is stuck on this naive way of coding. If someone could give me a push in any correct direction I would be most grateful. -- http://mail.python.org/mailman/listinfo/python-list