On Sat, Oct 23, 2010 at 4:40 PM, Devon <dshur...@gmail.com> wrote: > I must quickly and efficiently parse some data contained in multiple > XML files in order to perform some learning algorithms on the data. > Info: > > I have thousands of files, each file corresponds to a single song. > Each XML file contains information extracted from the song (called > features). Examples include tempo, time signature, pitch classes, etc. > An example from the beginning of one of these files looks like: > > <analysis decoder="Quicktime" version="0x7608000"> > <track duration="29.12331" endOfFadeIn="0.00000" > startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031" > tempoConfidence="0.386" timeSignature="4" > timeSignatureConfidence="0.974" key="11" keyConfidence="1.000" > mode="0" modeConfidence="1.000"> > <sections> > <section start="0.00000" duration="7.35887"/> > <section start="7.35887" duration="13.03414"/> > <section start="20.39301" duration="8.73030"/> > </sections> > <segments> > <segment start="0.00000" duration="0.56000"> > <loudness> > <dB time="0">-60.000</dB> > <dB time="0.45279" type="max">-59.897</dB> > </loudness> > <pitches> > <pitch class="0">0.589</pitch> > <pitch class="1">0.446</pitch> > <pitch class="2">0.518</pitch> > <pitch class="3">1.000</pitch> > <pitch class="4">0.850</pitch> > <pitch class="5">0.414</pitch> > <pitch class="6">0.326</pitch> > <pitch class="7">0.304</pitch> > <pitch class="8">0.415</pitch> > <pitch class="9">0.566</pitch> > <pitch class="10">0.353</pitch> > <pitch class="11">0.350</pitch> > > I am a statistician and therefore used to data being stored in CSV- > like files, with each row being a datapoint, and each column being a > feature. I would like to parse the data out of these XML files and > write them out into a CSV file. Any help would be greatly appreciated. > Mostly I am looking for a point in the right direction.
ElementTree is a good way to go for XML parsing: http://docs.python.org/library/xml.etree.elementtree.html http://effbot.org/zone/element-index.htm http://codespeak.net/lxml/ And for CSV writing there's obviously: http://docs.python.org/library/csv.html > And I am also more > concerned about how to use the tags in the XML files to build feature > names so I do not have to hard code them. For example, the first > feature given by the above code would be "track duration" with a value > of 29.12331 You'll probably want to look at namedtuple (http://docs.python.org/library/collections.html#collections.namedtuple ) or the "bunch" recipe (google for "Python bunch"). Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list