I wrote: > Can you suggest a fast, efficient way to use Perl to extract selected > data from an XML file?...
First of all, thank you everyone who promptly replied to my query. Second, I was not quite clear in my question. Many people said I should write an XSLT style sheet to transform my XML document into HTML. This is in fact what I do, but I was not clear in my question. I need a process to not only transform each of my documents, but I also need to create an author as well as title indexes to my collection, and therefore I need to extract bits of data from each of my original XML files. Third, most of the replies fell into two categories: 1) use an XSLT style sheet as as sort of "subroutine", and 2) use XML::Twig. Fourth, I tried both of these approaches plus my own, and timed them. I had to process 1.5 MB of data in nineteen files. Tiny. Ironically, my original code was the fastest at 96 seconds. The XSLT implementation came in second at 101 seconds, and the XML::Twig implementation, while straight-forward came in last as 141 seconds. (See the attached code snippets.) Since my original implementation is still the fastest, and the newer implementations do not improve the speed of the application, then I must assume that the process is slow because of the XSLT transformations themselves. These transformations are straight-forward: # transform the document and save it my $doc = $parser->parse_file($file); my $results = $stylesheet->transform($doc); my $html_file = "$HTML_DIR/$id.html"; open OUT, "> $html_file"; print OUT $stylesheet->output_string($results); close OUT; # convert the HTML to plain text and save it my $html = parse_htmlfile($html_file); my $text_file = "$TEXT_DIR/$id.txt"; open OUT, "> $text_file"; print OUT $formatter->format($html); close OUT; When my collection grows big I will have to figure out a better way to batch transform my documents. I might even have to break down and write a shell script to call xsltproc directly. (Blasphemy!) -- Eric Lease Morgan University Libraries of Notre Dame
subroutines.txt
Description: application/applefile
# my original code print "Processing $file...\n"; my $doc = $parser->parse_file($file); my $root = $doc->getDocumentElement; my @header = $root->findnodes('teiHeader'); my $author = $header[0]->findvalue('fileDesc/titleStmt/author'); my $title = $header[0]->findvalue('fileDesc/titleStmt/title'); my $id = $header[0]->findvalue('fileDesc/publicationStmt/idno'); print " author: $author\n title: $title\n id: $id\n\n"; # using an XSLT stylesheet print "Processing $file...\n"; my $style = $parser->parse_file($AUTIID); my $stylesheet = $xslt->parse_stylesheet($style); my $doc = $parser->parse_file($file); my $results = $stylesheet->transform($doc); my $fullResult = ($stylesheet->output_string($results)); my @fullResult = split /#/, $fullResult; my $title = $fullResult[0]; my $author = $fullResult[1]; my $id = $fullResult[2]; print " author: $author\n title: $title\n id: $id\n\n"; # using XML::Twig print "Processing $file...\n"; my ($author, $title, $id); my $twig = new XML::Twig(TwigHandlers => { 'teiHeader/fileDesc/titleStmt/author' => sub {$author = $_[1]->text}, 'teiHeader/fileDesc/titleStmt/title' => sub {$title = $_[1]->text}, 'teiHeader/fileDesc/publicationStmt/idno' => sub {$id = $_[1]->text}}); $twig->parsefile($file); print " author: $author\n title: $title\n id: $id\n\n";