Please, please, please, do not try to parse XML with regexps. They only
work in the simplest cases. There are perfectly good XML modules
designed to parse XML for you and they are not that hard to use.
The following code parses an XML file similar to the one you described,
but has an additional tag (<ARTICLES></ARTICLES>) since XML must have
one and only one root tag. I added this tag because I thought you have
more than one article per file. If this is true then the XML you
described is not well formed. However it would be a simple process to
wrap this tag around the file before attempting to parse it. If there
is in fact only one article per file then remove the outer foreach and
replace $articles->children with $xmlobj->children.
<code>
#!/usr/bin/perl -w
use strict;
use XML::Parser; #parse XML into an internal format
use XML::SimpleObject; #easy to use forntend to XML::Parse
if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
my $xmlobj = new XML::SimpleObject ($parser->parsefile($ARGV[0]));
open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
select HTML;
print "
<html>
<head>
<title>
News Articles for " . localtime() . "
</title>
</head>
<body>
<table>";
foreach my $articles ($xmlobj->children) { #get the top tag
foreach my $article ($articles->children) { #get all articles
my $file = $article->child('PUB')->value . '-' .
$article->child('RUB')->value . '-' .
$article->child('LEV')->value . '-' .
$article->child('DAT')->value;
$file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
.
open FH, ">$file" or die "Could not open $file:$!";
print FH $article->child('BRO')->value;
close FH;
print
"<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
"<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
"<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
"<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
"<tr><td><a href=\"$file\">", $article->child('RUB')->value,
"</a></td></tr>\n","<tr><td>", $article->child('INL')->value,
"</td></tr>\n",
"<tr><td></td></tr>";
}
}
print "
</table>
</body>
</html>";
close HTML;
</code>
On 19 Jun 2001 13:34:03 +0100, Nigel Wetters wrote:
> I think I can give you some clues. Here's some code out of the Perl Cookbook (6.8
>Extracting a Range of Lines), which I've adapted for you. You should be able to nest
>such structures to get what you want.
>
> my $extracted_lines = '';
> while (<>) {
> if (/BEGIN PATTERN/ .. /END PATTERN/) {
> # line falls between BEGIN and END in the
> # text, inclusive
> $extracted_lines .= $_;
> } else {
> # now, we're outside the pattern
> process($extracted_lines) if $extracted_lines;
> $extracted_lines = '';
> }
> }
> sub process
> {
> # do stuff with the extracted lines
> # maybe performing more regex's
> }
>
> >>> Morgan <[EMAIL PROTECTED]> 06/19/01 01:12pm >>>
> Hi
>
> I'm newbee perl developer and a rookie of xml :(
>
> Is there anyone who can give me some hints or help me out with a problem
> I have?
>
> Here is the problem.
> I will recive newsarticles three times a day in xml format and I need to
> automaticly publish those articels on a web page, on the first page it
> should only show the tags down to </INL>
> tag and a link to the whole page.
>
> Here is a sample of the xml format.
>
> <ART>
> <ORD>anbud</ORD>
> <LEV>2001-06-14</LEV>
> <DAT>14-06-01</DAT>
> <PUB>DAGENS INDUSTRI</PUB>
> <RUB>Dragkamp om förlusttåg</RUB>
> <INL>Here is the indroduction about the article and when the word
> anbud comes up it is enclosed in <HIT>anbud</HIT> tags.
> This is the word we use as criteria on the articels we should recive.
> </INL>
> <BRO>
> Here comes the rest of the document, thats the whole article.
> The article ends with
> </BRO>
> </ART>
>
>
> Raven
>
>
>
> This e-mail and any files transmitted with it are confidential
> and solely for the use of the intended recipient.
> ONdigital plc, 346 Queenstown Road, London SW8 4DG. Reg No: 3302715.
>
--
Today is Setting Orange, the 24th day of Confusion in the YOLD 3167
Wibble.