<snip />
>
> Not withstanding my other comment, this code is also inefficient,
> both tactically and strategically.
I know it was horrendous code, it was just the first thing that popped
into my head. After I had it working I was going to make it more
efficient.
>
> Take for example the string "\200abc"...
>
> After you replace "\200" with "&200;", the next character examined
> is the "2" that you've just inserted, since you do not bump along
> the value of $i appropriately. Lots of wasted character movements
> there. You really want the next loop to look at "a", not "2".
Oops, I didn't even think about that. It gave the correct results so I
moved on.
>
> But more importantly, all those ord()s and substr()s are using
> the wrong parts of Perl for basic string processing. This will
> execute much much faster:
>
> $file =~ s/([\200-\377])/"&#".ord($1).";"/ge;
>
> A regex match here is appropriate. The "/g" replaces the outer loop,
> and the expression provides the replacement text without overlap.
I knew if I posted the terrible for loop version someone would give me a
nice regexp for it. I can't "think" in terms of regexps yet (well, I
can think in terms of simple Vimish regexps, but not Perlish regexps).
I will get there with time, but it has only been three months. I could
not remember how to make the function call inside of the regexp else I
would have used: $file =~ s/([^\0-\127])/"&#".ord($1).";"/ge). I got to
"$file =~ s/([^\0-\127])/&#\1;/g" and thought "How the hell am I going
to get the ord in there?" So I took the easy way out and wrote the
terrible for loop.
>
> Whenever you think "change string", you should first think "regex
> replacement", not anything else. Rarely will something else be
> better. (For "change character to character", think "transliterate".)
>
> The moral of the story is that quick and dirty... may end up
> being just dirty. :-)
>
> --
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <[EMAIL PROTECTED]> <URL:http://www.stonehenge.com/merlyn/>
> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
> See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
>
>From your earlier email I got the impression that all that need to be
done to fix the character problem was to add "<?xml version='1.0'
encoding='ISO-8859-1'?>" to the XML to force XML::Parser to not bomb on
the foreign characters. This works, but I am still getting "Dragkamp om
förlusttÃ¥g" instead of "Dragkamp om förlusttåg" as output. My current
bet is that these suckers are 2 byte ISO characters (hence the ö being
printed as ö). Is there some step I am still missing? I have copied
the latest version of the code below.
<code>
#!/usr/bin/perl -w
use strict;
use XML::Parser; #parse XML into an internal format
use XML::SimpleObject; #easy to use forntend to XML::Parse
if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }
open FH, $ARGV[0] or die "Could not open $ARGV[0]:$!";
my $file = "<?xml version='1.0' encoding='ISO-8859-1'?>\n";
#grab the whole file
{ local ($/) = undef; $file .= <FH>; }
close FH;
#this shouldn't be necessary since we are setting encoding='ISO-8859-1'
#$file =~ s/([^\0-\128])/"&#".ord($1).";"/ge;
my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
my $xmlobj = new XML::SimpleObject ($parser->parse($file));
open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
select HTML;
print "
<html>
<head>
<title>
News Articles for " . localtime() . "
</title>
</head>
<body>
<table>";
foreach my $articles ($xmlobj->children) { #get the top tag
foreach my $article ($articles->children) { #get all articles
print STDOUT $article->child('RUB')->value, "\n";
my $file = $article->child('PUB')->value . '-' .
$article->child('RUB')->value . '-' .
$article->child('LEV')->value . '-' .
$article->child('DAT')->value;
$file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
.
open FH, ">$file.art" or die "Could not open $file.art:$!";
print FH $article->child('BRO')->value;
close FH;
print
"<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
"<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
"<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
"<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
"<tr><td><a
href=\"$file\">",$article->child('RUB')->value,"</a></td></tr>\n",
"<tr><td>", $article->child('INL')->value, "</td></tr>\n",
"<tr><td></td></tr>";
}
print "
</table>
</body>
</html>";
close HTML;
</code>
--
Today is Sweetmorn, the 25th day of Confusion in the YOLD 3167
Or not.