Re: xml problem

Chas Owens Wed, 20 Jun 2001 09:36:31 -0700
<snip />
> 
> Not withstanding my other comment, this code is also inefficient,
> both tactically and strategically.

I know it was horrendous code, it was just the first thing that popped
into my head.  After I had it working I was going to make it more
efficient.

> 
> Take for example the string "\200abc"...
> 
> After you replace "\200" with "&200;", the next character examined
> is the "2" that you've just inserted, since you do not bump along
> the value of $i appropriately.  Lots of wasted character movements
> there.  You really want the next loop to look at "a", not "2".

Oops, I didn't even think about that.  It gave the correct results so I
moved on.

> 
> But more importantly, all those ord()s and substr()s are using
> the wrong parts of Perl for basic string processing.  This will
> execute much much faster:
> 
>   $file =~ s/([\200-\377])/"&#".ord($1).";"/ge;
> 
> A regex match here is appropriate.  The "/g" replaces the outer loop,
> and the expression provides the replacement text without overlap.

I knew if I posted the terrible for loop version someone would give me a
nice regexp for it.  I can't "think" in terms of regexps yet (well, I
can think in terms of simple Vimish regexps, but not Perlish regexps).
I will get there with time, but it has only been three months.  I could
not remember how to make the function call inside of the regexp else I
would have used: $file =~ s/([^\0-\127])/"&#".ord($1).";"/ge).  I got to
"$file =~ s/([^\0-\127])/&#\1;/g" and thought "How the hell am I going
to get the ord in there?"  So I took the easy way out and wrote the
terrible for loop.

> 
> Whenever you think "change string", you should first think "regex
> replacement", not anything else.  Rarely will something else be
> better. (For "change character to character", think "transliterate".)
> 
> The moral of the story is that quick and dirty... may end up
> being just dirty. :-)
> 
> -- 
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <[EMAIL PROTECTED]> <URL:http://www.stonehenge.com/merlyn/>
> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
> See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
>

>From your earlier email I got the impression that all that need to be
done to fix the character problem was to add "<?xml version='1.0'
encoding='ISO-8859-1'?>" to the XML to force XML::Parser to not bomb on
the foreign characters.  This works, but I am still getting "Dragkamp om
fÃ¶rlusttÃ¥g" instead of "Dragkamp om förlusttåg" as output.  My current
bet is that these suckers are 2 byte ISO characters (hence the ö being
printed as Ã¶).  Is there some step I am still missing?  I have copied
the latest version of the code below.

<code>
#!/usr/bin/perl -w

use strict;
use XML::Parser;       #parse XML into an internal format
use XML::SimpleObject; #easy to use forntend to XML::Parse

if (@ARGV != 2) { die "Usage: $0 news.xml index.html" }

open FH, $ARGV[0] or die "Could not open $ARGV[0]:$!";

my $file = "<?xml version='1.0' encoding='ISO-8859-1'?>\n";
#grab the whole file
{ local ($/) = undef; $file .= <FH>; }

close FH;

#this shouldn't be necessary since we are setting encoding='ISO-8859-1'
#$file =~ s/([^\0-\128])/"&#".ord($1).";"/ge;

my $parser = new XML::Parser (ErrorContext => 2, Style => "Tree");
my $xmlobj = new XML::SimpleObject ($parser->parse($file));

open HTML, ">$ARGV[1]" or die "Could not open $ARGV[1]:$!";
select HTML;

print "
<html>
<head>
<title>
News Articles for " . localtime() .  "
</title>
</head>
<body>
<table>";

foreach my $articles ($xmlobj->children) { #get the top tag
   foreach my $article ($articles->children) { #get all articles
      print STDOUT $article->child('RUB')->value, "\n";
      my $file = $article->child('PUB')->value . '-' .
                 $article->child('RUB')->value . '-' .
                 $article->child('LEV')->value . '-' .
                 $article->child('DAT')->value;
      $file =~ s/[^\w.-]//g; #remove anything not alphanumeric, _, -, or
. 
      open FH, ">$file.art" or die "Could not open $file.art:$!";
      print FH $article->child('BRO')->value;
      close FH;
      print
"<tr><td>", $article->child('ORD')->value, "</td></tr>\n",
"<tr><td>", $article->child('LEV')->value, "</td></tr>\n",
"<tr><td>", $article->child('DAT')->value, "</td></tr>\n",
"<tr><td>", $article->child('PUB')->value, "</td></tr>\n",
"<tr><td><a
href=\"$file\">",$article->child('RUB')->value,"</a></td></tr>\n",
"<tr><td>", $article->child('INL')->value, "</td></tr>\n",
"<tr><td></td></tr>";
   }

print "
</table>
</body>
</html>";

close HTML;
</code>
 
--
Today is Sweetmorn, the 25th day of Confusion in the YOLD 3167
Or not.
Re: xml problem

Reply via email to