On Wed, 18 Apr 2012 22:23:37 +0200
Manfred Lotz <manfred.l...@arcor.de> wrote:

> On Thu, 19 Apr 2012 06:15:47 +1000
> "Owen" <rc...@pcug.org.au> wrote:
> 
> > 
> > > Hi there,
> > > I've got a question about XML::Mini.
> > >
> > > When parsing an xml document for some reasons I want to preserve
> > > white space. However, it doesn't work really.
> > >
> > > Minimal example:
> > >
> > > ! /usr/bin/perl
> > >
> > >
> > > use strict;
> > > use warnings;
> > > use Data::Dumper;
> > > use XML::Mini::Document;
> > >
> > > my $XMLString = "<book>  Learning Perl </book>";
> > >
> > > my $xmlDoc = XML::Mini::Document->new();
> > >
> > > $XML::Mini::IgnoreWhitespaces = 0;
> > >
> > > # init the doc from an XML string
> > > $xmlDoc->parse($XMLString);
> > >
> > > my $xmlHash = $xmlDoc->toHash();
> > >
> > > print Dumper($xmlHash);
> > >
> > >
> > > I get the following output:
> > > VAR1 = {
> > >           'book' => 'Learning Perl '
> > >         };
> > >
> > >
> > > I would have expecte to have
> > >    book' => '  Learning Perl '
> > >
> > > instead.
> > >
> > >
> > > Any idea, what's going wrong?
> > 
> > 
> > What Happens if you set $XML::Mini::IgnoreWhitespaces = 1
> > 
> > Seems to me that 1 = yes
> > 
> 
> This is true.
> 
> > What does the documentation say?
> > 
> 
> If I set it to 1 then I get
>   book' => 'Learning Perl'
> 
> which is even worse. Please note that I don't want to have ignored
> white space. 
> 
> 

Hm, I had no other idea but to look up the source code. I guess I found
what happens.

 if ($XMLString =~ 
   m/^\s*(<\s*([^\s>]+)([^>]+)\/\s*>|   # <unary \/>
          <\?\s*([^\s>]+)\s*([^>]*)\?>| # <? headers ?>
          <!--(.+?)-->| # <!-- comments -->
          <!\[CDATA\s*\[(.*?)\]\]\s*>\s*|       # CDATA
          <!DOCTYPE\s*([^\[>]*)(\[.*?\])?\s*>\s*| # DOCTYPE
          <!ENTITY\s*([^"'>]+)\s*(["'])([^\11]+)\11\s*>\s*| # ENTITY
          ([^<]+))(.*)/xogsmi) # plain text      

IHMO, here is the bug. Here leading white space will be deleted which
is ok if it is no plaintext.

I changed it like this
if ($XMLString =~ 
   m/(^\s*<\s*([^\s>]+)([^>]+)\/\s*>|   #<unary \/> 
      ^\s*<\?\s*([^\s>]+)\s*([^>]*)\?>| # <? headers ?>
      ^\s*<!--(.+?)-->| # <!-- comments -->
      ^\s*<!\[CDATA\s*\[(.*?)\]\]\s*>\s*|       # CDATA
      ^\s*<!DOCTYPE\s*([^\[>]*)(\[.*?\])?\s*>\s*| # DOCTYPE
      ^\s*<!ENTITY\s*([^"'>]+)\s*(["'])([^\11]+)\11\s*>\s*| # ENTITY
      ([^<]+))(.*)/xogsmi) # plain text     


Now in all cases except plain text leading space will be deleted.


$VAR1 = {
          'book' => '  Learning Perl '
        };



-- 
Manfred





-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to