On Wed, 22 Oct 2014 14:13:17 +0200, Vincent Lefevre wrote: > Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks > parse_file > > As a consequence of this bug, html2xhtml doesn't work at all when > applied on a file. No problems when the HTML document is provided > in the standard input, though. [..] > parse_file is used in the former test (like in my original bug report), > and parse_string is used in the latter test. Thus it seems that's > parse_file that is broken. Hence the retitle.
Thanks for all those test cases.
Out of curiosity, I looked at the code a bit and used your test HTML
file with bin/html2xhtml:
So what happens is:
lib/HTML/HTML5/Parser.pm: parse_file():
my $response = HTML::HTML5::Parser::UA->get($file, $opts->{user_agent});
lib/HTML/HTML5/Parser/UA.pm: get():
interestingly takes the _get_lwp route for file:/// and returns stuff
lib/HTML/HTML5/Parser.pm: parse_file():
then takes $response->{decoded_content};
which generates, when printed, a wide character warning, and
presumably from here on things go south
What helps is:
- replace in lib/HTML/HTML5/Parser.pm
$response->{decoded_content} with $response->{content}
which feels a bit dangerous
- or in lib/HTML/HTML5/Parser/UA.pm's get:
move the
if ($uri =~ /^file:/i)
up so it's the first alternative and then _get_fs is used
The latter change would be, as a diff:
#v+
--- a/lib/HTML/HTML5/Parser/UA.pm
+++ b/lib/HTML/HTML5/Parser/UA.pm
@@ -18,14 +18,14 @@ sub get
{
my ($class, $uri, $ua) = @_;
+ if ($uri =~ /^file:/i)
+ { goto \&_get_fs }
if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
{ goto \&_get_tiny }
if (ref $ua and $ua->isa('LWP::UserAgent'))
{ goto \&_get_lwp }
if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
{ goto \&_get_lwp }
- if ($uri =~ /^file:/i)
- { goto \&_get_fs }
goto \&_get_tiny;
}
#v-
While this helps for reading local files, I guess the _get_lwp() case
might still be buggy.
Cheers,
gregor
--
.''`. https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
: :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D 85FA BB3A 6801 8649 AA06
`. `' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
`-
signature.asc
Description: Digital Signature

