> Le vendredi 5 juin 2015 14:31:17, vous avez écrit : > On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote: > > Ok, so after further testing, it turns out that if I change the coding of > > the > > string from UTF-8 to ISO-8859..., it encode to the proper entities. > > Good. > > > I obviously can adjust the script to pre convert UTF-8 to ISO-8859 > > Or just add "use utf8;" to your script if it contains utf8-encoded > strings.
That works for the test script allright. But in the script I'm actually working on, the string is imported from an image exif data. And in this case, use utf8 has no effect at all. The string is utf8 and encode_entities fails to convert it properly. Instead of keeping strings UTF-8 and expecting HTML::Entities to cope properly with it (it does not), I actually need to do the contrary: convert UTF-8 to perl internal format and then call encode entities. Consider the following: $ cat test.pl #!/usr/bin/perl use utf8; use HTML::Entities; open(INPUT, "< testdata"); while (<INPUT>) { print encode_entities($_), "\n" } close(INPUT); $ echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" > testdata $ perl test.pl vis-à -vis Beyoncé's naïve\npapier- mâché résumé Back to square one. Now, without use utf8; but decoding: #!/usr/bin/perl use HTML::Entities; use Encode qw(decode); use Encode::Detect::Detector; open(INPUT, "< testdata"); while (<INPUT>) { print encode_entities(decode(detect($_),$_)), "\n" } close(INPUT); $ perl test.pl vis-à-vis Beyoncé's naïve\npapier-mâché résumé > > but it > > should be at least documented (but I dont see any reason why > > encode_entities > > should actually not be able to deal with UTF-8) > > That's how encoding in perl works in general, and I'm sure it's > documented somewhere :) > (I just don't find the correct perldoc right now ...) I expected these use utf8/no utf8 to be sort of transitional and thought should be avoided whenever not absolutely necessary. Description of use utf8; mentions: "When UTF-8 becomes the standard source format, this pragma will effectively become a no-op." Well, that day, if that day comes, HTML::Entities will definitely have to deal properly with UTF-8 first hand. :-) Anyway, in the meantime, I tend to prefer forcing strings to be decoded into internal format than saying that all strings are UTF-8. Regards, -- http://yeupou.wordpress.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org