Hi Laurent, Op di 18 jan 2005 om 11:48:05 +0100 schreef Laurent Fousse: > I've found more combined log lines that trigger the bug. They all > contain a referer field which is a result from a search engine with > encoded accentuated characters in the url (like %E9). > > Using "lr_log2report -o xml" you can see the produced xml declares an > utf-8 encoding but the encoding actually used is iso-8859-1, hence the > bug.
You are absolutely right. For the record, here's the proof: ---------- $ echo 'silence.lateralis.org - - [11/Jan/2005:02:27:17 +0100] "GET /pl/synth/ HTTP/1.1" 200 14356 "http://www.google.fr/search?num=100&hl=fr&ie=ISO-8859-1&q=images+de+synth%E8se&btnG=Rechercher&meta=" "Mozilla/4.0 (compatible; MSIE 5.17; Mac_PowerPC)"' | \ lr_log2report -o xml combined > ~/tmp/291063.xml $ lr_xml2report -o txt ~/tmp/291063.xml > /dev/null Formatting report as txt in -... lr_xml2report: ERROR not well-formed (invalid token) at line 1004, column 46, byte 53763 at /usr/lib/perl5/XML/Parser.pm line 187 $ recode cp1252/..u8 < tmp/291063.xml > tmp/291063.utf8.xml $ lr_xml2report -o txt ~/tmp/291063.utf8.xml > /dev/null Formatting report as txt in -... $ ---------- The latest lire upstream snapshot, lire-2.0.1.99.1, suffers from the same bug. The generated XML file has a wrong header "<?xml version="1.0" encoding="UTF-8"?>". Changing the header to '<?xml version="1.0" encoding="ISO-8859-1"?>' is another way to work around the problem. I'll investigate more. Thanks for your well documented bugreport! Bye, Joost -- . . http://logreport.com/ | '.| /^LogReport$/ | Lire http://logreport.org/
signature.asc
Description: Digital signature

