Package: lftp Version: 3.5.6-1 Severity: important Tags: patch
UTF-8 multi-byte characters are not correctly encoded into URLs. These characters are for example vowels with accents, and thus appear very frequently in european languages like French (which is my own). Although UTF-8 encoded web pages are not widespread yet, I believe it is a good practice to encourage unicode. Here is an example website which fails with lftp : > $ lftp http://files.iai.heig-vd.ch/Enseignement/ <<EOF > cd Supports%20de%20cours/Acquisition\ de\ données\ \&\ CEM/ > EOF Here is the output I get : > $ lftp http://files.iai.heig-vd.ch/Enseignement/Supports%20de%20cours/ > cd ok, cwd=/Enseignement/Supports de cours > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> > cd Acquisition\ de\ données\ \&\ CEM/ > cd: Access failed: 404 Not Found (/Enseignement/Supports de > cours/Acquisition de données & CEM) > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> exit I wrote a naïve patch to url-encode some of these characters and it seems to work for the example page, but it still misses most UTF-8 characters. While I figure out how to do it correctly maybe you can point me to some relevant information or to upstream coders which would be interested ? Oh and thank you for maintaining this =) -- billitch -- System Information: Debian Release: 4.0 APT prefers stable APT policy: (990, 'stable'), (500, 'testing') Architecture: i386 (i686) Shell: /bin/sh linked to /bin/bash Kernel: Linux 2.6.18-4-686 Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8) Versions of packages lftp depends on: ii libc6 2.3.6.ds1-13 GNU C Library: Shared libraries ii libexpat1 1.95.8-3.4 XML parsing C library - runtime li ii libgcc1 1:4.1.1-21 GCC support library ii libgcrypt11 1.2.3-2 LGPL Crypto library - runtime libr ii libgnutls13 1.4.4-3 the GNU TLS library - runtime libr ii libgpg-error0 1.4-1 library for common error values an ii libncurses5 5.5-5 Shared libraries for terminal hand ii libreadline5 5.2-2 GNU readline and history libraries ii libtasn1-3 0.3.6-2 Manage ASN.1 structures (runtime) ii netbase 4.29 Basic TCP/IP networking system ii zlib1g 1:1.2.3-13 compression library - runtime lftp recommends no packages. -- no debconf information
diff -ur lftp-3.5.6.orig/src/url.cc lftp-3.5.6/src/url.cc --- lftp-3.5.6.orig/src/url.cc 2006-02-06 11:59:59.000000000 +0100 +++ lftp-3.5.6/src/url.cc 2007-08-08 08:08:37.000000000 +0200 @@ -441,6 +441,7 @@ /* Encodes the unsafe characters (listed in URL_UNSAFE) in a given string, returning a malloc-ed %XX encoded string. */ +inline char *cat_quoted (char *p, const unsigned char c); #define need_quote(c) (!unsafe || iscntrl((unsigned char)(c)) || strchr(unsafe,(c))) char *url::encode_string (const char *s,char *res,const char *unsafe) { @@ -462,10 +463,12 @@ { if (need_quote(*s)) { - const unsigned char c = *s; - *p++ = '%'; - sprintf(p,"%02X",c); - p+=2; + p = cat_quoted (p, *s); + if ((unsigned char) *s == 0xC3 && s[1]) + { + s++; + p = cat_quoted (p, *s); + } } else *p++ = *s; @@ -474,6 +477,14 @@ return res; } +inline char *cat_quoted (char *p, const unsigned char c) +{ + *p++ = '%'; + sprintf(p,"%02X",c); + p+=2; + return p; +} + bool url::dir_needs_trailing_slash(const char *proto) { if(!proto) diff -ur lftp-3.5.6.orig/src/url.h lftp-3.5.6/src/url.h --- lftp-3.5.6.orig/src/url.h 2006-02-06 12:00:06.000000000 +0100 +++ lftp-3.5.6/src/url.h 2007-08-08 08:04:48.000000000 +0200 @@ -47,7 +47,7 @@ char *Combine(const char *home=0,bool use_rfc1738=true); }; -# define URL_UNSAFE " <>\"%{}|\\^[]`" +# define URL_UNSAFE " <>\"%{}|\\^[]`\xC3" # define URL_PATH_UNSAFE URL_UNSAFE"#;?" # define URL_HOST_UNSAFE URL_UNSAFE":/" # define URL_PORT_UNSAFE URL_UNSAFE"/"