Package: lftp
Version: 3.5.6-1
Severity: important
Tags: patch

UTF-8 multi-byte characters are not correctly encoded into URLs. These
characters are for example vowels with accents, and thus appear very
frequently in european languages like French (which is my own).

Although UTF-8 encoded web pages are not widespread yet, I believe it is
a good practice to encourage unicode. Here is an example website which
fails with lftp :

> $ lftp http://files.iai.heig-vd.ch/Enseignement/ <<EOF
> cd Supports%20de%20cours/Acquisition\ de\ données\ \&\ CEM/
> EOF

Here is the output I get :
> $ lftp http://files.iai.heig-vd.ch/Enseignement/Supports%20de%20cours/
> cd ok, cwd=/Enseignement/Supports de cours
> lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours>
>  cd Acquisition\ de\ données\ \&\ CEM/
> cd: Access failed: 404 Not Found (/Enseignement/Supports de 
>  cours/Acquisition de données & CEM)
> lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> exit

I wrote a naïve patch to url-encode some of these characters and it
seems to work for the example page, but it still misses most UTF-8
characters. While I figure out how to do it correctly maybe you can
point me to some relevant information or to upstream coders which
would be interested ?

Oh and thank you for maintaining this =)

-- 
  billitch


-- System Information:
Debian Release: 4.0
  APT prefers stable
  APT policy: (990, 'stable'), (500, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-4-686
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8)

Versions of packages lftp depends on:
ii  libc6                       2.3.6.ds1-13 GNU C Library: Shared libraries
ii  libexpat1                   1.95.8-3.4   XML parsing C library - runtime li
ii  libgcc1                     1:4.1.1-21   GCC support library
ii  libgcrypt11                 1.2.3-2      LGPL Crypto library - runtime libr
ii  libgnutls13                 1.4.4-3      the GNU TLS library - runtime libr
ii  libgpg-error0               1.4-1        library for common error values an
ii  libncurses5                 5.5-5        Shared libraries for terminal hand
ii  libreadline5                5.2-2        GNU readline and history libraries
ii  libtasn1-3                  0.3.6-2      Manage ASN.1 structures (runtime)
ii  netbase                     4.29         Basic TCP/IP networking system
ii  zlib1g                      1:1.2.3-13   compression library - runtime

lftp recommends no packages.

-- no debconf information
diff -ur lftp-3.5.6.orig/src/url.cc lftp-3.5.6/src/url.cc
--- lftp-3.5.6.orig/src/url.cc	2006-02-06 11:59:59.000000000 +0100
+++ lftp-3.5.6/src/url.cc	2007-08-08 08:08:37.000000000 +0200
@@ -441,6 +441,7 @@
 
 /* Encodes the unsafe characters (listed in URL_UNSAFE) in a given
    string, returning a malloc-ed %XX encoded string.  */
+inline char *cat_quoted (char *p, const unsigned char c);
 #define need_quote(c) (!unsafe || iscntrl((unsigned char)(c)) || strchr(unsafe,(c)))
 char *url::encode_string (const char *s,char *res,const char *unsafe)
 {
@@ -462,10 +463,12 @@
   {
     if (need_quote(*s))
       {
-	const unsigned char c = *s;
-	*p++ = '%';
-	sprintf(p,"%02X",c);
-	p+=2;
+	p = cat_quoted (p, *s);
+	if ((unsigned char) *s == 0xC3 && s[1])
+	  {
+	    s++;
+	    p = cat_quoted (p, *s);
+	  }
       }
     else
       *p++ = *s;
@@ -474,6 +477,14 @@
   return res;
 }
 
+inline char *cat_quoted (char *p, const unsigned char c)
+{
+  *p++ = '%';
+  sprintf(p,"%02X",c);
+  p+=2;
+  return p;
+}
+
 bool url::dir_needs_trailing_slash(const char *proto)
 {
    if(!proto)
diff -ur lftp-3.5.6.orig/src/url.h lftp-3.5.6/src/url.h
--- lftp-3.5.6.orig/src/url.h	2006-02-06 12:00:06.000000000 +0100
+++ lftp-3.5.6/src/url.h	2007-08-08 08:04:48.000000000 +0200
@@ -47,7 +47,7 @@
    char *Combine(const char *home=0,bool use_rfc1738=true);
 };
 
-# define URL_UNSAFE " <>\"%{}|\\^[]`"
+# define URL_UNSAFE " <>\"%{}|\\^[]`\xC3"
 # define URL_PATH_UNSAFE URL_UNSAFE"#;?"
 # define URL_HOST_UNSAFE URL_UNSAFE":/"
 # define URL_PORT_UNSAFE URL_UNSAFE"/"

Reply via email to