(1) With legacy code pages (not UTF-8), certain non-ASCII
    Unicode characters in file names are converted in a lossy
    way ("best-fit mapping"). For example, if active code page
    is Windows-1252 and there is a file name with U+2044 in it,
    it becomes ASCII '/'. Thus, malicious file names result in
    directory traversal.

    Fix by reading the file names as wide chars and converting
    them losslessly to active code page. If lossless conversion
    isn't possible, skip the name. At the end of the directory,
    if anything was skipped, set errno to EOVERFLOW. It should
    be the correct choice according to POSIX readdir() docs.

    The check for lossless conversion is useful even with the UTF-8
    code page because unpaired surrogates (invalid UTF-16) are
    converted to U+FFFD. It seems likely that file names with
    unpaired surrogates cannot be accessed via multibyte APIs.

    For more information, see the blog post by Orange Tsai and
    Splitline from the DEVCORE Research Team.[*]

(2) With UTF-8 code page, file names can be longer than MAX_PATH bytes.
    For example, a long name may fit in MAX_PATH when encoded as
    Windows-1252 and then readdir() returns it correctly. If the same
    name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such
    file names can still be opened etc. as long as their wide char
    encoding doesn't exceed MAX_PATH.

(3) If application is declared as "long path aware" in an application
    manifest, then file names can be longer than MAX_PATH even in the
    native UTF-16 encoding (up to about 32767 wide chars). Fix by
    dynamically allocating the buffer for the absolute file name.

[*] 
https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/

* * *

The above text is also in the commit message of the attached patch. A
few notes:

  - rewinddir.c I didn't test!

  - I used EILSEQ if opendir() is given an invalid multibyte string. I
    feel it's the most descriptive code. POSIX doesn't document EILSEQ
    or even EINVAL for opendir() though.

  - The old code used EINVAL if GetFullPathName() failed. I kept the
    EINVAL there but I'm not sure what is best. ENOENT could be an
    alternative but perhaps EINVAL helps debugging if GetFullPathName()
    does fail.

  - ChangeLog is missing and the commit message format isn't in the
    format used in Gnulib. If this patch is wanted, I hope someone can
    help with ChangeLog and the commit message.

  - It's not ideal that three issues are fixed in a single patch but
    those issues were fairly entangled in the code so I didn't try to
    separate them after I was done.

There might be more to fix in Gnulib:

  - lib/stat.c uses FindFirstFileA(). Likely it needs a fix too but I
    haven't tested it.

  - lib/progreloc.c and lib/relocatable.c use GetModuleFileNameA(). It
    sounds likely that it uses best-fit mapping. It's also MAX_PATH
    limited.

The "best-fit mapping" or "WorstFit" issue category affects many apps,
especially those that have been ported from POSIX systems. This
particular Gnulib issue is just one thing.

XZ Utils was affected too. I fixed it by switching to UTF-8 (which
doesn't help on past Windows versions though). These Windows details
were all new to me in September 2024, so I wrote a few notes how
application manifest was set up in XZ Utils:

    
https://github.com/tukaani-project/xz/blob/master/src/common/w32_application.manifest.comments.txt

There are so many apps ported from POSIX systems that it's a pain to
manually add even a few lines of code to every one of them to protect
them against malicious command line arguments etc. GCC can add an
application manifest by default already but if it is used to enable
UTF-8, apps need to be compatible with it too.

* * *

I don't have nice and proper tests for this. :-( The attached tarball
contains an ugly program that creates a few directories for manual
testing (run create-test-files in an empty dir). One filename it
creates is so long that it only works if for long path aware
applications. The tarball contains three manifests but only one is
used. The others might be convenient when testing the Gnulib code.

I tested the patch via gzip 1.13 only. gzip is convenient for testing
because it's quick to build and

    gzip -dcfr somedir

will recursively "cat" all files to stdout (including uncompressed
files).

MinGW-w64 provides <dirent.h> and then Gnulib's code isn't used. I
don't have MSVC but I tested with MinGW-w64: I copied the files to
gzip's lib directory and forced the dirent replacements to be used:

    ../configure --host=x86_64-w64-mingw32 \
                 ac_cv_header_dirent_h=no \
                 ac_cv_header_dirent_dirent_h=no \
                 ac_cv_func_closedir=no \
                 ac_cv_func_opendir=no \
                 ac_cv_func_readdir=no

gzip's tailor.h has

    #  define MAX_PATH_LEN  260

for native Windows which has to be removed or increased significantly
to test long UTF-8 names. gzip.c defaults to 1024 when tailor.h doesn't
define it. The longest test file name also requires that
utf8-longpath.rc is built and linked into gzip.exe (and that requires
changing the Windows registry setting and rebooting). The shorter name
works without longPathAware (utf8.rc) in the manifest too if the
leading file name components are short enough.

gzip uses Gnulib's savedir module which doesn't return any file names
if there is an error. So in that sense gzip isn't the best test program
because one problematic file name will make gzip skip all files in a
directory; the delaying of EOVERFLOW isn't useful when using savedir.

* * *

Very minor things:

In Gnulib's lib/pathmax.h, there is a comment about PATH_MAX from 2011:

    Undefine the original value, because mingw's <limits.h> gets it
    wrong.

The code in pathmax.h does no harm but the issue was fixed in MinGW-w64
in 2010. I haven't looked at MinGW so I don't know if the issue is
still there.

Gnulib uses "mingw" in docs. Does it refer to MinGW or MinGW-w64 or
both? In configure triplets, MinGW uses "pc" as the vendor and
MinGW-w64 uses "w64".

-- 
Lasse Collin

Attachment: create-test-files.tar.xz
Description: application/xz

>From 69c54c1abbce9713157d2d47a1cb25aeb0963540 Mon Sep 17 00:00:00 2001
From: Lasse Collin <lasse.col...@tukaani.org>
Date: Sun, 12 Jan 2025 14:31:20 +0200
Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

(1) With legacy code pages (not UTF-8), certain non-ASCII
    Unicode characters in file names are converted in a lossy
    way ("best-fit mapping"). For example, if active code page
    is Windows-1252 and there is a file name with U+2044 in it,
    it becomes ASCII '/'. Thus, malicious file names result in
    directory traversal.

    Fix by reading the file names as wide chars and converting
    them losslessly to active code page. If lossless conversion
    isn't possible, skip the name. At the end of the directory,
    if anything was skipped, set errno to EOVERFLOW. It should
    be the correct choice according to POSIX readdir() docs.

    The check for lossless conversion is useful even with the UTF-8
    code page because unpaired surrogates (invalid UTF-16) are
    converted to U+FFFD. It seems likely that file names with
    unpaired surrogates cannot be accessed via multibyte APIs.

    For more information, see the blog post by Orange Tsai and
    Splitline from the DEVCORE Research Team.[*]

(2) With UTF-8 code page, file names can be longer than MAX_PATH bytes.
    For example, a long name may fit in MAX_PATH when encoded as
    Windows-1252 and then readdir() returns it correctly. If the same
    name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such
    file names can still be opened etc. as long as their wide char
    encoding doesn't exceed MAX_PATH.

(3) If application is declared as "long path aware" in an application
    manifest, then file names can be longer than MAX_PATH even in the
    native UTF-16 encoding (up to about 32767 wide chars). Fix by
    dynamically allocating the buffer for the absolute file name.

[*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/
---
 lib/dirent-private.h |  33 +++++++---
 lib/opendir.c        | 140 +++++++++++++++++++++++++++----------------
 lib/readdir.c        |  37 ++++++++----
 lib/rewinddir.c      |   7 +--
 4 files changed, 141 insertions(+), 76 deletions(-)

diff --git a/lib/dirent-private.h b/lib/dirent-private.h
index 22d847106a..cc5e81300b 100644
--- a/lib/dirent-private.h
+++ b/lib/dirent-private.h
@@ -37,10 +37,7 @@ struct gl_directory
 
 # define WIN32_LEAN_AND_MEAN
 # include <windows.h>
-
-/* Don't assume that UNICODE is not defined.  */
-# undef WIN32_FIND_DATA
-# define WIN32_FIND_DATA WIN32_FIND_DATAA
+# include <limits.h>
 
 struct gl_directory
 {
@@ -53,13 +50,35 @@ struct gl_directory
      0 means the entry needs to be filled using FindNextFile.
      A positive value is an error code.  */
   int status;
+  /* If one or more file names cannot be encoded in the active code page,
+     this is set to true. If end of the directory is otherwise reached
+     successfully, readdir() sets errno to EOVERFLOW.  */
+  bool has_problematic_chars;
   /* Handle, reading the directory, at current position.  */
   HANDLE current;
   /* Found directory entry.  */
-  WIN32_FIND_DATA entry;
-  /* Argument to pass to FindFirstFile.  It consists of the absolutized
+  WIN32_FIND_DATAW entry;
+  /* readdir() returns a pointer to this struct. d_name is entry.cFileName
+     as multibyte string. A file name is at most 255 UTF-16 code units
+     (not code points) followed by the terminating null character. The
+     length as multibyte string depends on the active code page:
+       * In single-byte code pages like Windows-1252 it's 255 bytes
+         plus the terminating null character.
+       * UTF-8: The longest name uses 3-byte UTF-8 characters.
+         4-byte UTF-8 characters consume two UTF-16 code units, thus
+         3-byte chars result in the longest possible UTF-8 file name.
+       * GB18030: Some BMP chars like U+FFFD are four bytes.
+       * MB_LEN_MAX is 5 though. Use that to be certain that we can
+         handle all long names.
+     NOTE: With char-based APIs, file names with unpaired surrogates
+     (invalid UTF-16) are inaccessible even with the UTF-8 code page.  */
+  struct {
+    char d_type;
+    char d_name[255 * MB_LEN_MAX + 1];
+  } dirent_entry;
+  /* Argument to pass to FindFirstFileW.  It consists of the absolutized
      directory name, followed by a directory separator and the wildcards.  */
-  char dir_name_mask[1];
+  wchar_t dir_name_mask[1];
 };
 
 #endif
diff --git a/lib/opendir.c b/lib/opendir.c
index a121e1befa..6d94322ec5 100644
--- a/lib/opendir.c
+++ b/lib/opendir.c
@@ -44,16 +44,6 @@
 # include <unistd.h>
 #endif
 
-#if defined _WIN32 && ! defined __CYGWIN__
-/* Don't assume that UNICODE is not defined.  */
-# undef WIN32_FIND_DATA
-# define WIN32_FIND_DATA WIN32_FIND_DATAA
-# undef GetFullPathName
-# define GetFullPathName GetFullPathNameA
-# undef FindFirstFile
-# define FindFirstFile FindFirstFileA
-#endif
-
 DIR *
 opendir (const char *dir_name)
 #undef opendir
@@ -90,10 +80,6 @@ opendir (const char *dir_name)
 
 #else
 
-  char dir_name_mask[MAX_PATH + 1 + 1 + 1];
-  int status;
-  HANDLE current;
-  WIN32_FIND_DATA entry;
   struct gl_directory *dirp;
 
   if (dir_name[0] == '\0')
@@ -102,36 +88,103 @@ opendir (const char *dir_name)
       return NULL;
     }
 
-  /* Make the dir_name absolute, so that we continue reading the same
-     directory if the current directory changed between this opendir()
-     call and a subsequent rewinddir() call.  */
-  if (!GetFullPathName (dir_name, MAX_PATH, dir_name_mask, NULL))
-    {
-      errno = EINVAL;
-      return NULL;
-    }
-
-  /* Append the mask.
-     "*" and "*.*" appear to be equivalent.  */
   {
-    char *p;
+    /* Convert dir_name to UTF-16. */
+    int wdir_name_size = MultiByteToWideChar (CP_ACP, MB_ERR_INVALID_CHARS,
+                                              dir_name, -1, NULL, 0);
+    if (wdir_name_size <= 0)
+      {
+        /* dir_name isn't a valid multibyte string. */
+        errno = EILSEQ;
+        return NULL;
+      }
+
+    wchar_t *wdir_name = malloc (wdir_name_size * sizeof (wchar_t));
+    if (wdir_name == NULL)
+      {
+        errno = ENOMEM;
+        return NULL;
+      }
+
+    if (MultiByteToWideChar (CP_ACP, MB_ERR_INVALID_CHARS,
+                             dir_name, -1, wdir_name, wdir_name_size)
+                            != wdir_name_size)
+      {
+        /* Conversion succeeded earlier so this should never be reached. */
+        free (wdir_name);
+        errno = EILSEQ;
+        return NULL;
+      }
+
+    /* Make the dir_name absolute, so that we continue reading the same
+       directory if the current directory changed between this opendir()
+       call and a subsequent rewinddir() call.
+
+       Long path aware applications need to support pathnames longer than
+       MAX_PATH. Call GetFullPathNameW() in a loop because the required
+       buffer size can change if something renames a file name component
+       between our GetFullPathNameW() calls.  */
+    dirp = NULL;
+    DWORD dir_name_mask_size = 0;
 
-    p = dir_name_mask + strlen (dir_name_mask);
-    if (p > dir_name_mask && !ISSLASH (p[-1]))
-      *p++ = '\\';
-    *p++ = '*';
-    *p = '\0';
+    while (true)
+      {
+        DWORD n = GetFullPathNameW (wdir_name, dir_name_mask_size,
+                                    dirp == NULL ? NULL : dirp->dir_name_mask,
+                                    NULL);
+        if (n == 0)
+          {
+            free (dirp);
+            free (wdir_name);
+            errno = EINVAL;
+            return NULL;
+          }
+
+         /* When there is enough space, the return value doesn't include
+            the terminating \0. When there isn't enough space, the \0 is
+            counted in n.  */
+        if (n < dir_name_mask_size)
+          break;
+
+        dir_name_mask_size = n;
+
+        /* dir_name_mask_size includes the terminating \0. We also need space
+           to append the two characters "\*".  */
+        struct gl_directory *new_dirp =
+          realloc(dirp, offsetof (struct gl_directory, dir_name_mask[0])
+                        + (dir_name_mask_size + 1 + 1) * sizeof (wchar_t));
+        if (new_dirp == NULL)
+          {
+            free (dirp);
+            free (wdir_name);
+            errno = ENOMEM;
+          }
+
+        dirp = new_dirp;
+      }
+
+    free (wdir_name);
+
+    /* Append the mask.
+       "*" and "*.*" appear to be equivalent.
+       Because wchar_t is UTF-16, casting to int for ISSLASH is OK.  */
+    wchar_t *p = dirp->dir_name_mask + dir_name_mask_size - 1;
+    if (p > dirp->dir_name_mask && !ISSLASH ((int) p[-1]))
+      *p++ = L'\\';
+    *p++ = L'*';
+    *p = L'\0';
   }
 
   /* Start searching the directory.  */
-  status = -1;
-  current = FindFirstFile (dir_name_mask, &entry);
-  if (current == INVALID_HANDLE_VALUE)
+  dirp->status = -1;
+  dirp->has_problematic_chars = false;
+  dirp->current = FindFirstFileW (dirp->dir_name_mask, &dirp->entry);
+  if (dirp->current == INVALID_HANDLE_VALUE)
     {
       switch (GetLastError ())
         {
         case ERROR_FILE_NOT_FOUND:
-          status = -2;
+          dirp->status = -2;
           break;
         case ERROR_PATH_NOT_FOUND:
           errno = ENOENT;
@@ -148,24 +201,7 @@ opendir (const char *dir_name)
         }
     }
 
-  /* Allocate the result.  */
-  dirp =
-    (struct gl_directory *)
-    malloc (offsetof (struct gl_directory, dir_name_mask[0])
-            + strlen (dir_name_mask) + 1);
-  if (dirp == NULL)
-    {
-      if (current != INVALID_HANDLE_VALUE)
-        FindClose (current);
-      errno = ENOMEM;
-      return NULL;
-    }
   dirp->fd_to_close = -1;
-  dirp->status = status;
-  dirp->current = current;
-  if (status == -1)
-    memcpy (&dirp->entry, &entry, sizeof (WIN32_FIND_DATA));
-  strcpy (dirp->dir_name_mask, dir_name_mask);
 
 #endif
 
diff --git a/lib/readdir.c b/lib/readdir.c
index 78225ec486..caa1c5d946 100644
--- a/lib/readdir.c
+++ b/lib/readdir.c
@@ -26,10 +26,6 @@
 # include "dirent-private.h"
 #endif
 
-/* Don't assume that UNICODE is not defined.  */
-#undef FindNextFile
-#define FindNextFile FindNextFileA
-
 struct dirent *
 readdir (DIR *dirp)
 #undef readdir
@@ -38,7 +34,6 @@ readdir (DIR *dirp)
   return readdir (dirp->real_dirp);
 #else
   char type;
-  struct dirent *result;
 
   /* There is no need to add code to produce entries for "." and "..".
      According to the POSIX:2008 section "4.12 Pathname Resolution"
@@ -49,6 +44,7 @@ readdir (DIR *dirp)
         for dot and one entry shall be returned for dot-dot; otherwise,
         they shall not be returned."  */
 
+again:
   switch (dirp->status)
     {
     case -2:
@@ -57,12 +53,17 @@ readdir (DIR *dirp)
     case -1:
       break;
     case 0:
-      if (!FindNextFile (dirp->current, &dirp->entry))
+      if (!FindNextFileW (dirp->current, &dirp->entry))
         {
           switch (GetLastError ())
             {
             case ERROR_NO_MORE_FILES:
               dirp->status = -2;
+              if (dirp->has_problematic_chars)
+                {
+                  dirp->status = EOVERFLOW;
+                  errno = dirp->status;
+                }
               return NULL;
             default:
               errno = EIO;
@@ -98,12 +99,24 @@ readdir (DIR *dirp)
   else
     type = DT_UNKNOWN;
 
-  /* Reuse the memory of dirp->entry for the result.  */
-  result =
-    (struct dirent *)
-    ((char *) dirp->entry.cFileName - offsetof (struct dirent, d_name[0]));
-  result->d_type = type;
+  dirp->dirent_entry.d_type = type;
+
+  /* Try to convert to multibyte string.  */
+  BOOL conv_was_lossy = TRUE;
+  int conv_result = WideCharToMultiByte (CP_ACP, WC_NO_BEST_FIT_CHARS,
+                                         dirp->entry.cFileName, -1,
+                                         dirp->dirent_entry.d_name,
+                                         sizeof (dirp->dirent_entry.d_name),
+                                         NULL, &conv_was_lossy);
+  if (conv_result <= 0 || conv_was_lossy)
+    {
+      /* This file name cannot be encoded in the active code page.
+         Try to list the other file names still but remember that
+         this problem occurred.  */
+      dirp->has_problematic_chars = true;
+      goto again;
+    }
 
-  return result;
+  return (struct dirent *) &dirp->dirent_entry;
 #endif
 }
diff --git a/lib/rewinddir.c b/lib/rewinddir.c
index 3fd31cdf4a..60bc7cf38f 100644
--- a/lib/rewinddir.c
+++ b/lib/rewinddir.c
@@ -25,10 +25,6 @@
 # include "dirent-private.h"
 #endif
 
-/* Don't assume that UNICODE is not defined.  */
-#undef FindFirstFile
-#define FindFirstFile FindFirstFileA
-
 void
 rewinddir (DIR *dirp)
 #undef rewinddir
@@ -42,7 +38,8 @@ rewinddir (DIR *dirp)
 
   /* Like in opendir().  */
   dirp->status = -1;
-  dirp->current = FindFirstFile (dirp->dir_name_mask, &dirp->entry);
+  dirp->has_problematic_chars = false;
+  dirp->current = FindFirstFileW (dirp->dir_name_mask, &dirp->entry);
   if (dirp->current == INVALID_HANDLE_VALUE)
     {
       switch (GetLastError ())
-- 
2.47.1

Reply via email to