(1) With legacy code pages (not UTF-8), certain non-ASCII Unicode characters in file names are converted in a lossy way ("best-fit mapping"). For example, if active code page is Windows-1252 and there is a file name with U+2044 in it, it becomes ASCII '/'. Thus, malicious file names result in directory traversal.
Fix by reading the file names as wide chars and converting them losslessly to active code page. If lossless conversion isn't possible, skip the name. At the end of the directory, if anything was skipped, set errno to EOVERFLOW. It should be the correct choice according to POSIX readdir() docs. The check for lossless conversion is useful even with the UTF-8 code page because unpaired surrogates (invalid UTF-16) are converted to U+FFFD. It seems likely that file names with unpaired surrogates cannot be accessed via multibyte APIs. For more information, see the blog post by Orange Tsai and Splitline from the DEVCORE Research Team.[*] (2) With UTF-8 code page, file names can be longer than MAX_PATH bytes. For example, a long name may fit in MAX_PATH when encoded as Windows-1252 and then readdir() returns it correctly. If the same name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such file names can still be opened etc. as long as their wide char encoding doesn't exceed MAX_PATH. (3) If application is declared as "long path aware" in an application manifest, then file names can be longer than MAX_PATH even in the native UTF-16 encoding (up to about 32767 wide chars). Fix by dynamically allocating the buffer for the absolute file name. [*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/ * * * The above text is also in the commit message of the attached patch. A few notes: - rewinddir.c I didn't test! - I used EILSEQ if opendir() is given an invalid multibyte string. I feel it's the most descriptive code. POSIX doesn't document EILSEQ or even EINVAL for opendir() though. - The old code used EINVAL if GetFullPathName() failed. I kept the EINVAL there but I'm not sure what is best. ENOENT could be an alternative but perhaps EINVAL helps debugging if GetFullPathName() does fail. - ChangeLog is missing and the commit message format isn't in the format used in Gnulib. If this patch is wanted, I hope someone can help with ChangeLog and the commit message. - It's not ideal that three issues are fixed in a single patch but those issues were fairly entangled in the code so I didn't try to separate them after I was done. There might be more to fix in Gnulib: - lib/stat.c uses FindFirstFileA(). Likely it needs a fix too but I haven't tested it. - lib/progreloc.c and lib/relocatable.c use GetModuleFileNameA(). It sounds likely that it uses best-fit mapping. It's also MAX_PATH limited. The "best-fit mapping" or "WorstFit" issue category affects many apps, especially those that have been ported from POSIX systems. This particular Gnulib issue is just one thing. XZ Utils was affected too. I fixed it by switching to UTF-8 (which doesn't help on past Windows versions though). These Windows details were all new to me in September 2024, so I wrote a few notes how application manifest was set up in XZ Utils: https://github.com/tukaani-project/xz/blob/master/src/common/w32_application.manifest.comments.txt There are so many apps ported from POSIX systems that it's a pain to manually add even a few lines of code to every one of them to protect them against malicious command line arguments etc. GCC can add an application manifest by default already but if it is used to enable UTF-8, apps need to be compatible with it too. * * * I don't have nice and proper tests for this. :-( The attached tarball contains an ugly program that creates a few directories for manual testing (run create-test-files in an empty dir). One filename it creates is so long that it only works if for long path aware applications. The tarball contains three manifests but only one is used. The others might be convenient when testing the Gnulib code. I tested the patch via gzip 1.13 only. gzip is convenient for testing because it's quick to build and gzip -dcfr somedir will recursively "cat" all files to stdout (including uncompressed files). MinGW-w64 provides <dirent.h> and then Gnulib's code isn't used. I don't have MSVC but I tested with MinGW-w64: I copied the files to gzip's lib directory and forced the dirent replacements to be used: ../configure --host=x86_64-w64-mingw32 \ ac_cv_header_dirent_h=no \ ac_cv_header_dirent_dirent_h=no \ ac_cv_func_closedir=no \ ac_cv_func_opendir=no \ ac_cv_func_readdir=no gzip's tailor.h has # define MAX_PATH_LEN 260 for native Windows which has to be removed or increased significantly to test long UTF-8 names. gzip.c defaults to 1024 when tailor.h doesn't define it. The longest test file name also requires that utf8-longpath.rc is built and linked into gzip.exe (and that requires changing the Windows registry setting and rebooting). The shorter name works without longPathAware (utf8.rc) in the manifest too if the leading file name components are short enough. gzip uses Gnulib's savedir module which doesn't return any file names if there is an error. So in that sense gzip isn't the best test program because one problematic file name will make gzip skip all files in a directory; the delaying of EOVERFLOW isn't useful when using savedir. * * * Very minor things: In Gnulib's lib/pathmax.h, there is a comment about PATH_MAX from 2011: Undefine the original value, because mingw's <limits.h> gets it wrong. The code in pathmax.h does no harm but the issue was fixed in MinGW-w64 in 2010. I haven't looked at MinGW so I don't know if the issue is still there. Gnulib uses "mingw" in docs. Does it refer to MinGW or MinGW-w64 or both? In configure triplets, MinGW uses "pc" as the vendor and MinGW-w64 uses "w64". -- Lasse Collin
create-test-files.tar.xz
Description: application/xz
>From 69c54c1abbce9713157d2d47a1cb25aeb0963540 Mon Sep 17 00:00:00 2001 From: Lasse Collin <lasse.col...@tukaani.org> Date: Sun, 12 Jan 2025 14:31:20 +0200 Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows (1) With legacy code pages (not UTF-8), certain non-ASCII Unicode characters in file names are converted in a lossy way ("best-fit mapping"). For example, if active code page is Windows-1252 and there is a file name with U+2044 in it, it becomes ASCII '/'. Thus, malicious file names result in directory traversal. Fix by reading the file names as wide chars and converting them losslessly to active code page. If lossless conversion isn't possible, skip the name. At the end of the directory, if anything was skipped, set errno to EOVERFLOW. It should be the correct choice according to POSIX readdir() docs. The check for lossless conversion is useful even with the UTF-8 code page because unpaired surrogates (invalid UTF-16) are converted to U+FFFD. It seems likely that file names with unpaired surrogates cannot be accessed via multibyte APIs. For more information, see the blog post by Orange Tsai and Splitline from the DEVCORE Research Team.[*] (2) With UTF-8 code page, file names can be longer than MAX_PATH bytes. For example, a long name may fit in MAX_PATH when encoded as Windows-1252 and then readdir() returns it correctly. If the same name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such file names can still be opened etc. as long as their wide char encoding doesn't exceed MAX_PATH. (3) If application is declared as "long path aware" in an application manifest, then file names can be longer than MAX_PATH even in the native UTF-16 encoding (up to about 32767 wide chars). Fix by dynamically allocating the buffer for the absolute file name. [*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/ --- lib/dirent-private.h | 33 +++++++--- lib/opendir.c | 140 +++++++++++++++++++++++++++---------------- lib/readdir.c | 37 ++++++++---- lib/rewinddir.c | 7 +-- 4 files changed, 141 insertions(+), 76 deletions(-) diff --git a/lib/dirent-private.h b/lib/dirent-private.h index 22d847106a..cc5e81300b 100644 --- a/lib/dirent-private.h +++ b/lib/dirent-private.h @@ -37,10 +37,7 @@ struct gl_directory # define WIN32_LEAN_AND_MEAN # include <windows.h> - -/* Don't assume that UNICODE is not defined. */ -# undef WIN32_FIND_DATA -# define WIN32_FIND_DATA WIN32_FIND_DATAA +# include <limits.h> struct gl_directory { @@ -53,13 +50,35 @@ struct gl_directory 0 means the entry needs to be filled using FindNextFile. A positive value is an error code. */ int status; + /* If one or more file names cannot be encoded in the active code page, + this is set to true. If end of the directory is otherwise reached + successfully, readdir() sets errno to EOVERFLOW. */ + bool has_problematic_chars; /* Handle, reading the directory, at current position. */ HANDLE current; /* Found directory entry. */ - WIN32_FIND_DATA entry; - /* Argument to pass to FindFirstFile. It consists of the absolutized + WIN32_FIND_DATAW entry; + /* readdir() returns a pointer to this struct. d_name is entry.cFileName + as multibyte string. A file name is at most 255 UTF-16 code units + (not code points) followed by the terminating null character. The + length as multibyte string depends on the active code page: + * In single-byte code pages like Windows-1252 it's 255 bytes + plus the terminating null character. + * UTF-8: The longest name uses 3-byte UTF-8 characters. + 4-byte UTF-8 characters consume two UTF-16 code units, thus + 3-byte chars result in the longest possible UTF-8 file name. + * GB18030: Some BMP chars like U+FFFD are four bytes. + * MB_LEN_MAX is 5 though. Use that to be certain that we can + handle all long names. + NOTE: With char-based APIs, file names with unpaired surrogates + (invalid UTF-16) are inaccessible even with the UTF-8 code page. */ + struct { + char d_type; + char d_name[255 * MB_LEN_MAX + 1]; + } dirent_entry; + /* Argument to pass to FindFirstFileW. It consists of the absolutized directory name, followed by a directory separator and the wildcards. */ - char dir_name_mask[1]; + wchar_t dir_name_mask[1]; }; #endif diff --git a/lib/opendir.c b/lib/opendir.c index a121e1befa..6d94322ec5 100644 --- a/lib/opendir.c +++ b/lib/opendir.c @@ -44,16 +44,6 @@ # include <unistd.h> #endif -#if defined _WIN32 && ! defined __CYGWIN__ -/* Don't assume that UNICODE is not defined. */ -# undef WIN32_FIND_DATA -# define WIN32_FIND_DATA WIN32_FIND_DATAA -# undef GetFullPathName -# define GetFullPathName GetFullPathNameA -# undef FindFirstFile -# define FindFirstFile FindFirstFileA -#endif - DIR * opendir (const char *dir_name) #undef opendir @@ -90,10 +80,6 @@ opendir (const char *dir_name) #else - char dir_name_mask[MAX_PATH + 1 + 1 + 1]; - int status; - HANDLE current; - WIN32_FIND_DATA entry; struct gl_directory *dirp; if (dir_name[0] == '\0') @@ -102,36 +88,103 @@ opendir (const char *dir_name) return NULL; } - /* Make the dir_name absolute, so that we continue reading the same - directory if the current directory changed between this opendir() - call and a subsequent rewinddir() call. */ - if (!GetFullPathName (dir_name, MAX_PATH, dir_name_mask, NULL)) - { - errno = EINVAL; - return NULL; - } - - /* Append the mask. - "*" and "*.*" appear to be equivalent. */ { - char *p; + /* Convert dir_name to UTF-16. */ + int wdir_name_size = MultiByteToWideChar (CP_ACP, MB_ERR_INVALID_CHARS, + dir_name, -1, NULL, 0); + if (wdir_name_size <= 0) + { + /* dir_name isn't a valid multibyte string. */ + errno = EILSEQ; + return NULL; + } + + wchar_t *wdir_name = malloc (wdir_name_size * sizeof (wchar_t)); + if (wdir_name == NULL) + { + errno = ENOMEM; + return NULL; + } + + if (MultiByteToWideChar (CP_ACP, MB_ERR_INVALID_CHARS, + dir_name, -1, wdir_name, wdir_name_size) + != wdir_name_size) + { + /* Conversion succeeded earlier so this should never be reached. */ + free (wdir_name); + errno = EILSEQ; + return NULL; + } + + /* Make the dir_name absolute, so that we continue reading the same + directory if the current directory changed between this opendir() + call and a subsequent rewinddir() call. + + Long path aware applications need to support pathnames longer than + MAX_PATH. Call GetFullPathNameW() in a loop because the required + buffer size can change if something renames a file name component + between our GetFullPathNameW() calls. */ + dirp = NULL; + DWORD dir_name_mask_size = 0; - p = dir_name_mask + strlen (dir_name_mask); - if (p > dir_name_mask && !ISSLASH (p[-1])) - *p++ = '\\'; - *p++ = '*'; - *p = '\0'; + while (true) + { + DWORD n = GetFullPathNameW (wdir_name, dir_name_mask_size, + dirp == NULL ? NULL : dirp->dir_name_mask, + NULL); + if (n == 0) + { + free (dirp); + free (wdir_name); + errno = EINVAL; + return NULL; + } + + /* When there is enough space, the return value doesn't include + the terminating \0. When there isn't enough space, the \0 is + counted in n. */ + if (n < dir_name_mask_size) + break; + + dir_name_mask_size = n; + + /* dir_name_mask_size includes the terminating \0. We also need space + to append the two characters "\*". */ + struct gl_directory *new_dirp = + realloc(dirp, offsetof (struct gl_directory, dir_name_mask[0]) + + (dir_name_mask_size + 1 + 1) * sizeof (wchar_t)); + if (new_dirp == NULL) + { + free (dirp); + free (wdir_name); + errno = ENOMEM; + } + + dirp = new_dirp; + } + + free (wdir_name); + + /* Append the mask. + "*" and "*.*" appear to be equivalent. + Because wchar_t is UTF-16, casting to int for ISSLASH is OK. */ + wchar_t *p = dirp->dir_name_mask + dir_name_mask_size - 1; + if (p > dirp->dir_name_mask && !ISSLASH ((int) p[-1])) + *p++ = L'\\'; + *p++ = L'*'; + *p = L'\0'; } /* Start searching the directory. */ - status = -1; - current = FindFirstFile (dir_name_mask, &entry); - if (current == INVALID_HANDLE_VALUE) + dirp->status = -1; + dirp->has_problematic_chars = false; + dirp->current = FindFirstFileW (dirp->dir_name_mask, &dirp->entry); + if (dirp->current == INVALID_HANDLE_VALUE) { switch (GetLastError ()) { case ERROR_FILE_NOT_FOUND: - status = -2; + dirp->status = -2; break; case ERROR_PATH_NOT_FOUND: errno = ENOENT; @@ -148,24 +201,7 @@ opendir (const char *dir_name) } } - /* Allocate the result. */ - dirp = - (struct gl_directory *) - malloc (offsetof (struct gl_directory, dir_name_mask[0]) - + strlen (dir_name_mask) + 1); - if (dirp == NULL) - { - if (current != INVALID_HANDLE_VALUE) - FindClose (current); - errno = ENOMEM; - return NULL; - } dirp->fd_to_close = -1; - dirp->status = status; - dirp->current = current; - if (status == -1) - memcpy (&dirp->entry, &entry, sizeof (WIN32_FIND_DATA)); - strcpy (dirp->dir_name_mask, dir_name_mask); #endif diff --git a/lib/readdir.c b/lib/readdir.c index 78225ec486..caa1c5d946 100644 --- a/lib/readdir.c +++ b/lib/readdir.c @@ -26,10 +26,6 @@ # include "dirent-private.h" #endif -/* Don't assume that UNICODE is not defined. */ -#undef FindNextFile -#define FindNextFile FindNextFileA - struct dirent * readdir (DIR *dirp) #undef readdir @@ -38,7 +34,6 @@ readdir (DIR *dirp) return readdir (dirp->real_dirp); #else char type; - struct dirent *result; /* There is no need to add code to produce entries for "." and "..". According to the POSIX:2008 section "4.12 Pathname Resolution" @@ -49,6 +44,7 @@ readdir (DIR *dirp) for dot and one entry shall be returned for dot-dot; otherwise, they shall not be returned." */ +again: switch (dirp->status) { case -2: @@ -57,12 +53,17 @@ readdir (DIR *dirp) case -1: break; case 0: - if (!FindNextFile (dirp->current, &dirp->entry)) + if (!FindNextFileW (dirp->current, &dirp->entry)) { switch (GetLastError ()) { case ERROR_NO_MORE_FILES: dirp->status = -2; + if (dirp->has_problematic_chars) + { + dirp->status = EOVERFLOW; + errno = dirp->status; + } return NULL; default: errno = EIO; @@ -98,12 +99,24 @@ readdir (DIR *dirp) else type = DT_UNKNOWN; - /* Reuse the memory of dirp->entry for the result. */ - result = - (struct dirent *) - ((char *) dirp->entry.cFileName - offsetof (struct dirent, d_name[0])); - result->d_type = type; + dirp->dirent_entry.d_type = type; + + /* Try to convert to multibyte string. */ + BOOL conv_was_lossy = TRUE; + int conv_result = WideCharToMultiByte (CP_ACP, WC_NO_BEST_FIT_CHARS, + dirp->entry.cFileName, -1, + dirp->dirent_entry.d_name, + sizeof (dirp->dirent_entry.d_name), + NULL, &conv_was_lossy); + if (conv_result <= 0 || conv_was_lossy) + { + /* This file name cannot be encoded in the active code page. + Try to list the other file names still but remember that + this problem occurred. */ + dirp->has_problematic_chars = true; + goto again; + } - return result; + return (struct dirent *) &dirp->dirent_entry; #endif } diff --git a/lib/rewinddir.c b/lib/rewinddir.c index 3fd31cdf4a..60bc7cf38f 100644 --- a/lib/rewinddir.c +++ b/lib/rewinddir.c @@ -25,10 +25,6 @@ # include "dirent-private.h" #endif -/* Don't assume that UNICODE is not defined. */ -#undef FindFirstFile -#define FindFirstFile FindFirstFileA - void rewinddir (DIR *dirp) #undef rewinddir @@ -42,7 +38,8 @@ rewinddir (DIR *dirp) /* Like in opendir(). */ dirp->status = -1; - dirp->current = FindFirstFile (dirp->dir_name_mask, &dirp->entry); + dirp->has_problematic_chars = false; + dirp->current = FindFirstFileW (dirp->dir_name_mask, &dirp->entry); if (dirp->current == INVALID_HANDLE_VALUE) { switch (GetLastError ()) -- 2.47.1