Re: license request for mgetgroups
On 2013-05-22 Eric Blake wrote: > Since getgrouplist is implemented in glibc as LGPLv2+, is there any > objection to relicensing the following modules as LGPLv2+? [...] > getgroups [LGPLv3+] => *Jim, *Eric, Bruno, Paul > getugroups [GPLv3+] => *Jim, Eric, Paul, Bruno, Lasse > mgetgroups [GPLv3+] => *Jim, *Eric, Bruno > xalloc-oversized [GPLv3+] => [*all] Eric, Paul, Jim, (Bruno) Relicensing is fine with me. (I barely touched the code.) -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: argp --help formatting bug with non-ASCII characters
On 2024-04-12 Bruno Haible wrote: > How to reproduce: > 1. Build GNU tar (1.33 or 1.35, for example). > 2. Install it. > 3. You can see misindentation: > $ LC_ALL=de_DE.UTF-8 src/tar --help > > ... > abzubilden > --mode=ÄNDERUNGEN den (symbolischen) Modus ÄNDERUNGEN für > hinzugefügte Dateien erzwingen [...] > In other words, it should use the Gnulib module 'mbswidth'. This is one of the issues. The other one is with the wrapping of the description text. It counts 1 byte == 1 column but in languages which mostly use non-ASCII chars it can be mostly 2 bytes == 1 column. Then the wrapping occurs far too early, resulting in narrow output. -- Lasse Collin
Re: supporting in the UTF-8 environment on native Windows
I commented on bug-gettext that these fixes work for me when environment variables LANG, LC_*, or LANGUAGE aren't set. Thanks again! The commit 00211fc69c92 ("setlocale: Support the UTF-8 environment on native Windows.") introduces Windows support for C.UTF-8 in setlocale.c. Emulating C.UTF-8 with "English_United States.65001" looks like a great idea with LC_CTYPE. However, with some other locale categories it's problematic. I think "C.UTF-8" should map to a mixed locale. I noticed the informative link[1] in the Gettext commit 3873b7f1c777 ("intl: Treat C.UTF-8 locale like C locale."). The wiki page makes me suspect that LC_ALL=C.UTF-8 should set LC_CTYPE to "English_United States.65001" while setting all other categories to "C". This seems pretty clear for LC_COLLATE and LC_NUMERIC but I'm not sure about the other categories. [1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 LC_COLLATE=C.UTF-8 should sort in Unicode codepoint order. In UTF-8 this is the same as byte order, thus "C" does the right thing (at least for valid UTF-8 inputs). In contrast, "English_United States.65001" sorts by English rules, for example, putting "ä" before "b". Test with strcoll("b", "ä"). LC_NUMERIC can make a difference if thousand separators are requested. Thousand separators in printf() are a POSIX feature that UCRT's printf() doesn't support. However, MinGW-w64's replacement via "#define __USE_MINGW_ANSI_STDIO 1" provides an implementation that does support thousand separators. -- Lasse Collin
[PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
ac_cv_func_readdir=no gzip's tailor.h has # define MAX_PATH_LEN 260 for native Windows which has to be removed or increased significantly to test long UTF-8 names. gzip.c defaults to 1024 when tailor.h doesn't define it. The longest test file name also requires that utf8-longpath.rc is built and linked into gzip.exe (and that requires changing the Windows registry setting and rebooting). The shorter name works without longPathAware (utf8.rc) in the manifest too if the leading file name components are short enough. gzip uses Gnulib's savedir module which doesn't return any file names if there is an error. So in that sense gzip isn't the best test program because one problematic file name will make gzip skip all files in a directory; the delaying of EOVERFLOW isn't useful when using savedir. * * * Very minor things: In Gnulib's lib/pathmax.h, there is a comment about PATH_MAX from 2011: Undefine the original value, because mingw's gets it wrong. The code in pathmax.h does no harm but the issue was fixed in MinGW-w64 in 2010. I haven't looked at MinGW so I don't know if the issue is still there. Gnulib uses "mingw" in docs. Does it refer to MinGW or MinGW-w64 or both? In configure triplets, MinGW uses "pc" as the vendor and MinGW-w64 uses "w64". -- Lasse Collin create-test-files.tar.xz Description: application/xz >From 69c54c1abbce9713157d2d47a1cb25aeb0963540 Mon Sep 17 00:00:00 2001 From: Lasse Collin Date: Sun, 12 Jan 2025 14:31:20 +0200 Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows (1) With legacy code pages (not UTF-8), certain non-ASCII Unicode characters in file names are converted in a lossy way ("best-fit mapping"). For example, if active code page is Windows-1252 and there is a file name with U+2044 in it, it becomes ASCII '/'. Thus, malicious file names result in directory traversal. Fix by reading the file names as wide chars and converting them losslessly to active code page. If lossless conversion isn't possible, skip the name. At the end of the directory, if anything was skipped, set errno to EOVERFLOW. It should be the correct choice according to POSIX readdir() docs. The check for lossless conversion is useful even with the UTF-8 code page because unpaired surrogates (invalid UTF-16) are converted to U+FFFD. It seems likely that file names with unpaired surrogates cannot be accessed via multibyte APIs. For more information, see the blog post by Orange Tsai and Splitline from the DEVCORE Research Team.[*] (2) With UTF-8 code page, file names can be longer than MAX_PATH bytes. For example, a long name may fit in MAX_PATH when encoded as Windows-1252 and then readdir() returns it correctly. If the same name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such file names can still be opened etc. as long as their wide char encoding doesn't exceed MAX_PATH. (3) If application is declared as "long path aware" in an application manifest, then file names can be longer than MAX_PATH even in the native UTF-16 encoding (up to about 32767 wide chars). Fix by dynamically allocating the buffer for the absolute file name. [*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/ --- lib/dirent-private.h | 33 +++--- lib/opendir.c| 140 +++ lib/readdir.c| 37 lib/rewinddir.c | 7 +-- 4 files changed, 141 insertions(+), 76 deletions(-) diff --git a/lib/dirent-private.h b/lib/dirent-private.h index 22d847106a..cc5e81300b 100644 --- a/lib/dirent-private.h +++ b/lib/dirent-private.h @@ -37,10 +37,7 @@ struct gl_directory # define WIN32_LEAN_AND_MEAN # include - -/* Don't assume that UNICODE is not defined. */ -# undef WIN32_FIND_DATA -# define WIN32_FIND_DATA WIN32_FIND_DATAA +# include struct gl_directory { @@ -53,13 +50,35 @@ struct gl_directory 0 means the entry needs to be filled using FindNextFile. A positive value is an error code. */ int status; + /* If one or more file names cannot be encoded in the active code page, + this is set to true. If end of the directory is otherwise reached + successfully, readdir() sets errno to EOVERFLOW. */ + bool has_problematic_chars; /* Handle, reading the directory, at current position. */ HANDLE current; /* Found directory entry. */ - WIN32_FIND_DATA entry; - /* Argument to pass to FindFirstFile. It consists of the absolutized + WIN32_FIND_DATAW entry; + /* readdir() returns a pointer to this struct. d_name is entry.cFileName + as multibyte string. A file name is at most 255 UTF-16 code units + (not code points) followed by the terminating null character. The + length as multibyte string depends on the a
Re: crc-x86_64: undefined behaviour
On 2025-01-17 Simon Josefsson wrote: > Lasse Collin writes: > > > [1] > > https://github.com/tukaani-project/xz/blob/master/src/liblzma/check/crc_x86_clmul.h > > > > Oh. Do you see any chance to merge your code into gnulib? If there > is any feature in your implementation that we don't have. And then > use it in xz? My code hasn't been included in any xz release yet. I don't want to backport it to 5.6.x and 5.7.1alpha isn't out yet. So the code hasn't been tested by many people yet. A CRC function tries to produce the correct result as fast as possible. There isn't much else to compare feature wise. It is fine to me if the code is adapted into Gnulib. However, a significant amount of effort has been put into the version that is in Gnulib already, so I don't want make any recommendations, especially since my version hasn't been widely tested yet. (I was afraid that if I link to my code, it's easily seen as if I'm advertising a replacement. Sorry.) > It seems a bit silly to have a lot of different > implementations around for this fairly low-level simple code, and as > we saw in this thread, maintaining it does show up bugs... Yes, it's a bit silly and duplicate effort. I suppose there are multiple reasons why many implementations exists, for example, avoiding extra dependencies, licensing questions, and just having fun. There has to be a reason why a CRC32 function is wanted in Gnulib instead of adding, for example, zlib as a dependency to packages that need CRC32. Even GNU has more than one CLMUL CRC implementation. See Libgcrypt: https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/crc-intel-pclmul.c I tested Gnulib's lib/crc-x86_64-pclmul.c (v1.0-1387-g5dd298feb0) a bit. I changed the target attribute to plain "pclmul" to make it work on a processor that lacks AVX. With 0-3 byte inputs the result differs from what I see from other implementations. I didn't investigate why it happens. I attached a test program that tests alignment variations but the problem can be seen without such a loop already. About sanitizers: They can be annoying with SIMD code. If a function is passed an unaligned buffer, it would be fine to round the address down to an aligned value, do an aligned read, and ignore the out-of-bounds bytes. One can do it in assembly because sanitizers don't see it. In contrast to sanitizers, Valgrind is happy if the extra bytes are thrown away. -- Lasse Collin /* From Gnulib version, comment out #includes for , "crc-x86_64.h", and "crc.h", and add . Possibly omit ",avx" from the attribute too. gcc -O2 -Wall -Wextra crc32-test.c -DUSE_ZLIB -lz gcc -O2 -Wall -Wextra crc32-test.c -DUSE_LIBLZMA -llzma gcc -O2 -Wall -Wextra crc32-test.c crc-x86_64-pclmul-modified.c */ #include #include #include #if defined(USE_ZLIB) # include # define CRC32(buf, size, crc) crc32(crc, buf, size) #elif defined(USE_LIBLZMA) # include # define CRC32(buf, size, crc) lzma_crc32(buf, size, crc) #else extern uint32_t crc32_update_no_xor_pclmul( uint32_t crc, const void *buf, size_t len); # define CRC32(buf, size, crc) ~crc32_update_no_xor_pclmul(~(crc), buf, size) #endif #define END_SIZE 2000 static uint8_t r[END_SIZE + 16 * 17]; // Yes, it's a bit oversized. static uint32_t test(size_t size) { uint32_t crc = 0; for (size_t k = 0; k < 16 * 17; k += 17) crc = CRC32(r + k, size, crc); return crc; } extern int main(void) { uint32_t seed = 29; for (size_t i = 0; i < sizeof(r); ++i) { seed = seed * 1103515245 + 12345; r[i] = (uint8_t)(seed >> 22); } for (size_t size = 0; size < END_SIZE; ++size) printf("%-4zu 0x%08" PRIX32 "\n", size, test(size)); return 0; }
Re: valgrind vs. sanitizers
On 2025-01-18 Paul Eggert wrote: > On 2025-01-18 11:45, Lasse Collin wrote: > > On 2025-01-18 Paul Eggert wrote: > >> Does the unaligned read trick work even with CheriBSD's memory-safe > >> model? That is an edge case that might need an ifdef or something. > >> > > > > I'm not familiar with CheriBSD but the trick never crosses a cache > > line boundary (or page boundary). So the memory-safe model has to > > be really strict to not allow it. > > Yes, it's really that strict. That's really nice, especially if performance is still acceptable. :-) > One way to fix bugs like this on CHERI is to use the alignment trick > only if the macro __CHERI_PURE_CAPABILITY__ is not defined. I will keep this in mind but the primarily lesson for me is that such tricks are a worse idea than I had thought. > Do you have access to cfarm ? I don't. Thanks for the idea. -- Lasse Collin
Re: crc-x86_64: undefined behaviour
On 2025-01-18 Sam Russell wrote: > IIRC we test values from 0-16 bytes in the unit test and the pclmul > implementation doesn't get hit for smaller values? Correct. lib/crc.c has if (pclmul_enabled && len >= 16) return crc32_update_no_xor_pclmul (crc, buf, len); so the behavior with short inputs is irrelevant. I'm sorry for the false alarm. Having assert(len >= 16); at the beginning of the pclmul function could have made it obvious but that's not an excuse, I should have looked at crc.c too. -- Lasse Collin
Re: valgrind vs. sanitizers
On 2025-01-18 Bruno Haible wrote: > Valgrind was a tool without replacement, for many years, when > sanitizers did not exist. Nowadays, however, I generally prefer > testing with sanitizers than with valgrind because there are some > bugs that ASAN finds and valgrind doesn't [1]. For example, when you > have a struct with two adjacent arrays, ASAN can find buffer overruns > of the first array, while valgrind can't. Just this week, sanitizers > have found a real bug in coreutils [2]. > > Regarding your trick to do an aligned read on (addr & -alignment) > instead of an unaligned read on (addr): I find it good that ASAN > catches this, because this trick amounts to exploiting a coincidental > property of current hardware. I agree. One could mark the tricks so that sanitizers skip them. There are attributes to do that but those can make people suspicious that one is trying to prevent fuzzers from finding security issues. Writing extra code to keep sanitizers happy is the alternative. Intrinsics usage is already platform specific (in practice at least) so doing platform specific tricks should be possible somehow because it can be more convenient to write with intrinsics than assembly. Inline assembly also avoids sanitizers. It doesn't look as suspicious as attributes that disable sanitizers so people are happier (as long as they don't realize this). > Similarly to accessing (addr + (1 << n)) for 48 < n < 64: some > hardware allows this, but it's an ISO C violation nevertheless. That wouldn't be a good idea in assembly code either. -- Lasse Collin
Re: valgrind vs. sanitizers
On 2025-01-18 Paul Eggert wrote: > Does the unaligned read trick work even with CheriBSD's memory-safe > model? That is an edge case that might need an ifdef or something. I'm not familiar with CheriBSD but the trick never crosses a cache line boundary (or page boundary). So the memory-safe model has to be really strict to not allow it. > The aligned read trick reminds me a bit of the "adding 0 to a null > pointer gives you a null pointer" trick. That also is a violation of > ISO C but it works everywhere (even on CheriBSD) and Gnulib assumes > it in places. Oh that... https://github.com/tukaani-project/xz/commit/30e95bb44c36ae26b2ab12a94343b215fec285e7 I heard the standardization news last year. It was a nice surprise. -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-12 Paul Eggert wrote: > On 2025-01-12 12:26, Lasse Collin wrote: > > The patch makes readdir() detect that lossless conversion isn't > > possible and inform the application with EOVERFLOW > > Wouldn't EILSEQ be more appropriate? That's what "open" is supposed > to do with names that the file system doesn't support. I pondered it before sending the patch. POSIX.1-2024 readdir() [1]: [EOVERFLOW] One of the values in the structure to be returned cannot be represented correctly. Since the UTF-16 encoded string cannot be correctly converted to a more limited character set, clearly it's about a value that "cannot be represented correctly". Error message from strerror(EOVERFLOW) is confusing in this case though. strerror(EILSEQ) feels more logical but it's not explicitly listed for readdir() in POSIX. > Also, the POSIX spec suggests that readdir should return a null > pointer right away with errno set, rather than wait for the end of > the directory. A subsequent readdir resumes traversal of the > directory, even after such an error. Doing it this nicer way should > avoid the need for the new label and goto, and it would also let the > caller count how many bad entries the directory has. Returning an error immediately makes the code slightly simpler. I wonder how many apps continue after any error though. The example in the POSIX readdir() page terminates the search after any error but it's a simplified example. In GNU coreutils, src/ls.c, print_dir() has a loop that calls readdir().[2] It handles two errno values specially: - ENOENT is treated the same as successfully reaching the end of the directory. - EOVERFLOW results in an error message but directory reading is continued still. All other errors make print_dir() stop reading the directory. The behavior of ls seems reasonable in context of what errno values are listed for readdir() in POSIX. If readdir() returns the more logical sounding EILSEQ, it means that GNU ls won't attempt to list the remaining directory entries. Thus, I think using EOVERFLOW is better than EILSEQ when character set conversion cannot be done correctly. I will change the code to return EOVERFLOW immediately instead of delaying it. [1] https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/readdir_r.html [2] https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n3086 -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-12 Bruno Haible wrote: > As I understand it, the patch modifies the conversion > wchar_t[] (UTF-16) --> char[] > used for file names, and it does so only for opendir+readdir. The way I see it, the patch adds error detection to the conversion. Names that pass the check are returned just like they were before. If a file name contains the infinity symbol, let's say "∞.txt", best-fit mapping to Windows-1252 makes current readdir() return "8.txt". Passing "8.txt" to open() won't open "∞.txt". If the active code page is Windows-1252, one cannot use Gnulib's open module to access "∞.txt". The patch makes readdir() detect that lossless conversion isn't possible and inform the application with EOVERFLOW instead of returning a made-up name that might even be dangerous. I don't see how this could affect open(), chdir(), etc. negatively. If you suspect that I'm missing something, please try to explain again. Thanks! -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-13 Paul Eggert wrote: > On 2025-01-13 00:05, Lasse Collin wrote: > > > I pondered it before sending the patch. POSIX.1-2024 readdir() [1]: > > > > [EOVERFLOW] > > One of the values in the structure to be returned cannot be > > represented correctly. > > Yes, but the list of error numbers in the readdir ERRORS section is > not exhaustive. For example, readdir can plausibly fail with EIO even > though EIO is not in the list. See > <https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html#tag_16_03>, > > which says "Implementations may support additional errors not > included in this list, may generate errors included in this list > under circumstances other than those described here, or may contain > extensions or limitations that prevent some errors from occurring." Thanks! I'm aware that additional error numbers are common. However, portable programs might not know what those numbers mean on a specific operating system, thus programs need to interpret the numbers conservatively. Most of the time this works well: an error message is displayed and the operation doesn't continue. If a program wants to continue calling readdir() after an error, it has to know the errno values with which continuation works. POSIX documents EBADF and ENOENT for readdir(), and their description doesn't give an impression that further readdir() calls might succeed. Attempting to continue after EBADF or ENOENT (or EIO) could result in an infinite loop. The description of EOVERFLOW is less scary. (It would still be useful if it explicitly said that one can continue after EOVERFLOW, especially because the example code doesn't do so.) My peek at GNU ls affirmed my thought that EOVERFLOW is special in portability context. Many apps don't continue after EOVERFLOW but I guess even fewer do so after EILSEQ. There is also: Implementations shall not generate a different error number from one required by this volume of POSIX.1-2024 for an error condition described in this volume of POSIX.1-2024, but may generate additional errors unless explicitly disallowed for a particular function. Pedantic interpretation might be that EILSEQ isn't allowed in this situation because EOVERFLOW already describes the error condition. In practice EILSEQ is fine if the diagnostics are delayed until the end of the directory. Applications that know the implementation-specific detail can then even count the number of conversion errors. > EILSEQ is more-appropriate than EOVERFLOW here, so I'd use it. [...] > PS. As you mention, it's fine (and indeed a good idea) to delay the > EILSEQ error to the end, as too much code mistakenly treats any null > return from readdir as EOF. Thanks! I agree now. That is, behave like v2-counting.patch but with EILSEQ instead of EOVERFLOW. > > If readdir() returns the more logical sounding EILSEQ, it means that > > GNU ls won't attempt to list the remaining directory entries. > > Oh, that's OK, we'd just change GNU ls. This would make for a better > 'ls' for the user, as the diagnostics would be more informative. That helps 'ls'. Other apps on Windows may trip on EILSEQ (and probably on EOVERFLOW too) if accessing a directory with even one problematic name. In some cases it might be a better user experience to list a few non-existing lookalike names in GUI. Perhaps this was the compatibility concern in the other branch of this thread. The behavior needs to be configurable. > > In GNU coreutils, src/ls.c, print_dir() has a loop that calls > > readdir().[2] It handles two errno values specially: > > > > - ENOENT is treated the same as successfully reaching the end of > > the directory. > > > > - EOVERFLOW results in an error message but directory reading is > > continued still. > > The EOVERFLOW treatment is buggy because errno might have changed > since readdir was called Those are so easy to miss. :-/ I had fixed a similar error only a few days ago but I missed it in ls.c even though I was specifically looking at its errno handling. -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-13 Lasse Collin wrote: > On 2025-01-12 Paul Eggert wrote: > > Also, the POSIX spec suggests that readdir should return a null > > pointer right away with errno set, rather than wait for the end of > > the directory. A subsequent readdir resumes traversal of the > > directory, even after such an error. Doing it this nicer way should > > avoid the need for the new label and goto, and it would also let the > > caller count how many bad entries the directory has. > > Returning an error immediately makes the code slightly simpler. I > wonder how many apps continue after any error though. POSIX allows readdir() to buffer the input, thus I guess reordering the failures until the end isn't strictly prohibited. The motivation was to be friendlier to users of programs that don't handle EOVERFLOW specially/correctly. With reordering they would see all other files. At this point I have no opinion if it is a good or bad idea. I attached two versions: - v2-simple.patch uses EOVERFLOW immediately as was discussed earlier. - v2-counting.patch counts the number of problematic file names. At the end of the directory, EOVERFLOW is set as many times as there were problematic file names. This way callers can count the number of problems. I fixed a few a few bad bugs that were in the first patch: - The return value from GetFullPathNameW was mishandled, which could result in non-working mask string. - dirp->status was set to EOVERFLOW. It made readdir() set it infinitely at the end of the directory. - "return NULL;" was missing from one error path. Thank you for the feedback! I learned a few things once again. Similar issues in MinGW-w64's implementation may be fixed at some point, and the comments here might a mistake or two. So if it turns out that my patches cannot be merged to Gnulib for now, this discussion has still been valuable. Thanks! -- Lasse Collin From ee205ebf07c9546df14f5d514675305de652e2cc Mon Sep 17 00:00:00 2001 From: Lasse Collin Date: Mon, 13 Jan 2025 17:31:13 +0200 Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows (1) With legacy code pages (not UTF-8), certain non-ASCII Unicode characters in file names are converted in a lossy way ("best-fit mapping"). For example, if active code page is Windows-1252 and there is a file name with U+2044 in it, it becomes ASCII '/'. Thus, malicious file names result in directory traversal. Fix by reading the file names as wide chars and converting them losslessly to active code page. If lossless conversion isn't possible, skip the name. At the end of the directory, return NULL with errno set to EOVERFLOW as many times as there were skipped files. While EILSEQ could result in a better error message from strerror(), POSIX readdir() docs say that EOVERFLOW is the correct choice. Some apps like "ls" from GNU coreutils correctly continue reading the directory after EOVERFLOW but many apps don't. Delaying the error allows such apps to see all other file names still. The downside is delayed error detection in apps that will fail their task on any error anyway. The check for lossless conversion is useful even with the UTF-8 code page because unpaired surrogates (invalid UTF-16) are converted to U+FFFD. It seems likely that file names with unpaired surrogates cannot be accessed via multibyte APIs. For more information, see the blog post by Orange Tsai and Splitline from the DEVCORE Research Team.[*] (2) With UTF-8 code page, file names can be longer than MAX_PATH bytes. For example, a long name may fit in MAX_PATH when encoded as Windows-1252 and then readdir() returns it correctly. If the same name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such file names can still be opened etc. as long as their wide char encoding doesn't exceed MAX_PATH. (3) If application is declared as "long path aware" in an application manifest, then file names can be longer than MAX_PATH even in the native UTF-16 encoding (up to about 32767 wide chars). Fix by dynamically allocating the buffer for the absolute file name. [*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/ --- lib/dirent-private.h | 33 +++--- lib/opendir.c| 141 +++ lib/readdir.c| 45 ++ lib/rewinddir.c | 7 +-- 4 files changed, 150 insertions(+), 76 deletions(-) diff --git a/lib/dirent-private.h b/lib/dirent-private.h index 22d847106a..42ddb0d0f1 100644 --- a/lib/dirent-private.h +++ b/lib/dirent-private.h @@ -37,10 +37,7 @@ struct gl_directory # define WIN32_LEAN_AND_MEAN # include - -/* Don't assume that UNICODE is not defined. */ -# undef WIN32_FIND_DAT
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-13 Bruno Haible via Gnulib discussion list wrote: > Without your patch, an application has the guarantee that when one of > the file system APIs has returned a file name, the other file system > APIs will be able to handle it (ignoring ENAMETOOLONG cases, which > rarely occur in practice). Your patch breaks this guarantee. The first version of the patch had bad bugs. I apologize if those bugs gave you the above impression. My patch to readdir() makes it use EOVERFLOW with names that cannot be correctly converted. The other names are returned identically to the old code, except that the patched code is also able to return longer names instead of failing with EIO. Some modules like "open" handle those long names without changes (works fine in gzip). But since the long name support is the only difference in names that are actually returned, I suppose this is where the guarantee you mentioned is broken. > Therefore my request to do this change of conversion on *all* file > system APIs, or on none. Understood. At least the stat module could be fixed to give it long name support. Other modules could be verified for long name support too. Perhaps EOVERFLOW handling is also worth checking. This is too much for me in the foreseeable future, so if it means that the current patch isn't usable alone, then that's how it is. The feedback I have received has still been useful. -- Lasse Collin
Re: supporting in the UTF-8 environment on native Windows
On 2024-12-23 Bruno Haible wrote: > Lasse Collin reported in > <https://lists.gnu.org/archive/html/bug-gettext/2024-12/msg00111.html> > that the setlocale() override from GNU libintl does not support the > UTF-8 environment of native Windows correctly. That setlocale() > override is based on the setlocale() override from gnulib. So let me > add that support here. Thanks! I looked at the commits but I didn't test anything yet. (1) In 9f7ff4f423cd ("localename-unsafe: Support the UTF-8 environment on native Windows."), the N(name) macro is used with strings that include @modifier. For example, N("az_AZ@cyrillic") can expand to "az...@cyrillic.utf-8". Similarly in 00211fc69c92 ("setlocale: Support the UTF-8 environment on native Windows."), ".65001" is appended after the @modifier. However, the typical order would be az_AZ.UTF-8@cyrillic. I suppose you had a reason to use .65001 instead of .UTF-8 or .utf8. I expect identical behavior from those. The MS setlocale() docs use variants of .UTF8: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support (2) In 2f4391fde862 ("setlocale tests: Test in the UTF-8 environment on native Windows."), the condition (strlen (name) > 6 && strcmp (name + strlen (name) - 6, ".UTF-8") == 0) matches the two long strings below it too, making those two extra strcmp calls redundant. (3) When a manifest is added via a resource file, a possible default manifest from the toolchain is replaced; they aren't merged. For example, on MSYS2, the mingw-w64-ucrt-x86_64-gcc package depends on mingw-w64-ucrt-x86_64-windows-default-manifest. The manifest comes from Cygwin: https://sourceware.org/git/?p=cygwin-apps/windows-default-manifest.git;a=blob;f=default-manifest.rc Omitting the section makes the application run with Vista as the Operating System Context. Omitting the section makes Windows treat the application as not UAC compliant, that is, a pre-Vista app that needs compatibility tricks. Probably these don't matter with the current tests. I suggest changing it still because it's still an odd combination to have UTF-8 without marking the app compatible with recent Windows versions. (4) The output from windres goes to a file with the .res suffix but the format is overridden with --output-format=coff. This looks weird because windres defaults to --output-format=res for files that use the .res suffix. For coff, the .o suffix would be logical, and --output-format option wouldn't be needed. See the paragraphs near the beginning of the info node (binutils)windres. A simple command should be enough: windres input.rc output.o > In fact, there are apparently two variants of this mode: > - the legacy Windows settings variant: when you haven't ever > (or recently?) changed the system default locale of Windows 10, > - the modern Windows settings variant: when you have changed > the system default locale of Windows 10. > With the legacy Windows settings, the setlocale() function produces > locale names such as "English_United States.65001" or > "English_United States.utf8". With the modern Windows settings, it > produces "en_US.UTF-8" instead. (This is with both mingw and MSVC, > according to my testing.) I don't know enough about Windows to comment much. I only tested on one Win10 system which returned the long spellings. If native setlocale(LC_ALL, "") can indeed result in "en_US" or "en_US.UTF-8", I wonder if it can result in "az-Cyrl_AZ.UTF-8" too. I don't see how Gnulib or Gettext would map such a locale name to az_AZ.UTF-8@cyrillic. (az_AZ@cyrillic was the first one with @ in localename-unsafe.c, thus I looked at that in MS docs too.) The codeset seems to be a part of the language name: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c Locale format doesn't use @modifier: https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-170 -- Lasse Collin
Re: supporting in the UTF-8 environment on native Windows
On 2024-12-24 Bruno Haible wrote: > Lasse Collin wrote: > > (1) > > In 9f7ff4f423cd ("localename-unsafe: Support the UTF-8 environment > > on native Windows."), the N(name) macro is used with strings that > > include @modifier. For example, N("az_AZ@cyrillic") can expand to > > "az...@cyrillic.utf-8". Similarly in 00211fc69c92 ("setlocale: > > Support the UTF-8 environment on native Windows."), ".65001" is > > appended after the @modifier. However, the typical order would be > > az_AZ.UTF-8@cyrillic. > > Good point. Fixed through the patch below. Thanks. Isn't there a similar issue in setlocale.c after the commit 00211fc69c92? /* llCC_buf now contains language[_territory][@modifier] */ if (strcmp (llCC_buf, locale) != 0) { if (is_utf8) { char buf[64+6]; strcpy (buf, llCC_buf); strcat (buf, ".65001"); result = setlocale (category, buf); } else result = setlocale (category, llCC_buf); > > I suppose you had a reason to use .65001 instead of .UTF-8 or .utf8. > > I expect identical behavior from those. > > Yes: There was some period (ca. 5 years ago) when Windows supported > the .65001 suffix but not the .utf8 suffix. The ability to use .utf8 > to denote code page 65001 was added a bit later. [...] > Yeah, the docs don't tell you everything (as usual with Microsoft). Makes sense. Thanks for the explanation. > > (2) > > In 2f4391fde862 ("setlocale tests: Test in the UTF-8 environment on > > native Windows."), the condition > > > > (strlen (name) > 6 && strcmp (name + strlen (name) - 6, > > ".UTF-8") == 0) > > > > matches the two long strings below it too, making those two extra > > strcmp calls redundant. > > Correct. Still, it's useful to have a writedown of what the output in > the legacy mode is. (Unit test code is not optimized for performance.) I agree it's a fine way to document things. It could be useful to have a comment saying that those conditions are there for documentation purposes. > > (3) > > When a manifest is added via a resource file, a possible default > > manifest from the toolchain is replaced; they aren't merged. For > > example, on MSYS2, the mingw-w64-ucrt-x86_64-gcc package depends on > > mingw-w64-ucrt-x86_64-windows-default-manifest. The manifest comes > > from Cygwin: > > > > > > https://sourceware.org/git/?p=cygwin-apps/windows-default-manifest.git;a=blob;f=default-manifest.rc > > > > Omitting the section makes the application run with > > Vista as the Operating System Context. Omitting the > > section makes Windows treat the application as not UAC compliant, > > that is, a pre-Vista app that needs compatibility tricks. > > > > Probably these don't matter with the current tests. I suggest > > changing it still because it's still an odd combination to have > > UTF-8 without marking the app compatible with recent Windows > > versions. > > I picked the smallest XML file that has the desired effect. > > Also, I don't like enumerating Windows versions in this way because > it's not future-proof: Since Windows 11, 12, 13, etc. are not listed, > it would only be matter of time until the test breaks in a new Windows > version. It's only a test program which doesn't affect the binaries being installed. As long as it works for testing I guess it's fine enough. One clarification though: If one omits , Win10 behaves as if you had put only Vista compatibility in the manifest: Task Manager shows Vista as the Operating System Context for the program. That is, one is sort of requesting "emulation" of Vista in that case. There is no separate entry for Windows 11 because Windows 10 and 11 use the same entry. What will happen with Win12 is unknown, of course. > > (4) > > The output from windres goes to a file with the .res suffix but the > > format is overridden with --output-format=coff. This looks weird > > because windres defaults to --output-format=res for files that use > > the .res suffix. For coff, the .o suffix would be logical, and > > --output-format option wouldn't be needed. > > Maybe it looks weird, but IIRC it's the best way that I found that > does not run into 32-bit / 64-bit problems. For example, if 'windres' > is 64-bit but I'm compiling with a 32-bit targetting gcc. (Some > toolchain installations have a -windres program, but some > other toolchain installations have only one windres program for all > targets.) OK. It was only the .res suffix that looked weird because windres already recognizes .res for a different file format than COFF. -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-14 Paul Eggert wrote: > On 2025-01-14 02:43, Lasse Collin wrote: > > Pedantic interpretation might be that EILSEQ isn't allowed in this > > situation because EOVERFLOW already describes the error condition. > > No, because EOVERFLOW is intended for things like inode numbers don't > fit. It is not intended for things like invalid byte sequences. I too associate EOVERFLOW with numbers, not strings. Your interpretation is the most reasonable. Don't take the following too seriously (feel free to even ignore). :-) I noticed that a discussion similar to ours happened around glibc's readdir_r in 2013. While it's a deprecated function, it's errno usage is very similar to readdir. The discussion was about - ENAMETOOLONG vs. EOVERFLOW (vs. ERANGE) and which is allowed by POSIX, - delaying the error until the end of the directory, and - if one can continue reading the directory after some errors or if it can result in an infinite loop. It's not a long thread: https://sourceware.org/legacy-ml/libc-alpha/2013-05/msg00445.html From the current glibc manual section '(libc)Reading/Closing Directory': To distinguish between an end-of-directory condition or an error, you must set ‘errno’ to zero before calling ‘readdir’. To avoid entering an infinite loop, you should stop reading from the directory after the first error. ... • On some systems, ‘readdir_r’ cannot read directory entries with very long names. If such a name is encountered, the GNU C Library implementation of ‘readdir_r’ returns with an error code of ‘ENAMETOOLONG’ after the final directory entry has been read. Those paragraphs were added as part of the readdir_r fix: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=91ce40854d0b7f865cf5024ef95a8026b76096f3 The current readdir_r code: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/readdir_r.c;hb=2642002380aafb71a1d3b569b6d7ebeab3284816 It doesn't count the number of problematic files. If my reading is correct, at the end of the directory the code will keep returning NULL with errno = ENAMETOOLONG. If so, one shouldn't continue after ENAMETOOLONG with this implementation. It seems that continuing after an error needs platform-specific knowledge in practice. (Maybe EOVERFLOW is an exception.) The behavior isn't needed in readdir in glibc. It's in readdir_r only which one hopefully won't use anyway. > > Other apps on Windows may trip on EILSEQ > > We can cross those bridges when we come to them. > > There is no perfect solution here. But for 'ls', the only program > where we know this matters, EILSEQ would be better than EOVERFLOW. Now I think that it's good enough if readdir on Windows delays EILSEQ (and ENAMETOOLONG) until the end of the directory and returns that error once. That is, if readdir is called again, it returns NULL without modifying errno to prevent infinite loops. It means that apps cannot count the errors and only one error code is remembered if two types of errors occur, but it's simpler code. Most apps stop after the first error anyway. The upside is that then there is no pressure to make GNU ls to continue reading the directory after EILSEQ. ;-) -- Lasse Collin
Re: crc-x86_64: undefined behaviour
I'm fairly new on the list and haven't read earlier discussion about the crc-x86_64 module, so I hope this doesn't come out wrong. I rewrote the CRC CLMUL code[1] in XZ Utils six months ago, and I'm commenting based on that experience. On 2025-01-16 Paul Eggert wrote: > That being said, it does appear there are a lot of unaligned word > accesses nearby, which can't be good for performance even if the > hardware doesn't trap. With the code variants I tried last year, the penalty from unaligned reads was very small. The buffer would need to be large until the extra cost of aligning at the beginning would pay off. This was with 128-bit registers. It might be different with wider registers. On 2025-01-17 Sam Russell wrote: > Most performant would probably be an alignment check at the start and > then manually processing the first N bytes. Another option could be > to simply cast data to unsigned char* and then we can guarantee the > compiler doesn't hit alignment issues? Try using __m128i_u* instead of __m128i*. The former has aligned(1) attribute. It's not available in old GCC versions though. I kept the input as uint8_t* and cast it for the loads in a wrapper: static inline __m128i my_load128(const uint8_t *p) { return _mm_loadu_si128((const __m128i *)p); } I suppose __m128i_u* would be more correct above too but sanitizers don't complain about it. Perhaps sanitizers treat the intrinsics differently than memcpy. [1] https://github.com/tukaani-project/xz/blob/master/src/liblzma/check/crc_x86_clmul.h -- Lasse Collin
Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows
On 2025-01-14 Bruno Haible wrote: > Maybe we should spread the word about the UTF-8 mode, that you made > us aware of recently [1]. Yes, UTF-8 compatibility would be good nowadays. If a user has set it globally in Windows settings (a beta feature), then programs use UTF-8 even when they don't have it set in an application manifest. (1) While UTF-8 can result in crazy-long file name *components* (over 259 bytes without \0), they aren't common. If Gnulib's readdir and stat have the limit, it might not matter too much in practice. "open" doesn't have such a limit. My patch had 255 * MB_LEN_MAX + 1, and MB_LEN_MAX is 5. Now I think it's execessive even in a theoretical case. I'm not sure but I suspect that UTF-8 is the only *locale* code page that has longer than two-byte characters. So 255 * 3 + 1 should be enough which is still a big number. (2) Making programs "long path aware" could matter more. The limit of 259 wide chars (excluding the \0) is enforced on the absolute file name even if program uses relative file names. The constant MAX_PATH appears in very few files in Gnulib but opendir.c is one of them. The \\?\ prefix bypasses the MAX_PATH limit but long path aware programs don't need that trick. I don't plan to continue my side quest to Windows issues further. :-) Thanks for the comments! -- Lasse Collin
Re: MinGW-w64 fenv changes.
On 2025-06-13 Bruno Haible wrote: > Lasse Collin wrote: > > I mentioned this thread on #mingw-w64 on OFTC. I was asked if I > > could ask the Gnulib developers to report the bugs with minimal > > reproducible examples to the mingw-w64-public mailing list[1]. > > I usually don't spend much time reporting bugs for packages that are > not part of the GNU system, or at least licensed under GPL / LGPL. > Anyone is free to use Gnulib's tests as a self-tests for their > platform, like for example the glibc people do [1]. > > But since you are asking in a friendly way, let me try to help. > > I'm not subscribing to the mingw-w64 mailing list; instead, you can > find this mail archived under [2]. Thank you very much! I have forwarded the information to mingw-w64 developers: https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/20250614194520.67f75e35.lasse.collin%40tukaani.org/ -- Lasse Collin
Re: MinGW-w64 fenv changes.
On 2025-06-11 Bruno Haible via Gnulib discussion list wrote: > In summary, the mingw approach that considers both the 387 unit and > the SSE floating-point unit basically works. But, as usual with > mingw, it comes with bugs: > - The older mingw bugs in this area are still present. > - They added two new bugs. I mentioned this thread on #mingw-w64 on OFTC. I was asked if I could ask the Gnulib developers to report the bugs with minimal reproducible examples to the mingw-w64-public mailing list[1]. I'm not able to help with the bugs themselves; I'm only relaying the message. [1] https://sourceforge.net/projects/mingw-w64/lists/mingw-w64-public I think posting requires one to be subscribed. Attachments should have .txt suffix because the list filters out many file types. -- Lasse Collin