Re: license request for mgetgroups

2013-05-22 Thread Lasse Collin
On 2013-05-22 Eric Blake wrote:
> Since getgrouplist is implemented in glibc as LGPLv2+, is there any
> objection to relicensing the following modules as LGPLv2+?
[...]
> getgroups [LGPLv3+] => *Jim, *Eric, Bruno, Paul
> getugroups [GPLv3+] => *Jim, Eric, Paul, Bruno, Lasse
> mgetgroups [GPLv3+] => *Jim, *Eric, Bruno
> xalloc-oversized [GPLv3+] => [*all] Eric, Paul, Jim, (Bruno)

Relicensing is fine with me. (I barely touched the code.)

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: argp --help formatting bug with non-ASCII characters

2024-04-13 Thread Lasse Collin
On 2024-04-12 Bruno Haible wrote:
> How to reproduce:
> 1. Build GNU tar (1.33 or 1.35, for example).
> 2. Install it.
> 3. You can see misindentation:
> $ LC_ALL=de_DE.UTF-8 src/tar --help
> 
> ...
>  abzubilden
>   --mode=ÄNDERUNGEN den (symbolischen) Modus ÄNDERUNGEN für
>  hinzugefügte Dateien erzwingen
[...]
> In other words, it should use the Gnulib module 'mbswidth'.

This is one of the issues. The other one is with the wrapping of the
description text. It counts 1 byte == 1 column but in languages which
mostly use non-ASCII chars it can be mostly 2 bytes == 1 column. Then
the wrapping occurs far too early, resulting in narrow output.

-- 
Lasse Collin



Re: supporting in the UTF-8 environment on native Windows

2024-12-26 Thread Lasse Collin
I commented on bug-gettext that these fixes work for me when
environment variables LANG, LC_*, or LANGUAGE aren't set. Thanks again!

The commit 00211fc69c92 ("setlocale: Support the UTF-8 environment on
native Windows.") introduces Windows support for C.UTF-8 in
setlocale.c. Emulating C.UTF-8 with "English_United States.65001" looks
like a great idea with LC_CTYPE. However, with some other locale
categories it's problematic. I think "C.UTF-8" should map to a mixed
locale.

I noticed the informative link[1] in the Gettext commit 3873b7f1c777
("intl: Treat C.UTF-8 locale like C locale."). The wiki page makes
me suspect that LC_ALL=C.UTF-8 should set LC_CTYPE to
"English_United States.65001" while setting all other categories to
"C". This seems pretty clear for LC_COLLATE and LC_NUMERIC but I'm not
sure about the other categories.

[1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

LC_COLLATE=C.UTF-8 should sort in Unicode codepoint order. In UTF-8 this
is the same as byte order, thus "C" does the right thing (at least for
valid UTF-8 inputs). In contrast, "English_United States.65001" sorts by
English rules, for example, putting "ä" before "b". Test with
strcoll("b", "ä").

LC_NUMERIC can make a difference if thousand separators are requested.
Thousand separators in printf() are a POSIX feature that UCRT's printf()
doesn't support. However, MinGW-w64's replacement via "#define
__USE_MINGW_ANSI_STDIO 1" provides an implementation that does support
thousand separators.

-- 
Lasse Collin



[PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-12 Thread Lasse Collin
 ac_cv_func_readdir=no

gzip's tailor.h has

#  define MAX_PATH_LEN  260

for native Windows which has to be removed or increased significantly
to test long UTF-8 names. gzip.c defaults to 1024 when tailor.h doesn't
define it. The longest test file name also requires that
utf8-longpath.rc is built and linked into gzip.exe (and that requires
changing the Windows registry setting and rebooting). The shorter name
works without longPathAware (utf8.rc) in the manifest too if the
leading file name components are short enough.

gzip uses Gnulib's savedir module which doesn't return any file names
if there is an error. So in that sense gzip isn't the best test program
because one problematic file name will make gzip skip all files in a
directory; the delaying of EOVERFLOW isn't useful when using savedir.

* * *

Very minor things:

In Gnulib's lib/pathmax.h, there is a comment about PATH_MAX from 2011:

Undefine the original value, because mingw's  gets it
wrong.

The code in pathmax.h does no harm but the issue was fixed in MinGW-w64
in 2010. I haven't looked at MinGW so I don't know if the issue is
still there.

Gnulib uses "mingw" in docs. Does it refer to MinGW or MinGW-w64 or
both? In configure triplets, MinGW uses "pc" as the vendor and
MinGW-w64 uses "w64".

-- 
Lasse Collin


create-test-files.tar.xz
Description: application/xz
>From 69c54c1abbce9713157d2d47a1cb25aeb0963540 Mon Sep 17 00:00:00 2001
From: Lasse Collin 
Date: Sun, 12 Jan 2025 14:31:20 +0200
Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

(1) With legacy code pages (not UTF-8), certain non-ASCII
Unicode characters in file names are converted in a lossy
way ("best-fit mapping"). For example, if active code page
is Windows-1252 and there is a file name with U+2044 in it,
it becomes ASCII '/'. Thus, malicious file names result in
directory traversal.

Fix by reading the file names as wide chars and converting
them losslessly to active code page. If lossless conversion
isn't possible, skip the name. At the end of the directory,
if anything was skipped, set errno to EOVERFLOW. It should
be the correct choice according to POSIX readdir() docs.

The check for lossless conversion is useful even with the UTF-8
code page because unpaired surrogates (invalid UTF-16) are
converted to U+FFFD. It seems likely that file names with
unpaired surrogates cannot be accessed via multibyte APIs.

For more information, see the blog post by Orange Tsai and
Splitline from the DEVCORE Research Team.[*]

(2) With UTF-8 code page, file names can be longer than MAX_PATH bytes.
For example, a long name may fit in MAX_PATH when encoded as
Windows-1252 and then readdir() returns it correctly. If the same
name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such
file names can still be opened etc. as long as their wide char
encoding doesn't exceed MAX_PATH.

(3) If application is declared as "long path aware" in an application
manifest, then file names can be longer than MAX_PATH even in the
native UTF-16 encoding (up to about 32767 wide chars). Fix by
dynamically allocating the buffer for the absolute file name.

[*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/
---
 lib/dirent-private.h |  33 +++---
 lib/opendir.c| 140 +++
 lib/readdir.c|  37 
 lib/rewinddir.c  |   7 +--
 4 files changed, 141 insertions(+), 76 deletions(-)

diff --git a/lib/dirent-private.h b/lib/dirent-private.h
index 22d847106a..cc5e81300b 100644
--- a/lib/dirent-private.h
+++ b/lib/dirent-private.h
@@ -37,10 +37,7 @@ struct gl_directory
 
 # define WIN32_LEAN_AND_MEAN
 # include 
-
-/* Don't assume that UNICODE is not defined.  */
-# undef WIN32_FIND_DATA
-# define WIN32_FIND_DATA WIN32_FIND_DATAA
+# include 
 
 struct gl_directory
 {
@@ -53,13 +50,35 @@ struct gl_directory
  0 means the entry needs to be filled using FindNextFile.
  A positive value is an error code.  */
   int status;
+  /* If one or more file names cannot be encoded in the active code page,
+ this is set to true. If end of the directory is otherwise reached
+ successfully, readdir() sets errno to EOVERFLOW.  */
+  bool has_problematic_chars;
   /* Handle, reading the directory, at current position.  */
   HANDLE current;
   /* Found directory entry.  */
-  WIN32_FIND_DATA entry;
-  /* Argument to pass to FindFirstFile.  It consists of the absolutized
+  WIN32_FIND_DATAW entry;
+  /* readdir() returns a pointer to this struct. d_name is entry.cFileName
+ as multibyte string. A file name is at most 255 UTF-16 code units
+ (not code points) followed by the terminating null character. The
+ length as multibyte string depends on the a

Re: crc-x86_64: undefined behaviour

2025-01-18 Thread Lasse Collin
On 2025-01-17 Simon Josefsson wrote:
> Lasse Collin  writes:
> 
> > [1] 
> > https://github.com/tukaani-project/xz/blob/master/src/liblzma/check/crc_x86_clmul.h
> >   
> 
> Oh.  Do you see any chance to merge your code into gnulib?  If there
> is any feature in your implementation that we don't have.  And then
> use it in xz?

My code hasn't been included in any xz release yet. I don't want to
backport it to 5.6.x and 5.7.1alpha isn't out yet. So the code hasn't
been tested by many people yet.

A CRC function tries to produce the correct result as fast as possible.
There isn't much else to compare feature wise. It is fine to me if the
code is adapted into Gnulib. However, a significant amount of effort has
been put into the version that is in Gnulib already, so I don't want
make any recommendations, especially since my version hasn't been widely
tested yet. (I was afraid that if I link to my code, it's easily seen as
if I'm advertising a replacement. Sorry.)

> It seems a bit silly to have a lot of different
> implementations around for this fairly low-level simple code, and as
> we saw in this thread, maintaining it does show up bugs...

Yes, it's a bit silly and duplicate effort. I suppose there are
multiple reasons why many implementations exists, for example, avoiding
extra dependencies, licensing questions, and just having fun. There has
to be a reason why a CRC32 function is wanted in Gnulib instead of
adding, for example, zlib as a dependency to packages that need CRC32.

Even GNU has more than one CLMUL CRC implementation. See Libgcrypt:


https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=blob;f=cipher/crc-intel-pclmul.c

I tested Gnulib's lib/crc-x86_64-pclmul.c (v1.0-1387-g5dd298feb0) a
bit. I changed the target attribute to plain "pclmul" to make it work
on a processor that lacks AVX. With 0-3 byte inputs the result differs
from what I see from other implementations. I didn't investigate why it
happens. I attached a test program that tests alignment variations but
the problem can be seen without such a loop already.

About sanitizers: They can be annoying with SIMD code. If a function is
passed an unaligned buffer, it would be fine to round the address down
to an aligned value, do an aligned read, and ignore the out-of-bounds
bytes. One can do it in assembly because sanitizers don't see it. In
contrast to sanitizers, Valgrind is happy if the extra bytes are thrown
away.

-- 
Lasse Collin
/*
From Gnulib version, comment out #includes for , "crc-x86_64.h",
and "crc.h", and add . Possibly omit ",avx" from the attribute too.

gcc -O2 -Wall -Wextra crc32-test.c -DUSE_ZLIB -lz
gcc -O2 -Wall -Wextra crc32-test.c -DUSE_LIBLZMA -llzma
gcc -O2 -Wall -Wextra crc32-test.c crc-x86_64-pclmul-modified.c
*/

#include 
#include 
#include 

#if defined(USE_ZLIB)
#   include 
#   define CRC32(buf, size, crc) crc32(crc, buf, size)
#elif defined(USE_LIBLZMA)
#   include 
#   define CRC32(buf, size, crc) lzma_crc32(buf, size, crc)
#else
extern uint32_t crc32_update_no_xor_pclmul(
uint32_t crc, const void *buf, size_t len);
#   define CRC32(buf, size, crc) ~crc32_update_no_xor_pclmul(~(crc), buf, size)
#endif


#define END_SIZE 2000

static uint8_t r[END_SIZE + 16 * 17]; // Yes, it's a bit oversized.


static uint32_t
test(size_t size)
{
uint32_t crc = 0;

for (size_t k = 0; k < 16 * 17; k += 17)
crc = CRC32(r + k, size, crc);

return crc;
}


extern int
main(void)
{
uint32_t seed = 29;
for (size_t i = 0; i < sizeof(r); ++i) {
seed = seed * 1103515245 + 12345;
r[i] = (uint8_t)(seed >> 22);
}

for (size_t size = 0; size < END_SIZE; ++size)
printf("%-4zu  0x%08" PRIX32 "\n", size, test(size));

return 0;
}


Re: valgrind vs. sanitizers

2025-01-21 Thread Lasse Collin
On 2025-01-18 Paul Eggert wrote:
> On 2025-01-18 11:45, Lasse Collin wrote:
> > On 2025-01-18 Paul Eggert wrote:  
> >> Does the unaligned read trick work even with CheriBSD's memory-safe
> >> model? That is an edge case that might need an ifdef or something.
> >>  
> > 
> > I'm not familiar with CheriBSD but the trick never crosses a cache
> > line boundary (or page boundary). So the memory-safe model has to
> > be really strict to not allow it.  
> 
> Yes, it's really that strict.

That's really nice, especially if performance is still acceptable. :-)

> One way to fix bugs like this on CHERI is to use the alignment trick 
> only if the macro __CHERI_PURE_CAPABILITY__ is not defined. 

I will keep this in mind but the primarily lesson for me is that such
tricks are a worse idea than I had thought.

> Do you have access to cfarm ?

I don't. Thanks for the idea.

-- 
Lasse Collin



Re: crc-x86_64: undefined behaviour

2025-01-21 Thread Lasse Collin
On 2025-01-18 Sam Russell wrote:
> IIRC we test values from 0-16 bytes in the unit test and the pclmul
> implementation doesn't get hit for smaller values?

Correct. lib/crc.c has

if (pclmul_enabled && len >= 16)
  return crc32_update_no_xor_pclmul (crc, buf, len);

so the behavior with short inputs is irrelevant. I'm sorry for the
false alarm. Having

assert(len >= 16);

at the beginning of the pclmul function could have made it obvious but
that's not an excuse, I should have looked at crc.c too.

-- 
Lasse Collin



Re: valgrind vs. sanitizers

2025-01-18 Thread Lasse Collin
On 2025-01-18 Bruno Haible wrote:
> Valgrind was a tool without replacement, for many years, when
> sanitizers did not exist. Nowadays, however, I generally prefer
> testing with sanitizers than with valgrind because there are some
> bugs that ASAN finds and valgrind doesn't [1]. For example, when you
> have a struct with two adjacent arrays, ASAN can find buffer overruns
> of the first array, while valgrind can't. Just this week, sanitizers
> have found a real bug in coreutils [2].
> 
> Regarding your trick to do an aligned read on (addr & -alignment)
> instead of an unaligned read on (addr): I find it good that ASAN
> catches this, because this trick amounts to exploiting a coincidental
> property of current hardware.

I agree. One could mark the tricks so that sanitizers skip them. There
are attributes to do that but those can make people suspicious that one
is trying to prevent fuzzers from finding security issues. Writing
extra code to keep sanitizers happy is the alternative. Intrinsics
usage is already platform specific (in practice at least) so doing
platform specific tricks should be possible somehow because it can be
more convenient to write with intrinsics than assembly.

Inline assembly also avoids sanitizers. It doesn't look as suspicious
as attributes that disable sanitizers so people are happier (as long as
they don't realize this).

> Similarly to accessing (addr + (1 << n)) for 48 < n < 64: some
> hardware allows this, but it's an ISO C violation nevertheless.

That wouldn't be a good idea in assembly code either.

-- 
Lasse Collin



Re: valgrind vs. sanitizers

2025-01-18 Thread Lasse Collin
On 2025-01-18 Paul Eggert wrote:
> Does the unaligned read trick work even with CheriBSD's memory-safe 
> model? That is an edge case that might need an ifdef or something.

I'm not familiar with CheriBSD but the trick never crosses a cache line
boundary (or page boundary). So the memory-safe model has to be really
strict to not allow it.

> The aligned read trick reminds me a bit of the "adding 0 to a null 
> pointer gives you a null pointer" trick. That also is a violation of
> ISO C but it works everywhere (even on CheriBSD) and Gnulib assumes
> it in places.

Oh that...


https://github.com/tukaani-project/xz/commit/30e95bb44c36ae26b2ab12a94343b215fec285e7

I heard the standardization news last year. It was a nice surprise.

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-13 Thread Lasse Collin
On 2025-01-12 Paul Eggert wrote:
> On 2025-01-12 12:26, Lasse Collin wrote:
> > The patch makes readdir() detect that lossless conversion isn't
> > possible and inform the application with EOVERFLOW  
> 
> Wouldn't EILSEQ be more appropriate? That's what "open" is supposed
> to do with names that the file system doesn't support.

I pondered it before sending the patch. POSIX.1-2024 readdir() [1]:

[EOVERFLOW]
One of the values in the structure to be returned cannot be
represented correctly.

Since the UTF-16 encoded string cannot be correctly converted to a more
limited character set, clearly it's about a value that "cannot be
represented correctly". Error message from strerror(EOVERFLOW) is
confusing in this case though. strerror(EILSEQ) feels more logical
but it's not explicitly listed for readdir() in POSIX.

> Also, the POSIX spec suggests that readdir should return a null
> pointer right away with errno set, rather than wait for the end of
> the directory. A subsequent readdir resumes traversal of the
> directory, even after such an error. Doing it this nicer way should
> avoid the need for the new label and goto, and it would also let the
> caller count how many bad entries the directory has.

Returning an error immediately makes the code slightly simpler. I
wonder how many apps continue after any error though. The example in
the POSIX readdir() page terminates the search after any error but it's
a simplified example.

In GNU coreutils, src/ls.c, print_dir() has a loop that calls
readdir().[2] It handles two errno values specially:

  - ENOENT is treated the same as successfully reaching the end of
the directory.

  - EOVERFLOW results in an error message but directory reading is
continued still.

All other errors make print_dir() stop reading the directory. The
behavior of ls seems reasonable in context of what errno values are
listed for readdir() in POSIX.

If readdir() returns the more logical sounding EILSEQ, it means that
GNU ls won't attempt to list the remaining directory entries. Thus, I
think using EOVERFLOW is better than EILSEQ when character set
conversion cannot be done correctly.

I will change the code to return EOVERFLOW immediately instead of
delaying it.

[1] 
https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/readdir_r.html

[2] https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ls.c#n3086

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-12 Thread Lasse Collin
On 2025-01-12 Bruno Haible wrote:
> As I understand it, the patch modifies the conversion
>   wchar_t[] (UTF-16)  -->  char[]
> used for file names, and it does so only for opendir+readdir.

The way I see it, the patch adds error detection to the conversion.
Names that pass the check are returned just like they were before.

If a file name contains the infinity symbol, let's say "∞.txt",
best-fit mapping to Windows-1252 makes current readdir() return "8.txt".
Passing "8.txt" to open() won't open "∞.txt". If the active code page is
Windows-1252, one cannot use Gnulib's open module to access "∞.txt".

The patch makes readdir() detect that lossless conversion isn't
possible and inform the application with EOVERFLOW instead of returning
a made-up name that might even be dangerous. I don't see how this could
affect open(), chdir(), etc. negatively.

If you suspect that I'm missing something, please try to explain again.
Thanks!

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-14 Thread Lasse Collin
On 2025-01-13 Paul Eggert wrote:
> On 2025-01-13 00:05, Lasse Collin wrote:
> 
> > I pondered it before sending the patch. POSIX.1-2024 readdir() [1]:
> > 
> >  [EOVERFLOW]
> >  One of the values in the structure to be returned cannot be
> >  represented correctly.  
> 
> Yes, but the list of error numbers in the readdir ERRORS section is
> not exhaustive. For example, readdir can plausibly fail with EIO even
> though EIO is not in the list. See 
> <https://pubs.opengroup.org/onlinepubs/9799919799/functions/V2_chap02.html#tag_16_03>,
>  
> which says "Implementations may support additional errors not
> included in this list, may generate errors included in this list
> under circumstances other than those described here, or may contain
> extensions or limitations that prevent some errors from occurring."

Thanks! I'm aware that additional error numbers are common. However,
portable programs might not know what those numbers mean on a specific
operating system, thus programs need to interpret the numbers
conservatively. Most of the time this works well: an error message is
displayed and the operation doesn't continue.

If a program wants to continue calling readdir() after an error, it has
to know the errno values with which continuation works. POSIX documents
EBADF and ENOENT for readdir(), and their description doesn't give an
impression that further readdir() calls might succeed. Attempting to
continue after EBADF or ENOENT (or EIO) could result in an infinite
loop.

The description of EOVERFLOW is less scary. (It would still be useful
if it explicitly said that one can continue after EOVERFLOW, especially
because the example code doesn't do so.) My peek at GNU ls affirmed my
thought that EOVERFLOW is special in portability context. Many apps
don't continue after EOVERFLOW but I guess even fewer do so after
EILSEQ.

There is also:

Implementations shall not generate a different error number from
one required by this volume of POSIX.1-2024 for an error condition
described in this volume of POSIX.1-2024, but may generate
additional errors unless explicitly disallowed for a particular
function.

Pedantic interpretation might be that EILSEQ isn't allowed in this
situation because EOVERFLOW already describes the error condition. In
practice EILSEQ is fine if the diagnostics are delayed until the end of
the directory. Applications that know the implementation-specific
detail can then even count the number of conversion errors.

> EILSEQ is more-appropriate than EOVERFLOW here, so I'd use it.
[...] 
> PS. As you mention, it's fine (and indeed a good idea) to delay the 
> EILSEQ error to the end, as too much code mistakenly treats any null 
> return from readdir as EOF.

Thanks! I agree now. That is, behave like v2-counting.patch but with
EILSEQ instead of EOVERFLOW.

> > If readdir() returns the more logical sounding EILSEQ, it means that
> > GNU ls won't attempt to list the remaining directory entries.  
> 
> Oh, that's OK, we'd just change GNU ls. This would make for a better 
> 'ls' for the user, as the diagnostics would be more informative.

That helps 'ls'. Other apps on Windows may trip on EILSEQ (and probably
on EOVERFLOW too) if accessing a directory with even one problematic
name. In some cases it might be a better user experience to list a few
non-existing lookalike names in GUI. Perhaps this was the compatibility
concern in the other branch of this thread. The behavior needs to be
configurable.

> > In GNU coreutils, src/ls.c, print_dir() has a loop that calls
> > readdir().[2] It handles two errno values specially:
> > 
> >   - ENOENT is treated the same as successfully reaching the end of
> > the directory.
> > 
> >   - EOVERFLOW results in an error message but directory reading is
> > continued still.  
> 
> The EOVERFLOW treatment is buggy because errno might have changed
> since readdir was called

Those are so easy to miss. :-/ I had fixed a similar error only a few
days ago but I missed it in ls.c even though I was specifically looking
at its errno handling.

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-13 Thread Lasse Collin
On 2025-01-13 Lasse Collin wrote:
> On 2025-01-12 Paul Eggert wrote:
> > Also, the POSIX spec suggests that readdir should return a null
> > pointer right away with errno set, rather than wait for the end of
> > the directory. A subsequent readdir resumes traversal of the
> > directory, even after such an error. Doing it this nicer way should
> > avoid the need for the new label and goto, and it would also let the
> > caller count how many bad entries the directory has.  
> 
> Returning an error immediately makes the code slightly simpler. I
> wonder how many apps continue after any error though.

POSIX allows readdir() to buffer the input, thus I guess reordering the
failures until the end isn't strictly prohibited. The motivation was to
be friendlier to users of programs that don't handle EOVERFLOW
specially/correctly. With reordering they would see all other files. At
this point I have no opinion if it is a good or bad idea.

I attached two versions:

  - v2-simple.patch uses EOVERFLOW immediately as was discussed
earlier.

  - v2-counting.patch counts the number of problematic file names. At
the end of the directory, EOVERFLOW is set as many times as there
were problematic file names. This way callers can count the number
of problems.

I fixed a few a few bad bugs that were in the first patch:

  - The return value from GetFullPathNameW was mishandled, which could
result in non-working mask string.

  - dirp->status was set to EOVERFLOW. It made readdir() set it
infinitely at the end of the directory.

  - "return NULL;" was missing from one error path.

Thank you for the feedback! I learned a few things once again. Similar
issues in MinGW-w64's  implementation may be fixed at some
point, and the comments here might a mistake or two. So if it turns out
that my patches cannot be merged to Gnulib for now, this discussion has
still been valuable. Thanks!

-- 
Lasse Collin
From ee205ebf07c9546df14f5d514675305de652e2cc Mon Sep 17 00:00:00 2001
From: Lasse Collin 
Date: Mon, 13 Jan 2025 17:31:13 +0200
Subject: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

(1) With legacy code pages (not UTF-8), certain non-ASCII
Unicode characters in file names are converted in a lossy
way ("best-fit mapping"). For example, if active code page
is Windows-1252 and there is a file name with U+2044 in it,
it becomes ASCII '/'. Thus, malicious file names result in
directory traversal.

Fix by reading the file names as wide chars and converting
them losslessly to active code page. If lossless conversion
isn't possible, skip the name. At the end of the directory,
return NULL with errno set to EOVERFLOW as many times as
there were skipped files.

While EILSEQ could result in a better error message from strerror(),
POSIX readdir() docs say that EOVERFLOW is the correct choice.
Some apps like "ls" from GNU coreutils correctly continue reading
the directory after EOVERFLOW but many apps don't. Delaying
the error allows such apps to see all other file names still.
The downside is delayed error detection in apps that will fail
their task on any error anyway.

The check for lossless conversion is useful even with the UTF-8
code page because unpaired surrogates (invalid UTF-16) are
converted to U+FFFD. It seems likely that file names with
unpaired surrogates cannot be accessed via multibyte APIs.

For more information, see the blog post by Orange Tsai and
Splitline from the DEVCORE Research Team.[*]

(2) With UTF-8 code page, file names can be longer than MAX_PATH bytes.
For example, a long name may fit in MAX_PATH when encoded as
Windows-1252 and then readdir() returns it correctly. If the same
name exceeds MAX_PATH as UTF-8, readdir() used to fail. Such
file names can still be opened etc. as long as their wide char
encoding doesn't exceed MAX_PATH.

(3) If application is declared as "long path aware" in an application
manifest, then file names can be longer than MAX_PATH even in the
native UTF-16 encoding (up to about 32767 wide chars). Fix by
dynamically allocating the buffer for the absolute file name.

[*] https://devco.re/blog/2025/01/09/worstfit-unveiling-hidden-transformers-in-windows-ansi/
---
 lib/dirent-private.h |  33 +++---
 lib/opendir.c| 141 +++
 lib/readdir.c|  45 ++
 lib/rewinddir.c  |   7 +--
 4 files changed, 150 insertions(+), 76 deletions(-)

diff --git a/lib/dirent-private.h b/lib/dirent-private.h
index 22d847106a..42ddb0d0f1 100644
--- a/lib/dirent-private.h
+++ b/lib/dirent-private.h
@@ -37,10 +37,7 @@ struct gl_directory
 
 # define WIN32_LEAN_AND_MEAN
 # include 
-
-/* Don't assume that UNICODE is not defined.  */
-# undef WIN32_FIND_DAT

Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-13 Thread Lasse Collin
On 2025-01-13 Bruno Haible via Gnulib discussion list wrote:
> Without your patch, an application has the guarantee that when one of
> the file system APIs has returned a file name, the other file system
> APIs will be able to handle it (ignoring ENAMETOOLONG cases, which
> rarely occur in practice). Your patch breaks this guarantee.

The first version of the patch had bad bugs. I apologize if those bugs
gave you the above impression.

My patch to readdir() makes it use EOVERFLOW with names that cannot be
correctly converted. The other names are returned identically to the
old code, except that the patched code is also able to return longer
names instead of failing with EIO.

Some modules like "open" handle those long names without changes (works
fine in gzip). But since the long name support is the only difference
in names that are actually returned, I suppose this is where the
guarantee you mentioned is broken.

> Therefore my request to do this change of conversion on *all* file
> system APIs, or on none.

Understood. At least the stat module could be fixed to give it long
name support. Other modules could be verified for long name support
too. Perhaps EOVERFLOW handling is also worth checking. This is too much
for me in the foreseeable future, so if it means that the current patch
isn't usable alone, then that's how it is. The feedback I have received
has still been useful.

-- 
Lasse Collin



Re: supporting in the UTF-8 environment on native Windows

2024-12-24 Thread Lasse Collin
On 2024-12-23 Bruno Haible wrote:
> Lasse Collin reported in
> <https://lists.gnu.org/archive/html/bug-gettext/2024-12/msg00111.html>
> that the setlocale() override from GNU libintl does not support the
> UTF-8 environment of native Windows correctly. That setlocale()
> override is based on the setlocale() override from gnulib. So let me
> add that support here.

Thanks! I looked at the commits but I didn't test anything yet.

(1)
In 9f7ff4f423cd ("localename-unsafe: Support the UTF-8 environment on
native Windows."), the N(name) macro is used with strings that include
@modifier. For example, N("az_AZ@cyrillic") can expand to
"az...@cyrillic.utf-8". Similarly in 00211fc69c92 ("setlocale: Support
the UTF-8 environment on native Windows."), ".65001" is appended after
the @modifier. However, the typical order would be az_AZ.UTF-8@cyrillic.

I suppose you had a reason to use .65001 instead of .UTF-8 or .utf8.
I expect identical behavior from those. The MS setlocale() docs use
variants of .UTF8:


https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support

(2)
In 2f4391fde862 ("setlocale tests: Test in the UTF-8 environment on
native Windows."), the condition

(strlen (name) > 6 && strcmp (name + strlen (name) - 6, ".UTF-8") == 0)

matches the two long strings below it too, making those two extra
strcmp calls redundant.

(3)
When a manifest is added via a resource file, a possible default
manifest from the toolchain is replaced; they aren't merged. For
example, on MSYS2, the mingw-w64-ucrt-x86_64-gcc package depends on
mingw-w64-ucrt-x86_64-windows-default-manifest. The manifest comes from
Cygwin:


https://sourceware.org/git/?p=cygwin-apps/windows-default-manifest.git;a=blob;f=default-manifest.rc

Omitting the  section makes the application run with
Vista as the Operating System Context. Omitting the  section
makes Windows treat the application as not UAC compliant, that is, a
pre-Vista app that needs compatibility tricks.

Probably these don't matter with the current tests. I suggest changing
it still because it's still an odd combination to have UTF-8 without
marking the app compatible with recent Windows versions.

(4)
The output from windres goes to a file with the .res suffix but the
format is overridden with --output-format=coff. This looks weird
because windres defaults to --output-format=res for files that use the
.res suffix. For coff, the .o suffix would be logical, and
--output-format option wouldn't be needed.

See the paragraphs near the beginning of the info node
(binutils)windres. A simple command should be enough:

windres input.rc output.o

> In fact, there are apparently two variants of this mode:
>   - the legacy Windows settings variant: when you haven't ever
> (or recently?) changed the system default locale of Windows 10,
>   - the modern Windows settings variant: when you have changed
> the system default locale of Windows 10.
> With the legacy Windows settings, the setlocale() function produces
> locale names such as "English_United States.65001" or
> "English_United States.utf8". With the modern Windows settings, it
> produces "en_US.UTF-8" instead. (This is with both mingw and MSVC,
> according to my testing.)

I don't know enough about Windows to comment much. I only tested on one
Win10 system which returned the long spellings.

If native setlocale(LC_ALL, "") can indeed result in "en_US" or
"en_US.UTF-8", I wonder if it can result in "az-Cyrl_AZ.UTF-8" too. I
don't see how Gnulib or Gettext would map such a locale name to
az_AZ.UTF-8@cyrillic. (az_AZ@cyrillic was the first one with @ in
localename-unsafe.c, thus I looked at that in MS docs too.)

The codeset seems to be a part of the language name:


https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c

Locale format doesn't use @modifier:


https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-170

-- 
Lasse Collin



Re: supporting in the UTF-8 environment on native Windows

2024-12-25 Thread Lasse Collin
On 2024-12-24 Bruno Haible wrote:
> Lasse Collin wrote:
> > (1)
> > In 9f7ff4f423cd ("localename-unsafe: Support the UTF-8 environment
> > on native Windows."), the N(name) macro is used with strings that
> > include @modifier. For example, N("az_AZ@cyrillic") can expand to
> > "az...@cyrillic.utf-8". Similarly in 00211fc69c92 ("setlocale:
> > Support the UTF-8 environment on native Windows."), ".65001" is
> > appended after the @modifier. However, the typical order would be
> > az_AZ.UTF-8@cyrillic.
> 
> Good point. Fixed through the patch below.

Thanks. Isn't there a similar issue in setlocale.c after the commit
00211fc69c92?

  /* llCC_buf now contains
   language[_territory][@modifier]
   */
  if (strcmp (llCC_buf, locale) != 0)
{
  if (is_utf8)
{
  char buf[64+6];
  strcpy (buf, llCC_buf);
  strcat (buf, ".65001");
  result = setlocale (category, buf);
}
  else
result = setlocale (category, llCC_buf);

> > I suppose you had a reason to use .65001 instead of .UTF-8 or .utf8.
> > I expect identical behavior from those.  
> 
> Yes: There was some period (ca. 5 years ago) when Windows supported
> the .65001 suffix but not the .utf8 suffix. The ability to use .utf8
> to denote code page 65001 was added a bit later.
[...]
> Yeah, the docs don't tell you everything (as usual with Microsoft).

Makes sense. Thanks for the explanation.

> > (2)
> > In 2f4391fde862 ("setlocale tests: Test in the UTF-8 environment on
> > native Windows."), the condition
> > 
> > (strlen (name) > 6 && strcmp (name + strlen (name) - 6,
> > ".UTF-8") == 0)
> > 
> > matches the two long strings below it too, making those two extra
> > strcmp calls redundant.  
> 
> Correct. Still, it's useful to have a writedown of what the output in
> the legacy mode is. (Unit test code is not optimized for performance.)

I agree it's a fine way to document things. It could be useful to have
a comment saying that those conditions are there for documentation
purposes.

> > (3)
> > When a manifest is added via a resource file, a possible default
> > manifest from the toolchain is replaced; they aren't merged. For
> > example, on MSYS2, the mingw-w64-ucrt-x86_64-gcc package depends on
> > mingw-w64-ucrt-x86_64-windows-default-manifest. The manifest comes
> > from Cygwin:
> > 
> > 
> > https://sourceware.org/git/?p=cygwin-apps/windows-default-manifest.git;a=blob;f=default-manifest.rc
> > 
> > Omitting the  section makes the application run with
> > Vista as the Operating System Context. Omitting the 
> > section makes Windows treat the application as not UAC compliant,
> > that is, a pre-Vista app that needs compatibility tricks.
> > 
> > Probably these don't matter with the current tests. I suggest
> > changing it still because it's still an odd combination to have
> > UTF-8 without marking the app compatible with recent Windows
> > versions.  
> 
> I picked the smallest XML file that has the desired effect.
> 
> Also, I don't like enumerating Windows versions in this way because
> it's not future-proof: Since Windows 11, 12, 13, etc. are not listed,
> it would only be matter of time until the test breaks in a new Windows
> version.

It's only a test program which doesn't affect the binaries being
installed. As long as it works for testing I guess it's fine enough.

One clarification though: If one omits , Win10 behaves
as if you had put only Vista compatibility in the manifest: Task
Manager shows Vista as the Operating System Context for the program.
That is, one is sort of requesting "emulation" of Vista in that case.

There is no separate entry for Windows 11 because Windows 10 and 11 use
the same entry. What will happen with Win12 is unknown, of course.

> > (4)
> > The output from windres goes to a file with the .res suffix but the
> > format is overridden with --output-format=coff. This looks weird
> > because windres defaults to --output-format=res for files that use
> > the .res suffix. For coff, the .o suffix would be logical, and
> > --output-format option wouldn't be needed.  
> 
> Maybe it looks weird, but IIRC it's the best way that I found that
> does not run into 32-bit / 64-bit problems. For example, if 'windres'
> is 64-bit but I'm compiling with a 32-bit targetting gcc. (Some
> toolchain installations have a -windres program, but some
> other toolchain installations have only one windres program for all
> targets.)

OK. It was only the .res suffix that looked weird because windres
already recognizes .res for a different file format than COFF.

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-17 Thread Lasse Collin
On 2025-01-14 Paul Eggert wrote:
> On 2025-01-14 02:43, Lasse Collin wrote:
> > Pedantic interpretation might be that EILSEQ isn't allowed in this
> > situation because EOVERFLOW already describes the error condition.  
> 
> No, because EOVERFLOW is intended for things like inode numbers don't 
> fit. It is not intended for things like invalid byte sequences.

I too associate EOVERFLOW with numbers, not strings. Your
interpretation is the most reasonable.

Don't take the following too seriously (feel free to even ignore). :-)
I noticed that a discussion similar to ours happened around glibc's
readdir_r in 2013. While it's a deprecated function, it's errno usage
is very similar to readdir. The discussion was about

  - ENAMETOOLONG vs. EOVERFLOW (vs. ERANGE) and which is allowed
by POSIX,

  - delaying the error until the end of the directory, and

  - if one can continue reading the directory after some errors or if
it can result in an infinite loop.

It's not a long thread:

https://sourceware.org/legacy-ml/libc-alpha/2013-05/msg00445.html

From the current glibc manual section '(libc)Reading/Closing Directory':

To distinguish between an end-of-directory condition or an error,
you must set ‘errno’ to zero before calling ‘readdir’.  To avoid
entering an infinite loop, you should stop reading from the
directory after the first error.

...

• On some systems, ‘readdir_r’ cannot read directory entries
  with very long names.  If such a name is encountered, the GNU
  C Library implementation of ‘readdir_r’ returns with an error
  code of ‘ENAMETOOLONG’ after the final directory entry has
  been read.

Those paragraphs were added as part of the readdir_r fix:


https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=91ce40854d0b7f865cf5024ef95a8026b76096f3

The current readdir_r code:


https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/readdir_r.c;hb=2642002380aafb71a1d3b569b6d7ebeab3284816

It doesn't count the number of problematic files. If my reading is
correct, at the end of the directory the code will keep returning NULL
with errno = ENAMETOOLONG. If so, one shouldn't continue after
ENAMETOOLONG with this implementation. It seems that continuing after an
error needs platform-specific knowledge in practice. (Maybe EOVERFLOW
is an exception.)

The behavior isn't needed in readdir in glibc. It's in readdir_r only
which one hopefully won't use anyway.

> > Other apps on Windows may trip on EILSEQ  
> 
> We can cross those bridges when we come to them.
> 
> There is no perfect solution here. But for 'ls', the only program
> where we know this matters, EILSEQ would be better than EOVERFLOW.

Now I think that it's good enough if readdir on Windows delays EILSEQ
(and ENAMETOOLONG) until the end of the directory and returns that
error once. That is, if readdir is called again, it returns NULL
without modifying errno to prevent infinite loops.

It means that apps cannot count the errors and only one error code is
remembered if two types of errors occur, but it's simpler code. Most
apps stop after the first error anyway. The upside is that then there
is no pressure to make GNU ls to continue reading the directory after
EILSEQ. ;-)

-- 
Lasse Collin



Re: crc-x86_64: undefined behaviour

2025-01-17 Thread Lasse Collin
I'm fairly new on the list and haven't read earlier discussion about
the crc-x86_64 module, so I hope this doesn't come out wrong. I rewrote
the CRC CLMUL code[1] in XZ Utils six months ago, and I'm commenting
based on that experience.

On 2025-01-16 Paul Eggert wrote:
> That being said, it does appear there are a lot of unaligned word 
> accesses nearby, which can't be good for performance even if the 
> hardware doesn't trap.

With the code variants I tried last year, the penalty from unaligned
reads was very small. The buffer would need to be large until the extra
cost of aligning at the beginning would pay off. This was with 128-bit
registers. It might be different with wider registers.

On 2025-01-17 Sam Russell wrote:
> Most performant would probably be an alignment check at the start and
> then manually processing the first N bytes. Another option could be
> to simply cast data to unsigned char* and then we can guarantee the
> compiler doesn't hit alignment issues?

Try using __m128i_u* instead of __m128i*. The former has aligned(1)
attribute. It's not available in old GCC versions though.

I kept the input as uint8_t* and cast it for the loads in a wrapper:

static inline __m128i
my_load128(const uint8_t *p)
{
return _mm_loadu_si128((const __m128i *)p);
}

I suppose __m128i_u* would be more correct above too but sanitizers
don't complain about it. Perhaps sanitizers treat the intrinsics
differently than memcpy.

[1] 
https://github.com/tukaani-project/xz/blob/master/src/liblzma/check/crc_x86_clmul.h

-- 
Lasse Collin



Re: [PATCH] opendir, readdir, rewinddir: Fix issues on native Windows

2025-01-17 Thread Lasse Collin
On 2025-01-14 Bruno Haible wrote:
> Maybe we should spread the word about the UTF-8 mode, that you made
> us aware of recently [1].

Yes, UTF-8 compatibility would be good nowadays. If a user has set it
globally in Windows settings (a beta feature), then programs use UTF-8
even when they don't have it set in an application manifest.

(1) While UTF-8 can result in crazy-long file name *components* (over
259 bytes without \0), they aren't common. If Gnulib's readdir and
stat have the limit, it might not matter too much in practice.
"open" doesn't have such a limit.

My patch had 255 * MB_LEN_MAX + 1, and MB_LEN_MAX is 5. Now I think
it's execessive even in a theoretical case. I'm not sure but I
suspect that UTF-8 is the only *locale* code page that has longer
than two-byte characters. So 255 * 3 + 1 should be enough which is
still a big number.

(2) Making programs "long path aware" could matter more. The limit of
259 wide chars (excluding the \0) is enforced on the absolute file
name even if program uses relative file names. The constant MAX_PATH
appears in very few files in Gnulib but opendir.c is one of them.
The \\?\ prefix bypasses the MAX_PATH limit but long path aware
programs don't need that trick.

I don't plan to continue my side quest to Windows issues further. :-)
Thanks for the comments!

-- 
Lasse Collin



Re: MinGW-w64 fenv changes.

2025-06-14 Thread Lasse Collin
On 2025-06-13 Bruno Haible wrote:
> Lasse Collin wrote:
> > I mentioned this thread on #mingw-w64 on OFTC. I was asked if I
> > could ask the Gnulib developers to report the bugs with minimal
> > reproducible examples to the mingw-w64-public mailing list[1].  
> 
> I usually don't spend much time reporting bugs for packages that are
> not part of the GNU system, or at least licensed under GPL / LGPL.
> Anyone is free to use Gnulib's tests as a self-tests for their
> platform, like for example the glibc people do [1].
> 
> But since you are asking in a friendly way, let me try to help.
> 
> I'm not subscribing to the mingw-w64 mailing list; instead, you can
> find this mail archived under [2].

Thank you very much! I have forwarded the information to mingw-w64
developers:


https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/20250614194520.67f75e35.lasse.collin%40tukaani.org/

-- 
Lasse Collin



Re: MinGW-w64 fenv changes.

2025-06-12 Thread Lasse Collin
On 2025-06-11 Bruno Haible via Gnulib discussion list wrote:
> In summary, the mingw approach that considers both the 387 unit and
> the SSE floating-point unit basically works. But, as usual with
> mingw, it comes with bugs:
>   - The older mingw bugs in this area are still present.
>   - They added two new bugs.

I mentioned this thread on #mingw-w64 on OFTC. I was asked if I
could ask the Gnulib developers to report the bugs with minimal
reproducible examples to the mingw-w64-public mailing list[1]. I'm not
able to help with the bugs themselves; I'm only relaying the message.

[1] https://sourceforge.net/projects/mingw-w64/lists/mingw-w64-public
I think posting requires one to be subscribed. Attachments should
have .txt suffix because the list filters out many file types.

-- 
Lasse Collin