On 12/25/2024 6:06 PM, Ken Brown wrote:
On 12/25/2024 1:08 PM, Ken Brown wrote:
Hi Corinna,

I know you're on vacation, but I'm writing this now while it's fresh in my mind.  We can discuss it when you return.

This is a follow-up to

   https://cygwin.com/pipermail/cygwin-developers/2024- December/012715.html

but I'm starting a new thread for clarity, and I'll start from scratch.

I'm back to thinking that mmap_is_attached_or_noreserve is broken. I'm attaching two test cases to show this.  To run them, create a file hello.txt with one line, and then compile and run the two programs.

They both take a region of size 128K and map the two halves of it separately with MAP_NORESERVE.  The only difference between the two cases is the order in which the halves are mapped.  They then call read(), passing the whole region of size 128K as buffer.  If mmap_is_attached_or_noreserve is working, the calls to read should succeed.  But only the one in mmap_noreserve2 succeeds, and the other fails with a bus error.  I'll now try to explain what's going on.

The task is to find one or more noreserve records that completely cover the interval [addr, addr+len].

After a successful call to match(start_addr, len, u_addr, u_len), we have the following situation:

     |---------|---------------------|-----------------|
start_addr   u_addr          u_addr+u_len          start_addr+len

The interval [u_addr, u_addr+u_len] is the intersection of [start_addr, start_addr+len] with the interval of the mmap_record. Note that u_addr  >= start_addr and u_addr+u_len <= start_addr+len, but I've shown in the picture the case where both inequalities are strict.

At this point we know that the intersection is contained in a noreserve mmap_record.  But we commit the entire interval [start_addr, u_addr+u_len], whose length is commit_len, which is written as

   u_len - (start_addr - u_addr) = (u_addr + u_len) - start_addr.

There's no problem if u_addr = start_addr.  But if u_addr > start_addr, then the commit includes [start_addr, u_addr], which is not at this point known to be contained in a noreserve mmap_record. Moreover, even if it is contained in a noreserve record that we just haven't come to yet, experiments show that the call to VirtualAlloc in line 794 fails in this case.  Apparently you can't commit in one call to VirtualAlloc a region that was reserved [Windows terminology] in two separate calls.  I can't find this documented anywhere, but that's what happens in the attached mmap_noreserve1.c, as I verified by stepping through the code under gdb.

As I said in an earlier message, I don't see how to fix this except by making sure that the noreserve maps in map_list occur in order of their starting address.  Then if we ever find u_addr > start_addr, we know we'll never find a noreserve mmap_record covering [start_addr, u_addr], and we can break out of the loop.  If you agree that there's a problem, does this seem like the right fix?

I apologize for the length of this message, but I wanted to write everything out in detail in order to be clear.  I've written a patch (also attached) that attempts to do does what I suggested in the previous paragraph.  Unfortunately, when I install it, mintty hangs on startup without showing the bash prompt, and a grep process is left running.  So there must be something wrong with my patch, or else it triggers a latent bug somewhere else.

Takashi, if you're reading this, could you take a look at my patch and see if you spot any obvious errors?  Or do you think this could be related to your recent changes?

I've find two errors, one in my patch and one in the original code: There were two places in the main loop of mmap_is_attached_or_noreserve where there was a "break" that should have been "continue".  Version 2 of the patch (attached) tries to fix these.  But I still get the same mintty hang on startup.  So there's still something wrong.

Note that the original bug that I'm trying to fix caused SIGBUS to be raised in some places where it shouldn't have been (due to a failed call to VirtualAlloc).  The new break/continue bug I just found had the opposite effect: it caused SIGBUS not to be raised in some places where it should have been.
I've cleaned up and improved the patch, and I've sent it to cygwin-patches. I'm stuck at this point.

Ken

Reply via email to