But that's the thing... in this case, the fd lifetime is owned by the X server. In fact, nouveau doesn't touch that fd in mesa beyond dup'ing it, and then exclusively using the dup'd fd.
On Sun, Jun 28, 2015 at 12:23 AM, Mario Kleiner <mario.kleiner...@gmail.com> wrote: > Ok, maybe one thing for the commit message, as i just made the same mixup in > my reply: The fd's are not owned by the x-server, but by the mesa screens > (pipe screens? dri screens?) they represent, so if such a screen goes away, > the fd goes away. Using "X server" would be confusing, especially under > dri3, although each such "mesa screen" corresponds to a x-screen. > > -mario > > > On 06/28/2015 06:03 AM, Mario Kleiner wrote: >> >> On 06/28/2015 05:41 AM, Ilia Mirkin wrote: >>> >>> Oh duh. Thanks for the super-detailed explanation. To rephrase what >>> you said in a slightly shorter manner: >>> >>> See bug https://bugs.freedesktop.org/show_bug.cgi?id=79823 for why we >>> need to dupfd (which we are already, although radeon might not be). >>> >> >> Radeon doesn't so far. My 2nd patch for radeon from earlier today adds >> that. >> >>> Except instead of sticking the dup'd fd into the hash table, whose >>> lifetime matches that of the device, we stick the other one in, which >>> is effectively owned by the X server. As a result, those fstat's might >>> fail down the line, and all sorts of hell will break loose. >>> >> >> Yes. Essentially we should make sure the fd's we keep around have the >> same lifetime as the data structure in which we need to use it. That was >> the point, the server owned fd went away while we still needed it. >> >>> Would you be opposed to me rewriting your commit message to basically >>> reflect the above? Perhaps your original one did as well, but it >>> wasn't clear to me. I'd also like to throw some assert's in to make >>> sure the fstat's don't fail -- does that sound reasonable? >>> >> >> Oh please, yes! My commit messages often have this disease that they are >> either too terse, when i think the problem is obvious, or too long and >> somewhat convoluted when i'm going overboard in the other direction. Any >> editing/shortening for clarity more than happily accepted :) >> >> thanks, >> -mario >> >>> Another solution, btw, is to stick <stat.st_dev, stat.st_ino, >>> stat.st_rdev> as the real key in the hash, although that'd involve >>> making pointers. Probably not worth it. >>> >>> Cheers, >>> >>> -ilia >>> >>> On Sat, Jun 27, 2015 at 11:13 PM, Mario Kleiner >>> <mario.kleiner...@gmail.com> wrote: >>>> >>>> On 06/28/2015 03:48 AM, Ilia Mirkin wrote: >>>>> >>>>> >>>>> On Fri, Jun 5, 2015 at 9:36 AM, Mario Kleiner >>>>> <mario.kleiner...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> The dup'ed fd owned by the nouveau_screen for a device node >>>>>> must also be used as key for the winsys hash table, instead >>>>>> of using the original fd passed in for a screen, to make >>>>>> multi-x-screen ZaphodHeads configurations work on nouveau. >>>>>> >>>>>> This prevents the following crash scenario that was observed >>>>>> when a dynamically loaded rendering plugin used OpenGL on a >>>>>> ZaphodHeads setup, e.g., on a dual x-screen setup. At first >>>>>> load the plugin worked, but after unloading and reloading it, >>>>>> the next rendering operation crashed: >>>>>> >>>>>> 1. Client, e.g., a plugin, calls glXQueryVersion. >>>>>> >>>>>> 2. DRI screens for the x-screens 0 and 1 are created, one shared >>>>>> nouveau_screen is created for the shared device node of both >>>>>> screens, but the original fd of x-screen 0 is used as identifying >>>>>> key in the hash table, instead of the dup()ed fd of x-screen 0 >>>>>> which is owned by the nouveau_screen. nouveau_screen's refcount >>>>>> is now 2. >>>>> >>>>> >>>>> >>>>> See below, but it shouldn't matter which fd gets used. >>>>> >>>>>> >>>>>> 3. Regular rendering happens by the client plugin, then the plugin >>>>>> gets unloaded. >>>>>> >>>>>> 4. XCloseDisplay(). x-screen 0 gets its DRI screen destroyed, >>>>>> nouveau_drm_screen_unref() drops the refcount to 1, calling mesa >>>>>> code then closes the fd of x-screen 0, so now the fd which is >>>>>> used as key in the hash table is invalid. x-screen 1 gets >>>>>> destroyed, nouveau_drm_screen_unref() drops the refcount to 0, >>>>>> the nouveau_screen gets destroyed, but removal of its entry >>>>>> in the hash table fails, because the invalid fd in the hash >>>>>> table no longer matches anything (fstat() on the fd is used >>>>>> for hashing and key comparison, but fstat() on an already closed >>>>>> fd fails and returns bogus results). x-screen 1 closes its fd. >>>>>> >>>>>> Now all fd's are closed, the nouveau_screen destroyed, but >>>>>> there is a dangling reference to the nouveau_screen in the >>>>>> hash table. >>>>>> >>>>>> 5. Some OpenGL client plugin gets loaded again and calls >>>>>> glXQueryVersion. Step 2 above repeats, but because a >>>>>> dangling reference with a matching fd is found in the winsys >>>>>> hash table, no new nouveau_screen is created this time. Instead >>>>>> the invalid pointer to the old nouveau_screen is recycled, >>>>>> which points to nirvana -> Crash. >>>>>> >>>>>> This problem is avoided by use of the dup()ed fd which is >>>>>> owned by the nouveau_screen and has the same lifetime as >>>>>> the nouveau_screen itself. >>>>> >>>>> >>>>> >>>>> I need to think about this some more, but... this shouldn't happen :) >>>>> >>>>> In fact, the whole dupfd thing was added there for ZaphodHeads screens >>>>> in the first place. See >>>>> https://bugs.freedesktop.org/show_bug.cgi?id=79823 and commit >>>>> a59f2bb17bcc which fixed it. >>>>> >>>>> Note that the hash has the following hash/eq functions: >>>>> >>>>> static unsigned hash_fd(void *key) >>>>> { >>>>> int fd = pointer_to_intptr(key); >>>>> struct stat stat; >>>>> fstat(fd, &stat); >>>>> >>>>> return stat.st_dev ^ stat.st_ino ^ stat.st_rdev; >>>>> } >>>>> >>>>> static int compare_fd(void *key1, void *key2) >>>>> { >>>>> int fd1 = pointer_to_intptr(key1); >>>>> int fd2 = pointer_to_intptr(key2); >>>>> struct stat stat1, stat2; >>>>> fstat(fd1, &stat1); >>>>> fstat(fd2, &stat2); >>>>> >>>>> return stat1.st_dev != stat2.st_dev || >>>>> stat1.st_ino != stat2.st_ino || >>>>> stat1.st_rdev != stat2.st_rdev; >>>>> } >>>>> >>>>> so fd and dupfd should get hashed to the same thing. I suspect there's >>>>> something else going on in your application... >>>>> >>>> >>>> My application is a set of dynamically loaded plugins running inside >>>> another >>>> host app (Psychtoolbox-3 inside GNU/Octave), so what happens often is >>>> that >>>> the OpenGL using plugin gets unloaded and reloaded at runtime, so a >>>> after a >>>> OpenGL session has ended with a XCloseDisplay() tearing the winsys >>>> down, you >>>> can easily have it restart at plugin reload with another XOpenDisplay -> >>>> glXQueryVersion -> .... sequence, where the new call to >>>> glXQueryVersion will >>>> trigger a recreation of the winsys, which will find the stale entry >>>> in the >>>> hash table pointing to nowhere instead of the previously released >>>> nouveau_screen -> use-after-free -> boom! >>>> >>>> The reason this fails is because during destruction of the 2nd, 3rd etc. >>>> x-screen, the already closed fd associated with the 1st x-screen is >>>> fed into >>>> the compare_fd function, so fstat() errors out on the invalid fd. I >>>> added >>>> printf's etc. on both nouveau and now radeon to verify the fstat >>>> gives me >>>> EBADF errors. So the hash calculation goes wrong when trying to find the >>>> matching element in the hash table with a fd that has a matching hash >>>> -> The >>>> element which should be removed is ignored/not removed because it >>>> contains >>>> an already closed fd for which no proper hash can be calculated >>>> anymore -> >>>> hash comparison during search goes wrong. >>>> >>>> This is because multiple x-screens, e.g., 2 x-screens are destroyed >>>> in order >>>> 0, 1 by FreeScreenConfigs() as part of XCloseDisplay(). >>>> FreeScreenConfigs() >>>> calls dri2DestroyScreen() in dri2.c or dri3_destroy_screen in dri3.c, >>>> which >>>> are essentially identical, e.g.,: >>>> >>>> static void >>>> dri2DestroyScreen(struct glx_screen *base) >>>> { >>>> struct dri2_screen *psc = (struct dri2_screen *) base; >>>> >>>> /* Free the direct rendering per screen data */ >>>> (*psc->core->destroyScreen) (psc->driScreen); >>>> >>>> --> This core->destroyScreen calls eventually into the winsys, which >>>> then >>>> drops the refcount of the nouveau_screen or radeon winsys structure >>>> by one, >>>> and tries to release the struct and remove the hash table entry once the >>>> refcount drops to zero. >>>> >>>> driDestroyConfigs(psc->driver_configs); >>>> close(psc->fd); >>>> >>>> -> This close() closes the fd of a screen immediately after closing down >>>> that screen. At the time screen 1 is closed down, the fd associated with >>>> screen 0 is already closed. >>>> >>>> free(psc); >>>> } >>>> >>>> So you have this sequence during creation: >>>> >>>> glXQueryVersion() -> AllocAndFetchScreenConfigs(): >>>> >>>> create screen 0 and its fd, e.g., fd==5, dup(fd), e.g., fd==6 for the >>>> nouveau_screen, but store x-screen 0's fd==5 *itself* in the hash >>>> table as >>>> key. >>>> >>>> create screen 1. This now creates an fd 7 >>>> >>>> => refcount is now 2. >>>> >>>> >>>> And this sequence during destruction at XCloseDisplay(); >>>> >>>> FreeScreenConfigs(): >>>> >>>> free screen 0: core->destroyScreen() -> ... -> >>>> nouveau_drm_screen_unref() => >>>> refcount-- is now 1 >>>> close(fd==5) of screen 0 and thereby the fd==5 stored inside the hash >>>> table >>>> as key. From now on using fcntl(fd==5) for hash calculation on that >>>> element >>>> will malfunction. >>>> >>>> free screen 1: -> nouveau_drm_screen_unref -> refcount-- is now 0. -> >>>> Try to >>>> remove hash table entry => fails because fcntl(fd==5) on the already >>>> closed >>>> fd==5 of screen 0 fails with EBADF. => Hash table keeps its stale >>>> element >>>> referencing the struct nouveau_screen. >>>> >>>> Free nouveau_screen struct (close dup()ed fd==6 for nouveau_screen, >>>> close >>>> down stuff, free the struct). >>>> -> close(fd==7) associated with screen 1. >>>> >>>> Now all fd's (5,6,7) are closed, the struct is gone, the hash table >>>> contains >>>> a stale element with a no longer existent fd==5 and a pointer to an >>>> already >>>> free()d struct nouveau_screen. >>>> >>>> Now we reload the plugin and do glXQueryVersion() again, so >>>> AllocAndFetchScreenConfigs() is executed again to create the winsys >>>> etc. for >>>> screens 0 and 1. >>>> >>>> create screen 0 and its fd: Here the first unused fd in the >>>> filedescriptor >>>> table is recycled, which happens to be fd==5, just as in the first >>>> session >>>> with the plugin! Now suddenly fd=5 is again the valid fd for x-screen >>>> 0, and >>>> it points to the same /dev/drm/card0 device file as before. That >>>> means when >>>> nouveau_drm_screen_create(fd==5) is called, it finds the stale >>>> element in >>>> the hash table from the previous session, because the previously >>>> closed fd 5 >>>> in that element is now valid and open again and points to the same >>>> device >>>> file as in the previous session, ergo has the same hash etc. >>>> nouveau_drm_screen_create() concludes that a fully initialized >>>> nouveau_screen already exists for this session, except it doesn't >>>> exist - >>>> the pointer in the hash table points to invalid previously freed memory. >>>> When it tries to access that memory we have a use-after-free and the >>>> application segfaults. >>>> >>>> So it depends a bit on what happens in the applications memory >>>> between the >>>> two consecutive OpenGL sessions etc. how long it takes for an invalid >>>> memory >>>> access to happen, but eventually it hits invalid data. Maybe >>>> occassionally >>>> it gets lucky and the freed memory is still accessible and the struct >>>> intact, so stuff works by chance. >>>> >>>> If the application happened to open other files inbetween the 1st and >>>> 2nd >>>> session, then the relevant fd in the 2nd session may not be exactly >>>> the same >>>> as in the first session, because recycling that fd didn't work, so we >>>> are >>>> only left with a dead but harmless entry in the hash table instead of a >>>> crasher. I assume something like that prevented my crashes on radeon >>>> in the >>>> past, because when i wasn't expecting/looking for trouble i didn't use a >>>> test sequence carefully made to trigger the bug for certain. >>>> >>>> Similar things happen on radeon for the same reason, ergo the 2nd >>>> patch with >>>> a similar fix. >>>> >>>> I don't have the gdb backtrace for nouveau anymore, but the traces for >>>> radeon are attached to this mail to show you the flow of execution, >>>> similar >>>> to nouveau. >>>> >>>> -mario _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev