Follow-up Comment #29, bug#64806 (group make): As proposed by Eli, I have investigated the process of mutex creation and inheritance; please find below a brief summary regarding the implementation of key functions in _posixos.c_ and _w32os.c_, a proposal for bringing the two approaches closer to each other then finally eliminating the hang experienced previously.
According to _os.h_, _osync_setup()_ is a function "called in the parent make to set up output sync initially": * In _posixos.c_ this function is implemented by opening a temporary file and storing its descriptor and name in variables _osync_handle_ and _osync_tmpfile_ respectively; the handle is configured for _not_ being inherited. * In _w32os.c_ this function is implemented by creating an unnamed mutex, setting its handle to be _inherited_ and storing the handle in _osync_handle_ (since the mutex is unnamed, there is no variable equivalent to _osync_tmpfile_). According to _os.h_, _osync_get_mutex()_ is a function that "returns an allocated buffer containing output sync info to pass to child instances, or NULL if not needed": * In _posixos.c_ this function returns a dynamically allocated copy of _osync_tmpfile_ prefixed with _MUTEX_PREFIX_ ("fnm:"), i.e., a file name prefixed with a known constant. * In _w32os.c_ this function returns a dynamically allocated string that holds _osync_handle_ printed in hexadecimal (note that apparently the _MUTEX_PREFIX_ macro is there in _w32os.c_, but not used). According to _os.h_, _osync_parse_mutex()_ is a function "called in a child instance to obtain info on the output sync mutex": * In _posixos.c_ this function extracts the name of the temporary file (created by _osync_setup()_ and name prefixed by _osync_get_mutex()_) and opens it; note that this behavior does not fully conform to the description in _os.h_, because here we not only "obtain info on" the mutex but actually _gain access to it_ by opening the file. * In _w32os.c_ this function extracts the numeric mutex handle (created by _osync_setup()_ and written into string in hexadecimal by _osync_get_mutex()_); since here we do not open any object just really extract information about some inherited resource, this behavior seems to better match the description in _os.h_. According to _os.h_, _osync_clear()_ is a function to "clean up this instance's output sync facilities": * In _posixos.c_ this closes the handle; additionally if the actual process is the topmost make process, also deletes the file. Note that (unless we are in the topmost make process), this function does not prevent children from using the temporary file for synchronization since the file remains existing. * In _w32os.c_ this function closes the handle of the mutex. Note that, closing the handle prevents its inheritance to children, furthermore, since we only remembered the handle as a number (no name for the mutex), a child process calling _osync_parse_mutex()_ will falsely consider the number returned to be an inherited mutex handle; this is made even worse by the fact that several pipes are opened upon child process creation thus the numeric handle will likely be valid but refer to some pipe or file, not the mutex (this is the explanation for apparent hang: child process tries to wait for a "mutex" that is in reality some terminal stream). This behavior is not a problem, as long as no child process is created after calling _osync_clear()_ but triggered if a child process is started after _osync_clear()_ -- apparently this happens only under quite special circumstances, namely when a Makefile includes another makefile (component.mk in my example) that triggers inclusion of further generated makefiles (_.d_ files in my example). Functions _osync_acquire()_ and _osync_release()_ are quite obvious: * In _posixos.c osync_acquire()_ locks the temporary file, _osync_release()_ unlocks it. * In _w32os.c osync_acquire()_ locks waits for the mutex (_WaitForSingleObject()_), _osync_release()_ releases the mutex (_ReleaseMutex()_). Summary: * _osync_setup()_ creates some object that processes can synchronize on (file on POSIX, mutex on Windows). * _osync_get_mutex()_ creates a string representation that can be used for referring to the synchronization object and can be passed as command line argument to child processes (prefixed file name on POSIX, handle number on Windows). * _osync_parse_mutex()_ extracts the information from the command line argument and (i) on POSIX it additionally opens the temporary file while (ii) on Windows the mutex is not opened but expected to be inherited. * _osync_clear()_ closes the file or mutex; the POSIX implementation explicitly deletes the temporary file in the topmost make process, while the Windows implementation relies on the garbage collection upon exiting the last process that had a handle on the unnamed mutex object. The key difference between the two approaches is that _osync_parse_mutex()_ on POSIX only expects that the file used for synchronization exists (i.e., has been created by the parent make process) while on Windows it not only expects that the mutex exists but also that the handle of that mutex is inherited -- this second assumption is broken if a child process is created after _osync_clear()_. > So I think the next step is to understand which call to osync_clear closes the handle. Maybe we shouldn't make that call, at least on Windows? Based on the observations above, I would not go in the direction proposed in the most recent comment "[...] Maybe we shouldn't make that call, at least on Windows? [...]". I would say that this approach would further extend the conceptual difference between POSIX and Windows implementations of the mutual exclusion mechanism. I would rather propose bringing them closer together by not relying on inheritance of the mutex handle but -- similarly to named temporary files on POSIX -- using a named mutex and instead of passing its handle as a number to child processes (which will break if we close the handle in _osync_clear()_), passing the _name_ of the mutex. Please find below my proposals for implementation of mutex handling according to this scheme: /** Name prefix for alternative mutex usage (MUTEX_PREFIX was not used in the original). */ #define MUTEX_ALT_PREFIX "make-mutex-" /** Name of the mutex for alternative mutex usage. */ static char *osync_mutex_name = NULL; /* Alternative mutex setup implementation: creates a mutex named "make-mutex-<PID of the topmost make process>" */ void osync_setup_alt() { /* Construct name of the mutex by appending ID of the top-level make process */ static const char * const format = MUTEX_ALT_PREFIX "-%" PRIu32; const uint32_t pid = (uint32_t) GetCurrentProcessId(); const int name_len = 1 + snprintf(NULL, 0, format, pid); /* Remember name of the mutext */ osync_mutex_name = xmalloc(name_len); snprintf(osync_mutex_name, name_len, format, pid); osync_handle = CreateMutex(NULL, FALSE, osync_mutex_name); if (0 == osync_handle) { fprintf(stderr, "CreateMutex: error %lu\n", GetLastError()); errno = ENOLCK; } } /* Alternative implementation for copying name of the mutext: copies previously constructed name in allocated buffer */ char * osync_get_mutex_alt() { return osync_enabled() ? xstrdup(osync_mutex_name) : NULL; } /* Alternative implementation for osync_parse_mutex(): extract the name of the mutex from command line argument and open it */ unsigned int osync_parse_mutex_alt(const char *mutex) { unsigned ret = 0; if (0 == strncmp(mutex, MUTEX_ALT_PREFIX, strlen(MUTEX_ALT_PREFIX))) { osync_handle = OpenMutex(MUTEX_ALL_ACCESS, FALSE, mutex); if (NULL != osync_handle) ret = 1; else fprintf(stderr, "Cannot open mutex: %s\n", mutex); } else { fprintf(stderr, "Ill-formed mutex name: %s\n", mutex); } return ret; } Having replaced _osync_setup()_, _osync_get_mutex()_ and _osync_parse_mutex()_ with these alternative implementations eliminated the hang experienced previously. Additionally, while analyzing the behavior of make regarding obtaining and releasing mutexes, I experienced that _osync_parse_mutex()_ seems to be called more than once. Apparently _decode_output_sync_flags()_ does not only "decode" the synchronization flags, but also calls _osync_parse_mutex()_ which -- as highlighted above -- in contrary to its name, not only parses a string but also opens a file on POSIX. This was not a problem at least on Windows until now, since _osync_parse_mutex()_ on Windows really only extracted a number from a string (I suspect that the POSIX implementation wastes a few file descriptors this way due to opening the same file several times, but have not yet checked). However when transitioning to the named mutex thus bringing POSIX and Windows implementations closer, the mutex would be opened more then once this way. The reason for calling _osync_parse_mutex()_ in _decode_output_sync_flags()_ is not clear for me, because _osync_parse_mutex()_ function is called in _main()_ anyway. Thus I would propose removing the call to _osync_parse_mutex()_ from _decode_output_sync_flags()_: static void decode_output_sync_flags (void) { #ifdef NO_OUTPUT_SYNC output_sync = OUTPUT_SYNC_NONE; #else if (output_sync_option) { if (streq (output_sync_option, "none")) output_sync = OUTPUT_SYNC_NONE; else if (streq (output_sync_option, "line")) output_sync = OUTPUT_SYNC_LINE; else if (streq (output_sync_option, "target")) output_sync = OUTPUT_SYNC_TARGET; else if (streq (output_sync_option, "recurse")) output_sync = OUTPUT_SYNC_RECURSE; else OS (fatal, NILF, _("unknown output-sync type '%s'"), output_sync_option); } #if 0 /* To be removed */ if (sync_mutex) osync_parse_mutex (sync_mutex); #endif /* To be removed */ #endif } With the above changes the hang experienced previously seems to have been eliminated and the POSIX and Windows implementations are moved closed to each other. Unfortunately I am not aware of the reasons for having deviated from the POSIX implementation originally (i.e., relying on inheritance instead of opening a named synchronization object); this may be due to some performance concerns which I am not familiar with. Finally, _osync_parse_mutex()_ could use some clarification: its name or at least the documentation comment should reflect that here we not only do some string parsing but actually obtaining resources. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?64806> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/