[issue4944] os.fsync() doesn't work as expect in Windows
New submission from Javen Wang : I encountered a very strange issue in file flush operation in Windows. Here's the scenario of my application: 1. The main thread of my application will create makefiles sequentially. 2. Once a makefile is generated, launch a separate process calling nmake.exe to run it in parallell. The main thread then create another makefile until no more makefiles to create. 3. The number of new processes is limited by command line options. My application has been running for almost a year without any problem. But, after I made some changes recently to the content of makefile generated, "nmake.exe" in separate process sometimes complains that the makefile was not found. But when I went into the directory, the makefile was there. Because I didn't change any thing related to file creation and the new makefiles are a little bit less than before, I guessed that the makefile just created hasn't been flushed to disk because of size change so the new process could not see it in a short time. So I decided add code to force flush file buffer after writing the file content (I didn't flush the file explicitly before). I did it like below: Fd = open(File, "w") Fd.write(Content) Fd.flush() os.fsync(Fd.fileno()) Fd.close() The strangest thing happened. The "makefile" missing happened more frequently than no flush operation. I searched the web but no answer there. Finally I tried to use Windows file API to create the file via pywin32 extension. The problem's gone. import win32file Fd = win32file.CreateFile( File, win32file.GENERIC_WRITE, 0, None, win32file.CREATE_ALWAYS, win32file.FILE_ATTRIBUTE_NORMAL, None ) win32file.WriteFile(Fd, str(Content), None) win32file.FlushFileBuffers(Fd) win32file.CloseHandle(Fd) I tried writing small python extension in C to make use Windows API to create file like above. It also works well, even I removed the FlushFileBuffers() calling. I think that there's a bug in Python file buffer mechanism. -- components: Windows messages: 79829 nosy: javen72 severity: normal status: open title: os.fsync() doesn't work as expect in Windows type: behavior versions: Python 2.5 ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: gagenellina, My application is a little bit different from your test code. It won't wait for the exit of new process and there're file writing operations during makefile running. I changed the test code to be real multi-process and tried many file sizes. But I cannot reproduce it. Maybe my application is more complicated situation. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: The created files are on local drive. I saw the problem on the laptop (XP-SP2), desktop(XP-SP3) and server (Win2003). But there's no such problem on the Linux and Mac boxes. I tried to use fopen/fwrite in my extension according to your suggestion. The problem wasn't be reproduced. It seems the bug is more alike in Python part. My application is a build system and is also an open source project. Is it possible for you to download it and try it in your box? I created temporary user (py123, password: 123py123) for you (just in case) and here's steps of how to reproduce it. 1. Checkout the build system source code in, for example, c:\test C:\test> svn co --username py123 --password 123py123 https://buildtools.tianocore.org/svn/buildtools/trunk/BaseTools tools 2. Checkout the source code to be built against in c:\test C:\test> svn co --username py123 --password 123py123 https://edk2.tianocore.org/svn/edk2/trunk/edk2 edk2 3. Change the source code between line 222 and line 229 of c:\test\tools\Source\Python\Common\Misc.py (SaveFileOnChange function) like below: Fd = open(File, "wb") Fd.write(Content) Fd.flush() os.fsync(Fd.fileno()) Fd.close() 4. In c:\test\edk2, run C:\test\edk2> edksetup.bat C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p MdeModulePkg\MdeModulePkg.dsc -a IA32 -s 5. If the application stops with message like "makefile not found" or "AutoGen.h not found" message, that means the problem happened. Visual Studio 2005 is needed to reproduce it. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: There would be more chances to see the problem by doing this: C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: Thank you very much for the trying. You might miss the step 4 in my previous message. The step is: C:\test\edk2> edksetup.bat C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s The Visual Studio 2005 must be in the standard installation directory. Otherwise the C:\test\edk2\Conf\tools_def.txt needs to be changed to reflect the real path. And I tried to disabled all antivirus services and the problem is still there. Don't worry about the binary version of build in edk2\BaseTools\Bin\Win32 (linked against Python 2.5.2). The step I told you is to execute my application from Python script directly. And I tried to execute from script source against Python 2.5.4 and the problem is the same. And no matter running the build from script source or the freeze-ed binary, the results are the same either. If it's hard or inconvenient for you to reproduce it, could you please give me any advice or suggestion on how to debug it (in the interpreter) and where's most possible place the root cause would be in the Python interpreter's code? I can try to change something in the Python interpreter's code, rebuild it and try it on my machine. Although I have work around for this problem, I'd like to root cause it to avoid further possible build break of our project. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I agree multithread programming is kind of difficult and tricky. But I don't think there's a race condition in _MultiThreadBuildPlatform method, because I do have only one producer. And the consumer consumes the product only when it's done (i.e. the file is created and closed). The only race condition, I think, it's in the Python or Windows file system, in which the file created might not be seen by other process due to the file buffer mechanism. I think the flush() method of file object and os.fsync() are to remove the race condition. But it seems that they are not working as they're supposed to be. What I know is that os.fsync() calls the _commit() which calls FlushFileBuffers(). Why no problem if I call the FlushFileBuffers() directly? That's why I think the most possible race condition is in Python file buffer operation which is out of the control of my Python code. I'm sorry that I didn't realize there's 150M code to checkout. Thanks for your patience. Actually they are not code of my application itself. They are the code used to test my application because my application is a build system which needs source code to build. The real code of my application is in the, for my example, directory of C:\test\tools\Source\Python with just about 3M source files :-) And I think I have narrowed down the issue in the file creation in SaveFileOnChange function in C:\test\tools\Source\Python\Common\Misc.py. I know it's very hard to reproduce issue in multi-thread context. And I cannot say absolutely there's no bug in my code. It's OK for you to close this tracker. But it would be better to let it open for a few days so that I can do more investigation then. Anyway, thanks for the trying again. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I narrowed down the root cause in the GIL of Python. I read the source code of implementing os.fsync and found it's using Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS to enclose the calling of _commit() in order to be thread safe. I tried to add them in my C extension and then the issue was reproduced. It looks like the thread state operation or GIL caused a race condition and let os.fsync() not flush the makefile in buffer to disk before it returns. The multi-thread scenario in my application is a. the main thread produces makefile b. once a makefile is generated, the main thread launches a new thread in which a separated process is launched to run the makefile until it exits. I think there's no issue in this multi-thread scenario. And the Python 2.5.2 Maunal, Python/C API Ch8.1, says GIL must be taken care when doing some blocking I/O operations, but it caused thread unsafe. So I still think there's a bug in the Python. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I'm quite certain that the makefile is generated before the "make" program is launched in separated process. Follow is the original code where the makefile is created (line 14) and a task is put in queue (line 19). It's executed in the main thread (no parallel execution here). There's no way that the process is launched before the makefile generation. 01for Module in Pa.Platform.Modules: 02Ma = ModuleAutoGen(Wa, Module, BuildTarget, ToolChain, Arch, self.PlatformFile) 03if Ma == None: 04continue 05# generate AutoGen files and Makefile here 06if self.Target not in ['clean', 'cleanlib', 'cleanall', 'run', 'fds']: 07# for target which must generate AutoGen code and makefile 08if not self.SkipAutoGen or self.Target == 'genc': 09Ma.CreateCodeFile(True) 10if self.Target == "genc": 11continue 12 13if not self.SkipAutoGen or self.Target == 'genmake': 14Ma.CreateMakeFile(True) 15if self.Target == "genmake": 16continue 17 18# Generate build task for the module which will be launched separately 19Bt = BuildTask.New(ModuleMakeUnit(Ma, self.Target)) I think the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS should not be used for fsync in the Python core code (in posixmodule.c, right?). Because fsync is mostly used to solve file operation issue in parallel execution context, it should not give out CPU time to other threads or processes when it's running. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I did trace the order of file creation and process launch. It shows the file is indeed created before the process launch. I did another investigation. I added a new process, "cmd.exe /c copy Makefile NewMakefile", to copy the the Makefile created, just before launching the "nmake.exe" process. The strangest thing happened again: each makefile was copied successful but there's still "makefile not found" reported by "nmake.exe" process. I tried standalone copy application "xcopy.exe" and the result was the same. So I guess that "cmd.exe", "xcopy.exe" and "nmake.exe" might use different families of API or different mode (sync vs async) to access file. I decided to try the code you provided. In checkfile.c, fopen is used to test the file existence. I changed it to the Windows native API CreateFile and I also added a file write operation in order to make it more similar to the real case. Eventually, the problem was reproduced in your code. Following are the successful number of creating 1000 files 5 times in different file sizes: Create file in 403 bytes, 985, 992, 984, 989, 992 (no flush after creation) 883, 886, 907, 909, 915 (flush after creation) Create file in 4061 bytes 983, 976, 982, 977, 983 (no flush after creation) 654, 672, 684, 686, 648 (flush after creation) Create file in 16461 bytes: 970, 967, 963, 963, 971 (no flush after creation) 598, 664, 711, 653, 623 (flush after creation) In summary: a. Using fopen to test a file existence in check_file.c will never report failure, no matter buffer is flushed or not. b. Using CreateFile (read mode) to test a file existence in check_file.c will always report failure. The bigger the file size will cause more failure reported; the flush operation after file creation in test_file_flush.py will cause more failure reported; the flush operation after new file creation in check_file.c will cause more failure reported; no flush operation in both check_file.c and test_file_flush.py almost cause no failure. I don't know what's root cause: GIL, Python thread state switch, Python file buffer or Windows FileCreate API. I'm just certain there's race condition between Python and Windows. Added file: http://bugs.python.org/file12807/checkfile.c ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: The test code and script which can reproduce the problem has been uploaded. Added file: http://bugs.python.org/file12808/test_file_flush.py ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I'm using Python2.5 in which close_fds is not available in Windows. And I cannot use Python2.6 because I need to redirect the stdout and stderr and there's performance concern. I have questions on the root cause: a. why doesn't fopen() has sharing issue? b. why don't os.close() and FileObject.close() really close the file? ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4944] os.fsync() doesn't work as expect in Windows
Javen Wang added the comment: I don't have Python2.6 installed. I just have ever read a bench article comparing the performance between different version of Python, including 2.5, 2.6 and 3.0. That article shows, for the same script, Python2.6 has longer running time than Python2.5. My application is a build system and the users are very care about the time spent in build. That's why I have the performance concern. What concerns me more is the stdout/stderr redirection. Python2.6 manual says they cannot be redirected if close_fds is set to True. My application relies on the I/O redirection to control the screen output from subprocess. Actually I have had a work-around for this issue. It works very well so far. I reported a bug here just because I want the Python to be better. I learned it one year ago but I love this language. I just hope nobody else encounter such problem again. If you guys think it won't be fixed in Python 2.5 or has been fixed in Python 2.6, it's OK to close this tracker. Thanks for your time. ___ Python tracker <http://bugs.python.org/issue4944> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com