Hello Teepean, On 11/17/2024 11:32 AM, Teepean via Cygwin wrote:
I raised this issue couple of years ago on cygwin-developer but now when the problem has manifested again with recent versions of Cygwin I decided to post this to general discussion list.
This (main Cygwin) list is the correct place for reports like this. There is no need to contact me (or other maintainers/devs) off-list.
Given that the result of the investigation a couple years ago was, essentially, no change to Cygwin's malloc*, why has the problem manifested again recently? Have you been benchmarking/testing all along? Can you be more specific about which recent Cygwin versions?
*My own benchmark, building the Cygwin tree, showed that there wasn't much difference between the half-dozen malloc implementations I tried and they were all spending more time in Windows' ntdll.dll than the current Cygwin malloc (==dlmalloc), though a little less time in Cygwin itself.
Steps to Reproduce 1. Compile BWA normally https://github.com/lh3/bwa/
What's involved with that? Clone the repo, ./configure, make? Anything else?
2. Compile BWA with rpmalloc and the following patch: // In thread worker function: #ifdef __CYGWIN__ rpmalloc_thread_initialize(); #endif // ... thread work ... #ifdef __CYGWIN__ rpmalloc_thread_finalize(1); #endif
Where does that patch go? Assume I know nothing about BWA.
3. Run both versions with the following command: time ./bwa mem -t 11 chr19_KI270866v1_alt.fasta test_1.fastq test_2.fastq > testorigsingle.sam Without Patch (Default malloc): [M::mem_process_seqs] Processed 120000 reads in 30.296 CPU sec, 3.743 real sec [main] Real time: 3.883 sec; CPU: 30.436 sec real 0m3.907s user 0m19.186s sys 0m11.265s With Patch (rpmalloc): [M::mem_process_seqs] Processed 120000 reads in 7.530 CPU sec, 0.702 real sec [main] Real time: 0.830 sec; CPU: 7.640 sec real 0m0.868s user 0m7.343s sys 0m0.327s
Are these examples of runs one would do "in production"? Or are you running much longer-lasting processing in the usual case?
Analysis 1. The default malloc implementation shows extremely high system time (11.265s) compared to the rpmalloc version (0.327s) 2. Total real time is about 4.5x slower with default malloc 3. The dramatic difference in system time suggests heavy contention in the memory allocation subsystem 4. The issue only manifests on Cygwin with bwa; the same code performs normally on native Linux and MacOS
Are you saying there is non-bwa code that runs on Cygwin comparably to Linux and Mac?
5. The issue manifests with recent versions of Cygwin but does work with older versions
Again, it would really help if you could give Cygwin versions or at least dates here...
The issue becomes more pronounced with higher thread counts
That I believe; dlmalloc as it is currently set up for Cygwin is not the greatest performer for heavy thread usage.
The patched code is located here in branch Cygwin: https://github.com/WGSExtract/bwa.git Simple testsuite. Run bash testsuite.sh. The testsuite includes a version compiled with an older version of Cygwin called bwa_working.exe https://drive.google.com/file/d/1jtbQVUAcCmpJM-8Exi0C6pzDXcEo4cV6/view?usp=drive_link
I'll glance at this stuff when I can but I hope to have some answers to my questions above from you to save me some time.
..mark -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple