Hello Teepean,

On 11/17/2024 11:32 AM, Teepean via Cygwin wrote:
I raised this issue couple of years ago on cygwin-developer but now when the 
problem has manifested again with recent versions of Cygwin I decided to post 
this to general discussion list.

This (main Cygwin) list is the correct place for reports like this. There is no need to contact me (or other maintainers/devs) off-list.

Given that the result of the investigation a couple years ago was, essentially, no change to Cygwin's malloc*, why has the problem manifested again recently? Have you been benchmarking/testing all along? Can you be more specific about which recent Cygwin versions?

*My own benchmark, building the Cygwin tree, showed that there wasn't much difference between the half-dozen malloc implementations I tried and they were all spending more time in Windows' ntdll.dll than the current Cygwin malloc (==dlmalloc), though a little less time in Cygwin itself.

Steps to Reproduce

1. Compile BWA normally

https://github.com/lh3/bwa/

What's involved with that? Clone the repo, ./configure, make? Anything else?

2. Compile BWA with rpmalloc and the following patch:


// In thread worker function:
#ifdef __CYGWIN__
rpmalloc_thread_initialize();
#endif


// ... thread work ...
#ifdef __CYGWIN__
rpmalloc_thread_finalize(1);
#endif

Where does that patch go? Assume I know nothing about BWA.

3. Run both versions with the following command:
time ./bwa mem -t 11 chr19_KI270866v1_alt.fasta test_1.fastq test_2.fastq > 
testorigsingle.sam


Without Patch (Default malloc):


[M::mem_process_seqs] Processed 120000 reads in 30.296 CPU sec, 3.743 real sec
[main] Real time: 3.883 sec; CPU: 30.436 sec
real    0m3.907s
user    0m19.186s
sys     0m11.265s


With Patch (rpmalloc):


[M::mem_process_seqs] Processed 120000 reads in 7.530 CPU sec, 0.702 real sec
[main] Real time: 0.830 sec; CPU: 7.640 sec
real    0m0.868s
user    0m7.343s
sys     0m0.327s

Are these examples of runs one would do "in production"? Or are you running much longer-lasting processing in the usual case?

Analysis

1. The default malloc implementation shows extremely high system time (11.265s) 
compared to the rpmalloc version (0.327s)
2. Total real time is about 4.5x slower with default malloc
3. The dramatic difference in system time suggests heavy contention in the 
memory allocation subsystem
4. The issue only manifests on Cygwin with bwa; the same code performs normally 
on native Linux and MacOS

Are you saying there is non-bwa code that runs on Cygwin comparably to Linux and Mac?

5. The issue manifests with recent versions of Cygwin but does work with older 
versions

Again, it would really help if you could give Cygwin versions or at least dates here...

The issue becomes more pronounced with higher thread counts

That I believe; dlmalloc as it is currently set up for Cygwin is not the greatest performer for heavy thread usage.

The patched code is located here in branch Cygwin:


https://github.com/WGSExtract/bwa.git


Simple testsuite. Run bash testsuite.sh. The testsuite includes a version 
compiled with an older version of Cygwin called bwa_working.exe


https://drive.google.com/file/d/1jtbQVUAcCmpJM-8Exi0C6pzDXcEo4cV6/view?usp=drive_link

I'll glance at this stuff when I can but I hope to have some answers to my questions above from you to save me some time.

..mark


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to