Hello,
I am having a problem with that is apparently related to memmove and looking
for some advice on how to investigate further. This winter I have been working
to simplify GLZA source code and make it more readable. GLZA is an advanced
open source code straight line grammar compressor first released in 2015.
Among these changes was replacing some rather bloated code with memmove and
memset in various locations. The program started crashing occassionally and
after extensively reviewing the changes, I was unable to find a cause for these
crashes. So I installed gdb to try to find out what was going on and was
apparently able to find the cause of the problem. As a new gdb user, I am not
very comfortable with trusting the results of what gdb showing, but it is
pointing directly to one of the code changes I made. I backed out of this code
change and the program has not crashed after 3 days of nearly continuous
testing.
So here is what gdb reports when backtrace is run immediately after
reporting a "SIGTRAP":
(gdb) bt full
#0 0x00007ff9dd8aa98b in KERNELBASE!DebugBreak () from
/cygdrive/c/Windows/system32/KERNELBASE.dll
No symbol table info available.
#1 0x00007ff9ca3b6417 in cygwin1!.assert () from /cygdrive/c/Windows/cygwin1.dll
No symbol table info available.
#2 0x00007ff9ca3cfb18 in secure_getenv () from /cygdrive/c/Windows/cygwin1.dll
No symbol table info available.
#3 0x00007ff9e03dd82d in ntdll!.chkstk () from
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#4 0x00007ff9e038916b in ntdll!RtlRaiseException () from
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#5 0x00007ff9e03dc9ee in ntdll!KiUserExceptionDispatcher () from
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
No symbol table info available.
#6 0x00007ff9ca3b12a9 in memmove () from /cygdrive/c/Windows/cygwin1.dll
No symbol table info available.
#7 0x0000000100409a7c in rank_scores_thread (arg=0x6ffece890010) at
GLZAcompress.c:904
new_score_rank = 2633
new_score_lmi2 = 183964750
new_score_pmi2 = 183964725
rank = 4380
max_rank = 2633
num_symbols = 25
new_score_lmi = 92079851
new_score_pmi = 92079826
thread_data_ptr = 0x6ffece890010
max_scores = 4883
candidates_index = 0xa00034470
score_index = 4380
node_score_num_symbols = 7
num_candidates = 4381
node_ptrs_num = 12224
local_write_index = 12225
rank_scores_buffer = 0x6ffece890020
candidates = 0x6ffece990020
score = 47.6283531
#8 0x00007ff9ca412eec in cygwin1!.getreent () from
/cygdrive/c/Windows/cygwin1.dll
No symbol table info available.
#9 0x00007ff9ca3b47d3 in cygwin1!.assert () from /cygdrive/c/Windows/cygwin1.dll
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.
GLZAcompress.c line 904 is as follows and is in code that runs as a separate
thread created in main:
memmove(&candidates_index[new_score_rank+1], &candidates_index[new_score_rank],
2 * (rank - new_score_rank));
This does point directly to where a code change was made.
candidates_index is allocated in main and not ever intentionally changed until
deallocated at the end of program execution:
if (0 == (candidates_index = (uint16_t *)malloc(max_scores * sizeof(uint16_t))))
fprintf(stderr, "ERROR - memory allocation failed\n");
This value is passed to the thread in a structure pointed to by the thread arg.
The value 0xa00034470 for candidates_index is similar to what is reported on
subsequent runs with added code to print this value so I don't think it's
corrupted, but would need to duplicate the crash after checking the initial
value to be 100% certain. With gdb reporting that rank = 4380 and
new_score_rank = 2633 at the time of the SIGTRAP, this should be a backward
move of 1747 uint16_t values by 2 bytes with a 2 byte difference between the
source and destination addresses.
Prior to this code change and for the last 3 days I have been using this code
instead and not seen any crashes:
uint16_t * score_ptr = &candidates_index[new_score_rank];
uint16_t * candidate_ptr = &candidates_index[rank];
while (candidate_ptr >= score_ptr + 8) {
*candidate_ptr = *(candidate_ptr - 1);
*(candidate_ptr - 1) = *(candidate_ptr - 2);
*(candidate_ptr - 2) = *(candidate_ptr - 3);
*(candidate_ptr - 3) = *(candidate_ptr - 4);
*(candidate_ptr - 4) = *(candidate_ptr - 5);
*(candidate_ptr - 5) = *(candidate_ptr - 6);
*(candidate_ptr - 6) = *(candidate_ptr - 7);
*(candidate_ptr - 7) = *(candidate_ptr - 8);
candidate_ptr -= 8;
}
while (candidate_ptr > score_ptr) {
*candidate_ptr = *(candidate_ptr - 1);
candidate_ptr--;
}
Yes, it's bloated code that should do the same thing as the memmove, but most
importantly the code has never caused any problems. Interestingly, even this
code shows memmove in the assembly code (gcc -S), but only for the second while
loop. The looping code for the first while loop looks like this and moves 8
uint16_t's in just 5 instruction so it is perhaps not as inefficient as the
source code looks:
.L25:
movdqu -16(%rax), %xmm1
subq $16, %rax
movups %xmm1, 2(%rax)
cmpq %rdx, %rax
jnb .L25
It may or may not matter, but the code this is happening on is very CPU
intensive - there can be up to 8 threads running at the same time when this
problem occurs. The problem doesn't occur consistently, it seems to be rather
random. The program runs about 500 iterations of ranking up to the top 30,000
new grammar rule candidates over nearly 4 hours on my test case and has crashed
on different iterations each time it has crashed, even though the thread that
seems to be crashing should be seeing exactly the same data each time the
program is run. The malloc'ed array address could be changing, I haven't
checked that out.
I find it really hard to believe there is a bug in memmove but that seems to be
what gdb and my testing are indicating. So I am looking for advice on how to
better understand what is causing the program to crash. I would like to review
the code memset is using, but have not been able to figure out how to track
that down. Any help in understanding what code the complier is using for
memmove would be helpful. Are there other things I could possibly be
overlooking? Are the any other things I should review or report that would be
helpful? I could try to write a simplified test case if that would be useful.
Best Regards,
Kennon Conrad
cygcheck.out
Description: Binary data
Makefile.bin
Description: Binary data-- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

