Hi, I'm finally gearing up to transition my local OpenDNSSEC + SoftHSM over to "supported" versions (2.x all around).
Quite a while back I set up a test system to perform the migration on, and left it running. Apparently ods-signerd had stopped or crashed and I didn't notice (the enforcer continued to run, though), and now when I came around to re-activating the test installation I find that ods-signerd all too often decided to SEGV and abort, e.g. a half CPU-hour after startup. I have not found the actual root cause or robustness fix for this problem. What I did find was this: This happens in the "ixfr_del_rr()" in this section of code: for(i=0; i<nrrsigs; i++) { if(matchedsignatures[i].signature == NULL) { if (rrsigs[i] != NULL) { if (zone->db->is_initialized) { pthread_mutex_lock(&zone->ixfr->ixfr_lock); ixfr_del_rr(zone->ixfr, rrsigs[i]->rr); pthread_mutex_unlock(&zone->ixfr->ixfr_lock); } while((signature = collection_iterator(rrset->rrsigs))) { if(signature == rrsigs[i]) { collection_del_cursor(rrset->rrsigs); } } } } else ++reusedsigs; } inside of rrset_sign() in signer/src/signer/rrset.c; the crash actually happened inside the ldns library: (gdb) where #0 0x00007f7ff78484cc in ldns_rr_owner (rr=0x48333c66ffe51200) at ./rr.c:913 #1 0x00007f7ff78491d0 in ldns_rr_clone (rr=0x48333c66ffe51200) at ./rr.c:1404 #2 0x0000000000414ca8 in ixfr_del_rr (ixfr=0x7f7ff7ea7df0, rr=0x48333c66ffe51200) at signer/ixfr.c:134 #3 0x0000000000419319 in rrset_sign (ctx=ctx@entry=0x7f7ff138a000, rrset=rrset@entry=0x7f7fdb345f40, signtime=1613494513) at signer/rrset.c:758 #4 0x000000000040f42e in drudge (worker=0x7f7ff7e8b700) at daemon/signertasks.c:196 #5 0x000000000043b1c4 in runthread (data=0x7f7fee13fcd0) at janitor.c:318 #6 0x00007f7ff540c072 in ?? () from /usr/lib/libpthread.so.1 #7 0x00007f7ff5887bb0 in ?? () from /usr/lib/libc.so.12 #8 0x0000000000200000 in ?? () #9 0x0000000000000000 in ?? () (gdb) The 'rr' pointer in frames 0-2 is clearly bogus. I looked at the rrsigs[x]->rr's in the debugger, and when it crashed (I attached gdb to the process after starting it), all the RR's pointed to un-mapped memory. What I found was that in my /var/opendnssec/tmp there were a lot of leftover and rather old *.xfrd-state files, and if I removed those (or moved them elsewhere, which is what I did), ods-signerd would thereafter not crash, and according to the logs would continue do useful work; where it would crash after 30 minutes before, it has now consumed some 330 minutes plus CPU-time, and is still going. I am guessing that some contents in the *.xfrd-state files violated some built-in assumptions in the code which were rendered invalid by their stale contents. I could have wished for some more sanity checking and robustness... So ... even though the actual failing or fix hasn't been found, this may prove useful as a workaround to consider should you face a similar situation (however unlikely it is...) Regards, - HÃ¥vard _______________________________________________ Opendnssec-user mailing list Opendnssec-user@lists.opendnssec.org https://lists.opendnssec.org/mailman/listinfo/opendnssec-user