Hi, we have now made our second failed attempt at converting our production OpenDNSSEC 1.4.14 installation to 2.1.8.
So far we have found one specific problem and one more vague problem: 1) We're using <NSEC3> for denial-of-existence. NSEC3 uses a "salt" value as an input value to the process. If we move away the old /var/opendnssec/signconf/ directory and create it anew, OpenDNSSEC will populate it with an xml file per zone. However, they all have this part: <Denial> <NSEC3> <Hash> <Algorithm>1</Algorithm> <Iterations>5</Iterations> <Salt>0</Salt> </Hash> </NSEC3> </Denial> This mixes just *terribly* with OpenDNSSEC's code, because of ods_status nsec3params_create_salt(const char* salt_str, uint8_t* salt_len, uint8_t** salt) { ... *salt_len = (uint8_t) strlen(salt_str); if (*salt_len == 1 && salt_str[0] == '-') { *salt_len = 0; *salt = NULL; return ODS_STATUS_OK; } else if (*salt_len % 2 != 0) { ods_log_error("[%s] invalid salt %s", nsec3_str, salt_str); *salt = NULL; return ODS_STATUS_ERR; } ... and of course we get the "invalid salt 0" message logged, together with this sequence: Mar 4 22:23:02 tilfeldigvis ods-signerd: [file] openfile /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml count 1 Mar 4 22:23:02 tilfeldigvis ods-signerd: [nsec3] invalid salt 0 Mar 4 22:23:02 tilfeldigvis ods-signerd: [nsec3] unable to create: create salt failed Mar 4 22:23:02 tilfeldigvis ods-signerd: [signconf] unable to read signconf /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml: nsec3params_create() failed Mar 4 22:23:02 tilfeldigvis ods-signerd: [signconf] unable to update signconf: failed to read file /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml (Memory allocation error) Mar 4 22:23:02 tilfeldigvis ods-signerd: [zone] unable to load signconf for zone 4.0.6.6.7.7.7.4.nrenum.net: signconf /var/opendnssec/signconf/4.0.6.6.7.7.7.4.nrenum.net.xml Memory allocation error Mar 4 22:23:02 tilfeldigvis ods-signerd: [tools] unable to load signconf for zone 4.0.6.6.7.7.7.4.nrenum.net: Memory allocation error Mar 4 22:23:02 tilfeldigvis ods-signerd: WARNING: unable to sign zone 4.0.6.6.7.7.7.4.nrenum.net, signconf is not ready Mar 4 22:23:02 tilfeldigvis ods-signerd: back-off task [configure] for zone 4.0.6.6.7.7.7.4.nrenum.net with 480 seconds Mar 4 22:23:02 tilfeldigvis ods-signerd: [scheduler] schedule task [configure] for 4.0.6.6.7.7.7.4.nrenum.net The "Memory allocation error" messages are just Terribly Misleading, and might send an operator on a wild goose chase about adjusting OpenDNSSEC's resource limits. They're lies, I tell you! Now, the old signconf files typically *do* have a non-zero salt value, e.g. <Denial> <NSEC3> <Hash> <Algorithm>1</Algorithm> <Iterations>5</Iterations> <Salt>cd79fa0214ff93b7</Salt> </Hash> </NSEC3> </Denial> and I suspect this data is just a reflection of what's in the kasp.db file. So ... did the NSEC3 not get converted / transferred to the new kasp.db file in the conversion? Why does it end up as the single-character 0 in OpenDNSSEC 2? It ended up as "0" in the policy table in the converted kasp.db file. Should it instead have been NULL or the empty string? One of my colleagues who helped me doing the conversion used sqlite3 to change the global config in kasp.db to use NSEC instead of NSEC3, and that seemed to have brought the process further along. 2) We seem to have issues related to SoftHSM2, which we're converting to at the same time. The problem we're having is that the level of error messages related to SoftHSM2 is nearly binary: either it works or it doesn't, and not a word about "why" when it doesn't. As an operator this leaves me stumped about what's really going on and what possible correction can be made. You might find this set of log messages in the log: Mar 4 22:28:50 tilfeldigvis ods-signerd: [zone] unable to publish keys for zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: error creating libhsm context Mar 4 22:28:50 tilfeldigvis ods-signerd: [tools] unable to read zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: failed to publish dnskeys (HSM error) Mar 4 22:28:50 tilfeldigvis ods-signerd: CRITICAL: failed to sign zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa: HSM error Mar 4 22:28:50 tilfeldigvis ods-signerd: back-off task [read] for zone 0.0.1.0.0.0.0.0.0.0.7.0.1.0.0.2.ip6.arpa with 60 seconds Of course, it doesn't get any better after 60 seconds. If we follow the first "error creatign libhsm context", we get to signer/src/signer/zone.c's zone_publish_dnskeys(), which does: /* hsm access */ if (!skip_hsm_access) { ctx = hsm_create_context(); if (ctx == NULL) { ods_log_error("[%s] unable to publish keys for zone %s: " "error creating libhsm context", zone_str, zone->name); return ODS_STATUS_HSM_ERR; } } and if we nest on hsm_create_context() to libhsm/src/lib/libhsm.c, which basically does hsm_ctx_clone() based on that module's global _hsm_ctx within locking. Now, hsm_ctx_clone() does hsm_session_clone(), and returns NULL if that fails. hsm_session_clone() in turn does hsm_get_slot_id(), which may store nice error messages such as hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()", "No slots found in HSM"); hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()", "Too many slots found in HSM"); hsm_ctx_set_error(ctx, HSM_ERROR, "hsm_get_slot_id()", "Could not allocate slot ID table"); hsm_ctx_set_error(ctx, -1, "hsm_get_slot_id()", "could not find token with the name %s", token_name); in the "ctx" parameter. However, remember, that is the module- global _hsm_ctx. Apparently there's no logging done in libhsm.c of any errors (it's a library, according to the name), and _hsm_ctx isn't readily available in the file which calls hsm_create_context(), so the opportunity of getting a proper error message to explain *why* the HSM module is unhappy is lost. I would call this an "interface design error". Now, I'm not entirely certain how one might go about verifying that the HSM module is happy with the SoftHSM2 database. At least the ods-migrate step succeeded, it said: ods @ tilfeldigvis: {27} ods-migrate Reading config file '/usr/pkg/etc/opendnssec/conf.xml'.. Connecting to HSM.. Connecting to database.. Computing keytags, this could take a while. Added keytags for 1653 keys. Finishing.. ods@ tilfeldigvis: {28} and took nearly 70 CPU minutes (yikes!) Also, ods-enforcer is showing all the keys -- an example: Keys: Zone: Keytype: State: Date of next transition: Size: Algorithm: CKA_ID: Repository: KeyTag: 0.0.7.5.5.3.7.7.4.nrenum.net KSK active now 2048 8 90ef217860f6f15368d3b4b3aedf71d8 SoftHSM 33904 0.9.4.8.4.6.7.4.nrenum.net KSK active now 2048 8 7f4796b2d79055a95c79fdad430bf2fa SoftHSM 40595 1.0.7.5.5.3.7.7.4.nrenum.net KSK active now 2048 8 de5f61ba82664928dbc6c26377595b1c SoftHSM 43282 1.9.4.8.4.6.7.4.nrenum.net KSK active now 2048 8 7af945a7ed5188d398e541a30f2eea7a SoftHSM 14144 2.5.4.2.2.7.4.nrenum.net KSK active now 2048 8 aafcc72f2b33f2cacec3216c37021b1e SoftHSM 56149 2.9.4.8.4.6.7.4.nrenum.net KSK active now 2048 8 77d95adf77729bcf6e89d2800206aade SoftHSM 19222 3.5.4.2.2.7.4.nrenum.net KSK active now 2048 8 b3929e9d4686f6d14350fd8efc3a07ef SoftHSM 33047 3.9.4.8.4.6.7.4.nrenum.net KSK active now 2048 8 c22698b4bafe25812adfe33218c395fb SoftHSM 60532 However, I'm not sure whether listing the keys actually causes access to the SoftHSM. Is there anything else I can/should do to minimally verify that the SoftHSM2 module is working as intended? So ... this leaves me without any answer as to what might be wrong and what I ought to do about it. Help! Best regards, - HÃ¥vard _______________________________________________ Opendnssec-user mailing list Opendnssec-user@lists.opendnssec.org https://lists.opendnssec.org/mailman/listinfo/opendnssec-user