Over the weekend, a few zones including libreswan.org failed validation. The logs showed a stream of:
Sep 21 14:11:56 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml Sep 21 14:33:10 ns0 ods-signerd: [zone] unable to publish dnskeys for zone libreswan.org: error creating dnskey Sep 21 14:33:10 ns0 ods-signerd: [tools] unable to read zone libreswan.org: failed to publish dnskeys (General error) Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.org: General error Sep 21 14:33:10 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.org with 3600 seconds Sep 21 15:11:57 ns0 ods-enforcerd: Zone libreswan.org found. Sep 21 15:11:57 ns0 ods-enforcerd: Policy for libreswan.org set to default. Sep 21 15:11:57 ns0 ods-enforcerd: Config will be output to /var/opendnssec/signconf/libreswan.org.xml. Sep 21 15:11:57 ns0 ods-enforcerd: No change to: /var/opendnssec/signconf/libreswan.org.xml A restart did not seem to help: Sep 22 09:44:25 ns0 ods-signerd: [hsm] sign init: CKR_KEY_HANDLE_INVALID Sep 22 09:44:25 ns0 ods-signerd: [hsm] error signing rrset with libhsm Sep 22 09:44:25 ns0 ods-signerd: [rrset] unable to sign RRset[6]: lhsm_sign() failed Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] sign zone httpca.org failed: 1 RRsets failed Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] CRITICAL: failed to sign zone httpca.org: General error Sep 22 09:44:25 ns0 ods-signerd: [worker[3]] backoff task [sign] for zone httpca.org with 3600 seconds This one seems due to a service restart leading to two ods-signerd's running. One not referenced in the pid file. I used a hammer to restart. Unfortunately, it remained broken: Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] CRITICAL: failed to sign zone libreswan.org: General error Sep 22 10:25:34 ns0 ods-signerd: [worker[1]] backoff task [read] for zone libreswan.org with 60 seconds Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone libreswan.com: failed to publish dnskeys (General error) Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] CRITICAL: failed to sign zone libreswan.com: General error Sep 22 10:25:34 ns0 ods-signerd: [worker[4]] backoff task [read] for zone libreswan.com with 60 seconds Sep 22 10:25:34 ns0 ods-signerd: [hsm] unable to get key: key f11737ffedb1723031545fe488c89bad not found Sep 22 10:25:34 ns0 ods-signerd: [zone] unable to publish dnskeys for zone httpca.org: error creating dnskey Sep 22 10:25:34 ns0 ods-signerd: [tools] unable to read zone httpca.org: failed to publish dnskeys (General error) Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] CRITICAL: failed to sign zone httpca.org: General error Sep 22 10:25:34 ns0 ods-signerd: [worker[2]] backoff task [read] for zone httpca.org with 60 seconds Looking around I saw: [root@ns0 log]# ods-ksmutil key list --verbose |grep "NOT IN" SQLite database set to: /var/opendnssec/kasp.db libreswan.org ZSK active 2014-09-25 05:44:55 (retire) 2048 8 e9fa25e8214d2920c9489069bdfe61e6 SoftHSM NOT IN repository libreswan.net ZSK active 2014-09-25 05:44:55 (retire) 2048 8 635cc83526dfafc2c76605ae7db96ea3 SoftHSM NOT IN repository libreswan.com ZSK active 2014-09-25 05:44:55 (retire) 2048 8 cd09e7a69e890ee7ad41a30e9a323b73 SoftHSM NOT IN repository httpca.org ZSK active 2014-09-23 03:44:15 (retire) 2048 8 f11737ffedb1723031545fe488c89bad SoftHSM NOT IN repository It is possible that testing with softhsm-2 and then reverting to softhsm-1 caused these to happen, if these keys were generated during the 2 days of running softhsm-v2. Running ods-ksmutil key delete --cka_id <id> --force seemed to have killed it and forced it to generate new ones Sep 22 10:37:00 ns0 ods-enforcerd: INFO: Promoting ZSK from publish to active as this is the first pass for the zone Sep 22 10:37:00 ns0 ods-enforcerd: WARNING: Making non-backed up ZSK active, PLEASE make sure that you know the potential problems of using keys which are not recoverable Sep 22 10:37:00 ns0 ods-enforcerd: Called signer engine: /usr/sbin/ods-signer update httpca.org Sep 22 10:37:00 ns0 ods-enforcerd: Disconnecting from Database... Sep 22 10:37:00 ns0 ods-enforcerd: Sleeping for 3600 seconds. Sep 22 10:37:00 ns0 ods-signerd: [signconf] zone httpca.org signconf: RESIGN[PT7200S] REFRESH[PT604800S] VALIDITY[PT1209600S] DENIAL[PT1209600S] JITTER[PT43200S] OFFSET[PT3600S] NSEC[50] DNSKEYTTL[PT3600S] SOATTL[PT3600S] MINIMUM[PT3600S] SERIAL[datecounter] etc. I guess I really need to look at shipping a mechanism to ensure CRITICAL errors are mailed to root to prevent zone outages. Paul _______________________________________________ Opendnssec-user mailing list Opendnssec-user@lists.opendnssec.org https://lists.opendnssec.org/mailman/listinfo/opendnssec-user