Public bug reported: Description =========== If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request:
keystoneauth1.exceptions.connection.SSLError: SSL exception connecting to https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in.... (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2578)'))) NOTE: Full trace in attachment. Is most likely due to the fact that after nova-scheduler restart, several worker processes based on the parent of SchedulerManager() are created in nova/cmd/scheduler.py respectively in ../oslo_service/service.py and SchedulerManager(), in its __init__() method, initializes the placement client by calling the report.report_client_singleton() method from scheduler/client/report.py, which should always return the same* instance of class SchedulerReportClient(): which creates the actual client also within its __init__() method self._client = self._create_client() line:230 approx. This already opens a real socket on the placement service. But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) NOTE: Do you notice that everyone has the same socket open. If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this: PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) And finally the second attempt in the sequence fails and the open tcp connections on placement look like this: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done 1693940 1693999 1694000 nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) >From this moment on, each process opens its own connection and everything runs >fine. But I don't understand why the first request passes without error. Steps to reproduce ================== * placement service with keepalive set up (in my case 65s) * more that 1 nova-scheduler worker (in my case worker==2) * restart nova-scheduler * and make two scheduling calls within the keealive time interval * second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error Expected result =============== The parent process does not open any socket for placement, or closes it before the child forks Environment =========== * Ubuntu 22.04.5 LTS * nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0 * I also checked the current code in the master repository and the situation is the same there * python3-openssl 21.0.0-1 * openssl 3.0.2-0ubuntu1.18 Related patches =========== https://review.opendev.org/c/x/tobiko/+/880152 https://review.opendev.org/c/openstack/freezer/+/456758 *) Actually, I don't know if the singleton works correctly and as expected when not using thready but separate forked processes. Is that OK? ** Affects: nova Importance: Undecided Status: New ** Attachment added: "decryption failed or bad record mac error" https://bugs.launchpad.net/bugs/2089388/+attachment/5839439/+files/nova-scheduler-SSL_DECRYPTION_FAILED_OR_BAD_RECORD_MAC_full_log.txt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2089388 Title: nova-scheduler error: SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC Status in OpenStack Compute (nova): New Bug description: Description =========== If I try to schedule a VM immediately after restarting nova-scheduler service, I get this error on every second scheduling request: keystoneauth1.exceptions.connection.SSLError: SSL exception connecting to https://api.ng1.os.ops.xx.xx:8778/allocation_candidates?limit=1000&member_of=in.... (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2578)'))) NOTE: Full trace in attachment. Is most likely due to the fact that after nova-scheduler restart, several worker processes based on the parent of SchedulerManager() are created in nova/cmd/scheduler.py respectively in ../oslo_service/service.py and SchedulerManager(), in its __init__() method, initializes the placement client by calling the report.report_client_singleton() method from scheduler/client/report.py, which should always return the same* instance of class SchedulerReportClient(): which creates the actual client also within its __init__() method self._client = self._create_client() line:230 approx. This already opens a real socket on the placement service. But this socket is inherited by descendants, so the whole situation after the restart (workers=2) is as follows: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) NOTE: Do you notice that everyone has the same socket open. If scheduling occurs while this shared socket is open, the first scheduling will pass without error and the worker will open a new solo connection for it. The situation looks like this: PID:1693940 nova-sche 1693940 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1693999 nova-sche 1693999 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) PID:1694000 nova-sche 1694000 nova 11u IPv4 2361702700 0t0 TCP tt-os1-lab1.ko.xx.xx:34444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) And finally the second attempt in the sequence fails and the open tcp connections on placement look like this: root@tt-os1-lab1:~# for P in $(pgrep -f "usr/bin/python3 /usr/bin/nova-scheduler");do echo $P ;lsof -p $P |grep :8778 ;done 1693940 1693999 1694000 nova-sche 1694000 nova 21u IPv4 2361698804 0t0 TCP tt-os1-lab1.ko.xx.xx:47444->tt-os1-lab1.ko.xx.xx:8778 (ESTABLISHED) From this moment on, each process opens its own connection and everything runs fine. But I don't understand why the first request passes without error. Steps to reproduce ================== * placement service with keepalive set up (in my case 65s) * more that 1 nova-scheduler worker (in my case worker==2) * restart nova-scheduler * and make two scheduling calls within the keealive time interval * second call gives you SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC error Expected result =============== The parent process does not open any socket for placement, or closes it before the child forks Environment =========== * Ubuntu 22.04.5 LTS * nova and nova-scheduler 29.0.1-0ubuntu1.4~cloud0 * I also checked the current code in the master repository and the situation is the same there * python3-openssl 21.0.0-1 * openssl 3.0.2-0ubuntu1.18 Related patches =========== https://review.opendev.org/c/x/tobiko/+/880152 https://review.opendev.org/c/openstack/freezer/+/456758 *) Actually, I don't know if the singleton works correctly and as expected when not using thready but separate forked processes. Is that OK? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2089388/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp