Public bug reported: When the ceph pool backing glance is full (goes into read only), Glance IO calls never respond, and the worker taking care of the API call is basically a zombie.
If enough IO requests are made, for example 4 when you have 4 workers, glance will not be able to respond to any kind of requests. You need to restart glance to have responses again. ceph status: cluster: id: ce9a32e4-9768-457a-b811-225b710aeb58 health: HEALTH_ERR 3 full osd(s) 3 pool(s) full 1 pool(s) have no replicas configured services: mon: 1 daemons, quorum bm0.lxd (age 2h) mgr: bm0.lxd(active, since 2h) osd: 3 osds: 3 up (since 2h), 3 in (since 2h) data: pools: 3 pools, 161 pgs objects: 6.92k objects, 47 GiB usage: 143 GiB used, 6.8 GiB / 150 GiB avail pgs: 161 active+clean ceph osd dump | grep ratio: full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85 Here's a response from the apache 2 http proxying for glance: openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2 Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server. Failed to delete 1 of 1 images. The last log for these requests at debug level is: DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15 2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132 To fix this, I adjusted the full_ratio to allow writing again, and deleted images. But glance should have a mechanism to detect this / a timeout. Versions: glance 27.0.0 ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/2059768 Title: glance hangs when rbd pool in read only Status in Glance: New Bug description: When the ceph pool backing glance is full (goes into read only), Glance IO calls never respond, and the worker taking care of the API call is basically a zombie. If enough IO requests are made, for example 4 when you have 4 workers, glance will not be able to respond to any kind of requests. You need to restart glance to have responses again. ceph status: cluster: id: ce9a32e4-9768-457a-b811-225b710aeb58 health: HEALTH_ERR 3 full osd(s) 3 pool(s) full 1 pool(s) have no replicas configured services: mon: 1 daemons, quorum bm0.lxd (age 2h) mgr: bm0.lxd(active, since 2h) osd: 3 osds: 3 up (since 2h), 3 in (since 2h) data: pools: 3 pools, 161 pgs objects: 6.92k objects, 47 GiB usage: 143 GiB used, 6.8 GiB / 150 GiB avail pgs: 161 active+clean ceph osd dump | grep ratio: full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85 Here's a response from the apache 2 http proxying for glance: openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2 Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server. Failed to delete 1 of 1 images. The last log for these requests at debug level is: DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15 2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132 To fix this, I adjusted the full_ratio to allow writing again, and deleted images. But glance should have a mechanism to detect this / a timeout. Versions: glance 27.0.0 ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/2059768/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp