On 12/04/17 00:45, Glenn Enright wrote:
> On 12/04/17 10:23, Andrew Cooper wrote:
>> On 11/04/2017 23:13, Glenn Enright wrote:
>>> On 11/04/17 21:49, Dietmar Hahn wrote:
>>>> Am Dienstag, 11. April 2017, 20:03:14 schrieb Glenn Enright:
>>>>> On 11/04/17 17:59, Juergen Gross wrote:
>>>>>> On 11/04/17 07:25, Glenn Enright wrote:
>>>>>>> Hi all
>>>>>>>
>>>>>>> We are seeing an odd issue with domu domains from xl destroy, under
>>>>>>> recent 4.9 kernels a (null) domain is left behind.
>>>>>>
>>>>>> I guess this is the dom0 kernel version?
>>>>>>
>>>>>>> This has occurred on a variety of hardware, with no obvious
>>>>>>> commonality.
>>>>>>>
>>>>>>> 4.4.55 does not show this behavior.
>>>>>>>
>>>>>>> On my test machine I have the following packages installed under
>>>>>>> centos6, from https://xen.crc.id.au/
>>>>>>>
>>>>>>> ~]# rpm -qa | grep xen
>>>>>>> xen47-licenses-4.7.2-4.el6.x86_64
>>>>>>> xen47-4.7.2-4.el6.x86_64
>>>>>>> kernel-xen-4.9.21-1.el6xen.x86_64
>>>>>>> xen47-ocaml-4.7.2-4.el6.x86_64
>>>>>>> xen47-libs-4.7.2-4.el6.x86_64
>>>>>>> xen47-libcacard-4.7.2-4.el6.x86_64
>>>>>>> xen47-hypervisor-4.7.2-4.el6.x86_64
>>>>>>> xen47-runtime-4.7.2-4.el6.x86_64
>>>>>>> kernel-xen-firmware-4.9.21-1.el6xen.x86_64
>>>>>>>
>>>>>>> I've also replicated the issue with 4.9.17 and 4.9.20
>>>>>>>
>>>>>>> To replicate, on a cleanly booted dom0 with one pv VM, I run the
>>>>>>> following on the VM
>>>>>>>
>>>>>>> {
>>>>>>> while true; do
>>>>>>>  dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
>>>>>>> done
>>>>>>> }
>>>>>>>
>>>>>>> Then on the dom0 I do this sequence to reliably get a null domain.
>>>>>>> This
>>>>>>> occurs with oxenstored and xenstored both.
>>>>>>>
>>>>>>> {
>>>>>>> xl sync 1
>>>>>>> xl destroy 1
>>>>>>> }
>>>>>>>
>>>>>>> xl list then renders something like ...
>>>>>>>
>>>>>>> (null)                                       1     4     4    
>>>>>>> --p--d
>>>>>>> 9.8     0
>>>>>>
>>>>>> Something is referencing the domain, e.g. some of its memory pages
>>>>>> are
>>>>>> still mapped by dom0.
>>>>
>>>> You can try
>>>> # xl debug-keys q
>>>> and further
>>>> # xl dmesg
>>>> to see the output of the previous command. The 'q' dumps domain
>>>> (and guest debug) info.
>>>> # xl debug-keys h
>>>> prints all possible parameters for more informations.
>>>>
>>>> Dietmar.
>>>>
>>>
>>> I've done this as requested, below is the output.
>>>
>>> <snip>
>>> (XEN) Memory pages belonging to domain 1:
>>> (XEN)     DomPage 0000000000071c00: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c01: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c02: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c03: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c04: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c05: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c06: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c07: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c08: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c09: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0a: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0b: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0c: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0d: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0e: caf=00000001, taf=7400000000000001
>>> (XEN)     DomPage 0000000000071c0f: caf=00000001, taf=7400000000000001
>>
>> There are 16 pages still referenced from somewhere.

Just a wild guess: could you please try the attached kernel patch? This
might give us some more diagnostic data...


Juergen
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 8fe61b5dc5a6..304d5d130e0c 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -313,7 +313,7 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
 static void xen_blkif_free(struct xen_blkif *blkif)
 {
 
-	xen_blkif_disconnect(blkif);
+	WARN_ON(xen_blkif_disconnect(blkif));
 	xen_vbd_free(&blkif->vbd);
 
 	/* Make sure everything is drained before shutting down */
@@ -505,7 +505,7 @@ static int xen_blkbk_remove(struct xenbus_device *dev)
 	dev_set_drvdata(&dev->dev, NULL);
 
 	if (be->blkif)
-		xen_blkif_disconnect(be->blkif);
+		WARN_ON(xen_blkif_disconnect(be->blkif));
 
 	/* Put the reference we set in xen_blkif_alloc(). */
 	xen_blkif_put(be->blkif);
@@ -792,7 +792,7 @@ static void frontend_changed(struct xenbus_device *dev,
 			 * Clean up so that memory resources can be used by
 			 * other devices. connect_ring reported already error.
 			 */
-			xen_blkif_disconnect(be->blkif);
+			WARN_ON(xen_blkif_disconnect(be->blkif));
 			break;
 		}
 		xen_update_blkif_status(be->blkif);
@@ -803,7 +803,7 @@ static void frontend_changed(struct xenbus_device *dev,
 		break;
 
 	case XenbusStateClosed:
-		xen_blkif_disconnect(be->blkif);
+		WARN_ON(xen_blkif_disconnect(be->blkif));
 		xenbus_switch_state(dev, XenbusStateClosed);
 		if (xenbus_dev_is_online(dev))
 			break;
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to