----- Original Message -----
> From: "Nir Soffer" <[email protected]>
> To: "Trey Dockendorf" <[email protected]>
> Cc: "users" <[email protected]>, "Michal Skrivanek" <[email protected]>
> Sent: Wednesday, February 12, 2014 10:04:04 AM
> Subject: Re: [Users] Host Non-Operational from sanlock and VM fails   to      
> migrate
> 
[...]
> The vm was starting a migration to the other host:
> 
> Thread-26::DEBUG::2014-02-03 07:49:18,067::BindingXMLRPC::965::vds::(wrapper)
> client [192.168.202.99]::call vmMigrate with ({'tunneled': 'false',
> 'dstqemu': '192.168.202.103',
> 'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.edu:54321', 'vmId':
> '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'},) {} flowID
> [7829ae2a]
> Thread-26::DEBUG::2014-02-03 07:49:18,067::API::463::vds::(migrate)
> {'tunneled': 'false', 'dstqemu': '192.168.202.103', 'src':
> 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.
> edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method':
> 'online'}
> Thread-26::DEBUG::2014-02-03 07:49:18,068::BindingXMLRPC::972::vds::(wrapper)
> return vmMigrate with {'status': {'message': 'Migration in progress',
> 'code': 0}, 'progress': 0}
> 
> The migration was almost complete after 20 seconds:
> 
> Thread-29::INFO::2014-02-03 07:49:38,329::vm::815::vm.Vm::(run)
> vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration Progress: 20 seconds
> elapsed, 99% of data processed, 99% of mem processed
> 
> But it never completed:
> 
> Thread-29::WARNING::2014-02-03 07:54:38,383::vm::792::vm.Vm::(run)
> vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration is stuck: Hasn't
> progressed in 300.054134846 seconds. Aborting.
> 
> CCing Michal to inspect why the migration has failed.

Hi,

I had a look at the logs, and this looks like another libvirt/qemu I/O related 
issue.

If QEMU on the src hosts cannot reliably access storage, migration may get 
stuck;
this seems the case given the information provided; VDSM detected the migration
was not progressing and aborted it.

libvirt has an option (which we already use) to detect those scenarios, the
VIR_MIGRATE_ABORT_ON_ERROR flag, but unfortunately this is not 100% reliable 
yet,
for reasons outlined below in this mail.

We are aware of this issue and we are actively working to improve the handling
of such scenarios, but actually this is mostly on QEMU.

The core issue here is that when we use NFS (or ISCSI), and there is an I/O 
error,
QEMU can get blocked inside the kernel, waiting for the faulty I/O operation to 
complete,
and thus fail to report an I/O error.
It really depends on what specific operation fails, and there are many possible 
cases
and error scenarios.

Of course, if QEMU is blocked and fails to report the I/O error, libvirt can do 
nothing
to report/recover error, so VDSM can do even less.
This is known and acknowledged both by libvirt and QEMU developers.

But there are some good news, because newer versions of QEMU have improvements 
on this
field: QEMU recently gained native block devices[1], which, among other things, 
will make
it more robust in presence of I/O errors, and should improve the error 
reporting as well.
RHEL7 should have a version of QEMU with native ISCSI; hopefully NFS will 
follow soon
enough.

HTH,

+++

[1] for example, ISCSI, recently merged: 
http://comments.gmane.org/gmane.comp.emulators.qemu/92599
work on NFS is ongoing.

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
IRC: fromani
_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to