On 03/29/2016 01:01 PM, Kaustubh Kelkar wrote:
-----Original Message-----
From: Rick Jones [mailto:rick.jon...@hpe.com]
Sent: Tuesday, March 29, 2016 1:43 PM
To: openstack@lists.openstack.org
Subject: Re: [Openstack] Compute downloading corrupted image from Glance
On 03/29/2016 10:17 AM, Kaustubh Kelkar wrote:
Every time I tried to download the image on the compute, I get a new
hash value (albeit, a wrong one).
On the compute node, what is the type of NIC and its driver and such?
[Kaustubh] It is an Intel X710 NIC with i40e driver. The NIC is part of the
integrated card on a Dell R730.
lscpi -v | grep -A 1 Ethernet
[Kaustubh] (Output redacted to show only the relevant interface)
01:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X710 Adapter
(rev 01)
Subsystem: Dell Device 0000
It wasn't assigned a sub-device ID? (Device 0000). I'm not all that
familiar with Dell kit but that seems a trifle odd.
ethtool -i <interfacename>
[Kaustubh] root@dchi:/home/kkelkar# ethtool -i em2
driver: i40e
version: 1.4.25
firmware-version: 4.41 0x80001863 16.5.20
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
And are any of the stateless offloads enabled?
ethtool -k <interfacename>
[Kaustubh] root@dchi:/home/kkelkar# ethtool -k em2
Features for em2:
trimmed...
Those would include checksum offload, and things built on top of it
like TSO, GSO, LRO and/or GRO.
If you find that checksum offload is enabled, and you disable it,
does the corrupt image download problem go away? If so, you have a
problem with your NIC and/or its driver getting the offloads wrong
and/or corrupting the traffic in a place outside the protection of
the offloaded checksuming. One of the central assumptions with the
likes of checksum offload in a NIC is that anything "above" the
checksum offload in the NIC has some sort of data protection - at
least parity, if not ECC. This includes components in the NIC
itself, the I/O bus etc etc.
If disabling checksum offload on the compute node doesn't resolve the
matter, you might consider the same on the controller.
[Kaustubh] I ended up disabling checksumming, TSO, GSO and GRO on
both controller and the compute so the ethtool output looks as above.
Now, the problem can only be reproduced intermittently. At times,
compute node still gets a corrupted image.
Ah, that ethtool -i output was after not before - I was initially
confused because I'd not expected the offloads to be disabled by default.
If the issue is still intermittent I'd *guess* it was timing related.
You might see if there are any increases in the back checksum stats in
netstat.
Other bits of straw-grasping would include, but not be limited to:
*) Transferring the image via scp and see if that always works OK
*) Run something like netperf TCP_STREAM or iperf and see if you see
checksum errors accumulating.
*) Perhaps create a fake image of the same size with a fixed pattern and
transfer that via glance and see if it ever complains. If it does, you
can look to see where the pattern breaks in terms of offset into the
file and how it breaks. If it is then reproducible you can then
consider getting packet traces at either end and looking through those
to see if it was indeed good or bad at the sender and such.
rick jones
_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack