Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg

Alexander Graf Wed, 17 Dec 2014 00:30:08 -0800

> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <ler...@redhat.com>:
> 
>> On 12/16/14 21:41, Peter Maydell wrote:
>>> On 16 December 2014 at 19:00, Laszlo Ersek <ler...@redhat.com> wrote:
>>> The root of this question is what each of
>>> 
>>> enum device_endian {
>>>    DEVICE_NATIVE_ENDIAN,
>>>    DEVICE_BIG_ENDIAN,
>>>    DEVICE_LITTLE_ENDIAN,
>>> };
>>> 
>>> means.
>> 
>> An opening remark: endianness is a horribly confusing topic and
>> support of more than one endianness is even worse. I may have made
>> some inadvertent errors in this reply; if you think so please
>> let me know and I'll have another stab at it.
>> 
>> That said: the device_endian options indicate what a device's
>> internal idea of its endianness is. This is most relevant if
>> a device accepts accesses at wider than byte width
>> (for instance, if you can read a 32-bit value from
>> address offset 0 and also an 8-bit value from offset 3 then
>> how do those values line up? If you read a 32-bit value then
>> which way round is it compared to what the device's io read/write
>> function return?).
>> 
>> NATIVE_ENDIAN means "same order as the CPU's main data bus's
>> natural representation". (Note that this is not necessarily
>> the same as "the endianness the CPU currently has"; on ARM
>> you can flip the CPU between LE and BE at runtime, which
>> is basically inserting a byte-swizzling step between data
>> accesses and the CPU's data bus, which is always LE for ARMv7+.)
>> 
>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
>> (because the guest vcpu reads it directly with the h/w cpu).
>> 
>>> Consider the following call tree, which implements the splitting /
>>> combining of an MMIO read:
>>> 
>>> memory_region_dispatch_read() [memory.c]
>>>  memory_region_dispatch_read1()
>>>    access_with_adjusted_size()
>>>      memory_region_big_endian()
>>>      for each byte in the wide access:
>>>        memory_region_read_accessor()
>>>          fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>>            fw_cfg_read()
>>>  adjust_endianness()
>>>    memory_region_wrong_endianness()
>>>    bswapXX()
>>> 
>>> The function access_with_adjusted_size() always iterates over the MMIO
>>> address range in incrementing address order. However, it can calculate
>>> the shift count for memory_region_read_accessor() in two ways.
>>> 
>>> When memory_region_big_endian() returns "true", the shift count
>>> decreases as the MMIO address increases.
>>> 
>>> When memory_region_big_endian() returns "false", the shift count
>>> increases as the MMIO address increases.
>> 
>> Yep, because this is how the device has told us that it thinks
>> of accesses as being put together.
>> 
>> The column in your table "host value" is the 16 bit value read from
>> the device, ie what we have decided (based on device_endian) that
>> it would have returned us if it had supported a 16 bit read directly
>> itself. BE devices compose 16 bit values with the high byte first,
>> LE devices with the low byte first, and native-endian devices
>> in the same order as guest-endianness.
>> 
>>> In memory_region_read_accessor(), the shift count is used for a logical
>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>>> values and hides (ie. accounts for) host endianness.
>>> 
>>> Let's consider
>>> - an array of two bytes, [0xaa, 0xbb],
>>> - a uint16_t access made from the guest,
>>> - and all twelve possible cases.
>>> 
>>> In the table below, the "host", "guest" and "device_endian" columns are
>>> input variables. The columns to the right are calculated / derived
>>> values. The arrows above the columns show the data dependencies.
>>> 
>>> After memory_region_dispatch_read1() constructs the value that is
>>> visible in the "host value" column, it is passed to adjust_endianness().
>>> If memory_region_wrong_endianness() returns "true", then the host value
>>> is byte-swapped. The result is then passed to the guest.
>>> 
>>>              
>>> +---------------------------------------------------------------------------------------------------------------+----------+
>>>              |                                                              
>>>                                                  |          |
>>>            +---- 
>>> ------+-------------------------------------------------------------------------+
>>>                            |          |
>>>            | |         |                                                    
>>>                      |                           |          |
>>>      +----------------------------------------------------------+---------+ 
>>>                      |                           |          |
>>>      |     | |         |                                        |         | 
>>>                      |                           |          |
>>>      |   +-----------+-------------------+  +----------------+  |         | 
>>>  +------------------------+-------------------+  |          |
>>>      |   | | |       | |                 |  |                |  |         | 
>>>  |                   |    |                   |  |          |
>>>      |   | | |       | |                 v  |                v  |         v 
>>>  |                   v    |                   v  |          v
>>> #  host  guest  device_endian  memory_region_big_endian()  host value  host 
>>> repr.    memory_region_wrong_endianness()  guest repr.   guest value
>>> --  ----  -----  -------------  --------------------------  ----------  
>>> ------------  --------------------------------  ------------  -----------
>>> 1  LE    LE     native         0                           0xbbaa      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 2  LE    LE     BE             1                           0xaabb      
>>> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>> 3  LE    LE     LE             0                           0xbbaa      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 
>>> 4  LE    BE     native         1                           0xaabb      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 5  LE    BE     BE             1                           0xaabb      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 6  LE    BE     LE             0                           0xbbaa      
>>> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>> 
>>> 7  BE    LE     native         0                           0xbbaa      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 8  BE    LE     BE             1                           0xaabb      
>>> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>> 9  BE    LE     LE             0                           0xbbaa      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 
>>> 10  BE    BE     native         1                           0xaabb      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 11  BE    BE     BE             1                           0xaabb      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 12  BE    BE     LE             0                           0xbbaa      
>>> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
>> 
>> The column you have labelled 'guest repr' here is the returned data
>> from io_mem_read, whose API contract is "write the data from the
>> device into this host C uint16_t (or whatever) such that it is the
>> value returned by the device read as a native host value". It's
>> not related to the guest order at all.
>> 
>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
>> which passes it a local variable "val". So now "val" has the
>> "guest repr" column's bytes in it, and (as a host C variable) the
>> value:
>> 1,2,3 : 0xbbaa
>> 4,5,6 : 0xaabb
>> 7,8,9 : 0xbbaa
>> 10,11,12 : 0xaabb
>> 
>> As you can see, this depends on the "guest endianness" (which is
>> kind of the endianness of the bus): a BE guest 16 bit access to
>> this device would return the 16 bit value 0xaabb, and an LE guest
>> 0xbbaa, and we have exactly those values in our host C variable.
>> If this is TCG, then we'll copy that 16 bit host value into the
>> CPUState struct field corresponding to the destination guest
>> register as-is. (TCG CPUState fields are always in host-C-order.)
>> 
>> However, to pass them up to KVM we have to put them into a
>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
>> in target CPU endianness"], so now buf has the bytes:
>> 1,2,3 : [0xaa, 0xbb]
>> 4,5,6 : [0xaa, 0xbb]
>> 7,8,9 : [0xaa, 0xbb]
>> 10,11,12 : [0xaa, 0xbb]
>> 
>> ...which is the same for every case.
>> 
>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
>> are "the value as it would appear if the VCPU performed a load or store
>> of the appropriate width directly to the byte array". Which is what we
>> want -- your device has two bytes in order 0xaa, 0xbb, and we did
>> a 16 bit load in the guest CPU, so we should get the same answer as if
>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
>> 
>> I think the fact that all of these things come out to the same
>> set of bytes in the mmio.data buffer is the indication that all
>> QEMU's byte swapping is correct.
>> 
>>> Looking at the two rightmost columns, we must conclude:
>>> 
>>> (a) The "device_endian" field has absolutely no significance wrt. what
>>>    the guest sees. In each triplet of cases, when we go from
>>>    DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>>    the guest sees the exact same value.
>>> 
>>>    I don't understand this result (it makes me doubt device_endian
>>>    makes any sense). What did I do wrong?
>> 
>> I think it's because you defined your device as only supporting
>> byte accesses that you didn't see any difference between the
>> various device_endian settings. If a device presents itself as
>> just an array of bytes then it doesn't have an endianness, really.
>> 
>> As Paolo says, if you make your device support wider accesses
>> directly and build up the value yourself then you'll see that
>> setting the device_endian to LE vs BE vs native does change
>> the values you see in the guest (and that you'll need to set it
>> to LE and interpret the words in the guest as LE to get the
>> right behaviour).
>> 
>>>    This means that this interface is *value preserving*, not
>>>    representation preserving. In other words: when host and guest
>>>    endiannesses are identical, the *array* is transferred okay (the
>>>    guest array matches the initial host array [0xaa, 0xbb]). When guest
>>>    and host differ in endianness, the guest will see an incorrect
>>>    *array*.
>> 
>> Think of a device which supports only byte accesses as being
>> like a lump of RAM. A big-endian guest CPU which reads 32 bits
>> at a time will get different values in registers to an LE guest
>> which does the same.
>> 
>> *However*, if both CPUs just do "read 32 bits; write 32 bits to
>> RAM" (ie a kind of memcpy but with the source being the mmio
>> register rather than some other bit of RAM) then you should get
>> a bytewise-correct copy of the data in RAM.
>> 
>> So I *think* that would be a viable approach: have your QEMU
>> device code as it is now, and just make sure that if the guest
>> is doing wider-than-byte accesses it does them as
>>  do {
>>     load word from mmio register;
>>     store word to RAM;
>>  } while (count);
>> 
>> and doesn't try to interpret the byte order of the values while
>> they're in the CPU register in the middle of the loop.
>> 
>> Paolo's suggestion would work too, if you prefer that.
>> 
>>> I apologize for wasting everyone's time, but I think both results are
>>> very non-intuitive.
>> 
>> Everything around endianness handling is non-intuitive --
>> it's the nature of the problem, I'm afraid. (Some of this is
>> also QEMU's fault for not having good documentation of the
>> semantics of each of the layers involved in memory accesses.
>> I have on my todo list the vague idea of trying to write these
>> all up as a blog post.)
> 
> Thanks for taking the time to write this up. My analysis must have
> missed at least two important things then:
> - the device's read/write function needs to consider address & size, and
>  return values that match host byte order. fw_cfg doesn't conform ATM.
> - there's one more layer outside the call tree that I checked that can
>  perform endianness conversion.
> 
> I'll try to implement Paolo's suggestion (ie. support wide accesses in
> fw_cfg internally, not relying on splitting/combining by memory.c).

Awesome :). Please define it as device little endian while you go as well. That 
should give us fewer headaches if we want to support wide access on ppc.


Alex
Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg

Reply via email to