Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 11/14] ioport: Switch dispatching to memory core layer

Benjamin Herrenschmidt Fri, 12 Jul 2013 16:50:58 -0700

On Sat, 2013-07-13 at 00:10 +0100, Peter Maydell wrote:

> The block marked "byteswap here" does "byte invariant bigendian",
> so byte accesses are unchanged, 16 bit accesses have the two words
> of data flipped, and 32 bit accesses have the four bytes flipped;
> this happens as the data passes through; addresses are unchanged.
> It only happens if the CPU is configured by the guest to operate
> in big-endian mode, obviously.
> (Contrast 'word invariant bigendian', which is what ARM used to do,
> where the addresses are changed but the data is not. That would be
> pretty painful to implement in the memory region API though it is
> of course trivial in hardware since it is just XORing of the low
> address bits according to the access size...)


Which means that ARM switches the byte order of its bus when
switching endian which is very very wrong... oh well.

PowerPCs operate in both endians and nowadays do not require any
adjustement at the bus level. It's simple, the definition that matter
and the only definition that matter for data is which byte is at what
address and that shouldn't change. For addresses, they should be in a
fixed significance order, not change order based on the "endianness" of
the processor.

Or to say things differently, byte order of data should be fixed, and
the "endianness" of the bus (which is purely in that case the byte order
of addresses) as well. ARM seem to have chosen to go from fixing the
byte order of data & changing the bus endianness, to changing the byte
order of data to preserve the bus endianness, or something along those
lines. Both approaches are WRONG.

Anyway, whatever, what is done is done, point is, what you have is a
byte lane swapper which is similar to what older ppc did which indeed
is needed if your bus flips around. I would still not model that using
the term "endianness" of either the bus or bridge. It's again purely a
statement of what byte lane coming out corresponds to what *address*,
regardless of byte significance.

> > Again, the only endian
> > attribute that exists are the byte order of the original access (which
> > byte has the lowest address, regardless of significance of those bytes
> > in the target, ie, purely from a qemu standpoint, in the variable that
> > carries the access around inside qemu, which byte has the lowest
> > address)
> 
> What does this even mean? At the point where a memory access leaves
> the CPU (emulation or real hardware) it has (a) an address and
> (b) a width -- a 16 bit access is neither big nor little endian,

Not from the point of view of the bus indeed. The value inside might or
might not have an endianness but that's purely somebody else business.

> it's just a request for 16 bits of data (on real hardware it's
> typically a bus transaction on a bunch of data lines with some
> control lines indicating transaction width). Now the CPU emulation
> may internally be intending to put that data into its emulated
> register one way round or the other, but that's an internal detail
> of the CPU emulation. (Similarly for stores.)

My statement above meant (sorry if it wasn't clear) that what matters is
that when qemu carries that request around (thus carries those 16-bits
in a u16 variable inside qemu itself, ie, the argument of some
load/store callback), the only attribute of interest is which of the
bytes in that u16 variable is the first in ascending address order.

IE. This is a property of the processor bus, it is *defined* which of
those 2 bytes that form that 16 bit wide "access" is supposed to go at
what address as part of the bus definition (though ARM seems to flip it
around).

Obviously, it should be done according to the host endianness, and thus
one would expect that qemu just "storing" that in memory results in the
two bytes being laid out in the right order.

My point here is that if any conversion at that level is needed, it's
purely somewhere in TCG to ensure that what's in the variable carried
out of TCG is represented according to the intended byte order of the
original access.

That's why the word "endianness" is so confusing. At this point, if
anything, the only endianness that exists is the one of the host CPU :-)

Now when we cross your byte-lanes adjusting bridge in qemu, two things
can happen.

 - The bridge is configured properly and it's a nop
 - The bridge is *not* configured properly, and you basically need to
perform a lane swap on anything crossing that bridge.

I wouldn't call that "endian". I wouldn't model that by saying that
some range of addresses is "LE" or "BE" or "Host Endian" or whatever ...

The best way to represent that in qemu would be to have some kind of
"lane swap" attribute which is set based on the (mis)match of the
CPU and bridge configuration.

> >, and the same on the target device (at which point a concept of
> > significance does apply, but it's a guest driver business to get it
> > right, qemu just need to make sure byte 0 goes to byte 0).
> 
> Similarly, at the target device end there is no concept
> of a "big endian access" -- we make a request for 16
> bits of data at a particular address (via the MemoryRegion
> API) and the device returns 16 bits of data.

>  It's entirely
> possible to design hardware so that byte access to address
> X, halfword access to address X and word access to address
> X all return entirely different data (though it would be
> a bit perverse.) (As an implementation convenience we may
> choose to provide helper infrastructure so you don't have
> to actually implement all of byte/halfword/word access by hand.)

In fact, I think some x86 IO devices did that perverse thing in the
past. It's not actually always possible. On modern busses it tends not
to be actually. Often, busses do *not* have low address bits below their
width but use byte enables instead. Now  of course the device could be
perverse enough to try to use those to "deduce" the address and still
return different values but I doubt anybody does that :-)

However this is academic. The point is that the bus the device is on has
a definition of what byte is expected first in ascending address order,
which should match the way qemu carry the data.

Endianness at the device level thus is purely about the interpretation
of those data at a register level, because qemu is no HW.

> > If a bridge flips things around in a way that breaks the
> > model,
> 
> That breaks what model?

Byte order invariance which is what we care about. If the bridge
preserves that, then it's a nop for qemu.

> > then add some property describing the flipping
> > properties but don't call it "big
> > endian" or "little endian" at the bridge level, that has no meaning,
> > confuses things and introduces breakage like we have seen.
> 
> I'm happy to call the property "byteswap", yes, because
> that's what it does. If you did two of these in a row you'd
> get a no-op.

Right but the bridge you mentioned "byteswap" configuration is based on
the configuration of the processor bus byte order, and if those are in
sync, the bridge should essentially be a nop for all intend and purpose.

If byte order is preserved, then the "value" qemu carries around can go
unmodified, all the way.

> >> (Our other serious endianness problem is that we don't really
> >> do very well at supporting a TCG CPU arbitrarily flipping
> >> endianness -- TARGET_WORDS_BIGENDIAN is a compile time setting
> >> and ideally it should not be.)
> >
> > Our experience is that it actually works fine for almost everything
> > except virtio :-) ie mostly TARGET_WORDS_BIGENDIAN is irrelevant (and
> > should be).
> 
> I agree that TARGET_WORDS_BIGENDIAN *should* go away, but
> it exists currently. Do you actually implement a CPU which
> does dynamic endianness flipping? Is it at all efficient
> in the config which is the opposite of whatever
> TARGET_WORDS_BIGENDIAN says?

Yes. I haven't measured the speed (in fact I haven't looked at the code
either, others have, but I know it works).

Cheers,
Ben.

Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 11/14] ioport: Switch dispatching to memory core layer

Reply via email to