Philippe Mathieu-Daudé <phi...@redhat.com> writes: > Hi Markus. > > On 7/16/19 5:12 PM, Markus Armbruster wrote: >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> writes: >> >>> * Markus Armbruster (arm...@redhat.com) wrote: >>>> Philippe asked me to have a look at this one, so here goes. > > Thanks a lot for your careful analysis. > > I got scared the uh-oh you raised would get this series or rework of it > still refused for the release, so I went for a quick-and-dirty bugfix.
This close to the release, minimal bug fix now and cleanup later makes lots of sense. > Since this bugfix got merged (as commit 3a283507c0347), I can now think > again about how to properly fix this (if it is fixable...). > >>>> Philippe Mathieu-Daudé <phi...@redhat.com> writes: >>>> >>>>> In the document [*] the "Read Array Flowchart", the READ_ARRAY >>>>> command has a value of 0xff. >>>>> >>>>> Use the correct value in the pflash model. >>>>> >>>>> There is no change of behavior in the guest, because: >>>>> - when the guest were sending 0xFF, the reset_flash label >>>>> was setting the command value as 0x00 >>>>> - 0x00 was used internally for READ_ARRAY >>>> >>>> *Groan* >>>> >>>> Is this cleanup, or does it fix an observable bug? > > Well it depends where you stand. > > I have a few patches on top of this adding trace events (4.2 material) > and while debugging it was not making sense with the CFI specs. > > 1/ The guest writes 0xFF to go in READ_ARRAY mode, the model report a > warning and take the switch default case which calls pflash_reset(), > which happens to set the flash in READ_ARRAY. This one, in pflash_write()? switch (pfl->wcycle) { case 0: ... ---> case 0xff: /* Read array mode */ DPRINTF("%s: Read array mode\n", __func__); goto reset_flash; ... } return; ... reset_flash: trace_pflash_reset(); memory_region_rom_device_set_romd(&pfl->mem, true); pfl->wcycle = 0; pfl->cmd = 0; I can't see a warning here. Let's ignore the tracepoint. Is the memory_region_rom_device_set_romd() appropriate for READ_ARRAY? pfl->wcycle = 0 is a no-op. pfl->cmd = 0 is part of the "use 0 instead 0f 0xFF to represent READ_ARRAY state" madness. By the way, use of tracing and DPRINTF() in the same .c is an anti-pattern. Care to convert the remaining DPRINTF() into tracepoints? Feel free to delete useless ones, if any. > 2/ Then a later series adds the CFI specs timings (like the CFI02 > model), because it would useful to test the UEFI Capsule Update feature > with real-time behavior. For the 'Virt' pflash, the timing is disabled. > Now running a non-Virt pflash, it becomes very slow because each time > the guest goes into READ_ARRAY mode, the reset delay (which is the > longest) occurs. Feels like a latent bug. Adding timing turns it into a real one. > Talking with Laszlo, I figured for 1/ instead of fixing the model, I can > display 0x00 as 0xFF and ignore the pflash_reset() when the caller is > not system_reset(). Dirty again. > > For 2/ it is not that easy, it will depends if there is more interest > from the UEFI community (Intel parallel NOR flashes are used on x86 and > aarch64 UEFI platforms). > > If we justify 1/ and 2/ are not important, then it is simply a cleanup. If it's a bug fix, have the commit message explain the bug. If it's just cleanup, heave the commit message say so. >>>>> To keep migration with older versions behaving correctly, we >>>>> decide to always migrate the READ_ARRAY as 0x00. >>>>> >>>>> If the CFI open standard decide to assign a new command of value >>>>> 0x00, this model is flawed because it uses this value internally. >>>>> If a guest eventually requires this new CFI feature, a different >>>>> model will be required (or this same model but breaking backward >>>>> migration). So it is safe to keep migrating READ_ARRAY as 0x00. >>>> >>>> We could perhaps keep migration working for "benign" device states, with >>>> judicious use of subsections. We'll cross that bridge when we get to >>>> it. >>>> >>>>> [*] "Common Flash Interface (CFI) and Command Sets" >>>>> (Intel Application Note 646) >>>>> Appendix B "Basic Command Set" >>>>> >>>>> Reviewed-by: John Snow <js...@redhat.com> >>>>> Reviewed-by: Alistair Francis <alistair.fran...@wdc.com> >>>>> Regression-tested-by: Laszlo Ersek <ler...@redhat.com> >>>>> Signed-off-by: Philippe Mathieu-Daudé <phi...@redhat.com> >>>>> --- >>>>> v3: Handle migrating the 'cmd' field. >>>>> v4: Handle migrating to older QEMU (Dave) >>>>> v5: Add a paragraph about why this model is flawed due to >>>>> historically using READ_ARRAY as 0x00 (Dave, Peter). >>>>> >>>>> Since Laszlo stated he did not test migration [*], I'm keeping his >>>>> test tag, because the change with v2 has no impact in the tests >>>>> he ran. >>>>> >>>>> Likewise I'm keeping John and Alistair tags, but I'd like an extra >>>>> review for the migration change, thanks! >>>>> >>>>> [*] https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg00679.html >>>>> --- >>>>> hw/block/pflash_cfi01.c | 57 ++++++++++++++++++++++++++++++++++------- >>>>> 1 file changed, 48 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c >>>>> index 9e34fd4e82..85bb2132c0 100644 >>>>> --- a/hw/block/pflash_cfi01.c >>>>> +++ b/hw/block/pflash_cfi01.c >>>>> @@ -96,6 +96,37 @@ struct PFlashCFI01 { >>>>> bool old_multiple_chip_handling; >>>>> }; >>>>> >>>>> +static int pflash_pre_save(void *opaque) >>>>> +{ >>>>> + PFlashCFI01 *s = opaque; >>>>> + >>>>> + /* >>>>> + * Previous to QEMU v4.1 an incorrect value of 0x00 was used for the >>>>> + * READ_ARRAY command. To preserve migrating to these older version, >>>>> + * always migrate the READ_ARRAY command as 0x00. >>>>> + */ >>>>> + if (s->cmd == 0xff) { >>>>> + s->cmd = 0x00; >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> +static int pflash_post_save(void *opaque) >>>>> +{ >>>>> + PFlashCFI01 *s = opaque; >>>>> + >>>>> + /* >>>>> + * If migration failed, the guest will continue to run. >>>>> + * Restore the correct READ_ARRAY value. >>>>> + */ >>>>> + if (s->cmd == 0x00) { >>>>> + s->cmd = 0xff; >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>> >>>> Uh, this gives me a queasy feeling. Perhaps David can assuage it. >>> >>> See the previous 4 versions of discussion.... >> >> Jumped in at v5; sorry about that. >> >>>> I figure the intent is to migrate PFlashCFI01 member @cmd value 0xFF as >>>> 0x00, for migration compatibility to and from older versions. >>>> >>>> You do this by monkey-patching it to 0x00 before migration, and to 0xFF >>>> afterwards. On the incoming side, you replace 0x00 by 0xFF, in >>>> pflash_post_load() below. >>>> >>>> Questions: >>>> >>>> * Can anything but the code that sends @cmd see the temporary 0x00 value >>>> between pflash_pre_save() and pflash_post_save() >>> >>> It is the same pflash data structure; but all CPUs are stopped and we're >>> just walking the list of devices serialising them; so no nothing should >>> be seeing that value. >> >> Sounds good. >> >>> (There is another way to do this, which is to produce a temporary >>> structure at this point, populate the temporary structure and migrate >>> that) >> >> Not necessary. >> >> The uh-ohs below still need assuaging, not necessarily yours. >> >>> Dave >>> >>>> * Consider the matrix source \in { old, new } x dest \in { old, new } x >>>> @cmd on source in { 0x00, 0xFF }. What does migration put into @cmd >>>> on dest? Eight cases: >>>> >>>> source @cmd -> wire -> dest @cmd >>>> old 0x00 -> 0x00 -> old 0x00 (1) >>>> new 0xFF (2) >>>> old 0xFF -> 0xFF -> old 0xFF (3) >>>> new 0xFF (4) >>>> new 0x00 -> 0x00 -> old 0x00 (5) >>>> new 0xFF (6) >>>> new 0xFF -> 0x00 -> old 0x00 (7) >>>> new 0xFF (8) >>>> >>>> Old -> old (cases 1 and 3) is unaffected by this patch. >>>> >>>> New -> new leaves 0xFF unchanged (8). It changes 0x00 to 0xFF (6). >>>> Uh-oh. Can this happen? Rephrasing the question: can @cmd ever be >>>> 0x00 with this patch applied? > > 0x00 is not in the spec (but as suggested Peter Maydell, the spec can > eventually assign the value in the future [*]). So no guests use it. > This value is only set: > - when the (old) model is initialized, without any access to the guest > - if the guest wrote an incorrect value, hitting the switch default case. > The guest can not read this value (value internal to the state machine). > > So answer: Yes :( Could zero @cmd be avoided? > [*] However the spec has a way to announce supported features to the guest. > >>>> >>>> Old -> new leaves 0xFF unchanged (4). It changes 0x00 to 0xFF (2), >>>> which I think is intentional. >>>> >>>> New -> old leaves 0x00 unchanged (5). It changes 0xFF to 0x00 (7), >>>> which I think is intentional. >>>> >>>> Old -> new -> old leaves 0x00 unchanged. Good. It changes 0xFF to >>>> 0x00. Uh-oh. Can @cmd ever be 0xFF before this patch? > > I understand the full question as "Can @cmd ever be 0xFF [in Old] before > this patch?". > > Answer: No, neither the guest nor the state machine can set @cmd to 0xFF > in Old. Good. >>>> >>>> New -> old -> new leaves 0xFF unchanged. Good. It changes 0x00 to >>>> 0xFF. Same uh-oh as for new -> new. > > "Same uh-oh", do you mean "Can @cmd ever be 0x00 [in New] before this > patch?"? Uh, isn't "[in New] before this patch" is a contradiction? Old = QEMU before this patch New = QEMU with this patch applied New -> old -> new = migrate from QEMU with the patch to QEMU without the patch, then migrate again to QEMU with the patch. If @cmd can be 0x00 initially, then the first migration is (5), and @cmd remains 0x00. The second migration is (2), and @cmd becomes 0xFF. The two migrations together change device state, which could be bad. > Same answer: No, neither the guest nor the state machine can set @cmd to > 0x00 in New. I'm afraid you gave a different answer to the same question above: "Yes :(". >>>> >>>>> + >>>>> static int pflash_post_load(void *opaque, int version_id); >>>>> >>>>> static const VMStateDescription vmstate_pflash = { >>>>> @@ -103,6 +134,8 @@ static const VMStateDescription vmstate_pflash = { >>>>> .version_id = 1, >>>>> .minimum_version_id = 1, >>>>> .post_load = pflash_post_load, >>>>> + .pre_save = pflash_pre_save, >>>>> + .post_save = pflash_post_save, >>>>> .fields = (VMStateField[]) { >>>>> VMSTATE_UINT8(wcycle, PFlashCFI01), >>>>> VMSTATE_UINT8(cmd, PFlashCFI01), >>>>> @@ -277,10 +310,9 @@ static uint32_t pflash_read(PFlashCFI01 *pfl, hwaddr >>>>> offset, >>>>> /* This should never happen : reset state & treat it as a read */ >>>>> DPRINTF("%s: unknown command state: %x\n", __func__, pfl->cmd); >>>>> pfl->wcycle = 0; >>>>> - pfl->cmd = 0; >>>>> + pfl->cmd = 0xff; >>>>> /* fall through to read code */ >>>>> - case 0x00: >>>>> - /* Flash area read */ >>>>> + case 0xff: /* Read Array */ >>>>> ret = pflash_data_read(pfl, offset, width, be); >>>>> break; >>>> >>>> On 0xFF, we no longer zap pfl->wcycle and pfl->cmd. > > We have 2 ways to set @cmd=0xFF. > > - Write 0xFF, write an invalid command, > finish a multicycle operation (wcycle returns to 0): > > pflash_write() goto reset_flash, then calls > memory_region_rom_device_set_romd(). set wcycle=0, cmd=READ_ARRAY. > > Next read() will be in ROMD mode, we won't reach pflash_read(). > > - if next access is write, we'll enter the same pflash_write(). > > - The /* This should never happen */ comment in pflash_read(). > > It might happens migrating? So we migrated crap, the guest wants to > read, the crap defaults to READ_ARRAY in I/O mode. Wrong in many ways, > not sure what the guest expects there, probably not ARRAY data. > Anyway, we stay in this READ_ARRAY I/O mode until the guest eventually > does a write access. Wrong. > > The case "The state machine set 0xff, let the device in I/O mode" so we > expect to answer to a read() with READ_ARRAY is wrong too, the device > should already be in ROMD mode. > >>>> >>>> On 0x00, we do. > > Because we have no idea how we got there... Neither what we should do. The part I actually understand here is "this device model is wrong in so many ways". Is this device model fixable with reasonable effort? Would starting over be easier? >>>> >>>> We zap pfl->cmd to 0xFF instead of 0x00. Same below after label >>>> error_flash and reset_flash. Related: initialization to 0xFF instead of >>>> 0x00 in pflash_cfi01_realize(). I *guess* these changes together ensure >>>> pfl->cmd can't become 0x00. Correct? >>>> >>>>> case 0x10: /* Single byte program */ >>>>> @@ -448,8 +480,6 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr >>>>> offset, >>>>> case 0: >>>>> /* read mode */ >>>>> switch (cmd) { >>>>> - case 0x00: /* ??? */ >>>>> - goto reset_flash; >>>> >>>> On 0x00, we now use default: goto error_flash. Can this happen? > > This could happen if the the guest is writing crap, so we correctly > report this as an error. A bug fix of sorts. >>>> >>>>> case 0x10: /* Single Byte Program */ >>>>> case 0x40: /* Single Byte Program */ >>>>> DPRINTF("%s: Single Byte Program\n", __func__); >>>>> @@ -526,7 +556,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr >>>>> offset, >>>>> if (cmd == 0xd0) { /* confirm */ >>>>> pfl->wcycle = 0; >>>>> pfl->status |= 0x80; >>>>> - } else if (cmd == 0xff) { /* read array mode */ >>>>> + } else if (cmd == 0xff) { /* Read Array */ >>>>> goto reset_flash; >>>>> } else >>>>> goto error_flash; >>>>> @@ -553,7 +583,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr >>>>> offset, >>>>> } else if (cmd == 0x01) { >>>>> pfl->wcycle = 0; >>>>> pfl->status |= 0x80; >>>>> - } else if (cmd == 0xff) { >>>>> + } else if (cmd == 0xff) { /* read array mode */ >>>> >>>> Your new comment is phrased the way you corrected in the previous hunk. >>>> Intentional? > > No :/ Too many rebases. Easy enough to clean up :) >>>> >>>>> goto reset_flash; >>>>> } else { >>>>> DPRINTF("%s: Unknown (un)locking command\n", __func__); >>>>> @@ -645,7 +675,7 @@ static void pflash_write(PFlashCFI01 *pfl, hwaddr >>>>> offset, >>>> error_flash: >>>> qemu_log_mask(LOG_UNIMP, "%s: Unimplemented flash cmd sequence " >>>> "(offset " TARGET_FMT_plx ", wcycle 0x%x cmd 0x%x >>>> value 0x%x)" >>>> "\n", __func__, offset, pfl->wcycle, pfl->cmd, value); >>>> >>>> reset_flash: >>>>> trace_pflash_reset(); >>>>> memory_region_rom_device_set_romd(&pfl->mem, true); >>>>> pfl->wcycle = 0; >>>>> - pfl->cmd = 0; >>>>> + pfl->cmd = 0xff; >>>>> } >>>>> >>>>> >>>>> @@ -761,7 +791,7 @@ static void pflash_cfi01_realize(DeviceState *dev, >>>>> Error **errp) >>>>> } >>>>> >>>>> pfl->wcycle = 0; >>>>> - pfl->cmd = 0; >>>>> + pfl->cmd = 0xff; >>>>> pfl->status = 0; >>>>> /* Hardcoded CFI table */ >>>>> /* Standard "QRY" string */ >>>>> @@ -1001,5 +1031,14 @@ static int pflash_post_load(void *opaque, int >>>>> version_id) >>>>> pfl->vmstate = >>>>> qemu_add_vm_change_state_handler(postload_update_cb, >>>>> pfl); >>>>> } >>>>> + >>>>> + /* >>>>> + * Previous to QEMU v4.1 an incorrect value of 0x00 was used for the >>>>> + * READ_ARRAY command. >>>>> + */ >>>>> + if (pfl->cmd == 0x00) { >>>>> + pfl->cmd = 0xff; >>>>> + } >>>>> + >>>>> return 0; >>>>> } >>> -- >>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK