On 04.01.2013, at 05:22, Bhushan Bharat-R65777 wrote: > > >> -----Original Message----- >> From: Marcelo Tosatti [mailto:mtosa...@redhat.com] >> Sent: Friday, January 04, 2013 7:08 AM >> To: Alexander Graf >> Cc: jjhe...@linux.vnet.ibm.com; Christian Borntraeger; Anthony Liguori; qemu- >> de...@nongnu.org qemu-devel; Bhushan Bharat-R65777 >> Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix >> do_kvm_cpu_synchronize_state data integrity issue >> >> On Thu, Jan 03, 2013 at 08:09:22PM +0100, Alexander Graf wrote: >>> >>> On 03.01.2013, at 19:48, Jason J. Herne wrote: >>> >>>> On 01/03/2013 08:56 AM, Alexander Graf wrote: >>>>>> static void do_kvm_cpu_synchronize_state(void *_args) >>>>>>> { >>>>>>> struct kvm_cpu_syncstate_args *args = _args; >>>>>>> + CPUArchState *env = args->env; >>>>>>> + int register_level = args->register_level; >>>>>>> >>>>> This probably becomes more readable if we explicitly revert back to >> unsynced state first: >>>>> >>>>> /* Write back local modifications at our current level */ if >>>>> (register_level > env->kvm_vcpu_dirty) { >>>>> kvm_arch_put_registers(...); >>>>> env->kvm_vcpu_dirty = 0; >>>>> } >>>>> >>>>> and then do the sync we are requested to do: >>>>> >>>>> if (!env->kvm_vcpu_dirty) { >>>>> ... >>>>> } >>>> >>>> I agree, but only if we add a second conditional to the if 1st statement >>>> as >> such: >>>> >>>> if (args->env->kvm_vcpu_dirty && register_level > >>>> env->kvm_vcpu_dirty) >>>> >>>> This is to cover the case where the caller is asking for register level "1" >> and we're already dirty at level "2". In this case, nothing should happen and >> we'll need the "args->env->kvm_vcpu_dirty" to ensure that is the case. >>> >>> As before, I'd prefer to make this explicit: >>> >>>> >>>> static void do_kvm_cpu_synchronize_state(void *_args) { >>>> struct kvm_cpu_syncstate_args *args = _args; >>>> CPUArchState *env = args->env; >>>> int register_level = args->register_level; >>> >>> if (register_level > env->kvm_vcpu_dirty) { >>> /* We are more dirty than we need to - all is well */ >>> return; >>> } >>> >>>> >>>> /* Write back local modifications at our current level */ >>>> if (args->env->kvm_vcpu_dirty && register_level > env->kvm_vcpu_dirty) { >>>> kvm_arch_put_registers(env, env->kvm_vcpu_dirty); >>>> env->kvm_vcpu_dirty = 0; >>>> } >>>> >>>> if (!args->env->kvm_vcpu_dirty) { >>>> kvm_arch_get_registers(env, register_level); >>>> env->kvm_vcpu_dirty = register_level; >>>> } >>>> } >>>> >>>> Do you agree? Thanks for your time. :) >>> >>> Please also check out the discussions I've had with Bharat about his >>> watchdog >> patches. There we need a mechanism to synchronize registers only when we >> actually need to, in order to avoid potential race conditions with a kernel >> timer. >>> >>> That particular case doesn't work well with levels. We can have multiple >> different potential race producers in the kernel that we need to avoid >> individually, so we can't always synchronize all of them when only one of >> them >> needs to be synchronized. >>> >>> The big question is what we should be doing about this. We basically have 3 >> options: >>> >>> * implement levels, treat racy registers as "manually synchronized", as >> Bharat's latest patch set does >>> * implement levels, add a bitmap for additional special synchronization >>> bits >>> * replace levels by bitmap >>> >>> I'm quite frankly not sure which one of the 3 would be the best way forward. >>> >>> >>> Alex >> >> Implement read/write accessors for individual registers, similarly to how its >> done in the kernel. See kvm_cache_regs.h. > > Read/write for individual registers can be heavy for cases where multiple > register needed to be updated. Is not it?
It depends. We can still have classes of registers to synchronize that we could even synchronize using MANY_REG. For writes, things become even faster with individual accessors since we can easily coalesce all register writes until we hit the vcpu again into a MANY_REG array and write them in one go. > > For cases where multiple register needed to be synchronized then I would like > to elaborate the options as: > > Option 1: > int timer_func(CPUArchState env) > { > cpu_synchronize_state(env); > //update env->timer_registers > env->updated_regs_type |= TIMER_REGS_TYPE_BIT; To scale this one, it's probably also more clever to swap the logic: env->kvm_sync_extra |= SYNC_TIMER_REGS; cpu_synchronize_state(env); /* gets timer regs */ /* update env->timer_registers */ /* timer regs will be synchronized later because kvm_sync_extra has SYNC_TIMER_REGS set */ Your case right now is special in that we don't need to get any register state, but only write it. However, if we do cpu_synchronize_state(env); /* will not get timer regs */ env->kvm_sync_extra |= SYNC_TIMER_REGS; then the next cpu_synchronize_state(env); would overwrite the timer values again. So we would also need to indicate that the bits are dirty: cpu_synchronize_state(env); /* will not get timer regs */ env->kvm_sync_extra |= SYNC_TIMER_REGS; env->kvm_sync_dirty |= SYNC_TIMER_REGS; which cpu_synchronize_state() could use to make sure it doesn't read back any timer regs again. It would obviously also have to set it in its normal operation when it executed first ;). > } > > Change the kvm_arch_put_registers() to add followings: > > int kvm_arch_put_registers(CPUArchState *env, int level) > { > > If (env->updated_regs_type) { > struct kvm_sregs sregs; > If (env->updated_regs_type & TIMER_REGS_TYPE_BIT) { > // update sregs->timer_registers; > // may be we can have a bitmap to tell kernel for what > actually updated > } > If (env->updated_regs_type & XX_CPU_REGS_TYPE) { > // update respective registers in sregs > } > > ioctl(env, KVM_GET_SREGS, &sregs); > } > } > > This mechanism will get all registers state but this is required only once > per entry in QEMU. I don't understand this bit. > > > Option 2: > Define read_regs_type() and write_regs_type() for cases where it requires > multiple registers updates: > > int read_regs_type((CPUArchState env, void *regs_ptr, int reg_type) > { > -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. > > Switch(reg_type) { > Case TIMER_REGS: > Struct *timer_regs = regs_ptr; > Ioctl(KVM_GET_SREGS, timer_regs, CPU_reg_type); > Break; > > Default: > Printf(error); > } > > Similarly will be the write function: > int write_regs_type((CPUArchState env, int reg_type) > { > -> reg_type are bitmap like: TIMER_REGS, DEBUG_REGS, PERFMON_REGS etc. > > Switch(reg_type) { > Case TIMER_REGS: > Struct timer_regs; > Timer_regs.reg1 = env->timer_regs->reg1; > Timer_regs.reg2 = env->timer_regs->reg2; > Ioctl(env, KVM_SET_SREGS, &timer_regs, CPU_reg_type); > Break; > > Default: > Printf(error); > } > > Int timer_func(CPUxxState env) > { > Struct timer_regs; > read_regs_type((env, &timer_regs,TIMER_REGS); > //update env->timer_registers > Write_regs_type(env, TIMER_REGS) > } > > This will get one type of register_types and can cause multiple IOCTL per > entry to QEMU. This is still too explicit. How about static inline void ppc_set_tsr(CPUState *env, target_ulong val) { env->kvm_sync_extra |= SYNC_TIMER_REGS; cpu_synchronize_registers(env); env->spr[SPR_TSR] = val; } static inline void ppc_set_tcr(CPUState *env, target_ulong val) { env->kvm_sync_extra |= SYNC_TIMER_REGS; cpu_synchronize_registers(env); env->spr[SPR_TCR] = val; } int timer_func(CPUState *env) { ppc_set_tsr(env, 0); ppc_set_tcr(env, 0); } The main benefit from this method is that we can do a simple search/replace to get rid of essentially all synchronization pitfalls where we might miss out on some code path that then doesn't synchronize its register set. > ------- > > Keep on using GET/SET_ONE_REG when only one registers needed to be updated. We can always use ONE_REG for any of the variants. The main question I have is whether it makes sense to move from /* state subset only touched by the VCPU itself during runtime */ #define KVM_PUT_RUNTIME_STATE 1 /* state subset modified during VCPU reset */ #define KVM_PUT_RESET_STATE 2 /* full state set, modified during initialization or on vmload */ #define KVM_PUT_FULL_STATE 3 to a bitmap. So we would always only synchronize KVM_PUT_RUNTIME_STATE, but keep a bitmap as explained above around for any state that is more elaborate. For easy conversion, we could even add bits for SYNC_RESET and SYNC_FULL = (SYNC_RESET | SYNC_FULL_REAL). That should get rid of all the level based logic, giving us a lot more flexibility on what we want to synchronize. Alex