On 04/11/2014 07:40 PM, Alexander Graf wrote: > > On 10.04.14 16:31, Alexey Kardashevskiy wrote: >> On 04/10/2014 10:34 PM, Alexander Graf wrote: >>> On 03.04.14 15:14, Alexey Kardashevskiy wrote: >>>> This allows guests to have a different timebase origin from the host. >>>> >>>> This is needed for migration, where a guest can migrate from one host >>>> to another and the two hosts might have a different timebase origin. >>>> However, the timebase seen by the guest must not go backwards, and >>>> should go forwards only by a small amount corresponding to the time >>>> taken for the migration. >>>> >>>> This is only supported for recent POWER hardware which has the TBU40 >>>> (timebase upper 40 bits) register. That includes POWER6, 7, 8 but not >>>> 970. >>>> >>>> This adds kvm_access_one_reg() to access a special register which is not >>>> in env->spr. >>>> >>>> The feature must be present in the host kernel. >>>> >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>> --- >>>> Changes: >>>> v4: >>>> * made it per machine timebase offser rather than per CPU >>>> >>>> v3: >>>> * kvm_access_one_reg moved out to a separate patch >>>> * tb_offset and host_timebase were replaced with guest_timebase as >>>> the destionation does not really care of offset on the source >>>> >>>> v2: >>>> * bumped the vmstate_ppc_cpu version >>>> * defined version for the env.tb_env field >>>> --- >>>> hw/ppc/ppc.c | 120 >>>> +++++++++++++++++++++++++++++++++++++++++++++++++ >>>> hw/ppc/spapr.c | 3 +- >>>> include/hw/ppc/spapr.h | 2 + >>>> target-ppc/cpu-qom.h | 16 +++++++ >>>> target-ppc/kvm.c | 5 +++ >>>> target-ppc/machine.c | 4 +- >>>> trace-events | 3 ++ >>>> 7 files changed, 151 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c >>>> index 9c2a132..b51db1b 100644 >>>> --- a/hw/ppc/ppc.c >>>> +++ b/hw/ppc/ppc.c >>>> @@ -29,9 +29,11 @@ >>>> #include "sysemu/cpus.h" >>>> #include "hw/timer/m48t59.h" >>>> #include "qemu/log.h" >>>> +#include "qemu/error-report.h" >>>> #include "hw/loader.h" >>>> #include "sysemu/kvm.h" >>>> #include "kvm_ppc.h" >>>> +#include "trace.h" >>>> //#define PPC_DEBUG_IRQ >>>> //#define PPC_DEBUG_TB >>>> @@ -797,6 +799,124 @@ static void cpu_ppc_set_tb_clk (void *opaque, >>>> uint32_t freq) >>>> cpu_ppc_store_purr(cpu, 0x0000000000000000ULL); >>>> } >>>> +/* >>>> + * Calculate timebase on the destination side of migration >>>> + * >>>> + * We calculate new timebase offset as shown below: >>>> + * 1) Gtb2 = Gtb1 + max(tod2 - tod1, 0) >>>> + * Gtb2 = tb2 + off2 >>>> + * 2) tb2 + off2 = Gtb1 + max(tod2 - tod1, 0) >>>> + * 3) off2 = Gtb1 - tb2 + max(tod2 - tod1, 0) >>>> + * >>>> + * where: >>>> + * Gtb2 - destination guest timebase >>>> + * tb2 - destination host timebase >>>> + * off2 - destination timebase offset >>>> + * tod2 - destination time of the day >>>> + * Gtb1 - source guest timebase >>>> + * tod1 - source time of the day >>>> + * >>>> + * The result we want is in @off2 >>>> + * >>>> + * Two conditions must be met for @off2: >>>> + * 1) off2 must be multiple of 2^24 ticks as it will be set via TBU40 SPR >>>> + * 2) Gtb2 >= Gtb1 >>>> + */ >>>> +static int64_t cpu_ppc_adjust_tb_offset(PPCTimebaseOffset *tb) >>>> +{ >>>> + uint64_t tb2, tod2; >>>> + int64_t off2; >>>> + int ratio = tb->freq / 1000000; >>>> + struct timeval tv; >>>> + >>>> + tb2 = cpu_get_real_ticks(); >>>> + gettimeofday(&tv, NULL); >>>> + tod2 = tv.tv_sec * 1000000 + tv.tv_usec; >>>> + >>>> + off2 = tb->guest_timebase - tb2; >>>> + if ((tod2 > tb->time_of_the_day) && >>>> + (tod2 - tb->time_of_the_day < 1000000)) { >>>> + off2 += (tod2 - tb->time_of_the_day) * ratio; >>>> + } >>>> + off2 = ROUND_UP(off2, 1 << 24); >>>> + >>>> + return off2; >>>> +} >>> I *think* what you're trying to say here is that you want >>> >>> assert(source_timebase_freq == timebase_freq); >>> >>> migration_duration_ns = host_ns - source_host_ns; >>> guest_tb = source_guest_tb + ns_scaled_to_tb(min(0, migration_duration_ns); >>> kvm_set_guest_tb(guest_tb); >>> -> kvm_set_one_reg(KVM_REG_PPC_TB_OFFSET, guest_tb - mftb()); >>> >>> But I honestly have not managed to read that from the code. Either this >>> really is what you're trying to do and the code is just very hard to read >>> (which means it needs to be written more easily) or you're doing something >>> different which I don't understand. >> >> Is this any better? >> >> static int64_t cpu_ppc_adjust_tb_offset(PPCTimebaseOffset *tb) >> { >> struct timeval tv; >> int64_t migration_duration_ns, migration_duration_tb; > > If I read the code correctly you're operating in us, not ns, no? > >> int64_t guest_tb, host_ns; >> int ratio = tb->freq / 1000000; > > #define USEC_PER_SEC 1000000 > > You're also losing quite a bit of precision here, no?
Am I? For tb_freq==512MHz and the way I use it below? > >> int64_t off; >> >> gettimeofday(&tv, NULL); >> host_ns = tv.tv_sec * 1000000 + tv.tv_usec; > > host_us = get_clock_realtime() / 1000; ? > >> migration_duration_ns = MIN(1000000, > > Why is it MIN(1000000)? Is a migration supposed to last at least 1sec? Why? It should probably be max_downtime (30ms which I am trying to change to 300ms in another patch), something like that. And if it is bigger than 1 seconds, it is either veeeery slow connection or time is not synchronized between hosts, and we try to be precise only if time is in sync. >> host_ns - tb->time_of_the_day); >> migration_duration_tb = migration_duration_ns * ratio; >> >> guest_tb = tb->guest_timebase + MIN(0, migration_duration_tb); >> >> off = guest_tb - cpu_get_real_ticks(); > > It's probably easier to read when you create one function that just returns > a guest TB value adjusted by the time the last measurement happened. The > fact that the KVM register wants an offset is a KVM implementation detail. > The TB adjustment should happen generically. > >> >> return off; >> } >> >> >>> We also designed the PPC_TB_OFFSET ONE_REG in a way that it always rounds >>> up to its 40 bit granularity, so no need to do this in QEMU. In fact, we >>> don't want to do it in QEMU in case there will be a more fine-grained SPR >>> in the future. >> I believe rounding was not in the kernel when I started making this... >> >> >>> And from all I understand the timebase frequency is now architecturally >>> specified, so it won't change for newer cores, no? >> I asked people in our lab. Everyone says that it should not change but >> noone would bet on it too much. > > When it changes and you want to live migrate, you'll need to implement a > guest TB scale register and the whole idea of a "TB offset" ONE_REG is absurd. > > The more I think about this the more I realize we should have created a > "guest TB value", not a "guest TB offset" ONE_REG. > >> >> >>> And if we migrate TCG >>> guests it will be the same between two hosts. >> And G5 uses 33333333. I really do not understand why it is bad to >> send-and-check timer frequency. Why? > > Because the guest will continue to run at a different TB frequency on the > new host and break. > >> Is the rest ok? Thanks for review! > > Not sure. Please rework everything according to the comments, make the code > readable enough that your wife understands it and then resend it :). I am about to post v5, please take a look. Thanks! -- Alexey