> > Btw, your prior response was too hard to properly read, due to excess blank > lines with at the same time squashed leading blanks. Together with your > apparent inability to avoid top-posting, I think you really want to adjust > your mail program's configuration.
I suggest we skip the discussion of formatting and the number of spaces in messages and instead focus on the topic of the thread. I have a very difficult time troubleshooting difficult-to-reproduce bugs, and the fact that their descriptions are difficult to read due to the number of spaces is probably the least of the difficulties. That invocation of get_s_time_fixed() reduces to scale_delta() (without > further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW > it's not quite clear to me what change you are suggesting (that would > actually make a functional difference). Replacing get_s_time_fixed with scale_delta will remove the calculation dependency on the previous local_stime value, which accumulates lag between cores. This is because: rdtsc_ordered is not called synchronously on the cores, but by the difference offset by the ipi speed. Therefore, we get: core0: current_rdtsc; core1: current_rdtsc + ipi speed; coreN: current_rdtsc + ipi speed * N; Since ipi values are sent alternately in a loop to core0, in the version with get_s_time_fixed, we get the following local_stime calculation format. coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed * N)) – local_rdtsc); This means the time on each core will differ by ipi_speed * N. And since we're using the values of the previous local_stime, the difference will accumulate because the previous local_stime was also offset. In the version with scale_delta, we get: coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N)); This means there will still be a difference, but it won't accumulate, and the offsets will remain within normal limits. If it's still unclear: If your local_stime in get_s_time_fixed is offset relative to other cores, then the fact that rdtsc_ordered and local_tsc are not offset doesn't change anything, since you're using the delta relative to local_stime. core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime + (rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are equal across cores. On 96-core configurations, up to a millisecond of latency can accumulate in local_stime over a week of operation, and this is a significant difference. This is due to the fact that I use cpufreq=xen:performance max_cstate=1 , meaning that in my configuration, local_stime is never overwritten by master_stime. Thanks. On Mon, Jan 12, 2026 at 2:45 PM Jan Beulich <[email protected]> wrote: > On 12.01.2026 11:31, Anton Markov wrote: > > Bit rounding isn't the main issue; the difference in ipi delivery to the > > cores accumulates due to the ordering. Replacing get_s_time_fixed with > > scale_delta in time_calibration_rendezvous_tail should be sufficient. > This > > configuration won't accumulate errors, but bit rounding can still cause a > > significant difference from calibration to calibration, but it's not as > > significant. > > That invocation of get_s_time_fixed() reduces to scale_delta() (without > further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW > it's not quite clear to me what change you are suggesting (that would > actually make a functional difference). > > Btw, your prior response was too hard to properly read, due to excess blank > lines with at the same time squashed leading blanks. Together with your > apparent inability to avoid top-posting, I think you really want to adjust > your mail program's configuration. > > Jan > > > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <[email protected]> > wrote: > > > >> You're right. These aren't interrupts in get_s_time_fixed, but rather a > >> cumulative error in the sequence due to integer rounding. I added > logging > >> of the current local_stime to local_time_calibration and noticed that > the > >> timestamp between cores is gradually increasing. If the server has been > >> running for weeks, this could be a very large value. > >> > >> > >> ``` > >> > >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void) > >> > >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > >> > >> { > >> > >> /* Atomically read cpu_calibration struct and write cpu_time struct. */ > >> > >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n", > >> smp_processor_id(), t->stamp.local_stime, c->local_stime, > >> > >> + t->last_seen_ns, t->last_seen_tsc); > >> > >> local_irq_disable(); > >> > >> t->stamp = *c; > >> > >> local_irq_enable(); > >> > >> ``` > >> > >> > >> 2 hours of work: > >> > >> > >> ``` > >> > >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597 > >> (8565145862216, 0) > >> > >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609 > >> (8565145863957, 0) > >> > >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491 > >> (8565145864800, 0) > >> > >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609 > >> (8565145865372, 0) > >> > >> > >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag > >> > >> ``` > >> > >> > >> 6 hours of work: > >> > >> ``` > >> > >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993 > >> (22915730870665, 0) > >> > >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889 > >> (22915730870693, 0) > >> > >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841 > >> (22915730872231, 0) > >> > >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696 > >> (22915730872096, 0) > >> > >> > >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag > >> > >> ``` > >> > >> > >> Clarification, according to my xen settings: > >> > >> ``` > >> > >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96 > dom0_vcpus_pin=0 > >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1 > >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512 > >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0 > >> sched_credit_tslice_ms=5 sched_ratelimit_us=500 > >> > >> ``` > >> > >> > >> Processors I tested on: > >> > >> > >> ``` > >> > >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz > >> > >> > >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16 > >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault > >> fsgsbase erms xsaveopt arch_capabilities > >> > >> ``` > >> > >> ``` > >> > >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA) > >> > >> > >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx > >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl > >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma > cx16 > >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm > abm > >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f > >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw > >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes > >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm > md_clear > >> arch_capabilities > >> > >> ``` > >> > >> > >> Next I moved the code to r3 to speed up playback: > >> > >> > >> ``` > >> > >> #include <stdint.h> > >> > >> #include <stdio.h> > >> > >> > >> static __inline__ unsigned long long rdtsc(void) > >> > >> { > >> > >> unsigned hi, lo; > >> > >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); > >> > >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); > >> > >> } > >> > >> > >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL)) > >> > >> > >> struct cpu_time_stamp { > >> > >> uint64_t local_tsc; > >> > >> int64_t local_stime; > >> > >> int64_t master_stime; > >> > >> }; > >> > >> > >> struct time_scale { > >> > >> int shift; > >> > >> uint32_t mul_frac; > >> > >> }; > >> > >> > >> > >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor) > >> > >> { > >> > >> uint32_t quotient, remainder; > >> > >> asm ( > >> > >> "divl %4" > >> > >> : "=a" (quotient), "=d" (remainder) > >> > >> : "0" (0), "1" (dividend), "r" (divisor) ); > >> > >> return quotient; > >> > >> } > >> > >> > >> > >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec) > >> > >> { > >> > >> uint64_t tps64 = ticks_per_sec; > >> > >> uint32_t tps32; > >> > >> int shift = 0; > >> > >> > >> while ( tps64 > (MILLISECS(1000)*2) ) > >> > >> { > >> > >> tps64 >>= 1; > >> > >> shift--; > >> > >> } > >> > >> > >> tps32 = (uint32_t)tps64; > >> > >> while ( tps32 <= (uint32_t)MILLISECS(1000) ) > >> > >> { > >> > >> tps32 <<= 1; > >> > >> shift++; > >> > >> } > >> > >> > >> ts->mul_frac = div_frac(MILLISECS(1000), tps32); > >> > >> ts->shift = shift; > >> > >> } > >> > >> > >> > >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale) > >> > >> { > >> > >> uint64_t product; > >> > >> > >> if ( scale->shift < 0 ) > >> > >> delta >>= -scale->shift; > >> > >> else > >> > >> delta <<= scale->shift; > >> > >> > >> asm ( > >> > >> "mulq %2 ; shrd $32,%1,%0" > >> > >> : "=a" (product), "=d" (delta) > >> > >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) ); > >> > >> > >> return product; > >> > >> } > >> > >> > >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL > >> > >> > >> static inline struct time_scale scale_reciprocal(struct time_scale > scale) > >> > >> { > >> > >> struct time_scale reciprocal; > >> > >> uint32_t dividend; > >> > >> > >> dividend = _TS_MUL_FRAC_IDENTITY; > >> > >> reciprocal.shift = 1 - scale.shift; > >> > >> while ( dividend >= scale.mul_frac ) > >> > >> { > >> > >> dividend >>= 1; > >> > >> reciprocal.shift++; > >> > >> } > >> > >> > >> asm ( > >> > >> "divl %4" > >> > >> : "=a" (reciprocal.mul_frac), "=d" (dividend) > >> > >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) ); > >> > >> > >> return reciprocal; > >> > >> } > >> > >> > >> > >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct > >> time_scale *ts, uint64_t at_tsc) > >> > >> { > >> > >> uint64_t tsc, delta; > >> > >> > >> if ( at_tsc ) > >> > >> tsc = at_tsc; > >> > >> else > >> > >> tsc = rdtsc(); > >> > >> delta = tsc - t->local_tsc; > >> > >> return t->local_stime + scale_delta(delta, ts); > >> > >> } > >> > >> > >> int main() { > >> > >> > >> struct cpu_time_stamp ct; > >> > >> struct time_scale ts; > >> > >> struct time_scale back; > >> > >> > >> uint64_t start = rdtsc(); > >> > >> > >> set_time_scale(&ts, 3000000000); > >> > >> back = scale_reciprocal(ts); > >> > >> > >> ct.local_tsc = start; > >> > >> ct.local_stime = scale_delta(start, &ts); > >> > >> > >> while (1) { > >> > >> uint64_t new_tsc = rdtsc(); > >> > >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc); > >> > >> ct.local_tsc = new_tsc; > >> > >> > >> uint64_t tmp_tsc = rdtsc(); > >> > >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts, > >> tmp_tsc), &back)); > >> > >> } > >> > >> > >> return 0; > >> > >> } > >> > >> ``` > >> > >> > >> It's enough to run the loop for 10-20 seconds to see the problem. > >> scale_delta rounds the least significant bits during calculation, which > >> causes the error to accumulate, at different rates on different cores, > >> depending on the least significant bits at the time of calibration. > >> > >> > >> Now let's talk about why dwm reacts this way. When a snapshot is > reversed, > >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't > >> prevent time from flowing backwards. This means that cache_tsc_offset in > >> hvm_get_guest_tsc_fixed may be configured correctly on one physical > core, > >> but due to shedding on a physical core with a lagging tsc, the guest may > >> occasionally see a tsc value lower than it saw before the snapshot was > >> reversed. If this happens to the dwm code, it terminates with an error. > >> > >> > >> A quick solution to this problem might be to save the last_seen_tsc > >> parameter in a snapshot for each core, followed by validation. > >> > >> > >> The correct solution is to remove the rounding of the least significant > >> bits from the sequence. > >> > >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <[email protected]> wrote: > >> > >>> On 06.01.2026 21:10, Антон Марков wrote: > >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt > >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because > >>>> process_pending_softirqs is frequently called there, which in turn > >>>> processes softirqs from the timer (where the timestamp is updated). > >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped > >>>> reproducing under no load. However, when the number of vCPUs is 4 > times > >>>> greater than the number of CPUs (under heavy load), the problem rarely > >>>> reoccurs (mostly during snapshot restores during > >>>> process_pending_softirqs calls), and this is no longer a simple case. > >>> If > >>>> get_s_time_fixed can indeed be interrupted during execution after > >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to > >>>> atomically copy "t->stamp" to the stack using local_irq_disable and > >>>> local_irq_enable (as in local_time_calibration), and then work with > the > >>>> copy, confident in its lifetime and immutability. But until > >>>> get_s_time_fixed is proven to be interruptible, this is premature, so > >>>> your fix is sufficient. I think I need more information and testing to > >>>> say more. > >>> > >>> While the cpu_calibration per-CPU variable is updated from IRQ context, > >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind > >>> the back of get_s_time_fixed(). I wonder whether ... > >>> > >>>> Regarding the other scale_delta calls, if they include values > >>>> calculated from externally saved tsc values that could have become > >>>> stale during the process_pending_softirqs call, this definitely needs > >>> to > >>>> be fixed. > >>> > >>> ... another similar issue (possibly one not included in the set of > >>> remarks I have in the patch, as none of those look related to what you > >>> describe) might be causing the remaining, more rare problems you say > you > >>> see. That set of remarks is actually a result of me going over all > other > >>> scale_delta() calls, but of course I may have got the analysis wrong. > >>> > >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy > >>> load) - while I don't think we have a support statement for such > upstream > >>> (when probably we should), iirc for our (SUSE's) products we would > >>> consider that unsupported. Just fyi. > >>> > >>> Also, btw, please don't top-post. > >>> > >>> Jan > >>> > >> > > > >
