>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.

I suggest we skip the discussion of formatting and the number of spaces in
messages and instead focus on the topic of the thread. I have a very
difficult time troubleshooting difficult-to-reproduce bugs, and the fact
that their descriptions are difficult to read due to the number of spaces
is probably the least of the difficulties.

That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).

Replacing get_s_time_fixed with scale_delta will remove the calculation
dependency on the previous local_stime value, which accumulates lag between
cores. This is because: rdtsc_ordered is not called synchronously on the
cores, but by the difference offset by the ipi speed. Therefore, we get:

core0: current_rdtsc;
core1: current_rdtsc + ipi speed;
coreN: current_rdtsc + ipi speed * N;

Since ipi values are sent alternately in a loop to core0, in the version
with get_s_time_fixed, we get the following local_stime calculation format.

coreN: local_stime = local_stime + scale_delta((current_rdtsc + (ipi_speed
* N)) – local_rdtsc);

This means the time on each core will differ by ipi_speed * N. And since
we're using the values of the previous local_stime, the difference will
accumulate because the previous local_stime was also offset. In the version
with scale_delta, we get:

coreN: local_stime = scale_delta(current_rdtsc + (ipi_speed * N));

This means there will still be a difference, but it won't accumulate, and
the offsets will remain within normal limits.

If it's still unclear: If your local_stime in get_s_time_fixed is offset
relative to other cores, then the fact that rdtsc_ordered and local_tsc are
not offset doesn't change anything, since you're using the delta relative
to local_stime.

core0_local_stime + (rdtsc_ordered() - local_tsc) != core1_local_stime +
(rdtsc_ordered() - local_tsc); // Even if rdtsc_ordered() and local_tsc are
equal across cores.

On 96-core configurations, up to a millisecond of latency can accumulate in
local_stime over a week of operation, and this is a significant
difference. This
is due to the fact that I use cpufreq=xen:performance max_cstate=1 ,
meaning that in my configuration, local_stime is never overwritten by
master_stime.

Thanks.

On Mon, Jan 12, 2026 at 2:45 PM Jan Beulich <[email protected]> wrote:

> On 12.01.2026 11:31, Anton Markov wrote:
> > Bit rounding isn't the main issue; the difference in ipi delivery to the
> > cores accumulates due to the ordering. Replacing get_s_time_fixed with
> > scale_delta in time_calibration_rendezvous_tail should be sufficient.
> This
> > configuration won't accumulate errors, but bit rounding can still cause a
> > significant difference from calibration to calibration, but it's not as
> > significant.
>
> That invocation of get_s_time_fixed() reduces to scale_delta() (without
> further rdtsc_ordered()), as non-zero at_tsc is passed in all cases. IOW
> it's not quite clear to me what change you are suggesting (that would
> actually make a functional difference).
>
> Btw, your prior response was too hard to properly read, due to excess blank
> lines with at the same time squashed leading blanks. Together with your
> apparent inability to avoid top-posting, I think you really want to adjust
> your mail program's configuration.
>
> Jan
>
> > On Fri, Jan 9, 2026 at 7:11 PM Anton Markov <[email protected]>
> wrote:
> >
> >> You're right. These aren't interrupts in get_s_time_fixed, but rather a
> >> cumulative error in the sequence due to integer rounding. I added
> logging
> >> of the current local_stime to local_time_calibration and noticed that
> the
> >> timestamp between cores is gradually increasing. If the server has been
> >> running for weeks, this could be a very large value.
> >>
> >>
> >> ```
> >>
> >> @@ -1732,6 +1753,8 @@ static void cf_check local_time_calibration(void)
> >>
> >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> >>
> >> {
> >>
> >> /* Atomically read cpu_calibration struct and write cpu_time struct. */
> >>
> >> + printk("update stime on time calibrate %d, %lu -> %lu (%lu, %lu)\n",
> >> smp_processor_id(), t->stamp.local_stime, c->local_stime,
> >>
> >> + t->last_seen_ns, t->last_seen_tsc);
> >>
> >> local_irq_disable();
> >>
> >> t->stamp = *c;
> >>
> >> local_irq_enable();
> >>
> >> ```
> >>
> >>
> >> 2 hours of work:
> >>
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 8564145820102 -> 8565145861597
> >> (8565145862216, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 8564145820129 -> 8565145861609
> >> (8565145863957, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 8564145819996 -> 8565145861491
> >> (8565145864800, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 8564145820099 -> 8565145861609
> >> (8565145865372, 0)
> >>
> >>
> >> 8565145861609 - 8565145861491 = 115 * 3 (3.00 GHz) = 345 lag
> >>
> >> ```
> >>
> >>
> >> 6 hours of work:
> >>
> >> ```
> >>
> >> (XEN) update stime on time calibrate 0, 22914730829200 -> 22915730869993
> >> (22915730870665, 0)
> >>
> >> (XEN) update stime on time calibrate 1, 22914730829073 -> 22915730869889
> >> (22915730870693, 0)
> >>
> >> (XEN) update stime on time calibrate 2, 22914730829052 -> 22915730869841
> >> (22915730872231, 0)
> >>
> >> (XEN) update stime on time calibrate 3, 22914730828892 -> 22915730869696
> >> (22915730872096, 0)
> >>
> >>
> >> 22915730869993 - 22915730869696 = 297 * 3 (3.00 GHz) = 891 lag
> >>
> >> ```
> >>
> >>
> >> Clarification, according to my xen settings:
> >>
> >> ```
> >>
> >> ucode=scan dom0_mem=53923M,max:53923M dom0_max_vcpus=4-96
> dom0_vcpus_pin=0
> >> force-ept=1 ept=no-ad,no-pml hap_1gb=0 hap_2mb=0 altp2m=1
> >> hpet=legacy-replacement smt=1 spec-ctrl=no gnttab_max_frames=512
> >> cpufreq=xen:performance max_cstate=1 sched=credit sched-gran=cpu apicv=0
> >> sched_credit_tslice_ms=5 sched_ratelimit_us=500
> >>
> >> ```
> >>
> >>
> >> Processors I tested on:
> >>
> >>
> >> ```
> >>
> >> Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 cx16
> >> sse4_1 sse4_2 popcnt aes xsave avx f16c hypervisor lahf_lm cpuid_fault
> >> fsgsbase erms xsaveopt arch_capabilities
> >>
> >> ```
> >>
> >> ```
> >>
> >> Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2 (NUMA)
> >>
> >>
> >> Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx
> >> fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl
> >> nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma
> cx16
> >> sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm
> abm
> >> 3dnowprefetch cpuid_fault ibpb fsgsbase bmi1 avx2 bmi2 erms rtm avx512f
> >> avx512dq rdseed adx avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
> >> avx512vl xsaveopt xsavec xgetbv1 avx512vbmi avx512_vbmi2 gfni vaes
> >> vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
> md_clear
> >> arch_capabilities
> >>
> >> ```
> >>
> >>
> >> Next I moved the code to r3 to speed up playback:
> >>
> >>
> >> ```
> >>
> >> #include <stdint.h>
> >>
> >> #include <stdio.h>
> >>
> >>
> >> static __inline__ unsigned long long rdtsc(void)
> >>
> >> {
> >>
> >> unsigned hi, lo;
> >>
> >> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> >>
> >> return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
> >>
> >> }
> >>
> >>
> >> #define MILLISECS(_ms) ((int64_t)((_ms) * 1000000ULL))
> >>
> >>
> >> struct cpu_time_stamp {
> >>
> >> uint64_t local_tsc;
> >>
> >> int64_t local_stime;
> >>
> >> int64_t master_stime;
> >>
> >> };
> >>
> >>
> >> struct time_scale {
> >>
> >> int shift;
> >>
> >> uint32_t mul_frac;
> >>
> >> };
> >>
> >>
> >>
> >> static inline uint32_t div_frac(uint32_t dividend, uint32_t divisor)
> >>
> >> {
> >>
> >> uint32_t quotient, remainder;
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (quotient), "=d" (remainder)
> >>
> >> : "0" (0), "1" (dividend), "r" (divisor) );
> >>
> >> return quotient;
> >>
> >> }
> >>
> >>
> >>
> >> void set_time_scale(struct time_scale *ts, uint64_t ticks_per_sec)
> >>
> >> {
> >>
> >> uint64_t tps64 = ticks_per_sec;
> >>
> >> uint32_t tps32;
> >>
> >> int shift = 0;
> >>
> >>
> >> while ( tps64 > (MILLISECS(1000)*2) )
> >>
> >> {
> >>
> >> tps64 >>= 1;
> >>
> >> shift--;
> >>
> >> }
> >>
> >>
> >> tps32 = (uint32_t)tps64;
> >>
> >> while ( tps32 <= (uint32_t)MILLISECS(1000) )
> >>
> >> {
> >>
> >> tps32 <<= 1;
> >>
> >> shift++;
> >>
> >> }
> >>
> >>
> >> ts->mul_frac = div_frac(MILLISECS(1000), tps32);
> >>
> >> ts->shift = shift;
> >>
> >> }
> >>
> >>
> >>
> >> uint64_t scale_delta(uint64_t delta, const struct time_scale *scale)
> >>
> >> {
> >>
> >> uint64_t product;
> >>
> >>
> >> if ( scale->shift < 0 )
> >>
> >> delta >>= -scale->shift;
> >>
> >> else
> >>
> >> delta <<= scale->shift;
> >>
> >>
> >> asm (
> >>
> >> "mulq %2 ; shrd $32,%1,%0"
> >>
> >> : "=a" (product), "=d" (delta)
> >>
> >> : "rm" (delta), "0" ((uint64_t)scale->mul_frac) );
> >>
> >>
> >> return product;
> >>
> >> }
> >>
> >>
> >> #define _TS_MUL_FRAC_IDENTITY 0x80000000UL
> >>
> >>
> >> static inline struct time_scale scale_reciprocal(struct time_scale
> scale)
> >>
> >> {
> >>
> >> struct time_scale reciprocal;
> >>
> >> uint32_t dividend;
> >>
> >>
> >> dividend = _TS_MUL_FRAC_IDENTITY;
> >>
> >> reciprocal.shift = 1 - scale.shift;
> >>
> >> while ( dividend >= scale.mul_frac )
> >>
> >> {
> >>
> >> dividend >>= 1;
> >>
> >> reciprocal.shift++;
> >>
> >> }
> >>
> >>
> >> asm (
> >>
> >> "divl %4"
> >>
> >> : "=a" (reciprocal.mul_frac), "=d" (dividend)
> >>
> >> : "0" (0), "1" (dividend), "r" (scale.mul_frac) );
> >>
> >>
> >> return reciprocal;
> >>
> >> }
> >>
> >>
> >>
> >> int64_t get_s_time_fixed(const struct cpu_time_stamp *t, const struct
> >> time_scale *ts, uint64_t at_tsc)
> >>
> >> {
> >>
> >> uint64_t tsc, delta;
> >>
> >>
> >> if ( at_tsc )
> >>
> >> tsc = at_tsc;
> >>
> >> else
> >>
> >> tsc = rdtsc();
> >>
> >> delta = tsc - t->local_tsc;
> >>
> >> return t->local_stime + scale_delta(delta, ts);
> >>
> >> }
> >>
> >>
> >> int main() {
> >>
> >>
> >> struct cpu_time_stamp ct;
> >>
> >> struct time_scale ts;
> >>
> >> struct time_scale back;
> >>
> >>
> >> uint64_t start = rdtsc();
> >>
> >>
> >> set_time_scale(&ts, 3000000000);
> >>
> >> back = scale_reciprocal(ts);
> >>
> >>
> >> ct.local_tsc = start;
> >>
> >> ct.local_stime = scale_delta(start, &ts);
> >>
> >>
> >> while (1) {
> >>
> >> uint64_t new_tsc = rdtsc();
> >>
> >> ct.local_stime = get_s_time_fixed(&ct, &ts, new_tsc);
> >>
> >> ct.local_tsc = new_tsc;
> >>
> >>
> >> uint64_t tmp_tsc = rdtsc();
> >>
> >> printf("%lu %lu\n", tmp_tsc, scale_delta(get_s_time_fixed(&ct, &ts,
> >> tmp_tsc), &back));
> >>
> >> }
> >>
> >>
> >> return 0;
> >>
> >> }
> >>
> >> ```
> >>
> >>
> >> It's enough to run the loop for 10-20 seconds to see the problem.
> >> scale_delta rounds the least significant bits during calculation, which
> >> causes the error to accumulate, at different rates on different cores,
> >> depending on the least significant bits at the time of calibration.
> >>
> >>
> >> Now let's talk about why dwm reacts this way. When a snapshot is
> reversed,
> >> last_guest_time in hvm_get_guest_time_fixed is set to 0, which doesn't
> >> prevent time from flowing backwards. This means that cache_tsc_offset in
> >> hvm_get_guest_tsc_fixed may be configured correctly on one physical
> core,
> >> but due to shedding on a physical core with a lagging tsc, the guest may
> >> occasionally see a tsc value lower than it saw before the snapshot was
> >> reversed. If this happens to the dwm code, it terminates with an error.
> >>
> >>
> >> A quick solution to this problem might be to save the last_seen_tsc
> >> parameter in a snapshot for each core, followed by validation.
> >>
> >>
> >> The correct solution is to remove the rounding of the least significant
> >> bits from the sequence.
> >>
> >> On Wed, Jan 7, 2026 at 11:02 AM Jan Beulich <[email protected]> wrote:
> >>
> >>> On 06.01.2026 21:10, Антон Марков wrote:
> >>>> Hi, I'm not sure about the other places. In hvm_load_cpu_ctxt
> >>>> (xen/arch/x86/hvm/hvm.c ), it was easy to catch because
> >>>> process_pending_softirqs is frequently called there, which in turn
> >>>> processes softirqs from the timer (where the timestamp is updated).
> >>>> After I fixed sync_tsc in hvm_load_cpu_ctxt, the problem stopped
> >>>> reproducing under no load. However, when the number of vCPUs is 4
> times
> >>>> greater than the number of CPUs (under heavy load), the problem rarely
> >>>> reoccurs (mostly during snapshot restores during
> >>>> process_pending_softirqs calls), and this is no longer a simple case.
> >>> If
> >>>> get_s_time_fixed can indeed be interrupted during execution after
> >>>> rdtsc_ordered, then the current fix is insufficient. It's necessary to
> >>>> atomically copy "t->stamp" to the stack using local_irq_disable and
> >>>> local_irq_enable (as in local_time_calibration), and then work with
> the
> >>>> copy, confident in its lifetime and immutability. But until
> >>>> get_s_time_fixed is proven to be interruptible, this is premature, so
> >>>> your fix is sufficient. I think I need more information and testing to
> >>>> say more.
> >>>
> >>> While the cpu_calibration per-CPU variable is updated from IRQ context,
> >>> the cpu_time one isn't. Hence t->stamp's contents cannot change behind
> >>> the back of get_s_time_fixed(). I wonder whether ...
> >>>
> >>>> Regarding the other scale_delta calls, if they include values
> >>>> calculated from externally saved tsc values that could have become
> >>>> stale during the process_pending_softirqs call, this definitely needs
> >>> to
> >>>> be fixed.
> >>>
> >>> ... another similar issue (possibly one not included in the set of
> >>> remarks I have in the patch, as none of those look related to what you
> >>> describe) might be causing the remaining, more rare problems you say
> you
> >>> see. That set of remarks is actually a result of me going over all
> other
> >>> scale_delta() calls, but of course I may have got the analysis wrong.
> >>>
> >>> As to using 4 times as many vCPU-s as there are pCPU-s (and then heavy
> >>> load) - while I don't think we have a support statement for such
> upstream
> >>> (when probably we should), iirc for our (SUSE's) products we would
> >>> consider that unsupported. Just fyi.
> >>>
> >>> Also, btw, please don't top-post.
> >>>
> >>> Jan
> >>>
> >>
> >
>
>

Reply via email to