On 12/06/2011 07:25 PM, Paolo Bonzini wrote: > is_dup_page is already proceeding in 32-bit chunks. Changing it to 16 > bytes using Altivec or SSE is easy, and provides a noticeable improvement. > Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9 > seconds on a 6GB guest (best of three times for me; dunno for Pierre). > Both of them are approximately a 15% improvement. > > I tried playing with non-temporal prefetches, but I did not get any > improvement (though I did get less cache misses, so the patch was doing > its job).
It's worthwhile anyway IMO. > > +static int is_dup_page(uint8_t *page) > { > - uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch; > - uint32_t *array = (uint32_t *)page; > + VECTYPE *p = (VECTYPE *)page; > + VECTYPE val = SPLAT(p); > I think you can drop the SPLAT and just compare against zero. Full page repeats of anything but zero are unlikely, so we can simplify the code a bit here. If we do go with non-temporal loads, it saves an additional miss. -- error compiling committee.c: too many arguments to function