2007/12/22, Frank Hofmann <[EMAIL PROTECTED]>: > > > On Sat, 22 Dec 2007, Minskey Guo wrote: > > > > > On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote: > > > >> Hi Bart, > >> > >> I noticed this email just now :( > >> Thank you for your advice. > >> > >> Are there any barrier instructions on x86/x64 could force the rdtsc to > >> behave sychronously? > > > > iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax > etc, sfence is > > not available for all pentium (PIII ???) > > We had this discussion with AMD a while ago; if I remember correctly, but
Do you remember the topic of that discussion? > Bart may well step in here, is that the only thing that's guaranteed in > all situations and fully vendor/chip-rev independent is CPUID. Which is > sort of a barrier sledgehammer. Takes thousands of cycles. Because it will takes thousands of cycles? It takes thousands of cycles, then it will affact the testing result a bit. But it seems a good generic resolution. btw, On P4 and the later Intel platform, which instruction is the best barrier? On AMD Opteron and the later AMD platform, which instruction is? > Wondering - what _exactly_ are you planning to do ? Instruction-based > sampling can be done via CPU performance monitoring counters, the old > "sample time, do something, sample time again" is sort-of superseded by > those. High-level access in Solaris would be via the cpc(7d) driver. OK, I'll try to find some articles about performance monitoring counters and the cpc driver in Solaris to read. A program, I want to analysis its detail behavior. In a word, I want to know the time cost of any sub-flow on the whole program flow. Suppose the program flow is a long vertical line like *main* *func1 * *func2 * *func3 * *"some key instructions in func3" (record it as "#1")* *func4* *func3* *func2 * *func1 * *"some key instructions in func1" (record it as "#2")* *"exit in main"* I'm interested in func4 takes how much time? #1 takes how much time? #2 takes how much time? control transfered from func2 to func3 (this is a function call) takes how much time? during func4, this program may be interrupted by some event, if so, it takes how much time? and it spends how much time to re-gain the CPU if not, that's all right. To this problem, are there any good suggestions? Kind Regards, TJ > FrankH. > > > > > > >> My concern is > >> --------------- > >> rdtsc > >> [barrier] > >> AA > >> BB > >> CC > >> .... > >> XX > >> [barrier] > >> rdtsc > >> ---------------- > >> (2nd rdtsc - 1st rdtsc) should be the time cost of these inner > >> instructions/functions. > >> And it should be equal to or greater than the actual cost. > >> > >> Are there any barrier instructions to force rdtsc execute before AA and > 2nd > >> rdtsc execute after XX? > >> using some continuous nops? or some instrcution else? > > > > sfence > > rdtsc > > xxxxx > > > > sfence > > rdtsc > > > > > > maybe cpuid is available if exa can be corrupted, or you can save it > > somewhere before cpuid . > > > > -minskey > > > > > > > >> > >> Btw, you mentioned you had the experience of performance measuring. > >> Are there any recommended articles about performance measuring on > x86/x64 > >> platform? > >> Are there any recommended atricles about measuring instrcution cost? > >> For example, in some books, they said nop costs 1 cycle on Pentium, > costs 3 > >> cycle on 386. How to get these precise costs? > >> > >> Thank you :) > >> > >> Another question: > >> In SMP or Multi-core (or say CMT) platform, each processor/core does > have > >> its own tsc register on its chip, doesn't it? > >> Then, how could gethrtime() guarantee to provide the system-wide time? > I > >> mean if a program runing on CPU1 for a while and then running on CPU2, > >> would gethrtime() - gethrtime() be the precise time cost? Does > gethrtime() > >> read ticks from CPU's tsc register or read it from system-wide timer( > e.g. > >> 8253 chip for x86)? > >> > >> I'm not familiar with timer... sorry for these stupid questions :-( > >> > >> > >> Kind Regards, > >> TJ > >> > >> > >> 2007/10/30, Bart Smaalders <[EMAIL PROTECTED]>: > >> ?? TaoJie wrote: > >>> Dear all: > >>> > >>> My platform is: > >>> Intel Pentium 4 CPU > >>> OpenSolaris B74, built by myself > >>> Sun Studio 11 > >>> > >>> In my program, I use asm("rdtsc") to measure the time cost between two > >>> rdtsc. > >>> for example: > >>> int some_func(...) > >>> { > >>> long long time1, time2; > >>> int i = 3198, j = 324; > >>> > >>> asm volatile("rdtsc" : "=A" (time1)); > >>> > >>> .... > >>> i = i + j * i / j; > >>> > >>> asm volatile("rdtsc" : "=A" (time2)) > >>> > >>> return i; > >>> } > >>> > >>> int main(...) > >>> { > >>> .... > >>> some_func(); > >>> .... > >>> } > >>> > >>> When I compile this program using "cc example.c" and disasmble a.out > >>> by dis, the program logic is ok. The output is > >>> some_func() > >>> main+0x36: 0f 31 rdtsc > >>> main+0x38: 89 45 f4 movl %eax,-0xc(%ebp) > >>> main+0x3b: 89 55 f8 movl %edx,-0x8(%ebp) > >>> main+0x3e: 8b 45 e8 movl -0x18(%ebp),%eax > >>> main+0x41: 03 45 e4 addl -0x1c(%ebp),%eax > >>> main+0x44: 89 45 e8 movl %eax,-0x18(%ebp) > >>> main+0x47: 8b 45 e8 movl -0x18(%ebp),%eax > >>> main+0x4a: 0f af 45 e4 imull -0x1c(%ebp),%eax > >>> main+0x4e: 89 45 e8 movl %eax,-0x18(%ebp) > >>> main+0x51: 8b 45 e8 movl -0x18(%ebp),%eax > >>> main+0x54: 99 cltd > >>> main+0x55: f7 7d e4 idivl -0x1c(%ebp) > >>> main+0x58: 8b d0 movl %eax,%edx > >>> main+0x5a: 89 55 e8 movl %edx,-0x18(%ebp) > >>> main+0x5d: 0f 31 rdtsc > >>> main+0x5f: 89 45 ec movl %eax,-0x14(%ebp) > >>> main+0x62: 89 55 f0 movl %edx,-0x10(%ebp) > >>> > >>> When I compile this program using "cc -xO5", the dis output is > >>> some_func() > >>> main+0x7: 0f 31 rdtsc > >>> main+0x9: 89 45 e8 movl %eax,-0x18(%ebp) > >>> main+0xc: 89 55 ec movl %edx,-0x14(%ebp) > >>> main+0xf: 0f 31 rdtsc > >>> main+0x11: 89 45 f0 movl %eax,-0x10(%ebp) > >>> main+0x14: 89 55 f4 movl %edx,-0xc(%ebp) > >>> main+0x17: 8b 5d f0 movl -0x10(%ebp),%ebx > >>> main+0x1a: 8b 45 f4 movl -0xc(%ebp),%eax > >>> main+0x1d: 8b 4d e8 movl -0x18(%ebp),%ecx > >>> main+0x20: 8b 55 ec movl -0x14(%ebp),%edx > >>> main+0x23: 2b d9 subl %ecx,%ebx > >>> main+0x25: 1b c2 sbbl %edx,%eax > >>> main+0x27: 89 5d e0 movl %ebx,-0x20(%ebp) > >>> main+0x2a: 89 45 e4 movl %eax,-0x1c(%ebp) > >>> > >>> Now the program logic is wrong! sun cc thinks rdtscs are irrelative > >>> with the other parts in some_func, and then it advances the second > >>> asm("rdtsc")! > >>> In this case, I can't measure the time cost. > >>> > >>> Then how can I stop sun cc optimization partly between these two asm > >>> statements when using -xO5 optimization to the whole program? > >>> I mean the second rdtsc should be put after the statement i = i + j * > >>> i / j strictly. (though I know the instructions will be executed in > >>> x86 cpu out-of-order, and the result may not be very precise, but it > >>> still works) > >>> Any good ideas? > >>> > >>> TIA > >>> > >>> Regards, > >>> TJ > >>> _______________________________________________ > >>> opensolaris-discuss mailing list > >>> [EMAIL PROTECTED] > >> > >> You're going to be very frustrated with this approach because: > >> > >> 1) rdtsc is not a synchronizing instruction; the cpu may perform > >> the load earlier than you think it does. > >> 2) you'll need to bind your program to a cpu as tsc counters are not > >> the same at boot. > >> > >> My suggestion is to repeat the activity a sufficient number of times > >> such that you can afford to use gethrtime() to measure the time > >> interval. This is the approach we took w/ libmicro (see performance > >> community) and has worked reasonably well. > >> > >> - Bart > >> > >> > >> -- > >> Bart Smaalders Solaris Kernel Performance > >> [EMAIL PROTECTED] http://blogs.sun.com/barts > >> > >> _______________________________________________ > >> perf-discuss mailing list > >> perf-discuss@opensolaris.org > > >
_______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org