On Thu, 25 Sep 2008, Przemyslaw Czerpak wrote: Hi All,
> The cost of TLS access is strictly compiler/OS dependent. I've > just make interesting experiment to compare the code of using > stack pointer to dynamically allocated stack instead of statick > stack address in ST programs. > I made very simple modificatrion. In hbstack.c for ST mode I changed: > extern HB_STACK hb_stack; > to: > extern PHB_STACK hb_stack_ptr; > # define hb_stack ( * hb_stack_ptr ) > and in estack.c: > # if defined( HB_STACK_MACROS ) > HB_STACK hb_stack; > # else > static HB_STACK hb_stack; > # endif > to: > HB_STACK _hb_stack_; > PHB_STACK hb_stack_ptr = &_hb_stack_; An now I compared BCC-5.5 and GCC-4.3.1 assembler code generated for such modified HVM and this simple code: void func( void ) { hb_stackPush(); hb_stackPop(); } BCC with -4 -5 -6 -O2 gives: ; ; void func( void ) ; { ; hb_stackPush(); ; @3: mov eax,dword ptr [_hb_stack_ptr] add dword ptr [eax+4],4 mov edx,dword ptr [eax+4] mov ecx,dword ptr [_hb_stack_ptr] cmp edx,dword ptr [ecx+8] jne short @4 call _hb_stackIncrease ; ; hb_stackPop(); ; @4: mov eax,dword ptr [_hb_stack_ptr] sub dword ptr [eax+4],4 mov edx,dword ptr [_hb_stack_ptr] mov ecx,dword ptr [edx+4] mov eax,dword ptr [ecx] test dword ptr [eax],46085 je short @5 push eax call _hb_itemClear pop ecx ; ; } ; @5: @6: ret Please note that _hb_stack_ptr is accessed always 4 times. In my GCC final code looks for -O3 is: func: pushl %ebp movl %esp, %ebp subl $8, %esp movl hb_stack_ptr, %ecx movl 4(%ecx), %eax addl $4, %eax cmpl 8(%ecx), %eax movl %eax, 4(%ecx) je .L6 .L2: movl 4(%ecx), %edx leal -4(%edx), %eax movl %eax, 4(%ecx) movl -4(%edx), %eax testw $-19451, (%eax) jne .L7 leave ret .L7: movl %eax, (%esp) call hb_itemClear leave ret .L6: call hb_stackIncrease movl hb_stack_ptr, %ecx jmp .L2 It access hb_stack_ptr only _ONCE_ during normal code execution. The second hb_stack_ptr is used when external function like hb_stackIncrease() have to be called (in practice never or few times in whole application live). And this explains the speed difference. Which such optimization the overhead in my builds is minimal when TLS native variables are used. GCC was always optimized to reduce memory access when BCC seems to be hardcoded for x86 machines where the cost of memory operation was relatively small in the past and now data CPU caches reduce the overhead but it's still not friendly code for CPU optimization logic. It also shows why TLS cost so much in BCC. Four calls instead of one in my GCC in such simple example. best regards, Przemek _______________________________________________ Harbour mailing list Harbour@harbour-project.org http://lists.harbour-project.org/mailman/listinfo/harbour