On Thu, 25 Sep 2008, Przemyslaw Czerpak wrote:

Hi All,

> The cost of TLS access is strictly compiler/OS dependent. I've
> just make interesting experiment to compare the code of using
> stack pointer to dynamically allocated stack instead of statick
> stack address in ST programs.
> I made very simple modificatrion. In hbstack.c for ST mode I changed:
>       extern HB_STACK hb_stack;
> to:
>       extern PHB_STACK hb_stack_ptr;
> #     define hb_stack      ( * hb_stack_ptr )
> and in estack.c:
>    #  if defined( HB_STACK_MACROS )
>          HB_STACK hb_stack;
>    #  else
>          static HB_STACK hb_stack;
>    #  endif
> to:
>       HB_STACK _hb_stack_;
>       PHB_STACK hb_stack_ptr = &_hb_stack_;

An now I compared BCC-5.5 and GCC-4.3.1 assembler code generated for
such modified HVM and this simple code:

   void func( void )
   {
      hb_stackPush();
      hb_stackPop();
   }

BCC with -4 -5 -6 -O2 gives:
         ;      
         ;      void func( void )
         ;      {
         ;         hb_stackPush();
         ;      
      @3:
        mov       eax,dword ptr [_hb_stack_ptr]
        add       dword ptr [eax+4],4
        mov       edx,dword ptr [eax+4]
        mov       ecx,dword ptr [_hb_stack_ptr]
        cmp       edx,dword ptr [ecx+8]
        jne       short @4
        call      _hb_stackIncrease
         ;      
         ;         hb_stackPop();
         ;      
      @4:
        mov       eax,dword ptr [_hb_stack_ptr]
        sub       dword ptr [eax+4],4
        mov       edx,dword ptr [_hb_stack_ptr]
        mov       ecx,dword ptr [edx+4]
        mov       eax,dword ptr [ecx]
        test      dword ptr [eax],46085
        je        short @5
        push      eax
        call      _hb_itemClear
        pop       ecx
         ;      
         ;      }
         ;      
      @5:
      @6:
        ret 

Please note that _hb_stack_ptr is accessed always 4 times.
In my GCC final code looks for -O3 is:
      func:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    hb_stack_ptr, %ecx
        movl    4(%ecx), %eax
        addl    $4, %eax
        cmpl    8(%ecx), %eax
        movl    %eax, 4(%ecx)
        je      .L6
      .L2:
        movl    4(%ecx), %edx
        leal    -4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    -4(%edx), %eax
        testw   $-19451, (%eax)
        jne     .L7
        leave
        ret
      .L7:
        movl    %eax, (%esp)
        call    hb_itemClear
        leave
        ret
      .L6:
        call    hb_stackIncrease
        movl    hb_stack_ptr, %ecx
        jmp     .L2

It access hb_stack_ptr only _ONCE_ during normal code execution.
The second hb_stack_ptr is used when external function like
hb_stackIncrease() have to be called (in practice never or few
times in whole application live).
And this explains the speed difference. Which such optimization
the overhead in my builds is minimal when TLS native variables
are used. GCC was always optimized to reduce memory access when
BCC seems to be hardcoded for x86 machines where the cost of memory
operation was relatively small in the past and now data CPU caches
reduce the overhead but it's still not friendly code for CPU
optimization logic.
It also shows why TLS cost so much in BCC. Four calls instead
of one in my GCC in such simple example.

best regards,
Przemek
_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Reply via email to