Hi again,

Looks that the most expensive is TLS access and it reduce the
performance in BCC builds - the cost of ABI in which VM pointer
is not passed to functions :-(.
We can do three things:
   1. add some tricks to reduce TLS access like HB_THREAD_STUB in
      xHarbour in hvm.c but it makes the code a little bit ugly
      though it will probably improve the MT speed about few percent.
   2. we can change used ABI so each Harbour function which may
      need HVM access will receive pointer to HB_STACK. Quite easy
      for HB_FUNC() but for internal ones it will force much more
      jobs.
   3. we can leave it as is waiting for new hardware and OS-es where
      TLS access is usually greatly improved very often by native
      hardware support.


I've spent some time (well not much, an hour...) to find out how tls works on Windows. The original idea is based on undocumented (but de facto) fs segment. In Win32 both 9x and NT fs segment register points to Win32 Thread information block (TIB). Structure is defined at http://en.wikipedia.org/wiki/Win32_Thread_Information_Block

TIB could be used to access some thread specific data (including TLS values) as an alternative to Win32 API calls. For example a single asm instruction
   mov eax,fs:[24h]
could be used instead of GetCurrentThreadID Win32 API call.

To enable TIB extensions (or by other reasons) TIB should be not accessed directly from fs segment, but address of TIB is obtained from using fs:[18h], and this addresss is used to access TIB using common data segment register ds. So, GetCurentThreadID in asm could be implemented:
   mov eax,fs:[18h]
   mov eax,[eax+24h]
   ret
Actually it is exact code for GetCurentThreadID from kernel32.dll.

BCC's TLS is implented:
  int __thread  a;
would be comiled to:
  call ___GetTls
  mov  eax,[eax+some_offset]

The function __GetTls itself is:
  mov  eax,[tls_index]
  mov  edx,fs:[2Ch]
  mov  eax,[edx+eax*4]
  ret

Win32 API's TlsGetValue() is a little less optimal. It creates stack frame (push bp, etc.), checks for tls_index value to not exceed maximul allowed value, and clears last error value. This adds 7 more CPU instructions. These few additional instructions should not cause a big overhead, but my test shows a big speed difference. speedtst.prg shows 100 seconds with HB_USE_TLS and 170 seconds without HB_USE_TLS (commented out HB_USE_TLS inside hbthread.h). The reason of such big overhead is not clear enough for me. Win32 API calls during application load process are redirected to dlls via additional jmp instruction, but this does not explain such big overhead.


Best regards,
Mindaugas

_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Reply via email to