priv.onet.pl)

Mindaugas Kavaliauskas Tue, 16 Sep 2008 15:54:54 -0700

Hi again,

Looks that the most expensive is TLS access and it reduce the
performance in BCC builds - the cost of ABI in which VM pointer
is not passed to functions :-(.
We can do three things:
   1. add some tricks to reduce TLS access like HB_THREAD_STUB in
      xHarbour in hvm.c but it makes the code a little bit ugly
      though it will probably improve the MT speed about few percent.
   2. we can change used ABI so each Harbour function which may
      need HVM access will receive pointer to HB_STACK. Quite easy
      for HB_FUNC() but for internal ones it will force much more
      jobs.
   3. we can leave it as is waiting for new hardware and OS-es where
      TLS access is usually greatly improved very often by native
      hardware support.

I've spent some time (well not much, an hour...) to find out how tlsworks on Windows.The original idea is based on undocumented (but de facto) fs segment. InWin32 both 9x and NT fs segment register points to Win32 Threadinformation block (TIB). Structure is defined athttp://en.wikipedia.org/wiki/Win32_Thread_Information_Block

TIB could be used to access some thread specific data (including TLSvalues) as an alternative to Win32 API calls. For example a single asminstruction

   mov eax,fs:[24h]
could be used instead of GetCurrentThreadID Win32 API call.

To enable TIB extensions (or by other reasons) TIB should be notaccessed directly from fs segment, but address of TIB is obtained fromusing fs:[18h], and this addresss is used to access TIB using commondata segment register ds. So, GetCurentThreadID in asm could be implemented:

   mov eax,fs:[18h]
   mov eax,[eax+24h]
   ret
Actually it is exact code for GetCurentThreadID from kernel32.dll.

BCC's TLS is implented:
  int __thread  a;
would be comiled to:
  call ___GetTls
  mov  eax,[eax+some_offset]

The function __GetTls itself is:
  mov  eax,[tls_index]
  mov  edx,fs:[2Ch]
  mov  eax,[edx+eax*4]
  ret

Win32 API's TlsGetValue() is a little less optimal. It creates stackframe (push bp, etc.), checks for tls_index value to not exceed maximulallowed value, and clears last error value. This adds 7 more CPUinstructions. These few additional instructions should not cause a bigoverhead, but my test shows a big speed difference. speedtst.prg shows100 seconds with HB_USE_TLS and 170 seconds without HB_USE_TLS(commented out HB_USE_TLS inside hbthread.h). The reason of such bigoverhead is not clear enough for me. Win32 API calls during applicationload process are redirected to dlls via additional jmp instruction, butthis does not explain such big overhead.



Best regards,
Mindaugas

_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Re: [Harbour] 2008-09-15 13:38 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)

Reply via email to