Hi again,
Looks that the most expensive is TLS access and it reduce the
performance in BCC builds - the cost of ABI in which VM pointer
is not passed to functions :-(.
We can do three things:
1. add some tricks to reduce TLS access like HB_THREAD_STUB in
xHarbour in hvm.c but it makes the code a little bit ugly
though it will probably improve the MT speed about few percent.
2. we can change used ABI so each Harbour function which may
need HVM access will receive pointer to HB_STACK. Quite easy
for HB_FUNC() but for internal ones it will force much more
jobs.
3. we can leave it as is waiting for new hardware and OS-es where
TLS access is usually greatly improved very often by native
hardware support.
I've spent some time (well not much, an hour...) to find out how tls
works on Windows.
The original idea is based on undocumented (but de facto) fs segment. In
Win32 both 9x and NT fs segment register points to Win32 Thread
information block (TIB). Structure is defined at
http://en.wikipedia.org/wiki/Win32_Thread_Information_Block
TIB could be used to access some thread specific data (including TLS
values) as an alternative to Win32 API calls. For example a single asm
instruction
mov eax,fs:[24h]
could be used instead of GetCurrentThreadID Win32 API call.
To enable TIB extensions (or by other reasons) TIB should be not
accessed directly from fs segment, but address of TIB is obtained from
using fs:[18h], and this addresss is used to access TIB using common
data segment register ds. So, GetCurentThreadID in asm could be implemented:
mov eax,fs:[18h]
mov eax,[eax+24h]
ret
Actually it is exact code for GetCurentThreadID from kernel32.dll.
BCC's TLS is implented:
int __thread a;
would be comiled to:
call ___GetTls
mov eax,[eax+some_offset]
The function __GetTls itself is:
mov eax,[tls_index]
mov edx,fs:[2Ch]
mov eax,[edx+eax*4]
ret
Win32 API's TlsGetValue() is a little less optimal. It creates stack
frame (push bp, etc.), checks for tls_index value to not exceed maximul
allowed value, and clears last error value. This adds 7 more CPU
instructions. These few additional instructions should not cause a big
overhead, but my test shows a big speed difference. speedtst.prg shows
100 seconds with HB_USE_TLS and 170 seconds without HB_USE_TLS
(commented out HB_USE_TLS inside hbthread.h). The reason of such big
overhead is not clear enough for me. Win32 API calls during application
load process are redirected to dlls via additional jmp instruction, but
this does not explain such big overhead.
Best regards,
Mindaugas
_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour