Hi,

Przemyslaw Czerpak wrote:
Maybe BCC does not inline InterLocked*() functions or
they are not as efficient as they can be.
Interlocked*() functions are WinAPI functions. It cannot be inlined.

I know that they are Windows API functions but
MSDN says that each compiler should try to inline them
using its own resources.

I've reverse engineered .obj files produced by BCC. It does not inline Interlocked*() functions, but uses Win32 API calls.


Harbour. Does anyone know how to include asm code into C???

If these are only few instructions then you can use __emit__ (...),
see rtl/hbtone.c

I'll take a look.


It was default build using make_b32.bat, i.e. with memstat. Here are results without memstat:
[...]
total application time:                   130.69  164.55

Much better though still the difference is quite huge.
You can simply check the cost of Interlocked*() operation
be redefining them inside hbthreads.h to:

   #define HB_ATOM_INC( p )    ( ++(*(p)) )
   #define HB_ATOM_DEC( p )    ( --(*(p)) )

Here are new results:

ARR_LEN =         16                     ST      MT      MT
N_LOOPS =    1000000                                   (++(*(p)))
empty loops overhead =                   0.16    0.30    0.34
CPU usage -> secondsCPU()

c:=L_C ->                                0.22    0.38    0.33
n:=L_N ->                                0.22    0.25    0.31
d:=L_D ->                                0.19    0.25    0.31
c:=M_C ->                                0.25    0.41    0.31
n:=M_N ->                                0.22    0.28    0.30
d:=M_D ->                                0.23    0.25    0.30
(sh) c:=F_C ->                           0.36    0.88    0.78
(sh) n:=F_N ->                           0.58    0.66    0.63
(sh) d:=F_D ->                           0.30    0.36    0.34
(ex) c:=F_C ->                           0.38    0.88    0.80
(ex) n:=F_N ->                           0.56    0.66    0.63
(ex) d:=F_D ->                           0.28    0.36    0.33
n:=o:GenCode ->                          0.50    0.78    0.75
n:=o[8] ->                               0.45    0.61    0.56
round(i/1000,2) ->                       0.73    0.94    1.00
str(i/1000) ->                           1.50    2.59    2.53
val(a3[i%ARR_LEN+1]) ->                  1.34    1.81    1.70
dtos(j+i%10000-5000) ->                  1.42    2.13    2.13
eval({||i%ARR_LEN}) ->                   0.67    1.03    1.09
eval({|x|x%ARR_LEN},i) ->                0.75    1.23    1.25
eval({|x|f1(x)},i) ->                    1.17    1.83    1.83
&('f1('+str(i)+')') ->                   7.22   13.39   12.38
eval([&('{|x|f1(x)}')]) ->               1.17    1.83    1.84
j := valtype(a)+valtype(i) ->            1.09    2.06    1.97
j := str(i%100,2) $ a2[i%ARR_LEN+1] ->   2.31    3.70    3.48
j := val(a2[i%ARR_LEN+1]) ->             1.53    2.14    1.98
j := a2[i%ARR_LEN+1] == s ->             1.09    1.69    1.39
j := a2[i%ARR_LEN+1] = s ->              1.16    1.70    1.45
j := a2[i%ARR_LEN+1] >= s ->             1.19    1.69    1.44
j := a2[i%ARR_LEN+1] < s ->              1.13    1.69    1.47
aadd(aa,{i,j,s,a,a2,t,bc}) ->            4.89    7.31    6.31
f0() ->                                  0.34    0.55    0.56
f1(i) ->                                 0.59    0.91    0.97
f2(c[8]) ->                              0.47    0.78    0.75
f2(c[40000]) ->                          0.45    0.75    0.75
f2(@c[40000]) ->                         0.38    0.61    0.66
f2(c[40000]); c2:=c ->                   0.70    1.16    1.05
f2(@c[40000]); c2:=c ->                  0.59    1.05    1.00
f3(a,a2,c,i,j,t,bc) ->                   1.20    2.05    1.77
f2(a2) ->                                0.48    0.80    0.73
s:=f4() ->                               1.69    2.41    2.39
s:=f5() ->                               0.75    1.45    1.42
ascan(a,i%ARR_LEN) ->                    0.78    1.17    1.14
ascan(a2,c+chr(i%64+64)) ->              2.64    3.91    3.66
ascan(a,{|x|x==i%ARR_LEN}) ->            7.81   14.44   13.52
===============================================================
total application time:                 61.33  101.38   98.66
total real time:                        62.16  102.20   99.49

Interlocked*() does not consumes a lot of time. MT-ST time difference is the same as in previous tests ~40seconds, but after droping significance of f4() call, ration of MT/ST became even larger ~= 1.6.


Looks that the most expensive is TLS access and it reduce the
performance in BCC builds - the cost of ABI in which VM pointer
is not passed to functions :-(.
We can do three things:
   1. add some tricks to reduce TLS access like HB_THREAD_STUB in
      xHarbour in hvm.c but it makes the code a little bit ugly
      though it will probably improve the MT speed about few percent.
   2. we can change used ABI so each Harbour function which may
      need HVM access will receive pointer to HB_STACK. Quite easy
      for HB_FUNC() but for internal ones it will force much more
      jobs.
   3. we can leave it as is waiting for new hardware and OS-es where
      TLS access is usually greatly improved very often by native
      hardware support.

I know very few about TLS. How TLS functions and __thread keyword are implemented in other compilers and OSes based on the same x86 hardware? Why MT/ST speed ration is not so big on Linux? Can we implement our own TLS based on GCC/GLIBC idea?


ps. Do you have assembler version of InterlockedDec() function?

If you mean using xadd instruction, it's almost the same as increment. Increment is:
    mov         eax,1
    lock xadd   [esp],eax
    inc         eax
    ret
Decrement is:
    mov         eax,-1
    lock xadd   [esp],eax
    dec         eax
    ret


Best regards,
Mindaugas
_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Reply via email to