On Fri, 03 Oct 2008, Maurilio Longo wrote: > And the winner is.... :) > they're slower than before. a lot! > I attach two files, speedtst.log which is built with -gc3 and hbvmmt and > speedtstmt.log which is built with the same options but started with a > parameter so that it runs with several threads.
The results are much worse then in my Linux box. It can be caused be few reasons: 1. the cost of task switching is probably much higher in OS2 then in Linux because it has to reload memory descriptors for each thread - it's forced by TLS format. 2. The cost of simultaneous memory allocation very high. Here other MM should help. 3. The overhead caused by freezing thread in critical section is very high. In all tests there are only two Harbour mutexes which are used (I do not count thread start/stop synchronization because this code is called only ones): 1-st inside hb_dynsymFind() which is used by these tests: [ x := &( "f1(" + str(i) + ")" ) ].............................50.81 [ bc := &( "{|x|f1(x)}" ); eval( bc, i ) ]....................142.90 2-nd inside garbage.c used to add/remove each new GC block to linked list. It's used by tests which allocates/deallocates new GC items in each loop: [ eval( { || i % 16 } ) ].....................................138.32 [ eval( { |x| x % 16 }, i ) ].................................140.12 [ eval( { |x| f1( x ) }, i ) ]................................138.03 [ bc := &( "{|x|f1(x)}" ); eval( bc, i ) ]....................142.90 [ ascan( a, { |x| x == i % 16 } ) ]...........................133.02 [ if i%4000==0;a:={};end; aadd(a,{i,1,.T.,s,s2,a2,bc}) ]......131.08 This are the most expensive tests and please look at the difference for the same tests but when existing codeblocks are reused. Codeblock create in each loop and then evaluated: [ eval( { || i % 16 } ) ].....................................138.32 Codeblock create in outside the loop and evaluated: [ eval( bc := { || i % 16 } ) ].................................1.02 I'll add also test for creating and coping empty array to easier see this overhead. The code covered by critical section inside garbage.c is minimal: HB_GC_LOCK hb_gc[Un]Link( &s_pCurrBlock, pAlloc ); HB_GC_UNLOCK and hb_gc[Un]Link() is very simple and short, f.e. hb_gcLink() look: static void hb_gcLink( HB_GARBAGE_PTR *pList, HB_GARBAGE_PTR pAlloc ) { if( *pList ) { pAlloc->pNext = *pList; pAlloc->pPrev = (*pList)->pPrev; pAlloc->pPrev->pNext = pAlloc; (*pList)->pPrev = pAlloc; } else *pList = pAlloc->pNext = pAlloc->pPrev = pAlloc; } It's only few machine instructions. This suggests that the whole overhead is caused by very huge cost of critical sections in OS2. Because it's such simple and short code which should be executed extremely fast then we can try to use our own spinlocks instead of OS level semaphores which seems to be not efficient enough. The code covered by critical section is probably less then 1% of whole code executed in LOOP but kills the performance in OS2. It's probably caused by series of two locks (MM probably also has its own critical section) which exploits some bad synchronization behavior in OS2 or just simply mutexes are such expensive in this system. If you are interested then I can implement small ASM inline function for OS2 GCC builds and we can compare the results. To Windows users: I'm very interested in MS-Windows results from real multi CPU machine. Does anyone have such computer and can test it? It's necessary to move main() function to the beginning of the code. I'll replace current speedtst.prg in SVN with this one in a while. It seems to be much better to tests different MT aspects, gives much more valuable results and also the code is much easier to update. best regards, Przemek _______________________________________________ Harbour mailing list Harbour@harbour-project.org http://lists.harbour-project.org/mailman/listinfo/harbour