Re: Threads Design. A Win32 perspective.

Nigel Sandever Fri, 02 Jan 2004 22:40:48 -0800

On Thu, 01 Jan 2004 21:32:22 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote:

UG> Uri Guttman
NS> Nigel Sandever. (Mostly not reproduced here!)


  NS> REENTRANCY

 UG> this is true for c level threads but not necessarily true for VM level
 UG> threads.  f the VM is atomic at its operation level and can't be
 UG> preempted (i.e. it is not using kernel threads with time slicing), 
 
One of the major points of what I am (was?) proposing, is that there 
would be a 1:1 mapping between VM threads and kernel threads. 

The second point is that atomic operation of the VM thread does not imply
that they cannot be preempted. It only means that when one VM thread is 
preempted by the scheduler, no other thread that is waiting to enter the 
same critical section as the first thread is in, will be scheduled 
(by the scheduler), until the first thread exits that critical section.
 
 UG> then
 UG> it could use thread unsafe calls (as long as it keeps those silly static
 UG> buffers clean). 
 
The point about avoiding calling the CRT directly from the main bulk of the 
code does not preclude the use of (suitably thread-aware) CRT calls completely.

The idea is to acknowledge that 
  A) Not all CRT's are equally well crafted or complete.
  B) Thread-aware CRTs are not available with all compilers for all platforms.
  
By writing the main bulk of the code in terms of a virtual CRT using macros, 
the main bulk of the code becomes platform independent. The only change required 
porting the code to different platforms is to write a set of suitable expansions 
for the macros for each platform/ compiler. Where the available CRT is up to the 
job, the macros expand directly to the underlying CRT call. Where they are not, 
a suitable alternative can be written that bypasses the CRT and goes directly to 
runtime. 

As an example: Compiling P5 with Borland 5.5 for NT, I encountered the restriction
that the Borland runtime didn't support (fseek) or (f)tell on files > 4GB. The 
underlying platform does, and it was relatively trivial to replace the calls to 
the CRT functions with appropriate NT syscalls and so enable both USE_LARGE_FILES
and PERLIO. 

The only difficulty was that instead of being able to make the changes in a single
place, it was necessary to make essentially the same change in 3 or 4 different files.
To compound matters, the modifications required conditional compilation directives at
all 4 places, and these had to be interspersed with other #ifdefs for other platforms 
and other purposes. 

The idea was to avoid this morass of conditional compilation and copy&paste coding by
pushing the conditional directives in the main code to a single

 #if defined( THIS_PLATFORM ) #include "this_platform_crt.h"
 
Within that header would be a simple set of #ifdefs to include compiler specific
header files. And within those, all the CRT macro expansions. Thus, I sought to 
ease both porting and maintenance. 

It would also become possible to add virtual syscall definitions to the high level
code, and expand them differently, perhaps radically so, on a platform by platform 
basis. In essence, creating a Virtual OS for the VM.
 
 UG> parrot will (according to dan) use one interpreter per
 UG> VM thread and those may run on kernel threads. 
 
>From what I have read, the decision as to whether each VM thread also got a 
separate interpreter was one of the things Dan was opening up to discussion?

Personally, with the perspective of (the forkless) NT platform, and my experience
with trying to make good use of both p5's 5005thread and ithread implementations,
I am completely against the idea of VM threads having a 1:1 relationship with 
interpreters. The runtime costs of duplicating all shared data between interpreters,
tying all shared variables, added to the cost of locking, throw away almost all of 
the benefit of using threads. Combine that overhead with the restrictions of 

 a) No shared references therefore:
 b) No shared objects, therefore:
 c) No shared methods.
 d) No shared compound data structures.
 
And the only hope for multiprocessing becomes forking, which WIn32 doesn't support 
natively. And which, without the benefits of process level COW performed by the 
kernel, results in a kludged emulation of a non-native facility that will never 
be usable in any real sense of the word.
 
 UG> it may be possible to
 UG> disable preemption and/ or time slicing so the VM threads will be atomic
 UG> at the VM operation level and then we don't have to worry as much about
 UG> thread unsafe libs. 
 
I was trying to demonstrate that there is no need to disable the scheduler in order
for VM threads to be achieve atomic VM level operations.
 
 UG> but i gather that people want real preemption and
 UG> priorities and time slicing so that idea may be moot.
 
Yes. I am one of them:)

 UG> but on most
 UG> platforms that support kernel threads there are thread safe versions of
 UG> most/ all the c lib stuff. 
 
As I said above. This is true, but it is also compiler dependant. And even 
where thread-safe CRTs are available, they aren't always reliable. 

Insulation the bulk of the sources from the vagaries of different CRTs was
my aim.
 
 UG> now, external libs that get linked in under
 UG> nci is a different story.
 
I can't address this as I have no idea what nci is :), and nothing applicable
to this context came up in a google for the term.

  NS> ATOMICITY AND CRITICAL SECTIONS

 UG> that is what i mentioned above, disabling time slicing/ preemption when
 UG> desired. it is not just a win32 concept. hell, turning off interrupts
 UG> during interrupt handlers goes way back! redmond just likes to rename
 UG> stuff and act like they invented it. :)
 
I cannot emphasis enough that Critical Sections are different to both Events
and Semaphores. 

The distinction is that a critical section does not disable time slicing/ preemption.
It prevents any thread attempting to enter an owned critical section from receiving
a timeslice until the current owner relinquishes it. This means that other threads in
the same process and other processes continue to timeslice preemptively.
Only the thread(s) waiting to enter the specified critical section are blocked.
There is no 'spinning' involved.

 UG> in effect it sounds like a thread shared mutex. it could be implemented
 UG> in kernel or process space.

They are indeed a form of mutex, but the process space state and avoiding the
need to switch to kernel mode, is crucial, as it is this that makes them
lighter weight than a conventional IPC Mutex.

A CRITICAL_SECTION is a DWORD allocated in user space. It must be allocated
by the user space code, and then initialised through a syscall. Once initialised,
it cannot be move, copied or modified (directly) by the user code.

If you have the time and inclination, there is a brief (25 line) description
of them at
[http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp]

 UG> that flag setting needs to be atomic or a mutex or similar. plain flag
 UG> setting won't work. also the blocking has to be kernel level (so a
 UG> kernel mutex/ semaphore is needed) so the kernel slicing will work.
 
Indeed. A simple flag *is* all that is necessary to protect shared (fat) data
structures, *if* the VMs, are the only code that could achieve concurrent access
to the contents of the PMC, and these are already interlocked at the operational 
level.

A PMC is an object, and the only code that can affect the state of that object 
are it's methods. The methods can only be invoked by the VM. Each method 
constitutes an (atomic) operation at the VM level. 

All that is required to protect an object from corruption through concurrent 
access and state change is to prevent two (or more) VMs trying to access the
same object simultaneously. In order for the VM to address the PMC and must 
load it into a VM register, it must know where it is. Ie. It must have a pointer. 
It can use this pointer to access the PMC with a Bit Test and Set operation. 
(x86 BTS) [http://www.itis.mn.it/linux/quarta/x86/bts.htm]
This is a CPU atomic operation. This occurs before the VM enters the VM 
operations critical section.

Once the bit has been set in the PMC header by one VM, if the scheduler intervened
exactly before the next instruction in that thread, and scheduled a second VM thread 
to run, and that VM also attempts to access that same PMC,the first operation it 
tries to perform is the same BTS instruction. The next instruction then tests the CF
(Carry flag) and knows that the PMC is already in use. Hence, it then relinquishes
(yields) the rest of its timeslice. 

It will continue to do so until the result of the BTS, is that the CF is clear. 
This will only happen when the original VM that caused the bit to be set, enters 
the critsec, performs and completes the operation, and exits the critsec. 
At which point it does a Bit Test & Reset on the bit. (BTR)
[http://www.itis.mn.it/linux/quarta/x86/btr.htm]

Every time the second (or subsequent) VM attempts to use a PMC that is already
in-use, it finds the bit set and immediately yields. This will only happen during
the interval between the first VM issuing the BTS, and that same VM entering the 
critsec. Once the critsec has been entered, no other VM that could load that PMC, 
would even be scheduled. 

 UG> we can't stop stupid coders from themselves.

Agreed:)

 UG> more than test/set is needed as it has to be atomic. so CPU's have that
 UG> instruction but we can't get at it directly from c and we have to
 UG> emulate it on platforms which don't have it. assuming it is short and
 UG> fast is not a good idea. now the atomic semaphore problem was solved
 UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs
 UG> several machine instructions or a few lines of c. this is not a trivial
 UG> amount of overhead for each access to a shared object.  also this type
 UG> of lock is in user space and won't block the kernel thread running this
 UG> VM thread. so we really need kernel semaphores and that means more
 UG> storage on each PMC that could be locked.
 
I think this is all covered above.

 UG> this is part of why dan said that the unthreaded version will be
 UG> designed to be faster. if you don't have all those locks and don't
 UG> compile in the storage for them, it will be faster. i would find it
 UG> amazing if someone could design a threaded system that actually ran
 UG> faster than the same thing with threads ripped out (even in a single
 UG> threaded application).

If my description above is to be believed, then the BTS opcode + a JCXZ 
[http://www.itis.mn.it/linux/quarta/x86/jcc.htm]
to skip over the yield are that are the only penalty of leaving the VM
interlocking code in place for non-threaded versions. As the flag would 
never test set on entry, the jump would always be taken and the yield code
would never come into play. 

The costs associated with entering and exiting critsec are minimal, but this
code could be bypass until a flag is set in the interpreter to indicate that 
a second VM thread has be created.

  NS> CONCLUSION (TENTATIVE, NO PROOF)

  NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
  NS> between threads at the VHLL level is done entirely through those
  NS> PMCs. If no single PMC can be the subject of an opcode on two threads
  NS> concurrently, there /should/ be no opportunity for conflict.

 UG> then you have no sharing. the issue is when you do have sharing. the
 UG> ithreads architecture is no IMPLICIT sharing so you can start out like
 UG> that. variables (and their underlying PMCs) can be declared shared at
 UG> compile or runtime (and with perl, who can tell the difference?
 UG> :). those shared things must each have a lock (preferably kernel level
 UG> so the locking thread can block and not spin) and that requires storage
 UG> and extra code wrapping access to those things.
 
Again. I think everything here was dealt with above. 

  NS> As all VMI internal state are encapsulated within the VMI register set,
  NS> and each thread has it's own set of registers, the internal state of the
  NS> VMI(s) should be equally secure.

 UG> you can't lock internal state nor do you need to as only the interpreter
 UG> can see them. the possibly shared things are variables and data.

By "internal state of the VMI" (interpreter), I was referring to such things as
the VHLL program that the interpreter is running. If a subroutine comes into 
existence at runtime (through eval for example) then the user program changes,
and with it, the state of the interpreter. 
If the garbage collector runs, again the internal state of the interpreter is 
modified. 

If these operations, like all user-program directed operations, are run within
the VM as atomic operations, then the state of the interpreter is protected. 

 UG> there may be times when a GC run needs to be initiated DURING a VM
 UG> operation. 
 
This is a matter of design. The suggestion that memory allocation was managed
at the class level was central to the proposal. If memory needs to be allocated 
as part of a VM operation, then it is associated with a single PMC; the subject
of the operation. If the pool of memory used by that PMC is entirely local to 
PMC and allocated by an operation within the PMC, then a failure to obtain memory 
requires that only *that* pool of memory, need be garbage collected. As such,
The GC cycle can be run on that pool, as a (nested) VM level operation. The
limited scope of the pool to be GC's makes this a musg cheaper increment than
global GC would be in the same circumstances. If the GC doesn't succeed in 
locating and freeing space, another (few) pages of the reserved allocation 
are then commited. 

No other pool is affected. No shuffling or copying is required. See below.

 UG> if the op requires an immediate lare chunk of ram it can
 UG> trigger a GC pass or allocation request. you can't force those things to
 UG> only happen between normal ops (which is what making them into ops
 UG> does). so GC and allocation both need to be able to lock all shared
 UG> things in their interpreter (and not just do a process global lock) so
 UG> those things won't be modified by the other threads that share them.

 UG> this is what dan is all so scared about. you can't hide gc and
 UG> allocation under a VM carpet. it has to be out there and visible at all
 UG> times and it needs total access to an interpreter's stack and other
 UG> stuff.

By moving away from a single central pool, or even many central pools of similarly 
sized objects, to a much finer grained set of pools, each managed directly by the 
consumer class of that pool. Allocation and GC become operations of the PMC 
(class) itself. 

Other pool(s) of memory used to satisfy the interpreters needs for it's own use,
as opposed to those done on behalf of the user program, would be required also.
But these will tend to have relatively longer life and will be allocated outside
of the VM operations. The separation of interpreter memory reqs from those of the 
user program also translate into less work to do at both allocation and GC times
for either.

 UG> there is no real main loop anymore. there are multiple main loops (one
 UG> in each interpreter. remember, each interpreter is mapped to a real
 UG> kernel thread (if possible). 
 
Again, I think I have covered this above. I know there will be multiple 
"main loops". In the scheme outlined, there would be one "main loop" 
per VM/kernel thread. And these would share a single interpreter.

Indeed, multiple interpreters within the same process would be possible.
And each interpreter would be able to have multiple VM threads. Two or
more interpreters (and the VM(s) that run them) would never, in fact, 
could never, interfere with each other. Their state, including their 
memory pools, garbage collectors and everything else.

The only exceptions would be unavoidably process global OS entities, 
like files, sockets, Events etc. but these are already managed by
multi-threaded CRTs and there shoudlbe no reason to make extra 
provision to handle them. 
 
 UG> you can have a single thread dedicated to
 UG> handling signals and events but each of the others must check that
 UG> process global event queue. 

Agreed. This is the way I would implement asynchronous events.
In fact, every Win32 process already has a separate event queue by
default. This is used by native GUI applications to manage events for
single and multi-threaded applications, including the message passing
mechanisms for communicating between windows running on different threads.
In part, this is the source of some of the inspiration for the model I am
suggesting. I don't have a great deal of VM interpreter knowledge, but
many of the same problems arise in threaded GUI applcations,and I have
a lot of experience with those.

In fact, the Win32 port of P5 utilises this event queue to provide its
(very limited) emulation of *nix signals. It would theoretically be 
possible to extend that current emulation to cover more than the 2 or 3
signals now supported By extending this mechanism, it would be possible 
to enable SIG_ALRM and SIG_CHLD for example, though the latter would be 
limited to inter operation with other pseudo-forked children.

 UG> and the classic problem is there too, which
 UG> thread gets what events which is particularly nasty with signals.

This only becomes a problem if the process can be considered to be more
than a single program -- as with pseudo-forked children. 

However, if the raiser of the signal can somehow identify the thread it 
wishes to signal, then there is no problem tagging the signal with it's
destination and having each thread only respond to signal tagged for it. 
Where a signal is raised by the OS, that has no way to target individual 
threads /interpreters, then the signal would be acted upon by ALL threads 
in the process.

  NS> MEMORY MANAGEMENT

 UG> the problem is too many different memory pools. you will waste enormous
 UG> amounts of unused space in each different sized pool as they all will
 UG> allocate slabs and break them up to their own granularity. so you will
 UG> have much of the allocated ram just sitting in pools and not being used
 UG> unless each class creates enough objects to use them.
 
Virtual memory allocation 
[http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp]
can be done in terms of reservation rather than committal. 

Each class would reserve some sized lump of virtual memory, but
only commit a small amount (in 4k pages) of the total reserved. 


 UG> see above. this is a fine scheme for a few queues of special size but
 UG> not for supporting hundreds of queues of differing sizes. and then you
 UG> get the real nasty problem of a class adding attributes at run time
 UG> which will necessitate changing the storage needed for all its instances
 UG> (this is the notification stuff dan has mentioned). this would require a
 UG> massive copy of all instances and an allocation of a completely new ram
 UG> pool of the new size and all the overhead of setting it up, etc.

Theoretically, each VirtualAlloc call can reserve as much Virtual memory
as it likes (up to the limit of physical memory size) and then commit 
this in 4K chunks on demand. The fact that no two such 'full-memory'
reservations could ever be completely committed doesn't matter. The 
OS/hardware takes care of mapping the committed virtual memory to
physical memory. As far as the program is concerned, there is no 
shuffling of memory required to expand any single allocation. 

 UG> how do you know who will grow quickly? larry hasn't published the specs
 UG> for PSI::ESP yet. :) also applications never see page faults so we can't
 UG> use that.
 
There are some indications (even without ESP:), of what classes are likely to 
grow faster than others. The SCALAR class, is likely to see considerably more 
activity than the FATAL_ERROR_DUMP class for example. 

The former might pre-commit 1MB on start up, and extend in 1MB increments. 
Whereas the latter might opt to have no pre-committal at all and expand only
when allocation is requested. 

It might even be possible to have the reservation, pre-committal and extension
values configurable at run time on a class-by-class basis. These might be an
optional class creation/ load-time parameters.

  NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

 UG> you seem to be crossing the kernel and VM boundaries here. 

I don't believe I have.
 
 UG> parrot will
 UG> have co-routines but at the VM level. 
 
In the model I describe, a VM thread *is* a kernel thread. 

 UG> they won't map easily onto any form
 UG> of kernel coros since parrot needs to keep a reference to a parrot level
 UG> stack and all sorts of VM level info in the coro object/thing. 
 
Agreed. This is why each VM thread, in addition to having it's own set of
VM registers, also needs it's own VM stack(s), just as kernel threads have
at the CPU level.
 
 UG> this
 UG> problem is similar to the problem of mapping VM threads onto kernel
 UG> threads. there are many ways to do it and none are perfect. 

I doubt this mechanism gets any closer to perfect than any other, but
I do think that it is 
        a) feasible, on win32 at least. 
        b) has some unique features worthy of consideration. 
        
I don't know if all the requisite (win32) OS facilities are available on
all other threaded platforms. There is no doubt in my mind that equivalent
facilities are either available (probably under different names), or could
be made available. Linux runs on the same hardware as Win32, and the rest
is just software. Whether anything that doesn't exist could be fabricated 
at a level that wouldn't mean a kernel rebuild I simply don't know.
 
 UG> a VM just
 UG> doesn't have the internal granularity of a kernel and the ability to
 UG> truly block threads when waiting. 
 
The scheme of using a single critsec to arbitrate and interlock between 
VM threads is the basis of the mechanism for doing exactly this. When a 
thread attempts to enter a critsec that is already entered by another 
thread, it is effectively removed from the schedulers tables and will
not be dispatched until the thread holding the critsec exits it. No other
threads are affected unless they too attempt to enter the same critsec.
 
 VM coros also can't utilize the kernel
 UG> for switching since a kernel stack only needs a pointer and a few
 UG> registers to be swapped vs a larger and more complex swap with VM coros.

By having both (each) thread in the co-processes running within it's 
own VM thread, and converting that VM thread to a fibre (in win32 terms)
[http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

As each VM thread already has the wherewithall to retain *all* state
across scheduler switches, once the thread is converted to a fibre, it 
retains these same abilities. The only change is that now, instead of the 
switching between VMs occurring at the schedulers behest, it occurs as a 
result of a user program directive. 

I'm not sure what the Parrot, IMCC or P6 language level terms for the 
transfer of control are or will be, but at the (win32 level) it would 
just translate to SwitchToFiber( lpfibre );
[http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

That's not a complete spec., but it's the root of the idea.

 UG> you have addressed many issues here and i think that is what dan
 UG> wanted. i hope i clarified some of them and explained why there are no
 UG> simple answers. we have to bite some performance bullets somewhere due
 UG> to the very lofty functional goals we have set. 
 
Likewise, I hope that the above, in conjunction with the offsite links, will
serve as a clarification of my ideas, and as evidence that I did not arrive
at the proposals without considerable thought and research. 

That's no guarantee that they are ultimately workable, and that, in part,
was the source of my reluctance to post them here earlier. I am working 
on a trivial implementation of most of the core ideas by way of a 
proof-of-concept, but it will be some time before this is demonstrable.

I fully appreciate that I am late to the party and it may be too late to 
consider altering the design of Parrot to accommodate these ideas. 

It may well be that some or all of the facilities upon which the concept
is based are simple to Win32 specific to translate into a cross-platform 
implementation.

It's also more than just possible, that there are gapping holes that I 
haven't addressed, or even considered. 

My purpose in posting, was prompted by Dan's opening of the discussion, and
with the hope that the skills and expertise of the denizens of this list 
would both scrutinise the ideas (as you did) and perhaps be able to relate
my win32 view of the world into *nix (and other) views. I have no doubts 
that even if some of the ideas I have postulated could be used in Parrot 
to good effect, that they would require (perhaps considerable) refinement
and translation along the way by all those people here that are far more
experienced with interpreter and VM design than I. 

Whilst I have tried to cover everything I can foresee, there is so much
that do not yet understand about Parrot in particular, and VM interpreters
in general, that I fully accept that none of this might fly here.

But as someone recently pointed out, there is no point in having ideas
unless you are prepared to voice them, and support them against summary
dismissal. If Dan turns speaks up and says 
 
    "It too much, and too little, too late"
or
    "it is simply too platform specific and too speculative to even consider"
    
I accept that. 
 
 UG> the key is to keep the
 UG> design and api clean and as elegant as possible while keeping up the
 UG> performance. and almost all of this is moot in a pure event system which
 UG> is why i like them better than threads. :)
 
Once again. Thanks for taking the time to read and digest my original post, 
and for your comprehensive feedback. Without the feedback, it is very 
difficult to know which parts of what I wrote need clarification. I had
several attempts at writing the original up. They varied from a 12,000
word diatribe that tried (and probably failed) to explain every detail
of the ideas. That would never have been read, and would have patronised
anyone brave enough to read it. To the other extreme, that talked totally
in terms of win32 APIs with no explanation at all.

That would have been like I was talking Greek to most people here.

 UG> uri

Regards, 
and forever holding my peace :), 

Nigel Sandever.

Re: Threads Design. A Win32 perspective.

Reply via email to