Re: Threads Design. A Win32 perspective.

Uri Guttman Fri, 02 Jan 2004 23:51:20 -0800

>>>>> "NS" == Nigel Sandever <[EMAIL PROTECTED]> writes:


  NS> ATOMICITY AND CRITICAL SECTIONS

  UG> that is what i mentioned above, disabling time slicing/ preemption when
  UG> desired. it is not just a win32 concept. hell, turning off interrupts
  UG> during interrupt handlers goes way back! redmond just likes to rename
  UG> stuff and act like they invented it. :)
 
  NS> I cannot emphasis enough that Critical Sections are different to
  NS> both Events and Semaphores.

  NS> The distinction is that a critical section does not disable time
  NS> slicing/ preemption.  It prevents any thread attempting to enter an
  NS> owned critical section from receiving a timeslice until the current
  NS> owner relinquishes it. This means that other threads in the same process
  NS> and other processes continue to timeslice preemptively.  Only the
  NS> thread(s) waiting to enter the specified critical section are blocked.
  NS> There is no 'spinning' involved.

it has to disable slicing/preemption if there are multiple kernel
threads running that have access to that data.

  UG> in effect it sounds like a thread shared mutex. it could be implemented
  UG> in kernel or process space.

  NS> They are indeed a form of mutex, but the process space state and
  NS> avoiding the need to switch to kernel mode, is crucial, as it is
  NS> this that makes them lighter weight than a conventional IPC Mutex.

  NS> A CRITICAL_SECTION is a DWORD allocated in user space. It must be
  NS> allocated by the user space code, and then initialised through a
  NS> syscall. Once initialised, it cannot be move, copied or modified
  NS> (directly) by the user code.

  NS> If you have the time and inclination, there is a brief (25 line)
  NS> description of them at
  NS> 
[http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp]

  UG> that flag setting needs to be atomic or a mutex or similar. plain flag
  UG> setting won't work. also the blocking has to be kernel level (so a
  UG> kernel mutex/ semaphore is needed) so the kernel slicing will work.
 
  NS> Indeed. A simple flag *is* all that is necessary to protect shared (fat)
  NS> data structures, *if* the VMs, are the only code that could achieve
  NS> concurrent access to the contents of the PMC, and these are already
  NS> interlocked at the operational level.

  NS> A PMC is an object, and the only code that can affect the state of
  NS> that object are it's methods. The methods can only be invoked by
  NS> the VM. Each method constitutes an (atomic) operation at the VM
  NS> level.

  NS> All that is required to protect an object from corruption through
  NS> concurrent access and state change is to prevent two (or more) VMs
  NS> trying to access the same object simultaneously. In order for the VM to
  NS> address the PMC and must load it into a VM register, it must know where
  NS> it is. Ie. It must have a pointer.  It can use this pointer to access
  NS> the PMC with a Bit Test and Set operation.  (x86 BTS)
  NS> [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic
  NS> operation. This occurs before the VM enters the VM operations critical
  NS> section.

ding! ding! ding! you just brought in a cpu specific instruction which
is not guaranteed to be on any other arch. in fact many have such a
beast but again, it is not accessible from c. and still some archs don't
have it so it has to be emulated by dijkstra's algorithm. and that
requires two counters and a little chunk of code.

you can't bring x86 centrism into this. the fact that redmond/intel
threads can make use of this instruction to do 'critical sections' is a
major point why it is bad for a portable system with many architectures
to be supported.

  NS> Once the bit has been set in the PMC header by one VM, if the scheduler
  NS> intervened exactly before the next instruction in that thread, and
  NS> scheduled a second VM thread to run, and that VM also attempts to access
  NS> that same PMC,the first operation it tries to perform is the same BTS
  NS> instruction. The next instruction then tests the CF (Carry flag) and
  NS> knows that the PMC is already in use. Hence, it then relinquishes
  NS> (yields) the rest of its timeslice.

i am well aware about test/set and atomic instructions. my point in my
previous reply was that it can't be assumed to be on a particular
arch. so it has to be assumed to be on none and it needs a pure software
solution. sure a platform specific macro/assembler lib could be done but
that is a pain as well.

  NS> Every time the second (or subsequent) VM attempts to use a PMC
  NS> that is already in-use, it finds the bit set and immediately
  NS> yields. This will only happen during the interval between the
  NS> first VM issuing the BTS, and that same VM entering the
  NS> critsec. Once the critsec has been entered, no other VM that could
  NS> load that PMC, would even be scheduled.

yield is also a kernel specific thing and what will wake up the thread
after that? it has to block on some kernel thingy and the test/set flag
is not one. you are conflating the vm/kernel boundaries here with
intel specific solutions. the only sure way to block a kernel thread
is a kernel level mutex/lock/semaphore. but that is not a good solution
to add to every PMC that could be shared or it has to be added at
runtime to a PMC that is being shared. we want fine grained locking
which means locks around each shared thing. but that would mean a lock
inside each shared thing and that is a lot of wasted overhead for all
those things that aren't shared. and it is a slower solution as well.

and above all of this is any GC or event handling threads that might
need to lock everything in the entire process so they may need a global
lock that all threads have to obey.

  UG> more than test/set is needed as it has to be atomic. so CPU's have that
  UG> instruction but we can't get at it directly from c and we have to
  UG> emulate it on platforms which don't have it. assuming it is short and
  UG> fast is not a good idea. now the atomic semaphore problem was solved
  UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs
  UG> several machine instructions or a few lines of c. this is not a trivial
  UG> amount of overhead for each access to a shared object.  also this type
  UG> of lock is in user space and won't block the kernel thread running this
  UG> VM thread. so we really need kernel semaphores and that means more
  UG> storage on each PMC that could be locked.
 
  NS> I think this is all covered above.

i disagree. it is covered for the intel case only and that is not good
enough.

  NS> If my description above is to be believed, then the BTS opcode + a JCXZ 
  NS> [http://www.itis.mn.it/linux/quarta/x86/jcc.htm]
  NS> to skip over the yield are that are the only penalty of leaving the VM
  NS> interlocking code in place for non-threaded versions. As the flag would 
  NS> never test set on entry, the jump would always be taken and the yield code
  NS> would never come into play. 

  NS> The costs associated with entering and exiting critsec are
  NS> minimal, but this code could be bypass until a flag is set in the
  NS> interpreter to indicate that a second VM thread has be created.

again, intel specific and it has to be tossed out.

  NS> CONCLUSION (TENTATIVE, NO PROOF)

  NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
  NS> between threads at the VHLL level is done entirely through those
  NS> PMCs. If no single PMC can be the subject of an opcode on two threads
  NS> concurrently, there /should/ be no opportunity for conflict.

  UG> then you have no sharing. the issue is when you do have sharing. the
  UG> ithreads architecture is no IMPLICIT sharing so you can start out like
  UG> that. variables (and their underlying PMCs) can be declared shared at
  UG> compile or runtime (and with perl, who can tell the difference?
  UG> :). those shared things must each have a lock (preferably kernel level
  UG> so the locking thread can block and not spin) and that requires storage
  UG> and extra code wrapping access to those things.
 
  NS> Again. I think everything here was dealt with above. 

again, i disagree.

  NS> As all VMI internal state are encapsulated within the VMI register
  NS> set, and each thread has it's own set of registers, the internal
  NS> state of the VMI(s) should be equally secure.

  UG> you can't lock internal state nor do you need to as only the
  UG> interpreter can see them. the possibly shared things are variables
  UG> and data.

  NS> By "internal state of the VMI" (interpreter), I was referring to
  NS> such things as the VHLL program that the interpreter is
  NS> running. If a subroutine comes into existence at runtime (through
  NS> eval for example) then the user program changes, and with it, the
  NS> state of the interpreter.  If the garbage collector runs, again
  NS> the internal state of the interpreter is modified.

  NS> If these operations, like all user-program directed operations,
  NS> are run within the VM as atomic operations, then the state of the
  NS> interpreter is protected.

but atomic operations are not available on all platforms. so you need
more work for that and since you have kernel threads you need kernel
locks. test/set in user space is not good enough as you can't block in
user space nor yield and have the kernel wake you up later. the kernel
has to be aware of the test/set location and that means it can't all be
done in user space.

  UG> there may be times when a GC run needs to be initiated DURING a VM
  UG> operation. 
 
  NS> This is a matter of design. The suggestion that memory allocation
  NS> was managed at the class level was central to the proposal. If
  NS> memory needs to be allocated as part of a VM operation, then it is
  NS> associated with a single PMC; the subject of the operation. If the
  NS> pool of memory used by that PMC is entirely local to PMC and
  NS> allocated by an operation within the PMC, then a failure to obtain
  NS> memory requires that only *that* pool of memory, need be garbage
  NS> collected. As such, The GC cycle can be run on that pool, as a
  NS> (nested) VM level operation. The limited scope of the pool to be
  NS> GC's makes this a musg cheaper increment than global GC would be
  NS> in the same circumstances. If the GC doesn't succeed in locating
  NS> and freeing space, another (few) pages of the reserved allocation
  NS> are then commited.

but what about shared PMCs? which ram pool gets GC'ed or allocated when
one of the sharing threads needs to deal with it?

  NS> No other pool is affected. No shuffling or copying is required. See below.

again, sharing breaks that. thread specific pools still need to be aware
of each other so threads need to lock them with kernel locks.

  NS> By moving away from a single central pool, or even many central
  NS> pools of similarly sized objects, to a much finer grained set of
  NS> pools, each managed directly by the consumer class of that
  NS> pool. Allocation and GC become operations of the PMC (class)
  NS> itself.

they are already operations of the PMC for most cases. but sharing
breaks that and so does GC which can happen inside an operation and not
just between them.

  NS> Indeed, multiple interpreters within the same process would be possible.
  NS> And each interpreter would be able to have multiple VM threads. Two or
  NS> more interpreters (and the VM(s) that run them) would never, in fact, 
  NS> could never, interfere with each other. Their state, including their 
  NS> memory pools, garbage collectors and everything else.

there is one VM to a kernel thread and one interpreter per VM. the
parrot VM is effectively a interpreter per parrot thread. you can't get
away from that.

  UG> and the classic problem is there too, which
  UG> thread gets what events which is particularly nasty with signals.

  NS> This only becomes a problem if the process can be considered to be more
  NS> than a single program -- as with pseudo-forked children. 

not in a unix environment. real multi-threaded processes already have
this problem.

  NS> However, if the raiser of the signal can somehow identify the
  NS> thread it wishes to signal, then there is no problem tagging the
  NS> signal with it's destination and having each thread only respond
  NS> to signal tagged for it.  Where a signal is raised by the OS, that
  NS> has no way to target individual threads /interpreters, then the
  NS> signal would be acted upon by ALL threads in the process.

that is a kernel issue and most signalling systems don't have thread
specific arguments AFAIK.

  NS> MEMORY MANAGEMENT

  UG> the problem is too many different memory pools. you will waste enormous
  UG> amounts of unused space in each different sized pool as they all will
  UG> allocate slabs and break them up to their own granularity. so you will
  UG> have much of the allocated ram just sitting in pools and not being used
  UG> unless each class creates enough objects to use them.
 
  NS> Virtual memory allocation 
  NS> [http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp]
  NS> can be done in terms of reservation rather than committal. 

  NS> Each class would reserve some sized lump of virtual memory, but
  NS> only commit a small amount (in 4k pages) of the total reserved. 

virtual ram is what counts on unix. you can't request some large amount
without using real swap space. it may not allocate real ram until later
(page faulting on demand) but it is swap space that counts. it is used
up as soon as you allocate it.

  UG> see above. this is a fine scheme for a few queues of special size but
  UG> not for supporting hundreds of queues of differing sizes. and then you
  UG> get the real nasty problem of a class adding attributes at run time
  UG> which will necessitate changing the storage needed for all its instances
  UG> (this is the notification stuff dan has mentioned). this would require a
  UG> massive copy of all instances and an allocation of a completely new ram
  UG> pool of the new size and all the overhead of setting it up, etc.

  NS> Theoretically, each VirtualAlloc call can reserve as much Virtual memory
  NS> as it likes (up to the limit of physical memory size) and then commit 
  NS> this in 4K chunks on demand. The fact that no two such 'full-memory'
  NS> reservations could ever be completely committed doesn't matter. The 
  NS> OS/hardware takes care of mapping the committed virtual memory to
  NS> physical memory. As far as the program is concerned, there is no 
  NS> shuffling of memory required to expand any single allocation. 

again, redmond specific. physical ram is never the issue and virtual ram
is used as you allocate it. there is no way to separate the two on unix.

  NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

  UG> you seem to be crossing the kernel and VM boundaries here. 

  NS> I don't believe I have.
 
  UG> parrot will
  UG> have co-routines but at the VM level. 
 
  NS> In the model I describe, a VM thread *is* a kernel thread. 

which means a kernel level lock and not a use level lock. this is the
disagreement we have. this is where you are crossing the kernel/VM
boundary (even if you think not).

  UG> they won't map easily onto any form of kernel coros since parrot
  UG> needs to keep a reference to a parrot level stack and all sorts of
  UG> VM level info in the coro object/thing.
 
  NS> Agreed. This is why each VM thread, in addition to having it's own
  NS> set of VM registers, also needs it's own VM stack(s), just as
  NS> kernel threads have at the CPU level.
 
that is expected. it is the locking issue where we disagree.

  UG> this problem is similar to the problem of mapping VM threads onto
  UG> kernel threads. there are many ways to do it and none are perfect.

  NS> I doubt this mechanism gets any closer to perfect than any other, but
  NS> I do think that it is 
  NS>   a) feasible, on win32 at least. 
  NS>   b) has some unique features worthy of consideration. 

what win32 does is not portable and not useful at a VM level. kernels
and their threads can work well together. portable VMs and their threads
are another beast that can't rely on a particular architecture
instruction or kernel feature.

  NS> I don't know if all the requisite (win32) OS facilities are
  NS> available on all other threaded platforms. There is no doubt in my
  NS> mind that equivalent facilities are either available (probably
  NS> under different names), or could be made available. Linux runs on
  NS> the same hardware as Win32, and the rest is just software. Whether
  NS> anything that doesn't exist could be fabricated at a level that
  NS> wouldn't mean a kernel rebuild I simply don't know.

hardware is tossed out with portable VM design. we have a user space
process written in c with a VM and VM threads that are to be based on
kernel threads. the locking issues for shared objects is the toughest
nut to crack right now. there is no simple or fast solution to this
given the known contraints. intel/redmond specific solutions are not
applicable (though we can learn from them).
 
  UG> a VM just doesn't have the internal granularity of a kernel and
  UG> the ability to truly block threads when waiting.
 
  NS> The scheme of using a single critsec to arbitrate and interlock
  NS> between VM threads is the basis of the mechanism for doing exactly
  NS> this. When a thread attempts to enter a critsec that is already
  NS> entered by another thread, it is effectively removed from the
  NS> schedulers tables and will not be dispatched until the thread
  NS> holding the critsec exits it. No other threads are affected unless
  NS> they too attempt to enter the same critsec.
 
this is where i feel you are contradicting yourself. you just did a
kernel level block in this critical section but above you claim it can
be done with just a user level test/set instruction. the kernel handles
scheduling but that means a thread has to block on a lock or yeild on
some kernel level thing. a use level test/set will never get the
kernel's attention and the thread will have to spin on the test.

  NS>  VM coros also can't utilize the kernel
  UG> for switching since a kernel stack only needs a pointer and a few
  UG> registers to be swapped vs a larger and more complex swap with VM
  UG> coros.

  NS> By having both (each) thread in the co-processes running within
  NS> it's own VM thread, and converting that VM thread to a fibre (in
  NS> win32 terms)
  NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

  NS> As each VM thread already has the wherewithall to retain *all*
  NS> state across scheduler switches, once the thread is converted to a
  NS> fibre, it retains these same abilities. The only change is that
  NS> now, instead of the switching between VMs occurring at the
  NS> schedulers behest, it occurs as a result of a user program
  NS> directive.

ok, i can see where you got the test/set and yield stuff from now.
fibres seem to be manually scheduled threads. this means user code has
to decide to yield to another thread blocking on the same lock. this
means all locks must have a list of threads/fibres blocking on that
lock. and manually scheduled stuff is not appropriate for
parrot. finally, it is redmond specific again and very unportable. i
have never heard of this fibre concept on any unix flavor.

  NS> I'm not sure what the Parrot, IMCC or P6 language level terms for
  NS> the transfer of control are or will be, but at the (win32 level)
  NS> it would just translate to SwitchToFiber( lpfibre );
  NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp]

and that is not possible on other platforms. also it means parrot would
have to have its own scheduler with that the pain that brings. 

  UG> you have addressed many issues here and i think that is what dan
  UG> wanted. i hope i clarified some of them and explained why there
  UG> are no simple answers. we have to bite some performance bullets
  UG> somewhere due to the very lofty functional goals we have set.
 
  NS> Likewise, I hope that the above, in conjunction with the offsite
  NS> links, will serve as a clarification of my ideas, and as evidence
  NS> that I did not arrive at the proposals without considerable
  NS> thought and research.

  NS> That's no guarantee that they are ultimately workable, and that,
  NS> in part, was the source of my reluctance to post them here
  NS> earlier. I am working on a trivial implementation of most of the
  NS> core ideas by way of a proof-of-concept, but it will be some time
  NS> before this is demonstrable.

  NS> I fully appreciate that I am late to the party and it may be too
  NS> late to consider altering the design of Parrot to accommodate
  NS> these ideas.

  NS> It may well be that some or all of the facilities upon which the
  NS> concept is based are simple to Win32 specific to translate into a
  NS> cross-platform implementation.

that is a big point and one that i don't see as possible. redmond can do
what they want with their kernel and user procs. parrot can only use
kernel concepts that are reasonably portable across most platforms.
kernel threads are reasonably portable (FSDO of reasonable) but anything
beyond that such as fibres, test/set in user space, etc is not. locks
have to be in kernel space since we can do a fibre yield in user space
on any other platform. so this rule out user space test/set as well
since that would need a thread to spin instead of blocking.

your ideas make sense but only on redmond/intel which is not the target
space for parrot.

thanx,

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org

Re: Threads Design. A Win32 perspective.

Reply via email to