Re: Threads Design. A Win32 perspective.

Uri Guttman Thu, 01 Jan 2004 19:36:00 -0800

>>>>> "NS" == Nigel Sandever <[EMAIL PROTECTED]> writes:


  NS> REENTRANCY

  NS> Not only must the VMI be coded in a reentrant fashion, with all state
  NS> addressed through pointers (references) loaded into it's Virtual
  NS> register set. All the code underlying it, including syscalls and CRT
  NS> must equally be reentrant. Many APIs within many CRTs *are not
  NS> reentrant* (eg. strtok()). All state must be on a per-thread not a
  NS> per-process basis.

  NS> To this end, I would propose that no CRT APIs are used directly from the
  NS> main code!

  NS> Instead, a full set of CRT-like macros would be inherited from a header
  NS> file, where either the real CRT API would be called, or an alternative
  NS> implementation. This header file would be conditionally included on the
  NS> basic of the target platform. This concentrates the bulk, if not all,
  NS> platform specific code into a single file (or set of files).

this is true for c level threads but not necessarily true for VM level
threads. if the VM is atomic at its operation level and can't be
preempted (i.e. it is not using kernel threads with time slicing), then
it could use thread unsafe calls (as long as it keeps those silly static
buffers clean). parrot will (according to dan) use one interpreter per
VM thread and those may run on kernel threads. it may be possible to
disable preemption and/or time slicing so the VM threads will be atomic
at the VM operation level and then we don't have to worry as much about
thread unsafe libs. but i gather that people want real preemption and
priorities and time slicing so that idea may be moot. but on most
platforms that support kernel threads there are thread safe versions of
most/all the c lib stuff. now, external libs that get linked in under
nci is a different story.


  NS> ATOMICITY AND CRITICAL SECTIONS

  NS> Atomicity of the VMI opcodes can be achieved by having the core
  NS> loop of the interpreter enter and exit a critical section each
  NS> time it processes an opcode. For those not familiar with critical
  NS> sections, they are a (Win32) OS construct that prevents any
  NS> cooperating thread within a process, from having timeslice until
  NS> the thread that entered the critical section has exited.

that is what i mentioned above, disabling timeslicing/preemption when
desired. it is not just a win32 concept. hell, turning off interrupts
during interrupt handlers goes way back! redmond just likes to rename
stuff and act like they invented it. :)

  NS> Unlike semaphores and events, critsecs only operate within a
  NS> single process. They are relatively lightweight as there is no
  NS> need to enter kernel mode for their operation. They are, in effect
  NS> a CPU specific "test and set" operation applied to a piece of user
  NS> mode memory. Their lightweight nature means that they are faster
  NS> than inter-process semaphore mechanisms. When used in a process
  NS> that currently has only a single thread in operation, they have
  NS> almost negligible performance effect upon that thread. They also
  NS> operate correctly on SMP machines.

in effect it sounds like a thread shared mutex. it could be implemented
in kernel or process space.

  NS> If two threads attempt concurrent operations on the same PMC, the
  NS> first thread accessing the PMC sets the flag. When a second thread
  NS> attempt to access it, it finds the flag set and blocks
  NS> (relinquishing it's timeslice) until the first thread has
  NS> completed and cleared the flag. This doesn't solve the potential
  NS> for deadlock problems arising from the user-level code, though it
  NS> may be possible for the VMI to detect and diagnose these at
  NS> runtime in a similar way that deep-recursion detection is done in
  NS> P5.

that flag setting needs to be atomic or a mutex or similar. plain flag
setting won't work. also the blocking has to be kernel level (so a
kernel mutex/semaphore is needed) so the kernel slicing will work.

deadlock detection is a known problem in the DB world and we can do
similar things. but are those locks VM level or kernel level? if the
coder wants to prevent deadlocks they have to do a higher level lock
(analogous to a full DB lock or transaction). we can't stop stupid
coders from themselves.

  NS> The cost to the single-threaded user application is for a test&set
  NS> operation (a few cycles), per object-method, plus the flag, which
  NS> need only be one bit, but maybe more efficiently coded as a 32-bit
  NS> entity.

more than test/set is needed as it has to be atomic. so cpu's have that
instruction but we can't get at it directly from c and we have to
emulate it on platforms which don't have it. assuming it is short and
fast is not a good idea. now the atomic semaphore problem was solved
long ago by dijkstra and it is a two phase test/set gizmo. it needs
several machine instructions or a few lines of c. this is not a trivial
amount of overhead for each access to a shared object.  also this type
of lock is in user space and won't block the kernel thread running this
VM thread. so we really need kernel semaphores and that means more
storage on each PMC that could be locked.

this is part of why dan said that the unthreaded version will be
designed to be faster. if you don't have all those locks and don't
compile in the storage for them, it will be faster. i would find it
amazing if someone could design a threaded system that actually ran
faster than the same thing with threads ripped out (even in a single
threaded application).

  NS> CONCLUSION (TENTATIVE, NO PROOF)

  NS> As all VHLL entities are PMCs at the VMI level, the sharing of data
  NS> between threads at the VHLL level is done entirely through those
  NS> PMCs. If no single PMC can be the subject of an opcode on two threads
  NS> concurrently, there /should/ be no opportunity for conflict.

then you have no sharing. the issue is when you do have sharing. the
ithreads architecture is no IMPLICIT sharing so you can start out like
that. variables (and their underlying PMCs) can be declared shared at
compile or runtime (and with perl, who can tell the difference?
:). those shared things must each have a lock (preferably kernel level
so the locking thread can block and not spin) and that requires storage
and extra code wrapping access to those things.

  NS> As all VMI internal state are encapsulated within the VMI register set,
  NS> and each thread has it's own set of registers, the internal state of the
  NS> VMI(s) should be equally secure.

you can't lock internal state nor do you need to as only the interpreter
can see them. the possibly shared things are variables and data.

  NS> Other internal housekeeping operations, memory allocation, garbage
  NS> collection etc. are performed as "sysopcodes", performed by the VMI
  NS> within the auspices of the critical section, and thus secured.

there may be times when a GC run needs to be initiated DURING a VM
operation. if the op requires an immediate lare chunk of ram it can
trigger a GC pass or allocation request. you can't force those things to
only happen between normal ops (which is what making them into ops
does). so GC and allocation both need to be able to lock all shared
things in their interpreter (and not just do a process global lock) so
those things won't be modified by the other threads that share them.

this is what dan is all so scared about. you can't hide gc and
allocation under a VM carpet. it has to be out there and visible at all
times and it needs total access to an interpreter's stack and other
stuff.

  NS> All asynchronous operations are performed by one or more non-VMI
  NS> threads and do not adjust the state of the VMI directly. Instead,
  NS> they queue notifications of events, and results to the VMI, and
  NS> these are detected by the VMI within the body of the main
  NS> loop. Once detected, an appropriate sysopcode is dispatched within
  NS> the critical section in the normal way.

there is no real main loop anymore. there are multiple main loops (one
in each interpreter. remember, each interpreter is mapped to a real
kernel thread (if possible). you can have a single thread dedicated to
handling signals and events but each of the others must check that
process global event queue. and the classic problem is there too, which
thread gets what events which is particularly nasty with signals.

  NS> MEMORY MANAGEMENT

  NS> The thought struck me that with the exception of strings and
  NS> (other) compound data types, all the instances of any given class
  NS> are always the same size.  Where an existing class is dynamically
  NS> extended in some way, the result is the creation of a new
  NS> (anonymous) class as pre-existing instances of the original class
  NS> would not be modified. As such, the class would seem to be the
  NS> ideal point at which to pool the allocation and deallocation of
  NS> memory.

the problem is too many different memory pools. you will waste enormous
amounts of unused space in each different sized pool as they all will
allocate slabs and break them up to their own granularity. so you will
have much of the allocated ram just sitting in pools and not being used
unless each class creates enough objects to use them. 

  NS> If each class is given it's own memory pool that is a multiple of
  NS> its instance size. That pool effectively becomes an (C-style)
  NS> array of instances. As each element is exactly the right size, the
  NS> only time the pool would need to grow, is when all the existing
  NS> elements are in-use and another is required. Once an instance has
  NS> been freed, it's slot in the array would be available and exactly
  NS> the right size for the next instantiation. There would never be a
  NS> requirement to coalesce allocations. Whether the free slots are
  NS> found through a free space chain, or even a linear search, the GC
  NS> would only need to process this pool when a new instance of this
  NS> class is being allocated. It would then only need to scan a single
  NS> pool of (same-sized) reference objects to GC the existing
  NS> instances.

see above. this is a fine scheme for a few queues of special size but
not for supporting hundreds of queues of differing sizes. and then you
get the real nasty problem of a class adding attributes at run time
which will necessitate changing the storage needed for all its instances
(this is the notification stuff dan has mentioned). this would require a
massive copy of all instances and an allocation of a completely new ram
pool of the new size and all the overhead of setting it up, etc.

  NS> If the larger segments of memory allocated for each pool, were
  NS> allocated as virtual memory segments, rather than as arbitrary
  NS> chunks of heap memory, the size of each individual pool can be
  NS> grown in-place, at runtime, without the need to copy then existing
  NS> space to the new larger allocation. And without the need to
  NS> shuffle any other memory pools around to accommodate the increase
  NS> in size. The OS's Virtual Memory Management takes care of the
  NS> whole process. It also becomes possible to reserve a sizeable
  NS> pre-allocation for important pools and those known to be likely to
  NS> grow quickly, without actually consuming the entire reservation
  NS> from the physical memory pool before it is actually needed. The OS
  NS> will then take care of notification of an individual pool's need
  NS> to grow through page faults and the growth becomes a process of
  NS> simply committing another (few) pages from the pre-allocation.

how do you know who will grow quickly? larry hasn't published the specs
for PSI::ESP yet. :) also applications never see page faults so we can't
use that.

  NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES.

  NS> I have no understanding (yet) of how the co-routines that Parrot is
  NS> promising are implemented on *nix systems. But unless the concept of
  NS> co-routines is encapsulated at the higher level with the implementation
  NS> being pushed down to platform specific code, unless *nix also supports a
  NS> concept very similar to Win32 fibres, it will be impossible to utilise
  NS> this "tailor made" OS facility. Thus, the implementation of co-routines
  NS> designed to operate well on *nix platforms is likely to need to be
  NS> emulated on Win32 with high level requirements that won't fit well with
  NS> the underlying OS. This could result in a similarly sub-optimal
  NS> emulation of a *nix concept on Win32 as currently exists for the fork
  NS> emulation. Other non-*nix platforms would likely also suffer from this
  NS> force fit as well.

you seem to be crossing the kernel and VM boundaries here. parrot will
have coroutines but at the VM level. they won't map easily onto any form
of kernel coros since parrot needs to keep a reference to a parrot level
stack and all sorts of VM level info in the coro object/thing. this
problem is similar to the problem of mapping VM threads onto kernel
threads. there are many ways to do it and none are perfect. a VM just
doesn't have the internal granularity of a kernel and the ability to
truly block threads when waiting. VM coros also can't utilize the kernel
for switching since a kernel stack only needs a pointer and a few
registers to be swapped vs a larger and more complex swap with VM coros.

you have addressed many issues here and i think that is what dan
wanted. i hope i clarified some of them and explained why there are no
simple answers. we have to bite some performance bullets somewhere due
to the very lofty functional goals we have set. the key is to keep the
design and api clean and as elegant as possible while keeping up the
performance. and almost all of this is moot in a pure event system which
is why i like them better than threads. :)

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org

Re: Threads Design. A Win32 perspective.

Reply via email to