>>>>> "NS" == Nigel Sandever <[EMAIL PROTECTED]> writes:
NS> REENTRANCY NS> Not only must the VMI be coded in a reentrant fashion, with all state NS> addressed through pointers (references) loaded into it's Virtual NS> register set. All the code underlying it, including syscalls and CRT NS> must equally be reentrant. Many APIs within many CRTs *are not NS> reentrant* (eg. strtok()). All state must be on a per-thread not a NS> per-process basis. NS> To this end, I would propose that no CRT APIs are used directly from the NS> main code! NS> Instead, a full set of CRT-like macros would be inherited from a header NS> file, where either the real CRT API would be called, or an alternative NS> implementation. This header file would be conditionally included on the NS> basic of the target platform. This concentrates the bulk, if not all, NS> platform specific code into a single file (or set of files). this is true for c level threads but not necessarily true for VM level threads. if the VM is atomic at its operation level and can't be preempted (i.e. it is not using kernel threads with time slicing), then it could use thread unsafe calls (as long as it keeps those silly static buffers clean). parrot will (according to dan) use one interpreter per VM thread and those may run on kernel threads. it may be possible to disable preemption and/or time slicing so the VM threads will be atomic at the VM operation level and then we don't have to worry as much about thread unsafe libs. but i gather that people want real preemption and priorities and time slicing so that idea may be moot. but on most platforms that support kernel threads there are thread safe versions of most/all the c lib stuff. now, external libs that get linked in under nci is a different story. NS> ATOMICITY AND CRITICAL SECTIONS NS> Atomicity of the VMI opcodes can be achieved by having the core NS> loop of the interpreter enter and exit a critical section each NS> time it processes an opcode. For those not familiar with critical NS> sections, they are a (Win32) OS construct that prevents any NS> cooperating thread within a process, from having timeslice until NS> the thread that entered the critical section has exited. that is what i mentioned above, disabling timeslicing/preemption when desired. it is not just a win32 concept. hell, turning off interrupts during interrupt handlers goes way back! redmond just likes to rename stuff and act like they invented it. :) NS> Unlike semaphores and events, critsecs only operate within a NS> single process. They are relatively lightweight as there is no NS> need to enter kernel mode for their operation. They are, in effect NS> a CPU specific "test and set" operation applied to a piece of user NS> mode memory. Their lightweight nature means that they are faster NS> than inter-process semaphore mechanisms. When used in a process NS> that currently has only a single thread in operation, they have NS> almost negligible performance effect upon that thread. They also NS> operate correctly on SMP machines. in effect it sounds like a thread shared mutex. it could be implemented in kernel or process space. NS> If two threads attempt concurrent operations on the same PMC, the NS> first thread accessing the PMC sets the flag. When a second thread NS> attempt to access it, it finds the flag set and blocks NS> (relinquishing it's timeslice) until the first thread has NS> completed and cleared the flag. This doesn't solve the potential NS> for deadlock problems arising from the user-level code, though it NS> may be possible for the VMI to detect and diagnose these at NS> runtime in a similar way that deep-recursion detection is done in NS> P5. that flag setting needs to be atomic or a mutex or similar. plain flag setting won't work. also the blocking has to be kernel level (so a kernel mutex/semaphore is needed) so the kernel slicing will work. deadlock detection is a known problem in the DB world and we can do similar things. but are those locks VM level or kernel level? if the coder wants to prevent deadlocks they have to do a higher level lock (analogous to a full DB lock or transaction). we can't stop stupid coders from themselves. NS> The cost to the single-threaded user application is for a test&set NS> operation (a few cycles), per object-method, plus the flag, which NS> need only be one bit, but maybe more efficiently coded as a 32-bit NS> entity. more than test/set is needed as it has to be atomic. so cpu's have that instruction but we can't get at it directly from c and we have to emulate it on platforms which don't have it. assuming it is short and fast is not a good idea. now the atomic semaphore problem was solved long ago by dijkstra and it is a two phase test/set gizmo. it needs several machine instructions or a few lines of c. this is not a trivial amount of overhead for each access to a shared object. also this type of lock is in user space and won't block the kernel thread running this VM thread. so we really need kernel semaphores and that means more storage on each PMC that could be locked. this is part of why dan said that the unthreaded version will be designed to be faster. if you don't have all those locks and don't compile in the storage for them, it will be faster. i would find it amazing if someone could design a threaded system that actually ran faster than the same thing with threads ripped out (even in a single threaded application). NS> CONCLUSION (TENTATIVE, NO PROOF) NS> As all VHLL entities are PMCs at the VMI level, the sharing of data NS> between threads at the VHLL level is done entirely through those NS> PMCs. If no single PMC can be the subject of an opcode on two threads NS> concurrently, there /should/ be no opportunity for conflict. then you have no sharing. the issue is when you do have sharing. the ithreads architecture is no IMPLICIT sharing so you can start out like that. variables (and their underlying PMCs) can be declared shared at compile or runtime (and with perl, who can tell the difference? :). those shared things must each have a lock (preferably kernel level so the locking thread can block and not spin) and that requires storage and extra code wrapping access to those things. NS> As all VMI internal state are encapsulated within the VMI register set, NS> and each thread has it's own set of registers, the internal state of the NS> VMI(s) should be equally secure. you can't lock internal state nor do you need to as only the interpreter can see them. the possibly shared things are variables and data. NS> Other internal housekeeping operations, memory allocation, garbage NS> collection etc. are performed as "sysopcodes", performed by the VMI NS> within the auspices of the critical section, and thus secured. there may be times when a GC run needs to be initiated DURING a VM operation. if the op requires an immediate lare chunk of ram it can trigger a GC pass or allocation request. you can't force those things to only happen between normal ops (which is what making them into ops does). so GC and allocation both need to be able to lock all shared things in their interpreter (and not just do a process global lock) so those things won't be modified by the other threads that share them. this is what dan is all so scared about. you can't hide gc and allocation under a VM carpet. it has to be out there and visible at all times and it needs total access to an interpreter's stack and other stuff. NS> All asynchronous operations are performed by one or more non-VMI NS> threads and do not adjust the state of the VMI directly. Instead, NS> they queue notifications of events, and results to the VMI, and NS> these are detected by the VMI within the body of the main NS> loop. Once detected, an appropriate sysopcode is dispatched within NS> the critical section in the normal way. there is no real main loop anymore. there are multiple main loops (one in each interpreter. remember, each interpreter is mapped to a real kernel thread (if possible). you can have a single thread dedicated to handling signals and events but each of the others must check that process global event queue. and the classic problem is there too, which thread gets what events which is particularly nasty with signals. NS> MEMORY MANAGEMENT NS> The thought struck me that with the exception of strings and NS> (other) compound data types, all the instances of any given class NS> are always the same size. Where an existing class is dynamically NS> extended in some way, the result is the creation of a new NS> (anonymous) class as pre-existing instances of the original class NS> would not be modified. As such, the class would seem to be the NS> ideal point at which to pool the allocation and deallocation of NS> memory. the problem is too many different memory pools. you will waste enormous amounts of unused space in each different sized pool as they all will allocate slabs and break them up to their own granularity. so you will have much of the allocated ram just sitting in pools and not being used unless each class creates enough objects to use them. NS> If each class is given it's own memory pool that is a multiple of NS> its instance size. That pool effectively becomes an (C-style) NS> array of instances. As each element is exactly the right size, the NS> only time the pool would need to grow, is when all the existing NS> elements are in-use and another is required. Once an instance has NS> been freed, it's slot in the array would be available and exactly NS> the right size for the next instantiation. There would never be a NS> requirement to coalesce allocations. Whether the free slots are NS> found through a free space chain, or even a linear search, the GC NS> would only need to process this pool when a new instance of this NS> class is being allocated. It would then only need to scan a single NS> pool of (same-sized) reference objects to GC the existing NS> instances. see above. this is a fine scheme for a few queues of special size but not for supporting hundreds of queues of differing sizes. and then you get the real nasty problem of a class adding attributes at run time which will necessitate changing the storage needed for all its instances (this is the notification stuff dan has mentioned). this would require a massive copy of all instances and an allocation of a completely new ram pool of the new size and all the overhead of setting it up, etc. NS> If the larger segments of memory allocated for each pool, were NS> allocated as virtual memory segments, rather than as arbitrary NS> chunks of heap memory, the size of each individual pool can be NS> grown in-place, at runtime, without the need to copy then existing NS> space to the new larger allocation. And without the need to NS> shuffle any other memory pools around to accommodate the increase NS> in size. The OS's Virtual Memory Management takes care of the NS> whole process. It also becomes possible to reserve a sizeable NS> pre-allocation for important pools and those known to be likely to NS> grow quickly, without actually consuming the entire reservation NS> from the physical memory pool before it is actually needed. The OS NS> will then take care of notification of an individual pool's need NS> to grow through page faults and the growth becomes a process of NS> simply committing another (few) pages from the pre-allocation. how do you know who will grow quickly? larry hasn't published the specs for PSI::ESP yet. :) also applications never see page faults so we can't use that. NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES. NS> I have no understanding (yet) of how the co-routines that Parrot is NS> promising are implemented on *nix systems. But unless the concept of NS> co-routines is encapsulated at the higher level with the implementation NS> being pushed down to platform specific code, unless *nix also supports a NS> concept very similar to Win32 fibres, it will be impossible to utilise NS> this "tailor made" OS facility. Thus, the implementation of co-routines NS> designed to operate well on *nix platforms is likely to need to be NS> emulated on Win32 with high level requirements that won't fit well with NS> the underlying OS. This could result in a similarly sub-optimal NS> emulation of a *nix concept on Win32 as currently exists for the fork NS> emulation. Other non-*nix platforms would likely also suffer from this NS> force fit as well. you seem to be crossing the kernel and VM boundaries here. parrot will have coroutines but at the VM level. they won't map easily onto any form of kernel coros since parrot needs to keep a reference to a parrot level stack and all sorts of VM level info in the coro object/thing. this problem is similar to the problem of mapping VM threads onto kernel threads. there are many ways to do it and none are perfect. a VM just doesn't have the internal granularity of a kernel and the ability to truly block threads when waiting. VM coros also can't utilize the kernel for switching since a kernel stack only needs a pointer and a few registers to be swapped vs a larger and more complex swap with VM coros. you have addressed many issues here and i think that is what dan wanted. i hope i clarified some of them and explained why there are no simple answers. we have to bite some performance bullets somewhere due to the very lofty functional goals we have set. the key is to keep the design and api clean and as elegant as possible while keeping up the performance. and almost all of this is moot in a pure event system which is why i like them better than threads. :) uri -- Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org