On Thu, 01 Jan 2004 21:32:22 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote: UG> Uri Guttman NS> Nigel Sandever. (Mostly not reproduced here!)
NS> REENTRANCY UG> this is true for c level threads but not necessarily true for VM level UG> threads. f the VM is atomic at its operation level and can't be UG> preempted (i.e. it is not using kernel threads with time slicing), One of the major points of what I am (was?) proposing, is that there would be a 1:1 mapping between VM threads and kernel threads. The second point is that atomic operation of the VM thread does not imply that they cannot be preempted. It only means that when one VM thread is preempted by the scheduler, no other thread that is waiting to enter the same critical section as the first thread is in, will be scheduled (by the scheduler), until the first thread exits that critical section. UG> then UG> it could use thread unsafe calls (as long as it keeps those silly static UG> buffers clean). The point about avoiding calling the CRT directly from the main bulk of the code does not preclude the use of (suitably thread-aware) CRT calls completely. The idea is to acknowledge that A) Not all CRT's are equally well crafted or complete. B) Thread-aware CRTs are not available with all compilers for all platforms. By writing the main bulk of the code in terms of a virtual CRT using macros, the main bulk of the code becomes platform independent. The only change required porting the code to different platforms is to write a set of suitable expansions for the macros for each platform/ compiler. Where the available CRT is up to the job, the macros expand directly to the underlying CRT call. Where they are not, a suitable alternative can be written that bypasses the CRT and goes directly to runtime. As an example: Compiling P5 with Borland 5.5 for NT, I encountered the restriction that the Borland runtime didn't support (fseek) or (f)tell on files > 4GB. The underlying platform does, and it was relatively trivial to replace the calls to the CRT functions with appropriate NT syscalls and so enable both USE_LARGE_FILES and PERLIO. The only difficulty was that instead of being able to make the changes in a single place, it was necessary to make essentially the same change in 3 or 4 different files. To compound matters, the modifications required conditional compilation directives at all 4 places, and these had to be interspersed with other #ifdefs for other platforms and other purposes. The idea was to avoid this morass of conditional compilation and copy&paste coding by pushing the conditional directives in the main code to a single #if defined( THIS_PLATFORM ) #include "this_platform_crt.h" Within that header would be a simple set of #ifdefs to include compiler specific header files. And within those, all the CRT macro expansions. Thus, I sought to ease both porting and maintenance. It would also become possible to add virtual syscall definitions to the high level code, and expand them differently, perhaps radically so, on a platform by platform basis. In essence, creating a Virtual OS for the VM. UG> parrot will (according to dan) use one interpreter per UG> VM thread and those may run on kernel threads. >From what I have read, the decision as to whether each VM thread also got a separate interpreter was one of the things Dan was opening up to discussion? Personally, with the perspective of (the forkless) NT platform, and my experience with trying to make good use of both p5's 5005thread and ithread implementations, I am completely against the idea of VM threads having a 1:1 relationship with interpreters. The runtime costs of duplicating all shared data between interpreters, tying all shared variables, added to the cost of locking, throw away almost all of the benefit of using threads. Combine that overhead with the restrictions of a) No shared references therefore: b) No shared objects, therefore: c) No shared methods. d) No shared compound data structures. And the only hope for multiprocessing becomes forking, which WIn32 doesn't support natively. And which, without the benefits of process level COW performed by the kernel, results in a kludged emulation of a non-native facility that will never be usable in any real sense of the word. UG> it may be possible to UG> disable preemption and/ or time slicing so the VM threads will be atomic UG> at the VM operation level and then we don't have to worry as much about UG> thread unsafe libs. I was trying to demonstrate that there is no need to disable the scheduler in order for VM threads to be achieve atomic VM level operations. UG> but i gather that people want real preemption and UG> priorities and time slicing so that idea may be moot. Yes. I am one of them:) UG> but on most UG> platforms that support kernel threads there are thread safe versions of UG> most/ all the c lib stuff. As I said above. This is true, but it is also compiler dependant. And even where thread-safe CRTs are available, they aren't always reliable. Insulation the bulk of the sources from the vagaries of different CRTs was my aim. UG> now, external libs that get linked in under UG> nci is a different story. I can't address this as I have no idea what nci is :), and nothing applicable to this context came up in a google for the term. NS> ATOMICITY AND CRITICAL SECTIONS UG> that is what i mentioned above, disabling time slicing/ preemption when UG> desired. it is not just a win32 concept. hell, turning off interrupts UG> during interrupt handlers goes way back! redmond just likes to rename UG> stuff and act like they invented it. :) I cannot emphasis enough that Critical Sections are different to both Events and Semaphores. The distinction is that a critical section does not disable time slicing/ preemption. It prevents any thread attempting to enter an owned critical section from receiving a timeslice until the current owner relinquishes it. This means that other threads in the same process and other processes continue to timeslice preemptively. Only the thread(s) waiting to enter the specified critical section are blocked. There is no 'spinning' involved. UG> in effect it sounds like a thread shared mutex. it could be implemented UG> in kernel or process space. They are indeed a form of mutex, but the process space state and avoiding the need to switch to kernel mode, is crucial, as it is this that makes them lighter weight than a conventional IPC Mutex. A CRITICAL_SECTION is a DWORD allocated in user space. It must be allocated by the user space code, and then initialised through a syscall. Once initialised, it cannot be move, copied or modified (directly) by the user code. If you have the time and inclination, there is a brief (25 line) description of them at [http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp] UG> that flag setting needs to be atomic or a mutex or similar. plain flag UG> setting won't work. also the blocking has to be kernel level (so a UG> kernel mutex/ semaphore is needed) so the kernel slicing will work. Indeed. A simple flag *is* all that is necessary to protect shared (fat) data structures, *if* the VMs, are the only code that could achieve concurrent access to the contents of the PMC, and these are already interlocked at the operational level. A PMC is an object, and the only code that can affect the state of that object are it's methods. The methods can only be invoked by the VM. Each method constitutes an (atomic) operation at the VM level. All that is required to protect an object from corruption through concurrent access and state change is to prevent two (or more) VMs trying to access the same object simultaneously. In order for the VM to address the PMC and must load it into a VM register, it must know where it is. Ie. It must have a pointer. It can use this pointer to access the PMC with a Bit Test and Set operation. (x86 BTS) [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic operation. This occurs before the VM enters the VM operations critical section. Once the bit has been set in the PMC header by one VM, if the scheduler intervened exactly before the next instruction in that thread, and scheduled a second VM thread to run, and that VM also attempts to access that same PMC,the first operation it tries to perform is the same BTS instruction. The next instruction then tests the CF (Carry flag) and knows that the PMC is already in use. Hence, it then relinquishes (yields) the rest of its timeslice. It will continue to do so until the result of the BTS, is that the CF is clear. This will only happen when the original VM that caused the bit to be set, enters the critsec, performs and completes the operation, and exits the critsec. At which point it does a Bit Test & Reset on the bit. (BTR) [http://www.itis.mn.it/linux/quarta/x86/btr.htm] Every time the second (or subsequent) VM attempts to use a PMC that is already in-use, it finds the bit set and immediately yields. This will only happen during the interval between the first VM issuing the BTS, and that same VM entering the critsec. Once the critsec has been entered, no other VM that could load that PMC, would even be scheduled. UG> we can't stop stupid coders from themselves. Agreed:) UG> more than test/set is needed as it has to be atomic. so CPU's have that UG> instruction but we can't get at it directly from c and we have to UG> emulate it on platforms which don't have it. assuming it is short and UG> fast is not a good idea. now the atomic semaphore problem was solved UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs UG> several machine instructions or a few lines of c. this is not a trivial UG> amount of overhead for each access to a shared object. also this type UG> of lock is in user space and won't block the kernel thread running this UG> VM thread. so we really need kernel semaphores and that means more UG> storage on each PMC that could be locked. I think this is all covered above. UG> this is part of why dan said that the unthreaded version will be UG> designed to be faster. if you don't have all those locks and don't UG> compile in the storage for them, it will be faster. i would find it UG> amazing if someone could design a threaded system that actually ran UG> faster than the same thing with threads ripped out (even in a single UG> threaded application). If my description above is to be believed, then the BTS opcode + a JCXZ [http://www.itis.mn.it/linux/quarta/x86/jcc.htm] to skip over the yield are that are the only penalty of leaving the VM interlocking code in place for non-threaded versions. As the flag would never test set on entry, the jump would always be taken and the yield code would never come into play. The costs associated with entering and exiting critsec are minimal, but this code could be bypass until a flag is set in the interpreter to indicate that a second VM thread has be created. NS> CONCLUSION (TENTATIVE, NO PROOF) NS> As all VHLL entities are PMCs at the VMI level, the sharing of data NS> between threads at the VHLL level is done entirely through those NS> PMCs. If no single PMC can be the subject of an opcode on two threads NS> concurrently, there /should/ be no opportunity for conflict. UG> then you have no sharing. the issue is when you do have sharing. the UG> ithreads architecture is no IMPLICIT sharing so you can start out like UG> that. variables (and their underlying PMCs) can be declared shared at UG> compile or runtime (and with perl, who can tell the difference? UG> :). those shared things must each have a lock (preferably kernel level UG> so the locking thread can block and not spin) and that requires storage UG> and extra code wrapping access to those things. Again. I think everything here was dealt with above. NS> As all VMI internal state are encapsulated within the VMI register set, NS> and each thread has it's own set of registers, the internal state of the NS> VMI(s) should be equally secure. UG> you can't lock internal state nor do you need to as only the interpreter UG> can see them. the possibly shared things are variables and data. By "internal state of the VMI" (interpreter), I was referring to such things as the VHLL program that the interpreter is running. If a subroutine comes into existence at runtime (through eval for example) then the user program changes, and with it, the state of the interpreter. If the garbage collector runs, again the internal state of the interpreter is modified. If these operations, like all user-program directed operations, are run within the VM as atomic operations, then the state of the interpreter is protected. UG> there may be times when a GC run needs to be initiated DURING a VM UG> operation. This is a matter of design. The suggestion that memory allocation was managed at the class level was central to the proposal. If memory needs to be allocated as part of a VM operation, then it is associated with a single PMC; the subject of the operation. If the pool of memory used by that PMC is entirely local to PMC and allocated by an operation within the PMC, then a failure to obtain memory requires that only *that* pool of memory, need be garbage collected. As such, The GC cycle can be run on that pool, as a (nested) VM level operation. The limited scope of the pool to be GC's makes this a musg cheaper increment than global GC would be in the same circumstances. If the GC doesn't succeed in locating and freeing space, another (few) pages of the reserved allocation are then commited. No other pool is affected. No shuffling or copying is required. See below. UG> if the op requires an immediate lare chunk of ram it can UG> trigger a GC pass or allocation request. you can't force those things to UG> only happen between normal ops (which is what making them into ops UG> does). so GC and allocation both need to be able to lock all shared UG> things in their interpreter (and not just do a process global lock) so UG> those things won't be modified by the other threads that share them. UG> this is what dan is all so scared about. you can't hide gc and UG> allocation under a VM carpet. it has to be out there and visible at all UG> times and it needs total access to an interpreter's stack and other UG> stuff. By moving away from a single central pool, or even many central pools of similarly sized objects, to a much finer grained set of pools, each managed directly by the consumer class of that pool. Allocation and GC become operations of the PMC (class) itself. Other pool(s) of memory used to satisfy the interpreters needs for it's own use, as opposed to those done on behalf of the user program, would be required also. But these will tend to have relatively longer life and will be allocated outside of the VM operations. The separation of interpreter memory reqs from those of the user program also translate into less work to do at both allocation and GC times for either. UG> there is no real main loop anymore. there are multiple main loops (one UG> in each interpreter. remember, each interpreter is mapped to a real UG> kernel thread (if possible). Again, I think I have covered this above. I know there will be multiple "main loops". In the scheme outlined, there would be one "main loop" per VM/kernel thread. And these would share a single interpreter. Indeed, multiple interpreters within the same process would be possible. And each interpreter would be able to have multiple VM threads. Two or more interpreters (and the VM(s) that run them) would never, in fact, could never, interfere with each other. Their state, including their memory pools, garbage collectors and everything else. The only exceptions would be unavoidably process global OS entities, like files, sockets, Events etc. but these are already managed by multi-threaded CRTs and there shoudlbe no reason to make extra provision to handle them. UG> you can have a single thread dedicated to UG> handling signals and events but each of the others must check that UG> process global event queue. Agreed. This is the way I would implement asynchronous events. In fact, every Win32 process already has a separate event queue by default. This is used by native GUI applications to manage events for single and multi-threaded applications, including the message passing mechanisms for communicating between windows running on different threads. In part, this is the source of some of the inspiration for the model I am suggesting. I don't have a great deal of VM interpreter knowledge, but many of the same problems arise in threaded GUI applcations,and I have a lot of experience with those. In fact, the Win32 port of P5 utilises this event queue to provide its (very limited) emulation of *nix signals. It would theoretically be possible to extend that current emulation to cover more than the 2 or 3 signals now supported By extending this mechanism, it would be possible to enable SIG_ALRM and SIG_CHLD for example, though the latter would be limited to inter operation with other pseudo-forked children. UG> and the classic problem is there too, which UG> thread gets what events which is particularly nasty with signals. This only becomes a problem if the process can be considered to be more than a single program -- as with pseudo-forked children. However, if the raiser of the signal can somehow identify the thread it wishes to signal, then there is no problem tagging the signal with it's destination and having each thread only respond to signal tagged for it. Where a signal is raised by the OS, that has no way to target individual threads /interpreters, then the signal would be acted upon by ALL threads in the process. NS> MEMORY MANAGEMENT UG> the problem is too many different memory pools. you will waste enormous UG> amounts of unused space in each different sized pool as they all will UG> allocate slabs and break them up to their own granularity. so you will UG> have much of the allocated ram just sitting in pools and not being used UG> unless each class creates enough objects to use them. Virtual memory allocation [http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp] can be done in terms of reservation rather than committal. Each class would reserve some sized lump of virtual memory, but only commit a small amount (in 4k pages) of the total reserved. UG> see above. this is a fine scheme for a few queues of special size but UG> not for supporting hundreds of queues of differing sizes. and then you UG> get the real nasty problem of a class adding attributes at run time UG> which will necessitate changing the storage needed for all its instances UG> (this is the notification stuff dan has mentioned). this would require a UG> massive copy of all instances and an allocation of a completely new ram UG> pool of the new size and all the overhead of setting it up, etc. Theoretically, each VirtualAlloc call can reserve as much Virtual memory as it likes (up to the limit of physical memory size) and then commit this in 4K chunks on demand. The fact that no two such 'full-memory' reservations could ever be completely committed doesn't matter. The OS/hardware takes care of mapping the committed virtual memory to physical memory. As far as the program is concerned, there is no shuffling of memory required to expand any single allocation. UG> how do you know who will grow quickly? larry hasn't published the specs UG> for PSI::ESP yet. :) also applications never see page faults so we can't UG> use that. There are some indications (even without ESP:), of what classes are likely to grow faster than others. The SCALAR class, is likely to see considerably more activity than the FATAL_ERROR_DUMP class for example. The former might pre-commit 1MB on start up, and extend in 1MB increments. Whereas the latter might opt to have no pre-committal at all and expand only when allocation is requested. It might even be possible to have the reservation, pre-committal and extension values configurable at run time on a class-by-class basis. These might be an optional class creation/ load-time parameters. NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES. UG> you seem to be crossing the kernel and VM boundaries here. I don't believe I have. UG> parrot will UG> have co-routines but at the VM level. In the model I describe, a VM thread *is* a kernel thread. UG> they won't map easily onto any form UG> of kernel coros since parrot needs to keep a reference to a parrot level UG> stack and all sorts of VM level info in the coro object/thing. Agreed. This is why each VM thread, in addition to having it's own set of VM registers, also needs it's own VM stack(s), just as kernel threads have at the CPU level. UG> this UG> problem is similar to the problem of mapping VM threads onto kernel UG> threads. there are many ways to do it and none are perfect. I doubt this mechanism gets any closer to perfect than any other, but I do think that it is a) feasible, on win32 at least. b) has some unique features worthy of consideration. I don't know if all the requisite (win32) OS facilities are available on all other threaded platforms. There is no doubt in my mind that equivalent facilities are either available (probably under different names), or could be made available. Linux runs on the same hardware as Win32, and the rest is just software. Whether anything that doesn't exist could be fabricated at a level that wouldn't mean a kernel rebuild I simply don't know. UG> a VM just UG> doesn't have the internal granularity of a kernel and the ability to UG> truly block threads when waiting. The scheme of using a single critsec to arbitrate and interlock between VM threads is the basis of the mechanism for doing exactly this. When a thread attempts to enter a critsec that is already entered by another thread, it is effectively removed from the schedulers tables and will not be dispatched until the thread holding the critsec exits it. No other threads are affected unless they too attempt to enter the same critsec. VM coros also can't utilize the kernel UG> for switching since a kernel stack only needs a pointer and a few UG> registers to be swapped vs a larger and more complex swap with VM coros. By having both (each) thread in the co-processes running within it's own VM thread, and converting that VM thread to a fibre (in win32 terms) [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp] As each VM thread already has the wherewithall to retain *all* state across scheduler switches, once the thread is converted to a fibre, it retains these same abilities. The only change is that now, instead of the switching between VMs occurring at the schedulers behest, it occurs as a result of a user program directive. I'm not sure what the Parrot, IMCC or P6 language level terms for the transfer of control are or will be, but at the (win32 level) it would just translate to SwitchToFiber( lpfibre ); [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp] That's not a complete spec., but it's the root of the idea. UG> you have addressed many issues here and i think that is what dan UG> wanted. i hope i clarified some of them and explained why there are no UG> simple answers. we have to bite some performance bullets somewhere due UG> to the very lofty functional goals we have set. Likewise, I hope that the above, in conjunction with the offsite links, will serve as a clarification of my ideas, and as evidence that I did not arrive at the proposals without considerable thought and research. That's no guarantee that they are ultimately workable, and that, in part, was the source of my reluctance to post them here earlier. I am working on a trivial implementation of most of the core ideas by way of a proof-of-concept, but it will be some time before this is demonstrable. I fully appreciate that I am late to the party and it may be too late to consider altering the design of Parrot to accommodate these ideas. It may well be that some or all of the facilities upon which the concept is based are simple to Win32 specific to translate into a cross-platform implementation. It's also more than just possible, that there are gapping holes that I haven't addressed, or even considered. My purpose in posting, was prompted by Dan's opening of the discussion, and with the hope that the skills and expertise of the denizens of this list would both scrutinise the ideas (as you did) and perhaps be able to relate my win32 view of the world into *nix (and other) views. I have no doubts that even if some of the ideas I have postulated could be used in Parrot to good effect, that they would require (perhaps considerable) refinement and translation along the way by all those people here that are far more experienced with interpreter and VM design than I. Whilst I have tried to cover everything I can foresee, there is so much that do not yet understand about Parrot in particular, and VM interpreters in general, that I fully accept that none of this might fly here. But as someone recently pointed out, there is no point in having ideas unless you are prepared to voice them, and support them against summary dismissal. If Dan turns speaks up and says "It too much, and too little, too late" or "it is simply too platform specific and too speculative to even consider" I accept that. UG> the key is to keep the UG> design and api clean and as elegant as possible while keeping up the UG> performance. and almost all of this is moot in a pure event system which UG> is why i like them better than threads. :) Once again. Thanks for taking the time to read and digest my original post, and for your comprehensive feedback. Without the feedback, it is very difficult to know which parts of what I wrote need clarification. I had several attempts at writing the original up. They varied from a 12,000 word diatribe that tried (and probably failed) to explain every detail of the ideas. That would never have been read, and would have patronised anyone brave enough to read it. To the other extreme, that talked totally in terms of win32 APIs with no explanation at all. That would have been like I was talking Greek to most people here. UG> uri Regards, and forever holding my peace :), Nigel Sandever.