>>>>> "NS" == Nigel Sandever <[EMAIL PROTECTED]> writes:
NS> ATOMICITY AND CRITICAL SECTIONS UG> that is what i mentioned above, disabling time slicing/ preemption when UG> desired. it is not just a win32 concept. hell, turning off interrupts UG> during interrupt handlers goes way back! redmond just likes to rename UG> stuff and act like they invented it. :) NS> I cannot emphasis enough that Critical Sections are different to NS> both Events and Semaphores. NS> The distinction is that a critical section does not disable time NS> slicing/ preemption. It prevents any thread attempting to enter an NS> owned critical section from receiving a timeslice until the current NS> owner relinquishes it. This means that other threads in the same process NS> and other processes continue to timeslice preemptively. Only the NS> thread(s) waiting to enter the specified critical section are blocked. NS> There is no 'spinning' involved. it has to disable slicing/preemption if there are multiple kernel threads running that have access to that data. UG> in effect it sounds like a thread shared mutex. it could be implemented UG> in kernel or process space. NS> They are indeed a form of mutex, but the process space state and NS> avoiding the need to switch to kernel mode, is crucial, as it is NS> this that makes them lighter weight than a conventional IPC Mutex. NS> A CRITICAL_SECTION is a DWORD allocated in user space. It must be NS> allocated by the user space code, and then initialised through a NS> syscall. Once initialised, it cannot be move, copied or modified NS> (directly) by the user code. NS> If you have the time and inclination, there is a brief (25 line) NS> description of them at NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp] UG> that flag setting needs to be atomic or a mutex or similar. plain flag UG> setting won't work. also the blocking has to be kernel level (so a UG> kernel mutex/ semaphore is needed) so the kernel slicing will work. NS> Indeed. A simple flag *is* all that is necessary to protect shared (fat) NS> data structures, *if* the VMs, are the only code that could achieve NS> concurrent access to the contents of the PMC, and these are already NS> interlocked at the operational level. NS> A PMC is an object, and the only code that can affect the state of NS> that object are it's methods. The methods can only be invoked by NS> the VM. Each method constitutes an (atomic) operation at the VM NS> level. NS> All that is required to protect an object from corruption through NS> concurrent access and state change is to prevent two (or more) VMs NS> trying to access the same object simultaneously. In order for the VM to NS> address the PMC and must load it into a VM register, it must know where NS> it is. Ie. It must have a pointer. It can use this pointer to access NS> the PMC with a Bit Test and Set operation. (x86 BTS) NS> [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic NS> operation. This occurs before the VM enters the VM operations critical NS> section. ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. and still some archs don't have it so it has to be emulated by dijkstra's algorithm. and that requires two counters and a little chunk of code. you can't bring x86 centrism into this. the fact that redmond/intel threads can make use of this instruction to do 'critical sections' is a major point why it is bad for a portable system with many architectures to be supported. NS> Once the bit has been set in the PMC header by one VM, if the scheduler NS> intervened exactly before the next instruction in that thread, and NS> scheduled a second VM thread to run, and that VM also attempts to access NS> that same PMC,the first operation it tries to perform is the same BTS NS> instruction. The next instruction then tests the CF (Carry flag) and NS> knows that the PMC is already in use. Hence, it then relinquishes NS> (yields) the rest of its timeslice. i am well aware about test/set and atomic instructions. my point in my previous reply was that it can't be assumed to be on a particular arch. so it has to be assumed to be on none and it needs a pure software solution. sure a platform specific macro/assembler lib could be done but that is a pain as well. NS> Every time the second (or subsequent) VM attempts to use a PMC NS> that is already in-use, it finds the bit set and immediately NS> yields. This will only happen during the interval between the NS> first VM issuing the BTS, and that same VM entering the NS> critsec. Once the critsec has been entered, no other VM that could NS> load that PMC, would even be scheduled. yield is also a kernel specific thing and what will wake up the thread after that? it has to block on some kernel thingy and the test/set flag is not one. you are conflating the vm/kernel boundaries here with intel specific solutions. the only sure way to block a kernel thread is a kernel level mutex/lock/semaphore. but that is not a good solution to add to every PMC that could be shared or it has to be added at runtime to a PMC that is being shared. we want fine grained locking which means locks around each shared thing. but that would mean a lock inside each shared thing and that is a lot of wasted overhead for all those things that aren't shared. and it is a slower solution as well. and above all of this is any GC or event handling threads that might need to lock everything in the entire process so they may need a global lock that all threads have to obey. UG> more than test/set is needed as it has to be atomic. so CPU's have that UG> instruction but we can't get at it directly from c and we have to UG> emulate it on platforms which don't have it. assuming it is short and UG> fast is not a good idea. now the atomic semaphore problem was solved UG> long ago by dijkstra and it is a two phase test/set gizmo. it needs UG> several machine instructions or a few lines of c. this is not a trivial UG> amount of overhead for each access to a shared object. also this type UG> of lock is in user space and won't block the kernel thread running this UG> VM thread. so we really need kernel semaphores and that means more UG> storage on each PMC that could be locked. NS> I think this is all covered above. i disagree. it is covered for the intel case only and that is not good enough. NS> If my description above is to be believed, then the BTS opcode + a JCXZ NS> [http://www.itis.mn.it/linux/quarta/x86/jcc.htm] NS> to skip over the yield are that are the only penalty of leaving the VM NS> interlocking code in place for non-threaded versions. As the flag would NS> never test set on entry, the jump would always be taken and the yield code NS> would never come into play. NS> The costs associated with entering and exiting critsec are NS> minimal, but this code could be bypass until a flag is set in the NS> interpreter to indicate that a second VM thread has be created. again, intel specific and it has to be tossed out. NS> CONCLUSION (TENTATIVE, NO PROOF) NS> As all VHLL entities are PMCs at the VMI level, the sharing of data NS> between threads at the VHLL level is done entirely through those NS> PMCs. If no single PMC can be the subject of an opcode on two threads NS> concurrently, there /should/ be no opportunity for conflict. UG> then you have no sharing. the issue is when you do have sharing. the UG> ithreads architecture is no IMPLICIT sharing so you can start out like UG> that. variables (and their underlying PMCs) can be declared shared at UG> compile or runtime (and with perl, who can tell the difference? UG> :). those shared things must each have a lock (preferably kernel level UG> so the locking thread can block and not spin) and that requires storage UG> and extra code wrapping access to those things. NS> Again. I think everything here was dealt with above. again, i disagree. NS> As all VMI internal state are encapsulated within the VMI register NS> set, and each thread has it's own set of registers, the internal NS> state of the VMI(s) should be equally secure. UG> you can't lock internal state nor do you need to as only the UG> interpreter can see them. the possibly shared things are variables UG> and data. NS> By "internal state of the VMI" (interpreter), I was referring to NS> such things as the VHLL program that the interpreter is NS> running. If a subroutine comes into existence at runtime (through NS> eval for example) then the user program changes, and with it, the NS> state of the interpreter. If the garbage collector runs, again NS> the internal state of the interpreter is modified. NS> If these operations, like all user-program directed operations, NS> are run within the VM as atomic operations, then the state of the NS> interpreter is protected. but atomic operations are not available on all platforms. so you need more work for that and since you have kernel threads you need kernel locks. test/set in user space is not good enough as you can't block in user space nor yield and have the kernel wake you up later. the kernel has to be aware of the test/set location and that means it can't all be done in user space. UG> there may be times when a GC run needs to be initiated DURING a VM UG> operation. NS> This is a matter of design. The suggestion that memory allocation NS> was managed at the class level was central to the proposal. If NS> memory needs to be allocated as part of a VM operation, then it is NS> associated with a single PMC; the subject of the operation. If the NS> pool of memory used by that PMC is entirely local to PMC and NS> allocated by an operation within the PMC, then a failure to obtain NS> memory requires that only *that* pool of memory, need be garbage NS> collected. As such, The GC cycle can be run on that pool, as a NS> (nested) VM level operation. The limited scope of the pool to be NS> GC's makes this a musg cheaper increment than global GC would be NS> in the same circumstances. If the GC doesn't succeed in locating NS> and freeing space, another (few) pages of the reserved allocation NS> are then commited. but what about shared PMCs? which ram pool gets GC'ed or allocated when one of the sharing threads needs to deal with it? NS> No other pool is affected. No shuffling or copying is required. See below. again, sharing breaks that. thread specific pools still need to be aware of each other so threads need to lock them with kernel locks. NS> By moving away from a single central pool, or even many central NS> pools of similarly sized objects, to a much finer grained set of NS> pools, each managed directly by the consumer class of that NS> pool. Allocation and GC become operations of the PMC (class) NS> itself. they are already operations of the PMC for most cases. but sharing breaks that and so does GC which can happen inside an operation and not just between them. NS> Indeed, multiple interpreters within the same process would be possible. NS> And each interpreter would be able to have multiple VM threads. Two or NS> more interpreters (and the VM(s) that run them) would never, in fact, NS> could never, interfere with each other. Their state, including their NS> memory pools, garbage collectors and everything else. there is one VM to a kernel thread and one interpreter per VM. the parrot VM is effectively a interpreter per parrot thread. you can't get away from that. UG> and the classic problem is there too, which UG> thread gets what events which is particularly nasty with signals. NS> This only becomes a problem if the process can be considered to be more NS> than a single program -- as with pseudo-forked children. not in a unix environment. real multi-threaded processes already have this problem. NS> However, if the raiser of the signal can somehow identify the NS> thread it wishes to signal, then there is no problem tagging the NS> signal with it's destination and having each thread only respond NS> to signal tagged for it. Where a signal is raised by the OS, that NS> has no way to target individual threads /interpreters, then the NS> signal would be acted upon by ALL threads in the process. that is a kernel issue and most signalling systems don't have thread specific arguments AFAIK. NS> MEMORY MANAGEMENT UG> the problem is too many different memory pools. you will waste enormous UG> amounts of unused space in each different sized pool as they all will UG> allocate slabs and break them up to their own granularity. so you will UG> have much of the allocated ram just sitting in pools and not being used UG> unless each class creates enough objects to use them. NS> Virtual memory allocation NS> [http://msdn.microsoft.com/library/en-us/memory/base/virtualalloc.asp] NS> can be done in terms of reservation rather than committal. NS> Each class would reserve some sized lump of virtual memory, but NS> only commit a small amount (in 4k pages) of the total reserved. virtual ram is what counts on unix. you can't request some large amount without using real swap space. it may not allocate real ram until later (page faulting on demand) but it is swap space that counts. it is used up as soon as you allocate it. UG> see above. this is a fine scheme for a few queues of special size but UG> not for supporting hundreds of queues of differing sizes. and then you UG> get the real nasty problem of a class adding attributes at run time UG> which will necessitate changing the storage needed for all its instances UG> (this is the notification stuff dan has mentioned). this would require a UG> massive copy of all instances and an allocation of a completely new ram UG> pool of the new size and all the overhead of setting it up, etc. NS> Theoretically, each VirtualAlloc call can reserve as much Virtual memory NS> as it likes (up to the limit of physical memory size) and then commit NS> this in 4K chunks on demand. The fact that no two such 'full-memory' NS> reservations could ever be completely committed doesn't matter. The NS> OS/hardware takes care of mapping the committed virtual memory to NS> physical memory. As far as the program is concerned, there is no NS> shuffling of memory required to expand any single allocation. again, redmond specific. physical ram is never the issue and virtual ram is used as you allocate it. there is no way to separate the two on unix. NS> COOPERATIVE MULTIPROCESSING AND CO_ROUTINES. UG> you seem to be crossing the kernel and VM boundaries here. NS> I don't believe I have. UG> parrot will UG> have co-routines but at the VM level. NS> In the model I describe, a VM thread *is* a kernel thread. which means a kernel level lock and not a use level lock. this is the disagreement we have. this is where you are crossing the kernel/VM boundary (even if you think not). UG> they won't map easily onto any form of kernel coros since parrot UG> needs to keep a reference to a parrot level stack and all sorts of UG> VM level info in the coro object/thing. NS> Agreed. This is why each VM thread, in addition to having it's own NS> set of VM registers, also needs it's own VM stack(s), just as NS> kernel threads have at the CPU level. that is expected. it is the locking issue where we disagree. UG> this problem is similar to the problem of mapping VM threads onto UG> kernel threads. there are many ways to do it and none are perfect. NS> I doubt this mechanism gets any closer to perfect than any other, but NS> I do think that it is NS> a) feasible, on win32 at least. NS> b) has some unique features worthy of consideration. what win32 does is not portable and not useful at a VM level. kernels and their threads can work well together. portable VMs and their threads are another beast that can't rely on a particular architecture instruction or kernel feature. NS> I don't know if all the requisite (win32) OS facilities are NS> available on all other threaded platforms. There is no doubt in my NS> mind that equivalent facilities are either available (probably NS> under different names), or could be made available. Linux runs on NS> the same hardware as Win32, and the rest is just software. Whether NS> anything that doesn't exist could be fabricated at a level that NS> wouldn't mean a kernel rebuild I simply don't know. hardware is tossed out with portable VM design. we have a user space process written in c with a VM and VM threads that are to be based on kernel threads. the locking issues for shared objects is the toughest nut to crack right now. there is no simple or fast solution to this given the known contraints. intel/redmond specific solutions are not applicable (though we can learn from them). UG> a VM just doesn't have the internal granularity of a kernel and UG> the ability to truly block threads when waiting. NS> The scheme of using a single critsec to arbitrate and interlock NS> between VM threads is the basis of the mechanism for doing exactly NS> this. When a thread attempts to enter a critsec that is already NS> entered by another thread, it is effectively removed from the NS> schedulers tables and will not be dispatched until the thread NS> holding the critsec exits it. No other threads are affected unless NS> they too attempt to enter the same critsec. this is where i feel you are contradicting yourself. you just did a kernel level block in this critical section but above you claim it can be done with just a user level test/set instruction. the kernel handles scheduling but that means a thread has to block on a lock or yeild on some kernel level thing. a use level test/set will never get the kernel's attention and the thread will have to spin on the test. NS> VM coros also can't utilize the kernel UG> for switching since a kernel stack only needs a pointer and a few UG> registers to be swapped vs a larger and more complex swap with VM UG> coros. NS> By having both (each) thread in the co-processes running within NS> it's own VM thread, and converting that VM thread to a fibre (in NS> win32 terms) NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp] NS> As each VM thread already has the wherewithall to retain *all* NS> state across scheduler switches, once the thread is converted to a NS> fibre, it retains these same abilities. The only change is that NS> now, instead of the switching between VMs occurring at the NS> schedulers behest, it occurs as a result of a user program NS> directive. ok, i can see where you got the test/set and yield stuff from now. fibres seem to be manually scheduled threads. this means user code has to decide to yield to another thread blocking on the same lock. this means all locks must have a list of threads/fibres blocking on that lock. and manually scheduled stuff is not appropriate for parrot. finally, it is redmond specific again and very unportable. i have never heard of this fibre concept on any unix flavor. NS> I'm not sure what the Parrot, IMCC or P6 language level terms for NS> the transfer of control are or will be, but at the (win32 level) NS> it would just translate to SwitchToFiber( lpfibre ); NS> [http://msdn.microsoft.com/library/en-us/dllproc/base/threadproc.asp] and that is not possible on other platforms. also it means parrot would have to have its own scheduler with that the pain that brings. UG> you have addressed many issues here and i think that is what dan UG> wanted. i hope i clarified some of them and explained why there UG> are no simple answers. we have to bite some performance bullets UG> somewhere due to the very lofty functional goals we have set. NS> Likewise, I hope that the above, in conjunction with the offsite NS> links, will serve as a clarification of my ideas, and as evidence NS> that I did not arrive at the proposals without considerable NS> thought and research. NS> That's no guarantee that they are ultimately workable, and that, NS> in part, was the source of my reluctance to post them here NS> earlier. I am working on a trivial implementation of most of the NS> core ideas by way of a proof-of-concept, but it will be some time NS> before this is demonstrable. NS> I fully appreciate that I am late to the party and it may be too NS> late to consider altering the design of Parrot to accommodate NS> these ideas. NS> It may well be that some or all of the facilities upon which the NS> concept is based are simple to Win32 specific to translate into a NS> cross-platform implementation. that is a big point and one that i don't see as possible. redmond can do what they want with their kernel and user procs. parrot can only use kernel concepts that are reasonably portable across most platforms. kernel threads are reasonably portable (FSDO of reasonable) but anything beyond that such as fibres, test/set in user space, etc is not. locks have to be in kernel space since we can do a fibre yield in user space on any other platform. so this rule out user space test/set as well since that would need a thread to spin instead of blocking. your ideas make sense but only on redmond/intel which is not the target space for parrot. thanx, uri -- Uri Guttman ------ [EMAIL PROTECTED] -------- http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org