This is a threading proposal, in the form of a collection of notes.

Rationale:

We need to constantly compare parrot to the JVM, and in particular have a deep understanding of cases where the JVM performs well (or better than parrot). The reason for this is obvious: the JVM is the product of several years of R&D, and we need to benefit from that where we can. Of course, there are places where the two will necessarily differ (due to differences such as static v. dynamic typing of their target languages), but these necessary differences don't account for everything, and it's important for us to understand if individual performance differences are fundamental, or incidental. In particular, it's instructive to look at Java's threading strategy, and its impact on GC.

I have 2 fundamental assumptions:

1) As stated by others before, Parrot needs to guarantee that its internal state is protected, even in the absence of proper synchronization by user code, though user-visible data structures may become corrupted from the user's perspective. (That is, a hash might suddenly lose some entries, but it can't become corrupted in such a way that a crash or memory leak might result.) To put it formally, no sequence of byte code can have the possibility of causing an attempt to read from, or write to, unallocated memory. [That may not cover all cases of corruption, but it gets at the main point.] This is particularly important in an embedded context--we want to be able to guarantee that bad bytecode can't crash the process in which parrot is embedded. Note that Java seems to provide similar guarantees (more on this below).

2) My second assumption is that if we can provide a safe and performant implementation which allows "full-access" threads (that is, without data-access restrictions imposed between threads), then access-restricted threads (such as Perl 5's ithreads) can be implemented on top of this. Note that thread creation in the access-restricted case will necessarily be slower, due to the need to copy (or COW-copy) additional data structures.


So the theme is taking lessons from the JVM wherever we can. The focus is on how we can provide safe and high-performance threading without data-access restrictions between threads (in light of assumption 2 above).



Notes and considerations:


1) As mentioned above, Java seems to guarantee robustness in the absence of proper user-level synchronization. In particular, section 2.19 of the JVM spec <http://java.sun.com/docs/books/vmspec/2nd-edition/html/ Concepts.doc.html#33308> states the following:

"When a thread uses the value of a variable, the value it obtains is in fact a value stored into the variable by that thread or by some other thread. This is true even if the program does not contain code for proper synchronization. For example, if two threads store references to different objects into the same reference value, the variable will subsequently contain a reference to one object or the other, not a reference to some other object or a corrupted reference value. (There is a special exception for long and double values; see §8.4.)"

(It also further states, "In the absence of explicit synchronization, an implementation is free to update the main memory in an order that may be surprising. Therefore, the programmer who prefers to avoid surprises should use explicit synchronization.")

(Section 8.1 <http://java.sun.com/docs/books/vmspec/2nd-edition/html/ Threads.doc.html#22197> also clarifies, "A variable is any location within a program that may be stored into. This includes not only class variables and instance variables, but also components of arrays.")

2) It seems that Java provides these guarantees in part by relying on the atomicity of stores of 32-bit sized data. (This is "atomic" in the sense of "no word tearing", meaning that when such a value is written to a memory location, any thread reading that location will see either the old or the new value, and not some other value.) Section 8.4 of the JVM spec <http://java.sun.com/docs/books/vmspec/2nd-edition/html/ Threads.doc.html#22244> implies this, albeit indirectly. Due to the ubiquity of the JVM, it seems reasonable for Parrot to similarly rely on atomic stores, without compromising the range of platforms on which parrot can run. In practice, modern processors likely provide such a guarantee natively, without the need for explicit locking (though explicit locking could be employed in the seemingly rare case of a platform which provides threads but not atomic memory updates). If Java is relying on some other mechanism here, we need to understand what that is.

3) There seem to be 3 basic categories of things which need thread-safety: data structures internal to the VM (eg, opcode arrays, register stacks), variables (globals, registers, etc.; most importantly, those which hold references to aggregate types), and the internals of aggregate types (PMCs and strings). Things in these categories which we don't need to worry about are things which are logically local to a thread, no matter what the thread semantic--registers, lexical pads, etc. Of the three categories, I think that the aggregate types pose the biggest challenge. Most of the internals of the VM are likely not logically shared between threads (and those which are should probably be protected by explicit locking, and again this is likely true whether or not we actually support full-access threads), and global variables probably rely on a thread-safe hash implementation (and in particular, may be slower to access than lexical variables). So aggregate types are the key "problem to be solved".

4) On a related note, access to the global stash should probably happen via a vtable-like function pointer hanging off of the interpreter struct, to allow data-access (sharing) restrictions to be imposed when necessary. (That is, this use of different functions here will allow restricted access v. full access to globals.)

5) Java seems to use a check-in/check-out model for access to global data, in which global data "lives" in a central store, but is copied back-and-forth to thread-local storage for modification. I don't fully understand the performance benefit which I assume this provides, but this is something else we need to understand. It's discussed in detail in the threads section of the JVM spec, <http://java.sun.com/docs/books/vmspec/2nd-edition/html/ Threads.doc.html>.

6) A key aspect of the safety guarantees provided by Java (in addition to atomicity of 32-bit stores) seems to be non-shrinking of allocated memory blocks. For example, if an object needs to acquire additional memory in order to expand the backing store of a hash, it has to allocate a new (larger) memory block, copy over the values from the previous block, and leave the previous block to be garbage-collected. In practice, parrot already does this, in the case of the compacting collector. (It's realloc(), without immediately deallocating the old block.) This is part of what allows safety in the light of concurrent modification (because stale pointers and offsets can still refer to valid structures).

7) Java 1.4 provides several different GC strategies, but most (if not all) involve a stop-the-world stage, in which all threads are stopped. (Note that this doesn't mean that the whole GC process needs to happen with computation stopped, merely that some part of it may need to happen with threads stopped.) So a strategy in which data is not owned per-thread (from the point of view of GC) should be reasonable, since GC may be necessarily a global endeavor. In parrot, if we have devised a way to provide timely delivery of events, then we should be able to use the event infrastructure to suspend all threads. (That is, we don't need or want to suspend threads at the pthreads level.) See <http://developers.sun.com/techtopics/mobility/midp/articles/ garbagecollection2/> and the previous version <http://developers.sun.com/techtopics/mobility/midp/articles/garbage/> for a discussion of Java's GC strategies.

8) It should be possible to provide per-thread allocation pools, such that header or string body allocations can occur without locking, despite the fact that the allocated items are not considered to be "owned" by a thread. (Since GC will occur globally, as discussed in item 7, the deallocation will not be logically per-thread.)

9) As mentioned above, PMC thread-robustness seems to be the main challenge. (Where "thread-robustness" means non-crashing in the light of concurrent modification.) I think that we can provide this in 3 ways: (a) minimize the number of custom PMCs that need to be written; that is, emphasize use of the object infrastructure rather than custom PMCs to provide HLL-level data types. (b) provide tools and guidelines for implementing thread-robust PMCs; for instance, the memory-expansion policy mentioned in item 6 above (Java provides a native System.arraycopy() to assist with this, for example), favoring the use of offsets/indexes rather than pointers into data blocks, and out-of-band creation of data structure (initialization prior to storage in a globally-accessible location). (c) use actual locking internally when necessary, and only when running multithreaded. Simple PMCs such as PerlInt are probably already thread-robust. Notably, PerlString should be straightforward to make thread-robust as well, as long as its flags are only changed at create time or during GC, and a new string header is put into place when the backing store is expanded. (Specifically, if headers can be guaranteed to be allocated from zero-filled memory, then bufstart and buflen will either be consistent, or one of them will be zero/NULL.) Thread-robustness is probably easiest in the case of leaf structures such as these (ie, structures which hold few pointers).

10) For PMC internal use, the lock-only-when-threaded mentioned in item 9 can be easily provide via a LOCK_THIS_PMC macro which expands to something like "if(INTERP->is_threaded) { pthread_mutex_lock(SELF->mutex) }". (It's safe to check an interpreter->is_threaded flag without locking, because its value will only change when single-threaded.)

11) I don't think it's particularly relevant that Java's strings are immutable. StringBuffers are not, nor are collections such as HashMap, so Java still needs to deal with thread-robustness of data structures. Also, parrot has facilities for immutability, which HLLs can take advantage of as an optimization (though probably Perl6 will not, except for things such as string literals).

A few benchmarks (based on testing on Mac OS X, single CPU):

1) A pthread_lock/pthead_unlock pair on an uncontested mutex seems to have an overhead equivalent to about 7 function calls. (Or, since it _is_ 2 function calls, you could say the overhead is 5 function calls above that.) Also, pthead_trylock is only slightly faster.

2) In a pure-computation context with no locking, there isn't a penalty to being multithreaded. That is, counting to a large number in each of 40 threads takes no more time (seemingly, slightly less) than counting up to 40*large-number in a single thread. This is likely because the context-switch overhead is already being paid in context switching between processes.

3) In Perl 5.8.1, it is much faster to perform 1000 forks (and waits) than it is to create (and join) 1000 threads. In C, the opposite is true.

I'll publish some actual benchmarking numbers, with source code, separately. (They're just sort of interesting to have on hand.)

JEff

Reply via email to