This is a threading proposal, in the form of a collection of notes.
Rationale:
We need to constantly compare parrot to the JVM, and in particular have
a deep understanding of cases where the JVM performs well (or better
than parrot). The reason for this is obvious: the JVM is the product of
several years of R&D, and we need to benefit from that where we can. Of
course, there are places where the two will necessarily differ (due to
differences such as static v. dynamic typing of their target
languages), but these necessary differences don't account for
everything, and it's important for us to understand if individual
performance differences are fundamental, or incidental. In particular,
it's instructive to look at Java's threading strategy, and its impact
on GC.
I have 2 fundamental assumptions:
1) As stated by others before, Parrot needs to guarantee that its
internal state is protected, even in the absence of proper
synchronization by user code, though user-visible data structures may
become corrupted from the user's perspective. (That is, a hash might
suddenly lose some entries, but it can't become corrupted in such a way
that a crash or memory leak might result.) To put it formally, no
sequence of byte code can have the possibility of causing an attempt to
read from, or write to, unallocated memory. [That may not cover all
cases of corruption, but it gets at the main point.] This is
particularly important in an embedded context--we want to be able to
guarantee that bad bytecode can't crash the process in which parrot is
embedded. Note that Java seems to provide similar guarantees (more on
this below).
2) My second assumption is that if we can provide a safe and performant
implementation which allows "full-access" threads (that is, without
data-access restrictions imposed between threads), then
access-restricted threads (such as Perl 5's ithreads) can be
implemented on top of this. Note that thread creation in the
access-restricted case will necessarily be slower, due to the need to
copy (or COW-copy) additional data structures.
So the theme is taking lessons from the JVM wherever we can. The focus
is on how we can provide safe and high-performance threading without
data-access restrictions between threads (in light of assumption 2
above).
Notes and considerations:
1) As mentioned above, Java seems to guarantee robustness in the
absence of proper user-level synchronization. In particular, section
2.19 of the JVM spec
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Concepts.doc.html#33308> states the following:
"When a thread uses the value of a variable, the value it obtains is in
fact a value stored into the variable by that thread or by some other
thread. This is true even if the program does not contain code for
proper synchronization. For example, if two threads store references to
different objects into the same reference value, the variable will
subsequently contain a reference to one object or the other, not a
reference to some other object or a corrupted reference value. (There
is a special exception for long and double values; see §8.4.)"
(It also further states, "In the absence of explicit synchronization,
an implementation is free to update the main memory in an order that
may be surprising. Therefore, the programmer who prefers to avoid
surprises should use explicit synchronization.")
(Section 8.1
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22197> also clarifies, "A variable is any location
within a program that may be stored into. This includes not only class
variables and instance variables, but also components of arrays.")
2) It seems that Java provides these guarantees in part by relying on
the atomicity of stores of 32-bit sized data. (This is "atomic" in the
sense of "no word tearing", meaning that when such a value is written
to a memory location, any thread reading that location will see either
the old or the new value, and not some other value.) Section 8.4 of the
JVM spec
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html#22244> implies this, albeit indirectly. Due to the
ubiquity of the JVM, it seems reasonable for Parrot to similarly rely
on atomic stores, without compromising the range of platforms on which
parrot can run. In practice, modern processors likely provide such a
guarantee natively, without the need for explicit locking (though
explicit locking could be employed in the seemingly rare case of a
platform which provides threads but not atomic memory updates). If Java
is relying on some other mechanism here, we need to understand what
that is.
3) There seem to be 3 basic categories of things which need
thread-safety: data structures internal to the VM (eg, opcode arrays,
register stacks), variables (globals, registers, etc.; most
importantly, those which hold references to aggregate types), and the
internals of aggregate types (PMCs and strings). Things in these
categories which we don't need to worry about are things which are
logically local to a thread, no matter what the thread
semantic--registers, lexical pads, etc. Of the three categories, I
think that the aggregate types pose the biggest challenge. Most of the
internals of the VM are likely not logically shared between threads
(and those which are should probably be protected by explicit locking,
and again this is likely true whether or not we actually support
full-access threads), and global variables probably rely on a
thread-safe hash implementation (and in particular, may be slower to
access than lexical variables). So aggregate types are the key "problem
to be solved".
4) On a related note, access to the global stash should probably happen
via a vtable-like function pointer hanging off of the interpreter
struct, to allow data-access (sharing) restrictions to be imposed when
necessary. (That is, this use of different functions here will allow
restricted access v. full access to globals.)
5) Java seems to use a check-in/check-out model for access to global
data, in which global data "lives" in a central store, but is copied
back-and-forth to thread-local storage for modification. I don't fully
understand the performance benefit which I assume this provides, but
this is something else we need to understand. It's discussed in detail
in the threads section of the JVM spec,
<http://java.sun.com/docs/books/vmspec/2nd-edition/html/
Threads.doc.html>.
6) A key aspect of the safety guarantees provided by Java (in addition
to atomicity of 32-bit stores) seems to be non-shrinking of allocated
memory blocks. For example, if an object needs to acquire additional
memory in order to expand the backing store of a hash, it has to
allocate a new (larger) memory block, copy over the values from the
previous block, and leave the previous block to be garbage-collected.
In practice, parrot already does this, in the case of the compacting
collector. (It's realloc(), without immediately deallocating the old
block.) This is part of what allows safety in the light of concurrent
modification (because stale pointers and offsets can still refer to
valid structures).
7) Java 1.4 provides several different GC strategies, but most (if not
all) involve a stop-the-world stage, in which all threads are stopped.
(Note that this doesn't mean that the whole GC process needs to happen
with computation stopped, merely that some part of it may need to
happen with threads stopped.) So a strategy in which data is not owned
per-thread (from the point of view of GC) should be reasonable, since
GC may be necessarily a global endeavor. In parrot, if we have devised
a way to provide timely delivery of events, then we should be able to
use the event infrastructure to suspend all threads. (That is, we don't
need or want to suspend threads at the pthreads level.) See
<http://developers.sun.com/techtopics/mobility/midp/articles/
garbagecollection2/> and the previous version
<http://developers.sun.com/techtopics/mobility/midp/articles/garbage/>
for a discussion of Java's GC strategies.
8) It should be possible to provide per-thread allocation pools, such
that header or string body allocations can occur without locking,
despite the fact that the allocated items are not considered to be
"owned" by a thread. (Since GC will occur globally, as discussed in
item 7, the deallocation will not be logically per-thread.)
9) As mentioned above, PMC thread-robustness seems to be the main
challenge. (Where "thread-robustness" means non-crashing in the light
of concurrent modification.) I think that we can provide this in 3
ways: (a) minimize the number of custom PMCs that need to be written;
that is, emphasize use of the object infrastructure rather than custom
PMCs to provide HLL-level data types. (b) provide tools and guidelines
for implementing thread-robust PMCs; for instance, the memory-expansion
policy mentioned in item 6 above (Java provides a native
System.arraycopy() to assist with this, for example), favoring the use
of offsets/indexes rather than pointers into data blocks, and
out-of-band creation of data structure (initialization prior to storage
in a globally-accessible location). (c) use actual locking internally
when necessary, and only when running multithreaded. Simple PMCs such
as PerlInt are probably already thread-robust. Notably, PerlString
should be straightforward to make thread-robust as well, as long as its
flags are only changed at create time or during GC, and a new string
header is put into place when the backing store is expanded.
(Specifically, if headers can be guaranteed to be allocated from
zero-filled memory, then bufstart and buflen will either be consistent,
or one of them will be zero/NULL.) Thread-robustness is probably
easiest in the case of leaf structures such as these (ie, structures
which hold few pointers).
10) For PMC internal use, the lock-only-when-threaded mentioned in item
9 can be easily provide via a LOCK_THIS_PMC macro which expands to
something like "if(INTERP->is_threaded) {
pthread_mutex_lock(SELF->mutex) }". (It's safe to check an
interpreter->is_threaded flag without locking, because its value will
only change when single-threaded.)
11) I don't think it's particularly relevant that Java's strings are
immutable. StringBuffers are not, nor are collections such as HashMap,
so Java still needs to deal with thread-robustness of data structures.
Also, parrot has facilities for immutability, which HLLs can take
advantage of as an optimization (though probably Perl6 will not, except
for things such as string literals).
A few benchmarks (based on testing on Mac OS X, single CPU):
1) A pthread_lock/pthead_unlock pair on an uncontested mutex seems to
have an overhead equivalent to about 7 function calls. (Or, since it
_is_ 2 function calls, you could say the overhead is 5 function calls
above that.) Also, pthead_trylock is only slightly faster.
2) In a pure-computation context with no locking, there isn't a penalty
to being multithreaded. That is, counting to a large number in each of
40 threads takes no more time (seemingly, slightly less) than counting
up to 40*large-number in a single thread. This is likely because the
context-switch overhead is already being paid in context switching
between processes.
3) In Perl 5.8.1, it is much faster to perform 1000 forks (and waits)
than it is to create (and join) 1000 threads. In C, the opposite is
true.
I'll publish some actual benchmarking numbers, with source code,
separately. (They're just sort of interesting to have on hand.)
JEff
- Re: JVM as a threading example (threads proposal) Jeff Clites
- Re: JVM as a threading example (threads proposal) Elizabeth Mattijsen
- Re: JVM as a threading example (threads proposal) Gordon Henriksen
- Re: JVM as a threading example (threads propo... Jeff Clites
- Re: JVM as a threading example (threads p... Leopold Toetsch
- Re: JVM as a threading example (threa... Damien Neil
- Re: JVM as a threading example (... Leopold Toetsch
- Re: JVM as a threading examp... Jeff Clites
- Re: JVM as a threading e... Damien Neil
- Re: JVM as a threading e... Jeff Clites
- Re: JVM as a threading e... Gordon Henriksen