Gordon Henriksen <[EMAIL PROTECTED]> wrote: > Leopold Toetsch wrote:
>> Increasing the union size, so that each pointer is distinct is not an >> option. This imposes considerable overhead on a non-threaded program >> too, due its bigger PMC size. > That was the brute-force approach, separating out all pointers. If the > scalar hierarchy doesn't use all 4 of the pointers, then the bloat can > be reduced. Yep. Your proposal is a very thorough analysis, what the problems of morph() currently are. I can imagine, that we have a distinct string_val pointer, that isn't part of the value union. Morph is currently implemented only (and necessary) for PerlScalar types. So if a PerlString's string_val member is valid at any time, we could probably save a lot of locking overhead. > And what of the per-PMC mutex? Is that not also considerable overhead? > More than an unused field, even. We have to deal single-CPU (no-threaded) performance against threaded. For the latter, we have from single-CPU to many-multi CPU NUMA systems a wide spectrum of possibilties. I currently don't want to slow down the "normal" no-threaded case. >> To keep internal state consistent we have to LOCK shared PMCs, that's >> it. This locking is sometimes necessary for reading too. > Sometimes? Unless parrot can prove a PMC is not shared, PMC locking is > ALWAYS necessary for ALL accesses to ANY PMC. get_integer() on a PerlInt PMC is always safe. *If* the vtable is pointing to a PerlInt PMC it yields a correct value (atomic int access presumed). If for some reason (which I can't imagine now) the vtable pointer and the cache union are out of sync, get_integer would produce a wrong value, which is currently onsidered to be a user problem (i.e. missing user-level locking). The same holds for e.g. PerlNum, which might read a mixture of lo and hi words, but again, the pmc->vtable->get_number of a PerlNum is a safe operation on shared PMCs without locking too. The locking primitives provide AFAIK the (might needed) memory barriers to update the vtable *and* the cache values to point to consistent data during the *locking* of mutating vtable methods. > BENCHMARK USER SYS %CPU TOTAL > That's only 2.1%. [ big snip on benchmarks ] All these benchmarks show a scheme like: allocate once and use once. I.e. these benchmarks don't reuse the PMCs. They don't show RL program behavior. We don't have any benchmarks currently that expose a near to worst-case slowdown, which went towards 400% for doubled PMC sizes. We had some test programs (C only with simulated allocation schemes), that shuffled allocated PMCs in memory (a typical effect of reusing PMC memory a few time). The current 20 byte PMC went down by 200% the old 32 byte PMC went down by 400%. The point is, that when you get allocated PMC from the free_list, they are more and more scattered in the PMC arenas. All current benchmarks tend to touch contiguous memory, while RL (or a bit longer running) programs don't. That said, I do consider PMC sizes and cache pollution the major performance issues of RL Parrot programs. These benchmarks don't show that yet. > What do you think the overall performance effect of fine-grained locking > will be? You just showed in a microbenchmark that it's 400% for some > operations. Yes. That's the locking overhead for the *fastest* PMC vtable methods. I think, that we'll be able to have different locking strategies in the long run. If you want to have a more scalable application on a many-CPU system, a build-option may provide this. For a one or two CPU system, we can do a less fine grained locking with less overhead. That might be a global interpreter lock for that specific case. > And remember, these overheads are ON TOP OF the user's synchronization > requirements; the PMC locks will rarely coincide with the user's > high-level synchronization requirements. User level locking is't layed out yet. But my 2¢ towards that are: a user will put a lock around unsafe (that is shared) variable access. We have to lock internally a lot of times to keep our data-integrity, which is guaranteed. So the question is, why not to provide that integrity per se (the user will need it anyway and lock). I don't see a difference her for data integrity, *but* if all locking is under our control, we can optimize it and it doesn't conflict or it shouldn't deadlock. > If these are the two options, I as a user would rather have a separate > threaded parrot executable which takes the 2.1% hit, rather than the > 400% overhead as per above. It's easily the difference between usable > threads and YAFATTP (yet another failed attempt to thread perl). All these numbers are by far too premature, to have any impact on RL applications. Please note, that even with a 400% slowdown for one vtable operation the mops_p.pasm benchmark would run 4 to eight times faster then on an *unthreaded* perl5. Thread spawning is currently 8 times faster then on perl5. > ? !!!1 > Gordon Henriksen leo