Gordon Henriksen <[EMAIL PROTECTED]> wrote:
> Leopold Toetsch wrote:

>> Increasing the union size, so that each pointer is distinct is not an
>> option. This imposes considerable overhead on a non-threaded program
>> too, due its bigger PMC size.

> That was the brute-force approach, separating out all pointers. If the
> scalar hierarchy doesn't use all 4 of the pointers, then the bloat can
> be reduced.

Yep. Your proposal is a very thorough analysis, what the problems of
morph() currently are. I can imagine, that we have a distinct string_val
pointer, that isn't part of the value union. Morph is currently
implemented only (and necessary) for PerlScalar types. So if a
PerlString's string_val member is valid at any time, we could probably
save a lot of locking overhead.

> And what of the per-PMC mutex? Is that not also considerable overhead?
> More than an unused field, even.

We have to deal single-CPU (no-threaded) performance against threaded.
For the latter, we have from single-CPU to many-multi CPU NUMA systems a
wide spectrum of possibilties.

I currently don't want to slow down the "normal" no-threaded case.

>> To keep internal state consistent we have to LOCK shared PMCs, that's
>> it. This locking is sometimes necessary for reading too.

> Sometimes? Unless parrot can prove a PMC is not shared, PMC locking is
> ALWAYS necessary for ALL accesses to ANY PMC.

get_integer() on a PerlInt PMC is always safe. *If* the vtable is
pointing to a PerlInt PMC it yields a correct value (atomic int access
presumed). If for some reason (which I can't imagine now) the vtable
pointer and the cache union are out of sync, get_integer would produce a
wrong value, which is currently onsidered to be a user problem (i.e.
missing user-level locking).

The same holds for e.g. PerlNum, which might read a mixture of lo and hi
words, but again, the pmc->vtable->get_number of a PerlNum is a safe
operation on shared PMCs without locking too.

The locking primitives provide AFAIK the (might needed) memory barriers
to update the vtable *and* the cache values to point to consistent data
during the *locking* of mutating vtable methods.

> BENCHMARK                         USER     SYS  %CPU    TOTAL
> That's only 2.1%.

[ big snip on benchmarks ]

All these benchmarks show a scheme like: allocate once and use once. I.e.
these benchmarks don't reuse the PMCs. They don't show RL program
behavior. We don't have any benchmarks currently that expose a near to
worst-case slowdown, which went towards 400% for doubled PMC sizes.

We had some test programs (C only with simulated allocation schemes), that
shuffled allocated PMCs in memory (a typical effect of reusing PMC
memory a few time). The current 20 byte PMC went down by 200% the old 32
byte PMC went down by 400%.

The point is, that when you get allocated PMC from the free_list, they
are more and more scattered in the PMC arenas. All current benchmarks
tend to touch contiguous memory, while RL (or a bit longer running)
programs don't.

That said, I do consider PMC sizes and cache pollution the major
performance issues of RL Parrot programs. These benchmarks don't show
that yet.

> What do you think the overall performance effect of fine-grained locking
> will be? You just showed in a microbenchmark that it's 400% for some
> operations.

Yes. That's the locking overhead for the *fastest* PMC vtable methods. I
think, that we'll be able to have different locking strategies in the
long run. If you want to have a more scalable application on a many-CPU
system, a build-option may provide this. For a one or two CPU system, we
can do a less fine grained locking with less overhead. That might be a
global interpreter lock for that specific case.

> And remember, these overheads are ON TOP OF the user's synchronization
> requirements; the PMC locks will rarely coincide with the user's
> high-level synchronization requirements.

User level locking is't layed out yet. But my 2¢ towards that are: a
user will put a lock around unsafe (that is shared) variable access. We
have to lock internally a lot of times to keep our data-integrity, which
is guaranteed. So the question is, why not to provide that integrity
per se (the user will need it anyway and lock). I don't see a difference
her for data integrity, *but* if all locking is under our control, we
can optimize it and it doesn't conflict or it shouldn't deadlock.

> If these are the two options, I as a user would rather have a separate
> threaded parrot executable which takes the 2.1% hit, rather than the
> 400% overhead as per above. It's easily the difference between usable
> threads and YAFATTP (yet another failed attempt to thread perl).

All these numbers are by far too premature, to have any impact on RL
applications.

Please note, that even with a 400% slowdown for one vtable operation the
mops_p.pasm benchmark would run 4 to eight times faster then on an
*unthreaded* perl5. Thread spawning is currently 8 times faster then on
perl5.

> ?

!!!1

> Gordon Henriksen

leo

Reply via email to