Hello!

As promised/threatened earlier in another forum...

                                                        Thanx, Paul

------------------------------------------------------------------------

The Real-Time Preemption patch modified RCU to permit its read-side
critical sections to be safely preempted.  This has worked, but there
are a few serious problems with this variant of RCU.  [If you just want
to skip directly to the code, search for "synchronize_kernel(void)".
There are five occurrences, each a variation on the theme.  I recommend
the fifth one.  The third one might also be OK in some environments.
If you have better approaches, please do not keep them a secret!!!]

So, why am I saying that there are problems with the real-time preemption
implementation of RCU?

o       RCU read-side critical sections cannot be freely nested,
        since the read-side critical section now acquires locks.
        This means that the real-time preemption variant of RCU
        is subject to deadlock conditions that "classic" RCU
        is immune to.  This is not just a theoretical concern,
        for example, see nf_hook_slow() in linux/net/core/netfilter.c:

                +       /*
                +        * PREEMPT_RT semantics: different-type read-locks
                +        * dont nest that easily:
                +        */
                +//     rcu_read_lock_read(&ptype_lock);

        A number of other RCU read-side critical sections have been
        similarly disabled (17 total in the patch).  Perhaps most embedded
        systems will not be using netfilter heavily, but this does bring
        up a very real stability concern, even on UP systems (since
        preemption will eventually occur when and where least expected).

o       RCU read-side critical sections cannot be unconditionally
        upgraded to write-side critical sections in all cases.
        For example, in classic RCU, it is perfectly legal to do
        the following:

                rcu_read_lock();
                list_for_each_entry_rcu(lep, head, p) {
                        if (p->needs_update) {
                                spin_lock(&update_lock);
                                update_it(p);
                                spin_unlock(&update_lock);
                        }
                }
                rcu_read_unlock()

        This would need to change to the following for real-time
        preempt kernels:

                rcu_read_lock_spin(&update_lock);
                list_for_each_entry_rcu(lep, head, p) {
                        if (p->needs_update) {
                                spin_lock(&update_lock);
                                update_it(p);
                                spin_unlock(&update_lock);
                        }
                }
                rcu_read_unlock_spin(&update_lock)

        This results in self-deadlock.

o       There is an API expansion, with five different variants
        of rcu_read_lock():

        API                         # uses
        ------------------------    ------
        rcu_read_lock_spin()            11
        rcu_read_unlock_spin()          12
        rcu_read_lock_read()            42
        rcu_read_unlock_read()          42
        rcu_read_lock_bh_read()          2
        rcu_read_unlock_bh_read()        3
        rcu_read_lock_down_read()       14
        rcu_read_unlock_up_read()       20
        rcu_read_lock_nort()             3
        rcu_read_unlock_nort()           4

        TOTAL                          153

o       The need to modify lots of RCU code expands the size of this
        patch -- roughly 10% of the 20K lines of this patch are devoted
        to modifying RCU code to meet this new API.  10% may not sound
        like much, but it comes to more than 2,000 lines of context diffs.

Seems to me that it would be good to have an RCU implementation
that meet the requirements of the Real-Time Preemption patch,
but that is 100% compatible with the "classic RCU" API.  Such
an implementation must meet a number of requirements, which are
listed at the end of this message (search for "REQUIREMENTS").

I have looked into a number of seductive but subtly broken "solutions"
to this problem.  The solution listed here is not perfect, but I believe
that it has enough advantages to be worth pursuing.  The solution is
quite simple, and I feel a bit embarrassed that it took me so long
to come up with it.  All I can say in my defense is that the idea
of -adding- locks to improve scalability and eliminate deadlocks is
quite counterintuitive to me.  And, like I said earlier, if you
know of a better approach, please don't keep it a secret!

The following verbiage steps through several variations on this
solution, as follows:

1.      "Toy" implementation that has numerous API, scalability,
        and realtime problems, but is a very simple 28-line
        illustration of the underlying principles.  (In case you
        get excited about this being much smaller than "classic
        RCU", keep in mind that a similar "toy" implementation of
        "classic RCU" is even smaller.  To say nothing of faster
        and more scalable.)

2.      Expanded version of the "toy" implementation that exactly
        matches the "classic RCU" API, but still has realtime and
        scalability problems.

3.      Expanded version of solution #2 that meets realtime
        requirements (at least -I- think it does...).  It still does
        not scale well.

4.      Expanded version of solution #2 that scales, but that does
        not meet realtime requirements.

5.      The final version, which both scales and meets realtime
        requirements, as well as exactly matching the "classic RCU"
        API.

I have tested this approach, but in user-level scaffolding.  All of
these implementations should therefore be regarded with great suspicion:
untested, probably don't even compile.  Besides which, I certainly can't
claim to fully understand the real-time preempt patch, so I am bound
to have gotten something wrong somewhere.  In any case, none of these
implementations are a suitable replacement for "classic RCU" on large
servers, since they acquire locks in the RCU read-side critical sections.
However, they should scale enough to support small SMP systems, inflicting
only a modest performance penalty.

I believe that implementation #5 is most appropriate for real-time
preempt kernels.  Note that #1, #2, and #4 are not at all appropriate for
-any- kernel.  In theory, #3 might be appropriate, but if I understand
the real-time preempt implementation of reader-writer lock, it will
not perform well if there are long RCU read-side critical sections,
even in UP kernels.


------------------------------------------------------------------------

1. "Toy" Implementation.

[Inappropriate for actual use.]

The basic principle may be familiar to people who worked with the Linux
2.4 and early 2.5 networking code.  The RCU read-side critical section
read-acquires a lock, recursively if need be.  This lock is -not- acquired
by the update side, thus preventing the deadlocks caused by the current
real-time preempt implementation of RCU (and thus also deviating from the
aforementioned networking approach).  Instead, the lock is write-acquired
(and then immediately released) before freeing any memory removed from the
data structure.  The lock cannot be write-acquired until all currently
running read-side critical sections complete.  Therefore, the memory
cannot be released until all read-side critical sections have released
any references that they might have to the memory.

        rwlock_t rcu_deferral_lock = RW_LOCK_UNLOCKED;

        void
        rcu_read_lock(void)
        {
                read_lock(&rcu_deferral_lock);
        }

        void
        rcu_read_unlock(void)
        {
                read_unlock(&rcu_deferral_lock);
        }

        void
        synchronize_kernel(void)
        {
                write_lock(&rcu_deferral_lock);
                write_unlock(&rcu_deferral_lock);
        }

        void
        call_rcu(struct rcu_head *head,
                 void (*func)(struct rcu_head *rcu))
        {
                synchronize_kernel();
                func(head);
        }

This is quite small and simple, but before getting too excited, keep
in mind that a similar "toy" implementation of classic RCU would be
even smaller and simpler.

Note that, just as with classic RCU, the read-side and update-side
critical sections can run concurrently.  This means that the current
rcu_dereference() and rcu_assign_pointer() macros remain just as they
are in the current in-tree RCU implementation.  This is also true of
the other four implementations.

Note also that, just as with classic RCU, there has to be some sort
of mechanism (lock, semaphore, non-blocking synchronization, ...)
that prevents concurrent updates from making a mess.

Quick Quiz 1:  In what way does implementation #1 violate the
        current "classic RCU" API?  (Search for "ANSWER" if you can't
        wait.)

------------------------------------------------------------------------

2. "Toy" Implementation That Exactly Matches the "Classic RCU" API.

[Inappropriate for actual use.]

The following implementation exactly matches the "Classic RCU" API,
but does not meet realtime requirements, nor does it scale.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

Quick Quiz 3: Why does implementation #2 fail to scale?

        struct rcu_data {
                long            batch;
                struct rcu_head *waitlist;
                struct rcu_head **waittail;
        };
        struct rcu_ctrlblk {
                rwlock          lock;
                long            batch;
        };
        DECLARE_PER_CPU(struct rcu_data, rcu_data);
        struct rcu_ctrlblk rcu_ctrlblk = {
                .lock = RW_LOCK_UNLOCKED,
                .batch = 0,
        };

        void
        rcu_read_lock(void)
        {
                read_lock(&rcu_ctrlblk.lock);
        }

        void
        rcu_read_unlock(void)
        {
                read_unlock(&rcu_ctrlblk.lock);
        }

        void
        synchronize_kernel(void)
        {
                write_lock(&rcu_ctrlblk.lock);
                rcu_ctrlblk.batch++;
                write_unlock(&rcu_ctrlblk.lock);
        }

        void
        call_rcu(struct rcu_head *head,
                 void (*func)(struct rcu_head *rcu))
        {
                unsigned long flags;
                struct rcu_data *rdp;

                head->func = func;
                head->next = NULL;
                local_irq_save(flags);
                rcu_do_my_batch();
                rdp = &__get_cpu_var(rcu_data);
                *rdp->waittail = head;
                rdp->nexttail = &head->next;
                local_irq_restore(flags);
        }

        void rcu_do_my_batch(void)
        {
                unsigned long flags;
                struct rcu_data *rdp;
                struct rcu_head *next, *list;

                local_irq_save(flags);
                rdp = &__get_cpu_var(rcu_data);
                smp_mb();  /* prevent sampling batch # before list removal. */
                if (rdp->batch == rcu_ctrlblk.batch) {
                        local_irq_restore(flags);
                        return;
                }
                list = rdp->waitlist;
                rdp->waitlist = NULL;
                rdp->waittail = &rdp->waitlist;
                rdp->batch = rcu_ctrlblk.batch;
                local_irq_restore(flags);
                while (list) {
                        next = list->next;
                        list->func(list);
                        list = next;
                }
        }

        /*
         * Note: rcu_dereference() and rcu_assign_pointer() unchanged
         * from current in-tree implementation.
         */

The trick here is that call_rcu() does not write-acquire the lock,
either directly or indirectly.  This means that there is no way
for it to deadlock with an RCU read-side critical section.
However, the callbacks are still invoked with interrupts disabled,
so this implementation still does not meet realtime requirements.
In addition, it still uses a global reader-writer lock, so it does
not scale, either.

However, it -does- faithfully implement the classic RCU API.


------------------------------------------------------------------------

3. Meets Realtime Requirements, Doesn't Scale

The trick here is to break the callback processing out into a separate
rcu_process_callbacks() function, so that it may be invoked from a kernel
daemon, a work queue, or whatever.  It must be called periodically.
If the kernel is under memory pressure, one can get immediate relief
by doing the following:

        synchronize_kernel();
        rcu_process_callbacks();

Even more relief can be obtained by using something like
smp_call_function() in order to invoke rcu_process_callbacks() on each
CPU in an SMP system.  Alternatively, one could awaken the kernel daemons
or schedule the function into the appropriate work queue, depending
on how this code fragment is integrated in.  Or, perhaps better yet,
restrict use of this approach to UP kernels.  ;-)

        struct rcu_data {
                long            batch;
                struct rcu_head *waitlist;
                struct rcu_head **waittail;
                struct rcu_head *donelist;
                struct rcu_head **donetail;
        };
        struct rcu_ctrlblk {
                rwlock          lock;
                long            batch;
        };
        DECLARE_PER_CPU(struct rcu_data, rcu_data);
        struct rcu_ctrlblk rcu_ctrlblk = {
                .lock = RW_LOCK_UNLOCKED,
                .batch = 0,
        };

        void
        rcu_read_lock(void)
        {
                read_lock(&rcu_ctrlblk.lock);
        }

        void
        rcu_read_unlock(void)
        {
                read_unlock(&rcu_ctrlblk.lock);
        }

        void
        synchronize_kernel(void)
        {
                write_lock(&rcu_ctrlblk.lock);
                rcu_ctrlblk.batch++;
                write_unlock(&rcu_ctrlblk.lock);
        }

        void
        call_rcu(struct rcu_head *head,
                 void (*func)(struct rcu_head *rcu))
        {
                unsigned long flags;
                struct rcu_data *rdp;

                head->func = func;
                head->next = NULL;
                local_irq_save(flags);
                rcu_advance_callbacks();
                rdp = &__get_cpu_var(rcu_data);
                *rdp->waittail = head;
                rdp->nexttail = &head->next;
                local_irq_restore(flags);
        }

        void rcu_advance_callbacks(void)
        {
                unsigned long flags;
                struct rcu_data *rdp;

                local_irq_save(flags);
                rdp = &__get_cpu_var(rcu_data);
                smp_mb();  /* prevent sampling batch # before list removal. */
                if (rdp->batch != rcu_ctrlblk.batch) {
                        *rdp->donetail = rdp->waitlist;
                        rdp->donetail = rdp->waittail;
                        rdp->waitlist = NULL;
                        rdp->waittail = &rdp->waitlist;
                        rdp->batch = rcu_ctrlblk.batch;
                }
                local_irq_restore(flags);
        }

        void rcu_process_callbacks(void)
        {
                unsigned long flags;
                struct rcu_head *next, *list;
                struct rcu_data *rdp;

                local_irq_save(flags);
                rdp = &__get_cpu_var(rcu_data);
                list = rdp->donelist;
                if (list == NULL) {
                        local_irq_restore(flags);
                        return;
                }
                rdp->donelist = NULL;
                rdp->donetail = &rdp->waitlist;
                local_irq_restore(flags);
                while (list) {
                        next = list->next;
                        list->func(list);
                        list = next;
                }
        }

        /*
         * Note: rcu_dereference() and rcu_assign_pointer() unchanged
         * from current in-tree implementation.
         */

Since a single global rwlock is used, this implementation still does
not scale.


------------------------------------------------------------------------

4. Scalable Implementation

[Inappropriate for realtime use.]

This variant scales, though not well enough to recommend use on a
mid-range POWER machine, let alone a full-blown Altix.  In addition,
it fails to meet some realtime requirements.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

        #define GRACE_PERIODS_PER_SEC 10

        struct rcu_data {
                rwlock          lock;
                long            batch;
                struct rcu_head *waitlist;
                struct rcu_head **waittail;
        };
        struct rcu_ctrlblk {
                long            batch;
        };
        DECLARE_PER_CPU(struct rcu_data, rcu_data);
        struct rcu_ctrlblk rcu_ctrlblk = {
                .batch = 0,
        };

        void
        rcu_read_lock(void)
        {
                read_lock(&__get_cpu_var(rcu_data).lock);
        }

        void
        rcu_read_unlock(void)
        {
                read_unlock(&__get_cpu_var(rcu_data).lock);
        }

        void
        _synchronize_kernel(void)
        {
                int cpu;

                for_each_cpu(cpu) {
                        write_lock(&per_cpu(rcu_data, cpu).lock);
                }
                rcu_ctrlblk.batch++;
                for_each_cpu(cpu) {
                        write_unlock(&per_cpu(rcu_data, cpu).lock);
                }
        }

        void
        synchronize_kernel(void)
        {
                long oldbatch;

                smp_mb();
                oldbatch = rcu_ctrlblk.batch;
                schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
                if (rcu_ctrlblk.batch == oldbatch) {
                        _synchronize_kernel();
                }
        }

        void
        call_rcu(struct rcu_head *head,
                 void (*func)(struct rcu_head *rcu))
        {
                unsigned long flags;
                struct rcu_data *rdp;

                head->func = func;
                head->next = NULL;
                local_irq_save(flags);
                rcu_do_my_batch();
                rdp = &__get_cpu_var(rcu_data);
                *rdp->waittail = head;
                rdp->nexttail = &head->next;
                local_irq_restore(flags);
        }

        void rcu_do_my_batch(void)
        {
                unsigned long flags;
                struct rcu_data *rdp;
                struct rcu_head *next, *list;

                local_irq_save(flags);
                rdp = &__get_cpu_var(rcu_data);
                smp_mb();  /* prevent sampling batch # before list removal. */
                if (rdp->batch == rcu_ctrlblk.batch) {
                        local_irq_restore(flags);
                        return;
                }
                list = rdp->waitlist;
                rdp->waitlist = NULL;
                rdp->waittail = &rdp->waitlist;
                rdp->batch = rcu_ctrlblk.batch;
                local_irq_restore(flags);
                while (list) {
                        next = list->next;
                        list->func(list);
                        list = next;
                }
        }

        /*
         * Note: rcu_dereference() and rcu_assign_pointer() unchanged
         * from current in-tree implementation.
         */

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
        for_each_online_cpu() in _synchronize_kernel() in
        implementation #4?

Quick Quiz 6: Why aren't there write-side contention problems
        in implementation #4?


------------------------------------------------------------------------

5. Scalability -and- Realtime Response.

The trick here is to keep track of the CPU that did the rcu_read_lock()
in the task structure.  If there is a preemption, there will be some
slight inefficiency, but the correct lock will be released, preserving
correctness.

        #define GRACE_PERIODS_PER_SEC 10

        struct rcu_data {
                rwlock          lock;
                long            batch;
                struct rcu_head *waitlist;
                struct rcu_head **waittail;
                struct rcu_head *donelist;
                struct rcu_head **donetail;
        };
        struct rcu_ctrlblk {
                long            batch;
        };
        DECLARE_PER_CPU(struct rcu_data, rcu_data);
        struct rcu_ctrlblk rcu_ctrlblk = {
                .batch = 0,
        };

        void
        rcu_read_lock(void)
        {
                preempt_disable();
                if (current->rcu_read_lock_nesting++ == 0) {
                        current->rcu_read_lock_ptr =
                                &__get_cpu_var(rcu_data).lock;
                        read_lock(current->rcu_read_lock_ptr);
                }
                preempt_enable();
        }

        void
        rcu_read_unlock(void)
        {
                preempt_disable();
                if (--current->rcu_read_lock_nesting == 0) {
                        read_unlock(current->rcu_read_lock_ptr);
                }
                preempt_enable();
        }

        void
        _synchronize_kernel(void)
        {
                int cpu;

                for_each_cpu(cpu) {
                        write_lock(&per_cpu(rcu_data, cpu).lock);
                }
                rcu_ctrlblk.batch++;
                for_each_cpu(cpu) {
                        write_unlock(&per_cpu(rcu_data, cpu).lock);
                }
        }

        void
        synchronize_kernel(void)
        {
                long oldbatch;

                smp_mb();
                oldbatch = rcu_ctrlblk.batch;
                schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
                if (rcu_ctrlblk.batch == oldbatch) {
                        _synchronize_kernel();
                }
        }

        void
        call_rcu(struct rcu_head *head,
                 void (*func)(struct rcu_head *rcu))
        {
                unsigned long flags;
                struct rcu_data *rdp;

                head->func = func;
                head->next = NULL;
                local_irq_save(flags);
                rcu_advance_callbacks();
                rdp = &__get_cpu_var(rcu_data);
                *rdp->waittail = head;
                rdp->nexttail = &head->next;
                local_irq_restore(flags);
        }

        void rcu_advance_callbacks(void)
        {
                unsigned long flags;
                struct rcu_data *rdp;

                local_irq_save(flags);  /* allow invocation from OOM handler. */
                rdp = &__get_cpu_var(rcu_data);
                smp_mb(); /* prevent sampling batch # before list removal. */
                if (rdp->batch != rcu_ctrlblk.batch) {
                        *rdp->donetail = rdp->waitlist;
                        rdp->donetail = rdp->waittail;
                        rdp->waitlist = NULL;
                        rdp->waittail = &rdp->waitlist;
                        rdp->batch = rcu_ctrlblk.batch;
                }
                local_irq_restore(flags);
        }

        void rcu_process_callbacks(void)
        {
                unsigned long flags;
                struct rcu_head *next, *list;
                struct rcu_data *rdp;

                local_irq_save(flags);
                rdp = &__get_cpu_var(rcu_data);
                list = rdp->donelist;
                if (list == NULL) {
                        local_irq_restore(flags);
                        return;
                }
                rdp->donelist = NULL;
                rdp->donetail = &rdp->waitlist;
                local_irq_restore(flags);
                while (list) {
                        next = list->next;
                        list->func(list);
                        list = next;
                }
        }

        /*
         * Note: rcu_dereference() and rcu_assign_pointer() unchanged
         * from current in-tree implementation.
         */

This implementation meets all of the requirements, -except- that it
imposes significant (but probably not crippling) synchronization overhead
on RCU read-side critical sections.  This implementation is probably
serviceable on systems with up to 2-4 CPUs, and perhaps a few more.
However, it is not going to cut it on a mid-range POWER system, much
less a high-end Altix system.  So, if you don't need heavy-duty realtime
response, you should stick with "classic RCU".  Or, better yet, come
up with a single RCU implementation that is suited to both realtime and
server usage!

Quick Quiz 7: What if there is real interrupt or softirq code that
        needs to call rcu_read_lock()?

Quick Quiz 8: What other bugs are present in implementation #5?


------------------------------------------------------------------------

REQUIREMENTS

These are -not- in strict priority order.  Nor can they be, since
different people have different priorities.  ;-)

1. Deferred destruction.

        No data element may be destroyed (for example, freed) while an
        RCU read-side critical section is referencing it.  This property
        absolutely must be implemented, as failing to do so will break
        pretty much all the code that uses RCU.

2. Reliable.

        The implementation must not be prone to gratuitous failure; it
        must be able to run 24x7 for months at a time, and preferably
        for years.

        [This is a problem for so-called "hazard pointer" mechanisms,
        which require that hazard pointers be allocated and freed.
        So what do you do when the allocation fails while nested
        way down in the bowels of some lock's critical section?
        Avoiding this sort of unwind code was a major motivation
        for Jack and I to come up with RCU in the first place!]

3. Callable from IRQ.

        Since RCU is used from softirq state, a realtime implementation
        must either eliminate softirq and interrupt states, eliminate use
        of RCU from these states, or make RCU callable from these states.

        [I believe that the real-time preemption patch moves code
        from softirq/interrupt to process context, but could easily be
        missing something.  If need be, there are ways of handling cases
        were realtime RCU is called from softirq and interrupt contexts,
        for an example approach, see the Quick Quiz answers.]

4. Preemptible read side.

        RCU read-side critical sections can (in theory, anyway) be quite
        large, degrading realtime scheduling response.  Preemptible RCU
        read-side critical sections avoid such degradation.  Manual
        insertion of "preemption points" might be an alternative as well.
        But I have no intention of trying to settle the long-running
        argument between proponents of preemption and of preemption
        points.  Not today, anyway!  ;-)

        [This is one of classic RCU's weak points.  No surprise, as
        it was not designed with aggressive realtime in mind.]

5. Small memory footprint.

        Many realtime systems are configured with modest amounts
        of memory, so it is highly desirable to limit the number of
        outstanding RCU callbacks.  It is also necessary to force
        "reaping" of callbacks should the system run short of memory.

        [This is another of classic RCU's weak points.  No surprise,
        it was not designed with tiny-memory embedded systems in
        mind.]

6. Independent of memory blocks.

        The implementation should not make assumptions about the size
        and extent of the data elements being protected by RCU, since
        such assumptions constrain memory allocation design and can
        impose increased complexity.

        [This requirement makes it difficult to use so-called
        "hazard pointer" techniques.]

7. Synchronization-free read side.

        RCU read-side critical sections should avoid expensive
        synchronization instructions or cache misses.  It is most
        important to avoid global locks and atomic instructions that
        modify global memory, as these operations can inflict severe
        performance and scalability bottlenecks.  Avoiding memory barriers
        is desirable but not as critical.

        [This is where classic RCU shines.  The realtime scheme does not
        have a synchronization-free read side, but implementation #5 at
        least has decent cache locality for read-mostly workloads.]

8. Freely nestable read-side critical sections.

        Restrictions on nestability can result in code duplication,
        so should be avoided if possible.

        [This is where the current real-time preempt RCU implementation
        has -serious- problems.]

9. Unconditional read-to-write upgrade.

        RCU permits a read-side critical section to acquire the
        corresponding write-side lock.  This capability can be convenient
        in some cases.  However, since it is rarely used, it cannot be
        considered mandatory.  Then again, working around the lack of
        such upgrade usually results in code bloat.

        [This is another big issue with the current real-time preempt
        RCU implementation.]

10. Compatible API.

        A realtime RCU implementation should have an API compatible with
        that in the kernel.

        [Another big issue with the current real-time preempt RCU
        implementation.]


------------------------------------------------------------------------

ANSWERS TO QUICK QUIZ

Quick Quiz 1:  In what way does implementation #1 violate the
        current "classic RCU" API?

        Answer: Unlike "classic RCU", it is not legal to invoke
                call_rcu() from within an RCU read-side critical
                section.  Doing so will result in deadlock.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

        Answer: Because all the callbacks are invoked with interrupts
                disabled.  If there are a lot of callbacks, realtime
                latency will suffer.

Quick Quiz 3: Why does implementation #2 fail to scale?

        Answer: Because the large cache-miss penalties of modern
                microprocessors.  See http://linuxjournal.com/article/6993
                for more details on why this is so.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

        Answer: Because the realtime-preempt version of locks can be
                preempted.  Therefore, the rcu_read_unlock() might
                run on a different CPU than did the corresponding
                rcu_read_lock(), resulting in horrible confusion and
                a great decrease in the expected lifespan of your
                kernel.
                
                Also because it reverts to invoking the callbacks
                with interrupts disabled.

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
        for_each_online_cpu() in _synchronize_kernel()?

        Answer: To completely avoid any races with CPU hotplug.
                Admittedly a large hammer to use, but it does
                have the advantage of simplicity.  Suggestions
                for improvements welcome!

Quick Quiz 6: Why aren't there write-side contention problems
        in implementation #4?

        Answer: Because synchronize_kernel() refuses to acquire
                the lock more than 10 times a second.  Of course,
                if you invoke _synchronize_kernel() frequently
                from an OOM handler, or if you are running on
                a 100-CPU machine, some adjustments may be needed.
                In a -lot- of places in the latter case.

Quick Quiz 7: What if there is real interrupt or softirq code that
        needs to call rcu_read_lock()?

        Answer: Use something like the call_rcu_bh() API.  This could
                have a similar implementation to the ones laid out here,
                but rcu_read_lock_bh() must disable preemption, and
                perhaps also interrupts.  As must _synchronize_kernel().

Quick Quiz 8: What other bugs are present in implementation #5?

        Answer: Beats me!  If I knew, I might have fixed them.  ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to