On Monday 16 May 2011 14:21:27 John Baldwin wrote:
> On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
> > I'd like to propose that we move forward with fixing the race, before
> > redesigning the API - please.
> > 
> > We agree that there is a problem with the exit rendezvous.  Let's fix
> > that. We agree that the generation approach proposed by NetAPP is the
> > right way to go?  Can we check it in, then?  Again, I'd like some
> > comments in the code if at all possible.
> 
> How about this:
> 
> Index: subr_smp.c
> ===================================================================
> --- subr_smp.c        (revision 221939)
> +++ subr_smp.c        (working copy)
> @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
>  static void (*volatile smp_rv_teardown_func)(void *arg);
>  static void *volatile smp_rv_func_arg;
>  static volatile int smp_rv_waiters[3];
> +static volatile int smp_rv_generation;
> 
>  /*
>   * Shared mutex to restrict busywaits between smp_rendezvous() and
> @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
>  void
>  smp_rendezvous_action(void)
>  {
> -     void* local_func_arg = smp_rv_func_arg;
> -     void (*local_setup_func)(void*)   = smp_rv_setup_func;
> -     void (*local_action_func)(void*)   = smp_rv_action_func;
> -     void (*local_teardown_func)(void*) = smp_rv_teardown_func;
> +     void *local_func_arg;
> +     void (*local_setup_func)(void*);
> +     void (*local_action_func)(void*);
> +     void (*local_teardown_func)(void*);
> +     int generation;
> 
>       /* Ensure we have up-to-date values. */
>       atomic_add_acq_int(&smp_rv_waiters[0], 1);
>       while (smp_rv_waiters[0] < smp_rv_ncpus)
>               cpu_spinwait();
> 
> -     /* setup function */
> +     /* Fetch rendezvous parameters after acquire barrier. */
> +     local_func_arg = smp_rv_func_arg;
> +     local_setup_func = smp_rv_setup_func;
> +     local_action_func = smp_rv_action_func;
> +     local_teardown_func = smp_rv_teardown_func;
> +     generation = smp_rv_generation;
> +
> +     /*
> +      * If requested, run a setup function before the main action
> +      * function.  Ensure all CPUs have completed the setup
> +      * function before moving on to the action function.
> +      */
>       if (local_setup_func != smp_no_rendevous_barrier) {
>               if (smp_rv_setup_func != NULL)
>                       smp_rv_setup_func(smp_rv_func_arg);
> -
> -             /* spin on entry rendezvous */
>               atomic_add_int(&smp_rv_waiters[1], 1);
>               while (smp_rv_waiters[1] < smp_rv_ncpus)
>                       cpu_spinwait();
>       }
> 
> -     /* action function */
>       if (local_action_func != NULL)
>               local_action_func(local_func_arg);
> 
> -     /* spin on exit rendezvous */
> +     /*
> +      * Signal that the main action has been completed.  If a
> +      * full exit rendezvous is requested, then all CPUs will
> +      * wait here until all CPUs have finished the main action.
> +      *
> +      * Note that the write by the last CPU to finish the action
> +      * may become visible to different CPUs at different times.
> +      * As a result, the CPU that initiated the rendezvous may
> +      * exit the rendezvous and drop the lock allowing another
> +      * rendezvous to be initiated on the same CPU or a different
> +      * CPU.  In that case the exit sentinel may be cleared before
> +      * all CPUs have noticed causing those CPUs to hang forever.
> +      * Workaround this by using a generation count to notice when
> +      * this race occurs and to exit the rendezvous in that case.
> +      */

MPASS(generation == smp_rv_generation);

>       atomic_add_int(&smp_rv_waiters[2], 1);
>       if (local_teardown_func == smp_no_rendevous_barrier)
>                  return;
> -     while (smp_rv_waiters[2] < smp_rv_ncpus)
> +     while (smp_rv_waiters[2] < smp_rv_ncpus &&
> +         generation == smp_rv_generation)
>               cpu_spinwait();
> 
> -     /* teardown function */
>       if (local_teardown_func != NULL)
>               local_teardown_func(local_func_arg);
>  }
> @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
>       if (ncpus == 0)
>               panic("ncpus is 0 with map=0x%x", map);
> 
> -     /* obtain rendezvous lock */
>       mtx_lock_spin(&smp_ipi_mtx);
> 
> -     /* set static function pointers */
> +     atomic_add_acq_int(&smp_rv_generation, 1);
> +
> +     /* Pass rendezvous parameters via global variables. */
>       smp_rv_ncpus = ncpus;
>       smp_rv_setup_func = setup_func;
>       smp_rv_action_func = action_func;
> @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
>       smp_rv_waiters[2] = 0;
>       atomic_store_rel_int(&smp_rv_waiters[0], 0);
> 
> -     /* signal other processors, which will enter the IPI with interrupts off
> */ +  /*
> +      * Signal other processors, which will enter the IPI with
> +      * interrupts off.
> +      */
>       ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);
> 
>       /* Check if the current CPU is in the map */
>       if ((map & (1 << curcpu)) != 0)
>               smp_rendezvous_action();
> 
> +     /*
> +      * If the caller did not request an exit barrier to be enforced
> +      * on each CPU, ensure that this CPU waits for all the other
> +      * CPUs to finish the rendezvous.
> +      */
>       if (teardown_func == smp_no_rendevous_barrier)
>               while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
>                       cpu_spinwait();
> 
> -     /* release lock */
>       mtx_unlock_spin(&smp_ipi_mtx);
>  }
> 
> > A second commit should take care of the entry sync - which may or may not
> > be required on some architectures.  If we decide that we don't need it,
> > we should remove it.  Otherwise, we must move all the assignments from
> > global smp_rv_* to the stack in smp_rendezvous_action to after the sync.
> >  Otherwise we should just kill it.  In any case, it would be nice to
> > clean this up.
> 
> The patch also fixes this in the cautious fashion (not removing the early
> rendezvous).

Good idea for now.  We can always remove the sync later and the assign on 
declare is a style(9) violation anyways.

> > After that, I have one more bugfix for rmlock(9) - one of the three
> > consumers of this API (that I am aware of).  As it uses a spinlock
> > inside its IPI action, we have to mark smp_rendezvous_action a critical
> > section otherwise the spin unlock might yield ... with the expected
> > consequences.
> 
> Yes, we need to fix that.  Humm, it doesn't preempt when you do a
> critical_exit() though?  Or do you use a hand-rolled critical exit that
> doesn't do a deferred preemption?

Right now I just did a manual td_critnest++/--, but I guess ...

> Actually, I'm curious how the spin unlock inside the IPI could yield the
> CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is
> ok as long as the critical_exit() just defers the preemption to the end of
> the IPI handler.

... the earliest point where it is safe to preempt is after doing the 

   atomic_add_int(&smp_rv_waiters[2], 1);

so that we can start other IPIs again.  However, since we don't accept new 
IPIs until we signal EOI in the MD code (on amd64), this might still not be a 
good place to do the yield?!?

The spin unlock boils down to a critical_exit() and unless we did a 
critical_enter() at some point during the redenvouz setup, we will yield() if 
we owepreempt.  I'm not quite sure how that can happen, but it seems like 
there is a path that allows the scheduler to set it from a foreign CPU.

> > It is arguable if you should be allowed to use spinlocks in the rendevous
> > at all, but that's beside the point.
> 
> I'm less convinced this is bad since we don't use NMIs for these, so we
> know the interrupted thread doesn't have any spin locks in the scheduler,
> etc.

Agreed.  I'll send a patch as soon as this one's in, but it really is a simple 
matter of adding critical_enter/exit (or the manual version thereof) to the 
right places.

Thanks,
  Max
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[email protected]"

Reply via email to