On 17.11.2016 19:16, David Miller wrote: > From: Hannes Frederic Sowa <han...@stressinduktion.org> > Date: Thu, 17 Nov 2016 18:20:39 +0100 > >> Hi, >> >> On 17.11.2016 17:45, David Miller wrote: >>> From: Hannes Frederic Sowa <han...@stressinduktion.org> >>> Date: Thu, 17 Nov 2016 15:36:48 +0100 >>> >>>> The other way is the journal idea I had, which uses an rb-tree with >>>> timestamps as keys (can be lamport timestamps). You insert into the tree >>>> until the dump is finished and use it as queue later to shuffle stuff >>>> into the hardware. >>> >>> If you have this "place" where pending inserts are stored, you have >>> a policy decision to make. >>> >>> First of all what do other lookups see when there are pending entires? >> >> I think this is a problem with the current approach already, as the >> delayed work queue already postpones the insert for an undecidable >> amount of time (and reorders depending on which CPU the entry was >> inserted and the fib notifier was called). >> >> For user space queries we would still query the in-kernel table. > > Ok, I think I might misunderstand something. > > What is going into this journal exactly? The actual full software and > hardware insert operation, or just the notification to the hardware > device driver notifiers?
The journal is only used as a timely ordered queue for updating the hardware in correct order. The enqueue side is the fib notifier only. If no fib notifier is registered we don't use this code at all (and also don't hit the lock protecting this journal in fib insertion/deletion path - fast in-kernel path is untouched - otherwise just a spin_lock already under rtnl_lock in slow path). The fib notifier enqueues the packet with a timestamp into this journal and can also merge entries while they are in the queue, e.g. we got a delete from the fib notifier but the rcu walk indicated an addition of the entry, so we can merge that at this point and depending on the timestamp remove the entry or drop the deletion event. We start dequeueing the fib entries into the hardware as soon as the rcu dump is finished, at this point we are up-to-date in the queue with all events. New events can be added to the journal (with appropriate locking) during this time, as the queue was once in proper synced state we stay proper synchronized. We keep up with the queue in steady state after the dump, so syncing happens ASAP. Maybe we can also drop the journal then. Something alike this described queue is implemented here (haven't checked if it exactly matches the specification, certainly it provides more features): https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c For this to work the config path needs to add timestamps to the fib_infos or fib_aliases. > The "lookup" I'm mostly concerned with is the fast path where the > packets being processed actually look up a route. This doesn't change at all. All code will be hidden in a library that gets attached to the fib notifier, which is configuration code path. > I do not think we can return success on the insert to the user yet > have the route lookup dataplace not return that route on a lookup. We don't change kernel fast path at all. If we add/delete a route in software and hardware, kernel indicates success as soon as software has the entry added. It also gets queued up in the journal. Journal will be lazily processed, if error happens during that (e.g. hardware signals table full), abort is called and all packets go to user space ASAP. User space will always show the route as it is added to in the first place and after the driver called abort also process the packets accordingly. I can imagine this can get very complicated. David's approach with a counter to check for modifications and a limited number of retries probably works too, especially because the hardware will probably be initialized before routing daemons start up and will be synced up hopefully all the time. So maybe this is over engineered, but I have no idea how long hardware needs to sync up a e.g. full IPv4 routing table into hardware (if that is actually the current goal of this). Bye, Hannes