Re: [vpp-dev] worker barrier state

Christian Hopps Tue, 05 May 2020 23:20:42 -0700


> On May 4, 2020, at 3:59 AM, Neale Ranns via lists.fd.io 
> <nranns=cisco....@lists.fd.io> wrote:
> 
> 
> Hi Chris,
> 
> With SAs there are two scenarios to consider for inflight packets
> 1) the SA is unlinked
> 2) the SA is deleted.
> 
> We've talked at length about how to deal with 2).
> By 'unlinked' I mean that whatever config dictated that an SA be used has now 
> gone (like tunnel protection or SPD policy). An inflight packet that is 
> processed by an IPSec node would (by either scheme we discussed for 1)) 
> retrieve the SA do encrypt/decrypt and then attempt to send the packet on its 
> merry way; this is the point at which it could fail. I say 'could' because it 
> depends on how the unlink affected the vlib graph. In today's tunnel 
> protection esp_encrpyt does vnet_feature_next(), this is not going to end 
> well once esp_encrypt is no longer a feature on the arc. In tomorrow's tunnel 
> protection we'll change that:
>   https://gerrit.fd.io/r/c/vpp/+/26265
> and it should be safe. But, what if the next API removes the tunnel whilst 
> there are still inflight packets? Is it still safe? Is it still correct to 
> send encrypted tunnelled packets?


It's safe to send in-flight encrypted packets for an SA that was deleted after 
they were encrypted, this reduces to buffering. After the SA is deleted new 
packets won't follow that path (they won't have the index+gen associate with 
them at all b/c of policy). If the user configures things such that packets 
which used to be encrypted should no longer encrypted, well then that's what 
they asked for for *newly arived* packets.

The security issue to be careful of here would be to have in-flight packets 
that progressed through valid policy such that they were "in the tunnel" and 
waiting to be encrypted, and then send them un-encrypted. Hopefully the change 
you reference above does not do that.

> 
> I think I'm coming round to the opinion that the safest way to approach this 
> is to ensure that if the SA can be found, whatever state it is in (created, 
> unlinked or deleted) then it needs to have a flag that states whether it 
> should be used or the packet dropped. We'd update this state when the SA is 
> [un]linked, with the barrier held.

Why does the SA object needs to be kept around? The indexes generation number 
being unequal is enough to say the object is deleted. Per the concern I 
mentioned above, we should make sure the packet doesn't *forget* that its been 
associated with that an index+generation number (i.e., it's in the tunnel), but 
that's orthogonal to whether the SA state itself is still in the pool.

> 
> On a somewhat related topic, you probably saw:
>  https://gerrit.fd.io/r/c/vpp/+/26811
> as an example of getting MP safe APIs wrong.

I made a similar change locally back when we started talking about this. :)

Thanks,
Chris.



> /neale
> 
> On 24/04/2020 16:34, "Christian Hopps" <cho...@chopps.org> wrote:
> 
> 
>    Hi Neale,
> 
>    Comments also inline...
> 
>    Neale Ranns (nranns) <nra...@cisco.com> writes:
> 
>> Hi Chris,
>> 
>> Comments inline...
>> 
>> On 15/04/2020 15:14, "Christian Hopps" <cho...@chopps.org> wrote:
>> 
>>    Hi Neale,
>> 
>>    I agree that something like 4, is probably the correct approach. I had a 
>> side-meeting with some of the ARM folks (Govind and Honnappa), and we 
>> thought using a generation number for the state rather than just waiting 
>> "long-enough" to recycle it could work. The generation number would be the 
>> atomic value associated with the state. So consider this API:
>> 
>>     - MP-safe pools store generation numbers alongside each object.
>>     - When you allocate a new object from the pool you get an index and 
>> generation number.
>>     - When storing the object index you also save the generation number.
>>     - When getting a pointer to the object you pass the API the index and 
>> generation number and it will return NULL if the generation number did not 
>> match the one stored with the object in the pool.
>>     - When you delete a pool object its generation number is incremented 
>> (with barrier).
>> 
>>    The size of the generation number needs to be large enough to guarantee 
>> there is no wrap with objects still in the system that have stored the 
>> generation number. Technically this is a "long-enough" aspect of the scheme. 
>> :) One could imagine using less than 64 bits for the combination of index 
>> and generation, if that was important.
>> 
>> It's a good scheme, I like it.
>> I assume the pool indices would be 64 bit and the separation between vector 
>> index and generation would be hidden from the user. Maybe a 32 bit value 
>> would suffice in most cases, but why skimp...
> 
>    I was thinking to keep the index and generation number separate at the 
> most basic API, to allow for selecting the size of the each independently and 
> for efficient storage. I'm thinking for some applications one might want to 
> do something like
> 
>    cacheline_packed_struct {
>        ...
>        u32 foo_index;
>        u32 bar_index;
>        u16 foo_gen;
>        u16 bar_gen;
>        ...
>    };
> 
>    a set of general purpose macros could be created for combining the 2 
> values into a single integer value though.
> 
>>    The advantage over just waiting N seconds to recycle the index is that 
>> the system scales better, i.e., if you just wait N seconds to reuse, and are 
>> creating and deleting objects at a significant rate, your pool can blow up 
>> in the N seconds of time. With the generation number this is not a problem 
>> as you can re-use the object immediately. Another advantage is that you 
>> don't have to have the timer logic (looping per pool or processing all 
>> pools) to free up old indices.
>> 
>> Yes, for my time based scheme, the size of the pool will become dependent on 
>> some integration over a rate of change, which is not deterministic, which is 
>> not great, but I don't suppose all APIs are subject to large churn.
>> With the generation scheme the pool always requires more memory, since 
>> you're storing a generation value for each index, but being a deterministic 
>> size (even though probably bigger), I'd probably take that.
>> I wouldn't use timer logic in my scheme. I'd make the pool's free-list a 
>> fifo (as opposed to the stack it is today) and each entry in the list has 
>> the index and the time it was added. If t_now - t_head < t_wrap I can pull 
>> from the free-list, else the pool needs to grow.
> 
>    FWIW, I think the space used for the timestamp would be similar to the 
> space used for the generation number, probably it's a wash.
> 
>>    The generation number scheme will still need the thread barrier to 
>> increment the generation number to make sure no-one is using the object in 
>> parallel. But this is a common problem with deleting non-reference-counted 
>> shared state I believe.
>> 
>> I don't think you strictly need the barrier, you can still use a 
>> make-before-break update. One downside of the generation approach is that 
>> nodes that try and fetch the state using the index will get NULL, so the 
>> only option is to drop, as opposed to what the make-before-break change 
>> determined. Mind you, this is probably fine for most practical purposes. 
>> Again if we're talking SAs, then at this point the SA is decoupled from the 
>> graph (i.e. it's no longer protecting the tunnel or it's not linked to a 
>> policy in the SPD), so drop is all we can do anyway.
> 
>    You need the barrier to make sure no in-flight packets are concurrently 
> using the object the index/gen pointed at. Strictly speaking, you could 
> increment the generation number w/o the barrier, but then you have to hold 
> the barrier during free/re-use of the pool object. The beauty of the 
> generation number is that you only hold the barrier while you increment it, 
> and the free/re-use of the object is done outside the barrier.
> 
>> I see it as a trade-off between a cost for every packet forwarded versus how 
>> many may be dropped during API calls. I wouldn't want the scheme employed to 
>> ensure safe delete to affect the overall packet through put - most of the 
>> time I'm not changing the state...
>> 
>> Now we have a few potential schemes in mind, IIRC you focus was on the 
>> deletion of SAs. Can you remind me again what additional state you had 
>> associated with the SA that you needed to deal with.
> 
>    My case currently is extra state that I associate with SAs, so it is 
> effectively the same as the SA state. Part of that state includes packet 
> queues but those also should be covered by the SA index+generation number I 
> believe.
> 
>    My overall architecture concern (for traffic flow security) is that the 
> users use of the SA tunnel (i.e., the inner traffic) must never affect the 
> TFS tunnel (outer traffic) in any measurable way or I have directly leaked 
> information. In isolation (a single SA) the tear-down/bring-up isn't an issue 
> b/c the tunnel is being created or deleted; however, I need to also consider 
> the security implications of multiple SA tunnel use and minimizing (but 
> really eliminating) control plane side-effects on all established/running SAs.
> 
>    I think starting off with a barrier based solution with generation numbers 
> is OK, if we keep in mind that a reference count type solution might be 
> needed if we run into unforeseen issues with the barrier.
> 
>    To make the barrier solution work lossless requires building enough buffer 
> capacity into the system to handle waiting for barrier acquisition (mentioned 
> this in another thread, but nodes need to be steadily handling under 
> VLIB_FRAME_SIZE/2 worth of packets), and then only holding the barrier to 
> increment the generation number (i.e., actually freeing the resources for the 
> SA can happen outside the barrier after the generation number is incremented).
> 
>    Thanks,
>    Chris.
> 
>> 
>> 
>> /neale
>> 
>> 
>>    Thanks,
>>    Chris.
>> 
>>> On Apr 15, 2020, at 5:38 AM, Neale Ranns (nranns) <nra...@cisco.com> wrote:
>>> 
>>> 
>>> Hi Chris,
>>> 
>>> Firstly, apologies for the lengthy delay.
>>> 
>>> When I say 'state' in the following I'm referring to some object[s] that 
>>> are used to forward packets.
>>> 
>>> I'd classify the possible solution space as:
>>> 1) maintain per-packet counters for the state to indicate how many packets 
>>> currently refer to that state.
>>>    Pros; we know exactly when the state is no longer required and can be 
>>> safely removed.
>>>    Cons; significant per-packet cost, similar to maintaining counters. For 
>>> reference, on my [aging] system enabling adjacency counters takes 
>>> ip4-rewrite from 2.52e1 to 3.49e1 clocks. The wait times could be large 
>>> (equivalent to flushing queues).
>>> 2) flush queues; ensure that there are no packets in flight, anywhere, when 
>>> the workers stop at the barrier.
>>>   Pros; It's certainly safe to delete state under these conditions.
>>>   Cons; for handoff this could be known, though the wait time would be 
>>> long. For async crypto HW this may not be knowable and if it is the wait 
>>> times would be large. Either way we may end up waiting for a worst-case 
>>> scenario, which is way longer that actually needed.
>>> 3) epochs; maintain a global epoch; each time an API is called, the epoch 
>>> is bumped. Packets entering the system get stamped with the current epoch. 
>>> If a node sees a packet whose epoch does not match the global one, it is 
>>> dropped.
>>>   Pros: simple scheme, low/negligible DP cost.
>>>   Cons: all inflight packets would be dropped on all API calls, not just 
>>> the packets that would use the state that is being deleted.
>>> 4) MP safe: remove the state with the workers unblocked. This is a 
>>> multi-stage process. Firstly, unlink the state from the lookup 
>>> data-structures so no more packets can find it. Secondly, 'atomically' 
>>> update the state so that packets using it still perform a consistent action 
>>> (probably drop). Thirdly, don't reuse that state (i.e. recycle its pool 
>>> index) until all the inflight packets pass through the system 
>>> (mis-forwarding must be avoided). Make-before-break, if that term means 
>>> anything to you __
>>>   Pros; MP safe is always good, since there's less packet drops. Zero 
>>> per-packet DP cost.
>>>   Cons; it's not easy to get right nor test.
>>> 
>>> IMHO the drawbacks of options 1, 2 & 3 rule them out, which leaves us only 
>>> 4.
>>> 
>>> For option 4, the first and second steps are very much dependent on the 
>>> type of state we're talking about. For SAs for example, unlinking the SA 
>>> from the lookup data-structure is accomplished using a separate API from 
>>> the SA delete*. The final step we can easily accomplish with a new version 
>>> of the pool allocator whose free-list prevents reuse for say 5 seconds (an 
>>> age in DP terms).
>>> 
>>> Thoughts?
>>> 
>>> /neale
>>> 
>>> * I note that a SA delete is already (optimistically) marked MP safe, which 
>>> assumes the system flushes inbetween these API calls.
>>> 
>>> 
>>> 
>>> 
>>> On 26/03/2020 16:09, "Christian Hopps" <cho...@chopps.org> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 25, 2020, at 1:39 PM, Dave Barach via Lists.Fd.Io 
>>>> <dbarach=cisco....@lists.fd.io> wrote:
>>>> 
>>>> Vlib_main_t *vm->main_loop_count.
>>>> 
>>>> One trip around the main loop accounts for all per-worker local graph 
>>>> edges / acyclic graph behaviors.
>>>> 
>>>> As to the magic number E (not to be confused with e): repeatedly handing 
>>>> off packets from thread to thread seems like a bad implementation 
>>>> strategy. The packet tracer will tell you how many handoffs are involved 
>>>> in a certain path, as will a bit of code inspection.
>>> 
>>>   No, it would not be a good implementation strategy. :)
>>> 
>>>   However, I was looking at trying to code this in an upstreamable way, and 
>>> I didn't think I got to make assumptions about how others might wire things 
>>> together. I suppose we could just define a maximum number of handoffs and 
>>> then if users violated that number they would need to increase it?
>>> 
>>>> Neale has some experience with this scenario, maybe he can share some 
>>>> thoughts...
>>> 
>>>   Hoping so. :)
>>> 
>>>   I noticed that crypto engine handoffs were added to the non-dpdk ipsec 
>>> encrypt/decrypt in master, which seems somewhat relevant.
>>> 
>>>   Thanks,
>>>   Chris.
>>> 
> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16242): https://lists.fd.io/g/vpp-dev/message/16242
Mute This Topic: https://lists.fd.io/mt/72542383/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] worker barrier state

Reply via email to