On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote: >> > For the ring, there is no requirement to allocate exactly the amount >> > specified by the user request. Safer than relying on shared memory >> > and simpler than the extra allocation in this patch would be to allocate >> > extra shadow memory at the end of the ring (and not mmap that). >> > >> > That still leaves an extra cold cacheline vs using tp_padding. >> >> Given my lack of experience and knowledge in writing kernel code >> it was easier for me to allocate the shadow ring as a separate >> structure. Of course it's not about me and my skills so if it's >> more appropriate to allocate at the tail of the existing ring >> then certainly I can look at doing that. > > The memory for the ring is not one contiguous block, it's an array of > blocks of pages (or 'order' sized blocks of pages). I don't think > increasing the size of each of the blocks to provided storage would be > such a good idea as it will risk spilling over into the next order and > wasting lots of memory. I suspect it's also more complex than a single > shadow ring to do both the allocation and the access. > > It could be tacked onto the end of the pg_vec[] used to store the > pointers to the blocks. The challenge with that is that a pg_vec[] is > created for each of RX and TX rings so either it would have to > allocate unnecessary storage for TX or the caller will have to say if > extra space should be allocated or not. E.g.: > > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int > scratch, void **scratch_p) > > I'm not sure avoiding the extra allocation and moving it to the > pg_vec[] for the RX ring is going to get the simplification you were > hoping for. Is there another way of storing the shadow ring which > I should consider?
I did indeed mean attaching extra pages to pg_vec[]. It should be simpler than a separate structure, but I may be wrong. Either way, I still would prefer to avoid the shadow buffer completely. It incurs complexity and cycle cost on all users because of only the rare (non-existent?) consumer that overwrites the padding bytes. Perhaps we can use padding yet avoid deadlock by writing a timed value. The simplest would be jiffies >> N. Then only a process that writes this exact value would be subject to drops and then still only for a limited period. Instead of depending on wall clock time, like jiffies, another option would be to keep a percpu array of values. Each cpu has a zero entry if it is not writing, nonzero if it is. If a writer encounters a number in padding that is > num_cpus, then the state is garbage from userspace. If <= num_cpus, it is adhered to only until that cpu clears its entry, which is guaranteed to happen eventually. Just a quick thought. This might not fly at all upon closer scrutiny.