Re: [PATCH bpf-next v3 2/8] bpf: add documentation for eBPF helpers (01-11)

Daniel Borkmann Thu, 19 Apr 2018 03:02:38 -0700

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
> 
> - bpf_map_lookup_elem()
> - bpf_map_update_elem()
> - bpf_map_delete_elem()
> - bpf_probe_read()
> - bpf_ktime_get_ns()
> - bpf_trace_printk()
> - bpf_skb_store_bytes()
> - bpf_l3_csum_replace()
> - bpf_l4_csum_replace()
> - bpf_tail_call()
> - bpf_clone_redirect()
> 
> v3:
> - bpf_map_lookup_elem(): Fix description of restrictions for flags
>   related to the existence of the entry.
> - bpf_trace_printk(): State that trace_pipe can be configured. Fix
>   return value in case an unknown format specifier is met. Add a note on
>   kernel log notice when the helper is used. Edit example.
> - bpf_tail_call(): Improve comment on stack inheritance.
> - bpf_clone_redirect(): Improve description of BPF_F_INGRESS flag.
> 
> Cc: Alexei Starovoitov <a...@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.mon...@netronome.com>


Thanks for doing all this work, Quentin!

Just some small improvements while reading over it:

> ---
>  include/uapi/linux/bpf.h | 210 
> +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 210 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 45f77f01e672..02b7d522b3c0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -381,6 +381,216 @@ union bpf_attr {
>   * intentional, removing them would break paragraphs for rst2man.
>   *
>   * Start of BPF helper function descriptions:
> + *
> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)

const void *key

> + *   Description
> + *           Perform a lookup in *map* for an entry associated to *key*.
> + *   Return
> + *           Map value associated to *key*, or **NULL** if no entry was
> + *           found.
> + *
> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 
> flags)

const void *key, const void *value

> + *   Description
> + *           Add or update the value of the entry associated to *key* in
> + *           *map* with *value*. *flags* is one of:
> + *
> + *           **BPF_NOEXIST**
> + *                   The entry for *key* must not exist in the map.
> + *           **BPF_EXIST**
> + *                   The entry for *key* must already exist in the map.
> + *           **BPF_ANY**
> + *                   No condition on the existence of the entry for *key*.
> + *
> + *           Flag value **BPF_NOEXIST** cannot be used for maps of types
> + *           **BPF_MAP_TYPE_ARRAY** or **BPF_MAP_TYPE_PERCPU_ARRAY**  (all
> + *           elements always exist), the helper would return an error.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_delete_elem(struct bpf_map *map, void *key)

const void *key

> + *   Description
> + *           Delete entry with *key* from *map*.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_probe_read(void *dst, u32 size, const void *src)
> + *   Description
> + *           For tracing programs, safely attempt to read *size* bytes from
> + *           address *src* and store the data in *dst*.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * u64 bpf_ktime_get_ns(void)
> + *   Description
> + *           Return the time elapsed since system boot, in nanoseconds.
> + *   Return
> + *           Current *ktime*.
> + *
> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
> + *   Description
> + *           This helper is a "printk()-like" facility for debugging. It
> + *           prints a message defined by format *fmt* (of size *fmt_size*)
> + *           to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
> + *           available. It can take up to three additional **u64**
> + *           arguments (as an eBPF helpers, the total number of arguments is
> + *           limited to five).
> + *
> + *           Each time the helper is called, it appends a line to the trace.
> + *           The format of the trace is customizable, and the exact output
> + *           one will get depends on the options set in
> + *           *\/sys/kernel/debug/tracing/trace_options* (see also the
> + *           *README* file under the same directory). However, it usually
> + *           defaults to something like:
> + *
> + *           ::
> + *
> + *                   telnet-470   [001] .N.. 419421.045894: 0x00000001: 
> <formatted msg>
> + *
> + *           In the above:
> + *
> + *                   * ``telnet`` is the name of the current task.
> + *                   * ``470`` is the PID of the current task.
> + *                   * ``001`` is the CPU number on which the task is
> + *                     running.
> + *                   * In ``.N..``, each character refers to a set of
> + *                     options (whether irqs are enabled, scheduling
> + *                     options, whether hard/softirqs are running, level of
> + *                     preempt_disabled respectively). **N** means that
> + *                     **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
> + *                     are set.
> + *                   * ``419421.045894`` is a timestamp.
> + *                   * ``0x00000001`` is a fake value used by BPF for the
> + *                     instruction pointer register.
> + *                   * ``<formatted msg>`` is the message formatted with
> + *                     *fmt*.
> + *
> + *           The conversion specifiers supported by *fmt* are similar, but
> + *           more limited than for printk(). They are **%d**, **%i**,
> + *           **%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
> + *           **%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
> + *           of field, padding with zeroes, etc.) is available, and the
> + *           helper will return **-EINVAL** (but print nothing) if it
> + *           encounters an unknown specifier.
> + *
> + *           Also, note that **bpf_trace_printk**\ () is slow, and should
> + *           only be used for debugging purposes. For this reason, a notice
> + *           bloc (spanning several lines) is printed to kernel logs and
> + *           states that the helper should not be used "for production use"
> + *           the first time this helper is used (or more precisely, when
> + *           **trace_printk**\ () buffers are allocated). For passing values
> + *           to user space, perf events should be preferred.
> + *   Return
> + *           The number of bytes written to the buffer, or a negative error
> + *           in case of failure.
> + *
> + * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void 
> *from, u32 len, u64 flags)
> + *   Description
> + *           Store *len* bytes from address *from* into the packet
> + *           associated to *skb*, at *offset*. *flags* are a combination of
> + *           **BPF_F_RECOMPUTE_CSUM** (automatically recompute the
> + *           checksum for the packet after storing the bytes) and
> + *           **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + *           **->swhash** and *skb*\ **->l4hash** to 0).
> + *
> + *           A call to this helper is susceptible to change data from the
> + *           packet. Therefore, at load time, all checks on pointers

Perhaps: to change the underlying packet buffer (it's not the data itself but
a potential reallocation to e.g. unclone the skb which otherwise would lead to
a use-after-free if not forced to be invalidated from verifier).

> + *           previously done by the verifier are invalidated and must be
> + *           performed again.

I would add something like "if used in combination with direct packet access"
where it's the only place this comment is relevant.

> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 
> to, u64 size)
> + *   Description
> + *           Recompute the IP checksum for the packet associated to *skb*.

Perhaps we could say something like 'L3 (e.g. IP)'. The helper itself has zero
knowledge of the underlying protocol used. It's solely replacing the csum and
updating skb->csum when necessary.

> + *           Computation is incremental, so the helper must know the former
> + *           value of the header field that was modified (*from*), the new
> + *           value of this field (*to*), and the number of bytes (2 or 4)
> + *           for this field, stored in *size*. Alternatively, it is possible
> + *           to store the difference between the previous and the new values
> + *           of the header field in *to*, by setting *from* and *size* to 0.

I would add that this works in combination with csum_diff() helper, allows for
more flexibility and to handle sizes larger than 2 or 4.

> + *           For both methods, *offset* indicates the location of the IP
> + *           checksum within the packet.
> + *
> + *           A call to this helper is susceptible to change data from the
> + *           packet. Therefore, at load time, all checks on pointers
> + *           previously done by the verifier are invalidated and must be
> + *           performed again.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 
> to, u64 flags)
> + *   Description
> + *           Recompute the TCP or UDP checksum for the packet associated to

See prior comment, L4 (e.g. TCP, UDP or ICMP).

> + *           *skb*. Computation is incremental, so the helper must know the
> + *           former value of the header field that was modified (*from*),
> + *           the new value of this field (*to*), and the number of bytes (2
> + *           or 4) for this field, stored on the lowest four bits of
> + *           *flags*. Alternatively, it is possible to store the difference
> + *           between the previous and the new values of the header field in
> + *           *to*, by setting *from* and the four lowest bits of *flags* to

Same reference for csum_diff().

> + *           0. For both methods, *offset* indicates the location of the IP
> + *           checksum within the packet. In addition to the size of the
> + *           field, *flags* can be added (bitwise OR) actual flags. With
> + *           **BPF_F_MARK_MANGLED_0**, a null checksum is left untouched
> + *           (unless **BPF_F_MARK_ENFORCE** is added as well), and for
> + *           updates resulting in a null checksum the value is set to
> + *           **CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR**
> + *           indicates the checksum is to be computed against a
> + *           pseudo-header.
> + *
> + *           A call to this helper is susceptible to change data from the
> + *           packet. Therefore, at load time, all checks on pointers
> + *           previously done by the verifier are invalidated and must be
> + *           performed again.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
> + *   Description
> + *           This special helper is used to trigger a "tail call", or in
> + *           other words, to jump into another eBPF program. The same stack
> + *           frame is used (but values on stack and in registers for the
> + *           caller are not accessible to the callee). This mechanism allows
> + *           for program chaining, either for raising the maximum number of
> + *           available eBPF instructions, or to execute given programs in
> + *           conditional blocks. For security reasons, there is an upper
> + *           limit to the number of successive tail calls that can be
> + *           performed.
> + *
> + *           Upon call of this helper, the program attempts to jump into a
> + *           program referenced at index *index* in *prog_array_map*, a
> + *           special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
> + *           *ctx*, a pointer to the context.
> + *
> + *           If the call succeeds, the kernel immediately runs the first
> + *           instruction of the new program. This is not a function call,
> + *           and it never goes back to the previous program. If the call

Nit: s/goes back/returns/

> + *           fails, then the helper has no effect, and the caller continues
> + *           to run its own instructions. A call can fail if the destination

Maybe: s/to run its own/to run its subsequent/

> + *           program for the jump does not exist (i.e. *index* is superior
> + *           to the number of entries in *prog_array_map*), or if the
> + *           maximum number of tail calls has been reached for this chain of
> + *           programs. This limit is defined in the kernel by the macro
> + *           **MAX_TAIL_CALL_CNT** (not accessible to user space), which
> + *           is currently set to 32.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
> + *
> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
> + *   Description
> + *           Clone and redirect the packet associated to *skb* to another
> + *           net device of index *ifindex*. Both ingress and egress
> + *           interfaces can be used for redirection. The **BPF_F_INGRESS**
> + *           value in *flags* is used to make the distinction (ingress path
> + *           is selected if the flag is present, egress path otherwise).
> + *           This is the only flag supported for now.

Would probably make sense to describe the relation to bpf_redirect() helper
in one sentence, meaning, that bpf_clone_redirect() has the associated cost
duplicating the skb but the helper can be used out of the BPF prog whereas
this is not the case with bpf_redirect() which is therefore more efficient
but handled through an action code where the redirect happens after the BPF
prog returned.

> + *           A call to this helper is susceptible to change data from the
> + *           packet. Therefore, at load time, all checks on pointers
> + *           previously done by the verifier are invalidated and must be
> + *           performed again.
> + *   Return
> + *           0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)                \
>       FN(unspec),                     \
>

Re: [PATCH bpf-next v3 2/8] bpf: add documentation for eBPF helpers (01-11)

Reply via email to