`tc-bpf` ( 8 )

программируемый классификатор и действия BPF для дисциплин входящей / исходящей очереди (BPF programmable classifier and actions for ingress/egress queueing disciplines)
Примеры (Examples)

eBPF TOOLING
       A full blown example including eBPF agent code can be found
       inside the iproute2 source package under: examples/bpf/

       As prerequisites, the kernel needs to have the eBPF system call
       namely bpf(2) enabled and ships with cls_bpf and act_bpf kernel
       modules for the traffic control subsystem. To enable eBPF/eBPF
       JIT support, depending which of the two the given architecture
       supports:

           echo 1 > /proc/sys/net/core/bpf_jit_enable

       A given restricted C file can be compiled via LLVM as:

           clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf
           -filetype=obj -o bpf.o

       The compiler invocation might still simplify in future, so for
       now, it's quite handy to alias this construct in one way or
       another, for example:

           __bcc() {
                   clang -O2 -emit-llvm -c $1 -o - | \
                   llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
           }

           alias bcc=__bcc

       A minimal, stand-alone unit, which matches on all traffic with
       the default classid (return code of -1) looks like:

           #include <linux/bpf.h>

           #ifndef __section
           # define __section(x)  __attribute__((section(x), used))
           #endif

           __section("classifier") int cls_main(struct __sk_buff *skb)
           {
                   return -1;
           }

           char __license[] __section("license") = "GPL";

       More examples can be found further below in subsection eBPF
       PROGRAMMING as focus here will be on tooling.

       There can be various other sections, for example, also for
       actions.  Thus, an object file in eBPF can contain multiple
       entrance points.  Always a specific entrance point, however, must
       be specified when configuring with tc. A license must be part of
       the restricted C code and the license string syntax is the same
       as with Linux kernel modules.  The kernel reserves its right that
       some eBPF helper functions can be restricted to GPL compatible
       licenses only, and thus may reject a program from loading into
       the kernel when such a license mismatch occurs.

       The resulting object file from the compilation can be inspected
       with the usual set of tools that also operate on normal object
       files, for example objdump(1) for inspecting ELF section headers:

           objdump -h bpf.o
           [...]
           3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
                           CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
           6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
                           CONTENTS, ALLOC, LOAD, DATA
           7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
                           CONTENTS, ALLOC, LOAD, DATA
           [...]

       Adding an eBPF classifier from an object file that contains a
       classifier in the default ELF section is trivial (note that
       instead of "object-file" also shortcuts such as "obj" can be
       used):

           bcc bpf.c
           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1

       In case the classifier resides in ELF section "mycls", then that
       same command needs to be invoked as:

           tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls
           flowid 1:1

       Dumping the classifier configuration will tell the location of
       the classifier, in other words that it's from object file "bpf.o"
       under section "mycls":

           tc filter show dev em1
           filter parent 1: protocol all pref 49152 bpf
           filter parent 1: protocol all pref 49152 bpf handle 0x1
           flowid 1:1 bpf.o:[mycls]

       The same program can also be installed on ingress qdisc side as
       opposed to egress ...

           tc qdisc add dev em1 handle ffff: ingress
           tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls
           flowid ffff:1

       ... and again dumped from there:

           tc filter show dev em1 parent ffff:
           filter protocol all pref 49152 bpf
           filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1
           bpf.o:[mycls]

       Attaching a classifier and action on ingress has the restriction
       that it doesn't have an actual underlying queueing discipline.
       What ingress can do is to classify, mangle, redirect or drop
       packets. When queueing is required on ingress side, then ingress
       must redirect packets to the ifb device, otherwise policing can
       be used. Moreover, ingress can be used to have an early drop
       point of unwanted packets before they hit upper layers of the
       networking stack, perform network accounting with eBPF maps that
       could be shared with egress, or have an early mangle and/or
       redirection point to different networking devices.

       Multiple eBPF actions and classifier can be placed into a single
       object file within various sections. In that case, non-default
       section names must be provided, which is the case for both
       actions in this example:

           tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \
                                    action bpf obj bpf.o sec action-mark
                                    \
                                    action bpf obj bpf.o sec action-rand
                                    ok

       The advantage of this is that the classifier and the two actions
       can then share eBPF maps with each other, if implemented in the
       programs.

       In order to access eBPF maps from user space beyond tc(8) setup
       lifetime, the ownership can be transferred to an eBPF agent via
       Unix domain sockets. There are two possibilities for implementing
       this:

       1) implementation of an own eBPF agent that takes care of setting
       up the Unix domain socket and implementing the protocol that
       tc(8) dictates. A code example of this can be found inside the
       iproute2 source package under: examples/bpf/

       2) use tc exec for transferring the eBPF map file descriptors
       through a Unix domain socket, and spawning an application such as
       sh(1) . This approach's advantage is that tc will place the file
       descriptors into the environment and thus make them available
       just like stdin, stdout, stderr file descriptors, meaning, in
       case user applications run from within this fd-owner shell, they
       can terminate and restart without losing eBPF maps file
       descriptors. Example invocation with the previous classifier and
       action mixture:

           tc exec bpf imp /tmp/bpf
           tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf
           flowid 1:1 \
                                    action bpf obj bpf.o sec action-mark
                                    \
                                    action bpf obj bpf.o sec action-rand
                                    ok

       Assuming that eBPF maps are shared with classifier and actions,
       it's enough to export them once, for example, from within the
       classifier or action command. tc will setup all eBPF map file
       descriptors at the time when the object file is first parsed.

       When a shell has been spawned, the environment will have a couple
       of eBPF related variables. BPF_NUM_MAPS provides the total number
       of maps that have been transferred over the Unix domain socket.
       BPF_MAP<X>'s value is the file descriptor number that can be
       accessed in eBPF agent applications, in other words, it can
       directly be used as the file descriptor value for the bpf(2)
       system call to retrieve or alter eBPF map values. <X> denotes the
       identifier of the eBPF map. It corresponds to the id member of
       struct bpf_elf_map  from the tc eBPF map specification.

       The environment in this example looks as follows:

           sh# env | grep BPF
               BPF_NUM_MAPS=3
               BPF_MAP1=6
               BPF_MAP0=5
               BPF_MAP2=7
           sh# ls -la /proc/self/fd
               [...]
               lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
               lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
               lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
           sh# my_bpf_agent

       eBPF agents are very useful in that they can prepopulate eBPF
       maps from user space, monitor statistics via maps and based on
       that feedback, for example, rewrite classids in eBPF map values
       during runtime. Given that eBPF agents are implemented as normal
       applications, they can also dynamically receive traffic control
       policies from external controllers and thus push them down into
       eBPF maps to dynamically adapt to network conditions. Moreover,
       eBPF maps can also be shared with other eBPF program types (e.g.
       tracing), thus very powerful combination can therefore be
       implemented.

   eBPF PROGRAMMING
       eBPF classifier and actions are being implemented in restricted C
       syntax (in future, there could additionally be new language
       frontends supported).

       The header file linux/bpf.h provides eBPF helper functions that
       can be called from an eBPF program.  This man page will only
       provide two minimal, stand-alone examples, have a look at
       examples/bpf from the iproute2 source package for a fully fledged
       flow dissector example to better demonstrate some of the
       possibilities with eBPF.

       Supported 32 bit classifier return codes from the C program and
       their meanings:
           0 , denotes a mismatch
           -1 , denotes the default classid configured from the command
           line
           else , everything else will override the default classid to
           provide a facility for non-linear matching

       Supported 32 bit action return codes from the C program and their
       meanings ( linux/pkt_cls.h ):
           TC_ACT_OK (0) , will terminate the packet processing pipeline
           and allows the packet to proceed
           TC_ACT_SHOT (2) , will terminate the packet processing
           pipeline and drops the packet
           TC_ACT_UNSPEC (-1) , will use the default action configured
           from tc (similarly as returning -1 from a classifier)
           TC_ACT_PIPE (3) , will iterate to the next action, if
           available
           TC_ACT_RECLASSIFY (1) , will terminate the packet processing
           pipeline and start classification from the beginning
           else , everything else is an unspecified return code

       Both classifier and action return codes are supported in eBPF and
       cBPF programs.

       To demonstrate restricted C syntax, a minimal toy classifier
       example is provided, which assumes that egress packets, for
       instance originating from a container, have previously been
       marked in interval [0, 255]. The program keeps statistics on
       different marks for user space and maps the classid to the root
       qdisc with the marking itself as the minor handle:

           #include <stdint.h>
           #include <asm/types.h>

           #include <linux/bpf.h>
           #include <linux/pkt_sched.h>

           #include "helpers.h"

           struct tuple {
                   long packets;
                   long bytes;
           };

           #define BPF_MAP_ID_STATS        1 /* agent's map identifier */
           #define BPF_MAX_MARK            256

           struct bpf_elf_map __section("maps") map_stats = {
                   .type           =       BPF_MAP_TYPE_ARRAY,
                   .id             =       BPF_MAP_ID_STATS,
                   .size_key       =       sizeof(uint32_t),
                   .size_value     =       sizeof(struct tuple),
                   .max_elem       =       BPF_MAX_MARK,
                   .pinning        =       PIN_GLOBAL_NS,
           };

           static inline void cls_update_stats(const struct __sk_buff *skb,
                                               uint32_t mark)
           {
                   struct tuple *tu;

                   tu = bpf_map_lookup_elem(&map_stats, &mark);
                   if (likely(tu)) {
                           __sync_fetch_and_add(&tu->packets, 1);
                           __sync_fetch_and_add(&tu->bytes, skb->len);
                   }
           }

           __section("cls") int cls_main(struct __sk_buff *skb)
           {
                   uint32_t mark = skb->mark;

                   if (unlikely(mark >= BPF_MAX_MARK))
                           return 0;

                   cls_update_stats(skb, mark);

                   return TC_H_MAKE(TC_H_ROOT, mark);
           }

           char __license[] __section("license") = "GPL";

       Another small example is a port redirector which demuxes
       destination port 80 into the interval [8080, 8087] steered by
       RSS, that can then be attached to ingress qdisc. The exercise of
       adding the egress counterpart and IPv6 support is left to the
       reader:

           #include <asm/types.h>
           #include <asm/byteorder.h>

           #include <linux/bpf.h>
           #include <linux/filter.h>
           #include <linux/in.h>
           #include <linux/if_ether.h>
           #include <linux/ip.h>
           #include <linux/tcp.h>

           #include "helpers.h"

           static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
                                            __u16 old_port, __u16 new_port)
           {
                   bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
                                       old_port, new_port, sizeof(new_port));
                   bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
                                       &new_port, sizeof(new_port), 0);
           }

           static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
           {
                   __u16 dport, dport_new = 8080, off;
                   __u8 ip_proto, ip_vl;

                   ip_proto = load_byte(skb, nh_off +
                                        offsetof(struct iphdr, protocol));
                   if (ip_proto != IPPROTO_TCP)
                           return 0;

                   ip_vl = load_byte(skb, nh_off);
                   if (likely(ip_vl == 0x45))
                           nh_off += sizeof(struct iphdr);
                   else
                           nh_off += (ip_vl & 0xF) << 2;

                   dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
                   if (dport != 80)
                           return 0;

                   off = skb->queue_mapping & 7;
                   set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
                                 __cpu_to_be16(dport_new + off));
                   return -1;
           }

           __section("lb") int lb_main(struct __sk_buff *skb)
           {
                   int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;

                   if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
                           ret = lb_do_ipv4(skb, nh_off);

                   return ret;
           }

           char __license[] __section("license") = "GPL";

       The related helper header file helpers.h in both examples was:

           /* Misc helper macros. */
           #define __section(x) __attribute__((section(x), used))
           #define offsetof(x, y) __builtin_offsetof(x, y)
           #define likely(x) __builtin_expect(!!(x), 1)
           #define unlikely(x) __builtin_expect(!!(x), 0)

           /* Object pinning settings */
           #define PIN_NONE       0
           #define PIN_OBJECT_NS  1
           #define PIN_GLOBAL_NS  2

           /* ELF map definition */
           struct bpf_elf_map {
               __u32 type;
               __u32 size_key;
               __u32 size_value;
               __u32 max_elem;
               __u32 flags;
               __u32 id;
               __u32 pinning;
               __u32 inner_id;
               __u32 inner_idx;
           };

           /* Some used BPF function calls. */
           static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
                                             int len, int flags) =
                 (void *) BPF_FUNC_skb_store_bytes;
           static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
                                             int to, int flags) =
                 (void *) BPF_FUNC_l4_csum_replace;
           static void *(*bpf_map_lookup_elem)(void *map, void *key) =
                 (void *) BPF_FUNC_map_lookup_elem;

           /* Some used BPF intrinsics. */
           unsigned long long load_byte(void *skb, unsigned long long off)
               asm ("llvm.bpf.load.byte");
           unsigned long long load_half(void *skb, unsigned long long off)
               asm ("llvm.bpf.load.half");

       Best practice, we recommend to only have a single eBPF classifier
       loaded in tc and perform all necessary matching and mangling from
       there instead of a list of individual classifier and separate
       actions. Just a single classifier tailored for a given use-case
       will be most efficient to run.

   eBPF DEBUGGING
       Both tc filter and action commands for bpf support an optional
       verbose parameter that can be used to inspect the eBPF verifier
       log. It is dumped by default in case of an error.

       In case the eBPF/cBPF JIT compiler has been enabled, it can also
       be instructed to emit a debug output of the resulting opcode
       image into the kernel log, which can be read via dmesg(1) :

           echo 2 > /proc/sys/net/core/bpf_jit_enable

       The Linux kernel source tree ships additionally under tools/net/
       a small helper called bpf_jit_disasm that reads out the opcode
       image dump from the kernel log and dumps the resulting
       disassembly:

           bpf_jit_disasm -o

       Other than that, the Linux kernel also contains an extensive
       eBPF/cBPF test suite module called test_bpf . Upon ...

           modprobe test_bpf

       ... it performs a diversity of test cases and dumps the results
       into the kernel log that can be inspected with dmesg(1) . The
       results can differ depending on whether the JIT compiler is
       enabled or not. In case of failed test cases, the module will
       fail to load. In such cases, we urge you to file a bug report to
       the related JIT authors, Linux kernel and networking mailing
       lists.

   cBPF
       Although we generally recommend switching to implementing eBPF
       classifier and actions, for the sake of completeness, a few words
       on how to program in cBPF will be lost here.

       Likewise, the bpf_jit_enable switch can be enabled as mentioned
       already. Tooling such as bpf_jit_disasm is also independent
       whether eBPF or cBPF code is being loaded.

       Unlike in eBPF, classifier and action are not implemented in
       restricted C, but rather in a minimal assembler-like language or
       with the help of other tooling.

       The raw interface with tc takes opcodes directly. For example,
       the most minimal classifier matching on every packet resulting in
       the default classid of 1:1 looks like:

           tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0
           4294967295,' flowid 1:1

       The first decimal of the bytecode sequence denotes the number of
       subsequent 4-tuples of cBPF opcodes. As mentioned, such a 4-tuple
       consists of c t f k decimals, where c represents the cBPF opcode,
       t the jump true offset target, f the jump false offset target and
       k the immediate constant/literal. Here, this denotes an
       unconditional return from the program with immediate value of -1.

       Thus, for egress classification, Willem de Bruijn implemented a
       minimal stand-alone helper tool under the GNU General Public
       License version 2 for iptables(8) BPF extension, which abuses the
       libpcap internal classic BPF compiler, his code derived here for
       usage with tc(8) :

           #include <pcap.h>
           #include <stdio.h>

           int main(int argc, char **argv)
           {
                   struct bpf_program prog;
                   struct bpf_insn *ins;
                   int i, ret, dlt = DLT_RAW;

                   if (argc < 2 || argc > 3)
                           return 1;
                   if (argc == 3) {
                           dlt = pcap_datalink_name_to_val(argv[1]);
                           if (dlt == -1)
                                   return 1;
                   }

                   ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
                                             1, PCAP_NETMASK_UNKNOWN);
                   if (ret)
                           return 1;

                   printf("%d,", prog.bf_len);
                   ins = prog.bf_insns;

                   for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
                           printf("%u %u %u %u,", ins->code,
                                  ins->jt, ins->jf, ins->k);
                   printf("%u %u %u %u",
                          ins->code, ins->jt, ins->jf, ins->k);

                   pcap_freecode(&prog);
                   return 0;
           }

       Given this small helper, any tcpdump(8) filter expression can be
       abused as a classifier where a match will result in the default
       classid:

           bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-
           syn
           tc filter add dev em1 parent 1: bpf bytecode-file
           /var/bpf/tcp-syn flowid 1:1

       Basically, such a minimal generator is equivalent to:

           tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\n'
           ',' > /var/bpf/tcp-syn

       Since libpcap does not support all Linux' specific cBPF
       extensions in its compiler, the Linux kernel also ships under
       tools/net/ a minimal BPF assembler called bpf_asm for providing
       full control. For detailed syntax and semantics on implementing
       such programs by hand, see references under FURTHER READING .

       Trivial toy example in bpf_asm for classifying IPv4/TCP packets,
       saved in a text file called foobar :

           ldh [12]
           jne #0x800, drop
           ldb [23]
           jneq #6, drop
           ret #-1
           drop: ret #0

       Similarly, such a classifier can be loaded as:

           bpf_asm foobar > /var/bpf/tcp-syn
           tc filter add dev em1 parent 1: bpf bytecode-file
           /var/bpf/tcp-syn flowid 1:1

       For BPF classifiers, the Linux kernel provides additionally under
       tools/net/ a small BPF debugger called bpf_dbg , which can be
       used to test a classifier against pcap files, single-step or add
       various breakpoints into the classifier program and dump register
       contents during runtime.

       Implementing an action in classic BPF is rather limited in the
       sense that packet mangling is not supported. Therefore, it's
       generally recommended to make the switch to eBPF, whenever
       possible.
Исходный текст на man7.org
tc-bpf ( 8 )

Примеры (Examples)

`tc-bpf` ( 8 )