linux-stable/Documentation/bpf/map_lpm_trie.rst

.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.

=====================
BPF_MAP_TYPE_LPM_TRIE
=====================

.. note::
   - ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11

``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
can be used to match IP addresses to a stored set of prefixes.
Internally, data is stored in an unbalanced trie of nodes that uses
``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
network byte order, i.e. big endian, so ``data[0]`` stores the most
significant byte.

LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key_u8``, extended by
``max_prefixlen/8`` bytes.

- For IPv4 addresses the data length is 4 bytes
- For IPv6 addresses the data length is 16 bytes

The value type stored in the LPM trie can be any user defined type.

.. note::
   When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
   ``BPF_F_NO_PREALLOC`` flag.

Usage
=====

Kernel BPF
----------

bpf_map_lookup_elem()
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

   void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)

The longest prefix entry for a given data value can be found using the
``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
value associated with the longest matching ``key``, or ``NULL`` if no
entry was found.

The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
performing longest prefix lookups. For example, when searching for the
longest prefix match for an IPv4 address, ``prefixlen`` should be set to
``32``.

bpf_map_update_elem()
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

   long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)

Prefix entries can be added or updated using the ``bpf_map_update_elem()``
helper. This helper replaces existing elements atomically.

``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
case of failure.

 .. note::
    The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,
    but the value is ignored, giving BPF_ANY semantics.

bpf_map_delete_elem()
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

   long bpf_map_delete_elem(struct bpf_map *map, const void *key)

Prefix entries can be deleted using the ``bpf_map_delete_elem()``
helper. This helper will return 0 on success, or negative error in case
of failure.

Userspace
---------

Access from userspace uses libbpf APIs with the same names as above, with
the map identified by ``fd``.

bpf_map_get_next_key()
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

   int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key)

A userspace program can iterate through the entries in an LPM trie using
libbpf's ``bpf_map_get_next_key()`` function. The first key can be
fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
``NULL``. Subsequent calls will fetch the next key that follows the
current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
error in case of failure.

``bpf_map_get_next_key()`` will iterate through the LPM trie elements
from leftmost leaf first. This means that iteration will return more
specific keys before less specific ones.

Examples
========

Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
of LPM trie usage from userspace. The code snippets below demonstrate
API usage.

Kernel BPF
----------

The following BPF code snippet shows how to declare a new LPM trie for IPv4
address prefixes:

.. code-block:: c

    #include <linux/bpf.h>
    #include <bpf/bpf_helpers.h>

    struct ipv4_lpm_key {
            __u32 prefixlen;
            __u32 data;
    };

    struct {
            __uint(type, BPF_MAP_TYPE_LPM_TRIE);
            __type(key, struct ipv4_lpm_key);
            __type(value, __u32);
            __uint(map_flags, BPF_F_NO_PREALLOC);
            __uint(max_entries, 255);
    } ipv4_lpm_map SEC(".maps");

The following BPF code snippet shows how to lookup by IPv4 address:

.. code-block:: c

    void *lookup(__u32 ipaddr)
    {
            struct ipv4_lpm_key key = {
                    .prefixlen = 32,
                    .data = ipaddr
            };

            return bpf_map_lookup_elem(&ipv4_lpm_map, &key);
    }

Userspace
---------

The following snippet shows how to insert an IPv4 prefix entry into an
LPM trie:

.. code-block:: c

    int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)
    {
            struct ipv4_lpm_key ipv4_key = {
                    .prefixlen = prefixlen,
                    .data = addr
            };
            return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);
    }

The following snippet shows a userspace program walking through the entries
of an LPM trie:


.. code-block:: c

    #include <bpf/libbpf.h>
    #include <bpf/bpf.h>

    void iterate_lpm_trie(int map_fd)
    {
            struct ipv4_lpm_key *cur_key = NULL;
            struct ipv4_lpm_key next_key;
            struct value value;
            int err;

            for (;;) {
                    err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
                    if (err)
                            break;

                    bpf_map_lookup_elem(map_fd, &next_key, &value);

                    /* Use key and value here */

                    cur_key = &next_key;
            }
    }
docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			`.. SPDX-License-Identifier: GPL-2.0-only`
			`.. Copyright (C) 2022 Red Hat, Inc.`

			`=====================`
			`BPF_MAP_TYPE_LPM_TRIE`
			`=====================`

			`.. note::`
			- ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11

			``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
			`can be used to match IP addresses to a stored set of prefixes.`
			`Internally, data is stored in an unbalanced trie of nodes that uses`
			``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
			network byte order, i.e. big endian, so ``data[0]`` stores the most
			`significant byte.`

			`LPM tries may be created with a maximum prefix length that is a multiple`
			`of 8, in the range from 8 to 2048. The key used for lookup and update`
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array Replace deprecated 0-length array in struct bpf_lpm_trie_key with flexible array. Found with GCC 13: ../kernel/bpf/lpm_trie.c:207:51: warning: array subscript i is outside array bounds of 'const __u8[0]' {aka 'const unsigned char[]'} [-Warray-bounds=] 207 \| (__be16 )&key->data[i]); \| ^~~~~~~~~~~~~ ../include/uapi/linux/swab.h:102:54: note: in definition of macro '__swab16' 102 \| #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x)) \| ^ ../include/linux/byteorder/generic.h:97:21: note: in expansion of macro '__be16_to_cpu' 97 \| #define be16_to_cpu __be16_to_cpu \| ^~~~~~~~~~~~~ ../kernel/bpf/lpm_trie.c:206:28: note: in expansion of macro 'be16_to_cpu' 206 \| u16 diff = be16_to_cpu((__be16 )&node->data[i] ^ \| ^~~~~~~~~~~ In file included from ../include/linux/bpf.h:7: ../include/uapi/linux/bpf.h:82:17: note: while referencing 'data' 82 \| __u8 data[0]; /* Arbitrary size / \| ^~~~ And found at run-time under CONFIG_FORTIFY_SOURCE: UBSAN: array-index-out-of-bounds in kernel/bpf/lpm_trie.c:218:49 index 0 is out of range for type '__u8 []' Changing struct bpf_lpm_trie_key is difficult since has been used by userspace. For example, in Cilium: struct egress_gw_policy_key { struct bpf_lpm_trie_key lpm_key; __u32 saddr; __u32 daddr; }; While direct references to the "data" member haven't been found, there are static initializers what include the final member. For example, the "{}" here: struct egress_gw_policy_key in_key = { .lpm_key = { 32 + 24, {} }, .saddr = CLIENT_IP, .daddr = EXTERNAL_SVC_IP & 0Xffffff, }; To avoid the build time and run time warnings seen with a 0-sized trailing array for struct bpf_lpm_trie_key, introduce a new struct that correctly uses a flexible array for the trailing bytes, struct bpf_lpm_trie_key_u8. As part of this, include the "header" portion (which is just the "prefixlen" member), so it can be used by anything building a bpf_lpr_trie_key that has trailing members that aren't a u8 flexible array (like the self-test[1]), which is named struct bpf_lpm_trie_key_hdr. Unfortunately, C++ refuses to parse the __struct_group() helper, so it is not possible to define struct bpf_lpm_trie_key_hdr directly in struct bpf_lpm_trie_key_u8, so we must open-code the union directly. Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out, and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment to the UAPI header directing folks to the two new options. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Closes: https://paste.debian.net/hidden/ca500597/ Link: https://lore.kernel.org/all/202206281009.4332AA33@keescook/ [1] Link: https://lore.kernel.org/bpf/20240222155612.it.533-kees@kernel.org 2024-02-22 15:56:15 +00:00			operations is a ``struct bpf_lpm_trie_key_u8``, extended by
docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			``max_prefixlen/8`` bytes.

			`- For IPv4 addresses the data length is 4 bytes`
			`- For IPv6 addresses the data length is 16 bytes`

			`The value type stored in the LPM trie can be any user defined type.`

			`.. note::`
			When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
			``BPF_F_NO_PREALLOC`` flag.

			`Usage`
			`=====`

			`Kernel BPF`
			`----------`

docs/bpf: Fix sphinx warnings in BPF map docs Fix duplicate C declaration warnings when using sphinx >= 3.1. Reported-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/bpf/ed4dac84-1b12-5c58-e4de-93ab9ac67c09@gmail.com Link: https://lore.kernel.org/bpf/20221122143933.91321-1-donald.hunter@gmail.com 2022-11-22 14:39:33 +00:00			`bpf_map_lookup_elem()`
			`~~~~~~~~~~~~~~~~~~~~~`

			`.. code-block:: c`

docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			`void bpf_map_lookup_elem(struct bpf_map map, const void *key)`

			`The longest prefix entry for a given data value can be found using the`
			``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
			value associated with the longest matching ``key``, or ``NULL`` if no
			`entry was found.`

			The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
			`performing longest prefix lookups. For example, when searching for the`
			longest prefix match for an IPv4 address, ``prefixlen`` should be set to
			``32``.

docs/bpf: Fix sphinx warnings in BPF map docs Fix duplicate C declaration warnings when using sphinx >= 3.1. Reported-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/bpf/ed4dac84-1b12-5c58-e4de-93ab9ac67c09@gmail.com Link: https://lore.kernel.org/bpf/20221122143933.91321-1-donald.hunter@gmail.com 2022-11-22 14:39:33 +00:00			`bpf_map_update_elem()`
			`~~~~~~~~~~~~~~~~~~~~~`

			`.. code-block:: c`

docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			`long bpf_map_update_elem(struct bpf_map map, const void key, const void *value, u64 flags)`

			Prefix entries can be added or updated using the ``bpf_map_update_elem()``
			`helper. This helper replaces existing elements atomically.`

			``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
			`case of failure.`

			`.. note::`
			`The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,`
			`but the value is ignored, giving BPF_ANY semantics.`

docs/bpf: Fix sphinx warnings in BPF map docs Fix duplicate C declaration warnings when using sphinx >= 3.1. Reported-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/bpf/ed4dac84-1b12-5c58-e4de-93ab9ac67c09@gmail.com Link: https://lore.kernel.org/bpf/20221122143933.91321-1-donald.hunter@gmail.com 2022-11-22 14:39:33 +00:00			`bpf_map_delete_elem()`
			`~~~~~~~~~~~~~~~~~~~~~`

			`.. code-block:: c`

docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			`long bpf_map_delete_elem(struct bpf_map map, const void key)`

			Prefix entries can be deleted using the ``bpf_map_delete_elem()``
			`helper. This helper will return 0 on success, or negative error in case`
			`of failure.`

			`Userspace`
			`---------`

			`Access from userspace uses libbpf APIs with the same names as above, with`
			the map identified by ``fd``.

docs/bpf: Fix sphinx warnings in BPF map docs Fix duplicate C declaration warnings when using sphinx >= 3.1. Reported-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Link: https://lore.kernel.org/bpf/ed4dac84-1b12-5c58-e4de-93ab9ac67c09@gmail.com Link: https://lore.kernel.org/bpf/20221122143933.91321-1-donald.hunter@gmail.com 2022-11-22 14:39:33 +00:00			`bpf_map_get_next_key()`
			`~~~~~~~~~~~~~~~~~~~~~~`

			`.. code-block:: c`

docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map Add documentation for BPF_MAP_TYPE_LPM_TRIE including kernel BPF helper usage, userspace usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221101114542.24481-2-donald.hunter@gmail.com 2022-11-01 11:45:42 +00:00			`int bpf_map_get_next_key (int fd, const void cur_key, void next_key)`

			`A userspace program can iterate through the entries in an LPM trie using`
			libbpf's ``bpf_map_get_next_key()`` function. The first key can be
			fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
			``NULL``. Subsequent calls will fetch the next key that follows the
			current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
			``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
			`error in case of failure.`

			``bpf_map_get_next_key()`` will iterate through the LPM trie elements
			`from leftmost leaf first. This means that iteration will return more`
			`specific keys before less specific ones.`

			`Examples`
			`========`

			Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
			`of LPM trie usage from userspace. The code snippets below demonstrate`
			`API usage.`

			`Kernel BPF`
			`----------`

			`The following BPF code snippet shows how to declare a new LPM trie for IPv4`
			`address prefixes:`

			`.. code-block:: c`

			`#include <linux/bpf.h>`
			`#include <bpf/bpf_helpers.h>`

			`struct ipv4_lpm_key {`
			`__u32 prefixlen;`
			`__u32 data;`
			`};`

			`struct {`
			`__uint(type, BPF_MAP_TYPE_LPM_TRIE);`
			`__type(key, struct ipv4_lpm_key);`
			`__type(value, __u32);`
			`__uint(map_flags, BPF_F_NO_PREALLOC);`
			`__uint(max_entries, 255);`
			`} ipv4_lpm_map SEC(".maps");`

			`The following BPF code snippet shows how to lookup by IPv4 address:`

			`.. code-block:: c`

			`void *lookup(__u32 ipaddr)`
			`{`
			`struct ipv4_lpm_key key = {`
			`.prefixlen = 32,`
			`.data = ipaddr`
			`};`

			`return bpf_map_lookup_elem(&ipv4_lpm_map, &key);`
			`}`

			`Userspace`
			`---------`

			`The following snippet shows how to insert an IPv4 prefix entry into an`
			`LPM trie:`

			`.. code-block:: c`

			`int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)`
			`{`
			`struct ipv4_lpm_key ipv4_key = {`
			`.prefixlen = prefixlen,`
			`.data = addr`
			`};`
			`return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);`
			`}`

			`The following snippet shows a userspace program walking through the entries`
			`of an LPM trie:`


			`.. code-block:: c`

			`#include <bpf/libbpf.h>`
			`#include <bpf/bpf.h>`

			`void iterate_lpm_trie(int map_fd)`
			`{`
			`struct ipv4_lpm_key *cur_key = NULL;`
			`struct ipv4_lpm_key next_key;`
			`struct value value;`
			`int err;`

			`for (;;) {`
			`err = bpf_map_get_next_key(map_fd, cur_key, &next_key);`
			`if (err)`
			`break;`

			`bpf_map_lookup_elem(map_fd, &next_key, &value);`

			`/* Use key and value here */`

			`cur_key = &next_key;`
			`}`
			`}`