ARC: Add eBPF JIT support

This will add eBPF JIT support to the 32-bit ARCv2 processors. The
implementation is qualified by running the BPF tests on a Synopsys HSDK
board with "ARC HS38 v2.1c at 500 MHz" as the 4-core CPU.

The test_bpf.ko reports 2-10 fold improvements in execution time of its
tests. For instance:

test_bpf: #33 tcpdump port 22 jited:0 704 1766 2104 PASS
test_bpf: #33 tcpdump port 22 jited:1 120  224  260 PASS

test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:0 238 PASS
test_bpf: #141 ALU_DIV_X: 4294967295 / 4294967295 = 1 jited:1  23 PASS

test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:0 2034681 PASS
test_bpf: #776 JMP32_JGE_K: all ... magnitudes jited:1 1020022 PASS

Deployment and structure
------------------------
The related codes are added to "arch/arc/net":

- bpf_jit.h       -- The interface that a back-end translator must provide
- bpf_jit_core.c  -- Knows how to handle the input eBPF byte stream
- bpf_jit_arcv2.c -- The back-end code that knows the translation logic

The bpf_int_jit_compile() at the end of bpf_jit_core.c is the entrance
to the whole process. Normally, the translation is done in one pass,
namely the "normal pass". In case some relocations are not known during
this pass, some data (arc_jit_data) is allocated for the next pass to
come. This possible next (and last) pass is called the "extra pass".

1. Normal pass       # The necessary pass
     1a. Dry run       # Get the whole JIT length, epilogue offset, etc.
     1b. Emit phase    # Allocate memory and start emitting instructions
2. Extra pass        # Only needed if there are relocations to be fixed
     2a. Patch relocations

Support status
--------------
The JIT compiler supports BPF instructions up to "cpu=v4". However, it
does not yet provide support for:

- Tail calls
- Atomic operations
- 64-bit division/remainder
- BPF_PROBE_MEM* (exception table)

The result of "test_bpf" test suite on an HSDK board is:

hsdk-lnx# insmod test_bpf.ko test_suite=test_bpf

  test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed]

All the failing test cases are due to the ones that were not JIT'ed.
Categorically, they can be represented as:

  .-----------.------------.-------------.
  | test type |   opcodes  | # of cases  |
  |-----------+------------+-------------|
  | atomic    | 0xC3, 0xDB |         149 |
  | div64     | 0x37, 0x3F |          22 |
  | mod64     | 0x97, 0x9F |          15 |
  `-----------^------------+-------------|
                           | (total) 186 |
                           `-------------'

Setup: build config
-------------------
The following configs must be set to have a working JIT test:

  CONFIG_BPF_JIT=y
  CONFIG_BPF_JIT_ALWAYS_ON=y
  CONFIG_TEST_BPF=m

The following options are not necessary for the tests module,
but are good to have:

  CONFIG_DEBUG_INFO=y             # prerequisite for below
  CONFIG_DEBUG_INFO_BTF=y         # so bpftool can generate vmlinux.h

  CONFIG_FTRACE=y                 #
  CONFIG_BPF_SYSCALL=y            # all these options lead to
  CONFIG_KPROBE_EVENTS=y          # having CONFIG_BPF_EVENTS=y
  CONFIG_PERF_EVENTS=y            #

Some BPF programs provide data through /sys/kernel/debug:
  CONFIG_DEBUG_FS=y
arc# mount -t debugfs debugfs /sys/kernel/debug

Setup: elfutils
---------------
The libdw.{so,a} library that is used by pahole for processing
the final binary must come from elfutils 0.189 or newer. The
support for ARCv2 [1] has been added since that version.

[1]
https://sourceware.org/git/?p=elfutils.git;a=commit;h=de3d46b3e7

Setup: pahole
-------------
The line below in linux/scripts/Makefile.btf must be commented out:

pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats

Or else, the build will fail:

$ make V=1
  ...
  BTF     .btf.vmlinux.bin.o
pahole -J --btf_gen_floats                    \
       -j --lang_exclude=rust                 \
       --skip_encoding_btf_inconsistent_proto \
       --btf_gen_optimized .tmp_vmlinux.btf
Complex, interval and imaginary float types are not supported
Encountered error while encoding BTF.
  ...
  BTFIDS  vmlinux
./tools/bpf/resolve_btfids/resolve_btfids vmlinux
libbpf: failed to find '.BTF' ELF section in vmlinux
FAILED: load BTF from vmlinux: No data available

This is due to the fact that the ARC toolchains generate
"complex float" DIE entries in libgcc and at the moment, pahole
can't handle such entries.

Running the tests
-----------------
host$ scp /bld/linux/lib/test_bpf.ko arc:
arc # sysctl net.core.bpf_jit_enable=1
arc # insmod test_bpf.ko test_suite=test_bpf
      ...
      test_bpf: #1048 Staggered jumps: JMP32_JSLE_X jited:1 697811 PASS
      test_bpf: Summary: 863 PASSED, 186 FAILED, [851/851 JIT'ed]

Acknowledgments
---------------
- Claudiu Zissulescu for his unwavering support
- Yuriy Kolerov for testing and troubleshooting
- Vladimir Isaev for the pahole workaround
- Sergey Matyukevich for paving the road by adding the interpreter support

Signed-off-by: Shahab Vahedi <shahab@synopsys.com>
Link: https://lore.kernel.org/r/20240430145604.38592-1-list+bpf@vahedi.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
Shahab Vahedi 2024-04-30 16:56:04 +02:00 committed by Alexei Starovoitov
parent fcd1ed89a0
commit f122668ddc
9 changed files with 4611 additions and 2 deletions

View File

@ -72,6 +72,7 @@ two flavors of JITs, the newer eBPF JIT currently supported on:
- riscv64
- riscv32
- loongarch64
- arc
And the older cBPF JIT supported on the following archs:

View File

@ -513,7 +513,7 @@ JIT compiler
------------
The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
PowerPC, ARM, ARM64, MIPS, RISC-V, s390, and ARC and can be enabled through
CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
attached filter from user space or for internal kernel users if it has
been previously enabled by root::
@ -650,7 +650,7 @@ before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most
32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
sparc64, arm32, riscv64, riscv32, loongarch64 perform JIT compilation
sparc64, arm32, riscv64, riscv32, loongarch64, arc perform JIT compilation
from eBPF instruction set.
Testing

View File

@ -3712,6 +3712,12 @@ S: Maintained
F: Documentation/devicetree/bindings/iio/imu/bosch,bmi323.yaml
F: drivers/iio/imu/bmi323/
BPF JIT for ARC
M: Shahab Vahedi <shahab@synopsys.com>
L: bpf@vger.kernel.org
S: Maintained
F: arch/arc/net/
BPF JIT for ARM
M: Russell King <linux@armlinux.org.uk>
M: Puranjay Mohan <puranjay12@gmail.com>

View File

@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-y += kernel/
obj-y += mm/
obj-y += net/
# for cleaning
subdir- += boot

View File

@ -52,6 +52,7 @@ config ARC
select PCI_SYSCALL if PCI
select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32
select TRACE_IRQFLAGS_SUPPORT
select HAVE_EBPF_JIT if ISA_ARCV2
config LOCKDEP_SUPPORT
def_bool y

6
arch/arc/net/Makefile Normal file
View File

@ -0,0 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
ifeq ($(CONFIG_ISA_ARCV2),y)
obj-$(CONFIG_BPF_JIT) += bpf_jit_core.o
obj-$(CONFIG_BPF_JIT) += bpf_jit_arcv2.o
endif

164
arch/arc/net/bpf_jit.h Normal file
View File

@ -0,0 +1,164 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* The interface that a back-end should provide to bpf_jit_core.c.
*
* Copyright (c) 2024 Synopsys Inc.
* Author: Shahab Vahedi <shahab@synopsys.com>
*/
#ifndef _ARC_BPF_JIT_H
#define _ARC_BPF_JIT_H
#include <linux/bpf.h>
#include <linux/filter.h>
/* Print debug info and assert. */
//#define ARC_BPF_JIT_DEBUG
/* Determine the address type of the target. */
#ifdef CONFIG_ISA_ARCV2
#define ARC_ADDR u32
#endif
/*
* For the translation of some BPF instructions, a temporary register
* might be needed for some interim data.
*/
#define JIT_REG_TMP MAX_BPF_JIT_REG
/*
* Buffer access: If buffer "b" is not NULL, advance by "n" bytes.
*
* This macro must be used in any place that potentially requires a
* "buf + len". This way, we make sure that the "buf" argument for
* the underlying "arc_*(buf, ...)" ends up as NULL instead of something
* like "0+4" or "0+8", etc. Those "arc_*()" functions check their "buf"
* value to decide if instructions should be emitted or not.
*/
#define BUF(b, n) (((b) != NULL) ? ((b) + (n)) : (b))
/************** Functions that the back-end must provide **************/
/* Extension for 32-bit operations. */
inline u8 zext(u8 *buf, u8 rd);
/***** Moves *****/
u8 mov_r32(u8 *buf, u8 rd, u8 rs, u8 sign_ext);
u8 mov_r32_i32(u8 *buf, u8 reg, s32 imm);
u8 mov_r64(u8 *buf, u8 rd, u8 rs, u8 sign_ext);
u8 mov_r64_i32(u8 *buf, u8 reg, s32 imm);
u8 mov_r64_i64(u8 *buf, u8 reg, u32 lo, u32 hi);
/***** Loads and stores *****/
u8 load_r(u8 *buf, u8 rd, u8 rs, s16 off, u8 size, bool sign_ext);
u8 store_r(u8 *buf, u8 rd, u8 rs, s16 off, u8 size);
u8 store_i(u8 *buf, s32 imm, u8 rd, s16 off, u8 size);
/***** Addition *****/
u8 add_r32(u8 *buf, u8 rd, u8 rs);
u8 add_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 add_r64(u8 *buf, u8 rd, u8 rs);
u8 add_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Subtraction *****/
u8 sub_r32(u8 *buf, u8 rd, u8 rs);
u8 sub_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 sub_r64(u8 *buf, u8 rd, u8 rs);
u8 sub_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Multiplication *****/
u8 mul_r32(u8 *buf, u8 rd, u8 rs);
u8 mul_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 mul_r64(u8 *buf, u8 rd, u8 rs);
u8 mul_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Division *****/
u8 div_r32(u8 *buf, u8 rd, u8 rs, bool sign_ext);
u8 div_r32_i32(u8 *buf, u8 rd, s32 imm, bool sign_ext);
/***** Remainder *****/
u8 mod_r32(u8 *buf, u8 rd, u8 rs, bool sign_ext);
u8 mod_r32_i32(u8 *buf, u8 rd, s32 imm, bool sign_ext);
/***** Bitwise AND *****/
u8 and_r32(u8 *buf, u8 rd, u8 rs);
u8 and_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 and_r64(u8 *buf, u8 rd, u8 rs);
u8 and_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Bitwise OR *****/
u8 or_r32(u8 *buf, u8 rd, u8 rs);
u8 or_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 or_r64(u8 *buf, u8 rd, u8 rs);
u8 or_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Bitwise XOR *****/
u8 xor_r32(u8 *buf, u8 rd, u8 rs);
u8 xor_r32_i32(u8 *buf, u8 rd, s32 imm);
u8 xor_r64(u8 *buf, u8 rd, u8 rs);
u8 xor_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Bitwise Negate *****/
u8 neg_r32(u8 *buf, u8 r);
u8 neg_r64(u8 *buf, u8 r);
/***** Bitwise left shift *****/
u8 lsh_r32(u8 *buf, u8 rd, u8 rs);
u8 lsh_r32_i32(u8 *buf, u8 rd, u8 imm);
u8 lsh_r64(u8 *buf, u8 rd, u8 rs);
u8 lsh_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Bitwise right shift (logical) *****/
u8 rsh_r32(u8 *buf, u8 rd, u8 rs);
u8 rsh_r32_i32(u8 *buf, u8 rd, u8 imm);
u8 rsh_r64(u8 *buf, u8 rd, u8 rs);
u8 rsh_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Bitwise right shift (arithmetic) *****/
u8 arsh_r32(u8 *buf, u8 rd, u8 rs);
u8 arsh_r32_i32(u8 *buf, u8 rd, u8 imm);
u8 arsh_r64(u8 *buf, u8 rd, u8 rs);
u8 arsh_r64_i32(u8 *buf, u8 rd, s32 imm);
/***** Frame related *****/
u32 mask_for_used_regs(u8 bpf_reg, bool is_call);
u8 arc_prologue(u8 *buf, u32 usage, u16 frame_size);
u8 arc_epilogue(u8 *buf, u32 usage, u16 frame_size);
/***** Jumps *****/
/*
* Different sorts of conditions (ARC enum as opposed to BPF_*).
*
* Do not change the order of enums here. ARC_CC_SLE+1 is used
* to determine the number of JCCs.
*/
enum ARC_CC {
ARC_CC_UGT = 0, /* unsigned > */
ARC_CC_UGE, /* unsigned >= */
ARC_CC_ULT, /* unsigned < */
ARC_CC_ULE, /* unsigned <= */
ARC_CC_SGT, /* signed > */
ARC_CC_SGE, /* signed >= */
ARC_CC_SLT, /* signed < */
ARC_CC_SLE, /* signed <= */
ARC_CC_AL, /* always */
ARC_CC_EQ, /* == */
ARC_CC_NE, /* != */
ARC_CC_SET, /* test */
ARC_CC_LAST
};
/*
* A few notes:
*
* - check_jmp_*() are prerequisites before calling the gen_jmp_*().
* They return "true" if the jump is possible and "false" otherwise.
*
* - The notion of "*_off" is to emphasize that these parameters are
* merely offsets in the JIT stream and not absolute addresses. One
* can look at them as addresses if the JIT code would start from
* address 0x0000_0000. Nonetheless, since the buffer address for the
* JIT is on a word-aligned address, this works and actually makes
* things simpler (offsets are in the range of u32 which is more than
* enough).
*/
bool check_jmp_32(u32 curr_off, u32 targ_off, u8 cond);
bool check_jmp_64(u32 curr_off, u32 targ_off, u8 cond);
u8 gen_jmp_32(u8 *buf, u8 rd, u8 rs, u8 cond, u32 c_off, u32 t_off);
u8 gen_jmp_64(u8 *buf, u8 rd, u8 rs, u8 cond, u32 c_off, u32 t_off);
/***** Miscellaneous *****/
u8 gen_func_call(u8 *buf, ARC_ADDR func_addr, bool external_func);
u8 arc_to_bpf_return(u8 *buf);
/*
* - Perform byte swaps on "rd" based on the "size".
* - If "force" is set, do it unconditionally. Otherwise, consider the
* desired "endian"ness and the host endianness.
* - For data "size"s up to 32 bits, perform a zero-extension if asked
* by the "do_zext" boolean.
*/
u8 gen_swap(u8 *buf, u8 rd, u8 size, u8 endian, bool force, bool do_zext);
#endif /* _ARC_BPF_JIT_H */

3005
arch/arc/net/bpf_jit_arcv2.c Normal file

File diff suppressed because it is too large Load Diff

1425
arch/arc/net/bpf_jit_core.c Normal file

File diff suppressed because it is too large Load Diff