linux-stable/net
Daniel Borkmann e2e9b6541d cls_bpf: add initial eBPF support for programmable classifiers
This work extends the "classic" BPF programmable tc classifier by
extending its scope also to native eBPF code!

This allows for user space to implement own custom, 'safe' C like
classifiers (or whatever other frontend language LLVM et al may
provide in future), that can then be compiled with the LLVM eBPF
backend to an eBPF elf file. The result of this can be loaded into
the kernel via iproute2's tc. In the kernel, they can be JITed on
major archs and thus run in native performance.

Simple, minimal toy example to demonstrate the workflow:

  #include <linux/ip.h>
  #include <linux/if_ether.h>
  #include <linux/bpf.h>

  #include "tc_bpf_api.h"

  __section("classify")
  int cls_main(struct sk_buff *skb)
  {
    return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
  }

  char __license[] __section("license") = "GPL";

The classifier can then be compiled into eBPF opcodes and loaded
via tc, for example:

  clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
  tc filter add dev em1 parent 1: bpf cls.o [...]

As it has been demonstrated, the scope can even reach up to a fully
fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).

For tc, maps are allowed to be used, but from kernel context only,
in other words, eBPF code can keep state across filter invocations.
In future, we perhaps may reattach from a different application to
those maps e.g., to read out collected statistics/state.

Similarly as in socket filters, we may extend functionality for eBPF
classifiers over time depending on the use cases. For that purpose,
cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
we can allow additional functions/accessors (e.g. an ABI compatible
offset translation to skb fields/metadata). For an initial cls_bpf
support, we allow the same set of helper functions as eBPF socket
filters, but we could diverge at some point in time w/o problem.

I was wondering whether cls_bpf and act_bpf could share C programs,
I can imagine that at some point, we introduce i) further common
handlers for both (or even beyond their scope), and/or if truly needed
ii) some restricted function space for each of them. Both can be
abstracted easily through struct bpf_verifier_ops in future.

The context of cls_bpf versus act_bpf is slightly different though:
a cls_bpf program will return a specific classid whereas act_bpf a
drop/non-drop return code, latter may also in future mangle skbs.
That said, we can surely have a "classify" and "action" section in
a single object file, or considered mentioned constraint add a
possibility of a shared section.

The workflow for getting native eBPF running from tc [1] is as
follows: for f_bpf, I've added a slightly modified ELF parser code
from Alexei's kernel sample, which reads out the LLVM compiled
object, sets up maps (and dynamically fixes up map fds) if any, and
loads the eBPF instructions all centrally through the bpf syscall.

The resulting fd from the loaded program itself is being passed down
to cls_bpf, which looks up struct bpf_prog from the fd store, and
holds reference, so that it stays available also after tc program
lifetime. On tc filter destruction, it will then drop its reference.

Moreover, I've also added the optional possibility to annotate an
eBPF filter with a name (e.g. path to object file, or something
else if preferred) so that when tc dumps currently installed filters,
some more context can be given to an admin for a given instance (as
opposed to just the file descriptor number).

Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
exported, so that eBPF can be used from cls_bpf built as a module.
Thanks to 60a3b2253c ("net: bpf: make eBPF interpreter images
read-only") I think this is of no concern since anything wanting to
alter eBPF opcode after verification stage would crash the kernel.

  [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01 14:05:19 -05:00
..
6lowpan net/6lowpan: Remove FSF address from GPL statement. 2014-12-05 12:43:04 +01:00
9p
802
8021q vlan: advertise link netns via netlink 2015-01-23 17:51:15 -08:00
appletalk new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
atm put iov_iter into msghdr 2014-12-09 16:29:03 -05:00
ax25 new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
batman-adv batman-adv: Fix use of seq_has_overflowed() 2015-02-22 17:00:08 -05:00
bluetooth Bluetooth: Fix potential NULL dereference 2015-02-03 09:02:12 +01:00
bridge bridge: fix link notification skb size calculation to include vlan ranges 2015-02-26 11:25:43 -05:00
caif caif: remove wrong dev_net_set() call 2015-01-29 14:20:02 -08:00
can netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
ceph mm: gup: use get_user_pages_unlocked 2015-02-11 17:06:05 -08:00
core ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux 2015-03-01 14:05:19 -05:00
dcb dcbnl : Disable software interrupts before taking dcb_lock 2014-11-16 14:50:52 -05:00
dccp net: introduce helper macro for_each_cmsghdr 2014-12-10 22:41:55 -05:00
decnet netlink: Fix bugs in nlmsg_end() conversions. 2015-01-18 23:36:08 -05:00
dns_resolver
dsa net: dsa: Introduce dsa_is_port_initialized 2015-02-25 17:57:48 -05:00
ethernet net: Add Transparent Ethernet Bridging GRO support. 2015-01-02 15:46:41 -05:00
hsr
ieee802154 netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
ipv4 tcp: cleanup static functions 2015-02-28 16:56:51 -05:00
ipv6 tcp: cleanup static functions 2015-02-28 16:56:51 -05:00
ipx switch ipxrtr_route_packet() from iovec to msghdr 2014-11-24 04:28:49 -05:00
irda irda: use msecs_to_jiffies for conversions 2015-01-30 18:08:25 -08:00
iucv net: introduce helper macro for_each_cmsghdr 2014-12-10 22:41:55 -05:00
key new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
l2tp netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
lapb
llc net: llc: use correct size for sysctl timeout entries 2015-01-25 00:23:21 -08:00
mac80211 Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
mac802154 mac802154: fix kbuild test robot warning 2015-01-03 01:51:51 +01:00
mpls net: mark some potential candidates __read_mostly 2015-01-30 17:58:39 -08:00
netfilter net: Remove state argument from skb_find_text() 2015-02-22 15:59:54 -05:00
netlabel Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2015-02-11 20:25:11 -08:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-02-05 14:33:28 -08:00
netrom new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
nfc NFC: nci: Move NFCEE discovery logic 2015-02-04 09:15:18 +01:00
openvswitch openvswitch: Fix key serialization. 2015-02-14 20:20:40 -08:00
packet netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
phonet phonet netlink: allow multiple messages per skb in route dump 2015-01-19 16:20:17 -05:00
rds rds: rds_cong_queue_updates needs to defer the congestion update transmission 2015-02-11 14:35:44 -08:00
rfkill Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
rose new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
rxrpc Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-04 20:46:55 -08:00
sched cls_bpf: add initial eBPF support for programmable classifiers 2015-03-01 14:05:19 -05:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-02-05 14:33:28 -08:00
sunrpc Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux 2015-02-12 10:39:41 -08:00
switchdev swdevice: add new apis to set and del bridge port attributes 2015-02-01 23:16:34 -08:00
tipc tipc: make media address offset a common define 2015-02-27 18:18:48 -05:00
unix net: remove sock_iocb 2015-01-28 23:15:07 -08:00
vmw_vsock vmci: propagate msghdr all way down to __qp_memcpy_to_queue() 2015-02-04 01:34:14 -05:00
wimax
wireless Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
x25 new helper: memcpy_from_msg() 2014-11-24 04:28:48 -05:00
xfrm netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
compat.c net: __aligned(size) is preferred over __attribute__((aligned(size))) 2015-02-22 17:01:22 -05:00
Kconfig net: introduce generic switch devices support 2014-12-02 20:01:20 -08:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
socket.c net: switch sockets to ->read_iter/->write_iter 2015-02-04 01:34:15 -05:00
sysctl_net.c