linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-11 11:09:13 +00:00

Author	SHA1	Message	Date
Arkadi Sharshevsky	3580732448	devlink: Move dpipe entry clear function into devlink The entry clear routine can be shared between the drivers, thus it is moved inside devlink. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	ffd3cdccf2	devlink: Add support for dynamic table size Up until now the dpipe table's size was static and known at registration time. The host table does not have constant size and it is resized in dynamic manner. In order to support this behavior the size is changed to be obtained dynamically via an op. This patch also adjust the current dpipe table for the new API. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	3fb886ecea	devlink: Add IPv4 header for dpipe This will be used by the IPv4 host table which will be introduced in the following patches. This header is global and can be reused by many drivers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:16 -07:00
Arkadi Sharshevsky	1177009131	devlink: Add Ethernet header for dpipe This will be used by the IPv4 host table which will be introduced in the following patches. This header is global and can be reused by many drivers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 09:33:15 -07:00
Paolo Abeni	257a73031d	net/sock: allow the user to set negative peek offset This is necessary to allow the user to disable peeking with offset once it's enabled. Unix sockets already allow the above, with this patch we permit it for udp[6] sockets, too. Fixes: `627d2d6b55` ("udp: enable MSG_PEEK at non-zero offset") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-23 22:18:44 -07:00
Daniel Borkmann	e4a8e817d3	bpf: misc xdp redirect cleanups Few cleanups including: bpf_redirect_map() is really XDP only due to the return code. Move it to a more appropriate location where we do the XDP redirect handling and change it's name into bpf_xdp_redirect_map() to make it consistent to the bpf_xdp_redirect() helper. xdp_do_redirect_map() helper can be static since only used out of filter.c file. Drop the goto in xdp_do_generic_redirect() and only return errors directly. In xdp_do_flush_map() only clear ri->map_to_flush which is the arg we're using in that function, ri->map is cleared earlier along with ri->ifindex. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-22 21:26:29 -07:00
David S. Miller	e2a7c34fb2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-21 17:06:42 -07:00
David Lamparter	e65a4955b0	net: check type when freeing metadata dst Commit `3fcece12bc` ("net: store port/representator id in metadata_dst") added a new type field to metadata_dst, but metadata_dst_free() wasn't updated to check it before freeing the METADATA_IP_TUNNEL specific dst cache entry. This is not currently causing problems since it's far enough back in the struct to be zeroed for the only other type currently in existance (METADATA_HW_PORT_MUX), but nevertheless it's not correct. Fixes: `3fcece12bc` ("net: store port/representator id in metadata_dst") Signed-off-by: David Lamparter <equinox@diac24.net> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> Cc: Simon Horman <horms@verge.net.au> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 10:57:38 -07:00
David S. Miller	a43dce9358	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-08-21 1) Support RX checksum with IPsec crypto offload for esp4/esp6. From Ilan Tayari. 2) Fixup IPv6 checksums when doing IPsec crypto offload. From Yossi Kuperman. 3) Auto load the xfrom offload modules if a user installs a SA that requests IPsec offload. From Ilan Tayari. 4) Clear RX offload informations in xfrm_input to not confuse the TX path with stale offload informations. From Ilan Tayari. 5) Allow IPsec GSO for local sockets if the crypto operation will be offloaded. 6) Support setting of an output mark to the xfrm_state. This mark can be used to to do the tunnel route lookup. From Lorenzo Colitti. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-21 09:29:47 -07:00
stephen hemminger	6648c65e7e	net: style cleanups Make code closer to current style. Mostly whitespace changes. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:47 -07:00
stephen hemminger	667e427bc3	net: mark receive queue attributes ro_after_init Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	2b9c758129	net: make queue attributes ro_after_init The XPS queue attributes can be ro_after_init. Also use __ATTR_RX macros to simplify initialization. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	170c658afc	net: make BQL sysfs attributes ro_after_init Also fix macro to not have ; at end. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:46 -07:00
stephen hemminger	718ad681ef	net: drop unused attribute argument from sysfs queue funcs The show and store functions don't need/use the attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:38:31 -07:00
stephen hemminger	ec6cc5993c	net: make net sysfs attributes ro_after_init The attributes of net devices are immutable. Ideally, attribute groups would contain const attributes but there are too many places that do modifications of list during startup (in other code) to allow that. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:35 -07:00
stephen hemminger	737aec57c6	net: constify net_ns_type_operations This can be const. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:27 -07:00
stephen hemminger	e6d473e635	net: make net_class ro_after_init The net_class in sysfs is only modified on init. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:21 -07:00
stephen hemminger	b793dc5c6e	net: constify netdev_class_file These functions are wrapper arount class_create_file which can take a const attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:12 -07:00
stephen hemminger	d0d6683716	net: don't decrement kobj reference count on init failure If kobject_init_and_add failed, then the failure path would decrement the reference count of the queue kobject whose reference count was already zero. Fixes: `114cf58021` ("bql: Byte queue limits") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 22:37:03 -07:00
Jesper Dangaard Brouer	4c03bdd7b5	xdp: adjust xdp redirect tracepoint to include return error code The return error code need to be included in the tracepoint xdp:xdp_redirect, else its not possible to distinguish successful or failed XDP_REDIRECT transmits. XDP have no queuing mechanism. Thus, it is fairly easily to overrun a NIC transmit queue. The eBPF program invoking helpers (bpf_redirect or bpf_redirect_map) to redirect a packet doesn't get any feedback whether the packet was actually transmitted. Info on failed transmits in the tracepoint xdp:xdp_redirect, is interesting as this opens for providing a feedback-loop to the receiving XDP program. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 16:18:40 -07:00
Eric Dumazet	9620fef27e	ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 15:14:07 -07:00
Matthew Dawson	a0917e0bc6	datagram: When peeking datagrams with offset < 0 don't skip empty skbs Due to commit `e6afc8ace6` ("udp: remove headers from UDP packets before queueing"), when udp packets are being peeked the requested extra offset is always 0 as there is no need to skip the udp header. However, when the offset is 0 and the next skb is of length 0, it is only returned once. The behaviour can be seen with the following python script: from socket import *; f=socket(AF_INET6, SOCK_DGRAM \| SOCK_NONBLOCK, 0); g=socket(AF_INET6, SOCK_DGRAM \| SOCK_NONBLOCK, 0); f.bind(('::', 0)); addr=('::1', f.getsockname()[1]); g.sendto(b'', addr) g.sendto(b'b', addr) print(f.recvfrom(10, MSG_PEEK)); print(f.recvfrom(10, MSG_PEEK)); Where the expected output should be the empty string twice. Instead, make sk_peek_offset return negative values, and pass those values to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset to __skb_try_recv_from_queue is negative, the checked skb is never skipped. __skb_try_recv_from_queue will then ensure the offset is reset back to 0 if a peek is requested without an offset, unless no packets are found. Also simplify the if condition in __skb_try_recv_from_queue. If _off is greater then 0, and off is greater then or equal to skb->len, then (_off \|\| skb->len) must always be true assuming skb->len >= 0 is always true. Also remove a redundant check around a call to sk_peek_offset in af_unix.c, as it double checked if MSG_PEEK was set in the flags. V2: - Moved the negative fixup into __skb_try_recv_from_queue, and remove now redundant checks - Fix peeking in udp{,v6}_recvmsg to report the right value when the offset is 0 V3: - Marked new branch in __skb_try_recv_from_queue as unlikely. Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-18 15:12:54 -07:00
Daniel Borkmann	047b0ecd68	bpf: reuse tc bpf prologue for sk skb progs Given both program types are effecitvely doing the same in the prologue, just reuse the one that we had for tc and only adapt to the corresponding drop verdict value. That way, we don't need to have the duplicate from `8a31db5615` ("bpf: add access to sock fields and pkt data from sk_skb programs") to maintain. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-17 10:25:19 -07:00
Daniel Borkmann	4d6a75b65d	bpf: no need to nullify ri->map in xdp_do_redirect We are guaranteed to have a NULL ri->map in this branch since we test for it earlier, so we don't need to reset it here. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-17 10:15:53 -07:00
Jesper Dangaard Brouer	e543002f77	qdisc: add tracepoint qdisc:qdisc_dequeue for dequeued SKBs The main purpose of this tracepoint is to monitor bulk dequeue in the network qdisc layer, as it cannot be deducted from the existing qdisc stats. The txq_state can be used for determining the reason for zero packet dequeues, see enum netdev_queue_state_t. Notice all packets doesn't necessary activate this tracepoint. As qdiscs with flag TCQ_F_CAN_BYPASS, can directly invoke sch_direct_xmit() when qdisc_qlen is zero. Remember that perf record supports filters like: perf record -e qdisc:qdisc_dequeue \ --filter 'ifindex == 4 && (packets > 1 \|\| txq_state > 0)' Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 14:10:10 -07:00
John Fastabend	8a31db5615	bpf: add access to sock fields and pkt data from sk_skb programs Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:27:53 -07:00
John Fastabend	174a79ff95	bpf: sockmap with sk redirect support Recently we added a new map type called dev map used to forward XDP packets between ports (`6093ec2dc3`). This patches introduces a similar notion for sockets. A sockmap allows users to add participating sockets to a map. When sockets are added to the map enough context is stored with the map entry to use the entry with a new helper bpf_sk_redirect_map(map, key, flags) This helper (analogous to bpf_redirect_map in XDP) is given the map and an entry in the map. When called from a sockmap program, discussed below, the skb will be sent on the socket using skb_send_sock(). With the above we need a bpf program to call the helper from that will then implement the send logic. The initial site implemented in this series is the recv_sock hook. For this to work we implemented a map attach command to add attributes to a map. In sockmap we add two programs a parse program and a verdict program. The parse program uses strparser to build messages and pass them to the verdict program. The parse programs use the normal strparser semantics. The verdict program is of type SK_SKB. The verdict program returns a verdict SK_DROP, or SK_REDIRECT for now. Additional actions may be added later. When SK_REDIRECT is returned, expected when bpf program uses bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables set by the helper routine and pull the sock entry out of the sock map. This pattern follows the existing redirect logic in cls and xdp programs. This gives the flow, recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock \ -> kfree_skb As an example use case a message based load balancer may use specific logic in the verdict program to select the sock to send on. Sample programs are provided in future patches that hopefully illustrate the user interfaces. Also selftests are in follow-on patches. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:27:53 -07:00
John Fastabend	b005fd189c	bpf: introduce new program type for skbs on sockets A class of programs, run from strparser and soon from a new map type called sock map, are used with skb as the context but on established sockets. By creating a specific program type for these we can use bpf helpers that expect full sockets and get the verifier to ensure these helpers are not used out of context. The new type is BPF_PROG_TYPE_SK_SKB. This patch introduces the infrastructure and type. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:27:53 -07:00
John Fastabend	db5980d804	net: fixes for skb_send_sock A couple fixes to new skb_send_sock infrastructure. However, no users currently exist for this code (adding user in next handful of patches) so it should not be possible to trigger a panic with existing in-kernel code. Fixes: `306b13eb3c` ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:27:52 -07:00
David S. Miller	463910e2df	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-15 20:23:23 -07:00
Craig Gallek	7324157b8a	dsa: fix flow disector null pointer A recent change to fix up DSA device behavior made the assumption that all skbs passing through the flow disector will be associated with a device. This does not appear to be a safe assumption. Syzkaller found the crash below by attaching a BPF socket filter that tries to find the payload offset of a packet passing between two unix sockets. kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 2940 Comm: syzkaller872007 Not tainted 4.13.0-rc4-next-20170811 #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801d1b425c0 task.stack: ffff8801d0bc0000 RIP: 0010:__skb_flow_dissect+0xdcd/0x3ae0 net/core/flow_dissector.c:445 RSP: 0018:ffff8801d0bc7340 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000060 RSI: ffffffff856dc080 RDI: 0000000000000300 RBP: ffff8801d0bc7870 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000008 R11: ffffed003a178f1e R12: 0000000000000000 R13: 0000000000000000 R14: ffffffff856dc080 R15: ffff8801ce223140 FS: 00000000016ed880(0000) GS:ffff8801dc000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 00000001ce22d000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_flow_dissect_flow_keys include/linux/skbuff.h:1176 [inline] skb_get_poff+0x9a/0x1a0 net/core/flow_dissector.c:1079 ______skb_get_pay_offset net/core/filter.c:114 [inline] __skb_get_pay_offset+0x15/0x20 net/core/filter.c:112 Code: 80 3c 02 00 44 89 6d 10 0f 85 44 2b 00 00 4d 8b 67 20 48 b8 00 00 00 00 00 fc ff df 49 8d bc 24 00 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 13 2b 00 00 4d 8b a4 24 00 03 00 00 4d 85 e4 RIP: __skb_flow_dissect+0xdcd/0x3ae0 net/core/flow_dissector.c:445 RSP: ffff8801d0bc7340 Fixes: `43e665287f` ("net-next: dsa: fix flow dissection") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-15 17:18:35 -07:00
Jason Wang	7c4974786f	net: export some generic xdp helpers This patch tries to export some generic xdp helpers to drivers. This can let driver to do XDP for a specific skb. This is useful for the case when the packet is hard to be processed at page level directly (e.g jumbo/GSO frame). With this patch, there's no need for driver to forbid the XDP set when configuration is not suitable. Instead, it can defer the XDP for packets that is hard to be processed directly after skb is created. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-13 19:56:07 -07:00
Jakub Sitnicki	d0225784be	rtnelink: Move link dump consistency check out of the loop Calls to rtnl_dump_ifinfo() are protected by RTNL lock. So are the {list,unlist}_netdevice() calls where we bump the net->dev_base_seq number. For this reason net->dev_base_seq can't change under out feet while we're looping over links in rtnl_dump_ifinfo(). So move the check for net->dev_base_seq change (since the last time we were called) out of the loop. This way we avoid giving a wrong impression that there are concurrent updates to the link list going on while we're iterating over them. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-13 19:43:57 -07:00
Daniel Borkmann	2ed46ce45e	bpf: fix two missing target_size settings in bpf_convert_ctx_access When CONFIG_NET_SCHED or CONFIG_NET_RX_BUSY_POLL is /not/ set and we try a narrow __sk_buff load of tc_index or napi_id, respectively, then verifier rightfully complains that it's misconfigured, because we need to set target_size in each of the two cases. The rewrite for the ctx access is just a dummy op, but needs to pass, so fix this up. Fixes: `f96da09473` ("bpf: simplify narrower ctx access") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-11 14:59:24 -07:00
Tonghao Zhang	939912216f	net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx. Because we remove the UFO support, we will also remove the CHECKSUM_UNNECESSARY check in skb_needs_check(). Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-11 14:45:52 -07:00
Florian Westphal	8caa38b56c	rtnetlink: fallback to UNSPEC if current family has no doit callback We need to use PF_UNSPEC in case the requested family has no doit callback, otherwise this now fails with EOPNOTSUPP instead of running the unspec doit callback, as before. Fixes: `6853dd4881` ("rtnetlink: protect handler table with rcu") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:50:22 -07:00
Florian Westphal	d38a65125f	rtnetlink: init handler refcounts to 1 If using CONFIG_REFCOUNT_FULL=y we get following splat: refcount_t: increment on 0; use-after-free. WARNING: CPU: 0 PID: 304 at lib/refcount.c:152 refcount_inc+0x47/0x50 Call Trace: rtnetlink_rcv_msg+0x191/0x260 ... This warning is harmless (0 is "no callback running", not "memory was freed"). Use '1' as the new 'no handler is running' base instead of 0 to avoid this. Fixes: `019a316992` ("rtnetlink: add reference counting to prevent module unload while dump is in progress") Reported-by: Sabrina Dubroca <sdubroca@redhat.com> Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:50:22 -07:00
Florian Westphal	8515ae3843	rtnetlink: switch rtnl_link_get_slave_info_data_size to rcu David Ahern reports following splat: RTNL: assertion failed at net/core/dev.c (5717) netdev_master_upper_dev_get+0x5f/0x70 if_nlmsg_size+0x158/0x240 rtnl_calcit.isra.26+0xa3/0xf0 rtnl_link_get_slave_info_data_size currently assumes RTNL protection, but there appears to be no hard requirement for this, so use rcu instead. At the time of this writing, there are three 'get_slave_size' callbacks (now invoked under rcu): bond_get_slave_size, vrf_get_slave_size and br_port_get_slave_size, all return constant only (i.e. they don't sleep). Fixes: `6853dd4881` ("rtnetlink: protect handler table with rcu") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:50:22 -07:00
Florian Westphal	5c2bb9b6e2	rtnetlink: do not use RTM_GETLINK directly Userspace sends RTM_GETLINK type, but the kernel substracts RTM_BASE from this, i.e. 'type' doesn't contain RTM_GETLINK anymore but instead RTM_GETLINK - RTM_BASE. This caused the calcit callback to not be invoked when it should have been (and vice versa). While at it, also fix a off-by one when checking family index. vs handler array size. Fixes: `e1fa6d216d` ("rtnetlink: call rtnl_calcit directly") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:50:22 -07:00
Florian Westphal	377cb24884	rtnetlink: use rcu_dereference_raw to silence rcu splat Ido reports a rcu splat in __rtnl_register. The splat is correct; as rtnl_register doesn't grab any logs and doesn't use rcu locks either. It has always been like this. handler families are not registered in parallel so there are no races wrt. the kmalloc ordering. The only reason to use rcu_dereference in the first place was to avoid sparse from complaining about this. Thus this switches to _raw() to not have rcu checks here. The alternative is to add rtnl locking to register/unregister, however, I don't see a compelling reason to do so as this has been lockless for the past twenty years or so. Fixes: `6853dd4881` ("rtnetlink: protect handler table with rcu") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:50:21 -07:00
John Crispin	2d57164564	net: core: fix compile error inside flow_dissector due to new dsa callback The following error was introduced by commit `43e665287f` ("net-next: dsa: fix flow dissection") due to a missing #if guard net/core/flow_dissector.c: In function '__skb_flow_dissect': net/core/flow_dissector.c:448:18: error: 'struct net_device' has no member named 'dsa_ptr' ops = skb->dev->dsa_ptr->tag_ops; ^ make[3]: *** [net/core/flow_dissector.o] Error 1 Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-10 09:16:05 -07:00
John Crispin	43e665287f	net-next: dsa: fix flow dissection RPS and probably other kernel features are currently broken on some if not all DSA devices. The root cause of this is that skb_hash will call the flow_dissector. At this point the skb still contains the magic switch header and the skb->protocol field is not set up to the correct 802.3 value yet. By the time the tag specific code is called, removing the header and properly setting the protocol an invalid hash is already set. In the case of the mt7530 this will result in all flows always having the same hash. Signed-off-by: Muciri Gatimu <muciri@openmesh.com> Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com> Signed-off-by: John Crispin <john@phrozen.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 22:51:47 -07:00
Florian Westphal	165b911725	net: call newid/getid without rtnl mutex held Both functions take nsid_lock and don't rely on rtnl lock. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	62256f98f2	rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED Allow callers to tell rtnetlink core that its doit callback should be invoked without holding rtnl mutex. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	6853dd4881	rtnetlink: protect handler table with rcu Note that netlink dumps still acquire rtnl mutex via the netlink dump infrastructure. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	0cc09020ae	rtnetlink: small rtnl lock pushdown instead of rtnl lock/unload at the top level, push it down to the called function. This is just an intermediate step, next commit switches protection of the rtnl_link ops table to rcu, in which case (for dumps) the rtnl lock is acquired only by the netlink dumper infrastructure (current lock/unlock/dump/lock/unlock rtnl sequence becomes rcu lock/rcu unlock/dump). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	019a316992	rtnetlink: add reference counting to prevent module unload while dump is in progress I don't see what prevents rmmod (unregister_all is called) while a dump is active. Even if we'd add rtnl lock/unlock pair to unregister_all (as done here), thats not enough either as rtnl_lock is released right before the dump process starts. So this adds a refcount: * acquire rtnl mutex * bump refcount * release mutex * start the dump ... and make unregister_all remove the callbacks (no new dumps possible) and then wait until refcount is 0. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	b97bac64a5	rtnetlink: make rtnl_register accept a flags parameter This change allows us to later indicate to rtnetlink core that certain doit functions should be called without acquiring rtnl_mutex. This change should have no effect, we simply replace the last (now unused) calcit argument with the new flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Florian Westphal	e1fa6d216d	rtnetlink: call rtnl_calcit directly There is only a single place in the kernel that regisers the "calcit" callback (to determine min allocation for dumps). This is in rtnetlink.c for PF_UNSPEC RTM_GETLINK. The function that checks for calcit presence at run time will first check the requested family (which will always fail for !PF_UNSPEC as no callsite registers this), then falls back to checking PF_UNSPEC. Therefore we can just check if type is RTM_GETLINK and then do a direct call. Because of fallback to PF_UNSPEC all RTM_GETLINK types used this regardless of family. This has the advantage that we don't need to allocate space for the function pointer for all the other families. A followup patch will drop the calcit function pointer from the rtnl_link callback structure. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:57:38 -07:00
Daniel Borkmann	92b31a9af7	bpf: add BPF_J{LT,LE,SLT,SLE} instructions Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=), BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that particularly JLT/JLE counterparts involving immediates need to be rewritten from e.g. X < [IMM] by swapping arguments into [IMM] > X, meaning the immediate first is required to be loaded into a register Y := [IMM], such that then we can compare with Y > X. Note that the destination operand is always required to be a register. This has the downside of having unnecessarily increased register pressure, meaning complex program would need to spill other registers temporarily to stack in order to obtain an unused register for the [IMM]. Loading to registers will thus also affect state pruning since we need to account for that register use and potentially those registers that had to be spilled/filled again. As a consequence slightly more stack space might have been used due to spilling, and BPF programs are a bit longer due to extra code involving the register load and potentially required spill/fills. Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=) counterparts to the eBPF instruction set. Modifying LLVM to remove the NegateCC() workaround in a PoC patch at [1] and allowing it to also emit the new instructions resulted in cilium's BPF programs that are injected into the fast-path to have a reduced program length in the range of 2-3% (e.g. accumulated main and tail call sections from one of the object file reduced from 4864 to 4729 insns), reduced complexity in the range of 10-30% (e.g. accumulated sections reduced in one of the cases from 116432 to 88428 insns), and reduced stack usage in the range of 1-5% (e.g. accumulated sections from one of the object files reduced from 824 to 784b). The modification for LLVM will be incorporated in a backwards compatible way. Plan is for LLVM to have i) a target specific option to offer a possibility to explicitly enable the extension by the user (as we have with -m target specific extensions today for various CPU insns), and ii) have the kernel checked for presence of the extensions and enable them transparently when the user is selecting more aggressive options such as -march=native in a bpf target context. (Other frontends generating BPF byte code, e.g. ply can probe the kernel directly for its code generation.) [1] https://github.com/borkmann/llvm/tree/bpf-insns Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:53:56 -07:00

1 2 3 4 5 ...

5011 commits