Commit graph

2014 commits

Author SHA1 Message Date
David S. Miller
e4655e4a79 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-10-13

This series contains updates to mqprio and i40e.

Amritha introduces a new hardware offload mode in tc/mqprio where the TCs,
the queue configurations and bandwidth rate limits are offloaded to the
hardware. The existing mqprio framework is extended to configure the queue
counts and layout and also added support for rate limiting. This is
achieved through new netlink attributes for the 'mode' option which takes
values such as 'dcb' (default) and 'channel' and a 'shaper' option for
QoS attributes such as bandwidth rate limits in hw mode 1.  Legacy devices
can fall back to the existing setup supporting hw mode 1 without these
additional options where only the TCs are offloaded and then the 'mode'
and 'shaper' options defaults to DCB support.  The i40e driver enables the
new mqprio hardware offload mechanism factoring the TCs, queue
configuration and bandwidth rates by creating HW channel VSIs.
In this new mode, the priority to traffic class mapping and the user
specified queue ranges are used to configure the traffic class when the
'mode' option is set to 'channel'. This is achieved by creating HW
channels(VSI). A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC (TC0)
which is for the main VSI. TC0 for the main VSI is also reconfigured as
per user provided queue parameters. Finally, bandwidth rate limits are set
on these traffic classes through the shaper attribute by sending these
rates in addition to the number of TCs and the queue configurations.

Colin Ian King makes an array of constant values "constant".

Alan fixes and issue where on some firmware versions, we were failing to
actually fill out the phy_types which caused ethtool to not report any
link types.  Also hardened against a potentially malicious VF by not
letting the VF to reset itself after requesting to change the number of
queues (via ethtool), let the PF reset the VF to institute the requested
changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:49:42 -07:00
Cong Wang
6578759447 net_sched: fix a compile warning in act_ife
Apparently ife_meta_id2name() is only called when
CONFIG_MODULES is defined.

This fixes:

net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used [-Wunused-function]
 static const char *ife_meta_id2name(u32 metaid)
                    ^~~~~~~~~~~~~~~~

Fixes: d3f24ba895 ("net sched actions: fix module auto-loading")
Cc: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:43:45 -07:00
Amritha Nambiar
4e8b86c062 mqprio: Introduce new hardware offload mode and shaper in mqprio
The offload types currently supported in mqprio are 0 (no offload) and
1 (offload only TCs) by setting these values for the 'hw' option. If
offloads are supported by setting the 'hw' option to 1, the default
offload mode is 'dcb' where only the TC values are offloaded to the
device. This patch introduces a new hardware offload mode called
'channel' with 'hw' set to 1 in mqprio which makes full use of the
mqprio options, the TCs, the queue configurations and the QoS parameters
for the TCs. This is achieved through a new netlink attribute for the
'mode' option which takes values such as 'dcb' (default) and 'channel'.
The 'channel' mode also supports QoS attributes for traffic class such as
minimum and maximum values for bandwidth rate limits.

This patch enables configuring additional HW shaper attributes associated
with a traffic class. Currently the shaper for bandwidth rate limiting is
supported which takes options such as minimum and maximum bandwidth rates
and are offloaded to the hardware in the 'channel' mode. The min and max
limits for bandwidth rates are provided by the user along with the TCs
and the queue configurations when creating the mqprio qdisc. The interface
can be extended to support new HW shapers in future through the 'shaper'
attribute.

Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the existing data structure
for mqprio queue options. This new data structure can be extended when
adding new attributes for traffic class such as mode, shaper, shaper
parameters (bandwidth rate limits). The existing data structure for mqprio
queue options will be shared between the kernel and userspace.

Example:
  queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
  min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit

To dump the bandwidth rates:

qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
             queues:(0:3) (4:7)
             mode:channel
             shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-13 13:23:35 -07:00
Alexander Aring
aa9fd9a325 sched: act: ife: update parameters via rcu handling
This patch changes the parameter updating via RCU and not protected by a
spinlock anymore. This reduce the time that the spinlock is being held.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:23:03 -07:00
Alexander Aring
ced273eacf sched: act: ife: migrate to use per-cpu counters
This patch migrates the current counter handling which is protected by a
spinlock to a per-cpu counter handling. This reduce the time where the
spinlock is being held.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:23:03 -07:00
Alexander Aring
734534e9a8 sched: act: ife: move encode/decode check to init
This patch adds the check of the two possible ife handlings encode
and decode to the init callback. The decode value is for usability
aspect and used in userspace code only. The current code offers encode
else decode only. This patch avoids any other option than this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:23:02 -07:00
Roman Mashak
d3f24ba895 net sched actions: fix module auto-loading
Macro __stringify_1() can stringify a macro argument, however IFE_META_*
are enums, so they never expand, however request_module expects an integer
in IFE module name, so as a result it always fails to auto-load.

Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:13:20 -07:00
Roman Mashak
8f04748016 net sched actions: change IFE modules alias names
Make style of module alias name consistent with other subsystems in kernel,
for example net devices.

Fixes: 084e2f6566 ("Support to encoding decoding skb mark on IFE action")
Fixes: 200e10f469 ("Support to encoding decoding skb prio on IFE action")
Fixes: 408fbc22ef ("net sched ife action: Introduce skb tcindex metadata encap decap")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:13:20 -07:00
Jiri Pirko
7578d7b45e net: sched: remove unused tcf_exts_get_dev helper and cls_flower->egress_dev
The helper and the struct field ares no longer used by any code,
so remove them.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:15:43 -07:00
Jiri Pirko
717503b9cf net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra
The only user of cls_flower->egress_dev is mlx5. So do the conversion
there alongside with the code originating the call in cls_flower
function fl_hw_replace_filter to the newly introduced egress device
callback infrastucture.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:15:43 -07:00
Jiri Pirko
b3f55bdda8 net: sched: introduce per-egress action device callbacks
Introduce infrastructure that allows drivers to register callbacks that
are called whenever tc would offload inserted rule and specified device
acts as tc action egress device.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:15:43 -07:00
Jiri Pirko
843e79d05a net: sched: make tc_action_ops->get_dev return dev and avoid passing net
Return dev directly, NULL if not possible. That is enough.

Makes no sense to pass struct net * to get_dev op, as there is only one
net possible, the one the action was created in. So just store it in
mirred priv and use directly.

Rename the mirred op callback function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:15:42 -07:00
Eric Dumazet
18a4c0eab2 net: add rb_to_skb() and other rb tree helpers
Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
Simon Horman
a38402bc50 flow_dissector: dissect tunnel info
Move dissection of tunnel info from the flower classifier to the flow
dissector where all other dissection occurs.  This should not have any
behavioural affect on other users of the flow dissector.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-02 11:06:07 -07:00
Colin Ian King
b1c49d1420 net_sched: remove redundant assignment to ret
The assignment of -EINVAL to variable ret is redundant as it
is being overwritten on the following error exit paths or
to the return value from the following call to basic_set_parms.
Fix this up by removing it. Cleans up clang warning message:

net/sched/cls_basic.c:185:2: warning: Value stored to 'err' is never read

Fixes: 1d8134fea2 ("net_sched: use idr to allocate basic filter handles")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 04:08:00 +01:00
Cong Wang
e7614370d6 net_sched: use idr to allocate u32 filter handles
Instead of calling u32_lookup_ht() in a loop to find
a unused handle, just switch to idr API to allocate
new handles. u32 filters are special as the handle
could contain a hash table id and a key id, so we
need two IDR to allocate each of them.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Cong Wang
1d8134fea2 net_sched: use idr to allocate basic filter handles
Instead of calling basic_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Cong Wang
76cf546c28 net_sched: use idr to allocate bpf filter handles
Instead of calling cls_bpf_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Daniel Borkmann
6aaae2b6c4 bpf: rename bpf_compute_data_end into bpf_compute_data_pointers
Just do the rename into bpf_compute_data_pointers() as we'll add
one more pointer here to recompute.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
Eric Dumazet
3aa605f28b sch_netem: faster rb tree removal
While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.

I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.

I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :

1) (current 'naive' method)

	while ((p = rb_first(&q->t_root))) {
		struct sk_buff *skb = netem_rb_to_skb(p);

		rb_erase(p, &q->t_root);
		rtnl_kfree_skbs(skb, skb);
	}

2) Use rb_next() instead of rb_first() in the loop :

	p = rb_first(&q->t_root);
	while (p) {
		struct sk_buff *skb = netem_rb_to_skb(p);

		p = rb_next(p);
		rb_erase(&skb->rbnode, &q->t_root);
		rtnl_kfree_skbs(skb, skb);
	}

3) "optimized" method using rbtree_postorder_for_each_entry_safe()

	struct sk_buff *skb, *next;

	rbtree_postorder_for_each_entry_safe(skb, next,
					     &q->t_root, rbnode) {
               rtnl_kfree_skbs(skb, skb);
	}
	q->t_root = RB_ROOT;

Results :

method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)

method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)

method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...}  1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)

Method 2) is simply faster, probably because it maintains a smaller
working size set.

Note that this is the method we use in tcp_ofo_queue() already.

I will also change skb_rbtree_purge() in a second patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 20:31:32 -07:00
David S. Miller
1f8d31d189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-23 10:16:53 -07:00
Cong Wang
fe2502e49b net_sched: remove cls_flower idr on failure
Fixes: c15ab236d6 ("net/sched: Change cls_flower to use IDR")
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-21 15:13:52 -07:00
Konstantin Khlebnikov
21f4d5cc25 net_sched/hfsc: fix curve activation in hfsc_change_class()
If real-time or fair-share curves are enabled in hfsc_change_class()
class isn't inserted into rb-trees yet. Thus init_ed() and init_vf()
must be called in place of update_ed() and update_vf().

Remove isn't required because for now curves cannot be disabled.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-21 11:56:32 -07:00
Konstantin Khlebnikov
c8e1812960 net_sched: always reset qdisc backlog in qdisc_reset()
SKB stored in qdisc->gso_skb also counted into backlog.

Some qdiscs don't reset backlog to zero in ->reset(),
for example sfq just dequeue and free all queued skb.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-21 11:56:32 -07:00
Cong Wang
752fbcc334 net_sched: no need to free qdisc in RCU callback
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller no longer needs to wait for a grace period. So this
patch gets rid of it.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19 16:30:03 -07:00
Eric Dumazet
bffa72cf7f net: sk_buff rbnode reorg
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-19 15:20:22 -07:00
Eric Dumazet
3c75f6ee13 net_sched: sch_htb: add per class overlimits counter
HTB qdisc overlimits counter is properly increased, but we have no per
class counter, meaning it is difficult to diagnose HTB problems.

This patch adds this counter, visible in "tc -s class show dev eth0",
with current iproute2.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-18 19:12:14 -07:00
Colin Ian King
7e5dd53f6e net_sched: use explicit size of struct tcmsg, remove need to declare tcm
Pointer tcm is being initialized and is never read, it is only being used
to determine the size of struct tcmsg.  Clean this up by removing
variable tcm and explicitly using the sizeof struct tcmsg rather than *tcm.
Cleans up clang warning:

warning: Value stored to 'tcm' during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-18 16:52:26 -07:00
Davide Caratti
3ff4cbec87 net/sched: cls_matchall: fix crash when used with classful qdisc
this script, edited from Linux Advanced Routing and Traffic Control guide

tc q a dev en0 root handle 1: htb default a
tc c a dev en0 parent 1:  classid 1:1 htb rate 6mbit burst 15k
tc c a dev en0 parent 1:1 classid 1:a htb rate 5mbit ceil 6mbit burst 15k
tc c a dev en0 parent 1:1 classid 1:b htb rate 1mbit ceil 6mbit burst 15k
tc f a dev en0 parent 1:0 prio 1 $clsname $clsargs classid 1:b
ping $address -c1
tc -s c s dev en0

classifies traffic to 1:b or 1:a, depending on whether the packet matches
or not the pattern $clsargs of filter $clsname. However, when $clsname is
'matchall', a systematic crash can be observed in htb_classify(). HTB and
classful qdiscs don't assign initial value to struct tcf_result, but then
they expect it to contain valid values after filters have been run. Thus,
current 'matchall' ignores the TCA_MATCHALL_CLASSID attribute, configured
by user, and makes HTB (and classful qdiscs) dereference random pointers.

By assigning head->res to *res in mall_classify(), before the actions are
invoked, we fix this crash and enable TCA_MATCHALL_CLASSID functionality,
that had no effect on 'matchall' classifier since its first introduction.

BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1460213
Reported-by: Jiri Benc <jbenc@redhat.com>
Fixes: b87f7936a9 ("net/sched: introduce Match-all classifier")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-18 16:37:36 -07:00
Jiri Pirko
255cd50f20 net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
Recent commit d7fb60b9ca ("net_sched: get rid of tcfa_rcu") removed
freeing in call_rcu, which changed already existing hard-to-hit
race condition into 100% hit:

[  598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[  598.607782] IP: tcf_action_destroy+0xc0/0x140

Or:

[   40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[   40.862840] IP: tcf_generic_walker+0x534/0x820

Fix this by storing the ops and use them directly for module_put call.

Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-13 09:34:08 -07:00
Cong Wang
1697c4bb52 net_sched: carefully handle tcf_block_put()
As pointed out by Jiri, there is still a race condition between
tcf_block_put() and tcf_chain_destroy() in a RCU callback. There
is no way to make it correct without proper locking or synchronization,
because both operate on a shared list.

Locking is hard, because the only lock we can pick here is a spinlock,
however, in tc_dump_tfilter() we iterate this list with a sleeping
function called (tcf_chain_dump()), which makes using a lock to protect
chain_list almost impossible.

Jiri suggested the idea of holding a refcnt before flushing, this works
because it guarantees us there would be no parallel tcf_chain_destroy()
during the loop, therefore the race condition is gone. But we have to
be very careful with proper synchronization with RCU callbacks.

Suggested-by: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-12 20:41:02 -07:00
Cong Wang
e2ef754453 net_sched: fix reference counting of tc filter chain
This patch fixes the following ugliness of tc filter chain refcnt:

a) tp proto should hold a refcnt to the chain too. This significantly
   simplifies the logic.

b) Chain 0 is no longer special, it is created with refcnt=1 like any
   other chains. All the ugliness in tcf_chain_put() can be gone!

c) No need to handle the flushing oddly, because block still holds
   chain 0, it can not be released, this guarantees block is the last
   user.

d) The race condition with RCU callbacks is easier to handle with just
   a rcu_barrier(). Much easier to understand, nothing to hide. Thanks
   to the previous patch. Please see also the comments in code.

e) Make the code understandable by humans, much less error-prone.

Fixes: 744a4cf63e ("net: sched: fix use after free when tcf_chain_destroy is called multiple times")
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-12 20:41:02 -07:00
Cong Wang
d7fb60b9ca net_sched: get rid of tcfa_rcu
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Please see also the comments in code.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-12 20:41:02 -07:00
Josh Hunt
230cfd2dbc net/sched: fix pointer check in gen_handle
Fixes sparse warning about pointer in gen_handle:
net/sched/cls_rsvp.h:392:40: warning: Using plain integer as NULL pointer

Fixes: 8113c09567 ("net_sched: use void pointer for filter handle")
Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-11 14:34:52 -07:00
Jiri Pirko
80532384af net: sched: fix memleak for chain zero
There's a memleak happening for chain 0. The thing is, chain 0 needs to
be always present, not created on demand. Therefore tcf_block_get upon
creation of block calls the tcf_chain_create function directly. The
chain is created with refcnt == 1, which is not correct in this case and
causes the memleak. So move the refcnt increment into tcf_chain_get
function even for the case when chain needs to be created.

Reported-by: Jakub Kicinski <kubakici@wp.pl>
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-07 19:17:20 -07:00
Gao Feng
39ad1297a2 sched: Use __qdisc_drop instead of kfree_skb in sch_prio and sch_qfq
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. There are two points
left in sch_prio and sch_qfq which are not changed with that commit. Now
enhance them now with __qdisc_drop.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:20:07 -07:00
Jakub Kicinski
2c8468dcf8 net: sched: don't use GFP_KERNEL under spin lock
The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
to:

[  582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416
[  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
[  582.636939] 2 locks held by tc/3379:
[  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff910354ce>] rtnetlink_rcv_msg+0x92e/0x1400
[  582.650958]  #1:  (&(&tn->idrinfo->lock)->rlock){+.-.+.}, at: [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.662217] Preemption disabled at:
[  582.662222] [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: G        W       4.13.0-rc7-debug-00648-g43503a79b9f0 #287
[  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
[  582.691937] Call Trace:
...
[  582.742460]  kmem_cache_alloc+0x286/0x540
[  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
[  582.753209]  idr_get_free_cmn+0x627/0xf80
...
[  582.815525]  idr_alloc_cmn+0x1a8/0x270
...
[  582.833804]  tcf_idr_create+0x31b/0x8e0
...

Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
(as suggested by Eric Dumazet), and change the allocation
flags under spin lock.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-05 14:48:29 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Cong Wang
07d79fc7d9 net_sched: add reverse binding for tc class
TC filters when used as classifiers are bound to TC classes.
However, there is a hidden difference when adding them in different
orders:

1. If we add tc classes before its filters, everything is fine.
   Logically, the classes exist before we specify their ID's in
   filters, it is easy to bind them together, just as in the current
   code base.

2. If we add tc filters before the tc classes they bind, we have to
   do dynamic lookup in fast path. What's worse, this happens all
   the time not just once, because on fast path tcf_result is passed
   on stack, there is no way to propagate back to the one in tc filters.

This hidden difference hurts performance silently if we have many tc
classes in hierarchy.

This patch intends to close this gap by doing the reverse binding when
we create a new class, in this case we can actually search all the
filters in its parent, match and fixup by classid. And because
tcf_result is specific to each type of tc filter, we have to introduce
a new ops for each filter to tell how to bind the class.

Note, we still can NOT totally get rid of those class lookup in
->enqueue() because cgroup and flow filters have no way to determine
the classid at setup time, they still have to go through dynamic lookup.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31 11:40:52 -07:00
Nikolay Aleksandrov
c2d6511e6a sch_tbf: fix two null pointer dereferences on init failure
sch_tbf calls qdisc_watchdog_cancel() in both its ->reset and ->destroy
callbacks but it may fail before the timer is initialized due to missing
options (either not supplied by user-space or set as a default qdisc),
also q->qdisc is used by ->reset and ->destroy so we need it initialized.

Reproduce:
$ sysctl net.core.default_qdisc=tbf
$ ip l set ethX up

Crash log:
[  959.160172] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  959.160323] IP: qdisc_reset+0xa/0x5c
[  959.160400] PGD 59cdb067
[  959.160401] P4D 59cdb067
[  959.160466] PUD 59ccb067
[  959.160532] PMD 0
[  959.160597]
[  959.160706] Oops: 0000 [#1] SMP
[  959.160778] Modules linked in: sch_tbf sch_sfb sch_prio sch_netem
[  959.160891] CPU: 2 PID: 1562 Comm: ip Not tainted 4.13.0-rc6+ #62
[  959.160998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[  959.161157] task: ffff880059c9a700 task.stack: ffff8800376d0000
[  959.161263] RIP: 0010:qdisc_reset+0xa/0x5c
[  959.161347] RSP: 0018:ffff8800376d3610 EFLAGS: 00010286
[  959.161531] RAX: ffffffffa001b1dd RBX: ffff8800373a2800 RCX: 0000000000000000
[  959.161733] RDX: ffffffff8215f160 RSI: ffffffff8215f160 RDI: 0000000000000000
[  959.161939] RBP: ffff8800376d3618 R08: 00000000014080c0 R09: 00000000ffffffff
[  959.162141] R10: ffff8800376d3578 R11: 0000000000000020 R12: ffffffffa001d2c0
[  959.162343] R13: ffff880037538000 R14: 00000000ffffffff R15: 0000000000000001
[  959.162546] FS:  00007fcc5126b740(0000) GS:ffff88005d900000(0000) knlGS:0000000000000000
[  959.162844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  959.163030] CR2: 0000000000000018 CR3: 000000005abc4000 CR4: 00000000000406e0
[  959.163233] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  959.163436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  959.163638] Call Trace:
[  959.163788]  tbf_reset+0x19/0x64 [sch_tbf]
[  959.163957]  qdisc_destroy+0x8b/0xe5
[  959.164119]  qdisc_create_dflt+0x86/0x94
[  959.164284]  ? dev_activate+0x129/0x129
[  959.164449]  attach_one_default_qdisc+0x36/0x63
[  959.164623]  netdev_for_each_tx_queue+0x3d/0x48
[  959.164795]  dev_activate+0x4b/0x129
[  959.164957]  __dev_open+0xe7/0x104
[  959.165118]  __dev_change_flags+0xc6/0x15c
[  959.165287]  dev_change_flags+0x25/0x59
[  959.165451]  do_setlink+0x30c/0xb3f
[  959.165613]  ? check_chain_key+0xb0/0xfd
[  959.165782]  rtnl_newlink+0x3a4/0x729
[  959.165947]  ? rtnl_newlink+0x117/0x729
[  959.166121]  ? ns_capable_common+0xd/0xb1
[  959.166288]  ? ns_capable+0x13/0x15
[  959.166450]  rtnetlink_rcv_msg+0x188/0x197
[  959.166617]  ? rcu_read_unlock+0x3e/0x5f
[  959.166783]  ? rtnl_newlink+0x729/0x729
[  959.166948]  netlink_rcv_skb+0x6c/0xce
[  959.167113]  rtnetlink_rcv+0x23/0x2a
[  959.167273]  netlink_unicast+0x103/0x181
[  959.167439]  netlink_sendmsg+0x326/0x337
[  959.167607]  sock_sendmsg_nosec+0x14/0x3f
[  959.167772]  sock_sendmsg+0x29/0x2e
[  959.167932]  ___sys_sendmsg+0x209/0x28b
[  959.168098]  ? do_raw_spin_unlock+0xcd/0xf8
[  959.168267]  ? _raw_spin_unlock+0x27/0x31
[  959.168432]  ? __handle_mm_fault+0x651/0xdb1
[  959.168602]  ? check_chain_key+0xb0/0xfd
[  959.168773]  __sys_sendmsg+0x45/0x63
[  959.168934]  ? __sys_sendmsg+0x45/0x63
[  959.169100]  SyS_sendmsg+0x19/0x1b
[  959.169260]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  959.169432] RIP: 0033:0x7fcc5097e690
[  959.169592] RSP: 002b:00007ffd0d5c7b48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  959.169887] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007fcc5097e690
[  959.170089] RDX: 0000000000000000 RSI: 00007ffd0d5c7b90 RDI: 0000000000000003
[  959.170292] RBP: ffff8800376d3f98 R08: 0000000000000001 R09: 0000000000000003
[  959.170494] R10: 00007ffd0d5c7910 R11: 0000000000000246 R12: 0000000000000006
[  959.170697] R13: 000000000066f1a0 R14: 00007ffd0d5cfc40 R15: 0000000000000000
[  959.170900]  ? trace_hardirqs_off_caller+0xa7/0xcf
[  959.171076] Code: 00 41 c7 84 24 14 01 00 00 00 00 00 00 41 c7 84 24
98 00 00 00 00 00 00 00 41 5c 41 5d 41 5e 5d c3 66 66 66 66 90 55 48 89
e5 53 <48> 8b 47 18 48 89 fb 48 8b 40 48 48 85 c0 74 02 ff d0 48 8b bb
[  959.171637] RIP: qdisc_reset+0xa/0x5c RSP: ffff8800376d3610
[  959.171821] CR2: 0000000000000018

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:12 -07:00
Nikolay Aleksandrov
e232657661 sch_sfq: fix null pointer dereference on init failure
Currently only a memory allocation failure can lead to this, so let's
initialize the timer first.

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:12 -07:00
Nikolay Aleksandrov
634576a184 sch_netem: avoid null pointer deref on init failure
netem can fail in ->init due to missing options (either not supplied by
user-space or used as a default qdisc) causing a timer->base null
pointer deref in its ->destroy() and ->reset() callbacks.

Reproduce:
$ sysctl net.core.default_qdisc=netem
$ ip l set ethX up

Crash log:
[ 1814.846943] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 1814.847181] IP: hrtimer_active+0x17/0x8a
[ 1814.847270] PGD 59c34067
[ 1814.847271] P4D 59c34067
[ 1814.847337] PUD 37374067
[ 1814.847403] PMD 0
[ 1814.847468]
[ 1814.847582] Oops: 0000 [#1] SMP
[ 1814.847655] Modules linked in: sch_netem(O) sch_fq_codel(O)
[ 1814.847761] CPU: 3 PID: 1573 Comm: ip Tainted: G           O 4.13.0-rc6+ #62
[ 1814.847884] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1814.848043] task: ffff88003723a700 task.stack: ffff88005adc8000
[ 1814.848235] RIP: 0010:hrtimer_active+0x17/0x8a
[ 1814.848407] RSP: 0018:ffff88005adcb590 EFLAGS: 00010246
[ 1814.848590] RAX: 0000000000000000 RBX: ffff880058e359d8 RCX: 0000000000000000
[ 1814.848793] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880058e359d8
[ 1814.848998] RBP: ffff88005adcb5b0 R08: 00000000014080c0 R09: 00000000ffffffff
[ 1814.849204] R10: ffff88005adcb660 R11: 0000000000000020 R12: 0000000000000000
[ 1814.849410] R13: ffff880058e359d8 R14: 00000000ffffffff R15: 0000000000000001
[ 1814.849616] FS:  00007f733bbca740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 1814.849919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1814.850107] CR2: 0000000000000000 CR3: 0000000059f0d000 CR4: 00000000000406e0
[ 1814.850313] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1814.850518] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1814.850723] Call Trace:
[ 1814.850875]  hrtimer_try_to_cancel+0x1a/0x93
[ 1814.851047]  hrtimer_cancel+0x15/0x20
[ 1814.851211]  qdisc_watchdog_cancel+0x12/0x14
[ 1814.851383]  netem_reset+0xe6/0xed [sch_netem]
[ 1814.851561]  qdisc_destroy+0x8b/0xe5
[ 1814.851723]  qdisc_create_dflt+0x86/0x94
[ 1814.851890]  ? dev_activate+0x129/0x129
[ 1814.852057]  attach_one_default_qdisc+0x36/0x63
[ 1814.852232]  netdev_for_each_tx_queue+0x3d/0x48
[ 1814.852406]  dev_activate+0x4b/0x129
[ 1814.852569]  __dev_open+0xe7/0x104
[ 1814.852730]  __dev_change_flags+0xc6/0x15c
[ 1814.852899]  dev_change_flags+0x25/0x59
[ 1814.853064]  do_setlink+0x30c/0xb3f
[ 1814.853228]  ? check_chain_key+0xb0/0xfd
[ 1814.853396]  ? check_chain_key+0xb0/0xfd
[ 1814.853565]  rtnl_newlink+0x3a4/0x729
[ 1814.853728]  ? rtnl_newlink+0x117/0x729
[ 1814.853905]  ? ns_capable_common+0xd/0xb1
[ 1814.854072]  ? ns_capable+0x13/0x15
[ 1814.854234]  rtnetlink_rcv_msg+0x188/0x197
[ 1814.854404]  ? rcu_read_unlock+0x3e/0x5f
[ 1814.854572]  ? rtnl_newlink+0x729/0x729
[ 1814.854737]  netlink_rcv_skb+0x6c/0xce
[ 1814.854902]  rtnetlink_rcv+0x23/0x2a
[ 1814.855064]  netlink_unicast+0x103/0x181
[ 1814.855230]  netlink_sendmsg+0x326/0x337
[ 1814.855398]  sock_sendmsg_nosec+0x14/0x3f
[ 1814.855584]  sock_sendmsg+0x29/0x2e
[ 1814.855747]  ___sys_sendmsg+0x209/0x28b
[ 1814.855912]  ? do_raw_spin_unlock+0xcd/0xf8
[ 1814.856082]  ? _raw_spin_unlock+0x27/0x31
[ 1814.856251]  ? __handle_mm_fault+0x651/0xdb1
[ 1814.856421]  ? check_chain_key+0xb0/0xfd
[ 1814.856592]  __sys_sendmsg+0x45/0x63
[ 1814.856755]  ? __sys_sendmsg+0x45/0x63
[ 1814.856923]  SyS_sendmsg+0x19/0x1b
[ 1814.857083]  entry_SYSCALL_64_fastpath+0x23/0xc2
[ 1814.857256] RIP: 0033:0x7f733b2dd690
[ 1814.857419] RSP: 002b:00007ffe1d3387d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1814.858238] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f733b2dd690
[ 1814.858445] RDX: 0000000000000000 RSI: 00007ffe1d338820 RDI: 0000000000000003
[ 1814.858651] RBP: ffff88005adcbf98 R08: 0000000000000001 R09: 0000000000000003
[ 1814.858856] R10: 00007ffe1d3385a0 R11: 0000000000000246 R12: 0000000000000002
[ 1814.859060] R13: 000000000066f1a0 R14: 00007ffe1d3408d0 R15: 0000000000000000
[ 1814.859267]  ? trace_hardirqs_off_caller+0xa7/0xcf
[ 1814.859446] Code: 10 55 48 89 c7 48 89 e5 e8 45 a1 fb ff 31 c0 5d c3
31 c0 c3 66 66 66 66 90 55 48 89 e5 41 56 41 55 41 54 53 49 89 fd 49 8b
45 30 <4c> 8b 20 41 8b 5c 24 38 31 c9 31 d2 48 c7 c7 50 8e 1d 82 41 89
[ 1814.860022] RIP: hrtimer_active+0x17/0x8a RSP: ffff88005adcb590
[ 1814.860214] CR2: 0000000000000000

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
30c31d746d sch_fq_codel: avoid double free on init failure
It is very unlikely to happen but the backlogs memory allocation
could fail and will free q->flows, but then ->destroy() will free
q->flows too. For correctness remove the first free and let ->destroy
clean up.

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
3501d05992 sch_cbq: fix null pointer dereferences on init failure
CBQ can fail on ->init by wrong nl attributes or simply for missing any,
f.e. if it's set as a default qdisc then TCA_OPTIONS (opt) will be NULL
when it is activated. The first thing init does is parse opt but it will
dereference a null pointer if used as a default qdisc, also since init
failure at default qdisc invokes ->reset() which cancels all timers then
we'll also dereference two more null pointers (timer->base) as they were
never initialized.

To reproduce:
$ sysctl net.core.default_qdisc=cbq
$ ip l set ethX up

Crash log of the first null ptr deref:
[44727.907454] BUG: unable to handle kernel NULL pointer dereference at (null)
[44727.907600] IP: cbq_init+0x27/0x205
[44727.907676] PGD 59ff4067
[44727.907677] P4D 59ff4067
[44727.907742] PUD 59c70067
[44727.907807] PMD 0
[44727.907873]
[44727.907982] Oops: 0000 [#1] SMP
[44727.908054] Modules linked in:
[44727.908126] CPU: 1 PID: 21312 Comm: ip Not tainted 4.13.0-rc6+ #60
[44727.908235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[44727.908477] task: ffff88005ad42700 task.stack: ffff880037214000
[44727.908672] RIP: 0010:cbq_init+0x27/0x205
[44727.908838] RSP: 0018:ffff8800372175f0 EFLAGS: 00010286
[44727.909018] RAX: ffffffff816c3852 RBX: ffff880058c53800 RCX: 0000000000000000
[44727.909222] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8800372175f8
[44727.909427] RBP: ffff880037217650 R08: ffffffff81b0f380 R09: 0000000000000000
[44727.909631] R10: ffff880037217660 R11: 0000000000000020 R12: ffffffff822a44c0
[44727.909835] R13: ffff880058b92000 R14: 00000000ffffffff R15: 0000000000000001
[44727.910040] FS:  00007ff8bc583740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
[44727.910339] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[44727.910525] CR2: 0000000000000000 CR3: 00000000371e5000 CR4: 00000000000406e0
[44727.910731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[44727.910936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[44727.911141] Call Trace:
[44727.911291]  ? lockdep_init_map+0xb6/0x1ba
[44727.911461]  ? qdisc_alloc+0x14e/0x187
[44727.911626]  qdisc_create_dflt+0x7a/0x94
[44727.911794]  ? dev_activate+0x129/0x129
[44727.911959]  attach_one_default_qdisc+0x36/0x63
[44727.912132]  netdev_for_each_tx_queue+0x3d/0x48
[44727.912305]  dev_activate+0x4b/0x129
[44727.912468]  __dev_open+0xe7/0x104
[44727.912631]  __dev_change_flags+0xc6/0x15c
[44727.912799]  dev_change_flags+0x25/0x59
[44727.912966]  do_setlink+0x30c/0xb3f
[44727.913129]  ? check_chain_key+0xb0/0xfd
[44727.913294]  ? check_chain_key+0xb0/0xfd
[44727.913463]  rtnl_newlink+0x3a4/0x729
[44727.913626]  ? rtnl_newlink+0x117/0x729
[44727.913801]  ? ns_capable_common+0xd/0xb1
[44727.913968]  ? ns_capable+0x13/0x15
[44727.914131]  rtnetlink_rcv_msg+0x188/0x197
[44727.914300]  ? rcu_read_unlock+0x3e/0x5f
[44727.914465]  ? rtnl_newlink+0x729/0x729
[44727.914630]  netlink_rcv_skb+0x6c/0xce
[44727.914796]  rtnetlink_rcv+0x23/0x2a
[44727.914956]  netlink_unicast+0x103/0x181
[44727.915122]  netlink_sendmsg+0x326/0x337
[44727.915291]  sock_sendmsg_nosec+0x14/0x3f
[44727.915459]  sock_sendmsg+0x29/0x2e
[44727.915619]  ___sys_sendmsg+0x209/0x28b
[44727.915784]  ? do_raw_spin_unlock+0xcd/0xf8
[44727.915954]  ? _raw_spin_unlock+0x27/0x31
[44727.916121]  ? __handle_mm_fault+0x651/0xdb1
[44727.916290]  ? check_chain_key+0xb0/0xfd
[44727.916461]  __sys_sendmsg+0x45/0x63
[44727.916626]  ? __sys_sendmsg+0x45/0x63
[44727.916792]  SyS_sendmsg+0x19/0x1b
[44727.916950]  entry_SYSCALL_64_fastpath+0x23/0xc2
[44727.917125] RIP: 0033:0x7ff8bbc96690
[44727.917286] RSP: 002b:00007ffc360991e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[44727.917579] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007ff8bbc96690
[44727.917783] RDX: 0000000000000000 RSI: 00007ffc36099230 RDI: 0000000000000003
[44727.917987] RBP: ffff880037217f98 R08: 0000000000000001 R09: 0000000000000003
[44727.918190] R10: 00007ffc36098fb0 R11: 0000000000000246 R12: 0000000000000006
[44727.918393] R13: 000000000066f1a0 R14: 00007ffc360a12e0 R15: 0000000000000000
[44727.918597]  ? trace_hardirqs_off_caller+0xa7/0xcf
[44727.918774] Code: 41 5f 5d c3 66 66 66 66 90 55 48 8d 56 04 45 31 c9
49 c7 c0 80 f3 b0 81 48 89 e5 41 55 41 54 53 48 89 fb 48 8d 7d a8 48 83
ec 48 <0f> b7 0e be 07 00 00 00 83 e9 04 e8 e6 f7 d8 ff 85 c0 0f 88 bb
[44727.919332] RIP: cbq_init+0x27/0x205 RSP: ffff8800372175f0
[44727.919516] CR2: 0000000000000000

Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
3bdac362a2 sch_hfsc: fix null pointer deref and double free on init failure
Depending on where ->init fails we can get a null pointer deref due to
uninitialized hires timer (watchdog) or a double free of the qdisc hash
because it is already freed by ->destroy().

Fixes: 8d55373875 ("net/sched/hfsc: allocate tcf block for hfsc root class")
Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
32db864d33 sch_hhf: fix null pointer dereference on init failure
If sch_hhf fails in its ->init() function (either due to wrong
user-space arguments as below or memory alloc failure of hh_flows) it
will do a null pointer deref of q->hh_flows in its ->destroy() function.

To reproduce the crash:
$ tc qdisc add dev eth0 root hhf quantum 2000000 non_hh_weight 10000000

Crash log:
[  690.654882] BUG: unable to handle kernel NULL pointer dereference at (null)
[  690.655565] IP: hhf_destroy+0x48/0xbc
[  690.655944] PGD 37345067
[  690.655948] P4D 37345067
[  690.656252] PUD 58402067
[  690.656554] PMD 0
[  690.656857]
[  690.657362] Oops: 0000 [#1] SMP
[  690.657696] Modules linked in:
[  690.658032] CPU: 3 PID: 920 Comm: tc Not tainted 4.13.0-rc6+ #57
[  690.658525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[  690.659255] task: ffff880058578000 task.stack: ffff88005acbc000
[  690.659747] RIP: 0010:hhf_destroy+0x48/0xbc
[  690.660146] RSP: 0018:ffff88005acbf9e0 EFLAGS: 00010246
[  690.660601] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000000
[  690.661155] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff821f63f0
[  690.661710] RBP: ffff88005acbfa08 R08: ffffffff81b10a90 R09: 0000000000000000
[  690.662267] R10: 00000000f42b7019 R11: ffff880058578000 R12: 00000000ffffffea
[  690.662820] R13: ffff8800372f6400 R14: 0000000000000000 R15: 0000000000000000
[  690.663769] FS:  00007f8ae5e8b740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[  690.667069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  690.667965] CR2: 0000000000000000 CR3: 0000000058523000 CR4: 00000000000406e0
[  690.668918] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  690.669945] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  690.671003] Call Trace:
[  690.671743]  qdisc_create+0x377/0x3fd
[  690.672534]  tc_modify_qdisc+0x4d2/0x4fd
[  690.673324]  rtnetlink_rcv_msg+0x188/0x197
[  690.674204]  ? rcu_read_unlock+0x3e/0x5f
[  690.675091]  ? rtnl_newlink+0x729/0x729
[  690.675877]  netlink_rcv_skb+0x6c/0xce
[  690.676648]  rtnetlink_rcv+0x23/0x2a
[  690.677405]  netlink_unicast+0x103/0x181
[  690.678179]  netlink_sendmsg+0x326/0x337
[  690.678958]  sock_sendmsg_nosec+0x14/0x3f
[  690.679743]  sock_sendmsg+0x29/0x2e
[  690.680506]  ___sys_sendmsg+0x209/0x28b
[  690.681283]  ? __handle_mm_fault+0xc7d/0xdb1
[  690.681915]  ? check_chain_key+0xb0/0xfd
[  690.682449]  __sys_sendmsg+0x45/0x63
[  690.682954]  ? __sys_sendmsg+0x45/0x63
[  690.683471]  SyS_sendmsg+0x19/0x1b
[  690.683974]  entry_SYSCALL_64_fastpath+0x23/0xc2
[  690.684516] RIP: 0033:0x7f8ae529d690
[  690.685016] RSP: 002b:00007fff26d2d6b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  690.685931] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f8ae529d690
[  690.686573] RDX: 0000000000000000 RSI: 00007fff26d2d700 RDI: 0000000000000003
[  690.687047] RBP: ffff88005acbff98 R08: 0000000000000001 R09: 0000000000000000
[  690.687519] R10: 00007fff26d2d480 R11: 0000000000000246 R12: 0000000000000002
[  690.687996] R13: 0000000001258070 R14: 0000000000000001 R15: 0000000000000000
[  690.688475]  ? trace_hardirqs_off_caller+0xa7/0xcf
[  690.688887] Code: 00 00 e8 2a 02 ae ff 49 8b bc 1d 60 02 00 00 48 83
c3 08 e8 19 02 ae ff 48 83 fb 20 75 dc 45 31 f6 4d 89 f7 4d 03 bd 20 02
00 00 <49> 8b 07 49 39 c7 75 24 49 83 c6 10 49 81 fe 00 40 00 00 75 e1
[  690.690200] RIP: hhf_destroy+0x48/0xbc RSP: ffff88005acbf9e0
[  690.690636] CR2: 0000000000000000

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
e89d469e3b sch_multiq: fix double free on init failure
The below commit added a call to ->destroy() on init failure, but multiq
still frees ->queues on error in init, but ->queues is also freed by
->destroy() thus we get double free and corrupted memory.

Very easy to reproduce (eth0 not multiqueue):
$ tc qdisc add dev eth0 root multiq
RTNETLINK answers: Operation not supported
$ ip l add dumdum type dummy
(crash)

Trace log:
[ 3929.467747] general protection fault: 0000 [#1] SMP
[ 3929.468083] Modules linked in:
[ 3929.468302] CPU: 3 PID: 967 Comm: ip Not tainted 4.13.0-rc6+ #56
[ 3929.468625] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 3929.469124] task: ffff88003716a700 task.stack: ffff88005872c000
[ 3929.469449] RIP: 0010:__kmalloc_track_caller+0x117/0x1be
[ 3929.469746] RSP: 0018:ffff88005872f6a0 EFLAGS: 00010246
[ 3929.470042] RAX: 00000000000002de RBX: 0000000058a59000 RCX: 00000000000002df
[ 3929.470406] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff821f7020
[ 3929.470770] RBP: ffff88005872f6e8 R08: 000000000001f010 R09: 0000000000000000
[ 3929.471133] R10: ffff88005872f730 R11: 0000000000008cdd R12: ff006d75646d7564
[ 3929.471496] R13: 00000000014000c0 R14: ffff88005b403c00 R15: ffff88005b403c00
[ 3929.471869] FS:  00007f0b70480740(0000) GS:ffff88005d980000(0000) knlGS:0000000000000000
[ 3929.472286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3929.472677] CR2: 00007ffcee4f3000 CR3: 0000000059d45000 CR4: 00000000000406e0
[ 3929.473209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3929.474109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3929.474873] Call Trace:
[ 3929.475337]  ? kstrdup_const+0x23/0x25
[ 3929.475863]  kstrdup+0x2e/0x4b
[ 3929.476338]  kstrdup_const+0x23/0x25
[ 3929.478084]  __kernfs_new_node+0x28/0xbc
[ 3929.478478]  kernfs_new_node+0x35/0x55
[ 3929.478929]  kernfs_create_link+0x23/0x76
[ 3929.479478]  sysfs_do_create_link_sd.isra.2+0x85/0xd7
[ 3929.480096]  sysfs_create_link+0x33/0x35
[ 3929.480649]  device_add+0x200/0x589
[ 3929.481184]  netdev_register_kobject+0x7c/0x12f
[ 3929.481711]  register_netdevice+0x373/0x471
[ 3929.482174]  rtnl_newlink+0x614/0x729
[ 3929.482610]  ? rtnl_newlink+0x17f/0x729
[ 3929.483080]  rtnetlink_rcv_msg+0x188/0x197
[ 3929.483533]  ? rcu_read_unlock+0x3e/0x5f
[ 3929.483984]  ? rtnl_newlink+0x729/0x729
[ 3929.484420]  netlink_rcv_skb+0x6c/0xce
[ 3929.484858]  rtnetlink_rcv+0x23/0x2a
[ 3929.485291]  netlink_unicast+0x103/0x181
[ 3929.485735]  netlink_sendmsg+0x326/0x337
[ 3929.486181]  sock_sendmsg_nosec+0x14/0x3f
[ 3929.486614]  sock_sendmsg+0x29/0x2e
[ 3929.486973]  ___sys_sendmsg+0x209/0x28b
[ 3929.487340]  ? do_raw_spin_unlock+0xcd/0xf8
[ 3929.487719]  ? _raw_spin_unlock+0x27/0x31
[ 3929.488092]  ? __handle_mm_fault+0x651/0xdb1
[ 3929.488471]  ? check_chain_key+0xb0/0xfd
[ 3929.488847]  __sys_sendmsg+0x45/0x63
[ 3929.489206]  ? __sys_sendmsg+0x45/0x63
[ 3929.489576]  SyS_sendmsg+0x19/0x1b
[ 3929.489901]  entry_SYSCALL_64_fastpath+0x23/0xc2
[ 3929.490172] RIP: 0033:0x7f0b6fb93690
[ 3929.490423] RSP: 002b:00007ffcee4ed588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 3929.490881] RAX: ffffffffffffffda RBX: ffffffff810d278c RCX: 00007f0b6fb93690
[ 3929.491198] RDX: 0000000000000000 RSI: 00007ffcee4ed5d0 RDI: 0000000000000003
[ 3929.491521] RBP: ffff88005872ff98 R08: 0000000000000001 R09: 0000000000000000
[ 3929.491801] R10: 00007ffcee4ed350 R11: 0000000000000246 R12: 0000000000000002
[ 3929.492075] R13: 000000000066f1a0 R14: 00007ffcee4f5680 R15: 0000000000000000
[ 3929.492352]  ? trace_hardirqs_off_caller+0xa7/0xcf
[ 3929.492590] Code: 8b 45 c0 48 8b 45 b8 74 17 48 8b 4d c8 83 ca ff 44
89 ee 4c 89 f7 e8 83 ca ff ff 49 89 c4 eb 49 49 63 56 20 48 8d 48 01 4d
8b 06 <49> 8b 1c 14 48 89 c2 4c 89 e0 65 49 0f c7 08 0f 94 c0 83 f0 01
[ 3929.493335] RIP: __kmalloc_track_caller+0x117/0x1be RSP: ffff88005872f6a0

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: f07d150129 ("multiq: Further multiqueue cleanup")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Nikolay Aleksandrov
88c2ace69d sch_htb: fix crash on init failure
The commit below added a call to the ->destroy() callback for all qdiscs
which failed in their ->init(), but some were not prepared for such
change and can't handle partially initialized qdisc. HTB is one of them
and if any error occurs before the qdisc watchdog timer and qdisc work are
initialized then we can hit either a null ptr deref (timer->base) when
canceling in ->destroy or lockdep error info about trying to register
a non-static key and a stack dump. So to fix these two move the watchdog
timer and workqueue init before anything that can err out.
To reproduce userspace needs to send broken htb qdisc create request,
tested with a modified tc (q_htb.c).

Trace log:
[ 2710.897602] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2710.897977] IP: hrtimer_active+0x17/0x8a
[ 2710.898174] PGD 58fab067
[ 2710.898175] P4D 58fab067
[ 2710.898353] PUD 586c0067
[ 2710.898531] PMD 0
[ 2710.898710]
[ 2710.899045] Oops: 0000 [#1] SMP
[ 2710.899232] Modules linked in:
[ 2710.899419] CPU: 1 PID: 950 Comm: tc Not tainted 4.13.0-rc6+ #54
[ 2710.899646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 2710.900035] task: ffff880059ed2700 task.stack: ffff88005ad4c000
[ 2710.900262] RIP: 0010:hrtimer_active+0x17/0x8a
[ 2710.900467] RSP: 0018:ffff88005ad4f960 EFLAGS: 00010246
[ 2710.900684] RAX: 0000000000000000 RBX: ffff88003701e298 RCX: 0000000000000000
[ 2710.900933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003701e298
[ 2710.901177] RBP: ffff88005ad4f980 R08: 0000000000000001 R09: 0000000000000001
[ 2710.901419] R10: ffff88005ad4f800 R11: 0000000000000400 R12: 0000000000000000
[ 2710.901663] R13: ffff88003701e298 R14: ffffffff822a4540 R15: ffff88005ad4fac0
[ 2710.901907] FS:  00007f2f5e90f740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
[ 2710.902277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2710.902500] CR2: 0000000000000000 CR3: 0000000058ca3000 CR4: 00000000000406e0
[ 2710.902744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2710.902977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2710.903180] Call Trace:
[ 2710.903332]  hrtimer_try_to_cancel+0x1a/0x93
[ 2710.903504]  hrtimer_cancel+0x15/0x20
[ 2710.903667]  qdisc_watchdog_cancel+0x12/0x14
[ 2710.903866]  htb_destroy+0x2e/0xf7
[ 2710.904097]  qdisc_create+0x377/0x3fd
[ 2710.904330]  tc_modify_qdisc+0x4d2/0x4fd
[ 2710.904511]  rtnetlink_rcv_msg+0x188/0x197
[ 2710.904682]  ? rcu_read_unlock+0x3e/0x5f
[ 2710.904849]  ? rtnl_newlink+0x729/0x729
[ 2710.905017]  netlink_rcv_skb+0x6c/0xce
[ 2710.905183]  rtnetlink_rcv+0x23/0x2a
[ 2710.905345]  netlink_unicast+0x103/0x181
[ 2710.905511]  netlink_sendmsg+0x326/0x337
[ 2710.905679]  sock_sendmsg_nosec+0x14/0x3f
[ 2710.905847]  sock_sendmsg+0x29/0x2e
[ 2710.906010]  ___sys_sendmsg+0x209/0x28b
[ 2710.906176]  ? do_raw_spin_unlock+0xcd/0xf8
[ 2710.906346]  ? _raw_spin_unlock+0x27/0x31
[ 2710.906514]  ? __handle_mm_fault+0x651/0xdb1
[ 2710.906685]  ? check_chain_key+0xb0/0xfd
[ 2710.906855]  __sys_sendmsg+0x45/0x63
[ 2710.907018]  ? __sys_sendmsg+0x45/0x63
[ 2710.907185]  SyS_sendmsg+0x19/0x1b
[ 2710.907344]  entry_SYSCALL_64_fastpath+0x23/0xc2

Note that probably this bug goes further back because the default qdisc
handling always calls ->destroy on init failure too.

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
Chris Mi
65a206c01e net/sched: Change act_api and act_xxx modules to use IDR
Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.

But please note that the test result is based on previous patch that
cls_flower uses IDR.

Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 14:38:51 -07:00
Chris Mi
c15ab236d6 net/sched: Change cls_flower to use IDR
Currently, all filters with the same priority are linked in a doubly
linked list. Every filter should have a unique handle. To make the
handle unique, we need to iterate the list every time to see if the
handle exists or not when inserting a new filter. It is time-consuming.
For example, it takes about 5m3.169s to insert 64K rules.

This patch changes cls_flower to use IDR. With this patch, it
takes about 0m1.127s to insert 64K rules. The improvement is huge.

But please note that in this testing, all filters share the same action.
If every filter has a unique action, that is another bottleneck.
Follow-up patch in this patchset addresses that.

Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 14:36:44 -07:00
Alexander Aring
b522ed6ed6 act_ife: use registered ife_type as fallback
This patch handles a default IFE type if it's not given by user space
netlink api. The default IFE type will be the registered ethertype by
IEEE for IFE ForCES.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 15:14:18 -07:00
Gao Feng
f9ab7425b3 sched: sfq: drop packets after root qdisc lock is released
The commit 520ac30f45 ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. But there are
some points which are not changed in SFQ enqueue operation.
1. Fail to find the SFQ hash slot;
2. When the queue is full;

Now use qdisc_drop instead free skb directly.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28 15:51:42 -07:00
WANG Cong
3cd904ecbb net_sched: kill u32_node pointer in Qdisc
It is ugly to hide a u32-filter-specific pointer inside Qdisc,
this breaks the TC layers:

1. Qdisc is a generic representation, should not have any specific
   data of any type

2. Qdisc layer is above filter layer, should only save filters in
   the list of struct tcf_proto.

This pointer is used as the head of the chain of u32 hash tables,
that is struct tc_u_hnode, because u32 filter is very special,
it allows to create multiple hash tables within one qdisc and
across multiple u32 filters.

Instead of using this ugly pointer, we can just save it in a global
hash table key'ed by (dev ifindex, qdisc handle), therefore we can
still treat it as a per qdisc basis data structure conceptually.

Of course, because of network namespaces, this key is not unique
at all, but it is fine as we already have a pointer to Qdisc in
struct tc_u_common, we can just compare the pointers when collision.

And this only affects slow paths, has no impact to fast path,
thanks to the pointer ->tp_c.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
WANG Cong
143976ce99 net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:

1) For class modification and dumping paths, we already hold RTNL lock,
   so all of these ->get(),->change(),->put() are atomic.

2) For filter bindiing/unbinding, we use other reference counter than
   this one, and they should have RTNL lock too.

3) For ->qlen_notify(), it is special because it is called on ->enqueue()
   path, but we already hold qdisc tree lock there, and we hold this
   tree lock when graft or delete the class too, so it should not be gone
   or changed until we release the tree lock.

Therefore, this patch removes ->get() and ->put(), but:

1) Adds a new ->find() to find the pointer to a class by classid, no
   refcnt.

2) Move the original class destroy upon the last refcnt into ->delete(),
   right after releasing tree lock. This is fine because the class is
   already removed from hash when holding the lock.

For those who also use ->put() as ->unbind(), just rename them to reflect
this change.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
WANG Cong
14546ba1e5 net_sched: introduce tclass_del_notify()
Like for TC actions, ->delete() is a special case,
we have to prepare and fill the notification before delete
otherwise would get use-after-free after we remove the
reference count.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
WANG Cong
27d7f07c49 net_sched: get rid of more forward declarations
This is not needed if we move them up properly.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
Eric Dumazet
551143d8d9 net_sched: fix a refcount_t issue with noop_qdisc
syzkaller reported a refcount_t warning [1]

Issue here is that noop_qdisc refcnt was never really considered as
a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :

if (qdisc->flags & TCQ_F_BUILTIN ||
    !refcount_dec_and_test(&qdisc->refcnt)))
	return;

Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
really needed, but harmless until refcount_t came.

To fix this problem, we simply need to not increment noop_qdisc.refcnt,
since we never decrement it.

[1]
refcount_t: increment on 0; use-after-free.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282
RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000
RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8
RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0
R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000
 attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
 dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
 __dev_open+0x227/0x330 net/core/dev.c:1380
 __dev_change_flags+0x695/0x990 net/core/dev.c:6726
 dev_change_flags+0x88/0x140 net/core/dev.c:6792
 dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
 dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
 sock_do_ioctl+0x94/0xb0 net/socket.c:968
 sock_ioctl+0x2c2/0x440 net/socket.c:1058
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

Fixes: 7b93640502 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Reshetova, Elena <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:28:24 -07:00
Jiri Pirko
30d65e8f96 net: sched: don't do tcf_chain_flush from tcf_chain_destroy
tcf_chain_flush needs to be called with RTNL. However, on
free_tcf->
 tcf_action_goto_chain_fini->
  tcf_chain_put->
   tcf_chain_destroy->
    tcf_chain_flush
callpath, it is called without RTNL.
This issue was notified by following warning:

[  155.599052] WARNING: suspicious RCU usage
[  155.603165] 4.13.0-rc5jiri+ #54 Not tainted
[  155.607456] -----------------------------
[  155.611561] net/sched/cls_api.c:195 suspicious rcu_dereference_protected() usage!

Since on this callpath, the chain is guaranteed to be already empty
by check in tcf_chain_put, move the tcf_chain_flush call out and call it
only where it is needed - into tcf_block_put.

Fixes: db50514f9a ("net: sched: add termination action to allow goto chain")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:39:58 -07:00
Jiri Pirko
744a4cf63e net: sched: fix use after free when tcf_chain_destroy is called multiple times
The goto_chain termination action takes a reference of a chain. In that
case, there is an issue when block_put is called tcf_chain_destroy
directly. The follo-up call of tcf_chain_put by goto_chain action free
works with memory that is already freed. This was caught by kasan:

[  220.337908] BUG: KASAN: use-after-free in tcf_chain_put+0x1b/0x50
[  220.344103] Read of size 4 at addr ffff88036d1f2cec by task systemd-journal/261
[  220.353047] CPU: 0 PID: 261 Comm: systemd-journal Not tainted 4.13.0-rc5jiri+ #54
[  220.360661] Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox x86 mezzanine board, BIOS 4.6.5 08/02/2016
[  220.371784] Call Trace:
[  220.374290]  <IRQ>
[  220.376355]  dump_stack+0xd5/0x150
[  220.391485]  print_address_description+0x86/0x410
[  220.396308]  kasan_report+0x181/0x4c0
[  220.415211]  tcf_chain_put+0x1b/0x50
[  220.418949]  free_tcf+0x95/0xc0

So allow tcf_chain_destroy to be called multiple times, free only in
case the reference count drops to 0.

Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:39:58 -07:00
Eric Dumazet
9695fe6f21 net: sched: use kvmalloc() for class hash tables
High order GFP_KERNEL allocations can stress the host badly.

Use modern kvmalloc_array()/kvfree() instead of custom
allocations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:37:24 -07:00
David S. Miller
e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Xin Long
4f8a881acc net: sched: fix NULL pointer dereference when action calls some targets
As we know in some target's checkentry it may dereference par.entryinfo
to check entry stuff inside. But when sched action calls xt_check_target,
par.entryinfo is set with NULL. It would cause kernel panic when calling
some targets.

It can be reproduce with:
  # tc qd add dev eth1 ingress handle ffff:
  # tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
    -j ECN --ecn-tcp-remove

It could also crash kernel when using target CLUSTERIP or TPROXY.

By now there's no proper value for par.entryinfo in ipt_init_target,
but it can not be set with NULL. This patch is to void all these
panics by setting it with an ipt_entry obj with all members = 0.

Note that this issue has been there since the very beginning.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 16:25:49 -07:00
Jiri Pirko
acc8b31665 net: sched: fix p_filter_chain check in tcf_chain_flush
The dereference before check is wrong and leads to an oops when
p_filter_chain is NULL. The check needs to be done on the pointer to
prevent NULL dereference.

Fixes: f93e1cdcf4 ("net/sched: fix filter flushing")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 10:19:11 -07:00
Jiri Pirko
d978db8dbe net: sched: cls_flower: fix ndo_setup_tc type for stats call
I made a stupid mistake using TC_CLSFLOWER_STATS instead of
TC_SETUP_CLSFLOWER. Funny thing is that both are defined as "2" so it
actually did not cause any harm. Anyway, fixing it now.

Fixes: 2572ac53c4 ("net: sched: make type an argument for ndo_setup_tc")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 14:29:41 -07:00
Jesper Dangaard Brouer
e543002f77 qdisc: add tracepoint qdisc:qdisc_dequeue for dequeued SKBs
The main purpose of this tracepoint is to monitor bulk dequeue
in the network qdisc layer, as it cannot be deducted from the
existing qdisc stats.

The txq_state can be used for determining the reason for zero packet
dequeues, see enum netdev_queue_state_t.

Notice all packets doesn't necessary activate this tracepoint. As
qdiscs with flag TCQ_F_CAN_BYPASS, can directly invoke
sch_direct_xmit() when qdisc_qlen is zero.

Remember that perf record supports filters like:

 perf record -e qdisc:qdisc_dequeue \
  --filter 'ifindex == 4 && (packets > 1 || txq_state > 0)'

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 14:10:10 -07:00
Konstantin Khlebnikov
6b0355f4a9 net_sched/hfsc: opencode trivial set_active() and set_passive()
Any move comment abount update_vf() into right place.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 10:55:34 -07:00
Konstantin Khlebnikov
959466588a net_sched: call qlen_notify only if child qdisc is empty
This callback is used for deactivating class in parent qdisc.
This is cheaper to test queue length right here.

Also this allows to catch draining screwed backlog and prevent
second deactivation of already inactive parent class which will
crash kernel for sure. Kernel with print warning at destruction
of child qdisc where no packets but backlog is not zero.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 10:55:34 -07:00
David S. Miller
463910e2df Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-15 20:23:23 -07:00
Konstantin Khlebnikov
c90e95147c net_sched: remove warning from qdisc_hash_add
It was added in commit e57a784d8c ("pkt_sched: set root qdisc
before change() in attach_default_qdiscs()") to hide duplicates
from "tc qdisc show" for incative deivices.

After 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
it triggered when classful qdisc is added to inactive device because
default qdiscs are added before switching root qdisc.

Anyway after commit ea32746953 ("net: sched: avoid duplicates in
qdisc dump") duplicates are filtered right in dumper.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:16:39 -07:00
Konstantin Khlebnikov
325d5dc3f7 net_sched/sfq: update hierarchical backlog when drop packet
When sfq_enqueue() drops head packet or packet from another queue it
have to update backlog at upper qdiscs too.

Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:16:39 -07:00
Konstantin Khlebnikov
898904226b net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
Traffic filters could keep direct pointers to classes in classful qdisc,
thus qdisc destruction first removes all filters before freeing classes.
Class destruction methods also tries to free attached filters but now
this isn't safe because tcf_block_put() unlike to tcf_destroy_chain()
cannot be called second time.

This patch set class->block to NULL after first tcf_block_put() and
turn second call into no-op.

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:16:39 -07:00
Konstantin Khlebnikov
8d55373875 net/sched/hfsc: allocate tcf block for hfsc root class
Without this filters cannot be attached.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:30:30 -07:00
Jiri Pirko
7b06e8aed2 net: sched: remove cops->tcf_cl_offload
cops->tcf_cl_offload is no longer needed, as the drivers check what they
can and cannot offload using the classid identify helpers. So remove this.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 13:47:01 -07:00
David S. Miller
3b2b69efec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.

Minor context conflict in bcmsysport statistics bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 12:11:16 -07:00
Xin Long
96d9703050 net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
Commit 55917a21d0 ("netfilter: x_tables: add context to know if
extension runs from nft_compat") introduced a member nft_compat to
xt_tgchk_param structure.

But it didn't set it's value for ipt_init_target. With unexpected
value in par.nft_compat, it may return unexpected result in some
target's checkentry.

This patch is to set all it's fields as 0 and only initialize the
non-zero fields in ipt_init_target.

v1->v2:
  As Wang Cong's suggestion, fix it by setting all it's fields as
  0 and only initializing the non-zero fields.

Fixes: 55917a21d0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:46:44 -07:00
Florian Westphal
b97bac64a5 rtnetlink: make rtnl_register accept a flags parameter
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
David S. Miller
3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Xin Long
ec0acb0931 net: sched: set xt_tgchk_param par.net properly in ipt_init_target
Now xt_tgchk_param par in ipt_init_target is a local varibale,
par.net is not initialized there. Later when xt_check_target
calls target's checkentry in which it may access par.net, it
would cause kernel panic.

Jaroslav found this panic when running:

  # ip link add TestIface type dummy
  # tc qd add dev TestIface ingress handle ffff:
  # tc filter add dev TestIface parent ffff: u32 match u32 0 0 \
    action xt -j CONNMARK --set-mark 4

This patch is to pass net param into ipt_init_target and set
par.net with it properly in there.

v1->v2:
  As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
  it by also passing net_id to __tcf_ipt_init.
v2->v3:
  Missed the fixes tag, so add it.

Fixes: ecb2421b5d ("netfilter: add and use nf_ct_netns_get/put")
Reported-by: Jaroslav Aster <jaster@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 20:38:00 -07:00
WANG Cong
7120371c8e net_sched: get rid of some forward declarations
If we move up tcf_fill_node() we can get rid of these
forward declarations.

Also, move down tfilter_notify_chain() to group them together.

Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 18:17:39 -07:00
WANG Cong
8113c09567 net_sched: use void pointer for filter handle
Now we use 'unsigned long fh' as a pointer in every place,
it is safe to convert it to a void pointer now. This gets
rid of many casts to pointer.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 14:12:17 -07:00
WANG Cong
54df2cf819 net_sched: refactor notification code for RTM_DELTFILTER
It is confusing to use 'unsigned long fh' as both a handle
and a pointer, especially commit 9ee7837449
("net sched filters: fix notification of filter delete with proper handle").

This patch introduces tfilter_del_notify() so that we can
pass it as a pointer as before, and we don't need to check
RTM_DELTFILTER in tcf_fill_node() any more.

This prepares for the next patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 14:12:17 -07:00
Jiri Pirko
de4784ca03 net: sched: get rid of struct tc_to_netdev
Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko
d7c1c8d2e5 net: sched: move prio into cls_common
prio is not cls_flower specific, but it is meaningful for all
classifiers. Seems that only mlxsw cares about the value. Obviously,
cls offload in other drivers is broken.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko
5fd9fc4e20 net: sched: push cls related args into cls_common structure
As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko
3e0e826643 net: sched: make egress_dev flag part of flower offload struct
Since this is specific to flower now, make it part of the flower offload
struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jiri Pirko
ade9b65884 net: sched: rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL
In order to be aligned with the rest of the types, rename
TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jiri Pirko
2572ac53c4 net: sched: make type an argument for ndo_setup_tc
Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jiri Pirko
9b0d4446b5 net: sched: avoid atomic swap in tcf_exts_change
tcf_exts_change is always called on newly created exts, which are not used
on fastpath. Therefore, simple struct copy is enough.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
705c709126 net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct
As the n struct was allocated right before u32_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
8c98d571bb net: sched: cls_route: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before route4_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
c09fc2e11e net: sched: cls_flow: no need to call tcf_exts_change for newly allocated struct
As the fnew struct just was allocated, so no need to use tcf_exts_change
to do atomic change, and we can just fill-up the unused exts struct
directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
8cc6251381 net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated struct
As the new struct just was allocated, so no need to use tcf_exts_change
to do atomic change, and we can just fill-up the unused exts struct
directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6839da326d net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated struct
As the prog struct was allocated right before cls_bpf_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ff1f8ca080 net: sched: cls_basic: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before basic_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
a74cb36980 net: sched: cls_matchall: no need to call tcf_exts_change for newly allocated struct
As the head struct was allocated right before mall_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
94611bff6e net: sched: cls_fw: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fw_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
455075292b net: sched: cls_flower: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fl_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
1e5003af37 net: sched: cls_fw: rename fw_change_attrs function
Since the function name is misleading since it is not changing
anything, name it similarly to other cls.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6a725c481d net: sched: cls_bpf: rename cls_bpf_modify_existing function
The name cls_bpf_modify_existing is highly misleading, as it indeed does
not modify anything existing. It does not modify at all.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
978dfd8d14 net: sched: use tcf_exts_has_actions instead of exts->nr_actions
For check in tcf_exts_dump use tcf_exts_has_actions helper instead
of exts->nr_actions for checking if there are any actions present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ec1a9cca0e net: sched: remove check for number of actions in tcf_exts_exec
Leave it to tcf_action_exec to return TC_ACT_OK in case there is no
action present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
6fc6d06e53 net: sched: remove redundant helpers tcf_exts_is_predicative and tcf_exts_is_available
These two helpers are doing the same as tcf_exts_has_actions, so remove
them and use tcf_exts_has_actions instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
3bcc0cec81 net: sched: change names of action number helpers to be aligned with the rest
The rest of the helpers are named tcf_exts_*, so change the name of
the action number helpers to be aligned. While at it, change to inline
functions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
4ebc1e3cfc net: sched: remove unneeded tcf_em_tree_change
Since tcf_em_tree_validate could be always called on a newly created
filter, there is no need for this change function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
f7ebdff757 net: sched: sch_atm: use Qdisc_class_common structure
Even if it is only for classid now, use this common struct a be aligned
with the rest of the classful qdiscs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jamal Hadi Salim
e62e484df0 net sched actions: add time filter for action dumping
This patch adds support for filtering based on time since last used.
When we are dumping a large number of actions it is useful to
have the option of filtering based on when the action was last
used to reduce the amount of data crossing to user space.

With this patch the user space app sets the TCA_ROOT_TIME_DELTA
attribute with the value in milliseconds with "time of interest
since now".  The kernel converts this to jiffies and does the
filtering comparison matching entries that have seen activity
since then and returns them to user space.
Old kernels and old tc continue to work in legacy mode since
they dont specify this attribute.

Some example (we have 400 actions bound to 400 filters); at
installation time. Using updated when tc setting the time of
interest to 120 seconds earlier (we see 400 actions):
prompt$ hackedtc actions ls action gact since 120000| grep index | wc -l
400

go get some coffee and wait for > 120 seconds and try again:

prompt$ hackedtc actions ls action gact since 120000 | grep index | wc -l
0

Lets see a filter bound to one of these actions:
....
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2 success 1)
  match 7f000002/ffffffff at 12 (success 1 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1145 sec used 802 sec
    Action statistics:
    Sent 84 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
....

that coffee took long, no? It was good.

Now lets ping -c 1 127.0.0.2, then run the actions again:
prompt$ hackedtc actions ls action gact since 120 | grep index | wc -l
1

More details please:
prompt$ hackedtc -s actions ls action gact since 120000

    action order 0: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1270 sec used 30 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

And the filter?

filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 4 success 2)
  match 7f000002/ffffffff at 12 (success 2 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1324 sec used 84 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim
90825b23a8 net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

The top level action TLV space is extended. An attribute
TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON
is set by the user indicating the user is capable of processing
these large dumps. Older user space which doesnt set this flag
doesnt get the large (than 32) batches.
The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32 thus maintaining backward compat.

Some results dumping 1.5M actions below:
first an unpatched tc which doesnt understand these features...

prompt$ time -p tc actions ls action gact | grep index | wc -l
1500000
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
1500000
real 178.13
user 2.02
sys 176.96

That is about 8x performance improvement for tc app which sets its
receive buffer to about 32K.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim
df823b0297 net sched actions: Use proper root attribute table for actions
Bug fix for an issue which has been around for about a decade.
We got away with it because the enumeration was larger than needed.

Fixes: 7ba699c604 ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API")
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
David S. Miller
7a68ada6ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-07-21 03:38:43 +01:00
Linus Torvalds
96080f6977 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF verifier signed/unsigned value tracking fix, from Daniel
    Borkmann, Edward Cree, and Josef Bacik.

 2) Fix memory allocation length when setting up calls to
    ->ndo_set_mac_address, from Cong Wang.

 3) Add a new cxgb4 device ID, from Ganesh Goudar.

 4) Fix FIB refcount handling, we have to set it's initial value before
    the configure callback (which can bump it). From David Ahern.

 5) Fix double-free in qcom/emac driver, from Timur Tabi.

 6) A bunch of gcc-7 string format overflow warning fixes from Arnd
    Bergmann.

 7) Fix link level headroom tests in ip_do_fragment(), from Vasily
    Averin.

 8) Fix chunk walking in SCTP when iterating over error and parameter
    headers. From Alexander Potapenko.

 9) TCP BBR congestion control fixes from Neal Cardwell.

10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.

11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
    Wang.

12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
    from Gao Feng.

13) Cannot release skb->dst in UDP if IP options processing needs it.
    From Paolo Abeni.

14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
    Levin and myself.

15) Revert some rtnetlink notification changes that are causing
    regressions, from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
  net: bonding: Fix transmit load balancing in balance-alb mode
  rds: Make sure updates to cp_send_gen can be observed
  net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
  ipv4: initialize fib_trie prior to register_netdev_notifier call.
  rtnetlink: allocate more memory for dev_set_mac_address()
  net: dsa: b53: Add missing ARL entries for BCM53125
  bpf: more tests for mixed signed and unsigned bounds checks
  bpf: add test for mixed signed and unsigned bounds checks
  bpf: fix up test cases with mixed signed/unsigned bounds
  bpf: allow to specify log level and reduce it for test_verifier
  bpf: fix mixed signed/unsigned derived min/max value bounds
  ipv6: avoid overflow of offset in ip6_find_1stfragopt
  net: tehuti: don't process data if it has not been copied from userspace
  Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
  net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
  dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
  NET: dwmac: Make dwmac reset unconditional
  net: Zero terminate ifr_name in dev_ifname().
  wireless: wext: terminate ifr name coming from userspace
  netfilter: fix netfilter_net_init() return
  ...
2017-07-20 16:33:39 -07:00
David S. Miller
880388aa3c net: Remove all references to SKB_GSO_UDP.
Such packets are no longer possible.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 09:52:58 -07:00
Roman Mashak
c4c4290c17 net sched actions: rename act_get_notify() to tcf_get_notify()
Make name consistent with other TC event notification routines, such as
tcf_add_notify() and tcf_del_notify()

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:52:33 -07:00
Michal Hocko
dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Reshetova, Elena
7b93640502 net, sched: convert Qdisc.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:16 +01:00
Reshetova, Elena
41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena
14afee4b60 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
David S. Miller
b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
Gao Feng
c1a4872ebf net: sched: Fix one possible panic when no destroy callback
When qdisc fail to init, qdisc_create would invoke the destroy callback
to cleanup. But there is no check if the callback exists really. So it
would cause the panic if there is no real destroy callback like the qdisc
codel, fq, and so on.

Take codel as an example following:
When a malicious user constructs one invalid netlink msg, it would cause
codel_init->codel_change->nla_parse_nested failed.
Then kernel would invoke the destroy callback directly but qdisc codel
doesn't define one. It causes one panic as a result.

Now add one the check for destroy to avoid the possible panic.

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 12:55:12 -04:00
Daniel Borkmann
e86283071f bpf: expose prog id for cls_bpf and act_bpf
In order to be able to retrieve the attached programs from cls_bpf
and act_bpf, we need to expose the prog ids via netlink so that
an application can later on get an fd based on the id through the
BPF_PROG_GET_FD_BY_ID command, and dump related prog info via
BPF_OBJ_GET_INFO_BY_FD command for bpf(2).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 15:14:23 -04:00
Jiri Benc
86087e170c net: sched: act_tunnel_key: make UDP checksum configurable
Allow requesting of zero UDP checksum for encapsulated packets. The name and
meaning of the attribute is "NO_CSUM" in order to have the same meaning of
the attribute missing and being 0.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:21:03 -04:00
Jiri Benc
63fe4c39d2 net: sched: act_tunnel_key: request UDP checksum by default
There's currently no way to request (outer) UDP checksum with
act_tunnel_key. This is problem especially for IPv6. Right now, tunnel_key
action with IPv6 does not work without going through hassles: both sides
have to have udp6zerocsumrx configured on the tunnel interface. This is
obviously not a good solution universally.

It makes more sense to compute the UDP checksum by default even for IPv4.
Just set the default to request the checksum when using act_tunnel_key.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 14:21:03 -04:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Dan Carpenter
c4f65b09b4 net/act_pedit: fix an error code
I'm reviewing static checker warnings where we do ERR_PTR(0), which is
the same as NULL.  I'm pretty sure we intended to return ERR_PTR(-EINVAL)
here.  Sometimes these bugs lead to a NULL dereference but I don't
immediately see that problem here.

Fixes: 71d0ed7079 ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:24:18 -04:00
WANG Cong
74030603df net_sched: move tcf_lock down after gen_replace_estimator()
Laura reported a sleep-in-atomic kernel warning inside
tcf_act_police_init() which calls gen_replace_estimator() with
spinlock protection.

It is not necessary in this case, we already have RTNL lock here
so it is enough to protect concurrent writers. For the reader,
i.e. tcf_act_police(), it needs to make decision based on this
rate estimator, in the worst case we drop more/less packets than
necessary while changing the rate in parallel, it is still acceptable.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Nick Huber <nicholashuber@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 14:39:19 -04:00
Jiri Pirko
a5fcf8a6c9 net: propagate tc filter chain index down the ndo_setup_tc call
We need to push the chain index down to the drivers, so they have the
information to which chain the rule belongs. For now, no driver supports
multichain offload, so only chain 0 is supported. This is needed to
prevent chain squashes during offload for now. Later this will be used
to implement multichain offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 09:55:53 -04:00
Jiri Pirko
e25ea21ffa net: sched: introduce a TRAP control action
There is need to instruct the HW offloaded path to push certain matched
packets to cpu/kernel for further analysis. So this patch introduces a
new TRAP control action to TC.

For kernel datapath, this action does not make much sense. So with the
same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
dropped in the datapath (and virtually ejected to an upper level, which
does not exist in case of kernel).

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 12:45:23 -04:00
Jiri Pirko
8ec1507dc9 net: sched: select cls when cls_act is enabled
It really makes no sense to have cls_act enabled without cls. In that
case, the cls_act code is dead. So select it.

This also fixes an issue recently reported by kbuild robot:
[linux-next:master 1326/4151] net/sched/act_api.c:37:18: error: implicit declaration of function 'tcf_chain_get'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: db50514f9a ("net: sched: add termination action to allow goto chain")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-05 10:56:36 -04:00
Or Gerlitz
4d80cc0aaa net/sched: cls_flower: add support for matching on ip tos and ttl
Benefit from the support of ip header fields dissection and
allow users to set rules matching on ipv4 tos and ttl or
ipv6 traffic-class and hoplimit.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 18:12:24 -04:00
WANG Cong
367a8ce896 net_sched: only create filter chains for new filters/actions
tcf_chain_get() always creates a new filter chain if not found
in existing ones. This is totally unnecessary when we get or
delete filters, new chain should be only created for new filters
(or new actions).

Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 12:15:05 -04:00
Jiri Pirko
ee538dcea2 net: sched: cls_api: make reclassify return all the way back to the original tp
With the introduction of chain goto action, the reclassification would
cause the re-iteration of the actual chain. It makes more sense to restart
the whole thing and re-iterate starting from the original tp - start
of chain 0.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 12:11:49 -04:00
Jiri Pirko
fdfc7dd6ca net/sched: flower: add support for matching on tcp flags
Benefit from the support of tcp flags dissection and allow user to
insert rules matching on tcp flags.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-24 16:22:11 -04:00
Jiri Pirko
f93e1cdcf4 net/sched: fix filter flushing
When user instructs to remove all filters from chain, we cannot destroy
the chain as other actions may hold a reference. Also the put in errout
would try to destroy it again. So instead, just walk the chain and remove
all existing filters.

Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-23 11:00:07 -04:00
Jiri Pirko
31efcc250a net/sched: properly assign RCU pointer in tcf_chain_tp_insert/remove
*p_filter_chain is rcu-dereferenced on reader path. So here in writer,
property assign the pointer.

Fixes: 2190d1d094 ("net: sched: introduce helpers to work with filter chains")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-23 11:00:06 -04:00
David S. Miller
218b6a5b23 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-22 23:32:48 -04:00
Jiri Pirko
2d76b2f8b5 net: sched: cls_matchall: fix null pointer dereference
Since the head is guaranteed by the check above to be null, the call_rcu
would explode. Remove the previously logically dead code that was made
logically very much alive and kicking.

Fixes: 985538eee0 ("net/sched: remove redundant null check on head")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 14:54:16 -04:00
Davide Caratti
dba003067a net: use skb->csum_not_inet to identify packets needing crc32c
skb->csum_not_inet carries the indication on which algorithm is needed to
compute checksum on skb in the transmit path, when skb->ip_summed is equal
to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been
yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise,
assume Internet Checksum is needed and thus set skb->csum_not_inet to 0.

Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
David S. Miller
9d4f97f97b sch_dsmark: Fix uninitialized variable warning.
We still need to initialize err to -EINVAL for
the case where 'opt' is NULL in dsmark_init().

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:04:38 -04:00
Jiri Pirko
db50514f9a net: sched: add termination action to allow goto chain
Introduce new type of termination action called "goto_chain". This allows
user to specify a chain to be processed. This action type is
then processed as a return value in tcf_classify loop in similar
way as "reclassify" is, only it does not reset to the first filter
in chain but rather reset to the first filter of the desired chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
9fb9f251d2 net: sched: push tp down to action init
Tp pointer will be needed by the next patch in order to get the chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
5bc1701881 net: sched: introduce multichain support for filters
Instead of having only one filter per block, introduce a list of chains
for every block. Create chain 0 by default. UAPI is extended so the user
can specify which chain he wants to change. If the new attribute is not
specified, chain 0 is used. That allows to maintain backward
compatibility. If chain does not exist and user wants to manipulate with
it, new chain is created with specified index. Also, when last filter is
removed from the chain, the chain is destroyed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
acb31fae3b net: sched: push chain dump to a separate function
Since there will be multiple chains to dump, push chain dumping code to
a separate function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
2190d1d094 net: sched: introduce helpers to work with filter chains
Introduce struct tcf_chain object and set of helpers around it. Wraps up
insertion, deletion and search in the filter chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
7961973a00 net: sched: move TC_H_MAJ macro call into tcf_auto_prio
Call the helper from the function rather than to always adjust the
return value of the function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
9d36d9e545 net: sched: replace nprio by a bool to make the function more readable
The use of "nprio" variable in tc_ctl_tfilter is a bit cryptic and makes
a reader wonder what is going on for a while. So help him to understand
this priority allocation dance a litte bit better.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
fbe9c5b01f net: sched: rename tcf_destroy_chain helper
Make the name consistent with the rest of the helpers around.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
6529eaba33 net: sched: introduce tcf block infractructure
Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
87d83093bf net: sched: move tc_classify function to cls_api.c
Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Eric Dumazet
218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
Eric Dumazet
cb395b2010 net: sched: optimize class dumps
In commit 59cc1f61f0 ("net: sched: convert qdisc linked list to
hashtable") we missed the opportunity to considerably speed up
tc_dump_tclass_root() if a qdisc handle is provided by user.

Instead of iterating all the qdiscs, use qdisc_match_from_root()
to directly get the one we look for.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 21:37:40 -04:00
Michal Hocko
da6bc57a8f net: use kvmalloc with __GFP_REPEAT rather than open coded variant
fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
__GFP_REPEAT flag based on explanation from Eric:

 "At the time, tests on the hardware I had in my labs showed that
  vmalloc() could deliver pages spread all over the memory and that was
  a small penalty (once memory is fragmented enough, not at boot time)"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code.  This is rather disruptive
for something that can be achived with the fallback.  On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests.  So
the effect of this patch is that requests which fit into 32kB will fall
back to vmalloc easier now.

Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00