Commit Graph

154 Commits

Author SHA1 Message Date
Eric Dumazet 2170a3d1c2 net: sched: limit TC_ACT_REPEAT loops
commit 5740d06890 upstream.

We have been living dangerously, at the mercy of malicious users,
abusing TC_ACT_REPEAT, as shown by this syzpot report [1].

Add an arbitrary limit (32) to the number of times an action can
return TC_ACT_REPEAT.

v2: switch the limit to 32 instead of 10.
    Use net_warn_ratelimited() instead of pr_err_once().

[1] (C repro available on demand)

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    1-...!: (10500 ticks this GP) idle=021/1/0x4000000000000000 softirq=5592/5592 fqs=0
        (t=10502 jiffies g=5305 q=190)
rcu: rcu_preempt kthread timer wakeup didn't happen for 10502 jiffies! g5305 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
rcu:    Possible timer handling issue on cpu=0 timer-softirq=3527
rcu: rcu_preempt kthread starved for 10505 jiffies! g5305 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu:    Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:I stack:29344 pid:   14 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:4986 [inline]
 __schedule+0xab2/0x4db0 kernel/sched/core.c:6295
 schedule+0xd2/0x260 kernel/sched/core.c:6368
 schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881
 rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1963
 rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2136
 kthread+0x2e9/0x3a0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 3646 Comm: syz-executor358 Not tainted 5.17.0-rc3-syzkaller-00149-gbf8e59fd315f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:rep_nop arch/x86/include/asm/vdso/processor.h:13 [inline]
RIP: 0010:cpu_relax arch/x86/include/asm/vdso/processor.h:18 [inline]
RIP: 0010:pv_wait_head_or_lock kernel/locking/qspinlock_paravirt.h:437 [inline]
RIP: 0010:__pv_queued_spin_lock_slowpath+0x3b8/0xb40 kernel/locking/qspinlock.c:508
Code: 48 89 eb c6 45 01 01 41 bc 00 80 00 00 48 c1 e9 03 83 e3 07 41 be 01 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8d 2c 01 eb 0c <f3> 90 41 83 ec 01 0f 84 72 04 00 00 41 0f b6 45 00 38 d8 7f 08 84
RSP: 0018:ffffc9000283f1b0 EFLAGS: 00000206
RAX: 0000000000000003 RBX: 0000000000000000 RCX: 1ffff1100fc0071e
RDX: 0000000000000001 RSI: 0000000000000201 RDI: 0000000000000000
RBP: ffff88807e0038f0 R08: 0000000000000001 R09: ffffffff8ffbf9ff
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000004c1e
R13: ffffed100fc0071e R14: 0000000000000001 R15: ffff8880b9c3aa80
FS:  00005555562bf300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdbfef12b8 CR3: 00000000723c2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:591 [inline]
 queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:51 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:85 [inline]
 do_raw_spin_lock+0x200/0x2b0 kernel/locking/spinlock_debug.c:115
 spin_lock_bh include/linux/spinlock.h:354 [inline]
 sch_tree_lock include/net/sch_generic.h:610 [inline]
 sch_tree_lock include/net/sch_generic.h:605 [inline]
 prio_tune+0x3b9/0xb50 net/sched/sch_prio.c:211
 prio_init+0x5c/0x80 net/sched/sch_prio.c:244
 qdisc_create.constprop.0+0x44a/0x10f0 net/sched/sch_api.c:1253
 tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660
 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5594
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7ee98aae99
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffdbfef12d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ffdbfef1300 RCX: 00007f7ee98aae99
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 000000000000000d R09: 000000000000000d
R10: 000000000000000d R11: 0000000000000246 R12: 00007ffdbfef12f0
R13: 00000000000f4240 R14: 000000000004ca47 R15: 00007ffdbfef12e4
 </TASK>
INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 2.293 msecs
NMI backtrace for cpu 1
CPU: 1 PID: 3260 Comm: kworker/1:3 Not tainted 5.17.0-rc3-syzkaller-00149-gbf8e59fd315f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: mld mld_ifc_work
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111
 nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62
 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
 rcu_dump_cpu_stacks+0x25e/0x3f0 kernel/rcu/tree_stall.h:343
 print_cpu_stall kernel/rcu/tree_stall.h:604 [inline]
 check_cpu_stall kernel/rcu/tree_stall.h:688 [inline]
 rcu_pending kernel/rcu/tree.c:3919 [inline]
 rcu_sched_clock_irq.cold+0x5c/0x759 kernel/rcu/tree.c:2617
 update_process_times+0x16d/0x200 kernel/time/timer.c:1785
 tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226
 tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428
 __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
 __hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749
 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103
 sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:__sanitizer_cov_trace_const_cmp4+0xc/0x70 kernel/kcov.c:286
Code: 00 00 00 48 89 7c 30 e8 48 89 4c 30 f0 4c 89 54 d8 20 48 89 10 5b c3 0f 1f 80 00 00 00 00 41 89 f8 bf 03 00 00 00 4c 8b 14 24 <89> f1 65 48 8b 34 25 00 70 02 00 e8 14 f9 ff ff 84 c0 74 4b 48 8b
RSP: 0018:ffffc90002c5eea8 EFLAGS: 00000246
RAX: 0000000000000007 RBX: ffff88801c625800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: ffff8880137d3100 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff874fcd88 R11: 0000000000000000 R12: ffff88801d692dc0
R13: ffff8880137d3104 R14: 0000000000000000 R15: ffff88801d692de8
 tcf_police_act+0x358/0x11d0 net/sched/act_police.c:256
 tcf_action_exec net/sched/act_api.c:1049 [inline]
 tcf_action_exec+0x1a6/0x530 net/sched/act_api.c:1026
 tcf_exts_exec include/net/pkt_cls.h:326 [inline]
 route4_classify+0xef0/0x1400 net/sched/cls_route.c:179
 __tcf_classify net/sched/cls_api.c:1549 [inline]
 tcf_classify+0x3e8/0x9d0 net/sched/cls_api.c:1615
 prio_classify net/sched/sch_prio.c:42 [inline]
 prio_enqueue+0x3a7/0x790 net/sched/sch_prio.c:75
 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3668
 __dev_xmit_skb net/core/dev.c:3756 [inline]
 __dev_queue_xmit+0x1f61/0x3660 net/core/dev.c:4081
 neigh_hh_output include/net/neighbour.h:533 [inline]
 neigh_output include/net/neighbour.h:547 [inline]
 ip_finish_output2+0x14dc/0x2170 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:306 [inline]
 __ip_finish_output+0x396/0x650 net/ipv4/ip_output.c:288
 ip_finish_output+0x32/0x200 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:296 [inline]
 ip_output+0x196/0x310 net/ipv4/ip_output.c:430
 dst_output include/net/dst.h:451 [inline]
 ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:126
 iptunnel_xmit+0x628/0xa50 net/ipv4/ip_tunnel_core.c:82
 geneve_xmit_skb drivers/net/geneve.c:966 [inline]
 geneve_xmit+0x10c8/0x3530 drivers/net/geneve.c:1077
 __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
 netdev_start_xmit include/linux/netdevice.h:4697 [inline]
 xmit_one net/core/dev.c:3473 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489
 __dev_queue_xmit+0x2985/0x3660 net/core/dev.c:4116
 neigh_hh_output include/net/neighbour.h:533 [inline]
 neigh_output include/net/neighbour.h:547 [inline]
 ip6_finish_output2+0xf7a/0x14f0 net/ipv6/ip6_output.c:126
 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
 __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
 NF_HOOK_COND include/linux/netfilter.h:296 [inline]
 ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
 dst_output include/net/dst.h:451 [inline]
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 mld_sendpack+0x9a3/0xe40 net/ipv6/mcast.c:1826
 mld_send_cr net/ipv6/mcast.c:2127 [inline]
 mld_ifc_work+0x71c/0xdc0 net/ipv6/mcast.c:2659
 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
 worker_thread+0x657/0x1110 kernel/workqueue.c:2454
 kthread+0x2e9/0x3a0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>
----------------
Code disassembly (best guess):
   0:   48 89 eb                mov    %rbp,%rbx
   3:   c6 45 01 01             movb   $0x1,0x1(%rbp)
   7:   41 bc 00 80 00 00       mov    $0x8000,%r12d
   d:   48 c1 e9 03             shr    $0x3,%rcx
  11:   83 e3 07                and    $0x7,%ebx
  14:   41 be 01 00 00 00       mov    $0x1,%r14d
  1a:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  21:   fc ff df
  24:   4c 8d 2c 01             lea    (%rcx,%rax,1),%r13
  28:   eb 0c                   jmp    0x36
* 2a:   f3 90                   pause <-- trapping instruction
  2c:   41 83 ec 01             sub    $0x1,%r12d
  30:   0f 84 72 04 00 00       je     0x4a8
  36:   41 0f b6 45 00          movzbl 0x0(%r13),%eax
  3b:   38 d8                   cmp    %bl,%al
  3d:   7f 08                   jg     0x47
  3f:   84                      .byte 0x84

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220215235305.3272331-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 11:57:35 +01:00
Eric Dumazet eccfa2109a net: avoid potential infinite loop in tc_ctl_action()
[ Upstream commit 39f13ea2f6 ]

tc_ctl_action() has the ability to loop forever if tcf_action_add()
returns -EAGAIN.

This special case has been done in case a module needed to be loaded,
but it turns out that tcf_add_notify() could also return -EAGAIN
if the socket sk_rcvbuf limit is hit.

We need to separate the two cases, and only loop for the module
loading case.

While we are at it, add a limit of 10 attempts since unbounded
loops are always scary.

syzbot repro was something like :

socket(PF_NETLINK, SOCK_RAW|SOCK_NONBLOCK, NETLINK_ROUTE) = 3
write(3, ..., 38) = 38
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{..., 388}], msg_controllen=0, msg_flags=0x10}, ...)

NMI backtrace for cpu 0
CPU: 0 PID: 1054 Comm: khungtaskd Not tainted 5.4.0-rc1+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
 check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline]
 watchdog+0x9d0/0xef0 kernel/hung_task.c:289
 kthread+0x361/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 8859 Comm: syz-executor910 Not tainted 5.4.0-rc1+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:751 [inline]
RIP: 0010:lockdep_hardirqs_off+0x1df/0x2e0 kernel/locking/lockdep.c:3453
Code: 5c 08 00 00 5b 41 5c 41 5d 5d c3 48 c7 c0 58 1d f3 88 48 ba 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 0f 85 d3 00 00 00 <48> 83 3d 21 9e 99 07 00 0f 84 b9 00 00 00 9c 58 0f 1f 44 00 00 f6
RSP: 0018:ffff8880a6f3f1b8 EFLAGS: 00000046
RAX: 1ffffffff11e63ab RBX: ffff88808c9c6080 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff88808c9c6914
RBP: ffff8880a6f3f1d0 R08: ffff88808c9c6080 R09: fffffbfff16be5d1
R10: fffffbfff16be5d0 R11: 0000000000000003 R12: ffffffff8746591f
R13: ffff88808c9c6080 R14: ffffffff8746591f R15: 0000000000000003
FS:  00000000011e4880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 00000000a8920000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 trace_hardirqs_off+0x62/0x240 kernel/trace/trace_preemptirq.c:45
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
 _raw_spin_lock_irqsave+0x6f/0xcd kernel/locking/spinlock.c:159
 __wake_up_common_lock+0xc8/0x150 kernel/sched/wait.c:122
 __wake_up+0xe/0x10 kernel/sched/wait.c:142
 netlink_unlock_table net/netlink/af_netlink.c:466 [inline]
 netlink_unlock_table net/netlink/af_netlink.c:463 [inline]
 netlink_broadcast_filtered+0x705/0xb80 net/netlink/af_netlink.c:1514
 netlink_broadcast+0x3a/0x50 net/netlink/af_netlink.c:1534
 rtnetlink_send+0xdd/0x110 net/core/rtnetlink.c:714
 tcf_add_notify net/sched/act_api.c:1343 [inline]
 tcf_action_add+0x243/0x370 net/sched/act_api.c:1362
 tc_ctl_action+0x3b5/0x4bc net/sched/act_api.c:1410
 rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5386
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5404
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x803/0x920 net/socket.c:2311
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2356
 __do_sys_sendmsg net/socket.c:2365 [inline]
 __se_sys_sendmsg net/socket.c:2363 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440939

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+cf0adbb9c28c8866c788@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29 09:17:03 +01:00
Craig Dillabaugh cd19a9b12a net sched actions: fix dumping which requires several messages to user space
[ Upstream commit 734549eb55 ]

Fixes a bug in the tcf_dump_walker function that can cause some actions
to not be reported when dumping a large number of actions. This issue
became more aggrevated when cookies feature was added. In particular
this issue is manifest when large cookie values are assigned to the
actions and when enough actions are created that the resulting table
must be dumped in multiple batches.

The number of actions returned in each batch is limited by the total
number of actions and the memory buffer size.  With small cookies
the numeric limit is reached before the buffer size limit, which avoids
the code path triggering this bug. When large cookies are used buffer
fills before the numeric limit, and the erroneous code path is hit.

For example after creating 32 csum actions with the cookie
aaaabbbbccccdddd

$ tc actions ls action csum
total acts 26

    action order 0: csum (tcp) action continue
    index 1 ref 1 bind 0
    cookie aaaabbbbccccdddd

    .....

    action order 25: csum (tcp) action continue
    index 26 ref 1 bind 0
    cookie aaaabbbbccccdddd
total acts 6

    action order 0: csum (tcp) action continue
    index 28 ref 1 bind 0
    cookie aaaabbbbccccdddd

    ......

    action order 5: csum (tcp) action continue
    index 32 ref 1 bind 0
    cookie aaaabbbbccccdddd

Note that the action with index 27 is omitted from the report.

Fixes: 4b3550ef53 ("[NET_SCHED]: Use nla_nest_start/nla_nest_end")"
Signed-off-by: Craig Dillabaugh <cdillaba@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12 12:32:23 +02:00
Cong Wang c7e460ce55 Revert "net_sched: hold netns refcnt for each action"
This reverts commit ceffcc5e25.
If we hold that refcnt, the netns can never be destroyed until
all actions are destroyed by user, this breaks our netns design
which we expect all actions are destroyed when we destroy the
whole netns.

Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-09 10:03:09 +09:00
Cong Wang ceffcc5e25 net_sched: hold netns refcnt for each action
TC actions have been destroyed asynchronously for a long time,
previously in a RCU callback and now in a workqueue. If we
don't hold a refcnt for its netns, we could use the per netns
data structure, struct tcf_idrinfo, after it has been freed by
netns workqueue.

Hold refcnt to ensure netns destroy happens after all actions
are gone.

Fixes: ddf97ccdd7 ("net_sched: add network namespace support for tc actions")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Tested-by: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 10:30:38 +09:00
Cong Wang a159d3c4b8 net_sched: acquire RTNL in tc_action_net_exit()
I forgot to acquire RTNL in tc_action_net_exit()
which leads that action ops->cleanup() is not always
called with RTNL. This usually is not a big deal because
this function is called after all netns refcnt are gone,
but given RTNL protects more than just actions, add it
for safety and consistency.

Also add an assertion to catch other potential bugs.

Fixes: ddf97ccdd7 ("net_sched: add network namespace support for tc actions")
Reported-by: Lucas Bates <lucasb@mojatatu.com>
Tested-by: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 10:30:38 +09:00
Jiri Pirko 255cd50f20 net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
Recent commit d7fb60b9ca ("net_sched: get rid of tcfa_rcu") removed
freeing in call_rcu, which changed already existing hard-to-hit
race condition into 100% hit:

[  598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[  598.607782] IP: tcf_action_destroy+0xc0/0x140

Or:

[   40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[   40.862840] IP: tcf_generic_walker+0x534/0x820

Fix this by storing the ops and use them directly for module_put call.

Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-13 09:34:08 -07:00
Cong Wang d7fb60b9ca net_sched: get rid of tcfa_rcu
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Please see also the comments in code.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-12 20:41:02 -07:00
Jakub Kicinski 2c8468dcf8 net: sched: don't use GFP_KERNEL under spin lock
The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
to:

[  582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416
[  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
[  582.636939] 2 locks held by tc/3379:
[  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff910354ce>] rtnetlink_rcv_msg+0x92e/0x1400
[  582.650958]  #1:  (&(&tn->idrinfo->lock)->rlock){+.-.+.}, at: [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.662217] Preemption disabled at:
[  582.662222] [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: G        W       4.13.0-rc7-debug-00648-g43503a79b9f0 #287
[  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
[  582.691937] Call Trace:
...
[  582.742460]  kmem_cache_alloc+0x286/0x540
[  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
[  582.753209]  idr_get_free_cmn+0x627/0xf80
...
[  582.815525]  idr_alloc_cmn+0x1a8/0x270
...
[  582.833804]  tcf_idr_create+0x31b/0x8e0
...

Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
(as suggested by Eric Dumazet), and change the allocation
flags under spin lock.

Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-05 14:48:29 -07:00
Chris Mi 65a206c01e net/sched: Change act_api and act_xxx modules to use IDR
Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.

But please note that the test result is based on previous patch that
cls_flower uses IDR.

Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 14:38:51 -07:00
Florian Westphal b97bac64a5 rtnetlink: make rtnl_register accept a flags parameter
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Jiri Pirko ec1a9cca0e net: sched: remove check for number of actions in tcf_exts_exec
Leave it to tcf_action_exec to return TC_ACT_OK in case there is no
action present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jamal Hadi Salim e62e484df0 net sched actions: add time filter for action dumping
This patch adds support for filtering based on time since last used.
When we are dumping a large number of actions it is useful to
have the option of filtering based on when the action was last
used to reduce the amount of data crossing to user space.

With this patch the user space app sets the TCA_ROOT_TIME_DELTA
attribute with the value in milliseconds with "time of interest
since now".  The kernel converts this to jiffies and does the
filtering comparison matching entries that have seen activity
since then and returns them to user space.
Old kernels and old tc continue to work in legacy mode since
they dont specify this attribute.

Some example (we have 400 actions bound to 400 filters); at
installation time. Using updated when tc setting the time of
interest to 120 seconds earlier (we see 400 actions):
prompt$ hackedtc actions ls action gact since 120000| grep index | wc -l
400

go get some coffee and wait for > 120 seconds and try again:

prompt$ hackedtc actions ls action gact since 120000 | grep index | wc -l
0

Lets see a filter bound to one of these actions:
....
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2 success 1)
  match 7f000002/ffffffff at 12 (success 1 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1145 sec used 802 sec
    Action statistics:
    Sent 84 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
....

that coffee took long, no? It was good.

Now lets ping -c 1 127.0.0.2, then run the actions again:
prompt$ hackedtc actions ls action gact since 120 | grep index | wc -l
1

More details please:
prompt$ hackedtc -s actions ls action gact since 120000

    action order 0: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1270 sec used 30 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

And the filter?

filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 4 success 2)
  match 7f000002/ffffffff at 12 (success 2 )
    action order 1: gact action pass
     random type none pass val 0
     index 23 ref 2 bind 1 installed 1324 sec used 84 sec
    Action statistics:
    Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim 90825b23a8 net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

The top level action TLV space is extended. An attribute
TCA_ROOT_FLAGS is used to carry flags; flag TCA_FLAG_LARGE_DUMP_ON
is set by the user indicating the user is capable of processing
these large dumps. Older user space which doesnt set this flag
doesnt get the large (than 32) batches.
The kernel uses the TCA_ROOT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32 thus maintaining backward compat.

Some results dumping 1.5M actions below:
first an unpatched tc which doesnt understand these features...

prompt$ time -p tc actions ls action gact | grep index | wc -l
1500000
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
1500000
real 178.13
user 2.02
sys 176.96

That is about 8x performance improvement for tc app which sets its
receive buffer to about 32K.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Jamal Hadi Salim df823b0297 net sched actions: Use proper root attribute table for actions
Bug fix for an issue which has been around for about a decade.
We got away with it because the enumeration was larger than needed.

Fixes: 7ba699c604 ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API")
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-30 19:28:08 -07:00
Roman Mashak c4c4290c17 net sched actions: rename act_get_notify() to tcf_get_notify()
Make name consistent with other TC event notification routines, such as
tcf_add_notify() and tcf_del_notify()

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:52:33 -07:00
WANG Cong 367a8ce896 net_sched: only create filter chains for new filters/actions
tcf_chain_get() always creates a new filter chain if not found
in existing ones. This is totally unnecessary when we get or
delete filters, new chain should be only created for new filters
(or new actions).

Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 12:15:05 -04:00
Jiri Pirko db50514f9a net: sched: add termination action to allow goto chain
Introduce new type of termination action called "goto_chain". This allows
user to specify a chain to be processed. This action type is
then processed as a return value in tcf_classify loop in similar
way as "reclassify" is, only it does not reset to the first filter
in chain but rather reset to the first filter of the desired chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 9fb9f251d2 net: sched: push tp down to action init
Tp pointer will be needed by the next patch in order to get the chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko 9da3242e6a net: sched: add helpers to handle extended actions
Jump is now the only one using value action opcode. This is going to
change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.

This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
bit check, not a value check.

Fixes: e0ee84ded7 ("net sched actions: Complete the JUMPX opcode")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:33:54 -04:00
Jamal Hadi Salim e0ee84ded7 net sched actions: Complete the JUMPX opcode
per discussion at netconf/netdev:
When we have an action that is capable of branching (example a policer),
we can achieve a continuation of the action graph by programming a
"continue" where we find an exact replica of the same filter rule with a lower
priority and the remainder of the action graph. When you have 100s of thousands
of filters which require such a feature it gets very inefficient to do two
lookups.

This patch completes a leftover feature of action codes. Its time has come.

Example below where a user labels packets with a different skbmark on ingress
of a port depending on whether they have/not exceeded the configured rate.
This mark is then used to make further decisions on some egress port.

 #rate control, very low so we can easily see the effect
sudo $TC actions add action police rate 1kbit burst 90k \
conform-exceed pipe/jump 2 index 10
 # skbedit index 11 will be used if the user conforms
sudo $TC actions add action skbedit mark 11 ok index 11
 # skbedit index 12 will be used if the user does not conform
sudo $TC actions add action skbedit mark 12 ok index 12

 #lets bind the user ..
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 10 \
action skbedit index 11 \
action skbedit index 12

 #run a ping -f and see what happens..
 #
jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32
filter pref 8 u32 fh 800: ht divisor 1
filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2800 success 1005)
  match 7f000008/ffffffff at 16 (success 1005 )
	action order 1:  police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b
	ref 2 bind 1 installed 207 sec used 122 sec
	Action statistics:
	Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0)
	backlog 0b 0p requeues 0

	action order 2:  skbedit mark 11 pass
	 index 11 ref 2 bind 1 installed 204 sec used 122 sec
 	Action statistics:
	Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 3:  skbedit mark 12 pass
	 index 12 ref 2 bind 1 installed 201 sec used 122 sec
 	Action statistics:
	Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Not bad, about 28% non-conforming packets..

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:30:06 -04:00
David S. Miller fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Wolfgang Bumiller e0535ce58b net sched actions: allocate act cookie early
Policing filters do not use the TCA_ACT_* enum and the tb[]
nlattr array in tcf_action_init_1() doesn't get filled for
them so we should not try to look for a TCA_ACT_COOKIE
attribute in the then uninitialized array.
The error handling in cookie allocation then calls
tcf_hash_release() leading to invalid memory access later
on.
Additionally, if cookie allocation fails after an already
existing non-policing filter has successfully been changed,
tcf_action_release() should not be called, also we would
have to roll back the changes in the error handling, so
instead we now allocate the cookie early and assign it on
success at the end.

CVE-2017-7979
Fixes: 1045ba77a5 ("net sched actions: Add support for user cookies")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 16:32:07 -04:00
David Ahern c21ef3e343 net: rtnetlink: plumb extended ack to doit function
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:35:38 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Roman Mashak 37f1c63e3e net sched actions: do not overwrite status of action creation.
nla_memdup_cookie was overwriting err value, declared at function
scope and earlier initialized with result of ->init(). At success
nla_memdup_cookie() returns 0, and thus module refcnt decremented,
although the action was installed.

$ sudo tc actions add action pass index 1 cookie 1234
$ sudo tc actions ls action gact

        action order 0: gact action pass
         random type none pass val 0
         index 1 ref 1 bind 0
$
$ lsmod
Module                  Size  Used by
act_gact               16384  0
...
$
$ sudo rmmod act_gact
[   52.310283] ------------[ cut here ]------------
[   52.312551] WARNING: CPU: 1 PID: 455 at kernel/module.c:1113
module_put+0x99/0xa0
[   52.316278] Modules linked in: act_gact(-) crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel psmouse pcbc evbug aesni_intel aes_x86_64 crypto_simd
serio_raw glue_helper pcspkr cryptd
[   52.322285] CPU: 1 PID: 455 Comm: rmmod Not tainted 4.10.0+ #11
[   52.324261] Call Trace:
[   52.325132]  dump_stack+0x63/0x87
[   52.326236]  __warn+0xd1/0xf0
[   52.326260]  warn_slowpath_null+0x1d/0x20
[   52.326260]  module_put+0x99/0xa0
[   52.326260]  tcf_hashinfo_destroy+0x7f/0x90
[   52.326260]  gact_exit_net+0x27/0x40 [act_gact]
[   52.326260]  ops_exit_list.isra.6+0x38/0x60
[   52.326260]  unregister_pernet_operations+0x90/0xe0
[   52.326260]  unregister_pernet_subsys+0x21/0x30
[   52.326260]  tcf_unregister_action+0x68/0xa0
[   52.326260]  gact_cleanup_module+0x17/0xa0f [act_gact]
[   52.326260]  SyS_delete_module+0x1ba/0x220
[   52.326260]  entry_SYSCALL_64_fastpath+0x1e/0xad
[   52.326260] RIP: 0033:0x7f527ffae367
[   52.326260] RSP: 002b:00007ffeb402a598 EFLAGS: 00000202 ORIG_RAX:
00000000000000b0
[   52.326260] RAX: ffffffffffffffda RBX: 0000559b069912a0 RCX: 00007f527ffae367
[   52.326260] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559b06991308
[   52.326260] RBP: 0000000000000003 R08: 00007f5280264420 R09: 00007ffeb4029511
[   52.326260] R10: 000000000000087b R11: 0000000000000202 R12: 00007ffeb4029580
[   52.326260] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b069912a0
[   52.354856] ---[ end trace 90d89401542b0db6 ]---
$

With the fix:

$ sudo modprobe act_gact
$ lsmod
Module                  Size  Used by
act_gact               16384  0
...
$ sudo tc actions add action pass index 1 cookie 1234
$ sudo tc actions ls action gact

        action order 0: gact action pass
         random type none pass val 0
         index 1 ref 1 bind 0
$
$ lsmod
Module                  Size  Used by
act_gact               16384  1
...
$ sudo rmmod act_gact
rmmod: ERROR: Module act_gact is in use
$
$ sudo /home/mrv/bin/tc actions del action gact index 1
$ sudo rmmod act_gact
$ lsmod
Module                  Size  Used by
$

Fixes: 1045ba77a ("net sched actions: Add support for user cookies")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 21:31:32 -05:00
Roman Mashak edb9d1bff4 net sched actions: decrement module reference count after table flush.
When tc actions are loaded as a module and no actions have been installed,
flushing them would result in actions removed from the memory, but modules
reference count not being decremented, so that the modules would not be
unloaded.

Following is example with GACT action:

% sudo modprobe act_gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions ls action gact
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  1
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  2
% sudo rmmod act_gact
rmmod: ERROR: Module act_gact is in use
....

After the fix:
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions add action pass index 1
% sudo tc actions add action pass index 2
% sudo tc actions add action pass index 3
% lsmod
Module                  Size  Used by
act_gact               16384  3
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
%
% sudo tc actions flush action gact
% lsmod
Module                  Size  Used by
act_gact               16384  0
% sudo rmmod act_gact
% lsmod
Module                  Size  Used by
%

Fixes: f97017cdef ("net-sched: Fix actions flushing")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 21:28:41 -05:00
Wei Yongjun 6f2e3f7d97 net_sched: nla_memdup_cookie() can be static
Fixes the following sparse warning:

net/sched/act_api.c:532:5: warning:
 symbol 'nla_memdup_cookie' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-14 15:02:17 -05:00
Jamal Hadi Salim 1045ba77a5 net sched actions: Add support for user cookies
Introduce optional 128-bit action cookie.
Like all other cookie schemes in the networking world (eg in protocols
like http or existing kernel fib protocol field, etc) the idea is to save
user state that when retrieved serves as a correlator. The kernel
_should not_ intepret it.  The user can store whatever they wish in the
128 bits.

Sample exercise(showing variable length use of cookie)

.. create an accept action with cookie a1b2c3d4
sudo $TC actions add action ok index 1 cookie a1b2c3d4

.. dump all gact actions..
sudo $TC -s actions ls action gact

    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 1 bind 0 installed 5 sec used 5 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    cookie a1b2c3d4

.. bind the accept action to a filter..
sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1

... send some traffic..
$ ping 127.0.0.1 -c 3
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:37:04 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Jamal Hadi Salim 0faa9cb5b3 net sched actions: fix refcnt when GETing of action after bind
Demonstrating the issue:

.. add a drop action
$sudo $TC actions add action drop index 10

.. retrieve it
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 0 installed 29 sec used 29 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

... bug 1 above: reference is two.
    Reference is actually 1 but we forget to subtract 1.

... do a GET again and we see the same issue
    try a few times and nothing changes
~$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 0 installed 31 sec used 31 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

... lets try to bind the action to a filter..
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
  u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

... and now a few GETs:
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 3 bind 1 installed 204 sec used 204 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 4 bind 1 installed 206 sec used 206 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 5 bind 1 installed 235 sec used 235 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

.... as can be observed the reference count keeps going up.

After the fix

$ sudo $TC actions add action drop index 10
$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 1 bind 0 installed 4 sec used 4 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 1 bind 0 installed 6 sec used 6 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
  u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 1 installed 32 sec used 32 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

	action order 1: gact action drop
	 random type none pass val 0
	 index 10 ref 2 bind 1 installed 33 sec used 33 sec
 	Action statistics:
	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Fixes: aecc5cefc3 ("net sched actions: fix GETing actions")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 19:43:19 -05:00
Willem de Bruijn e7246e122a net-tc: extract skip classify bit from tc_verd
Packets sent by the IFB device skip subsequent tc classification.
A single bit governs this state. Move it out of tc_verd in
anticipation of removing that __u16 completely.

The new bitfield tc_skip_classify temporarily uses one bit of a
hole, until tc_verd is removed completely in a follow-up patch.

Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
With that many options, little value in documenting it.

Introduce a helper function to deduplicate the logic in the two
sites that check this bit.

The field tc_skip_classify is set only in IFB on skbs cloned in
act_mirred, so original packet sources do not have to clear the
bit when reusing packets (notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 20:58:52 -05:00
Eric Dumazet 1c0d32fde5 net_sched: gen_estimator: complete rewrite of rate estimators
1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:21:59 -05:00
Johannes Berg 4700e9ce6e net_sched actions: use nla_parse_nested()
Use nla_parse_nested instead of open-coding the call to
nla_parse() with the attribute data/len.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:01:01 -04:00
WANG Cong ab102b80ce net_sched: reorder pernet ops and act ops registrations
Krister reported a kernel NULL pointer dereference after
tcf_action_init_1() invokes a_o->init(), it is a race condition
where one thread calling tcf_register_action() to initialize
the netns data after putting act ops in the global list and
the other thread searching the list and then calling
a_o->init(net, ...).

Fix this by moving the pernet ops registration before making
the action ops visible. This is fine because: a) we don't
rely on act_base in pernet ops->init(), b) in the worst case we
have a fully initialized netns but ops is still not ready so
new actions still can't be created.

Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 10:26:43 -04:00
Jamal Hadi Salim aecc5cefc3 net sched actions: fix GETing actions
With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes: 22dc13c837 ("net_sched: convert tcf_exts from list to pointer array"),
Fixes: 824a7e8863 ("net_sched: remove an unnecessary list_del()")
Fixes: f07fed82ad ("net_sched: remove the leftover cleanup_a()")

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20 23:34:55 -04:00
Jamal Hadi Salim 5a7a5555a3 net sched: stylistic cleanups
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 22:04:14 -04:00
WANG Cong 22dc13c837 net_sched: convert tcf_exts from list to pointer array
As pointed out by Jamal, an action could be shared by
multiple filters, so we can't use list to chain them
any more after we get rid of the original tc_action.
Instead, we could just save pointers to these actions
in tcf_exts, since they are refcount'ed, so convert
the list to an array of pointers.

The "ugly" part is the action API still accepts list
as a parameter, I just introduce a helper function to
convert the array of pointers to a list, instead of
relying on the C99 feature to iterate the array.

Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17 19:27:51 -04:00
WANG Cong 824a7e8863 net_sched: remove an unnecessary list_del()
This list_del() for tc action is not needed actually,
because we only use this list to chain bulk operations,
therefore should not be carried for latter operations.

Fixes: ec0595cc44 ("net_sched: get rid of struct tcf_common")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17 19:27:51 -04:00
WANG Cong f07fed82ad net_sched: remove the leftover cleanup_a()
After refactoring tc_action into tcf_common, we no
longer need to cleanup temporary "actions" in list,
they are permanently stored in the hashtable.

Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17 19:27:51 -04:00
WANG Cong ec0595cc44 net_sched: get rid of struct tcf_common
After the previous patch, struct tc_action should be enough
to represent the generic tc action, tcf_common is not necessary
any more. This patch gets rid of it to make tc action code
more readable.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25 21:49:20 -07:00
WANG Cong a85a970af2 net_sched: move tc_action into tcf_common
struct tc_action is confusing, currently we use it for two purposes:
1) Pass in arguments and carry out results from helper functions
2) A generic representation for tc actions

The first one is error-prone, since we need to make sure we don't
miss anything. This patch aims to get rid of this use, by moving
tc_action into tcf_common, so that they are allocated together
in hashtable and can be cast'ed easily.

And together with the following patch, we could really make
tc_action a generic representation for all tc actions and each
type of action can inherit from it.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25 21:49:19 -07:00
David S. Miller ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Jamal Hadi Salim ebecaa6662 net sched actions: bug fix dumping actions directly didnt produce NLMSG_DONE
This refers to commands to direct action access as follows:

sudo tc actions add action drop index 12
sudo tc actions add action pipe index 10

And then dumping them like so:
sudo tc actions ls action gact

iproute2 worked because it depended on absence of TCA_ACT_TAB TLV
as end of message.
This fix has been tested with iproute2 and is backward compatible.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:02:53 -07:00
WANG Cong b2313077ed net_sched: make tcf_hash_check() boolean
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:43:35 -07:00
Eric Dumazet edb09eb17e net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 16:37:14 -07:00
Jamal Hadi Salim 0b0f43fe2e net sched: indentation and other OCD stylistic fixes
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
2016-06-07 15:53:54 -07:00
Jamal Hadi Salim 53eb440f4a net sched actions: introduce timestamp for firsttime use
Useful to know when the action was first used for accounting
(and debugging)

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 15:53:43 -07:00
Nicolas Dichtel 9854518ea0 sched: align nlattr properly when needed
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:49 -04:00
WANG Cong ddf97ccdd7 net_sched: add network namespace support for tc actions
Currently tc actions are stored in a per-module hashtable,
therefore are visible to all network namespaces. This is
probably the last part of the tc subsystem which is not
aware of netns now. This patch makes them per-netns,
several tc action API's need to be adjusted for this.

The tc action API code is ugly due to historical reasons,
we need to refactor that code in the future.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 14:16:21 -05:00