linux-stable/kernel
Shakeel Butt 01f4cb0005 cgroup: memcg: net: do not associate sock with unrelated cgroup
[ Upstream commit e876ecc67d ]

We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.

mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.

Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.

WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:
 <IRQ>
 dump_stack+0x57/0x75
 mem_cgroup_sk_alloc+0xe9/0xf0
 sk_clone_lock+0x2a7/0x420
 inet_csk_clone_lock+0x1b/0x110
 tcp_create_openreq_child+0x23/0x3b0
 tcp_v6_syn_recv_sock+0x88/0x730
 tcp_check_req+0x429/0x560
 tcp_v6_rcv+0x72d/0xa40
 ip6_protocol_deliver_rcu+0xc9/0x400
 ip6_input+0x44/0xd0
 ? ip6_protocol_deliver_rcu+0x400/0x400
 ip6_rcv_finish+0x71/0x80
 ipv6_rcv+0x5b/0xe0
 ? ip6_sublist_rcv+0x2e0/0x2e0
 process_backlog+0x108/0x1e0
 net_rx_action+0x26b/0x460
 __do_softirq+0x104/0x2a6
 do_softirq_own_stack+0x2a/0x40
 </IRQ>
 do_softirq.part.19+0x40/0x50
 __local_bh_enable_ip+0x51/0x60
 ip6_finish_output2+0x23d/0x520
 ? ip6table_mangle_hook+0x55/0x160
 __ip6_finish_output+0xa1/0x100
 ip6_finish_output+0x30/0xd0
 ip6_output+0x73/0x120
 ? __ip6_finish_output+0x100/0x100
 ip6_xmit+0x2e3/0x600
 ? ipv6_anycast_cleanup+0x50/0x50
 ? inet6_csk_route_socket+0x136/0x1e0
 ? skb_free_head+0x1e/0x30
 inet6_csk_xmit+0x95/0xf0
 __tcp_transmit_skb+0x5b4/0xb20
 __tcp_send_ack.part.60+0xa3/0x110
 tcp_send_ack+0x1d/0x20
 tcp_rcv_state_process+0xe64/0xe80
 ? tcp_v6_connect+0x5d1/0x5f0
 tcp_v6_do_rcv+0x1b1/0x3f0
 ? tcp_v6_do_rcv+0x1b1/0x3f0
 __release_sock+0x7f/0xd0
 release_sock+0x30/0xa0
 __inet_stream_connect+0x1c3/0x3b0
 ? prepare_to_wait+0xb0/0xb0
 inet_stream_connect+0x3b/0x60
 __sys_connect+0x101/0x120
 ? __sys_getsockopt+0x11b/0x140
 __x64_sys_connect+0x1a/0x20
 do_syscall_64+0x51/0x200
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d75807383 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18 07:17:43 +01:00
..
bpf bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fill 2020-02-28 17:22:27 +01:00
cgroup cgroup: memcg: net: do not associate sock with unrelated cgroup 2020-03-18 07:17:43 +01:00
configs
debug kgdb: don't use a notifier to enter kgdb at panic; call directly 2019-09-25 17:51:40 -07:00
dma dma-direct: don't check swiotlb=force in dma_direct_map_resource 2020-01-26 10:01:08 +01:00
events perf/core: Fix mlock accounting in perf_mmap() 2020-02-11 04:35:54 -08:00
gcov Revert "um: Enable CONFIG_CONSTRUCTORS" 2020-02-01 09:34:53 +00:00
irq genirq/proc: Reject invalid affinity masks (again) 2020-02-28 17:22:26 +01:00
livepatch livepatch: Nullify obj->mod in klp_module_coming()'s error path 2019-08-19 13:03:37 +02:00
locking locking/lockdep: Fix lockdep_stats indentation problem 2020-03-05 16:43:51 +01:00
power ACPI: PM: s2idle: Avoid possible race related to the EC GPE 2020-02-19 19:52:56 +01:00
printk printk: fix exclusive_console replaying 2020-02-24 08:36:24 +01:00
rcu rcu: Allow only one expedited GP to run concurrently with wakeups 2020-03-05 16:43:50 +01:00
sched sched/fair: Optimize select_idle_cpu 2020-03-05 16:43:48 +01:00
time lib/vdso: Update coarse timekeeper unconditionally 2020-03-05 16:43:49 +01:00
trace blktrace: fix dereference after null check 2020-03-12 13:00:10 +01:00
.gitignore
acct.c
async.c
audit.c audit: always check the netlink payload length in audit_receive_msg() 2020-03-05 16:43:42 +01:00
audit.h audit/stable-5.3 PR 20190702 2019-07-08 18:55:42 -07:00
audit_fsnotify.c
audit_tree.c
audit_watch.c audit_get_nd(): don't unlock parent too early 2019-11-10 11:56:55 -05:00
auditfilter.c audit: fix error handling in audit_data_to_entry() 2020-03-05 16:43:42 +01:00
auditsc.c
backtracetest.c
bounds.c
capability.c
compat.c
configs.c kernel/configs: Replace GPL boilerplate code with SPDX identifier 2019-07-30 18:34:15 +02:00
context_tracking.c
cpu.c cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order 2020-02-24 08:36:23 +01:00
cpu_pm.c
crash_core.c
crash_dump.c
cred.c keys: Fix request_key() cache 2020-01-17 19:48:42 +01:00
delayacct.c
dma.c
elfcore.c kernel/elfcore.c: include proper prototypes 2019-09-25 17:51:39 -07:00
exec_domain.c
exit.c exit: panic before exit_mm() on global init exit 2020-01-09 10:20:01 +01:00
extable.c extable: Add function to search only kernel exception table 2019-08-21 22:23:48 +10:00
fail_function.c
fork.c clone3: ensure copy_thread_tls is implemented 2020-01-14 20:08:35 +01:00
freezer.c Revert "libata, freezer: avoid block device removal while system is frozen" 2019-10-06 09:11:37 -06:00
futex.c futex: Prevent exit livelock 2019-11-29 10:10:14 +01:00
gen_kheaders.sh kheaders: substituting --sort in archive creation 2019-10-17 09:08:19 +09:00
groups.c
hung_task.c
iomem.c mm/nvdimm: add is_ioremap_addr and use that to check ioremap address 2019-07-12 11:05:40 -07:00
irq_work.c
jump_label.c jump_label: Don't warn on __exit jump entries 2019-08-29 15:10:10 +01:00
kallsyms.c kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol 2019-08-27 16:19:56 +01:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y 2019-07-22 18:05:11 +02:00
kcov.c
kexec.c kexec_load: Disable at runtime if the kernel is locked down 2019-08-19 21:54:15 -07:00
kexec_core.c kexec: bail out upon SIGKILL when allocating memory. 2019-09-25 17:51:40 -07:00
kexec_elf.c kexec_elf: support 32 bit ELF files 2019-09-06 23:58:44 +02:00
kexec_file.c Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
kexec_internal.h
kheaders.c
kmod.c
kprobes.c kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic 2020-03-12 13:00:09 +01:00
ksysfs.c
kthread.c kthread: make __kthread_queue_delayed_work static 2019-10-16 09:20:58 -07:00
latencytop.c
Makefile Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity 2019-09-27 19:37:27 -07:00
module-internal.h
module.c module: avoid setting info->name early in case we can fall back to info->mod->name 2020-02-24 08:36:54 +01:00
module_signature.c MODSIGN: Export module signature definitions 2019-08-05 18:39:56 -04:00
module_signing.c MODSIGN: Export module signature definitions 2019-08-05 18:39:56 -04:00
notifier.c
nsproxy.c
padata.c padata: validate cpumask without removed CPU during offline 2020-02-24 08:36:34 +01:00
panic.c panic: ensure preemption is disabled during panic() 2019-10-07 15:47:19 -07:00
params.c lockdown: Lock down module params that specify hardware parameters (eg. ioport) 2019-08-19 21:54:16 -07:00
pid.c kernel/pid.c: convert struct pid count to refcount_t 2019-07-16 19:23:24 -07:00
pid_namespace.c proc/sysctl: add shared variables for range check 2019-07-18 17:08:07 -07:00
profile.c
ptrace.c ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() 2020-01-23 08:22:36 +01:00
range.c
reboot.c
relay.c
resource.c mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range() 2019-09-24 15:54:09 -07:00
rseq.c
seccomp.c seccomp: Check that seccomp_notif is zeroed out by the user 2020-01-09 10:19:57 +01:00
signal.c cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop() 2019-10-11 08:39:57 -07:00
smp.c smp: Warn on function calls from softirq context 2019-07-20 11:27:16 +02:00
smpboot.c
smpboot.h
softirq.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-08 11:01:13 -07:00
stackleak.c
stacktrace.c stacktrace: Don't skip first entry on noncurrent tasks 2019-11-04 21:19:25 +01:00
stop_machine.c stop_machine: Avoid potential race behaviour 2019-10-17 12:47:12 +02:00
sys.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-09-17 12:35:15 -07:00
sys_ni.c
sysctl.c kernel: sysctl: make drop_caches write-only 2020-01-04 19:18:32 +01:00
sysctl_binary.c
task_work.c
taskstats.c taskstats: fix data-race 2020-01-09 10:19:54 +01:00
test_kprobes.c
torture.c torture: Remove exporting of internal functions 2019-08-01 14:30:22 -07:00
tracepoint.c The main changes in this release include: 2019-07-18 11:51:00 -07:00
tsacct.c
ucount.c proc/sysctl: add shared variables for range check 2019-07-18 17:08:07 -07:00
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c Keyrings namespacing 2019-07-08 19:36:47 -07:00
user_namespace.c Keyrings namespacing 2019-07-08 19:36:47 -07:00
utsname.c
utsname_sysctl.c
watchdog.c watchdog/softlockup: Enforce that timestamp is valid on boot 2020-02-24 08:36:52 +01:00
watchdog_hld.c
workqueue.c workqueue: Add RCU annotation for pwq list walk 2020-01-26 10:01:07 +01:00
workqueue_internal.h