linux-stable/net
Kirill Tkhai 1a57feb847 net: Introduce net_sem for protection of pernet_list
Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(&net_mutex)
  list_del_rcu(&net->list)
  synchronize_rcu()                                  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list(ops, &net_exit_list)
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_free_list(ops, &net_exit_list)
  mutex_unlock(&net_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.

The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.

Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..10000}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:04 -05:00
..
6lowpan
9p vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
802
8021q net: delete /proc THIS_MODULE references 2018-01-16 15:01:33 -05:00
appletalk net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
atm net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ax25 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
batman-adv vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
bluetooth net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
bpf bpf: fix null pointer deref in bpf_prog_test_run_xdp 2018-02-01 07:43:56 -08:00
bridge net: bridge: Fix uninitialized error in br_fdb_sync_static() 2018-02-01 09:47:37 -05:00
caif vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
can net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ceph libceph: check kstrndup() return value 2018-01-29 18:36:12 +01:00
core net: Introduce net_sem for protection of pernet_list 2018-02-13 10:36:04 -05:00
dcb
dccp vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
decnet net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
dns_resolver afs: Support the AFS dynamic root 2018-02-06 14:43:37 +00:00
dsa net: dsa: Support internal phy on 'cpu' port 2018-01-23 19:22:38 -05:00
ethernet
hsr
ieee802154
ife
ipv4 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ipv6 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
iucv net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
kcm vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
key af_key: Fix memory leak in key_notify_policy. 2018-01-10 09:45:11 +01:00
l2tp net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
l3mdev
lapb
llc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
mac80211 debugfs_sta: Remove unneeded semicolons 2018-01-22 14:03:28 +01:00
mac802154
mpls mpls, nospec: Sanitize array index in mpls_label_ok() 2018-02-08 15:24:12 -05:00
ncsi net/ncsi: Don't take any action on HNCDSC AEN 2017-12-18 14:50:11 -05:00
netfilter netfilter: nf_flow_offload: fix use-after-free and a resource leak 2018-02-07 11:55:52 +01:00
netlabel
netlink net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
netrom net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
nfc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
nsh
openvswitch openvswitch: Remove padding from packet before L3+ conntrack processing 2018-02-01 09:46:22 -05:00
packet net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
phonet net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
psample
qrtr net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
rds net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
rfkill vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
rose net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
rxrpc vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-09 15:34:18 -08:00
sctp net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
smc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
strparser strparser: Call sock_owned_by_user_nocheck 2017-12-28 14:28:22 -05:00
sunrpc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
switchdev
tipc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
tls net: add a UID to use for ULP socket assignment 2018-02-06 11:39:31 +01:00
unix net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
vmw_vsock net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
wimax
wireless Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-02-04 11:45:55 -08:00
x25 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-01-26 10:22:53 -05:00
compat.c
Kconfig Staging/IIO patches for 4.16-rc1 2018-02-01 09:51:57 -08:00
Makefile
socket.c net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
sysctl_net.c