linux-stable/net
Jon Paul Maloy 35c55c9877 tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.

This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its N
  known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N) - 1 nodes downstream from
  itself in the list, and chooses to actively monitor those. This is
  called its "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neighbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirectly
  monitored node has gone up or down. If a node is indicated lost, the
  receiving node temporarily activates its own direct monitoring towards
  that node in order to confirm, or not, that it is actually gone.

- Since each node is actively monitoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how it handles failures that
  cause massive network partitionings. If both a lost node and all its
  direct monitoring neighbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  will be more than two direct monitoring hops away. Because of this,
  each node, apart from monitoring the member of its local domain, will
  also typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain status and domain
  records are marked with a generation number. This saves senders from
  unnecessarily conveying  unaltered domain records, and receivers from
  performing unneeded re-adaptations of their node monitoring list, such
  as re-assigning domain heads.

- As a measure of caution we have added the possibility to disable the
  new algorithm through configuration. We do this by keeping a threshold
  value for the cluster size; a cluster that grows beyond this value
  will switch from full-mesh to ring monitoring, and vice versa when
  it shrinks below the value. This means that if the threshold is set to
  a value larger than any anticipated cluster size (default size is 32)
  the new algorithm is effectively disabled. A patch set for altering the
  threshold value and for listing the table contents will follow shortly.

- This change is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:06:28 -07:00
..
6lowpan 6lowpan: move mac802154 header 2016-04-13 10:41:10 +02:00
9p remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
802
8021q vlan: Propagate MAC address to VLANs 2016-05-31 11:56:48 -07:00
appletalk
atm net/atm: sk_err_soft must be positive 2016-05-23 13:51:10 -07:00
ax25
batman-adv sched: remove NET_XMIT_POLICED 2016-06-12 22:02:11 -04:00
bluetooth net: add netdev_lockdep_set_classes() helper 2016-06-09 13:28:37 -07:00
bridge bridge: Don't insert unnecessary local fdb entry on changing mac address 2016-06-08 00:31:38 -07:00
caif
can sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
ceph libceph: use %s instead of %pE in dout()s 2016-05-30 23:00:23 +02:00
core sched: remove NET_XMIT_POLICED 2016-06-12 22:02:11 -04:00
dcb
dccp dccp: do not assume DCCP code is non preemptible 2016-05-02 17:02:25 -04:00
decnet decnet: Do not build routes to devices without decnet private data. 2016-04-10 23:01:30 -04:00
dns_resolver KEYS: Add a facility to restrict new links into a keyring 2016-04-11 22:37:37 +01:00
dsa net: dsa: Initialize CPU port ethtool ops per tree 2016-06-08 11:23:42 -07:00
ethernet
hsr net/hsr: Use setup_timer and mod_timer. 2016-05-16 14:00:43 -04:00
ieee802154 net: add netdev_lockdep_set_classes() helper 2016-06-09 13:28:37 -07:00
ipv4 tcp: return sizeof tcp_dctcp_info in dctcp_get_info() 2016-06-14 23:46:30 -07:00
ipv6 net: vrf: Handle ipv6 multicast and link-local addresses 2016-06-15 12:34:34 -07:00
ipx
irda TTY and Serial driver update for 4.7-rc1 2016-05-20 20:57:27 -07:00
iucv af_iucv: use paged SKBs for big inbound messages 2016-06-15 12:21:05 -07:00
kcm kcm: fix a signedness in kcm_splice_read() 2016-05-19 11:26:51 -07:00
key
l2tp ipv6: use TOS marks from sockets for routing decision 2016-06-11 15:33:26 -07:00
l3mdev net: l3mdev: Remove const from flowi6 arg to get_rt6_dst 2016-06-15 12:34:34 -07:00
lapb net/lapb: tuse %*ph to dump buffers 2016-05-29 22:33:25 -07:00
llc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
mac80211 For the next cycle, we have the following: 2016-06-10 23:13:32 -07:00
mac802154
mpls skbuff: introduce skb_gso_validate_mtu 2016-06-03 19:37:21 -04:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-06-10 11:52:24 -07:00
netlabel netlabel: fix a problem with netlbl_secattr_catmap_setrng() 2016-04-05 16:10:47 -04:00
netlink net/netlink/af_netlink.h: Remove unused structure. 2016-06-09 22:26:24 -07:00
netrom
nfc nfc: nci: Add nci_nfcc_loopback to the nci core 2016-05-04 01:48:16 +02:00
openvswitch openvswitch: Add packet truncation support. 2016-06-10 17:58:03 -07:00
packet packet: use common code for virtio_net_hdr and skb GSO conversion 2016-06-10 23:03:56 -07:00
phonet
qrtr Merge tag 'qcom-soc-for-4.7-2' into net-next 2016-05-17 14:11:19 -04:00
rds RDS: Update rds_conn_destroy to be MP capable 2016-06-14 23:50:44 -07:00
rfkill rfkill: Use switch to demux userspace operations 2016-04-05 10:48:53 +02:00
rose
rxrpc rxrpc: Update the comments in ar-internal.h to reflect renames 2016-06-13 13:38:51 +01:00
sched net_sched: make tcf_hash_check() boolean 2016-06-15 12:43:35 -07:00
sctp sctp: fix error return code in sctp_init() 2016-06-14 23:45:42 -07:00
sunrpc NFS client updates for Linux 4.7 2016-05-26 10:33:33 -07:00
switchdev switchdev: pass pointer to fib_info instead of copy 2016-05-17 13:58:49 -04:00
tipc tipc: add neighbor monitoring framework 2016-06-15 14:06:28 -07:00
unix constify security_path_{mkdir,mknod,symlink} 2016-03-28 00:47:27 -04:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
wimax
wireless For the next cycle, we have the following: 2016-06-10 23:13:32 -07:00
x25 net: fix a kernel infoleak in x25 module 2016-05-09 22:45:33 -04:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
compat.c packet: compat support for sock_fprog 2016-06-09 23:41:03 -07:00
Kconfig bpf: add generic constant blinding for use in jits 2016-05-16 13:49:32 -04:00
Makefile net: Add Qualcomm IPC router 2016-05-08 23:46:14 -04:00
socket.c fs: poll/select/recvmmsg: use timespec64 for timeout events 2016-05-19 19:12:14 -07:00
sysctl_net.c