linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-29 23:53:32 +00:00

Author	SHA1	Message	Date
Vakul Garg	d2bdd26812	net/tls: Use aead_request_alloc/free for request alloc/free Instead of kzalloc/free for aead_request allocation and free, use functions aead_request_alloc(), aead_request_free(). It ensures that any sensitive crypto material held in crypto transforms is securely erased from memory. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 14:44:11 -07:00
Pieter Jansen van Vuuren	cba54f9cf4	tc-testing: add geneve options in tunnel_key unit tests Extend tc tunnel_key action unit tests with geneve options. Tests include testing single and multiple geneve options, as well as testing geneve options that are expected to fail. Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Acked-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 12:34:31 -07:00
David S. Miller	f1fbeada1b	Merge branch 'be2net-small-structures-clean-up' Ivan Vecera says: ==================== be2net: small structures clean-up The series: - removes unused / unneccessary fields in several be2net structures - re-order fields in some structures to eliminate holes, cache-lines crosses - as result reduces size of main struct be_adapter by 4kB ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:31 -07:00
Ivan Vecera	28ace84b10	be2net: move rss_flags field in rss_info to ensure proper alignment The current position of .rss_flags field in struct rss_info causes that fields .rsstable and .rssqueue (both 128 bytes long) crosses cache-line boundaries. Moving it at the end properly align all fields. Before patch: struct rss_info { u64 rss_flags; /* 0 8 / u8 rsstable[128]; / 8 128 / / --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- / u8 rss_queue[128]; / 136 128 / / --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- / u8 rss_hkey[40]; / 264 40 / }; After patch: struct rss_info { u8 rsstable[128]; / 0 128 / / --- cacheline 2 boundary (128 bytes) --- / u8 rss_queue[128]; / 128 128 / / --- cacheline 4 boundary (256 bytes) --- / u8 rss_hkey[40]; / 256 40 / u64 rss_flags; / 296 8 */ }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:31 -07:00
Ivan Vecera	03d231a963	be2net: re-order fields in be_error_recovert to avoid hole - Unionize two u8 fields where only one of them is used depending on NIC chipset. - Move recovery_supported field after that union These changes eliminate 7-bytes hole in the struct and makes it smaller by 8 bytes. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:31 -07:00
Ivan Vecera	f9520b86dc	be2net: remove unused tx_jiffies field from be_tx_stats Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Vecera	646d2c10aa	be2net: move txcp field in be_tx_obj to eliminate holes in the struct Before patch: struct be_tx_obj { u32 db_offset; /* 0 4 / / XXX 4 bytes hole, try to pack / struct be_queue_info q; / 8 56 / / --- cacheline 1 boundary (64 bytes) --- / struct be_queue_info cq; / 64 56 / struct be_tx_compl_info txcp; / 120 4 / / XXX 4 bytes hole, try to pack / / --- cacheline 2 boundary (128 bytes) --- / struct sk_buff sent_skb_list[2048]; /* 128 16384 / ... }: After patch: struct be_tx_obj { u32 db_offset; / 0 4 / struct be_tx_compl_info txcp; / 4 4 / struct be_queue_info q; / 8 56 / / --- cacheline 1 boundary (64 bytes) --- / struct be_queue_info cq; / 64 56 / struct sk_buff sent_skb_list[2048]; /* 120 16384 */ ... }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Vecera	e9c74cd85c	be2net: reorder fields in be_eq_obj structure Re-order fields in struct be_eq_obj to ensure that .napi field begins at start of cache-line. Also the .adapter field is moved to the first cache-line next to .q field and 3 fields (idx,msi_idx,spurious_intr) and the 4-bytes hole to 3rd cache-line. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Vecera	d6d9704af8	be2net: remove desc field from be_eq_obj The event queue description (be_eq_obj.desc) field is used only to format string for IRQ name and it is not really needed to hold this value. Remove it and use local variable to format string for IRQ name. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Vecera	c1328a27bb	be2net: remove unused old custom busy-poll fields The commit `fb6113e688` ("be2net: get rid of custom busy poll code") replaced custom busy-poll code by the generic one but left several macros and fields in struct be_eq_obj that are currently unused. Remove this stuff. Fixes: `fb6113e688` ("be2net: get rid of custom busy poll code") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Vecera	a5d7fcb689	be2net: remove unused old AIC info The commit `2632bafd74` ("be2net: fix adaptive interrupt coalescing") introduced a separate struct be_aic_obj to hold AIC information but unfortunately left the old stuff in be_eq_obj. So remove it. Fixes: `2632bafd74` ("be2net: fix adaptive interrupt coalescing") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:03:30 -07:00
Ivan Khoronzhuk	d0c694fc7b	net: ethernet: ti: cpts: break cycle once late ts is matched The late ts queue can contain a bunch of skbs while hi rate testing, no need to check all of them if timestamp is already matched. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-12 00:00:07 -07:00
Petr Machata	4280129838	selftests: forwarding: mirror_gre_nh: Unset rp_filter on host VRF The mirrored packets arrive at $h3 encapsulated in GRE/IPv4, with IP address from 192.0.2.128/28 network. However the interface is configured as a member of 192.0.2.160/28 and there's no route directing traffic from the former network through that interface. Correspondingly, the RP filter on the VRF rejects it. Therefore turn off the VRF's RP filter. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:59:27 -07:00
David S. Miller	d90a5215c8	Merge branch 'mlxsw-ERSPAN-Take-LACP-state-into-consideration' Ido Schimmel says: ==================== mlxsw: ERSPAN: Take LACP state into consideration Petr says: When offloading mirror-to-gretap, mlxsw needs to preroute the path that the encapsulated packet will take. That path may include a LAG device above a front panel port. So far, mlxsw resolved the path to the first up front panel slave of the LAG interface, but that only reflects administrative state of the port. It neglects to consider whether the port actually has a carrier, and what the LACP state is. This patch set aims to address these problems. Patch #1 publishes team_port_get_rcu(). Then in patch #2, a new function is introduced, mlxsw_sp_port_dev_check(). That returns, for a given netdevice that is a slave of a LAG device, whether that device is "txable", i.e. whether the LAG master would send traffic through it. Since there's no good place to put LAG-wide helpers, introduce a new header include/net/lag.h. Finally in patch #3, fix the slave selection logic to take into consideration whether a given slave has a carrier and whether it is txable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:10:20 -07:00
Petr Machata	b5de82f3df	mlxsw: spectrum_span: Change LAG lower selection When offloading mirror-to-gretap, mlxsw needs to preroute the path that the encapsulated packet will take. That path may include a LAG device above a front panel port. So far, mlxsw resolved the path to the first up front panel slave of the LAG interface, but that only reflects administrative state of the port. It neglects to consider whether the port actually has a carrier, and what the LACP state is. So instead of checking upness of the device, check carrier state and txability. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:10:19 -07:00
Petr Machata	eeed992b77	net: Add lag.h, net_lag_port_dev_txable() LAG devices (team or bond) recognize for each one of their slave devices whether LAG traffic is going to be sent through that device. Bond calls such devices "active", team calls them "txable". When this state changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together with a netdev_notifier_changelowerstate_info structure that for LAG devices includes a tx_enabled flag that refers to the new state. The notification thus makes it possible to react to the changes in txability in drivers. However there's no way to query txability from the outside on demand. That is problematic namely for mlxsw, which when resolving ERSPAN packet path, may encounter a LAG device, and needs to determine which of the slaves it should choose. To that end, introduce a new function, net_lag_port_dev_txable(), which determines whether a given slave device is "active" or "txable" (depending on the flavor of the LAG device). That function then dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp. team_port_dev_txable(). Because there currently is no good place where net_lag_port_dev_txable() should be added, introduce a new header file, lag.h, which should from now on hold any logic common to both team and bond. (But keep netif_is_lag_master() together with the rest of netif_is_*_master() functions). Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:10:19 -07:00
Petr Machata	3443b00e07	team: Publish team_port_get_rcu() A follow-up patch adds a new entry point, team_port_dev_txable(). Making it an ordinary exported function would mean that any module that may need the service in one of the supported configurations also unconditionally needs to pull in the team module, whether or not the user actually intends to create team interfaces. To prevent that, team_port_dev_txable() is defined in if_team.h, and therefore all dependencies of that function also need to be publicly-visible. Therefore move team_port_get_rcu() from team.c to if_team.h. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:10:19 -07:00
Travis Brown	80fd2d6ca5	macvlan: Change status when lower device goes down Today macvlan ignores the notification when a lower device goes administratively down, preventing the lack of connectivity from bubbling up. Processing NETDEV_DOWN results in a macvlan state of LOWERLAYERDOWN with NO-CARRIER which should be easy to interpret in userspace. 2: lower: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 3: macvlan@lower: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000 Signed-off-by: Suresh Krishnan <skrishnan@arista.com> Signed-off-by: Travis Brown <travisb@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:07:22 -07:00
David S. Miller	0e97c4fb18	Merge branch 'tipc-make-link-protocol-more-resilient' Jon Maloy says: ==================== tipc: make link protocol more resilient These two commits make the link ptotocol more resilient to infrastructures with frequent packet duplication and long delays. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:06:14 -07:00
Jon Maloy	7ea817f4e8	tipc: check session number before accepting link protocol messages In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:06:14 -07:00
Jon Maloy	9012de5089	tipc: add sequence number check for link STATE messages Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:06:14 -07:00
David S. Miller	e32f55f373	Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== L2 Fwd Offload & 10GbE Intel Driver Updates 2018-07-09 This patch series is meant to allow support for the L2 forward offload, aka MACVLAN offload without the need for using ndo_select_queue. The existing solution currently requires that we use ndo_select_queue in the transmit path if we want to associate specific Tx queues with a given MACVLAN interface. In order to get away from this we need to repurpose the tc_to_txq array and XPS pointer for the MACVLAN interface and use those as a means of accessing the queues on the lower device. As a result we cannot offload a device that is configured as multiqueue, however it doesn't really make sense to configure a macvlan interfaced as being multiqueue anyway since it doesn't really have a qdisc of its own in the first place. The big changes in this set are: Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN Disable XPS for single queue devices Replace accel_priv with sb_dev in ndo_select_queue Add sb_dev parameter to fallback function for ndo_select_queue Consolidated ndo_select_queue functions that appeared to be duplicates ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:03:32 -07:00
Deepti Raghavan	4929c9428a	tcp: expose both send and receive intervals for rate sample Congestion control algorithms, which access the rate sample through the tcp_cong_control function, only have access to the maximum of the send and receive interval, for cases where the acknowledgment rate may be inaccurate due to ACK compression or decimation. Algorithms may want to use send rates and receive rates as separate signals. Signed-off-by: Deepti Raghavan <deeptir@mit.edu> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:01:56 -07:00
Vlad Buslov	e0479b670d	net: sched: fix unprotected access to rcu cookie pointer Fix action attribute size calculation function to take rcu read lock and access act_cookie pointer with rcu dereference. Fixes: `eec94fdb04` ("net: sched: use rcu for action cookie update") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 23:01:02 -07:00
David S. Miller	2368957ab5	Merge branch 'cxgb4-move-stats-fetched-from-firmware-to-debugfs' Rahul Lakkireddy says: ==================== cxgb4: move stats fetched from firmware to debugfs Some stats are fetched via slow firmware mailbox, which can cause packet drops under heavy load. So, this series removes these stats from ethtool -S and expose them via debugfs. Patch 1 removes stats fetched via firmware from ethtool -S. Patch 2 exposes stats removed in Patch 1 via debugfs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:59:39 -07:00
Rahul Lakkireddy	31e5f5c3e9	cxgb4: expose stats fetched from firmware via debugfs Expose stats obtained from firmware via debugfs. These stats can't be part of ethtool -S because the slow firmware mailbox can cause packet drops under heavy load. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:59:38 -07:00
Rahul Lakkireddy	b351b16d8a	cxgb4: remove stats fetched from firmware When running ethtool -S, some stats are requested from firmware. Since getting these stats via firmware mailbox is slow, some packets get dropped under heavy load while running ethtool -S. So, remove these stats from ethtool -S. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:59:38 -07:00
Antoine Tenart	b32b088181	net: mvpp2: explicitly include linux/interrupt.h The Marvell PPv2 driver uses interrupts and tasklet but does not explicitly include linux/interrupt.h, relying on implicit includes. This one particularly is included by chance after a long unlogical chain of inclusions. Fix this so we do not get future build breaks. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:56:52 -07:00
Jan Dakinevich	fb8ed3af74	cnic: use kvzalloc to allocate memory for csk_tbl Size of csk_tbl is about 58K, which means 3rd order page allocation. kvzalloc provides a fallback if no high order memory is available. Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:55:52 -07:00
Colin Ian King	3c546728df	wimax/i2400m: remove redundant variables ack_status, bcf and protocol Variables ack_status, bcf and protocol are being assigned but are never used hence they are redundant and can be removed. Also declare ack_type as unsigned int rather than unsigned to clean up a checkpatch warning. Cleans up clang warnings: warning: variable 'ack_status' set but not used [-Wunused-but-set-variable] warning: variable 'bcf' set but not used [-Wunused-but-set-variable] warning: variable 'protocol' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:54:25 -07:00
Vlad Buslov	01e866bf07	net: sched: act_ife: fix memory leak in ife init Free params if tcf_idr_check_alloc() returned error. Fixes: `0190c1d452` ("net: sched: atomically check-allocate action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:53:00 -07:00
Arjun Vynipadath	8dce04f1fd	cxgb4: specify IQTYPE in fw_iq_cmd congestion argument passed to t4_sge_alloc_rxq() is used to differentiate between nic/ofld queues. Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:52:10 -07:00
David S. Miller	57cd07fbf7	Merge branch 'net-ipv6-addr_gen_mode-fixes' Sabrina Dubroca says: ==================== net/ipv6: addr_gen_mode fixes This series fixes bugs in handling of the addr_gen_mode option, mainly related to the sysctl. A minor netlink issue was also present in the initial commit introducing the option on a per-netdevice basis. v2: add patch 4, requested by David Ahern during review of v1 add patch 5, missing documentation for the sysctl patches 1, 2, 3 are unchanged ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:46 -07:00
Sabrina Dubroca	f168db5e25	Documentation: ip-sysctl.txt: document addr_gen_mode addr_gen_mode was introduced in without documentation, add it now. Fixes: `d35a00b8e3` ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:45 -07:00
Sabrina Dubroca	f24c5987dd	net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devices This aligns the addr_gen_mode sysctl with the expected behavior of the "all" variant. Fixes: `d35a00b8e3` ("net/ipv6: allow sysctl to change link-local address generation mode") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:45 -07:00
Sabrina Dubroca	bdd72f4133	net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE inet6_ifla6_size() is called to check how much space is needed by inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it. Fixes: `bc91b0f07a` ("ipv6: addrconf: implement address generation modes") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:45 -07:00
Sabrina Dubroca	70c30d76e5	net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev The value has already been copied from this netns's devconf_dflt, it shouldn't be reset to the global kernel default. Fixes: `d35a00b8e3` ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:45 -07:00
Sabrina Dubroca	c6dbf7aaa4	net/ipv6: fix addrconf_sysctl_addr_gen_mode addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores the errors returned by proc_dointvec(). addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which writes the value to memory, and then checks if it's valid and may return EINVAL. If a bad value is given, the value displayed when reading net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the value provided by the user was valid, addrconf_dev_config() won't be called since idev->cnf.addr_gen_mode has already been updated. Fix this in the usual way we deal with values that need to be checked after the proc_do*() helper has returned: define a local ctl_table and storage, call proc_dointvec() on that temporary area, then check and store. addrconf_sysctl_addr_gen_mode() also writes the new value to the global ipv6_devconf_dflt, when we're writing to some netns's default, so that new netns will inherit the value that was set by the change occuring in any netns. That doesn't make any sense, so let's drop this assignment. Finally, since addr_gen_mode is a __u32, switch to proc_douintvec(). Fixes: `d35a00b8e3` ("net/ipv6: allow sysctl to change link-local address generation mode") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:50:45 -07:00
Jianbo Liu	5e9a0fe492	net/sched: flower: Fix null pointer dereference when run tc vlan command Zahari issued tc vlan command without setting vlan_ethtype, which will crash kernel. To avoid this, we must check tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE] is not null before use it. Also we don't need to dump vlan_ethtype or cvlan_ethtype in this case. Fixes: `d64efd0926` ('net/sched: flower: Add supprt for matching on QinQ vlan headers') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reported-by: Zahari Doychev <zahari.doychev@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-11 22:48:13 -07:00
Petr Machata	db560d1612	selftests: forwarding: mirror_lib: Tighten up VLAN capture The function do_test_span_vlan_dir_ips() is used for testing whether mirrored packets are VLAN-encapsulated. But since it only considers VLAN encapsulation, it may end up matching unmirrored ARP traffic as well. One consequence is a rare failure of mirror_gre_vlan_bridge_1q's test_gretap_untagged_egress. Decreasing ping cadence in mirror_test() makes the problem easily reproducible. Therefore tighten up the match criterion to only count those 802.1q packets where the next header is IP. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 22:58:10 -07:00
David S. Miller	5025b99c96	Merge branch 'cake-qdisc' Toke Høiland-Jørgensen says: ==================== sched: Add Common Applications Kept Enhanced (cake) qdisc This patch series adds the CAKE qdisc, and has been split up to ease review. I have attempted to split out each configurable feature into its own patch. The first commit adds the base shaper and packet scheduler, while subsequent commits add the optional features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. The result of applying the entire series is identical to the out of tree version that have seen extensive testing in previous deployments, most notably as an out of tree patch to OpenWrt. However, note that I have only compile tested the individual patches; so the whole series should be considered as a unit. --- Changelog v19: - Rebase to current net-next. - Don't rely on the value of sch->q.qlen to break loops; fixes possible infinite loop on multi-queue devices. - Don't overwrite NAT flag when setting flow mode. v18: - Rework classification logic in the diffserv case to always hash if filter doesn't select a queue, and to run TC filters before selecting the diffserv tin (allowing filter to influence this). - Make sure we always call qdisc_watchdog_init() in cake_init(), so we don't crash in cake_destroy(). v17: - Rebase to newest net-next and move the conntrack callback to nf_ct_hook - Fix a compile error when NF_CONNTRACK is unset. v16: - Move conntrack lookup function into conntrack core and read it via RCU so it is only active when the nf_conntrack module is loaded. This avoids the module dependency on conntrack for NAT mode. Thanks to Pablo for the idea. v15: - Handle ECN flags in ACK filter v14: - Handle seqno wraps and DSACKs in ACK filter v13: - Avoid ktime_t to scalar compares - Add class dumping and basic stats - Fail with ENOTSUPP when requesting NAT mode and conntrack is not available. - Parse all TCP options in ACK filter and make sure to only drop safe ones. Also handle SACK ranges properly. v12: - Get rid of custom time typedefs. Use ktime_t for time and u64 for duration instead. v11: - Fix overhead compensation calculation for GSO packets - Change configured rate to be u64 (I ran out of bits before I ran out of CPU when testing the effects of the above) v10: - Christmas tree gardening (fix variable declarations to be in reverse line length order) v9: - Remove duplicated checks around kvfree() and just call it unconditionally. - Don't pass __GFP_NOWARN when allocating memory - Move options in cake_dump() that are related to optional features to later patches implementing the features. - Support attaching filters to the qdisc and use the classification result to select flow queue. - Support overriding diffserv priority tin from skb->priority v8: - Remove inline keyword from function definitions - Simplify ACK filter; remove the complex state handling to make the logic easier to follow. This will potentially be a bit less efficient, but I have not been able to measure a difference. v7: - Split up patch into a series to ease review. - Constify the ACK filter. v6: - Fix 6in4 encapsulation checks in ACK filter code - Checkpatch fixes v5: - Refactor ACK filter code and hopefully fix the safety issues properly this time. v4: - Only split GSO packets if shaping at speeds <= 1Gbps - Fix overhead calculation code to also work for GSO packets - Don't re-implement kvzalloc() - Remove local header include from out-of-tree build (fixes kbuild-bot complaint). - Several fixes to the ACK filter: - Check pskb_may_pull() before deref of transport headers. - Don't run ACK filter logic on split GSO packets - Fix TCP sequence number compare to deal with wraparounds v3: - Use IS_REACHABLE() macro to fix compilation when sch_cake is built-in and conntrack is a module. - Switch the stats output to use nested netlink attributes instead of a versioned struct. - Remove GPL boilerplate. - Fix array initialisation style. v2: - Fix kbuild test bot complaint - Clean up the netlink ABI - Fix checkpatch complaints - A few tweaks to the behaviour of cake based on testing carried out while writing the paper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:35 -07:00
Toke Høiland-Jørgensen	0c850344d3	sch_cake: Conditionally split GSO segments At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	a729b7f0bd	sch_cake: Add overhead compensation support to the rate shaper This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	83f8fd69af	sch_cake: Add DiffServ handling This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	ea82511518	sch_cake: Add NAT awareness to packet classifier When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	b60a60405f	netfilter: Add nf_ct_get_tuple_skb global lookup function This adds a global netfilter function to extract a conntrack tuple from an skb. The function uses a new function added to nf_ct_hook, which will try to get the tuple from skb->_nfct, and do a full lookup if that fails. This makes it possible to use the lookup function before the skb has passed through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is copied to the caller to avoid issues with reference counting. The function returns false if conntrack is not loaded, allowing it to be used without incurring a module dependency on conntrack. This is used by the NAT mode in sch_cake. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	8b7138814f	sch_cake: Add optional ACK filter The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the upstream direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs fewer bytes than the new packet being enqueued, including any SACK options. This prevents duplicate ACKs from being filtered, to avoid interfering with retransmission logic. In addition, we check TCP header options and only drop those that are known to not interfere with sender state. In particular, packets with unknown option codes are never dropped. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In really pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	7298de9cd7	sch_cake: Add ingress mode The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued for each flow. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	046f6fd5da	sched: Add Common Applications Kept Enhanced (cake) qdisc sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Jesus Sanchez-Palencia	52b509218f	net: Use __u32 in uapi net_stamp.h We are not supposed to use u32 in uapi, so change the flags member of struct sock_txtime from u32 to __u32 instead. Fixes: `80b14dee2b` ("net: Add a new socket option for a future transmit time") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:31:28 -07:00

1 2 3 4 5 ...

767838 commits