Commit graph

66840 commits

Author SHA1 Message Date
Jakub Kicinski
659f4e02f1 mac802154: use dev_addr_set()
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it got through appropriate helpers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20 14:27:40 +01:00
Jakub Kicinski
47ce5f1e3e batman-adv: prepare for const netdev->dev_addr
netdev->dev_addr will be constant soon, make sure
the qualifier is propagated thru batman-adv.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20 14:22:25 +01:00
Leon Romanovsky
cb3dc8901b devlink: Remove extra device_lock assert checks
PCI core code in the pci_call_probe() has a path that doesn't hold
device_lock. It happens because the ->probe() is called through the
workqueue mechanism.

   349 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
   350                           const struct pci_device_id *id)
   351 {
   352
....
   377         if (cpu < nr_cpu_ids)
   378                 error = work_on_cpu(cpu, local_pci_probe, &ddi);

Luckily enough, the core still ensures that only single flow is executed,
so it safe to remove the assert checks that anyway were added for annotations
purposes.

Fixes: b88f7b1203 ("devlink: Annotate devlink API calls")
Reported-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19 13:16:14 +01:00
Sebastian Andrzej Siewior
e22db7bd55 net: sched: Allow statistics reads from softirq.
Eric reported that the rate estimator reads statics from the softirq
which in turn triggers a warning introduced in the statistics rework.

The warning is too cautious. The updates happen in the softirq context
so reads from softirq are fine since the writes can not be preempted.
The updates/writes happen during qdisc_run() which ensures one writer
and the softirq context.
The remaining bad context for reading statistics remains in hard-IRQ
because it may preempt a writer.

Fixes: 29cbcd8582 ("net: sched: Remove Qdisc::running sequence counter")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19 13:07:35 +01:00
Petr Machata
6b3efbfa4e net: sch_tbf: Add a graft command
As another qdisc is linked to the TBF, the latter should issue an event to
give drivers a chance to react to the grafting. In other qdiscs, this event
is called GRAFT, so follow suit with TBF as well.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19 12:24:51 +01:00
David S. Miller
7adaf56edd Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS for net-next:

1) Add new run_estimation toggle to IPVS to stop the estimation_timer
   logic, from Dust Li.

2) Relax superfluous dynset check on NFT_SET_TIMEOUT.

3) Add egress hook, from Lukas Wunner.

4) Nowadays, almost all hook functions in x_table land just call the hook
   evaluation loop. Remove remaining hook wrappers from iptables and IPVS.
   From Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 14:05:25 +01:00
Alvin Šipraga
1521d5adfc net: dsa: tag_rtl8_4: add realtek 8 byte protocol 4 tag
This commit implements a basic version of the 8 byte tag protocol used
in the Realtek RTL8365MB-VC unmanaged switch, which carries with it a
protocol version of 0x04.

The implementation itself only handles the parsing of the EtherType
value and Realtek protocol version, together with the source or
destination port fields. The rest is left unimplemented for now.

The tag format is described in a confidential document provided to my
company by Realtek Semiconductor Corp. Permission has been granted by
the vendor to publish this driver based on that material, together with
an extract from the document describing the tag format and its fields.
It is hoped that this will help future implementors who do not have
access to the material but who wish to extend the functionality of
drivers for chips which use this protocol.

In addition, two possible values of the REASON field are specified,
based on experiments on my end. Realtek does not specify what value this
field can take.

Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 14:02:56 +01:00
Alvin Šipraga
9cb8edda21 net: dsa: move NET_DSA_TAG_RTL4_A to right place in Kconfig/Makefile
Move things around a little so that this tag driver is alphabetically
ordered. The Kconfig file is sorted based on the tristate text.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 14:02:55 +01:00
Alvin Šipraga
487d3855b6 net: dsa: allow reporting of standard ethtool stats for slave devices
Jakub pointed out that we have a new ethtool API for reporting device
statistics in a standardized way, via .get_eth_{phy,mac,ctrl}_stats.
Add a small amount of plumbing to allow DSA drivers to take advantage of
this when exposing statistics.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 14:02:55 +01:00
Ahmed S. Darwish
29cbcd8582 net: sched: Remove Qdisc::running sequence counter
The Qdisc::running sequence counter has two uses:

  1. Reliably reading qdisc's tc statistics while the qdisc is running
     (a seqcount read/retry loop at gnet_stats_add_basic()).

  2. As a flag, indicating whether the qdisc in question is running
     (without any retry loops).

For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.

Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().

For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.

Now that the qdisc->running sequence counter is no longer used, remove
it.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Ahmed S. Darwish
50dc9a8572 net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types
The only factor differentiating per-CPU bstats data type (struct
gnet_stats_basic_cpu) from the packed non-per-CPU one (struct
gnet_stats_basic_packed) was a u64_stats sync point inside the former.
The two data types are now equivalent: earlier commits added a u64_stats
sync point to the latter.

Combine both data types into "struct gnet_stats_basic_sync". This
eliminates redundancy and simplifies the bstats read/write APIs.

Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit
architectures, u64_stats sync points do not use sequence counter
protection.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Ahmed S. Darwish
f56940daa5 net: sched: Use _bstats_update/set() instead of raw writes
The Qdisc::running sequence counter, used to protect Qdisc::bstats reads
from parallel writes, is in the process of being removed. Qdisc::bstats
read/writes will synchronize using an internal u64_stats sync point
instead.

Modify all bstats writes to use _bstats_update(). This ensures that
the internal u64_stats sync point is always acquired and released as
appropriate.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Ahmed S. Darwish
67c9e6270f net: sched: Protect Qdisc::bstats with u64_stats
The not-per-CPU variant of qdisc tc (traffic control) statistics,
Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running
sequence counter.

This sequence counter is used for reliably protecting bstats reads from
parallel writes. Meanwhile, the seqcount's write section covers a much
wider area than bstats update: qdisc_run_begin() => qdisc_run_end().

That read/write section asymmetry can lead to needless retries of the
read section. To prepare for removing the Qdisc::running sequence
counter altogether, introduce a u64_stats sync point inside bstats
instead.

Modify _bstats_update() to start/end the bstats u64_stats write
section.

For bisectability, and finer commits granularity, the bstats read
section is still protected with a Qdisc::running read/retry loop and
qdisc_run_begin/end() still starts/ends that seqcount write section.
Once all call sites are modified to use _bstats_update(), the
Qdisc::running seqcount will be removed and bstats read/retry loop will
be modified to utilize the internal u64_stats sync point.

Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the statistics "packets" vs. "bytes"
values getting out of sync on rare occasions. The individual values will
still be valid.

[bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.]

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Sebastian Andrzej Siewior
10940eb746 gen_stats: Move remaining users to gnet_stats_add_queue().
The gnet_stats_queue::qlen member is only used in the SMP-case.

qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to
have the same value as that provided by qdisc_qlen_sum().

gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen
field whith the caller submitted qlen value. It might be differ from the
submitted value.

Let both functions use gnet_stats_add_queue() and remove unused
__gnet_stats_copy_queue().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Sebastian Andrzej Siewior
7361df4606 mq, mqprio: Use gnet_stats_add_queue().
gnet_stats_add_basic() and gnet_stats_add_queue() add up the statistics
so they can be used directly for both the per-CPU and global case.

gnet_stats_add_queue() copies either Qdisc's per-CPU
gnet_stats_queue::qlen or the global member. The global
gnet_stats_queue::qlen isn't touched in the per-CPU case so there is no
need to consider it in the global-case.

In the per-CPU case, the sum of global gnet_stats_queue::qlen and
the per-CPU gnet_stats_queue::qlen was assigned to sch->q.qlen and
sch->qstats.qlen. Now both fields are copied individually.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Sebastian Andrzej Siewior
448e163f8b gen_stats: Add gnet_stats_add_queue().
This function will replace __gnet_stats_copy_queue(). It reads all
arguments and adds them into the passed gnet_stats_queue argument.
In contrast to __gnet_stats_copy_queue() it also copies the qlen member.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:41 +01:00
Sebastian Andrzej Siewior
fbf307c89e gen_stats: Add instead Set the value in __gnet_stats_copy_basic().
__gnet_stats_copy_basic() always assigns the value to the bstats
argument overwriting the previous value. The later added per-CPU version
always accumulated the values in the returning gnet_stats_basic_packed
argument.

Based on review there are five users of that function as of today:
- est_fetch_counters(), ___gnet_stats_copy_basic()
  memsets() bstats to zero, single invocation.

- mq_dump(), mqprio_dump(), mqprio_dump_class_stats()
  memsets() bstats to zero, multiple invocation but does not use the
  function due to !qdisc_is_percpu_stats().

Add the values in __gnet_stats_copy_basic() instead overwriting. Rename
the function to gnet_stats_add_basic() to make it more obvious.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 12:54:40 +01:00
Lukas Wunner
ffdd33dd9c netfilter: core: Fix clang warnings about unused static inlines
Unlike gcc, clang warns about unused static inlines that are not in an
include file:

  net/netfilter/core.c:344:20: error: unused function 'nf_ingress_hook' [-Werror,-Wunused-function]
  static inline bool nf_ingress_hook(const struct nf_hook_ops *reg, int pf)
                     ^
  net/netfilter/core.c:353:20: error: unused function 'nf_egress_hook' [-Werror,-Wunused-function]
  static inline bool nf_egress_hook(const struct nf_hook_ops *reg, int pf)
                     ^

According to commit 6863f5643d ("kbuild: allow Clang to find unused
static inline functions for W=1 build"), the proper resolution is to
mark the affected functions as __maybe_unused.  An alternative approach
would be to move them to include/linux/netfilter_netdev.h, but since
Pablo didn't do that in commit ddcfa710d4 ("netfilter: add
nf_ingress_hook() helper function"), I'm guessing __maybe_unused is
preferred.

This fixes both the warning introduced by Pablo in v5.10 as well as the
one recently introduced by myself with commit 42df6e1d22 ("netfilter:
Introduce egress hook").

Fixes: ddcfa710d4 ("netfilter: add nf_ingress_hook() helper function")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-17 15:49:08 +02:00
Kyungrok Chung
254ec036db net: make use of helper netif_is_bridge_master()
Make use of netdev helper functions to improve code readability.
Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev).

Signed-off-by: Kyungrok Chung <acadx0@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 15:02:56 +01:00
Karsten Graul
29397e34c7 net/smc: stop links when their GID is removed
With SMC-Rv2 the GID is an IP address that can be deleted from the
device. When an IB_EVENT_GID_CHANGE event is provided then iterate over
all active links and check if their GID is still defined. Otherwise
stop the affected link.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
b0539f5edd net/smc: add netlink support for SMC-Rv2
Implement the netlink support for SMC-Rv2 related attributes that are
provided to user space.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
b4ba4652b3 net/smc: extend LLC layer for SMC-Rv2
Add support for large v2 LLC control messages in smc_llc.c.
The new large work request buffer allows to combine control
messages into one packet that had to be spread over several
packets before.
Add handling of the new v2 LLC messages.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
8799e310fb net/smc: add v2 support to the work request layer
In the work request layer define one large v2 buffer for each link group
that is used to transmit and receive large LLC control messages.
Add the completion queue handling for this buffer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
24fb68111d net/smc: retrieve v2 gid from IB device
In smc_ib.c, scan for RoCE devices that support UDP encapsulation.
Find an eligible device and check that there is a route to the
remote peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
8ade200c26 net/smc: add v2 format of CLC decline message
The CLC decline message changed with SMC-Rv2 and supports up to
4 additional diagnosis codes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
e49300a6bf net/smc: add listen processing for SMC-Rv2
Implement the server side of the SMC-Rv2 processing. Process incoming
CLC messages, find eligible devices and check for a valid route to the
remote peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:13 +01:00
Karsten Graul
e5c4744cfb net/smc: add SMC-Rv2 connection establishment
Send a CLC proposal message, and the remote side process this type of
message and determine the target GID. Check for a valid route to this
GID, and complete the connection establishment.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:12 +01:00
Karsten Graul
42042dbbc2 net/smc: prepare for SMC-Rv2 connection
Prepare the connection establishment with SMC-Rv2. Detect eligible
RoCE cards and indicate all supported SMC modes for the connection.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:12 +01:00
Karsten Graul
ed990df29f net/smc: save stack space and allocate smc_init_info
The struct smc_init_info grew over time, its time to save space on stack
and allocate this struct dynamically.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 14:58:12 +01:00
Jakub Kicinski
24bcbe1cc6 net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
sk_stream_kill_queues() can be called on close when there are
still outstanding skbs to transmit. Those skbs may try to queue
notifications to the error queue (e.g. timestamps).
If sk_stream_kill_queues() purges the queue without taking
its lock the queue may get corrupted, and skbs leaked.

This shows up as a warning about an rmem leak:

WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x...

The leak is always a multiple of 0x300 bytes (the value is in
%rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of
an empty sk_buff. Indeed if we dump the socket state at the time
of the warning the sk_error_queue is often (but not always)
corrupted. The ->next pointer points back at the list head,
but not the ->prev pointer. Indeed we can find the leaked skb
by scanning the kernel memory for something that looks like
an skb with ->sk = socket in question, and ->truesize = 0x300.
The contents of ->cb[] of the skb confirms the suspicion that
it is indeed a timestamp notification (as generated in
__skb_complete_tx_timestamp()).

Removing purging of sk_error_queue should be okay, since
inet_sock_destruct() does it again once all socket refs
are gone. Eric suggests this may cause sockets that go
thru disconnect() to maintain notifications from the
previous incarnations of the socket, but that should be
okay since the race was there anyway, and disconnect()
is not exactly dependable.

Thanks to Jonathan Lemon and Omar Sandoval for help at various
stages of tracing the issue.

Fixes: cb9eff0978 ("net: new user space API for time stamping of incoming and outgoing packets")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 09:06:09 +01:00
Mat Martineau
3828c51472 mptcp: Make mptcp_pm_nl_mp_prio_send_ack() static
This function is only used within pm_netlink.c now.

Fixes: 067065422f ("mptcp: add the outgoing MP_PRIO support")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 08:46:08 +01:00
Paolo Abeni
72bcbc46a5 mptcp: increase default max additional subflows to 2
The current default does not allowing additional subflows, mostly
as a safety restriction to avoid uncontrolled resource consumption
on busy servers.

Still the system admin and/or the application have to opt-in to
MPTCP explicitly. After that, they need to change (increase) the
default maximum number of additional subflows.

Let set that to reasonable default, and make end-users life easier.

Additionally we need to update some self-tests accordingly.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 08:46:08 +01:00
Tim Gardner
29211e7db2 mptcp: Avoid NULL dereference in mptcp_getsockopt_subflow_addrs()
Coverity complains of a possible NULL dereference in
mptcp_getsockopt_subflow_addrs():

 861       } else if (sk->sk_family == AF_INET6) {
    	3. returned_null: inet6_sk returns NULL. [show details]
    	4. var_assigned: Assigning: np = NULL return value from inet6_sk.
 862                const struct ipv6_pinfo *np = inet6_sk(sk);

Fix this by checking for NULL.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/231
Fixes: c11c5906bc ("mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt support")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[mjm: Added WARN_ON_ONCE() to the unexpected case]
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16 08:46:08 +01:00
Eric Dumazet
e72aeb9ee0 fq_codel: implement L4S style ce_threshold_ect1 marking
Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.

If enabled, only packets with ECT(1) can be transformed to CE
if their sojourn time is above the ce_threshold.

Note that this new option does not change rules for codel law.
In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
the default when fq_codel qdisc is created), ECT(0) packets can
still get CE if codel law (as governed by limit/target) decides so.

Section 4.3.b of current draft [1] states:

b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
    be used for L4S.  For instance within each queue of an FQ-CoDel
    system, as well as a CoDel AQM, there is typically also ECN
    marking at an immediate (unsmoothed) shallow threshold to support
    use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
    modified so that the shallow threshold is solely applied to
    ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
    packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
    applied; while if there is a flow of ECT(1) packets in the queue,
    the shallower (typically sub-millisecond) threshold is applied.

Tested:

tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec

netperf ... -t TCP_STREAM -- K dctcp

tc -s -d qd sh dev eth1
qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
 backlog 0b 0p requeues 152013
  maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
  new_flows_len 0 old_flows_len 0

[1] L4S current draft:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Cc: Bob Briscoe <in@bobbriscoe.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:33:08 +01:00
Eric Dumazet
19757cebf0 tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:28:34 +01:00
Matt Johnston
0b93aed284 mctp: Avoid leak of mctp_sk_key
mctp_key_alloc() returns a key already referenced.

The mctp_route_input() path receives a packet for a bind socket and
allocates a key. It passes the key to mctp_key_add() which takes a
refcount and adds the key to lists. mctp_route_input() should then
release its own refcount when setting the key pointer to NULL.

In the mctp_alloc_local_tag() path (for mctp_local_output()) we
similarly need to unref the key before returning (mctp_reserve_tag()
takes a refcount and adds the key to lists).

Fixes: 73c618456d ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:22:08 +01:00
Yunsheng Lin
d00e60ee54 page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA
As the 32-bit arch with 64-bit DMA seems to rare those days,
and page pool might carry a lot of code and complexity for
systems that possibly.

So disable dma mapping support for such systems, if drivers
really want to work on such systems, they have to implement
their own DMA-mapping fallback tracking outside page_pool.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 10:54:20 +01:00
Daniel Borkmann
30fc7efa38 net, neigh: Reject creating NUD_PERMANENT with NTF_MANAGED entries
The combination of NUD_PERMANENT + NTF_MANAGED is not supported and does
not make sense either given the former indicates a static/fixed neighbor
entry whereas the latter a dynamically resolved one. While it is possible
to transition from one over to the other, we should however reject such
creation attempts.

Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14 19:16:21 -07:00
Daniel Borkmann
c8e80c1169 net, neigh: Use NLA_POLICY_MASK helper for NDA_FLAGS_EXT attribute
Instead of open-coding a check for invalid bits in NTF_EXT_MASK, we can just
use the NLA_POLICY_MASK() helper instead, and simplify NDA_FLAGS_EXT sanity
check this way.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14 19:16:21 -07:00
Daniel Borkmann
507c2f1d29 net, neigh: Add build-time assertion to avoid neigh->flags overflow
Currently, NDA_FLAGS_EXT flags allow a maximum of 24 bits to be used for
extended neighbor flags. These are eventually fed into neigh->flags by
shifting with NTF_EXT_SHIFT as per commit 2c611ad97a ("net, neigh:
Extend neigh->flags to 32 bit to allow for extensions").

If really ever needed in future, the full 32 bits from NDA_FLAGS_EXT can
be used, it would only require to move neigh->flags from u32 to u64 inside
the kernel.

Add a build-time assertion such that when extending the NTF_EXT_MASK with
new bits, we'll trigger an error once we surpass the 24th bit. This assumes
that no bit holes in new NTF_EXT_* flags will slip in from UAPI, but I
think this is reasonable to assume.

Suggested-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14 19:16:21 -07:00
Jakub Kicinski
e15f5972b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/ioam6.sh
  7b1700e009 ("selftests: net: modify IOAM tests for undef bits")
  bf77b1400a ("selftests: net: Test for the IOAM encapsulation with IPv6")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14 16:50:14 -07:00
Florian Westphal
c650c35a25 netfilter: ipvs: merge ipv4 + ipv6 icmp reply handlers
Similar to earlier patches: allow ipv4 and ipv6 to use the
same handler.  ipv4 and ipv6 specific actions can be done by
checking state->pf.

v2: split the pf == NFPROTO_IPV4 check (Julian Anastasov)

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:54 +02:00
Florian Westphal
540ff44b28 netfilter: ipvs: remove unneeded input wrappers
After earlier patch ip_vs_hook_in can be used directly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:54 +02:00
Florian Westphal
8a9941b42d netfilter: ipvs: remove unneeded output wrappers
After earlier patch we can use ip_vs_out_hook directly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Florian Westphal
9dd43a5f4b netfilter: ipvs: prepare for hook function reduction
ipvs has multiple one-line wrappers for hooks, compact them.

To avoid a large patch make the two most common helpers use the same
function signature as hooks.

Next patches can then remove the oneline wrappers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Florian Westphal
f0d6764f7d netfilter: ebtables: allow use of ebt_do_table as hookfn
This is possible now that the xt_table structure is passed via *priv.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Florian Westphal
44b5990e7b netfilter: ip6tables: allow use of ip6t_do_table as hookfn
This is possible now that the xt_table structure is passed via *priv.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Florian Westphal
e8d225b600 netfilter: arp_tables: allow use of arpt_do_table as hookfn
This is possible now that the xt_table structure is passed in via *priv.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Florian Westphal
8844e01062 netfilter: iptables: allow use of ipt_do_table as hookfn
This is possible now that the xt_table structure is passed in via *priv.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:53 +02:00
Pablo Neira Ayuso
0d7308c0ff af_packet: Introduce egress hook
Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
socket option set to on, which allows packets to escape without being
filtered in the egress path.

This patch only updates the AF_PACKET path, it does not update
dev_direct_xmit() so the XDP infrastructure has a chance to bypass
Netfilter.

[lukas: acquire rcu_read_lock, fix typos, rebase]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:44 +02:00