[ Upstream commit d120d1a63b ]
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.
Convert to the regular interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 1dbd8d9a82 ("ipvs: use u64_stats_t for the per-cpu counters")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7d7cfb48d8 ]
icmp conntrack will set icmp redirects as RELATED, but icmpv6 will not
do this.
For icmpv6, only icmp errors (code <= 128) are examined for RELATED state.
ICMPV6 Redirects are part of neighbour discovery mechanism, those are
handled by marking a selected subset (e.g. neighbour solicitations) as
UNTRACKED, but not REDIRECT -- they will thus be flagged as INVALID.
Add minimal support for REDIRECTs. No parsing of neighbour options is
added for simplicity, so this will only check that we have the embeeded
original header (ND_OPT_REDIRECT_HDR), and then attempt to do a flow
lookup for this tuple.
Also extend the existing test case to cover redirects.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Reported-by: Eric Garver <eric@garver.life>
Link: https://github.com/firewalld/firewalld/issues/1046
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9464d0b68f ]
Currently in nf_conntrack_hash_check_insert(), when it fails in
nf_ct_ext_valid_pre/post(), NF_CT_STAT_INC() will be called in the
preemptible context, a call trace can be triggered:
BUG: using __this_cpu_add() in preemptible [00000000] code: conntrack/1636
caller is nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
check_preemption_disabled+0xc3/0xf0
nf_conntrack_hash_check_insert+0x45/0x430 [nf_conntrack]
ctnetlink_create_conntrack+0x3cd/0x4e0 [nf_conntrack_netlink]
ctnetlink_new_conntrack+0x1c0/0x450 [nf_conntrack_netlink]
nfnetlink_rcv_msg+0x277/0x2f0 [nfnetlink]
netlink_rcv_skb+0x50/0x100
nfnetlink_rcv+0x65/0x144 [nfnetlink]
netlink_unicast+0x1ae/0x290
netlink_sendmsg+0x257/0x4f0
sock_sendmsg+0x5f/0x70
This patch is to fix it by changing to use NF_CT_STAT_INC_ATOMIC() for
nf_ct_ext_valid_pre/post() check in nf_conntrack_hash_check_insert(),
as well as nf_ct_ext_valid_post() in __nf_conntrack_confirm().
Note that nf_ct_ext_valid_pre() check in __nf_conntrack_confirm() is
safe to use NF_CT_STAT_INC(), as it's under local_bh_disable().
Fixes: c56716c69c ("netfilter: extensions: introduce extension genid count")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a81047154e ]
flow_offload_queue_work() can be called in workqueue without
bh disabled, like the call trace showed in my act_ct testing,
calling NF_FLOW_TABLE_STAT_INC() there would cause a call
trace:
BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u4:0/138560
caller is flow_offload_queue_work+0xec/0x1b0 [nf_flow_table]
Workqueue: act_ct_workqueue tcf_ct_flow_table_cleanup_work [act_ct]
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x46
check_preemption_disabled+0xc3/0xf0
flow_offload_queue_work+0xec/0x1b0 [nf_flow_table]
nf_flow_table_iterate+0x138/0x170 [nf_flow_table]
nf_flow_table_free+0x140/0x1a0 [nf_flow_table]
tcf_ct_flow_table_cleanup_work+0x2f/0x2b0 [act_ct]
process_one_work+0x6a3/0x1030
worker_thread+0x8a/0xdf0
This patch fixes it by using NF_FLOW_TABLE_STAT_INC_ATOMIC()
instead in flow_offload_queue_work().
Note that for FLOW_CLS_REPLACE branch in flow_offload_queue_work(),
it may not be called in preemptible path, but it's good to use
NF_FLOW_TABLE_STAT_INC_ATOMIC() for all cases in
flow_offload_queue_work().
Fixes: b038177636 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 97d4d394b5 ]
Embarrassingly, nft_pipapo_insert() checked for interval validity in
the first field only.
The start_p and end_p pointers were reset to key data from the first
field at every iteration of the loop which was supposed to go over
the set fields.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bcd9e3c165 ]
nf_flow_table_block_setup and the driver TC_SETUP_FT call can modify the flow
block cb list while they are being traversed elsewhere, causing a crash.
Add a write lock around the calls to protect readers
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Chad Monroe <chad.monroe@smartrg.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6a66ce44a5 ]
The commit 510841da1f ("netfilter: ipset: enforce documented limit to
prevent allocating huge memory") was too strict and prevented to add up to
64 clashing elements to a hash:net,iface type of set. This patch fixes the
issue and now the type behaves as documented.
Fixes: 510841da1f ("netfilter: ipset: enforce documented limit to prevent allocating huge memory")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7aa1a76d4 ]
This patch introduced a regression: commit 48596a8ddc ("netfilter:
ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
The variable e.ip is passed to adtfn() function which finally adds the
ip address to the set. The patch above refactored the for loop and moved
e.ip = htonl(ip) to the end of the for loop.
What this means is that if the value of "ip" changes between the first
assignement of e.ip and the forloop, then e.ip is pointing to a
different ip address than "ip".
Test case:
$ ipset create jdtest_tmp hash:ip family inet hashsize 2048 maxelem 100000
$ ipset add jdtest_tmp 10.0.1.1/31
ipset v6.21.1: Element cannot be added to the set: it's already added
The value of ip gets updated inside the "else if (tb[IPSET_ATTR_CIDR])"
block but e.ip is still pointing to the old value.
Fixes: 48596a8ddc ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 33c7aba0b4 ]
Elements with an end interval flag set on do not store extensions. The
global set definition is currently setting on the timeout and stateful
expression for end interval elements.
This leads to skipping end interval elements from the set->ops->walk()
path as the expired check bogusly reports true.
Moreover, do not set up stateful expressions for elements with end
interval flag set on since this is never used.
Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 52d1aa8b82 ]
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 710d21fdff upstream.
In preparation for FORTIFY_SOURCE doing bounds-check on memcpy(),
switch from __nlmsg_put to nlmsg_put(), and explain the bounds check
for dealing with the memcpy() across a composite flexible array struct.
Avoids this future run-time warning:
memcpy: detected field-spanning write (size 32) of single field "&errmsg->msg" at net/netlink/af_netlink.c:2447 (size 16)
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: syzbot <syzkaller@googlegroups.com>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220901071336.1418572-1-keescook@chromium.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 03c1f1ef15 ]
syzbot reported a warning like below [1]:
WARNING: CPU: 3 PID: 9 at net/netfilter/nf_tables_api.c:10096 nf_tables_exit_net+0x71c/0x840
Modules linked in:
CPU: 2 PID: 9 Comm: kworker/u8:0 Tainted: G W 6.1.0-rc3-00072-g8e5423e991e8 #47
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:nf_tables_exit_net+0x71c/0x840
...
Call Trace:
<TASK>
? __nft_release_table+0xfc0/0xfc0
ops_exit_list+0xb5/0x180
cleanup_net+0x506/0xb10
? unregister_pernet_device+0x80/0x80
process_one_work+0xa38/0x1730
? pwq_dec_nr_in_flight+0x2b0/0x2b0
? rwlock_bug.part.0+0x90/0x90
? _raw_spin_lock_irq+0x46/0x50
worker_thread+0x67e/0x10e0
? process_one_work+0x1730/0x1730
kthread+0x2e5/0x3a0
? kthread_complete_and_exit+0x40/0x40
ret_from_fork+0x1f/0x30
</TASK>
In nf_tables_exit_net(), there is a case where nft_net->commit_list is
empty but nft_net->module_list is not empty. Such a case occurs with
the following scenario:
1. nfnetlink_rcv_batch() is called
2. nf_tables_newset() returns -EAGAIN and NFNL_BATCH_FAILURE bit is
set to status
3. nf_tables_abort() is called with NFNL_ABORT_AUTOLOAD
(nft_net->commit_list is released, but nft_net->module_list is not
because of NFNL_ABORT_AUTOLOAD flag)
4. Jump to replay label
5. netlink_skb_clone() fails and returns from the function (this is
caused by fault injection in the reproducer of syzbot)
This patch fixes this issue by calling __nf_tables_abort() when
nft_net->module_list is not empty in nf_tables_exit_net().
Fixes: eb014de4fd ("netfilter: nf_tables: autoload modules from the abort path")
Link: https://syzkaller.appspot.com/bug?id=802aba2422de4218ad0c01b46c9525cc9d4e4aa3 [1]
Reported-by: syzbot+178efee9e2d7f87f5103@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 03832a32bf ]
When type is NFNL_CB_MUTEX and -EAGAIN error occur in nfnetlink_rcv_msg(),
it does not execute nfnl_unlock(). That would trigger potential dead lock.
Fixes: 50f2db9e36 ("netfilter: nfnetlink: consolidate callback types")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 510841da1f ]
Daniel Xu reported that the hash:net,iface type of the ipset subsystem does
not limit adding the same network with different interfaces to a set, which
can lead to huge memory usage or allocation failure.
The quick reproducer is
$ ipset create ACL.IN.ALL_PERMIT hash:net,iface hashsize 1048576 timeout 0
$ for i in $(seq 0 100); do /sbin/ipset add ACL.IN.ALL_PERMIT 0.0.0.0/0,kaf_$i timeout 0 -exist; done
The backtrace when vmalloc fails:
[Tue Oct 25 00:13:08 2022] ipset: vmalloc error: size 1073741848, exceeds total pages
<...>
[Tue Oct 25 00:13:08 2022] Call Trace:
[Tue Oct 25 00:13:08 2022] <TASK>
[Tue Oct 25 00:13:08 2022] dump_stack_lvl+0x48/0x60
[Tue Oct 25 00:13:08 2022] warn_alloc+0x155/0x180
[Tue Oct 25 00:13:08 2022] __vmalloc_node_range+0x72a/0x760
[Tue Oct 25 00:13:08 2022] ? hash_netiface4_add+0x7c0/0xb20
[Tue Oct 25 00:13:08 2022] ? __kmalloc_large_node+0x4a/0x90
[Tue Oct 25 00:13:08 2022] kvmalloc_node+0xa6/0xd0
[Tue Oct 25 00:13:08 2022] ? hash_netiface4_resize+0x99/0x710
<...>
The fix is to enforce the limit documented in the ipset(8) manpage:
> The internal restriction of the hash:net,iface set type is that the same
> network prefix cannot be stored with more than 64 different interfaces
> in a single set.
Fixes: ccf0a4b7fc ("netfilter: ipset: Add bucketsize parameter to all hash types")
Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5663ed63ad ]
During the initialization of ip_vs_app_net_init(), if file ip_vs_app
fails to be created, the initialization is successful by default.
Therefore, the ip_vs_app file doesn't be found during the remove in
ip_vs_app_net_cleanup(). It will cause WRNING.
The following is the stack information:
name 'ip_vs_app'
WARNING: CPU: 1 PID: 9 at fs/proc/generic.c:712 remove_proc_entry+0x389/0x460
Modules linked in:
Workqueue: netns cleanup_net
RIP: 0010:remove_proc_entry+0x389/0x460
Call Trace:
<TASK>
ops_exit_list+0x125/0x170
cleanup_net+0x4ea/0xb00
process_one_work+0x9bf/0x1710
worker_thread+0x665/0x1080
kthread+0x2e4/0x3a0
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 457c4cbc5a ("[NET]: Make /proc/net per network namespace")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3d00c6a0da ]
During the initialization of ip_vs_conn_net_init(), if file ip_vs_conn
or ip_vs_conn_sync fails to be created, the initialization is successful
by default. Therefore, the ip_vs_conn or ip_vs_conn_sync file doesn't
be found during the remove.
The following is the stack information:
name 'ip_vs_conn_sync'
WARNING: CPU: 3 PID: 9 at fs/proc/generic.c:712
remove_proc_entry+0x389/0x460
Modules linked in:
Workqueue: netns cleanup_net
RIP: 0010:remove_proc_entry+0x389/0x460
Call Trace:
<TASK>
__ip_vs_cleanup_batch+0x7d/0x120
ops_exit_list+0x125/0x170
cleanup_net+0x4ea/0xb00
process_one_work+0x9bf/0x1710
worker_thread+0x665/0x1080
kthread+0x2e4/0x3a0
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 61b1ab4583 ("IPVS: netns, add basic init per netns.")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5c26159c97 ]
The `char` type with no explicit sign is sometimes signed and sometimes
unsigned. This code will break on platforms such as arm, where char is
unsigned. So mark it here as explicitly signed, so that the
todrop_counter decrement and subsequent comparison is correct.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 26b5934ff4 ]
No need to postpone this to the commit release path, since no packets
are walking over this object, this is accessed from control plane only.
This helped uncovered UAF triggered by races with the netlink notifier.
Fixes: 9dd732e0bd ("netfilter: nf_tables: memleak flow rule from commit path")
Reported-by: syzbot+8f747f62763bc6c32916@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d4bc8271db ]
commit release path is invoked via call_rcu and it runs lockless to
release the objects after rcu grace period. The netlink notifier handler
might win race to remove objects that the transaction context is still
referencing from the commit release path.
Call rcu_barrier() to ensure pending rcu callbacks run to completion
if the list of transactions to be destroyed is not empty.
Fixes: 6001a930ce ("netfilter: nftables: introduce table ownership")
Reported-by: syzbot+8f747f62763bc6c32916@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 96df8360db ]
Otherwise EINVAL is bogusly reported to userspace when deleting a set
element. NFTA_SET_ELEM_KEY_END does not need to be set in case of:
- insertion: if not present, start key is used as end key.
- deletion: only start key needs to be specified, end key is ignored.
Hence, relax the sanity check.
Fixes: 88cccd908d ("netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2aa1927570 ]
The previous commit changed the way the rescheduling delay is computed
which has a side effect: the bias is now represented as much as the
other entries in the rescheduling delay which makes the logic to kick in
only with very large sets, as the initial interval is very large
(INT_MAX).
Revisit the GC initial bias to allow more frequent GC for smaller sets
while still avoiding wakeups when a machine is mostly idle. We're moving
from a large initial value to pretending we have 100 entries expiring at
the upper bound. This way only a few entries having a small timeout
won't impact much the rescheduling delay and non-idle machines will have
enough entries to lower the delay when needed. This also improves
readability as the initial bias is now linked to what is computed
instead of being an arbitrary large value.
Fixes: 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 95eabdd207 ]
Commit 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
changed the eviction rescheduling to the use average expiry of scanned
entries (within 1-60s) by doing:
for (...) {
expires = clamp(nf_ct_expires(tmp), ...);
next_run += expires;
next_run /= 2;
}
The issue is the above will make the average ('next_run' here) more
dependent on the last expiration values than the firsts (for sets > 2).
Depending on the expiration values used to compute the average, the
result can be quite different than what's expected. To fix this we can
do the following:
for (...) {
expires = clamp(nf_ct_expires(tmp), ...);
next_run += (expires - next_run) / ++count;
}
Fixes: 2cfadb761d ("netfilter: conntrack: revisit gc autotuning")
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
We can't use ct->lock, this is already used by the seqadj internals.
When using ftp helper + nat, seqadj will attempt to acquire ct->lock
again.
Revert back to a global lock for now.
Fixes: c783a29c7e ("netfilter: nf_ct_ftp: prefer skb_linearize")
Reported-by: Bruno de Paula Larini <bruno.larini@riosoft.com.br>
Signed-off-by: Florian Westphal <fw@strlen.de>
It seems to me that percpu memory for chain stats started leaking since
commit 3bc158f8d0 ("netfilter: nf_tables: map basechain priority to
hardware priority") when nft_chain_offload_priority() returned an error.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 3bc158f8d0 ("netfilter: nf_tables: map basechain priority to hardware priority")
Signed-off-by: Florian Westphal <fw@strlen.de>
syzbot is reporting underflow of nft_counters_enabled counter at
nf_tables_addchain() [1], for commit 43eb8949cf ("netfilter:
nf_tables: do not leave chain stats enabled on error") missed that
nf_tables_chain_destroy() after nft_basechain_init() in the error path of
nf_tables_addchain() decrements the counter because nft_basechain_init()
makes nft_is_base_chain() return true by setting NFT_CHAIN_BASE flag.
Increment the counter immediately after returning from
nft_basechain_init().
Link: https://syzkaller.appspot.com/bug?extid=b5d82a651b71cd8a75ab [1]
Reported-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
Fixes: 43eb8949cf ("netfilter: nf_tables: do not leave chain stats enabled on error")
Signed-off-by: Florian Westphal <fw@strlen.de>
nf_osf_find() incorrectly returns true on mismatch, this leads to
copying uninitialized memory area in nft_osf which can be used to leak
stale kernel stack data to userspace.
Fixes: 22c7652cda ("netfilter: nft_osf: Add version option support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
CTCP messages should only be at the start of an IRC message, not
anywhere within it.
While the helper only decodes packes in the ORIGINAL direction, its
possible to make a client send a CTCP message back by empedding one into
a PING request. As-is, thats enough to make the helper believe that it
saw a CTCP message.
Fixes: 869f37d8e4 ("[NETFILTER]: nf_conntrack/nf_nat: add IRC helper port")
Signed-off-by: David Leadbeater <dgl@dgl.cx>
Signed-off-by: Florian Westphal <fw@strlen.de>
ct_sip_next_header and ct_sip_get_header return an absolute
value of matchoff, not a shift from current dataoff.
So dataoff should be assigned matchoff, not incremented by it.
This issue can be seen in the scenario when there are multiple
Contact headers and the first one is using a hostname and other headers
use IP addresses. In this case, ct_sip_walk_headers will work as follows:
The first ct_sip_get_header call to will find the first Contact header
but will return -1 as the header uses a hostname. But matchoff will
be changed to the offset of this header. After that, dataoff should be
set to matchoff, so that the next ct_sip_get_header call find the next
Contact header. But instead of assigning dataoff to matchoff, it is
incremented by it, which is not correct, as matchoff is an absolute
value of the offset. So on the next call to the ct_sip_get_header,
dataoff will be incorrect, and the next Contact header may not be
found at all.
Fixes: 05e3ced297 ("[NETFILTER]: nf_conntrack_sip: introduce SIP-URI parsing helper")
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Ensure the match happens in the right direction, previously the
destination used was the server, not the NAT host, as the comment
shows the code intended.
Additionally nf_nat_irc uses port 0 as a signal and there's no valid way
it can appear in a DCC message, so consider port 0 also forged.
Fixes: 869f37d8e4 ("[NETFILTER]: nf_conntrack/nf_nat: add IRC helper port")
Signed-off-by: David Leadbeater <dgl@dgl.cx>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
__nf_ct_try_assign_helper() remains in place but it now requires a
template to configure the helper.
A toggle to disable automatic helper assignment was added by:
a900689264 ("netfilter: nf_ct_helper: allow to disable automatic helper assignment")
in 2012 to address the issues described in "Secure use of iptables and
connection tracking helpers". Automatic conntrack helper assignment was
disabled by:
3bb398d925 ("netfilter: nf_ct_helper: disable automatic helper assignment")
back in 2016.
This patch removes the sysctl and modparam toggles, users now have to
rely on explicit conntrack helper configuration via ruleset.
Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to
check that auto-assignment does not happen anymore.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix crash with malformed ebtables blob which do not provide all
entry points, from Florian Westphal.
2) Fix possible TCP connection clogging up with default 5-days
timeout in conntrack, from Florian.
3) Fix crash in nf_tables tproxy with unsupported chains, also from Florian.
4) Do not allow to update implicit chains.
5) Make table handle allocation per-netns to fix data race.
6) Do not truncated payload length and offset, and checksum offset.
Instead report EINVAl.
7) Enable chain stats update via static key iff no error occurs.
8) Restrict osf expression to ip, ip6 and inet families.
9) Restrict tunnel expression to netdev family.
10) Fix crash when trying to bind again an already bound chain.
11) Flowtable garbage collector might leave behind pending work to
delete entries. This patch comes with a previous preparation patch
as dependency.
12) Allow net.netfilter.nf_conntrack_frag6_high_thresh to be lowered,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increases
netfilter: flowtable: fix stuck flows on cleanup due to pending work
netfilter: flowtable: add function to invoke garbage collection immediately
netfilter: nf_tables: disallow binding to already bound chain
netfilter: nft_tunnel: restrict it to netdev family
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
netfilter: nf_tables: do not leave chain stats enabled on error
netfilter: nft_payload: do not truncate csum_offset and csum_type
netfilter: nft_payload: report ERANGE for too long offset and length
netfilter: nf_tables: make table handle allocation per-netns friendly
netfilter: nf_tables: disallow updates of implicit chain
netfilter: nft_tproxy: restrict to prerouting hook
netfilter: conntrack: work around exceeded receive window
netfilter: ebtables: reject blobs that don't provide all entry points
====================
Link: https://lore.kernel.org/r/20220824220330.64283-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clear the flow table on flow table free, the following sequence
normally happens in order:
1) gc_step work is stopped to disable any further stats/del requests.
2) All flow table entries are set to teardown state.
3) Run gc_step which will queue HW del work for each flow table entry.
4) Waiting for the above del work to finish (flush).
5) Run gc_step again, deleting all entries from the flow table.
6) Flow table is freed.
But if a flow table entry already has pending HW stats or HW add work
step 3 will not queue HW del work (it will be skipped), step 4 will wait
for the pending add/stats to finish, and step 5 will queue HW del work
which might execute after freeing of the flow table.
To fix the above, this patch flushes the pending work, then it sets the
teardown flag to all flows in the flowtable and it forces a garbage
collector run to queue work to remove the flows from hardware, then it
flushes this new pending work and (finally) it forces another garbage
collector run to remove the entry from the software flowtable.
Stack trace:
[47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460
[47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704
[47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2
[47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
[47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table]
[47773.889727] Call Trace:
[47773.890214] dump_stack+0xbb/0x107
[47773.890818] print_address_description.constprop.0+0x18/0x140
[47773.892990] kasan_report.cold+0x7c/0xd8
[47773.894459] kasan_check_range+0x145/0x1a0
[47773.895174] down_read+0x99/0x460
[47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table]
[47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table]
[47773.913372] process_one_work+0x8ac/0x14e0
[47773.921325]
[47773.921325] Allocated by task 592159:
[47773.922031] kasan_save_stack+0x1b/0x40
[47773.922730] __kasan_kmalloc+0x7a/0x90
[47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct]
[47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct]
[47773.925207] tcf_action_init_1+0x45b/0x700
[47773.925987] tcf_action_init+0x453/0x6b0
[47773.926692] tcf_exts_validate+0x3d0/0x600
[47773.927419] fl_change+0x757/0x4a51 [cls_flower]
[47773.928227] tc_new_tfilter+0x89a/0x2070
[47773.936652]
[47773.936652] Freed by task 543704:
[47773.937303] kasan_save_stack+0x1b/0x40
[47773.938039] kasan_set_track+0x1c/0x30
[47773.938731] kasan_set_free_info+0x20/0x30
[47773.939467] __kasan_slab_free+0xe7/0x120
[47773.940194] slab_free_freelist_hook+0x86/0x190
[47773.941038] kfree+0xce/0x3a0
[47773.941644] tcf_ct_flow_table_cleanup_work
Original patch description and stack trace by Paul Blakey.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Paul Blakey <paulb@nvidia.com>
Tested-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only allow to use this expression from NFPROTO_NETDEV family.
Fixes: af308b94a2 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As it was originally intended, restrict extension to supported families.
Fixes: b96af92d6e ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Error might occur later in the nf_tables_addchain() codepath, enable
static key only after transaction has been created.
Fixes: 9f08ea8481 ("netfilter: nf_tables: keep chain counters away from hot path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead report ERANGE if csum_offset is too long, and EOPNOTSUPP if type
is not support.
Fixes: 7ec3f7b47b ("netfilter: nft_payload: add packet mangling support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of offset and length are truncation to u8, report ERANGE.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TPROXY is only allowed from prerouting, but nft_tproxy doesn't check this.
This fixes a crash (null dereference) when using tproxy from e.g. output.
Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Reported-by: Shell Chen <xierch@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
When a TCP sends more bytes than allowed by the receive window, all future
packets can be marked as invalid.
This can clog up the conntrack table because of 5-day default timeout.
Sequence of packets:
01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1]
02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8]
03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0
04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68
Window is 5840 starting from 33212 -> 39052.
05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0
06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
This is fine, conntrack will flag the connection as having outstanding
data (UNACKED), which lowers the conntrack timeout to 300s.
07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
Packet 10 is already sending more than permitted, but conntrack doesn't
validate this (only seq is tested vs. maxend, not 'seq+len').
38484 is acceptable, but only up to 39052, so this packet should
not have been sent (or only 568 bytes, not 1318).
At this point, connection is still in '300s' mode.
Next packet however will get flagged:
11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326
nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH ..
Now, a couple of replies/acks comes in:
12 initiator > responder: [.], ack 34530, win 4368,
[.. irrelevant acks removed ]
16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0
This ack is significant -- this acks the last packet send by the
responder that conntrack considered valid.
This means that ack == td_end. This will withdraw the
'unacked data' flag, the connection moves back to the 5-day timeout
of established conntracks.
17 initiator > responder: ack 40128, win 10030, ...
This packet is also flagged as invalid.
Because conntrack only updates state based on packets that are
considered valid, packet 11 'did not exist' and that gets us:
nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG
Because this received and processed by the endpoints, the conntrack entry
remains in a bad state, no packets will ever be considered valid again:
30 responder > initiator: [F.], seq 40432, ack 2045, win 391, ..
31 initiator > responder: [.], ack 40433, win 11348, ..
32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 ..
... all trigger 'ACK is over bound' test and we end up with
non-early-evictable 5-day default timeout.
NB: This patch triggers a bunch of checkpatch warnings because of silly
indent. I will resend the cleanup series linked below to reduce the
indent level once this change has propagated to net-next.
I could route the cleanup via nf but that causes extra backport work for
stable maintainers.
Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8
Signed-off-by: Florian Westphal <fw@strlen.de>
NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca68
("netfilter: provide config option to disable ancient procfs parts") in
v3.3.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Since f3a2181e16 ("netfilter: nf_tables: Support for sets with
multiple ranged fields"), it possible to combine intervals and
concatenations. Later on, ef516e8625 ("netfilter: nf_tables:
reintroduce the NFT_SET_CONCAT flag") provides the NFT_SET_CONCAT flag
for userspace to report that the set stores a concatenation.
Make sure NFT_SET_CONCAT is set on if field_count is specified for
consistency. Otherwise, if NFT_SET_CONCAT is specified with no
field_count, bail out with EINVAL.
Fixes: ef516e8625 ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These flags are mutually exclusive, report EINVAL in this case.
Fixes: aaa31047a6 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>