When a TCP sends more bytes than allowed by the receive window, all future
packets can be marked as invalid.
This can clog up the conntrack table because of 5-day default timeout.
Sequence of packets:
01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1]
02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8]
03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0
04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68
Window is 5840 starting from 33212 -> 39052.
05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0
06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
This is fine, conntrack will flag the connection as having outstanding
data (UNACKED), which lowers the conntrack timeout to 300s.
07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318
Packet 10 is already sending more than permitted, but conntrack doesn't
validate this (only seq is tested vs. maxend, not 'seq+len').
38484 is acceptable, but only up to 39052, so this packet should
not have been sent (or only 568 bytes, not 1318).
At this point, connection is still in '300s' mode.
Next packet however will get flagged:
11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326
nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH ..
Now, a couple of replies/acks comes in:
12 initiator > responder: [.], ack 34530, win 4368,
[.. irrelevant acks removed ]
16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0
This ack is significant -- this acks the last packet send by the
responder that conntrack considered valid.
This means that ack == td_end. This will withdraw the
'unacked data' flag, the connection moves back to the 5-day timeout
of established conntracks.
17 initiator > responder: ack 40128, win 10030, ...
This packet is also flagged as invalid.
Because conntrack only updates state based on packets that are
considered valid, packet 11 'did not exist' and that gets us:
nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG
Because this received and processed by the endpoints, the conntrack entry
remains in a bad state, no packets will ever be considered valid again:
30 responder > initiator: [F.], seq 40432, ack 2045, win 391, ..
31 initiator > responder: [.], ack 40433, win 11348, ..
32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 ..
... all trigger 'ACK is over bound' test and we end up with
non-early-evictable 5-day default timeout.
NB: This patch triggers a bunch of checkpatch warnings because of silly
indent. I will resend the cleanup series linked below to reduce the
indent level once this change has propagated to net-next.
I could route the cleanup via nf but that causes extra backport work for
stable maintainers.
Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8
Signed-off-by: Florian Westphal <fw@strlen.de>
Harshit Mogalapalli says:
In ebt_do_table() function dereferencing 'private->hook_entry[hook]'
can lead to NULL pointer dereference. [..] Kernel panic:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[..]
RIP: 0010:ebt_do_table+0x1dc/0x1ce0
Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88
[..]
Call Trace:
nf_hook_slow+0xb1/0x170
__br_forward+0x289/0x730
maybe_deliver+0x24b/0x380
br_flood+0xc6/0x390
br_dev_xmit+0xa2e/0x12c0
For some reason ebtables rejects blobs that provide entry points that are
not supported by the table, but what it should instead reject is the
opposite: blobs that DO NOT provide an entry point supported by the table.
t->valid_hooks is the bitmask of hooks (input, forward ...) that will see
packets. Providing an entry point that is not support is harmless
(never called/used), but the inverse isn't: it results in a crash
because the ebtables traverser doesn't expect a NULL blob for a location
its receiving packets for.
Instead of fixing all the individual checks, do what iptables is doing and
reject all blobs that differ from the expected hooks.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
The freq Qos request would be removed repeatedly if the cpufreq policy
relates to more than one CPU. Then, it would cause the "called for unknown
object" warning.
Remove the freq Qos request for each CPU relates to the cpufreq policy,
instead of removing repeatedly for the last CPU of it.
Fixes: a1bb46c36c ("ACPI: processor: Add QoS requests for all CPUs")
Reported-by: Jeremy Linton <Jeremy.Linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Riwen Lu <luriwen@kylinos.cn>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If ->uring_cmd returned an error value different from -EAGAIN or
-EIOCBQUEUED, it gets overridden with IOU_OK. This invites trouble
as caller (io_uring core code) handles IOU_OK differently than other
error codes.
Fix this by returning the actual error code.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a driver returns -EOPNOTSUPP in dsa_port_bridge_join() but failed
to provide a reason for it, DSA attempts to set the extack to say that
software fallback will kick in.
The problem is, when we use brctl and the legacy bridge ioctls, the
extack will be NULL, and DSA dereferences it in the process of setting
it.
Sergei Antonov proves this using the following stack trace:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
PC is at dsa_slave_changeupper+0x5c/0x158
dsa_slave_changeupper from raw_notifier_call_chain+0x38/0x6c
raw_notifier_call_chain from __netdev_upper_dev_link+0x198/0x3b4
__netdev_upper_dev_link from netdev_master_upper_dev_link+0x50/0x78
netdev_master_upper_dev_link from br_add_if+0x430/0x7f4
br_add_if from br_ioctl_stub+0x170/0x530
br_ioctl_stub from br_ioctl_call+0x54/0x7c
br_ioctl_call from dev_ifsioc+0x4e0/0x6bc
dev_ifsioc from dev_ioctl+0x2f8/0x758
dev_ioctl from sock_ioctl+0x5f0/0x674
sock_ioctl from sys_ioctl+0x518/0xe40
sys_ioctl from ret_fast_syscall+0x0/0x1c
Fix the problem by only overriding the extack if non-NULL.
Fixes: 1c6e8088d9 ("net: dsa: allow port_bridge_join() to override extack message")
Link: https://lore.kernel.org/netdev/CABikg9wx7vB5eRDAYtvAm7fprJ09Ta27a4ZazC=NX5K4wn6pWA@mail.gmail.com/
Reported-by: Sergei Antonov <saproj@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220819173925.3581871-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Looks to have been left out in an oversight.
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Sainath Grandhi <sainath.grandhi@intel.com>
Fixes: 235a9d89da ('ipvtap: IP-VLAN based tap driver')
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The passed in index should be validated against the number of registered
files we have, it needs to be smaller than the index value to avoid going
one beyond the end.
Fixes: 78a861b949 ("io_uring: add sync cancelation API through io_uring_register()")
Reported-by: Luo Likang <luolikang@nsfocus.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make it easy for liburing to integrate uapi header with the kernel.
Previously, when this header changes, the liburing side can't directly
copy this header file due to some small differences. Sync them.
Link: https://lore.kernel.org/io-uring/f1feef16-6ea2-0653-238f-4aaee35060b6@kernel.dk
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: Facebook Kernel Team <kernel-team@fb.com>
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
File include/linux/io_uring_types.h doesn't have a maintainer, add it
to the io_uring section.
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently as part of handling a SME access trap we flush the SVE register
state. This is not needed and would corrupt register state if the task has
access to the SVE registers already. For non-streaming mode accesses the
required flushing will be done in the SVE access trap. For streaming
mode SVE register accesses the architecture guarantees that the register
state will be flushed when streaming mode is entered or exited so there is
no need for us to do so. Simply remove the register initialisation.
Fixes: 8bd7f91c03 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220817182324.638214-5-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Currently when taking a SME access trap we allocate storage for the SVE
register state in order to be able to handle storage of streaming mode SVE.
Due to the original usage in a purely SVE context the SVE register state
allocation this also flushes the register state for SVE if storage was
already allocated but in the SME context this is not desirable. For a SME
access trap to be taken the task must not be in streaming mode so either
there already is SVE register state present for regular SVE mode which would
be corrupted or the task does not have TIF_SVE and the flush is redundant.
Fix this by adding a flag to sve_alloc() indicating if we are in a SVE
context and need to flush the state. Freshly allocated storage is always
zeroed either way.
Fixes: 8bd7f91c03 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220817182324.638214-4-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
When handling a signal delivered to a context with streaming mode enabled
we will disable streaming mode for the signal handler, when doing so we
should also flush the saved FPSIMD register state like exiting streaming
mode in the hardware would do so that if that state is reloaded we get the
same behaviour. Without this we will reload whatever the last FPSIMD state
that was saved for the task was.
Fixes: 40a8e87bb3 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220817182324.638214-3-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
The signal code has a limit of 64K on the size of a stack frame that it
will generate, if this limit is exceeded then a process will be killed if
it receives a signal. Unfortunately with the advent of SME this limit is
too small - the maximum possible size of the ZA register alone is 64K. This
is not an issue for practical systems at present but is easily seen using
virtual platforms.
Raise the limit to 256K, this is substantially more than could be used by
any current architecture extension.
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220817182324.638214-2-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Ard noticed that since we converted CTR_EL0 to automatic generation we have
been seeing errors on some systems handling the value of cache_type_cwg()
such as
CPU features: No Cache Writeback Granule information, assuming 128
This is because the manual definition of CTR_EL0_CWG_MASK was done without
a shift while our convention is to define the mask after shifting. This
means that the user in cache_type_cwg() was broken as it was written for
the manually written shift then mask. Fix this by converting to use
SYS_FIELD_GET().
The only other field where the _MASK for this register is used is IminLine
which is at offset 0 so unaffected.
Fixes: 9a3634d023 ("arm64/sysreg: Convert CTR_EL0 to automatic generation")
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220818213613.733091-4-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
The SYS_FIELD_ macros in sysreg.h use definitions from bitfield.h but there
is no direct inclusion of it, add one to ensure that sysreg.h is directly
usable.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220818213613.733091-2-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Though acpi_find_last_cache_level() always returned signed value and the
document states it will return any errors caused by lack of a PPTT table,
it never returned negative values before.
Commit 0c80f9e165 ("ACPI: PPTT: Leave the table mapped for the runtime usage")
however changed it by returning -ENOENT if no PPTT was found. The value
returned from acpi_find_last_cache_level() is then assigned to unsigned
fw_level.
It will result in the number of cache leaves calculated incorrectly as
a huge value which will then cause the following warning from __alloc_pages
as the order would be great than MAX_ORDER because of incorrect and huge
cache leaves value.
| WARNING: CPU: 0 PID: 1 at mm/page_alloc.c:5407 __alloc_pages+0x74/0x314
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-10393-g7c2a8d3ac4c0 #73
| pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __alloc_pages+0x74/0x314
| lr : alloc_pages+0xe8/0x318
| Call trace:
| __alloc_pages+0x74/0x314
| alloc_pages+0xe8/0x318
| kmalloc_order_trace+0x68/0x1dc
| __kmalloc+0x240/0x338
| detect_cache_attributes+0xe0/0x56c
| update_siblings_masks+0x38/0x284
| store_cpu_topology+0x78/0x84
| smp_prepare_cpus+0x48/0x134
| kernel_init_freeable+0xc4/0x14c
| kernel_init+0x2c/0x1b4
| ret_from_fork+0x10/0x20
Fix the same by changing fw_level to be signed integer and return the
error from init_cache_level() early in case of error.
Reported-and-Tested-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20220808084640.3165368-1-sudeep.holla@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
The AMU counter AMEVCNTR01 (constant counter) should increment at the same
rate as the system counter. On affected Cortex-A510 cores, AMEVCNTR01
increments incorrectly giving a significantly higher output value. This
results in inaccurate task scheduler utilization tracking and incorrect
feedback on CPU frequency.
Work around this problem by returning 0 when reading the affected counter
in key locations that results in disabling all users of this counter from
using it either for frequency invariance or as FFH reference counter. This
effect is the same to firmware disabling affected counters.
Details on how the two features are affected by this erratum:
- AMU counters will not be used for frequency invariance for affected
CPUs and CPUs in the same cpufreq policy. AMUs can still be used for
frequency invariance for unaffected CPUs in the system. Although
unlikely, if no alternative method can be found to support frequency
invariance for affected CPUs (cpufreq based or solution based on
platform counters) frequency invariance will be disabled. Please check
the chapter on frequency invariance at
Documentation/scheduler/sched-capacity.rst for details of its effect.
- Given that FFH can be used to fetch either the core or constant counter
values, restrictions are lifted regarding any of these counters
returning a valid (!0) value. Therefore FFH is considered supported
if there is a least one CPU that support AMUs, independent of any
counters being disabled or affected by this erratum. Clarifying
comments are now added to the cpc_ffh_supported(), cpu_read_constcnt()
and cpu_read_corecnt() functions.
The above is achieved through adding a new erratum: ARM64_ERRATUM_2457168.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: James Morse <james.morse@arm.com>
Link: https://lore.kernel.org/r/20220819103050.24211-1-ionela.voinescu@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
On arm64, "rodata=full" has been suppored (but not documented) since
commit:
c55191e96c ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well")
As it's necessary to determine the rodata configuration early during
boot, arm64 has an early_param() handler for this, whereas init/main.c
has a __setup() handler which is run later.
Unfortunately, this split meant that since commit:
f9a40b0890 ("init/main.c: return 1 from handled __setup() functions")
... passing "rodata=full" would result in a spurious warning from the
__setup() handler (though RO permissions would be configured
appropriately).
Further, "rodata=full" has been broken since commit:
0d6ea3ac94 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
... which caused strtobool() to parse "full" as false (in addition to
many other values not documented for the "rodata=" kernel parameter.
This patch fixes this breakage by:
* Moving the core parameter parser to an __early_param(), such that it
is available early.
* Adding an (optional) arch hook which arm64 can use to parse "full".
* Updating the documentation to mention that "full" is valid for arm64.
* Having the core parameter parser handle "on" and "off" explicitly,
such that any undocumented values (e.g. typos such as "ful") are
reported as errors rather than being silently accepted.
Note that __setup() and early_param() have opposite conventions for
their return values, where __setup() uses 1 to indicate a parameter was
handled and early_param() uses 0 to indicate a parameter was handled.
Fixes: f9a40b0890 ("init/main.c: return 1 from handled __setup() functions")
Fixes: 0d6ea3ac94 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jagdish Gediya <jvgediya@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Unify horizontal spacing (remove extra newlines) which
are sensitive to visual presentation by Sphinx.
Fixes: 5e64b862c4 ("arm64/sme: Basic enumeration support")
Signed-off-by: Martin Liska <mliska@suse.cz>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/84e3d6cc-75cf-d6f3-9bb8-be02075aaf6d@suse.cz
Signed-off-by: Will Deacon <will@kernel.org>
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb3 fallocate zero range was not grabbing the inode or filemap_invalidate
locks so could have race with pagemap reinstantiating the page.
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Although commit 88f1669019 ("scsi: sd: Rework asynchronous resume support")
eliminates a delay for some ATA disks after resume, it causes resume of ATA
disks to fail on other setups. See also:
* "Resume process hangs for 5-6 seconds starting sometime in 5.16"
(https://bugzilla.kernel.org/show_bug.cgi?id=215880).
* Geert's regression report
(https://lore.kernel.org/linux-scsi/alpine.DEB.2.22.394.2207191125130.1006766@ramsan.of.borg/).
This is what I understand about this issue:
* During resume, ata_port_pm_resume() starts the SCSI error handler. This
changes the SCSI host state into SHOST_RECOVERY and causes
scsi_queue_rq() to return BLK_STS_RESOURCE.
* sd_resume() calls sd_start_stop_device() for ATA devices. That function
in turn calls sd_submit_start() which tries to submit a START STOP UNIT
command. That command can only be submitted after the SCSI error handler
has changed the SCSI host state back to SHOST_RUNNING.
* The SCSI error handler runs on its own thread and calls
schedule_work(&(ap->scsi_rescan_task)). That causes
ata_scsi_dev_rescan() to be called from the context of a kernel
workqueue. That call hangs in blk_mq_get_tag(). I'm not sure why - maybe
because all available tags have been allocated by sd_submit_start()
calls (this is a guess).
Link: https://lore.kernel.org/r/20220816172638.538734-1-bvanassche@acm.org
Fixes: 88f1669019 ("scsi: sd: Rework asynchronous resume support")
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: gzhqyz@gmail.com
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: gzhqyz@gmail.com
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: John Garry <john.garry@huawei.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Jonathan Toppins says:
====================
bonding: 802.3ad: fix no transmission of LACPDUs
Configuring a bond in a specific order can leave the bond in a state
where it never transmits LACPDUs.
The first patch adds some kselftest infrastructure and the reproducer
that demonstrates the problem. The second patch fixes the issue. The
new third patch makes ad_ticks_per_sec a static const and removes the
passing of this variable via the stack.
====================
Link: https://lore.kernel.org/r/cover.1660919940.git.jtoppins@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The value is only ever set once in bond_3ad_initialize and only ever
read otherwise. There seems to be no reason to set the variable via
bond_3ad_initialize when setting the global variable will do. Change
ad_ticks_per_sec to a const to enforce its read-only usage.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is caused by the global variable ad_ticks_per_sec being zero as
demonstrated by the reproducer script discussed below. This causes
all timer values in __ad_timer_to_ticks to be zero, resulting
in the periodic timer to never fire.
To reproduce:
Run the script in
`tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which
puts bonding into a state where it never transmits LACPDUs.
line 44: ip link add fbond type bond mode 4 miimon 200 \
xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast
setting bond param: ad_actor_sys_prio
given:
params.ad_actor_system = 0
call stack:
bond_option_ad_actor_sys_prio()
-> bond_3ad_update_ad_actor_settings()
-> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
-> ad.system.sys_mac_addr = bond->dev->dev_addr; because
params.ad_actor_system == 0
results:
ad.system.sys_mac_addr = bond->dev->dev_addr
line 48: ip link set fbond address 52:54:00:3B:7C:A6
setting bond MAC addr
call stack:
bond->dev->dev_addr = new_mac
line 52: ip link set fbond type bond ad_actor_sys_prio 65535
setting bond param: ad_actor_sys_prio
given:
params.ad_actor_system = 0
call stack:
bond_option_ad_actor_sys_prio()
-> bond_3ad_update_ad_actor_settings()
-> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
-> ad.system.sys_mac_addr = bond->dev->dev_addr; because
params.ad_actor_system == 0
results:
ad.system.sys_mac_addr = bond->dev->dev_addr
line 60: ip link set veth1-bond down master fbond
given:
params.ad_actor_system = 0
params.mode = BOND_MODE_8023AD
ad.system.sys_mac_addr == bond->dev->dev_addr
call stack:
bond_enslave
-> bond_3ad_initialize(); because first slave
-> if ad.system.sys_mac_addr != bond->dev->dev_addr
return
results:
Nothing is run in bond_3ad_initialize() because dev_addr equals
sys_mac_addr leaving the global ad_ticks_per_sec zero as it is
never initialized anywhere else.
The if check around the contents of bond_3ad_initialize() is no longer
needed due to commit 5ee14e6d33 ("bonding: 3ad: apply ad_actor settings
changes immediately") which sets ad.system.sys_mac_addr if any one of
the bonding parameters whos set function calls
bond_3ad_update_ad_actor_settings(). This is because if
ad.system.sys_mac_addr is zero it will be set to the current bond mac
address, this causes the if check to never be true.
Fixes: 5ee14e6d33 ("bonding: 3ad: apply ad_actor settings changes immediately")
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This creates a test collection in drivers/net/bonding for bonding
specific kernel selftests.
The first test is a reproducer that provisions a bond and given the
specific order in how the ip-link(8) commands are issued the bond never
transmits an LACPDU frame on any of its slaves.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since priv->rx_mapping[i] is maped in moxart_mac_open(), we
should unmap it from moxart_mac_stop(). Fixes 2 warnings.
1. During error unwinding in moxart_mac_probe(): "goto init_fail;",
then moxart_mac_free_memory() calls dma_unmap_single() with
priv->rx_mapping[i] pointers zeroed.
WARNING: CPU: 0 PID: 1 at kernel/dma/debug.c:963 check_unmap+0x704/0x980
DMA-API: moxart-ethernet 92000000.mac: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=1600 bytes]
CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0+ #60
Hardware name: Generic DT based system
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x34/0x44
dump_stack_lvl from __warn+0xbc/0x1f0
__warn from warn_slowpath_fmt+0x94/0xc8
warn_slowpath_fmt from check_unmap+0x704/0x980
check_unmap from debug_dma_unmap_page+0x8c/0x9c
debug_dma_unmap_page from moxart_mac_free_memory+0x3c/0xa8
moxart_mac_free_memory from moxart_mac_probe+0x190/0x218
moxart_mac_probe from platform_probe+0x48/0x88
platform_probe from really_probe+0xc0/0x2e4
2. After commands:
ip link set dev eth0 down
ip link set dev eth0 up
WARNING: CPU: 0 PID: 55 at kernel/dma/debug.c:570 add_dma_entry+0x204/0x2ec
DMA-API: moxart-ethernet 92000000.mac: cacheline tracking EEXIST, overlapping mappings aren't supported
CPU: 0 PID: 55 Comm: ip Not tainted 5.19.0+ #57
Hardware name: Generic DT based system
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x34/0x44
dump_stack_lvl from __warn+0xbc/0x1f0
__warn from warn_slowpath_fmt+0x94/0xc8
warn_slowpath_fmt from add_dma_entry+0x204/0x2ec
add_dma_entry from dma_map_page_attrs+0x110/0x328
dma_map_page_attrs from moxart_mac_open+0x134/0x320
moxart_mac_open from __dev_open+0x11c/0x1ec
__dev_open from __dev_change_flags+0x194/0x22c
__dev_change_flags from dev_change_flags+0x14/0x44
dev_change_flags from devinet_ioctl+0x6d4/0x93c
devinet_ioctl from inet_ioctl+0x1ac/0x25c
v1 -> v2:
Extraneous change removed.
Fixes: 6c821bd9ed ("net: Add MOXA ART SoCs ethernet driver")
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220819110519.1230877-1-saproj@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For some MAC drivers, they set the mac_managed_pm to true in its
->ndo_open() callback. So before the mac_managed_pm is set to true,
we still want to leverage the mdio_bus_phy_suspend()/resume() for
the phy device suspend and resume. In this case, the phy device is
in PHY_READY, and we shouldn't warn about this. It also seems that
the check of mac_managed_pm in WARN_ON is redundant since we already
check this in the entry of mdio_bus_phy_resume(), so drop it.
Fixes: 744d23c71a ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220819082451.1992102-1-xiaolei.wang@windriver.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In ipa_smem_init(), a Qualcomm SMEM region is allocated (if needed)
and then its virtual address is fetched using qcom_smem_get(). The
physical address associated with that region is also fetched.
The physical address is adjusted so that it is page-aligned, and an
attempt is made to update the size of the region to compensate for
any non-zero adjustment.
But that adjustment isn't done properly. The physical address is
aligned twice, and as a result the size is never actually adjusted.
Fix this by *not* aligning the "addr" local variable, and instead
making the "phys" local variable be the adjusted "addr" value.
Fixes: a0036bb413 ("net: ipa: define SMEM memory region for IPA")
Signed-off-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20220818134206.567618-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DSA has multiple ways of specifying a MAC connection to an internal PHY.
One requires a DT description like this:
port@0 {
reg = <0>;
phy-handle = <&internal_phy>;
phy-mode = "internal";
};
(which is IMO the recommended approach, as it is the clearest
description)
but it is also possible to leave the specification as just:
port@0 {
reg = <0>;
}
and if the driver implements ds->ops->phy_read and ds->ops->phy_write,
the DSA framework "knows" it should create a ds->slave_mii_bus, and it
should connect to a non-OF-based internal PHY on this MDIO bus, at an
MDIO address equal to the port address.
There is also an intermediary way of describing things:
port@0 {
reg = <0>;
phy-handle = <&internal_phy>;
};
In case 2, DSA calls phylink_connect_phy() and in case 3, it calls
phylink_of_phy_connect(). In both cases, phylink_create() has been
called with a phy_interface_t of PHY_INTERFACE_MODE_NA, and in both
cases, PHY_INTERFACE_MODE_NA is translated into phy->interface.
It is important to note that phy_device_create() initializes
dev->interface = PHY_INTERFACE_MODE_GMII, and so, when we use
phylink_create(PHY_INTERFACE_MODE_NA), no one will override this, and we
will end up with a PHY_INTERFACE_MODE_GMII interface inherited from the
PHY.
All this means that in order to maintain compatibility with device tree
blobs where the phy-mode property is missing, we need to allow the
"gmii" phy-mode and treat it as "internal".
Fixes: 2c709e0bda ("net: dsa: microchip: ksz8795: add phylink support")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216320
Reported-by: Craig McQueen <craig@mcqueen.id.au>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Tested-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Link: https://lore.kernel.org/r/20220818143250.2797111-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Audit_alloc_mark() assign pathname to audit_mark->path, on error path
from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory
of audit_mark->path, but the caller of audit_alloc_mark will free
the pathname again, so there will be double free problem.
Fix this by resetting audit_mark->path to NULL pointer on error path
from fsnotify_add_inode_mark().
Cc: stable@vger.kernel.org
Fixes: 7b12932340 ("fsnotify: Add group pointer in fsnotify_init_mark()")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Use the list_for_each_entry_safe() macro to prevent dereferencing "obj"
after it has been freed.
Fixes: c4dfe704f5 ("net/mlx5e: kTLS, Recycle objects of device-offloaded TLS TX connections")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Unlock before returning on this error path.
Fixes: f1bc646c9a ("net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The cited commit reintroduced the ability to set hw-tc-offload
in switchdev mode by reusing NIC mode calls without modifying it
to support both modes, this can cause an illegal memory access
when trying to turn hw-tc-offload off.
Fix this by using the right TC_FLAG when checking if tc rules
are installed while disabling hw-tc-offload.
Fixes: d3cbd4254d ("net/mlx5e: Add ndo_set_feature for uplink representor")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
There is a missing policer validation when offloading police action
with tc action api. Add it.
Fixes: 7d1a5ce46e ("net/mlx5e: TC, Support tc action api for police")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Driver caches packet merge type in mlx5e_params instance which must be
in perfect sync with the netdev_feature's bit.
Prior to this patch, in certain conditions (*) LRO state was set in
mlx5e_params, while netdev_feature's bit was off. Causing the LRO to
be applied on the RQs (HW level).
(*) This can happen only on profile init (mlx5e_build_nic_params()),
when RQ expect non-linear SKB and PCI is fast enough in comparison to
link width.
Solution: remove setting of packet merge type from
mlx5e_build_nic_params() as netdev features are not updated.
Fixes: 619a8f2a42 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When the driver unloads, give/reclaim_pages may fail as PF driver in
teardown flow, current code will lead to the following kernel log print
'failed reclaiming pages: err 0'.
Fix it to get same behavior as before the cited commits,
by calling mlx5_cmd_check before handling error state.
mlx5_cmd_check will verify if the returned error is an actual error
needed to be handled by the driver or not and will return an
appropriate value.
Fixes: 8d564292a1 ("net/mlx5: Remove redundant error on reclaim pages")
Fixes: 4dac2f10ad ("net/mlx5: Remove redundant notify fail on give pages")
Signed-off-by: Roy Novich <royno@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Make sure to modify the rule for uplink forwarding only for the case
where destination vport number is MLX5_VPORT_UPLINK.
Fixes: 94db331778 ("net/mlx5: Support multiport eswitch mode")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When querying mlx5 non-uplink representors capabilities with ethtool
rx-vlan-offload is marked as "off [fixed]". However, it is actually always
enabled because mlx5e_params->vlan_strip_disable is 0 by default when
initializing struct mlx5e_params instance. Fix the issue by explicitly
setting the vlan_strip_disable to 'true' for non-uplink representors.
Fixes: cb67b83292 ("net/mlx5e: Introduce SRIOV VF representors")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Highlights include:
Stable fixes
- NFS: Fix another fsync() issue after a server reboot
Bugfixes
- NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
- NFS: Fix missing unlock in nfs_unlink()
- Add sanity checking of the file type used by __nfs42_ssc_open
- Fix a case where we're failing to set task->tk_rpc_status
Cleanups
- Remove the flag NFS_CONTEXT_RESEND_WRITES that got obsoleted by the
fsync() fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmMDk3wACgkQZwvnipYK
APJLpw/+ONqG16L5W31/BzGJ80DlG9CERMad7Yt8+lk+ih574k/OrCotHThMyBm9
2TfY3S8zD9QoLnsPesDKeoc6AYyL3el0Wo2vKmWlGvrirvrzNt9nMc61CDMs2IHT
kN7gjO2P1LCZln8GTE87C4tI3Pg0Cwr4UUlyHHjMSKdYuJckJugj1gDvblSjn5h4
bGKGEJ9G71G1REn013sVqmQ6huvQ3iif07X5NaN7T5e+TpNFet/0AlTmrA9zsUDI
WPm+efP+ieTmihvhqOSYdV31uHN/ECx4p60ITzAlWwPYPyXr1M0r9acUGX10ENna
eX1B9nyxbUAzO6rxxPgXi3LXgvmgRDVEmbSs5IL985XR2zsVR+AF3dgAwoJXqV9y
7mAtoiwyqe3idvaK+mHU4OWCSqdhZbauJJ+Jc0ZHZHy2vzHPS2CWcpvXHjVTw63R
txOkUFL89SwnqJv03N6CZt4OyY1av97dDOEvPqHuRx4NyfT3v/QvF5W3V/UvLnt2
hTPNGIRUPZU1lpfqEgd7NXWO6LLtkWK2MciRGVnSFf2S5uKYqvlbPsVqWc6CviXc
Mu4o2RoctkIwxexSfHY0p7UQrbu3OvYgTuIzgy6cIZ2GK70L29UpJYBe5YEh9Qru
J/Pgn1ZSdGgDgwqzR8S92PTbbKq1caOqnGReFdyJDnCetb6LrrA=
=tpKO
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
- NFS: Fix another fsync() issue after a server reboot
Bugfixes:
- NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
- NFS: Fix missing unlock in nfs_unlink()
- Add sanity checking of the file type used by __nfs42_ssc_open
- Fix a case where we're failing to set task->tk_rpc_status
Cleanups:
- Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
fsync() fix"
* tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: RPC level errors should set task->tk_rpc_status
NFSv4.2 fix problems with __nfs42_ssc_open
NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
NFS: Fix another fsync() issue after a server reboot
NFS: Fix missing unlock in nfs_unlink()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYwNkcwAKCRCRxhvAZXjc
ohHlAQC0Sb0jZxLDHeKXr6lHR+a+jOYTisM/8GkCygBhYBqlFgD+KclaIVJp9v/1
O88/iv91XfomkDSxNknv+MxoWE2i7Ao=
=hvvJ
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping fixes from Christian Brauner:
- Since Seth joined as co-maintainer for idmapped mounts we decided to
use a shared git tree. Konstantin suggested we use vfs/idmapping.git
on kernel.org under the vfs/ namespace. So this updates the tree in
the maintainers file.
- Ensure that POSIX ACLs checking, getting, and setting works correctly
for filesystems mountable with a filesystem idmapping that want to
support idmapped mounts.
Since no filesystems mountable with an fs_idmapping do yet support
idmapped mounts there is no problem. But this could change in the
future, so add a check to refuse to create idmapped mounts when the
mounter is not privileged over the mount's idmapping.
- Check that caller is privileged over the idmapping that will be
attached to a mount.
Currently no FS_USERNS_MOUNT filesystems support idmapped mounts,
thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is
allowed to set up idmapped mounts. But this could change in the
future, so add a check to refuse to create idmapped mounts when the
mounter is not privileged over the mount's idmapping.
- Fix POSIX ACLs for ntfs3. While looking at our current POSIX ACL
handling in the context of some overlayfs work I went through a range
of other filesystems checking how they handle them currently and
encountered a few bugs in ntfs3.
I've sent this some time ago and the fixes haven't been picked up
even though the pull request for other ntfs3 fixes got sent after.
This should really be fixed as right now POSIX ACLs are broken in
certain circumstances for ntfs3.
* tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
ntfs: fix acl handling
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
MAINTAINERS: update idmapping tree
acl: handle idmapped mounts for idmapped filesystems