using same source and destination ip/port for flow hash calculation
within the two directions.
Signed-off-by: zhang kai <zhangkaiheb@126.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The `input_mt_get_slot_by_key` function may return a negative value
if an error occurs (e.g. running out of slots). If this occurs we
should really avoid reporting any data for the slot.
Signed-off-by: Ping Cheng <ping.cheng@wacom.com>
Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Commit 670e90924b ("HID: wacom: support named keys on older devices")
added support for sending named events from the soft buttons on the
24HDT and 27QHDT. In the process, however, it inadvertantly disabled the
touchscreen of the 24HDT and 27QHDT by default. The
`wacom_set_shared_values` function would normally enable touch by default
but because it checks the state of the non-shared `has_mute_touch_switch`
flag and `wacom_setup_touch_input_capabilities` sets the state of the
/shared/ version, touch ends up being disabled by default.
This patch sets the non-shared flag, letting `wacom_set_shared_values`
take care of copying the value over to the shared version and setting
the default touch state to "on".
Fixes: 670e90924b ("HID: wacom: support named keys on older devices")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com>
Reviewed-by: Ping Cheng <ping.cheng@wacom.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There is a spelling mistake in the Kconfig text. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The Keychron K1 wireless keyboard has a set of Apple-like function keys
and an Fn key that works like on an Apple bluetooth keyboard. It
identifies as an Apple Alu RevB ANSI keyboard (05ac:024f) over USB and
BT. Use hid-apple for it so the Fn key and function keys work correctly.
Signed-off-by: Haochen Tong <i@hexchain.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There is a missing space in "relyingon".
Add it.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Leon Romanovsky says:
====================
Remove duplicated devlink registration check
Changelog:
v1:
* Added two new patches that remove registration field from mlx5 and ti drivers.
v0: https://lore.kernel.org/lkml/ed7bbb1e4c51dd58e6035a058e93d16f883b09ce.1627215829.git.leonro@nvidia.com
--------------------------------------------------------------------
Both registered flag and devlink pointer are set at the same time
and indicate the same thing - devlink/devlink_port are ready. Instead
of checking ->registered use devlink pointer as an indication.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Both registered flag and devlink pointer are set at the same time
and indicate the same thing - devlink/devlink_port are ready. Instead
of checking ->registered use devlink pointer as an indication.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink is an integral part of mlx5 driver and all flows ensure that
devlink_*_register() will success. That makes the ->registered check
an obsolete.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit that introduced devlink support released devlink resources in
wrong order, that made an unwind flow to be asymmetrical. In addition,
the am65-cpsw-nuss used internal to devlink core field - registered.
In order to fix the unwind flow and remove such access to the
registered field, rewrite the code to call devlink_port_unregister only
on registered ports.
Fixes: 58356eb31d ("net: ti: am65-cpsw-nuss: Add devlink support")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a use after free memory corruption during module exit:
- nfcsim_exit()
- nfcsim_device_free(dev0)
- nfc_digital_unregister_device()
This iterates over command queue and frees all commands,
- dev->up = false
- nfcsim_link_shutdown()
- nfcsim_link_recv_wake()
This wakes the sleeping thread nfcsim_link_recv_skb().
- nfcsim_link_recv_skb()
Wake from wait_event_interruptible_timeout(),
call directly the deb->cb callback even though (dev->up == false),
- digital_send_cmd_complete()
Dereference of "struct digital_cmd" cmd which was freed earlier by
nfc_digital_unregister_device().
This causes memory corruption shortly after (with unrelated stack
trace):
nfc nfc0: NFC: nfcsim_recv_wq: Device is down
llcp: nfc_llcp_recv: err -19
nfc nfc1: NFC: nfcsim_recv_wq: Device is down
BUG: unable to handle page fault for address: ffffffffffffffed
Call Trace:
fsnotify+0x54b/0x5c0
__fsnotify_parent+0x1fe/0x300
? vfs_write+0x27c/0x390
vfs_write+0x27c/0x390
ksys_write+0x63/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
KASAN report:
BUG: KASAN: use-after-free in digital_send_cmd_complete+0x16/0x50
Write of size 8 at addr ffff88800a05f720 by task kworker/0:2/71
Workqueue: events nfcsim_recv_wq [nfcsim]
Call Trace:
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x21/0x140
? digital_send_cmd_complete+0x16/0x50
? digital_send_cmd_complete+0x16/0x50
kasan_report.cold+0x7f/0x11b
? digital_send_cmd_complete+0x16/0x50
? digital_dep_link_down+0x60/0x60
digital_send_cmd_complete+0x16/0x50
nfcsim_recv_wq+0x38f/0x3d5 [nfcsim]
? nfcsim_in_send_cmd+0x4a/0x4a [nfcsim]
? lock_is_held_type+0x98/0x110
? finish_wait+0x110/0x110
? rcu_read_lock_sched_held+0x9c/0xd0
? rcu_read_lock_bh_held+0xb0/0xb0
? lockdep_hardirqs_on_prepare+0x12e/0x1f0
This flow of calling digital_send_cmd_complete() callback on driver exit
is specific to nfcsim which implements reading and sending work queues.
Since the NFC digital device was unregistered, the callback should not
be called.
Fixes: 204bddcb50 ("NFC: nfcsim: Make use of the Digital layer")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace pci_enable_device() with pcim_enable_device(),
pci_disable_device() and pci_release_regions() will be
called in release automatically.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Ben Hutchings noticed, this check should have been inverted: the call
returns true in case of success.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 0c5dc070ff ("sctp: validate from_addr_param return")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the function s3fwrn5_fw_download(), the 'ret' is not assigned,
so the correct value should be given in dev_err function.
Fixes: a0302ff590 ("nfc: s3fwrn5: remove unnecessary label")
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmEAkk4ACgkQSD+KveBX
+j6hFgf/eIRpqg3mivOXjwo3LF34cP02OBStbb5S5S2MFX60n8Z1GXew42I01iON
zUv6RRmtgJiWHedWUYaXnPHE3hqXlX7Vr2fuBXlNJdPG87jZBBK5FLfQCSPBmcyQ
ZK778uSlgarBL3kGwPkPLLwMypMgHpTCkvCb9RIOD2HzXPCjftP4IBQwEEV2zyaj
d7mtbndGeOyRyEh2RO7ABdJNIAta3Z6EfZCe61mYhVg6c49fkoZlX4/vdX5FhtZM
Pvode2ReD49vigpvfDpSSFea6W+qWtTz83/J9Dbnt8byLdzY5VKm3hanv7o41ksg
FyLHz08HPJjP6Mwikel/ZyG5Bb5FFw==
=lD83
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2021-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2021-07-27
This series introduces some fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
blkdev_get_no_open acquires a reference to the block_device through
the block device inode and then tries to acquire a device model
reference to the gendisk. But at this point the disk migh already
be freed (although the race is free). Fix this by only freeing the
gendisk from the whole device bdevs ->free_inode callback as well.
Fixes: 22ae8ce8b8 ("block: simplify bdev/disk lookup in blkdev_get")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it
wants the wq_entry to be always removed whether it ended up waking the
task or not. finish_wait() tests whether wq_entry needs removal without
grabbing the wait_queue lock and expects the waker to use
list_del_init_careful() after all waking operations are complete, which
iocg_wake_fn() didn't do. The operation order was wrong and the regular
list_del_init() was used.
The result is that if a waiter wakes up racing the waker, it can free pop
the wq_entry off stack before the waker is still looking at it, which can
lead to a backtrace like the following.
[7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP
...
[7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0
...
[7312084.858314] Call Trace:
[7312084.863548] _raw_spin_lock_irqsave+0x22/0x30
[7312084.872605] try_to_wake_up+0x4c/0x4f0
[7312084.880444] iocg_wake_fn+0x71/0x80
[7312084.887763] __wake_up_common+0x71/0x140
[7312084.895951] iocg_kick_waitq+0xe8/0x2b0
[7312084.903964] ioc_rqos_throttle+0x275/0x650
[7312084.922423] __rq_qos_throttle+0x20/0x30
[7312084.930608] blk_mq_make_request+0x120/0x650
[7312084.939490] generic_make_request+0xca/0x310
[7312084.957600] submit_bio+0x173/0x200
[7312084.981806] swap_readpage+0x15c/0x240
[7312084.989646] read_swap_cache_async+0x58/0x60
[7312084.998527] swap_cluster_readahead+0x201/0x320
[7312085.023432] swapin_readahead+0x2df/0x450
[7312085.040672] do_swap_page+0x52f/0x820
[7312085.058259] handle_mm_fault+0xa16/0x1420
[7312085.066620] do_page_fault+0x2c6/0x5c0
[7312085.074459] page_fault+0x2f/0x40
Fix it by switching to list_del_init_careful() and putting it at the end.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The result of __dev_get_by_index() is not checked for NULL and then gets
dereferenced immediately.
Also, __dev_get_by_index() must be called while holding either RTNL lock
or @dev_base_lock, which isn't satisfied by mlx5e_hairpin_get_mdev() or
its callers. This makes the underlying hlist_for_each_entry() loop not
safe, and can have adverse effects in itself.
Fix by using dev_get_by_index() and handling nullptr return value when
ifindex device is not found. Update mlx5e_hairpin_get_mdev() callers to
check for possible PTR_ERR() result.
Fixes: 77ab67b7f0 ("net/mlx5e: Basic setup of hairpin object")
Addresses-Coverity: ("Dereference null return value")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When fw_fatal reporter reports an error, the firmware in not responding.
Unload the device to ensure that the driver closes all its resources,
even if recovery is not due (user disabled auto-recovery or reporter is
in grace period). On successful recovery the device is loaded back up.
Fixes: b3bd076f75 ("net/mlx5: Report devlink health on FW fatal issues")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Set the correct pci-device pointer to the ptp-RQ. This allows access to
dma_mask and avoids allocation request with wrong pci-device.
Fixes: a099da8ffc ("net/mlx5e: Add RQ to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add PTP-RQ to the loop when setting rx-vlan-offload feature via ethtool.
On PTP-RQ's creation, set rx-vlan-offload into its parameters.
Fixes: a099da8ffc ("net/mlx5e: Add RQ to PTP channel")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If a feature flag is only present in features, but not in hw_features,
the user can't reset it. Although hw_features may contain NETIF_F_HW_TC
by the point where the driver checks whether HTB offload is supported,
this flag is controlled by another condition that may not hold. Set it
explicitly to make sure the user can disable it.
Fixes: 214baf2287 ("net/mlx5e: Support HTB offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When HW aggregates packets for an LRO session, it writes the payload
of two consecutive packets of a flow contiguously, so that they usually
share a cacheline.
The first byte of a packet's payload is written immediately after
the last byte of the preceding packet.
In this flow, there are two consecutive write requests to the shared
cacheline:
1. Regular write for the earlier packet.
2. Read-modify-write for the following packet.
In case of relaxed-ordering on, these two writes might be re-ordered.
Using the end padding optimization (to avoid partial write for the last
cacheline of a packet) becomes problematic if the two writes occur
out-of-order, as the padding would overwrite payload that belongs to
the following packet, causing data corruption.
Avoid this by disabling the end padding optimization when both
LRO and relaxed-ordering are enabled.
Fixes: 17347d5430 ("net/mlx5e: Add support for PCI relaxed ordering")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This is the same check as LAG mode checks if to enable lag.
This will fix adding peer miss rules if lag is not supported
and even an incorrect rules in socket direct mode.
Also fix the incorrect comment on mlx5_get_next_phys_dev() as flow #1
doesn't exists.
Fixes: ac004b8321 ("net/mlx5e: E-Switch, Add peer miss rules")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Destination vport vhca id is valid flag is set only merged eswitch isn't supported.
Change destination vport vhca id value to be set also only when merged eswitch
is supported.
Fixes: e4ad91f23f ("net/mlx5e: Split offloaded eswitch TC rules for port mirroring")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix a bug when flow table is created in priority that already
has other flow tables as shown in the below diagram.
If the new flow table (FT-B) has the lowest level in the priority,
we need to connect the flow tables from the previous priority (p0)
to this new table. In addition when this flow table is destroyed
(FT-B), we need to connect the flow tables from the previous
priority (p0) to the next level flow table (FT-C) in the same
priority of the destroyed table (if exists).
---------
|root_ns|
---------
|
--------------------------------
| | |
---------- ---------- ---------
|p(prio)-x| | p-y | | p-n |
---------- ---------- ---------
| |
---------------- ------------------
|ns(e.g bypass)| |ns(e.g. kernel) |
---------------- ------------------
| | |
------- ------ ----
| p0 | | p1 | |p2|
------- ------ ----
| | \
-------- ------- ------
| FT-A | |FT-B | |FT-C|
-------- ------- ------
Fixes: f90edfd279 ("net/mlx5_core: Connect flow tables")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Alex Elder says:
====================
net: ipa: add clock references
This series continues preparation for implementing runtime power
management for IPA. We need to ensure that the IPA core clock and
interconnects are operational whenever IPA hardware is accessed.
And in particular this means that any external entry point that can
lead to accessing IPA hardware must guarantee the hardware is "up"
when it is accessed.
The first four patches in this series take IPA clock references when
needed by such external entry points, dropping those references in
those same functions when they are no longer required.
The last patch is a bit different, though it too prepares for
enabling runtime power management. It avoids suspending/resuming
endpoints if setup is not complete.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Until we complete the setup stage of initialization, GSI is not
initialized and therefore endpoints aren't usable. So avoid
suspending endpoints during system suspend unless setup is complete.
Clear the setup_complete flag at the top of ipa_teardown() to
reflect the fact that things are no longer in setup state.
Get rid of a misplaced (and superfluous) comment.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPA network device can be opened at any time, and an opened
network device can be stopped any time. Both of these callback
functions require access to the hardware, and therefore they need
the IPA clock to be operational. Take an IPA clock reference in
both the ->open and ->stop callback functions, dropping the
reference when they are done accessing hardware.
The ->start_xmit callback requires a little different handling,
and that will be added separately.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The remoteproc SSR callback function for the modem requires hardware
access when handling a modem crash or shutdown. Take and later
release an IPA clock reference in ipa_modem_crashed(), to ensure the
hardware is operational.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two places call ipa_setup(). The first, ipa_probe(), holds an IPA
clock reference when calling ipa_setup() (if the AP is responsible
for IPA firmware loading). But if the modem is loading IPA
firmware, ipa_smp2p_modem_setup_ready_isr() calls ipa_setup() after
the modem has signaled the hardware is ready. This can happen at
any time, and there is no guarantee the hardware is active.
Have ipa_smp2p_modem_setup() take an IPA clock reference before it
calls ipa_setup(), and release it once setup is complete.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any entry point that leads to IPA hardware access must ensure the
hardware is operational (clocked). Currently we ensure this by
taking an extra clock reference during setup that is not released
until we receive a system suspend request. But this extra reference
will soon go away.
When the platform driver ->probe function is called, we first need
hardware access in ipa_config(). Although ipa_config() takes an IPA
clock reference, it the special reference taken to prevent suspending
the hardware.
Have ipa_probe() take a reference before calling ipa_config(), so
that the "no-suspend" reference can eventually go away. Drop this
reference before ipa_probe() returns.
Similarly, the driver ->remove function can be called at any time.
Take an IPA clock reference at the beginning of that function, and
drop it again after the deconfig stage has completed (at which point
hardware access is no longer needed).
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we
call blk_mq_sched_free_tags() -> blk_mq_free_rqs().
It is incorrect to do so, as any rqs would have already been freed in the
blk_mq_alloc_rqs() call.
Fix by calling blk_mq_free_rq_map() only directly.
Fixes: 6917ff0b5b ("blk-mq-sched: refactor scheduler initialization")
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If backlog handler is running during a tear down operation we may enqueue
data on the ingress msg queue while tear down is trying to free it.
sk_psock_backlog()
sk_psock_handle_skb()
skb_psock_skb_ingress()
sk_psock_skb_ingress_enqueue()
sk_psock_queue_msg(psock,msg)
spin_lock(ingress_lock)
sk_psock_zap_ingress()
_sk_psock_purge_ingerss_msg()
_sk_psock_purge_ingress_msg()
-- free ingress_msg list --
spin_unlock(ingress_lock)
spin_lock(ingress_lock)
list_add_tail(msg,ingress_msg) <- entry on list with no one
left to free it.
spin_unlock(ingress_lock)
To fix we only enqueue from backlog if the ENABLED bit is set. The tear
down logic clears the bit with ingress_lock set so we wont enqueue the
msg in the last step.
Fixes: 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
Its possible if a socket is closed and the receive thread is under memory
pressure it may have cached a skb. We need to ensure these skbs are
free'd along with the normal ingress_skb queue.
Before 799aa7f98d ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
down and backlog processing both had sock_lock for the common case of
socket close or unhash. So it was not possible to have both running in
parrallel so all we would need is the kfree in those kernels.
But, latest kernels include the commit 799aa7f98d5e and this requires a
bit more work. Without the ingress_lock guarding reading/writing the
state->skb case its possible the tear down could run before the state
update causing it to leak memory or worse when the backlog reads the state
it could potentially run interleaved with the tear down and we might end up
free'ing the state->skb from tear down side but already have the reference
from backlog side. To resolve such races we wrap accesses in ingress_lock
on both sides serializing tear down and backlog case. In both cases this
only happens after an EAGAIN error case so having an extra lock in place
is likely fine. The normal path will skip the locks.
Note, we check state->skb before grabbing lock. This works because
we can only enqueue with the mutex we hold already. Avoiding a race
on adding state->skb after the check. And if tear down path is running
that is also fine if the tear down path then removes state->skb we
will simply set skb=NULL and the subsequent goto is skipped. This
slight complication avoids locking in normal case.
With this fix we no longer see this warning splat from tcp side on
socket close when we hit the above case with redirect to ingress self.
[224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
[224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
[224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181
[224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
[224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
[224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
[224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
[224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
[224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
[224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
[224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
[224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
[224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
[224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[224913.936007] Call Trace:
[224913.936016] inet_csk_destroy_sock+0xba/0x1f0
[224913.936033] __tcp_close+0x620/0x790
[224913.936047] tcp_close+0x20/0x80
[224913.936056] inet_release+0x8f/0xf0
[224913.936070] __sock_release+0x72/0x120
[224913.936083] sock_close+0x14/0x20
Fixes: a136678c0b ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
We don't want strparser to run and pass skbs into skmsg handlers when
the psock is null. We just sk_drop them in this case. When removing
a live socket from map it means extra drops that we do not need to
incur. Move the zap below strparser close to avoid this condition.
This way we stop the stream parser first stopping it from processing
packets and then delete the psock.
Fixes: a136678c0b ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
- Many more irdma fixups from bots/etc
- bnxt_re regression in their counters from a FW upgrade
- User triggerable memory leak in rxe
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmD/PpEACgkQOG33FX4g
mxouhBAAgpjHUgALVAcYPyqXyAqr9KLIv6HB79OWP5W8I+XDDAbedoIJy3FWWm1d
z5l0kBwUCH216KyNrXlikqBM61dluA8N9J7lPCMOolU0dLMaC0MeXCBir2VV4NU9
lSVasmrBPnZFWk38VgTC4rZr4lt+JLyKYwg7BBozCoDhcX/TuYv2HFB9YCJ9Vv+D
u50K1Aj7/fhXfInDuIcICHfvE0d2YaN0+apEQ/Mk11vVGcxxjv/mYraGyFqZkN0M
6yuNZXXIPj9x/gPY8kWp4mBipY6W/cnjLzUEeKQem2rOXpS1baxK9PZlW+X8qlMq
P+9s6EGuxDWcYfU1VuTBwmuM91/kKch6nYtmhHPTv0Wrk/fFZpXM8kD+mi8Vbg4c
aE+0DfpLLnpyu8/BKNmO7s0STT2Ea0VCrlWlGwjg93Y53qCwFsexg/HXO27U8DS2
m2f0DKuGvIHlaUW7OaNADmAzpQ3NikVohBL9hAApNbgjBN8xrifSGdiuUbJ6MU0l
a207yVSQUc96eS+fDJRAngUgYEoO9o1JVeHXGaMs6Rwpu5iSHk1wpQTcQ7vTSeJP
CNjPWAoFe/Jv/S4r0E2s43u0ajV22G6NyWVUQ+eIb8BD5J7fTDn6466vnH8RyQNY
OgwXaYnoadC27SDVr9o6LyLNJYyqUnA1nmZcLFl8gE0LenxyldU=
=g54k
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Nothing very exciting here, mainly just a bunch of irdma fixes. irdma
is a new driver this cycle so it to be expected.
- Many more irdma fixups from bots/etc
- bnxt_re regression in their counters from a FW upgrade
- User triggerable memory leak in rxe"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/irdma: Change returned type of irdma_setup_virt_qp to void
RDMA/irdma: Change the returned type of irdma_set_hw_rsrc to void
RDMA/irdma: change the returned type of irdma_sc_repost_aeq_entries to void
RDMA/irdma: Check vsi pointer before using it
RDMA/rxe: Fix memory leak in error path code
RDMA/irdma: Change the returned type to void
RDMA/irdma: Make spdxcheck.py happy
RDMA/irdma: Fix unused variable total_size warning
RDMA/bnxt_re: Fix stats counters
Pull cgroup fix from Tejun Heo:
"Fix leak of filesystem context root which is triggered by LTP.
Not too likely to be a problem in non-testing environments"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: fix leaked context root causing sporadic NULL deref in LTP
The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer,
therefore it needs a compat ioctl implementation. Otherwise,
32-bit userspace fails to invoke it on 64-bit kernels; for x86
it might work fine by chance if the padding is zero, but not
on big-endian architectures.
Reported-by: Thomas Sattler
Cc: stable@vger.kernel.org
Fixes: 2a31b9db15 ("kvm: introduce manual dirty log reprotect")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SMT siblings share caches and other hardware, and busy halt polling
will degrade its sibling performance if its sibling is working
Sean Christopherson suggested as below:
"Rather than disallowing halt-polling entirely, on x86 it should be
sufficient to simply have the hardware thread yield to its sibling(s)
via PAUSE. It probably won't get back all performance, but I would
expect it to be close.
This compiles on all KVM architectures, and AFAICT the intended usage
of cpu_relax() is identical for all architectures."
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <20210727111247.55510-1-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently when SVM is enabled in guest CPUID, AVIC is inhibited as soon
as the guest CPUID is set.
AVIC happens to be fully disabled on all vCPUs by the time any guest
entry starts (if after migration the entry can be nested).
The reason is that currently we disable avic right away on vCPU from which
the kvm_request_apicv_update was called and for this case, it happens to be
called on all vCPUs (by svm_vcpu_after_set_cpuid).
After we stop doing this, AVIC will end up being disabled only when
KVM_REQ_APICV_UPDATE is processed which is after we done switching to the
nested guest.
Fix this by just using vmcb01 in svm_refresh_apicv_exec_ctrl for avic
(which is a right thing to do anyway).
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is possible that AVIC was requested to be disabled but
not yet disabled, e.g if the nested entry is done right
after svm_vcpu_after_set_cpuid.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is possible for AVIC inhibit and AVIC active state to be mismatched.
Currently we disable AVIC right away on vCPU which started the AVIC inhibit
request thus this warning doesn't trigger but at least in theory,
if svm_set_vintr is called at the same time on multiple vCPUs,
the warning can happen.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
commit bc9e9e672d ("KVM: debugfs: Reuse binary stats descriptors")
did replace the old definitions with the binary ones. While doing that
it missed that some files are names different than the counters. This
is especially important for kvm_stat which does have special handling
for counters named instruction_*.
Fixes: commit bc9e9e672d ("KVM: debugfs: Reuse binary stats descriptors")
CC: Jing Zhang <jingzhangos@google.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-Id: <20210726150108.5603-1-borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>