Commit Graph

1155004 Commits

Author SHA1 Message Date
Heinz Mauelshagen aa07f9d806 dm: adjust EXPORT_SYMBOL() to follow functions immediately
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:07 -05:00
Heinz Mauelshagen 2e84fecf19 dm: avoid split of quoted strings where possible
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:07 -05:00
Heinz Mauelshagen 2d0f25cbc0 dm: remove unnecessary braces from single statement blocks
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 0ef0b4717a dm: add missing empty lines
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 02f10ba178 dm: add argument identifier names
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 8ca817c43e dm: avoid spaces before function arguments or in favour of tabs
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen beecc8438c dm block-manager: avoid not required parentheses
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen ced6e475c3 dm crypt: correct 'foo*' to 'foo *'
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 03b1888770 dm: fix trailing statements
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 43be9c743c dm: fix undue/missing spaces
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen a4a82ce3d2 dm: correct block comments format.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 255e264649 dm: address indent/space issues
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 96422281ba dm: address space issues relative to switch/while/for/...
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 2f06cd12e1 dm: avoid initializing static variables
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 44bc08ed63 dm: enclose complex macros into parentheses where possible
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen d715fa2357 dm: avoid assignment in if conditions
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 86a3238c7b dm: change "unsigned" to "unsigned int"
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 238d991f05 dm: use fsleep() instead of msleep() for deterministic sleep duration
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 0d78954a2d dm: prefer kmap_local_page() instead of deprecated kmap_atomic()
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Heinz Mauelshagen 3bd9400307 dm: add missing SPDX-License-Indentifiers
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.

Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:23:06 -05:00
Mikulas Patocka 7533afa1d2 dm: send just one event on resize, not two
Device mapper sends an uevent when the device is suspended, using the
function set_capacity_and_notify. However, this causes a race condition
with udev.

Udev skips scanning dm devices that are suspended. If we send an uevent
while we are suspended, udev will be racing with device mapper resume
code. If the device mapper resume code wins the race, udev will process
the uevent after the device is resumed and it will properly scan the
device.

However, if udev wins the race, it will receive the uevent, find out that
the dm device is suspended and skip scanning the device. This causes bugs
such as systemd unmounting the device - see
https://bugzilla.redhat.com/show_bug.cgi?id=2158628

This commit fixes this race.

We replace the function set_capacity_and_notify with set_capacity, so that
the uevent is not sent at this point. In do_resume, we detect if the
capacity has changed and we pass a boolean variable need_resize_uevent to
dm_kobject_uevent. dm_kobject_uevent adds "RESIZE=1" to the uevent if
need_resize_uevent is set.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Peter Rajnoha <prajnoha@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14 14:22:27 -05:00
Benjamin Marzinski d1c0e1587e dm table: check that a dm device doesn't reference itself
If a DM device's table references itself, it will crash the kernel with an
infinite recursion.  Check for a self-reference in dm_get_device(). This
is a quick check, but it won't catch more complicated circular references.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-12 10:20:56 -05:00
Yu Zhe efdd3c3375 dm raid: fix some spelling mistakes in comments
Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-12 10:20:56 -05:00
Nathan Huckleberry c25da5b7ba dm verity: stop using WQ_UNBOUND for verify_wq
Setting WQ_UNBOUND increases scheduler latency on ARM64.  This is
likely due to the asymmetric architecture of ARM64 processors.

I've been unable to reproduce the results that claim WQ_UNBOUND gives
a performance boost on x86-64.

This flag is causing performance issues for multiple subsystems within
Android.  Notably, the same slowdown exists for decompression with
EROFS.

| open-prebuilt-camera  | WQ_UNBOUND | ~WQ_UNBOUND   |
|-----------------------|------------|---------------|
| verity wait time (us) | 11746      | 119 (-98%)    |
| erofs wait time (us)  | 357805     | 174205 (-51%) |

| sha256 ramdisk random read | WQ_UNBOUND    | ~WQ_UNBOUND |
|----------------------------|-----------=---|-------------|
| arm64 (accelerated)        | bw=42.4MiB/s  | bw=212MiB/s |
| arm64 (generic)            | bw=16.5MiB/s  | bw=48MiB/s  |
| x86_64 (generic)           | bw=233MiB/s   | bw=230MiB/s |

Using a alloc_workqueue() @max_active arg of num_online_cpus() only
made sense with WQ_UNBOUND. Switch the @max_active arg to 0 (aka
default, which is 256 per-cpu).

Also, eliminate 'wq_flags' since it really doesn't serve a purpose.

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02 14:26:09 -05:00
Jiapeng Chong 5cd6d1d53a dm integrity: Remove bi_sector that's only used by commented debug code
drivers/md/dm-integrity.c:1738:13: warning: variable 'bi_sector' set but not used.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3895
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02 14:26:09 -05:00
Christophe JAILLET fc772580a3 dm crypt: Slightly simplify crypt_set_keyring_key()
Use strchr() instead of strpbrk() when there is only 1 element in the set
of characters to look for.

This potentially saves a few cycles, but gcc does already account for
optimizing this pattern thanks to it's fold_builtin_strpbrk().

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02 14:26:09 -05:00
Sergey Shtylyov 151d812251 dm ioctl: drop always-false condition
The expression 'indata[3] > ULONG_MAX' always evaluates to false since
indata[] is declared as an array of *unsigned long* elements and #define
ULONG_MAX represents the max value of that exact type...

Note that gcc seems to be able to detect the dead code here and eliminate
this check anyway...

Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.

Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02 14:26:09 -05:00
Mikulas Patocka aa56b9b759 dm flakey: fix logic when corrupting a bio
If "corrupt_bio_byte" is set to corrupt reads and corrupt_bio_flags is
used, dm-flakey would erroneously return all writes as errors. Likewise,
if "corrupt_bio_byte" is set to corrupt writes, dm-flakey would return
errors for all reads.

Fix the logic so that if fc->corrupt_bio_byte is non-zero, dm-flakey
will not abort reads on writes with an error.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02 14:26:09 -05:00
Mikulas Patocka 8eb29c4fbf dm flakey: fix a bug with 32-bit highmem systems
The function page_address does not work with 32-bit systems with high
memory. Use bvec_kmap_local/kunmap_local instead.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-01 11:35:30 -05:00
Mikulas Patocka f50714b57a dm flakey: don't corrupt the zero page
When we need to zero some range on a block device, the function
__blkdev_issue_zero_pages submits a write bio with the bio vector pointing
to the zero page. If we use dm-flakey with corrupt bio writes option, it
will corrupt the content of the zero page which results in crashes of
various userspace programs. Glibc assumes that memory returned by mmap is
zeroed and it uses it for calloc implementation; if the newly mapped
memory is not zeroed, calloc will return non-zeroed memory.

Fix this bug by testing if the page is equal to ZERO_PAGE(0) and
avoiding the corruption in this case.

Cc: stable@vger.kernel.org
Fixes: a00f5276e2 ("dm flakey: Properly corrupt multi-page bios.")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-01 11:35:30 -05:00
Joe Thornber 22c40e134c dm cache: Add some documentation to dm-cache-background-tracker.h
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-01-30 14:20:04 -05:00
Joe Thornber 95ab80a8a0 dm cache: free background tracker's queued work in btracker_destroy
Otherwise the kernel can BUG with:

[ 2245.426978] =============================================================================
[ 2245.435155] BUG bt_work (Tainted: G    B   W         ): Objects remaining in bt_work on __kmem_cache_shutdown()
[ 2245.445233] -----------------------------------------------------------------------------
[ 2245.445233]
[ 2245.454879] Slab 0x00000000b0ce2b30 objects=64 used=2 fp=0x000000000a3c6a4e flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
[ 2245.467300] CPU: 7 PID: 10805 Comm: lvm Kdump: loaded Tainted: G    B   W          6.0.0-rc2 #19
[ 2245.476078] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.5.6 10/06/2021
[ 2245.483646] Call Trace:
[ 2245.486100]  <TASK>
[ 2245.488206]  dump_stack_lvl+0x34/0x48
[ 2245.491878]  slab_err+0x95/0xcd
[ 2245.495028]  __kmem_cache_shutdown.cold+0x31/0x136
[ 2245.499821]  kmem_cache_destroy+0x49/0x130
[ 2245.503928]  btracker_destroy+0x12/0x20 [dm_cache]
[ 2245.508728]  smq_destroy+0x15/0x60 [dm_cache_smq]
[ 2245.513435]  dm_cache_policy_destroy+0x12/0x20 [dm_cache]
[ 2245.518834]  destroy+0xc0/0x110 [dm_cache]
[ 2245.522933]  dm_table_destroy+0x5c/0x120 [dm_mod]
[ 2245.527649]  __dm_destroy+0x10e/0x1c0 [dm_mod]
[ 2245.532102]  dev_remove+0x117/0x190 [dm_mod]
[ 2245.536384]  ctl_ioctl+0x1a2/0x290 [dm_mod]
[ 2245.540579]  dm_ctl_ioctl+0xa/0x20 [dm_mod]
[ 2245.544773]  __x64_sys_ioctl+0x8a/0xc0
[ 2245.548524]  do_syscall_64+0x5c/0x90
[ 2245.552104]  ? syscall_exit_to_user_mode+0x12/0x30
[ 2245.556897]  ? do_syscall_64+0x69/0x90
[ 2245.560648]  ? do_syscall_64+0x69/0x90
[ 2245.564394]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 2245.569447] RIP: 0033:0x7fe52583ec6b
...
[ 2245.646771] ------------[ cut here ]------------
[ 2245.651395] kmem_cache_destroy bt_work: Slab cache still has objects when called from btracker_destroy+0x12/0x20 [dm_cache]
[ 2245.651408] WARNING: CPU: 7 PID: 10805 at mm/slab_common.c:478 kmem_cache_destroy+0x128/0x130

Found using: lvm2-testsuite --only "cache-single-split.sh"

Ben bisected and found that commit 0495e337b7 ("mm/slab_common:
Deleting kobject in kmem_cache_destroy() without holding
slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete
cleanup of its background tracker work objects.

Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Tested-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org # 6.0+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-01-30 14:20:04 -05:00
Mike Snitzer c87791bcc4 dm: improve shrinker debug names
Commit e33c267ab7 ("mm: shrinkers: provide shrinkers with names")
chose some fairly bad names for DM's shrinkers.

Fixes: e33c267ab7 ("mm: shrinkers: provide shrinkers with names")
Signed-off-by : Mike Snitzer <snitzer@kernel.org>
2023-01-30 14:20:04 -05:00
Ulf Hansson 4a6a7bc21d block: Default to use cgroup support for BFQ
Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we
most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's
make it default y.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30 09:42:42 -07:00
Kemeng Shi 323745a3aa block, bfq: remove unused bfq_wr_max_time in struct bfq_data
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed.
It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never
meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 87c971de81 block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq
We jump to tag only for returning current rq. Return directly to
remove this tag.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 433d4b03e7 block, bfq: remove redundant check in bfq_put_cooperator
We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 86f8382e6d block, bfq: remove unnecessary dereference to get async_bfqq
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 8ac2e43c35 block, bfq: use helper macro RQ_BFQQ to get bfqq of request
Use helper macro RQ_BFQQ to get bfqq of request.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 1c970450a7 block, bfq: initialize bfqq->decrease_time_jif correctly
Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.

To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.

Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi bebeb9e582 block, bfq: remove unsed parameter reason in bfq_bfqq_is_slow
Parameter reason is never used, just remove it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi 0c3e09e885 block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injection
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.

Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
 -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
 -For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:49 -07:00
Kemeng Shi b5fcf7871a sbitmap: correct wake_batch recalculation to avoid potential IO hung
Commit 180dccb0db ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is > hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit 180dccb0db ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.

Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth <= 4.

Fixes: 180dccb0db ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:01 -07:00
Kemeng Shi 678418c612 sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow
There are three differences between __sbitmap_get and
__sbitmap_get_shallow when searching free bit:
1. __sbitmap_get_shallow limit number of bit to search per word.
__sbitmap_get has no such limit.
2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set
wrap according to round_robin.
3. __sbitmap_get_shallow always searches from first bit in first word.
__sbitmap_get searches from first bit when round_robin is not set
otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).

Add helper function sbitmap_find_bit function to do common search while
accept "limit depth per word", "wrap flag" and "first bit to
search" from caller to support the need of both __sbitmap_get and
__sbitmap_get_shallow.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:01 -07:00
Kemeng Shi 08470a98a7 sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code
Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.

With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:01 -07:00
Kemeng Shi 903e86f3a6 sbitmap: remove redundant check in __sbitmap_queue_get_batch
Commit fbb564a557 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.

Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit fbb564a557
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.

Fixes: fbb564a557 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:01 -07:00
Kemeng Shi f1591a8bb3 sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So
simplify the logic.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 20:03:01 -07:00
Yu Kuai f1c006f1c6 blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()
Currently parent pd can be freed before child pd:

t1: remove cgroup C1
blkcg_destroy_blkgs
 blkg_destroy
  list_del_init(&blkg->q_node)
  // remove blkg from queue list
  percpu_ref_kill(&blkg->refcnt)
   blkg_release
    call_rcu

t2: from t1
__blkg_release
 blkg_free
  schedule_work
			t4: deactivate policy
			blkcg_deactivate_policy
			 pd_free_fn
			 // parent of C1 is freed first
t3: from t2
 blkg_free_workfn
  pd_free_fn

If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.

Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00
Yu Kuai dfd6200a09 blk-cgroup: support to track if policy is online
A new field 'online' is added to blkg_policy_data to fix following
2 problem:

1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT'
   failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again
   without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with
   it, and pd_offline_fn() will be called without pd_init_fn() and
   pd_online_fn(). This way null-ptr-deference can be triggered.

2) In order to synchronize pd_free_fn() from blkg_free_workfn() and
   blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be
   delayed to blkg_free_workfn(), hence pd_offline_fn() can be called
   first in blkg_destroy(), and then blkcg_deactivate_policy() will
   call it again, we must prevent it.

The new field 'online' will be set after pd_online_fn() and will be
cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only
be called if 'online' is set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00
Yu Kuai c7241babf0 blk-cgroup: dropping parent refcount after pd_free_fn() is done
Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().

Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.

This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.

BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:19:04 -07:00