Commit Graph

185 Commits

Author SHA1 Message Date
Kent Overstreet d4e3b928ab closures: CLOSURE_CALLBACK() to fix type punning
Control flow integrity is now checking that type signatures match on
indirect function calls. That breaks closures, which embed a work_struct
in a closure in such a way that a closure_fn may also be used as a
workqueue fn by the underlying closure code.

So we have to change closure fns to take a work_struct as their
argument - but that results in a loss of clarity, as closure fns have
different semantics from normal workqueue functions (they run owning a
ref on the closure, which must be released with continue_at() or
closure_return()).

Thus, this patc introduces CLOSURE_CALLBACK() and closure_type() macros
as suggested by Kees, to smooth things over a bit.

Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-24 00:29:58 -05:00
Qi Zheng a6a1eb6214 bcache: dynamically allocate the md-bcache shrinker
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the md-bcache shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct cache_set.

Link: https://lkml.kernel.org/r/20230911094444.68966-27-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Coly Li <colyli@suse.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:25 -07:00
Linus Torvalds bc6cb4d5bc Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
 
   The cmpxchg128() family of functions is basically & functionally
   the same as cmpxchg_double(), but with a saner interface: instead
   of a 6-parameter horror that forced u128 - u64/u64-halves layout
   details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128 types.
 
 - Restructure the generated atomic headers, and add
   kerneldoc comments for all of the generic atomic{,64,_long}_t
   operations. Generated definitions are much cleaner now,
   and come with documentation.
 
 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering
   when taking multiple locks of the same type. This gets rid of
   one use of lockdep_set_novalidate_class() in the bcache code.
 
 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
   variable shadowing generating garbage code on Clang on certain
   ARM builds.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
 zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
 QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
 8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
 sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
 wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
 YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
 r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
 giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
 7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
 o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
 x9Nt+Tp0Ze4=
 =DsYj
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()

   The cmpxchg128() family of functions is basically & functionally the
   same as cmpxchg_double(), but with a saner interface.

   Instead of a 6-parameter horror that forced u128 - u64/u64-halves
   layout details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128
   types.

 - Restructure the generated atomic headers, and add kerneldoc comments
   for all of the generic atomic{,64,_long}_t operations.

   The generated definitions are much cleaner now, and come with
   documentation.

 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
   taking multiple locks of the same type.

   This gets rid of one use of lockdep_set_novalidate_class() in the
   bcache code.

 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
   shadowing generating garbage code on Clang on certain ARM builds.

* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
  percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
  locking/atomic: treewide: delete arch_atomic_*() kerneldoc
  locking/atomic: docs: Add atomic operations to the driver basic API documentation
  locking/atomic: scripts: generate kerneldoc comments
  docs: scripts: kernel-doc: accept bitwise negation like ~@var
  locking/atomic: scripts: simplify raw_atomic*() definitions
  locking/atomic: scripts: simplify raw_atomic_long*() definitions
  locking/atomic: scripts: split pfx/name/sfx/order
  locking/atomic: scripts: restructure fallback ifdeffery
  locking/atomic: scripts: build raw_atomic_long*() directly
  locking/atomic: treewide: use raw_atomic*_<op>()
  locking/atomic: scripts: add trivial raw_atomic*_<op>()
  locking/atomic: scripts: factor out order template generation
  locking/atomic: scripts: remove leftover "${mult}"
  locking/atomic: scripts: remove bogus order parameter
  locking/atomic: xtensa: add preprocessor symbols
  locking/atomic: x86: add preprocessor symbols
  locking/atomic: sparc: add preprocessor symbols
  locking/atomic: sh: add preprocessor symbols
  ...
2023-06-27 14:14:30 -07:00
Mingzhe Zou f0854489fc bcache: fixup btree_cache_wait list damage
We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
  next = 0xffffae586d8afe68,
  prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
  next = 0x0,
  prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68  type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041]  prepare_to_wait+0x8a/0xb0
[230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788]  mca_alloc+0x2ae/0x450 [escache]
[230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382]  ? finish_wait+0x80/0x80
[230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259]  kthread+0x112/0x130
[230469.138448]  ? kthread_flush_work_fn+0x10/0x10
[230469.149477]  ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.

Fixes: 8e7102273f ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45fc5 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:55 -06:00
Zheng Wang 80fca8a10b bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.

Fix it by initializing the return value in __bch_btree_node_alloc.

Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:00 -06:00
Zheng Wang 028ddcac47 bcache: Remove unnecessary NULL point check in node allocations
Due to the previous fix of __bch_btree_node_alloc, the return value will
never be a NULL pointer. So IS_ERR is enough to handle the failure
situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check.

Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:43 -06:00
Kent Overstreet 4c8a49244c bcache: Convert to lock_cmp_fn
Replace one of bcache's lockdep_set_novalidate_class() usage with the
newly introduced custom lock nesting annotation.

[peterz: changelog]
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Coly Li <colyli@suse.de>
Link: https://lkml.kernel.org/r/20230509195847.1745548-2-kent.overstreet@linux.dev
2023-05-24 12:21:22 +02:00
Roman Gushchin e33c267ab7 mm: shrinkers: provide shrinkers with names
Currently shrinkers are anonymous objects.  For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g.  for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers.  register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
    <subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
  $ cd /sys/kernel/debug/shrinker/
  $ ls
    dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
    mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
    mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
    rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
    sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
    sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
    sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
    sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
    sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
    sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
    sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
    sb-dax-11           sb-proc-45       sb-tmpfs-35
    sb-debugfs-7        sb-proc-46       sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
  Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
  Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:40 -07:00
Coly Li 7d6b902ea0 bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
The local variables check_state (in bch_btree_check()) and state (in
bch_sectors_dirty_init()) should be fully filled by 0, because before
allocating them on stack, they were dynamically allocated by kzalloc().

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 09:49:48 -06:00
Coly Li 622536443b bcache: improve multithreaded bch_btree_check()
Commit 8e7102273f ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
  multiple threads. Although currently it is not mandatory but it is
  good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
  string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
  stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
  succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
  to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e7102273f ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24 06:19:33 -06:00
Mingzhe Zou 887554ab96 bcache: fixup multiple threads crash
When multiple threads to check btree nodes in parallel, the main
thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:

wait_event_interruptible(check_state->wait,
                         atomic_read(&check_state->started) == 0 ||
                         test_bit(CACHE_SET_IO_DISABLE, &c->flags));

However, the bch_btree_node_read and bch_btree_node_read_done
maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE
will be set. If the flag already set, the main thread return
error. At the same time, maybe some threads still running and
read NULL pointer, the kernel will crash.

This patch change the event wait condition, the main thread must
wait for all threads to stop.

Fixes: 8e7102273f ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Coly Li <colyli@suse.de>
2022-03-06 22:33:45 +08:00
Coly Li 2878feaed5 bcache: Revert "bcache: use bvec_virt"
This reverts commit 2fd3e5efe7.

The above commit replaces page_address(bv->bv_page) by bvec_virt(bv) to
avoid directly access to bv->bv_page, but in situation bv->bv_offset is
not zero and page_address(bv->bv_page) is not equal to bvec_virt(bv). In
such case a memory corruption may happen because memory in next page is
tainted by following line in do_btree_node_write(),
	memcpy(bvec_virt(bv), addr, PAGE_SIZE);

This patch reverts the mentioned commit to avoid the memory corruption.

Fixes: 2fd3e5efe7 ("bcache: use bvec_virt")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211103151041.70516-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-08 06:23:17 -07:00
Christoph Hellwig 39fa7a9555 bcache: remove bch_crc64_update
bch_crc64_update is an entirely pointless wrapper around crc64_be.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-9-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Christoph Hellwig 2fd3e5efe7 bcache: use bvec_virt
Use bvec_virt instead of open coding it.  Note that the existing code is
fine despite ignoring bv_offset as the bio is known to contain exactly
one page from the page allocator per bio_vec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210804095634.460779-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16 10:50:33 -06:00
Christoph Hellwig 11e9560e6c bcache: remove PTR_CACHE
Remove the PTR_CACHE inline and replace it with a direct dereference
of c->cache.

(Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11 08:37:55 -06:00
Kai Krakow d797bd9897 bcache: Give btree_io_wq correct semantics again
Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.

Let's combine both worlds and make it multi threaded but able to
reclaim memory.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 08:06:00 -07:00
Kai Krakow 9f233ffe02 Revert "bcache: Kill btree_io_wq"
This reverts commit 56b30770b2.

With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.

After some more investigation, it looks like the original assumption
of 56b3077 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.

So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 08:06:00 -07:00
Coly Li 4a784266c6 bcache: remove embedded struct cache_sb from struct cache_set
Since bcache code was merged into mainline kerrnel, each cache set only
as one single cache in it. The multiple caches framework is here but the
code is far from completed. Considering the multiple copies of cached
data can also be stored on e.g. md raid1 devices, it is unnecessary to
support multiple caches in one cache set indeed.

The previous preparation patches fix the dependencies of explicitly
making a cache set only have single cache. Now we don't have to maintain
an embedded partial super block in struct cache_set, the in-memory super
block can be directly referenced from struct cache.

This patch removes the embedded struct cache_sb from struct cache_set,
and fixes all locations where the superb lock was referenced from this
removed super block by referencing the in-memory super block of struct
cache.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li 4e1ebae3ee bcache: only use block_bytes() on struct cache
Because struct cache_set and struct cache both have struct cache_sb,
therefore macro block_bytes() can be used on both of them. When removing
the embedded struct cache_sb from struct cache_set, this macro won't be
used on struct cache_set anymore.

This patch unifies all block_bytes() usage only on struct cache, this is
one of the preparation to remove the embedded struct cache_sb from
struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li 08fdb2cddb bcache: remove for_each_cache()
Since now each cache_set explicitly has single cache, for_each_cache()
is unnecessary. This patch removes this macro, and update all locations
where it is used, and makes sure all code logic still being consistent.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Coly Li 17e4aed830 bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
The parameter 'int n' from bch_bucket_alloc_set() is not cleared
defined. From the code comments n is the number of buckets to alloc, but
from the code itself 'n' is the maximum cache to iterate. Indeed all the
locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1.

This patch removes the confused and unnecessary 'int n' from parameter
list of  bch_bucket_alloc_set(), and explicitly allocates only 1 bucket
for its caller.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Dongsheng Yang 7e59c506c3 bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve()
In mca_reserve(c) macro, we are checking root whether is NULL or not.
But that's not enough, when we read the root node in run_cache_set(),
if we got an error in bch_btree_node_read_done(), we will return
ERR_PTR(-EIO) to c->root.

And then we will go continue to unregister, but before calling
unregister_shrinker(&c->shrink), there is a possibility to call
bch_mca_count(), and we would get a crash with call trace like that:

[ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5
... ...
[ 2150.598931] Call trace:
[ 2150.606439]  bch_mca_count+0x58/0x98 [escache]
[ 2150.615866]  do_shrink_slab+0x54/0x310
[ 2150.624429]  shrink_slab+0x248/0x2d0
[ 2150.632633]  drop_slab_node+0x54/0x88
[ 2150.640746]  drop_slab+0x50/0x88
[ 2150.648228]  drop_caches_sysctl_handler+0xf0/0x118
[ 2150.657219]  proc_sys_call_handler.isra.18+0xb8/0x110
[ 2150.666342]  proc_sys_write+0x40/0x50
[ 2150.673889]  __vfs_write+0x48/0x90
[ 2150.681095]  vfs_write+0xac/0x1b8
[ 2150.688145]  ksys_write+0x6c/0xd0
[ 2150.695127]  __arm64_sys_write+0x24/0x30
[ 2150.702749]  el0_svc_handler+0xa0/0x128
[ 2150.710296]  el0_svc+0x8/0xc

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Coly Li bf6af17065 bcache: handle cache set verify_ondisk properly for bucket size > 8MB
In bch_btree_cache_alloc() when CONFIG_BCACHE_DEBUG is configured,
allocate memory for c->verify_ondisk may fail if the bucket size > 8MB,
which will require __get_free_pages() to allocate continuous pages
with order > 11 (the default MAX_ORDER of Linux buddy allocator). Such
over size allocation will fail, and cause 2 problems,
- When CONFIG_BCACHE_DEBUG is configured,  bch_btree_verify() does not
  work, because c->verify_ondisk is NULL and bch_btree_verify() returns
  immediately.
- bch_btree_cache_alloc() will fail due to c->verify_ondisk allocation
  failed, then the whole cache device registration fails. And because of
  this failure, the first problem of bch_btree_verify() has no chance to
  be triggered.

This patch fixes the above problem by two means,
1) If pages allocation of c->verify_ondisk fails, set it to NULL and
   returns bch_btree_cache_alloc() with -ENOMEM.
2) When calling __get_free_pages() to allocate c->verify_ondisk pages,
   use ilog2(meta_bucket_pages(&c->sb)) to make sure ilog2() will always
   generate a pages order <= MAX_ORDER (or CONFIG_FORCE_MAX_ZONEORDER).
   Then the buddy system won't directly reject the allocation request.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:20 -06:00
Coly Li 5fe4886785 bcache: allocate meta data pages as compound pages
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.

For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.

This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 07:38:19 -06:00
Christoph Hellwig ed00aabd5e block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Zhiqiang Liu be23e83733 bcache: fix potential deadlock problem in btree_gc_coalesce
coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
	if alloc new_nodes[i] fails:
		goto out_nocoalesce;
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock)
	// main coalescing process
	for (i = nodes - 1; i > 0; --i)
		[snipped]
		if coalescing process fails:
			// Here, directly goto out_nocoalesce
			 // tag will cause a deadlock
			goto out_nocoalesce;
		[snipped]
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock)
	// coalesing succ, return
	return;
out_nocoalesce:
	btree_node_free(new_nodes[i])	// free new_nodes[i]
	// obtain new_nodes[i]->write_lock
	mutex_lock(&new_nodes[i]->write_lock);
	// set flag for reuse
	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
	// release new_nodes[i]->write_lock
	mutex_unlock(&new_nodes[i]->write_lock);

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.

(Coly Li helps to clean up commit log format.)

Fixes: 2a285686c1 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-14 16:47:56 -06:00
Joe Perches 46f5aa8806 bcache: Convert pr_<level> uses to a more typical style
Remove the trailing newline from the define of pr_fmt and add newlines
to the uses.

Miscellanea:

o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
  as the earlier conversion was inappropriate done causing multiple
  lines to be emitted where only a single output line was desired
o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
  line output where only a single line output was desired
o Coalesce formats

Fixes: 6ae63e3501 ("bcache: replace printk() by pr_*() routines")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Colin Ian King 3b5b7b1f70 bcache: remove redundant variables i and n
Variables i and n are being assigned but are never used. They are
redundant and can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Addresses-Coverity: ("Unused value")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:19:36 -06:00
Coly Li eb9b6666d6 bcache: optimize barrier usage for atomic operations
The idea of this patch is from Davidlohr Bueso, he posts a patch
for bcache to optimize barrier usage for read-modify-write atomic
bitops. Indeed such optimization can also apply on other locations
where smp_mb() is used before or after an atomic operation.

This patch replaces smp_mb() with smp_mb__before_atomic() or
smp_mb__after_atomic() in btree.c and writeback.c,  where it is used
to synchronize memory cache just earlier on other cores. Although
the locations are not on hot code path, it is always not bad to mkae
things a little better.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 10:06:57 -06:00
Coly Li 8e7102273f bcache: make bch_btree_check() to be multithreaded
When registering a cache device, bch_btree_check() is called to check
all btree nodes, to make sure the btree is consistent and not
corrupted.

bch_btree_check() is recursively executed in a single thread, when there
are a lot of data cached and the btree is huge, it may take very long
time to check all the btree nodes. In my testing, I observed it took
around 50 minutes to finish bch_btree_check().

When checking the bcache btree nodes, the cache set is not running yet,
and indeed the whole tree is in read-only state, it is safe to create
multiple threads to check the btree in parallel.

This patch tries to create multiple threads, and each thread tries to
one-by-one check the sub-tree indexed by a key from the btree root node.
The parallel thread number depends on how many keys in the btree root
node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
practice is should be min(cpu-number/2, root-node-keys-number).

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 10:06:57 -06:00
Coly Li feac1a70b8 bcache: add bcache_ prefix to btree_root() and btree() macros
This patch changes macro btree_root() and btree() to bcache_btree_root()
and bcache_btree(), to avoid potential generic name clash in future.

NOTE: for product kernel maintainers, this patch can be skipped if
you feel the rename stuffs introduce inconvenince to patch backport.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 10:06:57 -06:00
Coly Li 253a99d95d bcache: move macro btree() and btree_root() into btree.h
In order to accelerate bcache registration speed, the macro btree()
and btree_root() will be referenced out of btree.c. This patch moves
them from btree.c into btree.h with other relative function declaration
in btree.h, for the following changes.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 10:06:57 -06:00
Jens Axboe 764b53b26c Revert "bcache: ignore pending signals when creating gc and allocator thread"
This reverts commit 0b96da639a.

We can't just go flushing random signals, under the assumption that the
OOM killer will just do something else. It's not safe from the OOM
perspective, and it could also cause other signals to get randomly lost.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 20:01:32 -07:00
Coly Li 0b96da639a bcache: ignore pending signals when creating gc and allocator thread
When run a cache set, all the bcache btree node of this cache set will
be checked by bch_btree_check(). If the bcache btree is very large,
iterating all the btree nodes will occupy too much system memory and
the bcache registering process might be selected and killed by system
OOM killer. kthread_run() will fail if current process has pending
signal, therefore the kthread creating in run_cache_set() for gc and
allocator kernel threads are very probably failed for a very large
bcache btree.

Indeed such OOM is safe and the registering process will exit after
the registration done. Therefore this patch flushes pending signals
during the cache set start up, specificly in bch_cache_allocator_start()
and bch_gc_thread_start(), to make sure run_cache_set() won't fail for
large cahced data set.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-13 08:53:49 -07:00
Coly Li e3de04469a bcache: reap from tail of c->btree_cache in bch_mca_scan()
When shrink btree node cache from c->btree_cache in bch_mca_scan(),
no matter the selected node is reaped or not, it will be rotated from
the head to the tail of c->btree_cache list. But in bcache journal
code, when flushing the btree nodes with oldest journal entry, btree
nodes are iterated and slected from the tail of c->btree_cache list in
btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
make btree_flush_write() iterate more nodes in c->btree_list in reverse
order.

This patch just reaps the selected btree node cache, and not move it
from the head to the tail of c->btree_cache list. Then bch_mca_scan()
will not mess up c->btree_cache list to btree_flush_write().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li d5c9c470b0 bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
In order to skip the most recently freed btree node cahce, currently
in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
are skipped when shrinking bcache node caches in bch_mca_scan(). The
related code in bch_mca_scan() is,

 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
 738         if (nr <= 0)
 739                 goto out;
 740
 741         if (++i > 3 &&
 742             !mca_reap(b, 0, false)) {
             		lines free cache memory
 746         }
 747         nr--;
 748 }

The problem is, if virtual memory code calls bch_mca_scan() and
the calculated 'nr' is 1 or 2, then in the above loop, nothing will
be shunk. In such case, if slub/slab manager calls bch_mca_scan()
for many times with small scan number, it does not help to shrink
cache memory and just wasts CPU cycles.

This patch just selects btree node caches from tail of the
c->btree_cache_freeable list, then the newly freed host cache can
still be allocated by mca_alloc(), and at least 1 node can be shunk.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li 125d98edd1 bcache: remove member accessed from struct btree
The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.

But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.

This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Jens Axboe 00b89892c8 Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
Coly says:

"Guoju Fang talked to me today, he told me this change was unnecessary
and I was over-thought.

Then I realize fifo_idx() uses a mask to handle the array index overflow
condition, so the index swap in journal_pin_cmp() won't happen. And yes,
Guoju and Kent are correct.

Since you already applied this patch, can you please to remove this
patch from your for-next branch? This single patch does not break
thing, but it is unecessary at this moment."

This reverts commit c0e0954e90.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-18 08:35:47 -07:00
Coly Li 9fcc34b1a6 bcache: at least try to shrink 1 node in bch_mca_scan()
In bch_mca_scan(), the number of shrinking btree node is calculated
by code like this,
	unsigned long nr = sc->nr_to_scan;

        nr /= c->btree_pages;
        nr = min_t(unsigned long, nr, mca_can_free(c));
variable sc->nr_to_scan is number of objects (here is bcache B+tree
nodes' number) to shrink, and pointer variable sc is sent from memory
management code as parametr of a callback.

If sc->nr_to_scan is smaller than c->btree_pages, after the above
calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
and releasing mutex c->bucket_lock.

This patch checkes whether nr is 0 after the above calculation, if 0
is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
will try to shrink a single B+tree node.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li 5dccefd3ea bcache: add code comments in bch_btree_leaf_dirty()
This patch adds code comments in bch_btree_leaf_dirty() to explain
why w->journal should always reference the eldest journal pin of
all the writing bkeys in the btree node. To make the bcache journal
code to be easier to be understood.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Guoju Fang 34cf78bf34 bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
This patch fix a lost wake-up problem caused by the race between
mca_cannibalize_lock and bch_cannibalize_unlock.

Consider two processes, A and B. Process A is executing
mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
and is executing bch_cannibalize_unlock. The problem happens that after
process A executes cmpxchg and will execute prepare_to_wait. In this
timeslice process B executes wake_up, but after that process A executes
prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
goes to sleep but no one will wake up it. This problem may cause bcache
device to dead.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li c0e0954e90 bcache: fix fifo index swapping condition in journal_pin_cmp()
Fifo structure journal.pin is implemented by a cycle buffer, if the back
index reaches highest location of the cycle buffer, it will be swapped
to 0. Once the swapping happens, it means a smaller fifo index might be
associated to a newer journal entry. So the btree node with oldest
journal entry won't be selected in bch_btree_leaf_dirty() to reference
the dirty B+tree leaf node. This problem may cause bcache journal won't
protect unflushed oldest B+tree dirty leaf node in power failure, and
this B+tree leaf node is possible to beinconsistent after reboot from
power failure.

This patch fixes the fifo index comparing logic in journal_pin_cmp(),
to avoid potential corrupted B+tree leaf node when the back index of
journal pin is swapped.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:49 -07:00
Coly Li 50a260e859 bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe563591 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li 41508bb7d4 bcache: add comments for mutex_lock(&b->write_lock)
When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li e5ec5f4765 bcache: only clear BTREE_NODE_dirty bit when it is set
In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
	if (btree_node_dirty(b))
		btree_complete_write(b, btree_current_write(b));
	clear_bit(BTREE_NODE_dirty, &b->flags);

Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.

This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li bd9026c8a7 bcache: remove unncessary code in bch_btree_keys_init()
Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.

This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Christoph Hellwig 2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Christoph Hellwig f936b06ae5 bcache: clean up do_btree_node_write a bit
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:11 -06:00
Shenghui Wang f16277ca20 bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Ming Lei 6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00