Commit graph

376 commits

Author SHA1 Message Date
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
ae220766d8 block: remove the unused mode argument to ->release
The mode argument to the ->release block_device_operation is never used,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>			[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
d32e2bf837 block: pass a gendisk to ->open
->open is only called on the whole device.  Make that explicit by
passing a gendisk instead of the block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Zhong Jinghua
f12bc113ce nbd: Add the maximum limit of allocated index in nbd_dev_add
If the index allocated by idr_alloc greater than MINORMASK >> part_shift,
the device number will overflow, resulting in failure to create a block
device.

Fix it by imiting the size of the max allocation.

Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230605122159.2134384-1-zhongjinghua@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07 07:50:48 -06:00
Ivan Orlov
4913cfcf01 nbd: Fix debugfs_create_dir error checking
The debugfs_create_dir function returns ERR_PTR in case of error, and the
only correct way to check if an error occurred is 'IS_ERR' inline function.
This patch will replace the null-comparison with IS_ERR.

Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Link: https://lore.kernel.org/r/20230512130533.98709-1-ivan.orlov0322@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-12 08:56:33 -06:00
Eric Blake
bd9e9916c3 block nbd: use req.cookie instead of req.handle
The NBD spec was recently changed [1] to refer to the opaque client
identifier as a 'cookie' rather than a 'handle', but has for a much
longer time listed it as a 64-bit value, and declares that all values
in the NBD protocol are sent in network byte order (big-endian).

Because the value is opaque to the server, it doesn't usually matter
what endianness we send as the client - as long as we are consistent
that either we byte-swap on both write and read, or on neither, then
we can match server replies back to our requests.  That said, our
internal use of the cookie is as a 64-bit number (well, as two 32-bit
numbers concatenated together), rather than as 8 individual bytes; so
prior to this commit, we ARE leaking the native endianness of our
internals as a client out to the server.  We don't know of any server
that will actually inspect the opaque value and behave differently
depending on whether a little-endian or big-endian client is sending
requests, but since we DO log the cookie value, a wireshark capture of
the network traffic is easier to correlate back to the kernel traffic
of a big-endian host (where the u64 and char[8] representations are
the same) than of a little-endian host (where if wireshark honors the
NBD spec and displays a u64 in network byte order, it is byte-swapped
from what the kernel logged).

The fix in this patch is thus two-part: it now consistently uses
network byte order for the opaque value (no difference to a big-endian
machine, but an extra byteswap on a little-endian machine; probably in
the noise compared to the overhead of network traffic in general), and
now uses a 64-bit integer instead of char[8] as its preferred access
to the opaque value (direct assignment instead of memcpy()).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230410180611.1051618-4-eblake@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-27 19:15:11 -06:00
Chaitanya Kulkarni
3f89ac587b block/drivers: remove dead clear of random flag
QUEUE_FLAG_ADD_RANDOM is not set before we clear it for "null_blk",
"brd", "nbd", "zram", and "bcache" since by default we don't set
"QUEUE_FLAG_ADD_RANDOM" to MQ ops.

Remove dead clear of QUEUE_FLAG_ADD_RANDOM in above listed drivers.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram
Link: https://lore.kernel.org/r/20230424234628.45544-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-25 08:02:11 -06:00
Zhong Jinghua
55793ea54d nbd: fix incomplete validation of ioctl arg
We tested and found an alarm caused by nbd_ioctl arg without verification.
The UBSAN warning calltrace like below:

UBSAN: Undefined behaviour in fs/buffer.c:1709:35
signed integer overflow:
-9223372036854775808 - 1 cannot be represented in type 'long long int'
CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x170/0x1dc lib/dump_stack.c:118
 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206
 __block_write_full_page+0x94c/0xa20 fs/buffer.c:1709
 block_write_full_page+0x1f0/0x280 fs/buffer.c:2934
 blkdev_writepage+0x34/0x40 fs/block_dev.c:607
 __writepage+0x68/0xe8 mm/page-writeback.c:2305
 write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240
 generic_writepages+0xdc/0x148 mm/page-writeback.c:2329
 blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114
 do_writepages+0xd4/0x250 mm/page-writeback.c:2344

The reason for triggering this warning is __block_write_full_page()
-> i_size_read(inode) - 1 overflow.
inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize.
We think it is necessary to limit the size of arg to prevent errors.

Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int.
Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine),
it will become 1 after the coercion, which will return unexpected results.

Fix it by adding checks to prevent passing in too large numbers.

Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-20 13:43:44 -06:00
Jakub Kicinski
7399b886b5 nbd: use the structured req attr check
Use the macro for checking presence of required attributes.
It has the advantage of reporting to the user which attr
was missing in a machine-readable format (extack).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-2-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14 14:17:13 -06:00
Jakub Kicinski
9195b317f1 nbd: allow genl access outside init_net
NBD doesn't have much to do with networking, allow users outside
init_net to access the family.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-1-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14 14:17:13 -06:00
Benjamin Coddington
98123866fc Treewide: Stop corrupting socket's task_frag
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-19 17:28:49 -08:00
Al Viro
de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Linus Torvalds
e8bc52cb8d Driver core changes for 6.1-rc1
Here is the big set of driver core and debug printk changes for 6.1-rc1.
 Included in here is:
 	- dynamic debug updates for the core and the drm subsystem.  The
 	  drm changes have all been acked by the relevant maintainers.
 	- kernfs fixes for syzbot reported problems
 	- kernfs refactors and updates for cgroup requirements
 	- magic number cleanups and removals from the kernel tree (they
 	  were not being used and they really did not actually do
 	  anything.)
 	- other tiny cleanups
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BYUA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylozwCdFRlcghaf7XBUyNgRZRwMC+oQI8EAn1G/nEDE
 6aFd2er41uK0IGQnSmYO
 =OK0k
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the big set of driver core and debug printk changes for
  6.1-rc1. Included in here is:

   - dynamic debug updates for the core and the drm subsystem. The drm
     changes have all been acked by the relevant maintainers

   - kernfs fixes for syzbot reported problems

   - kernfs refactors and updates for cgroup requirements

   - magic number cleanups and removals from the kernel tree (they were
     not being used and they really did not actually do anything)

   - other tiny cleanups

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
  docs: filesystems: sysfs: Make text and code for ->show() consistent
  Documentation: NBD_REQUEST_MAGIC isn't a magic number
  a.out: restore CMAGIC
  device property: Add const qualifier to device_get_match_data() parameter
  drm_print: add _ddebug descriptor to drm_*dbg prototypes
  drm_print: prefer bare printk KERN_DEBUG on generic fn
  drm_print: optimize drm_debug_enabled for jump-label
  drm-print: add drm_dbg_driver to improve namespace symmetry
  drm-print.h: include dyndbg header
  drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
  drm_print: interpose drm_*dbg with forwarding macros
  drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
  drm_print: condense enum drm_debug_category
  debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
  driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
  Documentation: ENI155_MAGIC isn't a magic number
  Documentation: NBD_REPLY_MAGIC isn't a magic number
  nbd: remove define-only NBD_MAGIC, previously magic number
  Documentation: FW_HEADER_MAGIC isn't a magic number
  Documentation: EEPROM_MAGIC_VALUE isn't a magic number
  ...
2022-10-07 17:04:10 -07:00
Linus Torvalds
513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
наб
bd5926220f nbd: remove define-only NBD_MAGIC, previously magic number
commit f4507164e7 ("nbd: rename the nbd_device variable from lo to
nbd") renamed LO_MAGIC to NBD_MAGIC; commit 5ea8d10802 ("nbd:
separate out the config information") removed the last users of that

Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Link: https://lore.kernel.org/r/10a80681c5966fed1a1afc696e3db114f481514c.1663280877.git.nabijaczleweli@nabijaczleweli.xyz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-24 14:17:19 +02:00
Shigeru Yoshida
1de7c3cf48 nbd: Fix hung when signal interrupts nbd_start_device_ioctl()
syzbot reported hung task [1].  The following program is a simplified
version of the reproducer:

int main(void)
{
	int sv[2], fd;

	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) < 0)
		return 1;
	if ((fd = open("/dev/nbd0", 0)) < 0)
		return 1;
	if (ioctl(fd, NBD_SET_SIZE_BLOCKS, 0x81) < 0)
		return 1;
	if (ioctl(fd, NBD_SET_SOCK, sv[0]) < 0)
		return 1;
	if (ioctl(fd, NBD_DO_IT) < 0)
		return 1;
	return 0;
}

When signal interrupt nbd_start_device_ioctl() waiting the condition
atomic_read(&config->recv_threads) == 0, the task can hung because it
waits the completion of the inflight IOs.

This patch fixes the issue by clearing queue, not just shutdown, when
signal interrupt nbd_start_device_ioctl().

Link: https://syzkaller.appspot.com/bug?id=7d89a3ffacd2b83fdd39549bc4d8e0a89ef21239 [1]
Reported-by: syzbot+38e6c55d4969a14c1534@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220907163502.577561-1-syoshida@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-08 08:41:36 -06:00
Jakub Kicinski
9c5d03d362 genetlink: start to validate reserved header bytes
We had historically not checked that genlmsghdr.reserved
is 0 on input which prevents us from using those precious
bytes in the future.

One use case would be to extend the cmd field, which is
currently just 8 bits wide and 256 is not a lot of commands
for some core families.

To make sure that new families do the right thing by default
put the onus of opting out of validation on existing families.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-29 12:47:15 +01:00
Yu Kuai
bc9da6dd06 nbd: add missing definition of pr_fmt
commit 1243172d58 ("nbd: use pr_err to output error message") tries
to define pr_fmt and use short pr_err() to output error message,
however, the definition is missed.

This patch also remove existing "nbd:" inside pr_err().

Fixes: 1243172d58 ("nbd: use pr_err to output error message")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220723082427.3890655-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
John Garry
2dd6532e95 blk-mq: Drop 'reserved' arg of busy_tag_iter_fn
We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
John Garry
9bdb4833dd blk-mq: Drop blk_mq_ops.timeout 'reserved' arg
With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.

This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
Christoph Hellwig
8b9ab62662 block: remove blk_cleanup_disk
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:33:15 -06:00
Yu Kuai
1243172d58 nbd: use pr_err to output error message
Instead of using the long printk(KERN_ERR "nbd: ...") to
output error message, defining pr_fmt and using
the short pr_err("") to do that. The replacemen is done
by using the following command:

  sed -i 's/printk(KERN_ERR "nbd: /pr_err("/g' \
		  drivers/block/nbd.c

This patch also rewrap to 80 columns where possible.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-7-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Zhang Wensheng
858f1bf65d nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
When 'index' is a big numbers, it may become negative which forced
to 'int'. then 'index << part_shift' might overflow to a positive
value that is not greater than '0xfffff', then sysfs might complains
about duplicate creation. Because of this, move the 'index' judgment
to the front will fix it and be better.

Fixes: b0d9111a2d ("nbd: use an idr to keep track of nbd devices")
Fixes: 940c264984 ("nbd: fix possible overflow for 'first_minor' in nbd_dev_add()")
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-6-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Yu Kuai
09dadb5985 nbd: fix io hung while disconnecting device
In our tests, "qemu-nbd" triggers a io hung:

INFO: task qemu-nbd:11445 blocked for more than 368 seconds.
      Not tainted 5.18.0-rc3-next-20220422-00003-g2176915513ca #884
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:qemu-nbd        state:D stack:    0 pid:11445 ppid:     1 flags:0x00000000
Call Trace:
 <TASK>
 __schedule+0x480/0x1050
 ? _raw_spin_lock_irqsave+0x3e/0xb0
 schedule+0x9c/0x1b0
 blk_mq_freeze_queue_wait+0x9d/0xf0
 ? ipi_rseq+0x70/0x70
 blk_mq_freeze_queue+0x2b/0x40
 nbd_add_socket+0x6b/0x270 [nbd]
 nbd_ioctl+0x383/0x510 [nbd]
 blkdev_ioctl+0x18e/0x3e0
 __x64_sys_ioctl+0xac/0x120
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fd8ff706577
RSP: 002b:00007fd8fcdfebf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fd8ff706577
RDX: 000000000000000d RSI: 000000000000ab00 RDI: 000000000000000f
RBP: 000000000000000f R08: 000000000000fbe8 R09: 000055fe497c62b0
R10: 00000002aff20000 R11: 0000000000000246 R12: 000000000000006d
R13: 0000000000000000 R14: 00007ffe82dc5e70 R15: 00007fd8fcdff9c0

"qemu-ndb -d" will call ioctl 'NBD_DISCONNECT' first, however, following
message was found:

block nbd0: Send disconnect failed -32

Which indicate that something is wrong with the server. Then,
"qemu-nbd -d" will call ioctl 'NBD_CLEAR_SOCK', however ioctl can't clear
requests after commit 2516ab1543fd("nbd: only clear the queue on device
teardown"). And in the meantime, request can't complete through timeout
because nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', which
means such request will never be completed in this situation.

Now that the flag 'NBD_CMD_INFLIGHT' can make sure requests won't
complete multiple times, switch back to call nbd_clear_sock() in
nbd_clear_sock_ioctl(), so that inflight requests can be cleared.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-5-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Yu Kuai
2895f1831e nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
Otherwise io will hung because request will only be completed if the
cmd has the flag 'NBD_CMD_INFLIGHT'.

Fixes: 07175cb1ba ("nbd: make sure request completion won't concurrent")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Yu Kuai
c55b2b983b nbd: fix race between nbd_alloc_config() and module removal
When nbd module is being removing, nbd_alloc_config() may be
called concurrently by nbd_genl_connect(), although try_module_get()
will return false, but nbd_alloc_config() doesn't handle it.

The race may lead to the leak of nbd_config and its related
resources (e.g, recv_workq) and oops in nbd_read_stat() due
to the unload of nbd module as shown below:

  BUG: kernel NULL pointer dereference, address: 0000000000000040
  Oops: 0000 [#1] SMP PTI
  CPU: 5 PID: 13840 Comm: kworker/u17:33 Not tainted 5.14.0+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: knbd16-recv recv_work [nbd]
  RIP: 0010:nbd_read_stat.cold+0x130/0x1a4 [nbd]
  Call Trace:
   recv_work+0x3b/0xb0 [nbd]
   process_one_work+0x1ed/0x390
   worker_thread+0x4a/0x3d0
   kthread+0x12a/0x150
   ret_from_fork+0x22/0x30

Fixing it by checking the return value of try_module_get()
in nbd_alloc_config(). As nbd_alloc_config() may return ERR_PTR(-ENODEV),
assign nbd->config only when nbd_alloc_config() succeeds to ensure
the value of nbd->config is binary (valid or NULL).

Also adding a debug message to check the reference counter
of nbd_config during module removal.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Yu Kuai
06c4da89c2 nbd: call genl_unregister_family() first in nbd_cleanup()
Otherwise there may be race between module removal and the handling of
netlink command, which can lead to the oops as shown below:

  BUG: kernel NULL pointer dereference, address: 0000000000000098
  Oops: 0002 [#1] SMP PTI
  CPU: 1 PID: 31299 Comm: nbd-client Tainted: G            E     5.14.0-rc4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:down_write+0x1a/0x50
  Call Trace:
   start_creating+0x89/0x130
   debugfs_create_dir+0x1b/0x130
   nbd_start_device+0x13d/0x390 [nbd]
   nbd_genl_connect+0x42f/0x748 [nbd]
   genl_family_rcv_msg_doit.isra.0+0xec/0x150
   genl_rcv_msg+0xe5/0x1e0
   netlink_rcv_skb+0x55/0x100
   genl_rcv+0x29/0x40
   netlink_unicast+0x1a8/0x250
   netlink_sendmsg+0x21b/0x430
   ____sys_sendmsg+0x2a4/0x2d0
   ___sys_sendmsg+0x81/0xc0
   __sys_sendmsg+0x62/0xb0
   __x64_sys_sendmsg+0x1f/0x30
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  Modules linked in: nbd(E-)

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 20:39:33 -06:00
Xie Yongji
491bf8f236 nbd: Fix hung on disconnect request if socket is closed before
When userspace closes the socket before sending a disconnect
request, the following I/O requests will be blocked in
wait_for_reconnect() until dead timeout. This will cause the
following disconnect request also hung on blk_mq_quiesce_queue().
That means we have no way to disconnect a nbd device if there
are some I/O requests waiting for reconnecting until dead timeout.
It's not expected. So let's wake up the thread waiting for
reconnecting directly when a disconnect request is sent.

Reported-by: Xu Jianhai <zero.xu@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220322080639.142-1-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16 06:19:35 -06:00
Christoph Hellwig
4a04d517c5 nbd: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by nbd is mostly harmless
but also useless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220418045314.360785-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-03 10:38:50 -06:00
Christoph Hellwig
dbdc1be325 block: add a disk_openers helper
Add a helper that returns the openers for a given gendisk to avoid having
drivers poke into disk->part0 to get at this information in a somewhat
cumbersome way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220330052917.2566582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18 06:54:09 -06:00
Christoph Hellwig
2a852a693f nbd: use the correct block_device in nbd_bdev_reset
The bdev parameter to ->ioctl contains the block device that the ioctl
is called on, which can be the partition.  But the openers check in
nbd_bdev_reset really needs to check use the whole device, so switch to
using that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220330052917.2566582-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18 06:54:09 -06:00
Christoph Hellwig
70200574cc block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Jens Axboe
7198bfc201 Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
This reverts commit 6d35d04a9e.

Both Gabriel and Borislav report that this commit casues a regression
with nbd:

sysfs: cannot create duplicate filename '/dev/block/43:0'

Revert it before 5.18-rc1 and we'll investigage this separately in
due time.

Link: https://lore.kernel.org/all/YkiJTnFOt9bTv6A2@zn.tnic/
Reported-by: Gabriel L. Somlo <somlo@cmu.edu>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-02 11:40:23 -06:00
Zhang Wensheng
6d35d04a9e nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
When 'index' is a big numbers, it may become negative which forced
to 'int'. then 'index << part_shift' might overflow to a positive
value that is not greater than '0xfffff', then sysfs might complains
about duplicate creation. Because of this, move the 'index' judgment
to the front will fix it and be better.

Fixes: b0d9111a2d ("nbd: use an idr to keep track of nbd devices")
Fixes: 940c264984 ("nbd: fix possible overflow for 'first_minor' in nbd_dev_add()")
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220310093224.4002895-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-31 08:30:21 -06:00
Linus Torvalds
cb690f5238 for-5.16/drivers-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqOcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr3yEADD9Cx8oNk3KzWV3c3JlIR4JQtvpczS3dho
 KkGU0D5fOh1sViXbLBNr6VxypcEIKQoHxDQQ6qid1kOu/B3mCNM1duLsVjyj3Qa0
 7nbm2dVUsD/EVDuXedRmMvcfCUx6Z23DbpI182wXtIPaCsEEmsANzHnZNg38OV44
 25SYG0QUvb9ViSz1Y1GORu0ttEJNF2GhZfiBpb0WveRnY7eTSL/PnHNDzHsSeFv4
 zD0W205g7jKbt0+57kgNElTz7DbdM3p8XVex+aXPlFaHz2qx4ZoJJIsaMv/P8tT5
 14b50cB41xnPvlGTvqr1WfZZfJocDNq2rG+fh6N5D1sO86ogWpj7psiiADfa0pb6
 ZWoJqhk3BvEUMPQ5N/BJ/8j3FWGIYWtKQf4QcyxrJYpqDwtwbBfMlzKkc7JMPFYk
 JAi6uq1uF5SbA4x99G90tK85LvxsbkseyIYXgBJ/GIyW5doIPkD9TPDEzJMCdHOe
 laynHS5PMHzuhPLuEDDn9sTVXpZWAMBnoy4j1L4wGmBjiogYWLTSJVobODzCAqHY
 1Va2oP6SXfCdVRkCysFbcrdsjJuoIWlMKrdE40tNvkmU0v7sEX0Zd+GLHiaWdIZa
 fgxC9fmZtDDOowCp+Iw0VaAqPeeptmyUrof06ZktJleOAscX7kSwbxPdmr1FM0jy
 dbnLDyaq/A==
 =QaFI
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:

 - Last series adding error handling support for add_disk() in drivers.
   After this one, and once the SCSI side has been merged, we can
   finally annotate add_disk() as must_check. (Luis)

 - bcache fixes (Coly)

 - zram fixes (Ming)

 - ataflop locking fix (Tetsuo)

 - nbd fixes (Ye, Yu)

 - MD merge via Song
      - Cleanup (Yang)
      - sysfs fix (Guoqing)

 - Misc fixes (Geert, Wu, luo)

* tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits)
  bcache: Revert "bcache: use bvec_virt"
  ataflop: Add missing semicolon to return statement
  floppy: address add_disk() error handling on probe
  ataflop: address add_disk() error handling on probe
  block: update __register_blkdev() probe documentation
  ataflop: remove ataflop_probe_lock mutex
  mtd/ubi/block: add error handling support for add_disk()
  block/sunvdc: add error handling support for add_disk()
  z2ram: add error handling support for add_disk()
  nvdimm/pmem: use add_disk() error handling
  nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
  nvdimm/blk: add error handling support for add_disk()
  nvdimm/blk: avoid calling del_gendisk() on early failures
  nvdimm/btt: add error handling support for add_disk()
  nvdimm/btt: use goto error labels on btt_blk_init()
  loop: Remove duplicate assignments
  drbd: Fix double free problem in drbd_create_device
  nvdimm/btt: do not call del_gendisk() if not needed
  bcache: fix use-after-free problem in bcache_device_free()
  zram: replace fsync_bdev with sync_blockdev
  ...
2021-11-09 11:24:08 -08:00
Yu Kuai
494dbee341 nbd: error out if socket index doesn't match in nbd_handle_reply()
commit fcf3d633d8 ("nbd: check sock index in nbd_read_stat()") just
add error message when socket index doesn't match. Since the request
and reply must be transmitted over the same socket, it's ok to error
out in such situation.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211101092538.1155842-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 14:42:41 -06:00
Ye Bin
e2daec488c nbd: Fix hungtask when nbd_config_put
I got follow issue:
[  247.381177] INFO: task kworker/u10:0:47 blocked for more than 120 seconds.
[  247.382644]       Not tainted 4.19.90-dirty #140
[  247.383502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.385027] Call Trace:
[  247.388384]  schedule+0xb8/0x3c0
[  247.388966]  schedule_timeout+0x2b4/0x380
[  247.392815]  wait_for_completion+0x367/0x510
[  247.397713]  flush_workqueue+0x32b/0x1340
[  247.402700]  drain_workqueue+0xda/0x3c0
[  247.403442]  destroy_workqueue+0x7b/0x690
[  247.405014]  nbd_config_put.cold+0x2f9/0x5b6
[  247.405823]  recv_work+0x1fd/0x2b0
[  247.406485]  process_one_work+0x70b/0x1610
[  247.407262]  worker_thread+0x5a9/0x1060
[  247.408699]  kthread+0x35e/0x430
[  247.410918]  ret_from_fork+0x1f/0x30

We can reproduce issue as follows:
1. Inject memory fault in nbd_start_device
-1244,10 +1248,18 @@ static int nbd_start_device(struct nbd_device *nbd)
        nbd_dev_dbg_init(nbd);
        for (i = 0; i < num_connections; i++) {
                struct recv_thread_args *args;
-
-               args = kzalloc(sizeof(*args), GFP_KERNEL);
+
+               if (i == 1) {
+                       args = NULL;
+                       printk("%s: inject malloc error\n", __func__);
+               }
+               else
+                       args = kzalloc(sizeof(*args), GFP_KERNEL);
2. Inject delay in recv_work
-757,6 +760,8 @@ static void recv_work(struct work_struct *work)

                blk_mq_complete_request(blk_mq_rq_from_pdu(cmd));
        }
+       printk("%s: comm=%s pid=%d\n", __func__, current->comm, current->pid);
+       mdelay(5 * 1000);
        nbd_config_put(nbd);
        atomic_dec(&config->recv_threads);
        wake_up(&config->recv_wq);
3. Create nbd server
nbd-server 8000 /tmp/disk
4. Create nbd client
nbd-client localhost 8000 /dev/nbd1
Then will trigger above issue.

Reason is when add delay in recv_work, lead to release the last reference
of 'nbd->config_refs'. nbd_config_put will call flush_workqueue to make
all work finish. Obviously, it will lead to deadloop.
To solve this issue, according to Josef's suggestion move 'recv_work'
init from start device to nbd_dev_add, then destroy 'recv_work'when
nbd device teardown.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211102015237.2309763-5-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 10:50:27 -06:00
Ye Bin
69beb62ff0 nbd: Fix incorrect error handle when first_minor is illegal in nbd_dev_add
If first_minor is illegal will goto out_free_idr label, this will miss
cleanup disk.

Fixes: b1a811633f ("block: nbd: add sanity check for first_minor")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211102015237.2309763-4-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 10:50:27 -06:00
Yu Kuai
940c264984 nbd: fix possible overflow for 'first_minor' in nbd_dev_add()
If 'part_shift' is not zero, then 'index << part_shift' might
overflow to a value that is not greater than '0xfffff', then sysfs
might complains about duplicate creation.

Fixes: b0d9111a2d ("nbd: use an idr to keep track of nbd devices")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211102015237.2309763-3-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 10:50:27 -06:00
Yu Kuai
e4c4871a73 nbd: fix max value for 'first_minor'
commit b1a811633f ("block: nbd: add sanity check for first_minor")
checks that 'first_minor' should not be greater than 0xff, which is
wrong. Whitout the commit, the details that when user pass 0x100000,
it ends up create sysfs dir "/sys/block/43:0" are as follows:

nbd_dev_add
 disk->first_minor = index << part_shift
  -> default part_shift is 5, first_minor is 0x2000000
  device_add_disk
   ddev->devt = MKDEV(disk->major, disk->first_minor)
    -> (0x2b << 20) | (0x2000000) = 0x2b00000
   device_add
    device_create_sys_dev_entry
	 format_dev_t
	  sprintf(buffer, "%u:%u", MAJOR(dev), MINOR(dev));
	   -> got 43:0
	  sysfs_create_link -> /sys/block/43:0

By the way, with the wrong fix, when part_shift is the default value,
only 8 ndb devices can be created since 8 << 5 is greater than 0xff.

Since the max bits for 'first_minor' should be the same as what
MKDEV() does, which is 20. Change the upper bound of 'first_minor'
from 0xff to 0xfffff.

Fixes: b1a811633f ("block: nbd: add sanity check for first_minor")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211102015237.2309763-2-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 10:50:27 -06:00
Linus Torvalds
643a7234e0 for-5.16/drivers-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KFsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph1ZEACwNuHkAZcIgNzKhzuLP9OjMhv9vV+q254G
 /EcM31e+qgRioMd0ihbVsgW76jOwLEmb3ldKGcN+0Wo5+Sv9Im8+wAWYY1REOZO5
 ZTUBfAzhEh63/EtqTFiU8U+7dmXqy4z7NaICnhlynjwkd3IT+I561os6kcqwJMMr
 G+Q1Cnk9rgCMIoLOCoVThIpjmjyZzF33qJb2VEIkHfkot62iNdpABWaSASF+CCba
 z8LfbvLAYz3YLl4thXlLJFU282T5y7gzgSomGvX4F0rMJSbqFbgoNEPxaYw9CvzC
 uC6MnYCYdCdvVkWVm1b8I8LYzPd5GrpVOSh3JQGvuA4Ppv2IyJCDSruYGgVUlhao
 cVPzuHCqNCfKk0ykYVRZy9oKiBk5wmFeKM/lSHu408y8VNraPNIAEpB6sA9qGr22
 AYr8lNh3JDr0g8dtFsDOq+7u3MANW0KQozfzwTPZo6NjzEE1D2jIg39Ljiijo9+Y
 3pU8pitIAhsKd2KhW1H6LmtJbF4dX756VKYDXOhzgORU0NZYgvGhBIj9tAdpQR0S
 xeae5Kj0/wBGcqR/owf/n1EY/q7rWgNDETnsBhbmzMZyhwH3L6zhT+bfD8YoQCHY
 ueyqhyIUe4YBxTrIpICqwDlqaMYAmQ0jRaci+bK9ovVlQ89FQ9o/BE2COPlI/DGX
 w+rUmmoX4g==
 =HiWU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - paride driver cleanups (Christoph)

 - Remove cryptoloop support (Christoph)

 - null_blk poll support (me)

 - Now that add_disk() supports proper error handling, add it to various
   drivers (Luis)

 - Make ataflop actually work again (Michael)

 - s390 dasd fixes (Stefan, Heiko)

 - nbd fixes (Yu, Ye)

 - Remove redundant wq flush in mtip32xx (Christophe)

 - NVMe updates
      - fix a multipath partition scanning deadlock (Hannes Reinecke)
      - generate uevent once a multipath namespace is operational again
        (Hannes Reinecke)
      - support unique discovery controller NQNs (Hannes Reinecke)
      - fix use-after-free when a port is removed (Israel Rukshin)
      - clear shadow doorbell memory on resets (Keith Busch)
      - use struct_size (Len Baker)
      - add error handling support for add_disk (Luis Chamberlain)
      - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
      - use a few more symbolic names (Max Gurtovoy)
      - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
      - add support for ->map_queues on FC (Saurav Kashyap)
      - support the current discovery subsystem entry (Hannes Reinecke)
      - use flex_array_size and struct_size (Len Baker)

 - bcache fixes (Christoph, Coly, Chao, Lin, Qing)

 - MD updates (Christoph, Guoqing, Xiao)

 - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye)

* tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits)
  null_blk: Fix handling of submit_queues and poll_queues attributes
  block: ataflop: Fix warning comparing pointer to 0
  bcache: replace snprintf in show functions with sysfs_emit
  bcache: move uapi header bcache.h to bcache code directory
  nvmet: use flex_array_size and struct_size
  nvmet: register discovery subsystem as 'current'
  nvmet: switch check for subsystem type
  nvme: add new discovery log page entry definitions
  block: ataflop: more blk-mq refactoring fixes
  block: remove support for cryptoloop and the xor transfer
  mtd: add add_disk() error handling
  rnbd: add error handling support for add_disk()
  um/drivers/ubd_kern: add error handling support for add_disk()
  m68k/emu/nfblock: add error handling support for add_disk()
  xen-blkfront: add error handling support for add_disk()
  bcache: add error handling support for add_disk()
  dm: add add_disk() error handling
  block: aoe: fixup coccinelle warnings
  nvmet: use struct_size over open coded arithmetic
  nvme: drop scan_lock and always kick requeue list when removing namespaces
  ...
2021-11-01 09:27:38 -07:00
Xie Yongji
c4318d6cd0 nbd: Use blk_validate_block_size() to validate block size
Use the block layer helper to validate block size instead
of open coding it.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20211026144015.188-3-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 14:15:53 -06:00
Xie Yongji
435c2acb30 nbd: Use invalidate_disk() helper on disconnect
When a nbd device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. Then if this nbd
device's backend is disconnected and another is attached, we will
get back the previous writeback error on fsync, which is unexpected.

To fix it, let's use invalidate_disk() helper to invalidate the
disk on disconnect instead of just setting disk's capacity to zero.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-5-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 10:12:41 -06:00
Ye Bin
0c98057be9 nbd: Fix use-after-free in pid_show
I got issue as follows:
[  263.886511] BUG: KASAN: use-after-free in pid_show+0x11f/0x13f
[  263.888359] Read of size 4 at addr ffff8880bf0648c0 by task cat/746
[  263.890479] CPU: 0 PID: 746 Comm: cat Not tainted 4.19.90-dirty #140
[  263.893162] Call Trace:
[  263.893509]  dump_stack+0x108/0x15f
[  263.893999]  print_address_description+0xa5/0x372
[  263.894641]  kasan_report.cold+0x236/0x2a8
[  263.895696]  __asan_report_load4_noabort+0x25/0x30
[  263.896365]  pid_show+0x11f/0x13f
[  263.897422]  dev_attr_show+0x48/0x90
[  263.898361]  sysfs_kf_seq_show+0x24d/0x4b0
[  263.899479]  kernfs_seq_show+0x14e/0x1b0
[  263.900029]  seq_read+0x43f/0x1150
[  263.900499]  kernfs_fop_read+0xc7/0x5a0
[  263.903764]  vfs_read+0x113/0x350
[  263.904231]  ksys_read+0x103/0x270
[  263.905230]  __x64_sys_read+0x77/0xc0
[  263.906284]  do_syscall_64+0x106/0x360
[  263.906797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reproduce this issue as follows:
1. nbd-server 8000 /tmp/disk
2. nbd-client localhost 8000 /dev/nbd1
3. cat /sys/block/nbd1/pid
Then trigger use-after-free in pid_show.

Reason is after do step '2', nbd-client progress is already exit. So
it's task_struct already freed.
To solve this issue, revert part of 6521d39a64b3's modify and remove
useless 'recv_task' member of nbd_device.

Fixes: 6521d39a64 ("nbd: Remove variable 'pid'")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20211020073959.2679255-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:09:56 -06:00
Yu Kuai
8663b210f8 nbd: fix uaf in nbd_handle_reply()
There is a problem that nbd_handle_reply() might access freed request:

1) At first, a normal io is submitted and completed with scheduler:

internel_tag = blk_mq_get_tag -> get tag from sched_tags
 blk_mq_rq_ctx_init
  sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag]
...
blk_mq_get_driver_tag
 __blk_mq_get_driver_tag -> get tag from tags
 tags->rq[tag] = sched_tag->static_rq[internel_tag]

So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing
to the request: sched_tags->static_rq[internal_tag]. Even if the
io is finished.

2) nbd server send a reply with random tag directly:

recv_work
 nbd_handle_reply
  blk_mq_tag_to_rq(tags, tag)
   rq = tags->rq[tag]

3) if the sched_tags->static_rq is freed:

blk_mq_sched_free_requests
 blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i)
  -> step 2) access rq before clearing rq mapping
  blk_mq_clear_rq_mapping(set, tags, hctx_idx);
  __free_pages() -> rq is freed here

4) Then, nbd continue to use the freed request in nbd_handle_reply

Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(),
thus request is ensured not to be freed because 'q_usage_counter' is
not zero.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210916141810.2325276-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00
Yu Kuai
3fe1db626a nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()
Prepare to fix uaf in nbd_read_stat(), no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-7-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00
Yu Kuai
f52c0e0823 nbd: clean up return value checking of sock_xmit()
Check if sock_xmit() return 0 is useless because it'll never return
0, comment it and remove such checkings.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-6-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00
Yu Kuai
0de2b7a4dd nbd: don't start request if nbd_queue_rq() failed
commit 6a468d5990 ("nbd: don't start req until after the dead
connection logic") move blk_mq_start_request() from nbd_queue_rq()
to nbd_handle_cmd() to skip starting request if the connection is
dead. However, request is still started in other error paths.

Currently, blk_mq_end_request() will be called immediately if
nbd_queue_rq() failed, thus start request in such situation is
useless. So remove blk_mq_start_request() from error paths in
nbd_handle_cmd().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-5-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00
Yu Kuai
fcf3d633d8 nbd: check sock index in nbd_read_stat()
The sock that clent send request in nbd_send_cmd() and receive reply
in nbd_read_stat() should be the same.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00
Yu Kuai
07175cb1ba nbd: make sure request completion won't concurrent
commit cddce01160 ("nbd: Aovid double completion of a request")
try to fix that nbd_clear_que() and recv_work() can complete a
request concurrently. However, the problem still exists:

t1                    t2                     t3

nbd_disconnect_and_put
 flush_workqueue
                      recv_work
                       blk_mq_complete_request
                        blk_mq_complete_request_remote -> this is true
                         WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)
                          blk_mq_raise_softirq
                                             blk_done_softirq
                                              blk_complete_reqs
                                               nbd_complete_rq
                                                blk_mq_end_request
                                                 blk_mq_free_request
                                                  WRITE_ONCE(rq->state, MQ_RQ_IDLE)
  nbd_clear_que
   blk_mq_tagset_busy_iter
    nbd_clear_req
                                                   __blk_mq_free_request
                                                    blk_mq_put_tag
     blk_mq_complete_request -> complete again

There are three places where request can be completed in nbd:
recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they
all hold cmd->lock before completing the request, it's easy to
avoid the problem by setting and checking a cmd flag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:50:37 -06:00