linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-03 17:40:04 +00:00

Author	SHA1	Message	Date
Zhihao Cheng	927cc5cec3	ubifs: ubifs_add_orphan: Fix a memory leak bug Memory leak occurs when files with extended attributes are added to orphan list. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: `988bec4131` ("ubifs: orphan: Handle xattrs like files") Signed-off-by: Richard Weinberger <richard@nod.at>	2020-03-30 23:02:30 +02:00
Zhihao Cheng	81423c7855	ubifs: ubifs_jnl_write_inode: Fix a memory leak bug When inodes with extended attributes are evicted, xent is not freed in one exit branch. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: `9ca2d73264` ("ubifs: Limit number of xattrs per inode") Signed-off-by: Richard Weinberger <richard@nod.at>	2020-03-30 23:00:36 +02:00
Linus Torvalds	59838093be	Driver core patches for 5.7-rc1 Here is the "big" set of driver core changes for 5.7-rc1. Nothing huge in here, just lots of little firmware core changes and use of new apis, a libfs fix, a debugfs api change, and some driver core deferred probe rework. All of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXoHLIg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yle2ACgjJJzRJl9Ckae3ms+9CS4OSFFZPsAoKSrXmFc Z7goYQdZo1zz8c0RYDrJ =Y91m -----END PGP SIGNATURE----- Merge tag 'driver-core-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core changes for 5.7-rc1. Nothing huge in here, just lots of little firmware core changes and use of new apis, a libfs fix, a debugfs api change, and some driver core deferred probe rework. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (44 commits) Revert "driver core: Set fw_devlink to "permissive" behavior by default" driver core: Set fw_devlink to "permissive" behavior by default driver core: Replace open-coded list_last_entry() driver core: Read atomic counter once in driver_probe_done() libfs: fix infoleak in simple_attr_read() driver core: Add device links from fwnode only for the primary device platform/x86: touchscreen_dmi: Add info for the Chuwi Vi8 Plus tablet platform/x86: touchscreen_dmi: Add EFI embedded firmware info support Input: icn8505 - Switch to firmware_request_platform for retreiving the fw Input: silead - Switch to firmware_request_platform for retreiving the fw selftests: firmware: Add firmware_request_platform tests test_firmware: add support for firmware_request_platform firmware: Add new platform fallback mechanism and firmware_request_platform() Revert "drivers: base: power: wakeup.c: Use built-in RCU list checking" drivers: base: power: wakeup.c: Use built-in RCU list checking component: allow missing unbind callback debugfs: remove return value of debugfs_create_file_size() debugfs: Check module state before warning in {full/open}_proxy_open() firmware: fix a double abort case with fw_load_sysfs_fallback arch_topology: Fix putting invalid cpu clk ...	2020-03-30 13:59:52 -07:00
Richard Weinberger	4ab25ac8b2	ubifs: Fix ubifs_tnc_lookup() usage in do_kill_orphans() Orphans are allowed to point to deleted inodes. So -ENOENT is not a fatal error. Reported-by: Кочетков Максим <fido_max@inbox.ru> Reported-and-tested-by: "Christian Berger" <Christian.Berger@de.bosch.com> Tested-by: Karl Olsen <karl@micro-technic.com> Tested-by: Jef Driesen <jef.driesen@niko.eu> Fixes: `ee1438ce5d` ("ubifs: Check link count of inodes when killing orphans.") Signed-off-by: Richard Weinberger <richard@nod.at>	2020-03-30 22:59:23 +02:00
Linus Torvalds	c271bdbf38	pstore updates - Improve failure paths (chenqiwu) - Fix ftrace position index (Vasily Averin) - Use proper flexible-array member (Gustavo A. R. Silva) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl6Bc00WHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJvUOEACudEIGbwcMkBOtWOlEDGMzMx1q 4NMOuewTCAUYVw7JqAuXhilVbATbxX3r46wah3qoF786/94tNTL//43WuhjSHL+W 6aBpL42vm4i3NNPw9nHNU6bSDiXsdQRLg0pTrQHQDUzlpR63PrgoVUiyxwS5Hoaq Php2qyLT4gtWi6zMxFJtLAuzJ5ye24odr4jep/BdGifUY5NfMXPbnhqij/3YTLTV B0XjgCCn3a/WyXnV9iKBacHVbp9qMHfY9BHKvIhmaOFR7Ef6TOW/Q1QgO2mOMJnY mD8w9Usz9+DGxXzZJRPHTn2Pd0kelMONIq5Tbt0va617KgWiyAxMlXZHiu+ZSMWj rI4piMUP1aP2+bCm0ST9FpP8lMoDmI/Wl2GtgUwxdbdMF+tbLFMLd8Y+xLIu0WSR TCkzCtnM/3zU4dPeOoptiIxWYyyoy2RXEThmeOBnibOZkNstcaVY03rogahW/H6Z m97cKMnMoVrEVFSEZCzwHuWTaJ6LTUT4lXz2X8VN3Yro994qcQut0R1IChyCA0t6 SgzvspbzBxHoIPxw03Ef82D5fAiWgTsdQ2lfkUy2j6zeJS5S9o4kuAjTBFDnnDfv kqOTDy0CxzDp0nB2x6cjSxVCxxOxZUVXglQj8X0hLIi/smu3zTxBGZSoZ8BA+6xt mY5bx6opEG/Oteg73Q== =g7H5 -----END PGP SIGNATURE----- Merge tag 'pstore-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "These mostly some minor cleanups and a bug fix for an ftrace corner case: - Improve failure paths (chenqiwu) - Fix ftrace position index (Vasily Averin) - Use proper flexible-array member (Gustavo A. R. Silva)" * tag 'pstore-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Replace zero-length array with flexible-array member pstore: pstore_ftrace_seq_next should increase position index pstore/ram: remove unnecessary ramoops_unregister_dummy() pstore/platform: fix potential mem leak if pstore_init_fs failed	2020-03-30 13:09:34 -07:00
Linus Torvalds	377ad0c28c	Changes since last update: - Convert radix tree usage to XArray; - Fix shrink scan count on multiple filesystem instances; - Better handling for specific corrupted images; - Update my email address in MAINTAINERS. -----BEGIN PGP SIGNATURE----- iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXoFRvBYcZ2FveGlhbmcy NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEswMBAMtsyo6TqWPToKt/eAJMbvt5vRGf y4XGEx67a1Ds7/LqAQCtOs+0HMWlK3F2DDljpA7Tg2QvRBJwFlhET6YZOAIcDQ== =XsWD -----END PGP SIGNATURE----- Merge tag 'erofs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "Updates with a XArray adaptation, several fixes for shrinker and corrupted images are ready for this cycle. All commits have been stress tested with no noticeable smoke out and have been in linux-next as well. Summary: - Convert radix tree usage to XArray - Fix shrink scan count on multiple filesystem instances - Better handling for specific corrupted images - Update my email address in MAINTAINERS" * tag 'erofs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: MAINTAINERS: erofs: update my email address erofs: handle corrupted images whose decompressed size less than it'd be erofs: use LZ4_decompress_safe() for full decoding erofs: correct the remaining shrink objects erofs: convert workstn to XArray	2020-03-30 12:49:33 -07:00
Linus Torvalds	481ed297d9	This has been a busy cycle for documentation work. Highlights include: - Lots of RST conversion work by Mauro, Daniel ALmeida, and others. Maybe someday we'll get to the end of this stuff...maybe... - Some organizational work to bring some order to the core-api manual. - Various new docs and additions to the existing documentation. - Typo fixes, warning fixes, ... -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl6BLf4PHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YLhkIAIhcg6gxp0oZZ3KDfQyhvej0EWQGVDNkmloQ O1VOSV3RJsZL9HwN9xSNnNfN5+hw5RUYVbn1s201uj6kovZY9qcTpHP2LCizUeGb eFkSTmzkyAuAbJjuVLgMPDerJPEew0HnudiToeSpQeoIL1WB6YGd4/5H/cN1KLex 8ggjllcY0wOgbiFffmK6+tavDv7vT0lKTdwKRYh2nxu7zrPVVd1ZnW+RtntdTVQt i+xwV6/YdWtg5C53IwBPpeyubX40vqaIjU8rzpLq5SCVbsZN14sSR709m1AYCOK0 i4VDWEhfA2XBi6Nycl5U0czuGziwoHrTgSCkS1mmSDujnpgfKM8= =6YOS -----END PGP SIGNATURE----- Merge tag 'docs-5.7' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "This has been a busy cycle for documentation work. Highlights include: - Lots of RST conversion work by Mauro, Daniel ALmeida, and others. Maybe someday we'll get to the end of this stuff...maybe... - Some organizational work to bring some order to the core-api manual. - Various new docs and additions to the existing documentation. - Typo fixes, warning fixes, ..." * tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits) Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT MAINTAINERS: adjust to filesystem doc ReST conversion docs: deprecated.rst: Add BUG()-family doc: zh_CN: add translation for virtiofs doc: zh_CN: index files in filesystems subdirectory docs: locking: Drop :c:func: throughout docs: locking: Add 'need' to hardirq section docs: conf.py: avoid thousands of duplicate label warning on Sphinx docs: prevent warnings due to autosectionlabel docs: fix reference to core-api/namespaces.rst docs: fix pointers to io-mapping.rst and io_ordering.rst files Documentation: Better document the softlockup_panic sysctl docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref docs: perf: imx-ddr.rst: get rid of a warning docs: filesystems: fuse.rst: supress a Sphinx warning docs: translations: it: avoid duplicate refs at programming-language.rst docs: driver.rst: supress two ReSt warnings docs: trace: events.rst: convert some new stuff to ReST format Documentation: Add io_ordering.rst to driver-api manual Documentation: Add io-mapping.rst to driver-api manual ...	2020-03-30 12:45:23 -07:00
Linus Torvalds	e59cd88028	for-5.7/io_uring-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4 zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx xVbzeR9UIw== =bYLJ -----END PGP SIGNATURE----- Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Here are the io_uring changes for this merge window. Light on new features this time around (just splice + buffer selection), lots of cleanups, fixes, and improvements to existing support. In particular, this contains: - Cleanup fixed file update handling for stack fallback (Hillf) - Re-work of how pollable async IO is handled, we no longer require thread offload to handle that. Instead we rely using poll to drive this, with task_work execution. - In conjunction with the above, allow expendable buffer selection, so that poll+recv (for example) no longer has to be a split operation. - Make sure we honor RLIMIT_FSIZE for buffered writes - Add support for splice (Pavel) - Linked work inheritance fixes and optimizations (Pavel) - Async work fixes and cleanups (Pavel) - Improve io-wq locking (Pavel) - Hashed link write improvements (Pavel) - SETUP_IOPOLL\|SETUP_SQPOLL improvements (Xiaoguang)" * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits) io_uring: cleanup io_alloc_async_ctx() io_uring: fix missing 'return' in comment io-wq: handle hashed writes in chains io-uring: drop 'free_pfile' in struct io_file_put io-uring: drop completion when removing file io_uring: Fix ->data corruption on re-enqueue io-wq: close cancel gap for hashed linked work io_uring: make spdxcheck.py happy io_uring: honor original task RLIMIT_FSIZE io-wq: hash dependent work io-wq: split hashing and enqueueing io-wq: don't resched if there is no work io-wq: remove duplicated cancel code io_uring: fix truncated async read/readv and write/writev retry io_uring: dual license io_uring.h uapi header io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL\|SETUP_SQPOLL enabled io_uring: Fix unused function warnings io_uring: add end-of-bits marker and build time verify it io_uring: provide means of removing buffers io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG ...	2020-03-30 12:18:49 -07:00
Linus Torvalds	10f36b1e80	for-5.7/block-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJCoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvziEACqQC+QRKiqR6X5yaPWJ9LqjKE7lfI1PUb7 0a1z1mKuf8d6z0qNleUwdSOEaS5zJiswou2K8GLvEtTQH41QYsQkxc9GLjAyTveK szAyzZaa3BNUy9hkczm9i2arv3fI8XoTE3JvRM0e9wL8fBJDYCtKtHFJvF4hisOQ ydaJlU6tcwzd9bdV7K5dLwBxu3AeAJjzS3Tyfw25u9N9O/btUxJ91RTqBb2+Xeoz AVasfRlAqf/CzdjxCCmDgWE2QM4852pAeQ7UJJBGISNWNoiwkezMg+6HD0jEOLee bQ8uDyQdihIWTY+/zQasotX8/71uLV8QgtjWLXR9zrjrubIBWHGzoWSQ4kPg5DfQ bJmKO0VvWN2sshZEpWvzzAFGYxZViNphbK2Pb4hKOcv7jtMcC8mmEogh/7EqbD/n KB3IM9qVoXM8INm5o0dTy5uDRJxiHiHYkqsZaKz55BB/R4Geym5TINT3nXgxhQrn JoSwp4zdm3/NJOySruDi2eETqWJC2bsz3FsQSyCQTPOuP0nLtFKBb1UKHpmYTCXG H4LCyCKFJ6s006qBcdaNPZBw1mrSNwoxEulHnpYA4BFfPeXi72yrnMZQkdwWONpW LIVuD0hBm8X/pulbvEEdjzXBqZVkqK3xFX+uX5+bnwwaUKddXAC/h9SQKpBP2Mbb AeZToMklKw== =6Glq -----END PGP SIGNATURE----- Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Online capacity resizing (Balbir) - Number of hardware queue change fixes (Bart) - null_blk fault injection addition (Bart) - Cleanup of queue allocation, unifying the node/no-node API (Christoph) - Cleanup of genhd, moving code to where it makes sense (Christoph) - Cleanup of the partition handling code (Christoph) - disk stat fixes/improvements (Konstantin) - BFQ improvements (Paolo) - Various fixes and improvements * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits) block: return NULL in blk_alloc_queue() on error block: move bio_map_* to blk-map.c Revert "blkdev: check for valid request queue before issuing flush" block: simplify queue allocation bcache: pass the make_request methods to blk_queue_make_request null_blk: use blk_mq_init_queue_data block: add a blk_mq_init_queue_data helper block: move the ->devnode callback to struct block_device_operations block: move the part_stat* helpers from genhd.h to a new header block: move block layer internals out of include/linux/genhd.h block: move guard_bio_eod to bio.c block: unexport get_gendisk block: unexport disk_map_sector_rcu block: unexport disk_get_part block: mark part_in_flight and part_in_flight_rw static block: mark block_depr static block: factor out requeue handling from dispatch code block/diskstats: replace time_in_queue with sum of request times block/diskstats: accumulate all per-cpu counters in one pass block/diskstats: more accurate approximation of io_ticks for slow disks ...	2020-03-30 11:20:13 -07:00
Bob Peterson	75b46c437f	gfs2: Fix oversight in gfs2_ail1_flush Ordinarily, function gfs2_ail1_start_one issues a write request for one item on the ail1 list, then returns -EBUSY. This makes the caller, gfs2_ail1_flush, loop around and start another. However, it was not clearing the -EBUSY return code each time through the loop. So on rare occasions, like when the wbc runs out of nr_to_write, it remained set to -EBUSY, which triggered an error and withdraw. This patch sets the return code to 0 each time through the restart loop so this won't happen anymore. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-30 07:55:35 -05:00
Luis Henriques	ef9157259f	ceph: fix snapshot directory timestamps The .snap directory timestamps are kept at 0 (1970-01-01 00:00), which isn't consistent with what the fuse client does. This patch makes the behaviour consistent, by setting these timestamps (atime, btime, ctime, mtime) to those of the parent directory. Cc: Marc Roos <M.Roos@f1-outsourcing.eu> Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:43 +02:00
Yan, Zheng	9bccb76574	ceph: wait for async creating inode before requesting new max size ceph_check_caps() can't request new max size for async creating inode. This may make ceph_get_caps() loop busily until getting reply of the async create. Also, wait for async creating reply before calling ceph_renew_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:43 +02:00
Yan, Zheng	0aa971b6fd	ceph: don't skip updating wanted caps when cap is stale 1. try_get_cap_refs() fails to get caps and finds that mds_wanted does not include what it wants. It returns -ESTALE. 2. ceph_get_caps() calls ceph_renew_caps(). ceph_renew_caps() finds that inode has cap, so it calls ceph_check_caps(). 3. ceph_check_caps() finds that issued caps (without checking if it's stale) already includes caps wanted by open file, so it skips updating wanted caps. Above events can cause an infinite loop inside ceph_get_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:43 +02:00
Yan, Zheng	42d70f8e31	ceph: request new max size only when there is auth cap When there is no auth cap, check_max_size() can't do anything and may cause an infinite loop inside ceph_get_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:43 +02:00
Yan, Zheng	546d402085	ceph: cleanup return error of try_get_cap_refs() Returns 0 if caps were not able to be acquired (yet), 1 if cap acquisition succeeded, or a negative error code. There are 3 special error codes: -EAGAIN: need to sleep but non-blocking is specified -EFBIG: ask caller to call check_max_size() and try again. -ESTALE: ask caller to call ceph_renew_caps() and try again. [ jlayton: add WARN_ON_ONCE check for -EAGAIN ] Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Qiujun Huang	c6d5029603	ceph: return ceph_mdsc_do_request() errors from __get_parent() Return the error returned by ceph_mdsc_do_request(). Otherwise, r_target_inode ends up being NULL this ends up returning ENOENT regardless of the error. Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	bf73c62e7f	ceph: check all mds' caps after page writeback If an inode has caps from multiple mds's, the following can happen: - non-auth mds revokes Fsc. Fcb is used, so page writeback is queued. - when writeback finishes, ceph_check_caps() is called with auth only flag. ceph_check_caps() invalidates pagecache, but skips checking any non-auth caps. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	11ba6b9cee	ceph: update i_requested_max_size only when sending cap msg to auth mds Non-auth mds can't do anything to 'update max' cap message. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	135e671e54	ceph: simplify calling of ceph_get_fmode() Originally, calling ceph_get_fmode() for open files is by thread that handles request reply. There is a small window between updating caps and and waking the request initiator. We need to prevent ceph_check_caps() from releasing wanted caps in the window. Previous patches made fill_inode() call __ceph_touch_fmode() for open file requests. This prevented ceph_check_caps() from releasing wanted caps for 'caps_wanted_delay_min' seconds, enough for request initiator to get woken up and call ceph_get_fmode(). This allows us to now call ceph_get_fmode() in ceph_open() instead. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	a0d93e327f	ceph: remove delay check logic from ceph_check_caps() __ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and 'caps_wanted_delay_max'. There is no need to duplicate the logic in ceph_check_caps() and __send_cap() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	719a2514e9	ceph: consider inode's last read/write when calculating wanted caps Add i_last_rd and i_last_wr to ceph_inode_info. These fields are used to track the last time the client acquired read/write caps for the inode. If there is no read/write on an inode for 'caps_wanted_delay_max' seconds, __ceph_caps_file_wanted() does not request caps for read/write even there are open files. Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted() calculates dir's wanted caps according to last dir read/modification. If there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If there is recent dir modification, also wants CEPH_CAP_FILE_EXCL. Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after readdir, as with that, modifications do not need to release CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	c0e385b106	ceph: always renew caps if mds_wanted is insufficient Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag, which indicates that mds has closed the session and caps were dropped. Remove this flag in preparation for not requesting caps for idle open files. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	3313f66a57	ceph: update dentry lease for async create Otherwise ceph_d_delete() may return 1 for the dentry, which makes dput() prune the dentry and clear parent dir's complete flag. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	9a8d03ca2e	ceph: attempt to do async create when possible With the Octopus release, the MDS will hand out directory create caps. If we have Fxc caps on the directory, and complete directory information or a known negative dentry, then we can return without waiting on the reply, allowing the open() call to return very quickly to userland. We use the normal ceph_fill_inode() routine to fill in the inode, so we have to gin up some reply inode information with what we'd expect the newly-created inode to have. The client assumes that it has a full set of caps on the new inode, and that the MDS will revoke them when there is conflicting access. This functionality is gated on the wsync/nowsync mount options. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	785892fe88	ceph: cache layout in parent dir on first sync create If a create is done, then typically we'll end up writing to the file soon afterward. We don't want to wait for the reply before doing that when doing an async create, so that means we need the layout for the new file before we've gotten the response from the MDS. All files created in a directory will initially inherit the same layout, so copy off the requisite info from the first synchronous create in the directory, and save it in a new i_cached_layout field. Zero out the layout when we lose Dc caps in the dir. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	6deb8008a8	ceph: add new MDS req field to hold delegated inode number Add new request field to hold the delegated inode number. Encode that into the message when it's set. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	d484648787	ceph: decode interval_sets for delegated inos Starting in Octopus, the MDS will hand out caps that allow the client to do asynchronous file creates under certain conditions. As part of that, the MDS will delegate ranges of inode numbers to the client. Add the infrastructure to decode these ranges, and stuff them into an xarray for later consumption by the async creation code. Because the xarray code currently only handles unsigned long indexes, and those are 32-bits on 32-bit arches, we only enable the decoding when running on a 64-bit arch. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	966c716018	ceph: make ceph_fill_inode non-static Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Jeff Layton	2ccb45462a	ceph: perform asynchronous unlink if we have sufficient caps The MDS is getting a new lock-caching facility that will allow it to cache the necessary locks to allow asynchronous directory operations. Since the CEPH_CAP_FILE_* caps are currently unused on directories, we can repurpose those bits for this purpose. When performing an unlink, if we have Fx on the parent directory, and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being removed is the primary link, then then we can fire off an unlink request immediately and don't need to wait on reply before returning. In that situation, just fix up the dcache and link count and return immediately after issuing the call to the MDS. This does mean that we need to hold an extra reference to the inode being unlinked, and extra references to the caps to avoid races. Those references are put and error handling is done in the r_callback routine. If the operation ends up failing, then set a writeback error on the directory inode, and the inode itself that can be fetched later by an fsync on the dir. The behavior of dir caps is slightly different from caps on normal files. Because these are just considered an optimization, if the session is reconnected, we will not automatically reclaim them. They are instead considered lost until we do another synchronous op in the parent directory. Async dirops are enabled via the "nowsync" mount option, which is patterned after the xfs "wsync" mount option. For now, the default is "wsync", but eventually we may flip that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:42 +02:00
Yan, Zheng	173e70e8ac	ceph: don't take refs to want mask unless we have all bits If we don't have all of the cap bits for the want mask in try_get_cap_refs, then just take refs on the need bits. Signed-off-by: "Yan, Zheng" <ukernel@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	a25949b990	ceph: cap tracking for async directory operations Track and correctly handle directory caps for asynchronous operations. Add aliases for Frc caps that we now designate at Dcu caps (when dealing with directories). Unlike file caps, we don't reclaim these when the session goes away, and instead preemptively release them. In-flight async dirops are instead handled during reconnect phase. The client needs to re-do a synchronous operation in order to re-get directory caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	40dcf75e82	ceph: make __take_cap_refs non-static Rename it to ceph_take_cap_refs and make it available to other files. Also replace a comment with a lockdep assertion. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	891f3f5a6a	ceph: add infrastructure for waiting for async create to complete When we issue an async create, we must ensure that any later on-the-wire requests involving it wait for the create reply. Expand i_ceph_flags to be an unsigned long, and add a new bit that MDS requests can wait on. If the bit is set in the inode when sending caps, then don't send it and just return that it has been delayed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	f5e17aed3a	ceph: track primary dentry link Newer versions of the MDS will flag a dentry as "primary". In later patches, we'll need to consult this info, so track it in di->flags. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	3bb48b4142	ceph: add flag to designate that a request is asynchronous ...and ensure that such requests are never queued. The MDS has need to know that a request is asynchronous so add flags and proper infrastructure for that. Also, delegated inode numbers and directory caps are associated with the session, so ensure that async requests are always transmitted on the first attempt and are never queued to wait for session reestablishment. If it does end up looking like we'll need to queue the request, then have it return -EJUKEBOX so the caller can reattempt with a synchronous request. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	c7e4f85ce9	ceph: more caps.c lockdep assertions Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	e8a4d26771	ceph: clean up kick_flushing_inode_caps() The last thing that this function does is release i_ceph_lock, so have the caller do that instead. Add a lockdep assertion to ensure that the function is always called with i_ceph_lock held. Change the prototype to take a ceph_inode_info pointer and drop the separate mdsc argument as we can get that from the session. While at it, make it non-static. We'll need this to kick any flushing caps once the create reply comes in. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Xiubo Li	8ccf7fcce1	ceph: return ETIMEDOUT errno to userland when request timed out req->r_timeout is only used during mounting, so this error will be more accurate. URL: https://tracker.ceph.com/issues/44215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Luis Henriques	1b0c3b9f91	ceph: re-org copy_file_range and fix some error paths This patch re-organizes copy_file_range, trying to fix a few issues in the error handling. Here's the summary: - Abort copy if initial do_splice_direct() returns fewer bytes than requested. - Move the 'size' initialization (with i_size_read()) further down in the code, after the initial call to do_splice_direct(). This avoids issues with a possibly stale value if a manual copy is done. - Move the object copy loop into a separate function. This makes it easier to handle errors (e.g, dirtying caps and updating the MDS metadata if only some objects have been copied before an error has occurred). - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc and dst_oloc - After the object copy loop, the new file size to be reported to the MDS (if there's file size change) is now the actual file size, and not the size after an eventual extra manual copy. - Added a few dout() to show the number of bytes copied in the two manual copies and in the object copy loop. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	058daab79d	ceph: move to a dedicated slabcache for mds requests On my machine (x86_64) this struct is 952 bytes, which gets rounded up to 1024 by kmalloc. Move this to a dedicated slabcache, so we can allocate them without the extra 72 bytes of overhead per. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Jeff Layton	c36d641493	ceph: reorganize fields in ceph_mds_request This shrinks the struct size by 16 bytes. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Andreas Gruenbacher	cb03c14390	ceph: switch to page_mkwrite_check_truncate in ceph_page_mkwrite Use the "page has been truncated" logic in page_mkwrite_check_truncate instead of reimplementing it here. Other than with the existing code, fail with -EFAULT / VM_FAULT_NOPAGE when page_offset(page) == size here as well, as should be expected. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:41 +02:00
Gustavo A. R. Silva	f682dc713c	ceph: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Yan, Zheng	bbb480ab05	ceph: check if file lock exists before sending unlock request When a process exits, kernel closes its files. locks_remove_file() is called to remove file locks on these files. locks_remove_file() tries unlocking files even there is no file lock. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Xiubo Li	5107d7d505	ceph: move ceph_osdc_{read,write}pages to ceph.ko Since these helpers are only used by ceph.ko, move them there and rename them with _sync_ qualifiers. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Jeff Layton	70837470b4	ceph: don't ClearPageChecked in ceph_invalidatepage() CephFS doesn't set this bit to begin with, so there should be no need to clear it. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Ilya Dryomov	072eaf3c0f	libceph: drop CEPH_DEFINE_SHOW_FUNC Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-03-30 12:42:40 +02:00
Yan, Zheng	525d15e8e5	ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO} These bits will have new meaning for directory inodes. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Jeff Layton	f85122afeb	ceph: add refcounting for Fx caps In future patches we'll be taking and relying on Fx caps. Add proper refcounting for them. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:40 +02:00
Jeff Layton	3db0a2fc56	ceph: register MDS request with dir inode from the start When the unsafe reply to a request comes in, the request is put on the r_unsafe_dir inode's list. In future patches, we're going to need to wait on requests that may not have gotten an unsafe reply yet. Change __register_request to put the entry on the dir inode's list when the pointer is set in the request, and don't check the CEPH_MDS_R_GOT_UNSAFE flag when unregistering it. The only place that uses this list today is fsync codepath, and with the coming changes, we'll want to wait on all operations whether it has gotten an unsafe reply or not. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-30 12:42:39 +02:00
Nathan Chancellor	6def1a1d2d	fanotify: Fix the checks in fanotify_fsid_equal Clang warns: fs/notify/fanotify/fanotify.c:28:23: warning: self-comparison always evaluates to true [-Wtautological-compare] return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1]; ^ fs/notify/fanotify/fanotify.c:28:57: warning: self-comparison always evaluates to true [-Wtautological-compare] return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1]; ^ 2 warnings generated. The intention was clearly to compare val[0] and val[1] in the two different fsid structs. Fix it otherwise this function always returns true. Fixes: `afc894c784` ("fanotify: Store fanotify handles differently") Link: https://github.com/ClangBuiltLinux/linux/issues/952 Link: https://lore.kernel.org/r/20200327171030.30625-1-natechancellor@gmail.com Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-30 12:40:53 +02:00
David S. Miller	f0b5989745	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor comment conflict in mac80211. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-29 21:25:29 -07:00
Steve French	f460c50274	cifs: update internal module version number To 2.26 Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-29 16:59:31 -05:00
Long Li	3946d0d04b	cifs: Allocate encryption header through kmalloc When encryption is used, smb2_transform_hdr is defined on the stack and is passed to the transport. This doesn't work with RDMA as the buffer needs to be DMA'ed. Fix it by using kmalloc. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-29 16:42:54 -05:00
Long Li	4ebb8795a7	cifs: smbd: Check and extend sender credits in interrupt context When a RDMA packet is received and server is extending send credits, we should check and unblock senders immediately in IRQ context. Doing it in a worker queue causes unnecessary delay and doesn't save much CPU on the receive path. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-29 16:42:36 -05:00
Long Li	f7950cb05d	cifs: smbd: Calculate the correct maximum packet size for segmented SMBDirect send/receive The packet size needs to take account of SMB2 header size and possible encryption header size. This is only done when signing is used and it is for RDMA send/receive, not read/write. Also remove the dead SMBD code in smb2_negotiate_r(w)size. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-29 16:41:49 -05:00
Andy Shevchenko	b58c4e9619	hostfs: Use kasprintf() instead of fixed buffer formatting Improve readability and maintainability by replacing a hardcoded string allocation and formatting by the use of the kasprintf() helper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-03-29 23:23:00 +02:00
Josh Triplett	df41460a21	ext4: fix incorrect group count in ext4_fill_super error message ext4_fill_super doublechecks the number of groups before mounting; if that check fails, the resulting error message prints the group count from the ext4_sb_info sbi, which hasn't been set yet. Print the freshly computed group count instead (which at that point has just been computed in "blocks_count"). Signed-off-by: Josh Triplett <josh@joshtriplett.org> Fixes: `4ec1102813` ("ext4: Add sanity checks for the superblock before mounting the filesystem") Link: https://lore.kernel.org/r/8b957cd1513fcc4550fe675c10bcce2175c33a49.1585431964.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-28 23:11:39 -04:00
Josh Triplett	b9c538da4e	ext4: fix incorrect inodes per group in error message If ext4_fill_super detects an invalid number of inodes per group, the resulting error message printed the number of blocks per group, rather than the number of inodes per group. Fix it to print the correct value. Fixes: `cd6bb35bf7` ("ext4: use more strict checks for inodes_per_block on mount") Link: https://lore.kernel.org/r/8be03355983a08e5d4eed480944613454d7e2550.1585434649.git.josh@joshtriplett.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-28 23:11:12 -04:00
Ritesh Harjani	626b035b81	ext4: don't set dioread_nolock by default for blocksize < pagesize Currently on calling echo 3 > drop_caches on host machine, we see FS corruption in the guest. This happens on Power machine where blocksize < pagesize. So as a temporary workaound don't enable dioread_nolock by default for blocksize < pagesize until we identify the root cause. Also emit a warning msg in case if this mount option is manually enabled for blocksize < pagesize. Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200327200744.12473-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-28 23:11:04 -04:00
Brian Foster	d4bc4c5fd1	xfs: return locked status of inode buffer on xfsaild push If the inode buffer backing a particular inode is locked, xfs_iflush() returns -EAGAIN and xfs_inode_item_push() skips the inode. It still returns success to xfsaild, however, which bypasses the xfsaild backoff heuristic. Update xfs_inode_item_push() to return locked status if the inode buffer couldn't be locked. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-28 09:40:12 -07:00
Brian Foster	8d3d7e2b35	xfs: trylock underlying buffer on dquot flush A dquot flush currently blocks on the buffer lock for the underlying dquot buffer. In turn, this causes xfsaild to block rather than continue processing other items in the meantime. Update xfs_qm_dqflush() to trylock the buffer, similar to how inode buffers are handled, and return -EAGAIN if the lock fails. Fix up any callers that don't currently handle the error properly. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-28 09:40:11 -07:00
Kaixu Xia	63337b63e7	xfs: remove unnecessary ternary from xfs_create Since the "no-allocation" reservations for file creations has been removed, the resblks value should be larger than zero, so remove unnecessary ternary conditional. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: s/judgment/ternary/] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-28 09:40:11 -07:00
Trond Myklebust	1de3af9883	NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio() If the FLUSH_SYNC flag is set, nfs_initiate_pgio() will currently wait for completion, and then return the status of the I/O operation. What we actually want to report in nfs_pageio_doio() is whether or not the RPC call was launched successfully, whereas actual I/O status is intended handled in the reply callbacks. Since FLUSH_SYNC is never set by any of the callers anyway, let's just remove that code altogether. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-28 11:54:20 -04:00
Thomas Gleixner	f1e67e355c	fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de	2020-03-28 13:21:08 +01:00
Trond Myklebust	cbd7be43c4	pNFS/flexfiles: Specify the layout segment range in LAYOUTGET Move from requesting only full file layout segments, to requesting layout segments that match our I/O size. This means the server is still free to return a full file layout, but we will no longer error out if it does not. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	e70430d939	pNFS/flexfiles: remove requirement for whole file layouts Remove the requirement that the server always sends whole file layouts. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	e1e54ab710	pNFS/flexfiles: Check the layout segment range before doing I/O When starting to read or write with a layout segment, check that the range matches our request. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	660d1eb223	pNFS/flexfile: Don't merge layout segments if the mirrors don't match Check that the number of mirrors, and the mirror information matches before deciding to merge layout segments in pNFS/flexfiles. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	e18c18ebd7	NFS/pNFS: Fix pnfs_layout_mark_request_commit() invalid layout segment handling Fix up pnfs_layout_mark_request_commit() to alway reschedule the write if the layout segment is invalid. Also minor cleanup. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	c84bea5944	NFS/pNFS: Simplify bucket layout segment reference counting Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	9c455a8c1e	NFS/pNFS: Clean up pNFS commit operations Move the pNFS commit related operations into a separate structure that can be carried by the pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	0aa647b736	NFS: Remove bucket array from struct pnfs_ds_commit_info Remove the unused bucket array in struct pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	fb6b53ba40	NFS/pNFS: Add a helper pnfs_generic_search_commit_reqs() Lift filelayout_search_commit_reqs() into the generic pnfs/nfs code, and add support for commit arrays. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:35 -04:00
Trond Myklebust	ba827c9abb	pNFS: Enable per-layout segment commit structures Enable adding and lookup of per-layout segment commits in filelayout and flexfilelayout. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	a9901899b6	pNFS: Add infrastructure for cleaning up per-layout commit structures Ensure that both the file and flexfiles layout types clean up when freeing the layout segments. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	e3b9f7e60b	NFS/pNFS: Support commit arrays in nfs_clear_pnfs_ds_commit_verifiers() Add support for scanning the full list of per-layout segment commit arrays to nfs_clear_pnfs_ds_commit_verifiers(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	1f28476dcb	NFS: Fix O_DIRECT commit verifier handling Instead of trying to save the commit verifiers and checking them against previous writes, adopt the same strategy as for buffered writes, of just checking the verifiers at commit time. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	fb5f7f20cd	NFS: commit errors should be fatal Fix the O_DIRECT code to avoid retries if the COMMIT fails with a fatal error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	18f4129696	NFS/pNFS: Allow O_DIRECT to release the DS commitinfo Add a pNFS callback to allow the O_DIRECT code to release the DS commitinfo when freeing the dreq. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	0cb1f6df8a	pNFS: Support per-layout segment commits in pnfs_generic_commit_pagelist() Add support for scanning the full list of per-layout segment commit arrays to pnfs_generic_commit_pagelist(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	fce9ed0302	pNFS: Support per-layout segment commits in pnfs_generic_recover_commit_reqs() Add support for scanning the full list of per-layout segment commit arrays to pnfs_generic_recover_commit_reqs(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	a8e3765e51	NFSv4/pNFS: Scan the full list of commit arrays when committing Add support for scanning the full list of per-layout segment commit arrays to pnfs_generic_scan_commit_lists() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Trond Myklebust	c21e716884	NFSv4/pnfs: Support a list of commit arrays in struct pnfs_ds_commit_info When we have multiple layout segments with different lists of mirrored data, we need to track the commits on a per layout segment basis. This patch adds a list to support this tracking in struct pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-27 16:34:34 -04:00
Bob Peterson	c953a735c7	gfs2: change from write to read lock for sd_log_flush_lock in journal replay Function gfs2_recover_func grabs the sd_log_flush_lock rw_semaphore in write mode. This is unnecessary because we only need to prevent log flush from using sd_log_bio bio while it does. Therefore, a read lock will be enough. This is a small step in cleaning up log flush. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:05 -05:00
Bob Peterson	9592ea80ad	gfs2: instrumentation wrt ail1 stuck Before this patch, if the ail1 flush got stuck for some reason, there were no clues as to why. This patch introduces a check for getting stuck for more than a minute, and if it happens, it dumps the items still remaining on the ail1 list. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:05 -05:00
Bob Peterson	e04d339bd8	gfs2: don't lock sd_log_flush_lock in try_rgrp_unlink In function try_rgrp_unlink, we added a temporary lock of the sd_log_flush_lock while searching the bitmaps. This protected us from problems in which dinodes being freed were still in a state of flux because the rgrp was in an active transaction. It was a kludge. Now that we've straightened out the code for inode eviction, deletes, and all the recovery mess, we no longer need this kludge. This patch removes it, and should improve performance. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:05 -05:00
Andreas Gruenbacher	4bd684bc01	gfs2: Remove unnecessary gfs2_qa_{get,put} pairs We now get the quota data structure when opening a file writable and put it when closing that writable file descriptor, so there no longer is a need for gfs2_qa_{get,put} while we're holding a writable file descriptor. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:05 -05:00
Andreas Gruenbacher	1595548fe7	gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put Keeping reservations and quotas separate helps reviewing the code. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:04 -05:00
Bob Peterson	2fba46a04c	gfs2: Change inode qa_data to allow multiple users Before this patch, multiple users called gfs2_qa_alloc which allocated a qadata structure to the inode, if quotas are turned on. Later, in file close or evict, the structure was deleted with gfs2_qa_delete. But there can be several competing processes who need access to the structure. There were races between file close (release) and the others. Thus, a release could delete the structure out from under a process that relied upon its existence. For example, chown. This patch changes the management of the qadata structures to be a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get and if the structure is allocated, the count essentially starts out at 1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the last guy to decrement the count to 0 frees the memory. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:04 -05:00
Bob Peterson	d580712a37	gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc Before this patch, multiple callers called gfs2_rsqa_alloc to force the existence of a reservations structure and a quota data structure if needed. However, now the reservations are handled separately, so the quota data is only the quota data. So we eliminate the one in favor of just calling gfs2_qa_alloc directly. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:04 -05:00
Andreas Gruenbacher	969183bc68	gfs2: Switch to list_{first,last}_entry Replace open-coded versions of list_first_entry and list_last_entry with those functions. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:04 -05:00
Andreas Gruenbacher	40e7e86ef1	gfs2: Clean up inode initialization and teardown When allocating a new inode, mark the iopen glock holder as uninitialized to make sure gfs2_evict_inode won't fail after an incomplete create or lookup. In gfs2_evict_inode, allow the inode glock to be NULL and remove the duplicate iopen glock teardown code. In gfs2_inode_lookup, don't tear down things that gfs2_evict_inode will already tear down. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-03-27 14:08:04 -05:00
Steve French	edad734c74	smb3: use SMB2_SIGNATURE_SIZE define It clarifies the code slightly to use SMB2_SIGNATURE_SIZE define rather than 16. Suggested-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-27 12:47:41 -05:00
Amir Goldstein	926e94d79b	ovl: enable xino automatically in more cases So far, with xino=auto, we only enable xino if we know that all underlying filesystem use 32bit inode numbers. When users configure overlay with xino=auto, they already declare that they are ready to handle 64bit inode number from overlay. It is a very common case, that underlying filesystem uses 64bit ino, but rarely or never uses the high inode number bits (e.g. tmpfs, xfs). Leaving it for the users to declare high ino bits are unused with xino=on is not a recipe for many users to enjoy the benefits of xino. There appears to be very little reason not to enable xino when users declare xino=auto even if we do not know how many bits underlying filesystem uses for inode numbers. In the worst case of xino bits overflow by real inode number, we already fall back to the non-xino behavior - real inode number with unique pseudo dev or to non persistent inode number and overlay st_dev (for directories). The only annoyance from auto enabling xino is that xino bits overflow emits a warning to kmsg. Suppress those warnings unless users explicitly asked for xino=on, suggesting that they expected high ino bits to be unused by underlying filesystem. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Amir Goldstein	dfe51d47b7	ovl: avoid possible inode number collisions with xino=on When xino feature is enabled and a real directory inode number overflows the lower xino bits, we cannot map this directory inode number to a unique and persistent inode number and we fall back to the real inode st_ino and overlay st_dev. The real inode st_ino with high bits may collide with a lower inode number on overlay st_dev that was mapped using xino. To avoid possible collision with legitimate xino values, map a non persistent inode number to a dedicated range in the xino address space. The dedicated range is created by adding one more bit to the number of reserved high xino bits. We could have added just one more fsid, but that would have had the undesired effect of changing persistent overlay inode numbers on kernel or require more complex xino mapping code. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Amir Goldstein	4d314f7859	ovl: use a private non-persistent ino pool There is no reason to deplete the system's global get_next_ino() pool for overlay non-persistent inode numbers and there is no reason at all to allocate non-persistent inode numbers for non-directories. For non-directories, it is much better to leave i_ino the same as real i_ino, to be consistent with st_ino/d_ino. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Miklos Szeredi	83552eacdf	ovl: fix WARN_ON nlink drop to zero Changes to underlying layers should not cause WARN_ON(), but this repro does: mkdir w l u mnt sudo mount -t overlay -o workdir=w,lowerdir=l,upperdir=u overlay mnt touch mnt/h ln u/h u/k rm -rf mnt/k rm -rf mnt/h dmesg ------------[ cut here ]------------ WARNING: CPU: 1 PID: 116244 at fs/inode.c:302 drop_nlink+0x28/0x40 After upper hardlinks were added while overlay is mounted, unlinking all overlay hardlinks drops overlay nlink to zero before all upper inodes are unlinked. After unlink/rename prevent i_nlink from going to zero if there are still hashed aliases (i.e. cached hard links to the victim) remaining. Reported-by: Phasip <phasip@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-27 16:51:02 +01:00
Darrick J. Wong	5cc3c006eb	xfs: don't write a corrupt unmount record to force summary counter recalc In commit `f467cad95f`, I added the ability to force a recalculation of the filesystem summary counters if they seemed incorrect. This was done (not entirely correctly) by tweaking the log code to write an unmount record without the UMOUNT_TRANS flag set. At next mount, the log recovery code will fail to find the unmount record and go into recovery, which triggers the recalculation. What actually gets written to the log is what ought to be an unmount record, but without any flags set to indicate what kind of record it actually is. This worked to trigger the recalculation, but we shouldn't write bogus log records when we could simply write nothing. Fixes: `f467cad95f` ("xfs: force summary counter recalc at next mount") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-03-27 08:32:55 -07:00
Dave Chinner	5806165a66	xfs: factor inode lookup from xfs_ifree_cluster There's lots of indent in this code which makes it a bit hard to follow. We are also going to completely rework the inode lookup code as part of the inode reclaim rework, so factor out the inode lookup code from the inode cluster freeing code. Based on prototype code from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:55 -07:00
Dave Chinner	8eb807bd83	xfs: tail updates only need to occur when LSN changes We currently wake anything waiting on the log tail to move whenever the log item at the tail of the log is removed. Historically this was fine behaviour because there were very few items at any given LSN. But with delayed logging, there may be thousands of items at any given LSN, and we can't move the tail until they are all gone. Hence if we are removing them in near tail-first order, we might be waking up processes waiting on the tail LSN to change (e.g. log space waiters) repeatedly without them being able to make progress. This also occurs with the new sync push waiters, and can result in thousands of spurious wakeups every second when under heavy direct reclaim pressure. To fix this, check that the tail LSN has actually changed on the AIL before triggering wakeups. This will reduce the number of spurious wakeups when doing bulk AIL removal and make this code much more efficient. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:55 -07:00
Dave Chinner	4165994ac9	xfs: factor common AIL item deletion code Factor the common AIL deletion code that does all the wakeups into a helper so we only have one copy of this somewhat tricky code to interface with all the wakeups necessary when the LSN of the log tail changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:55 -07:00
Dave Chinner	d59eadaea2	xfs: correctly acount for reclaimable slabs The XFS inode item slab actually reclaimed by inode shrinker callbacks from the memory reclaim subsystem. These should be marked as reclaimable so the mm subsystem has the full picture of how much memory it can actually reclaim from the XFS slab caches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	12eba65b28	xfs: Improve metadata buffer reclaim accountability The buffer cache shrinker frees more than just the xfs_buf slab objects - it also frees the pages attached to the buffers. Make sure the memory reclaim code accounts for this memory being freed correctly, similar to how the inode shrinker accounts for pages freed from the page cache due to mapping invalidation. We also need to make sure that the mm subsystem knows these are reclaimable objects. We provide the memory reclaim subsystem with a a shrinker to reclaim xfs_bufs, so we should really mark the slab that way. We also have a lot of xfs_bufs in a busy system, spread them around like we do inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	2def2845cc	xfs: don't allow log IO to be throttled Running metadata intensive workloads, I've been seeing the AIL pushing getting stuck on pinned buffers and triggering log forces. The log force is taking a long time to run because the log IO is getting throttled by wbt_wait() - the block layer writeback throttle. It's being throttled because there is a huge amount of metadata writeback going on which is filling the request queue. IOWs, we have a priority inversion problem here. Mark the log IO bios with REQ_IDLE so they don't get throttled by the block layer writeback throttle. When we are forcing the CIL, we are likely to need to to tens of log IOs, and they are issued as fast as they can be build and IO completed. Hence REQ_IDLE is appropriate - it's an indication that more IO will follow shortly. And because we also set REQ_SYNC, the writeback throttle will now treat log IO the same way it treats direct IO writes - it will not throttle them at all. Hence we solve the priority inversion problem caused by the writeback throttle being unable to distinguish between high priority log IO and background metadata writeback. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	0e7ab7efe7	xfs: Throttle commits on delayed background CIL push In certain situations the background CIL push can be indefinitely delayed. While we have workarounds from the obvious cases now, it doesn't solve the underlying issue. This issue is that there is no upper limit on the CIL where we will either force or wait for a background push to start, hence allowing the CIL to grow without bound until it consumes all log space. To fix this, add a new wait queue to the CIL which allows background pushes to wait for the CIL context to be switched out. This happens when the push starts, so it will allow us to block incoming transaction commit completion until the push has started. This will only affect processes that are running modifications, and only when the CIL threshold has been significantly overrun. This has no apparent impact on performance, and doesn't even trigger until over 45 million inodes had been created in a 16-way fsmark test on a 2GB log. That was limiting at 64MB of log space used, so the active CIL size is only about 3% of the total log in that case. The concurrent removal of those files did not trigger the background sleep at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	108a42358a	xfs: Lower CIL flush limit for large logs The current CIL size aggregation limit is 1/8th the log size. This means for large logs we might be aggregating at least 250MB of dirty objects in memory before the CIL is flushed to the journal. With CIL shadow buffers sitting around, this means the CIL is often consuming >500MB of temporary memory that is all allocated under GFP_NOFS conditions. Flushing the CIL can take some time to do if there is other IO ongoing, and can introduce substantial log force latency by itself. It also pins the memory until the objects are in the AIL and can be written back and reclaimed by shrinkers. Hence this threshold also tends to determine the minimum amount of memory XFS can operate in under heavy modification without triggering the OOM killer. Modify the CIL space limit to prevent such huge amounts of pinned metadata from aggregating. We can have 2MB of log IO in flight at once, so limit aggregation to 16x this size. This threshold was chosen as it little impact on performance (on 16-way fsmark) or log traffic but pins a lot less memory on large logs especially under heavy memory pressure. An aggregation limit of 8x had 5-10% performance degradation and a 50% increase in log throughput for the same workload, so clearly that was too small for highly concurrent workloads on large logs. This was found via trace analysis of AIL behaviour. e.g. insertion from a single CIL flush: xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL $ grep xfs_ail_insert /mnt/scratch/s.t \|grep "new lsn 1/3033090" \|wc -l 1721823 $ So there were 1.7 million objects inserted into the AIL from this CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which was the end of the trace (i.e. it hadn't finished). Clearly a major problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	b843299ba5	xfs: remove some stale comments from the log code Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	3c702f9590	xfs: refactor unmount record writing Separate out the unmount record writing from the rest of the ticket and log state futzing necessary to make it work. This is a no-op, just makes the code cleaner and places the unmount record formatting and writing alongside the commit record formatting and writing code. We can also get rid of the ticket flag clearing before the xlog_write() call because it no longer cares about the state of XLOG_TIC_INITED. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Dave Chinner	f10e925def	xfs: merge xlog_commit_record with xlog_write_done xlog_write_done() is just a thin wrapper around xlog_commit_record(), so they can be merged together easily. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:54 -07:00
Christoph Hellwig	8b41e3f98e	xfs: split xlog_ticket_done Remove xlog_ticket_done and just call the renamed low-level helpers for ungranting or regranting log space directly. To make that a little the reference put on the ticket and all tracing is moved into the actual helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:53 -07:00
Dave Chinner	70e42f2d47	xfs: kill XLOG_TIC_INITED It is not longer used or checked by anything, so remove the last traces from the log ticket code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:53 -07:00
Dave Chinner	dd401770b0	xfs: refactor and split xfs_log_done() xfs_log_done() does two separate things. Firstly, it triggers commit records to be written for permanent transactions, and secondly it releases or regrants transaction reservation space. Since delayed logging was introduced, transactions no longer write directly to the log, hence they never have the XLOG_TIC_INITED flag cleared on them. Hence transactions never write commit records to the log and only need to modify reservation space. Split up xfs_log_done into two parts, and only call the parts of the operation needed for the context xfs_log_done() is currently being called from. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:53 -07:00
Dave Chinner	9590e9c684	xfs: re-order initial space accounting checks in xlog_write Commit and unmount records records do not need start records to be written, so rearrange the logic in xlog_write() to remove the need to check for XLOG_TIC_INITED to determine if we should account for the space used by a start record. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:53 -07:00
Dave Chinner	7ec949212d	xfs: don't try to write a start record into every iclog The xlog_write() function iterates over iclogs until it completes writing all the log vectors passed in. The ticket tracks whether a start record has been written or not, so only the first iclog gets a start record. We only ever pass single use tickets to xlog_write() so we only ever need to write a start record once per xlog_write() call. Hence we don't need to store whether we should write a start record in the ticket as the callers provide all the information we need to determine if a start record should be written. For the moment, we have to ensure that we clear the XLOG_TIC_INITED appropriately so the code in xfs_log_done() still works correctly for committing transactions. (darrick: Note the slight behavior change that we always deduct the size of the op header from the ticket, even for unmount records) Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: pass an explicit need_start_rec argument] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-27 08:32:53 -07:00
Darrick J. Wong	f8e566c0f5	xfs: validate the realtime geometry in xfs_validate_sb_common Validate the geometry of the realtime geometry when we mount the filesystem, so that we don't abruptly shut down the filesystem later on. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-03-27 08:32:53 -07:00
Xiaoguang Wang	3d9932a8b2	io_uring: cleanup io_alloc_async_ctx() Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(), so io_setup_async_rw() won't need to check whether async_ctx is true or false again. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 08:54:06 -06:00
Sergey Alirzaev	52cbee2a57	9p: read only once on O_NONBLOCK A proper way to handle O_NONBLOCK would be making the requests and responses happen asynchronously, but this would require serious code refactoring. Link: http://lkml.kernel.org/r/20200205003457.24340-2-l29ah@cock.li Signed-off-by: Sergey Alirzaev <l29ah@cock.li> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>	2020-03-27 09:29:56 +00:00
zhengbin	5195881739	9p: Remove unneeded semicolon Fixes coccicheck warning: fs/9p/vfs_inode.c:146:3-4: Unneeded semicolon Link: http://lkml.kernel.org/r/1576752517-58292-1-git-send-email-zhengbin13@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>	2020-03-27 09:29:56 +00:00
Krzysztof Kozlowski	1f5bd6a202	9p: Fix Kconfig indentation Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Link: http://lkml.kernel.org/r/20191120134340.16770-1-krzk@kernel.org Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>	2020-03-27 09:29:56 +00:00
David Howells	9efcc4a129	afs: Fix unpinned address list during probing When it's probing all of a fileserver's interfaces to find which one is best to use, afs_do_probe_fileserver() takes a lock on the server record and notes the pointer to the address list. It doesn't, however, pin the address list, so as soon as it drops the lock, there's nothing to stop the address list from being freed under us. Fix this by taking a ref on the address list inside the locked section and dropping it at the end of the function. Fixes: `3bf0fb6f33` ("afs: Probe multiple fileservers simultaneously") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-03-26 16:04:29 -07:00
Linus Torvalds	60268940cd	A patch for a rather old regression in fullness handling and two memory leak fixes, marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl59DCwTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi5oGB/943a7gIBV52PD3MGCnI8RWjgHkk3d0 en2JNI6i7hf7GD7GplMGkc0D8INBJhCZo1mwzX36QXYA3BeXKARkNXvEE+AZ4dX5 XbUiPE5WuUwxcT9sE9rTiCurx1ToN/XUlA27Vbt9J67U08w5BjJ3utO1LuW7z2ME NPx6aw6tdwIEeNJBo4ge8y9vPKevtXqhkCbzSb2kn+tMhoMPuJ3RIj8kWIF7mYWZ ofwOFoDnOfQuH+9ZA/mT4jL7ifR0am5QptHSD9kxge2mKlc0pmoABZK6sWNPOslg jQaEiefH77K/IxRyAsQNM7iHbUzKpZGbqAHx92MU0redUjUWNdCDGUmF =c01Y -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A patch for a rather old regression in fullness handling and two memory leak fixes, marked for stable" * tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client: ceph: fix memory leak in ceph_cleanup_snapid_map() libceph: fix alloc_msg_with_page_vector() memory leaks ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL	2020-03-26 15:44:41 -07:00
Darrick J. Wong	27fb5a72f5	xfs: prohibit fs freezing when using empty transactions I noticed that fsfreeze can take a very long time to freeze an XFS if there happens to be a GETFSMAP caller running in the background. I also happened to notice the following in dmesg: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs] CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs] Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54 RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000 R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8 FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0 Call Trace: xfs_fs_freeze+0x25/0x40 [xfs] freeze_super+0xc8/0x180 do_vfs_ioctl+0x70b/0x750 ? __fget_files+0x135/0x210 ksys_ioctl+0x3a/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe These two things appear to be related. The assertion trips when another thread initiates a fsmap request (which uses an empty transaction) after the freezer waited for m_active_trans to hit zero but before the the freezer executes the WARN_ON just prior to calling xfs_log_quiesce. The lengthy delays in freezing happen because the freezer calls xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the GETFSMAP caller is continuing to grab and release buffers, which means that it can take a very long time for the buffer lru list to empty out. We fix both of these races by calling sb_start_write to obtain freeze protection while using empty transactions for GETFSMAP and for metadata scrubbing. The other two users occur during mount, during which time we cannot fs freeze. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-03-26 08:19:24 -07:00
Brian Foster	842a42d126	xfs: shutdown on failure to add page to log bio If the bio_add_page() call fails, we proceed to write out a partially constructed log buffer. This corrupts the physical log such that log recovery is not possible. Worse, persistent occurrences of this error eventually lead to a BUG_ON() failure in bio_split() as iclogs wrap the end of the physical log, which triggers log recovery on subsequent mount. Rather than warn about writing out a corrupted log buffer, shutdown the fs as is done for any log I/O related error. This preserves the consistency of the physical log such that log recovery succeeds on a subsequent mount. Note that this was observed on a 64k page debug kernel without upstream commit `59bb47985c` ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)"), which demonstrated frequent iclog bio overflows due to unaligned (slab allocated) iclog data buffers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-03-26 08:19:24 -07:00
Darrick J. Wong	d59f44d3e7	xfs: directory bestfree check should release buffers When we're checking bestfree information in directory blocks, always drop the block buffer at the end of the function. We should always release resources when we're done using them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-03-26 08:19:24 -07:00
Darrick J. Wong	afbabf5630	xfs: drop all altpath buffers at the end of the sibling check The dirattr btree checking code uses the altpath substructure of the dirattr state structure to check the sibling pointers of dir/attr tree blocks. At the end of sibling checks, xfs_da3_path_shift could have changed multiple levels of buffer pointers in the altpath structure. Although we release the leaf level buffer, this isn't enough -- we also need to release the node buffers that are unique to the altpath. Not releasing all of the altpath buffers leaves them locked to the transaction. This is suboptimal because we should release resources when we don't need them anymore. Fix the function to loop all levels of the altpath, and fix the return logic so that we always run the loop. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-03-26 08:19:24 -07:00
Darrick J. Wong	5885539f0a	xfs: preserve default grace interval during quotacheck When quotacheck runs, it zeroes all the timer fields in every dquot. Unfortunately, it also does this to the root dquot, which erases any preconfigured grace intervals and warning limits that the administrator may have set. Worse yet, the incore copies of those variables remain set. This cache coherence problem manifests itself as the grace interval mysteriously being reset back to the defaults at the /next/ mount. Fix it by not resetting the root disk dquot's timer and warning fields. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-03-26 08:19:24 -07:00
Eric Whitney	c8980e1980	ext4: disable dioread_nolock whenever delayed allocation is disabled The patch "ext4: make dioread_nolock the default" (`244adf6426`) causes generic/422 to fail when run in kvm-xfstests' ext3conv test case. This applies both the dioread_nolock and nodelalloc mount options, a combination not previously tested by kvm-xfstests. The failure occurs because the dioread_nolock code path splits a previously fallocated multiblock extent into a series of single block extents when overwriting a portion of that extent. That causes allocation of an extent tree leaf node and a reshuffling of extents. Once writeback is completed, the individual extents are recombined into a single extent, the extent is moved again, and the leaf node is deleted. The difference in block utilization before and after writeback due to the leaf node triggers the failure. The original reason for this behavior was to avoid ENOSPC when handling I/O completions during writeback in the dioread_nolock code paths when delayed allocation is disabled. It may no longer be necessary, because code was added in the past to reserve extra space to solve this problem when delayed allocation is enabled, and this code may also apply when delayed allocation is disabled. Until this can be verified, don't use the dioread_nolock code paths if delayed allocation is disabled. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200319150028.24592-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-26 10:57:42 -04:00
Eric Sandeen	c96e2b8564	ext4: do not commit super on read-only bdev Under some circumstances we may encounter a filesystem error on a read-only block device, and if we try to save the error info to the superblock and commit it, we'll wind up with a noisy error and backtrace, i.e.: [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 ... To avoid this, commit the error info in the superblock only if the block device is writable. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/4b6e774d-cc00-3469-7abb-108eb151071a@sandeen.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-26 10:56:53 -04:00
Jan Kara	d05466b27b	ext4: avoid ENOSPC when avoiding to reuse recently deleted inodes When ext4 is running on a filesystem without a journal, it tries not to reuse recently deleted inodes to provide better chances for filesystem recovery in case of crash. However this logic forbids reuse of freed inodes for up to 5 minutes and especially for filesystems with smaller number of inodes can lead to ENOSPC errors returned when allocating new inodes. Fix the problem by allowing to reuse recently deleted inode if there's no other inode free in the scanned range. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200318121317.31941-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-26 10:55:00 -04:00
Ritesh Harjani	5e47868fb9	ext4: unregister sysfs path before destroying jbd2 journal Call ext4_unregister_sysfs(), before destroying jbd2 journal, since below might cause, NULL pointer dereference issue. This got reported with LTP tests. ext4_put_super() cat /sys/fs/ext4/loop2/journal_task \| ext4_attr_show(); ext4_jbd2_journal_destroy(); \| \| journal_task_show() \| \| \| task_pid_vnr(NULL); sbi->s_journal = NULL; Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200318061301.4320-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-26 10:52:49 -04:00
Ritesh Harjani	f1eec3b0d0	ext4: check for non-zero journal inum in ext4_calculate_overhead While calculating overhead for internal journal, also check that j_inum shouldn't be 0. Otherwise we get below error with xfstests generic/050 with external journal (XXX_LOGDEV config) enabled. It could be simply reproduced with loop device with an external journal and marking blockdev as RO before mounting. [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 CPU: 107 PID: 115347 Comm: mount Tainted: G L --------- -t - 4.18.0-167.el8.ppc64le #1 NIP: c0000000006f6d44 LR: c0000000006f6d40 CTR: 0000000030041dd4 <...> NIP [c0000000006f6d44] generic_make_request_checks+0x6b4/0x7d0 LR [c0000000006f6d40] generic_make_request_checks+0x6b0/0x7d0 <...> Call Trace: generic_make_request_checks+0x6b0/0x7d0 (unreliable) generic_make_request+0x3c/0x420 submit_bio+0xd8/0x200 submit_bh_wbc+0x1e8/0x250 __sync_dirty_buffer+0xd0/0x210 ext4_commit_super+0x310/0x420 [ext4] __ext4_error+0xa4/0x1e0 [ext4] __ext4_iget+0x388/0xe10 [ext4] ext4_get_journal_inode+0x40/0x150 [ext4] ext4_calculate_overhead+0x5a8/0x610 [ext4] ext4_fill_super+0x3188/0x3260 [ext4] mount_bdev+0x778/0x8f0 ext4_mount+0x28/0x50 [ext4] mount_fs+0x74/0x230 vfs_kern_mount.part.6+0x6c/0x250 do_mount+0x2fc/0x1280 sys_mount+0x158/0x180 system_call+0x5c/0x70 EXT4-fs (pmem1p2): no journal found EXT4-fs (pmem1p2): can't get journal size EXT4-fs (pmem1p2): mounted filesystem without journal. Opts: dax,norecovery Fixes: `3c816ded78` ("ext4: use journal inode to determine journal overhead") Reported-by: Harish Sriram <harish@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200316093038.25485-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-03-26 10:52:45 -04:00
Trond Myklebust	d7242c4641	pNFS: Add a helper to allocate the array of buckets Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-26 10:52:04 -04:00
Trond Myklebust	19573c939a	NFS/pNFS: Refactor pnfs_generic_commit_pagelist() Refactor pnfs_generic_commit_pagelist() to simplify the conversion to layout segment based commit lists. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-26 10:52:04 -04:00
Trond Myklebust	329651b1f1	pNFS/flexfiles: Simplify allocation of the mirror array Just allocate the array at the end of the layout segment structure, instead of allocating it as a separate array of pointers. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-26 10:52:04 -04:00
David S. Miller	9fb16955fb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-25 18:58:11 -07:00
Amir Goldstein	44d705b037	fanotify: report name info for FAN_DIR_MODIFY event Report event FAN_DIR_MODIFY with name in a variable length record similar to how fid's are reported. With name info reporting implemented, setting FAN_DIR_MODIFY in mark mask is now allowed. When events are reported with name, the reported fid identifies the directory and the name follows the fid. The info record type for this event info is FAN_EVENT_INFO_TYPE_DFID_NAME. For now, all reported events have at most one info record which is either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for FAN_DIR_MODIFY). Later on, events "on child" will report both records. There are several ways that an application can use this information: 1. When watching a single directory, the name is always relative to the watched directory, so application need to fstatat(2) the name relative to the watched directory. 2. When watching a set of directories, the application could keep a map of dirfd for all watched directories and hash the map by fid obtained with name_to_handle_at(2). When getting a name event, the fid in the event info could be used to lookup the base dirfd in the map and then call fstatat(2) with that dirfd. 3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of directories, the application could use open_by_handle_at(2) with the fid in event info to obtain dirfd for the directory where event happened and call fstatat(2) with this dirfd. The last option scales better for a large number of watched directories. The first two options may be available in the future also for non privileged fanotify watchers, because open_by_handle_at(2) requires the CAP_DAC_READ_SEARCH capability. Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 23:17:16 +01:00
Amir Goldstein	cacfb956d4	fanotify: record name info for FAN_DIR_MODIFY event For FAN_DIR_MODIFY event, allocate a variable size event struct to store the dir entry name along side the directory file handle. At this point, name info reporting is not yet implemented, so trying to set FAN_DIR_MODIFY in mark mask will return -EINVAL. Link: https://lore.kernel.org/r/20200319151022.31456-14-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 23:17:10 +01:00
Linus Torvalds	1b649e0bca	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Fix deadlock in bpf_send_signal() from Yonghong Song. 2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan. 3) Add missing locking in iwlwifi mvm code, from Avraham Stern. 4) Fix MSG_WAITALL handling in rxrpc, from David Howells. 5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong Wang. 6) Fix producer race condition in AF_PACKET, from Willem de Bruijn. 7) cls_route removes the wrong filter during change operations, from Cong Wang. 8) Reject unrecognized request flags in ethtool netlink code, from Michal Kubecek. 9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from Doug Berger. 10) Don't leak ct zone template in act_ct during replace, from Paul Blakey. 11) Fix flushing of offloaded netfilter flowtable flows, also from Paul Blakey. 12) Fix throughput drop during tx backpressure in cxgb4, from Rahul Lakkireddy. 13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric Dumazet. 14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well, also from Eric Dumazet. 15) Restrict macsec to ethernet devices, from Willem de Bruijn. 16) Fix reference leak in some ethtool _SET handlers, from Michal Kubecek. 17) Fix accidental disabling of MSI for some r8169 chips, from Heiner Kallweit. git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits) net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build net: ena: Add PCI shutdown handler to allow safe kexec selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED selftests/net: add missing tests to Makefile r8169: re-enable MSI on RTL8168c net: phy: mdio-bcm-unimac: Fix clock handling cxgb4/ptp: pass the sign of offset delta in FW CMD net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop net: cbs: Fix software cbs to consider packet sending time net/mlx5e: Do not recover from a non-fatal syndrome net/mlx5e: Fix ICOSQ recovery flow with Striding RQ net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset net/mlx5e: Enhance ICOSQ WQE info fields net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure selftests: netfilter: add nfqueue test case netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress netfilter: nft_fwd_netdev: validate family and chain type netfilter: nft_set_rbtree: Detect partial overlaps on insertion netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start() netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion ...	2020-03-25 13:58:05 -07:00
Linus Torvalds	e2cf67f668	zonefs fixes for 5.6 final A single fix in this pull request to correctly handle the size of read-only zone files (from me). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXnrJJAAKCRDdoc3SxdoY dgZXAQDK88T4sdtFq1Fl1PuP+oyzHml+xgNo0djZQOdicnD6qQD8CgMGDFQQG4dv Ral+67qEyvUABGt0Vkmy29wuN8El6wQ= =+1D9 -----END PGP SIGNATURE----- Merge tag 'zonefs-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single fix from me to correctly handle the size of read-only zone files" * tag 'zonefs-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonfs: Fix handling of read-only zones	2020-03-25 10:34:02 -07:00
Christoph Hellwig	c6a564ffad	block: move the part_stat* helpers from genhd.h to a new header These macros are just used by a few files. Move them out of genhd.h, which is included everywhere into a new standalone header. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:09 -06:00
Christoph Hellwig	29125ed624	block: move guard_bio_eod to bio.c This is bio layer functionality and not related to buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:08 -06:00
Robbie Ko	6ff06729c2	btrfs: fix missing semaphore unlock in btrfs_sync_file Ordered ops are started twice in sync file, once outside of inode mutex and once inside, taking the dio semaphore. There was one error path missing the semaphore unlock. Fixes: `aab15e8ec2` ("Btrfs: fix rare chances for data loss when doing a fast fsync") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ add changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-25 16:29:16 +01:00
Josef Bacik	351cbf6e44	btrfs: use nofs allocations for running delayed items Zygo reported the following lockdep splat while testing the balance patches ====================================================== WARNING: possible circular locking dependency detected 5.6.0-c6f0579d496a+ #53 Not tainted ------------------------------------------------------ kswapd0/1133 is trying to acquire lock: ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0 but task is already holding lock: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.91+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc_trace+0x32/0x740 add_block_entry+0x45/0x260 btrfs_ref_tree_mod+0x6e2/0x8b0 btrfs_alloc_tree_block+0x789/0x880 alloc_tree_block_no_bg_flush+0xc6/0xf0 __btrfs_cow_block+0x270/0x940 btrfs_cow_block+0x1ba/0x3a0 btrfs_search_slot+0x999/0x1030 btrfs_insert_empty_items+0x81/0xe0 btrfs_insert_delayed_items+0x128/0x7d0 __btrfs_run_delayed_items+0xf4/0x2a0 btrfs_run_delayed_items+0x13/0x20 btrfs_commit_transaction+0x5cc/0x1390 insert_balance_item.isra.39+0x6b2/0x6e0 btrfs_balance+0x72d/0x18d0 btrfs_ioctl_balance+0x3de/0x4c0 btrfs_ioctl+0x30ab/0x44a0 ksys_ioctl+0xa1/0xe0 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x77/0x2c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0x197e/0x2550 lock_acquire+0x103/0x220 __mutex_lock+0x13d/0xce0 mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 kswapd+0x35a/0x800 kthread+0x1e9/0x210 ret_from_fork+0x3a/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&delayed_node->mutex); lock(fs_reclaim); lock(&delayed_node->mutex); * DEADLOCK * 3 locks held by kswapd0/1133: #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410 #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70 stack backtrace: CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc1/0x11a print_circular_bug.isra.38.cold.57+0x145/0x14a check_noncircular+0x2a9/0x2f0 ? print_circular_bug.isra.38+0x130/0x130 ? stack_trace_consume_entry+0x90/0x90 ? save_trace+0x3cc/0x420 __lock_acquire+0x197e/0x2550 ? btrfs_inode_clear_file_extent_range+0x9b/0xb0 ? register_lock_class+0x960/0x960 lock_acquire+0x103/0x220 ? __btrfs_release_delayed_node+0x7c/0x5b0 __mutex_lock+0x13d/0xce0 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? __asan_loadN+0xf/0x20 ? pvclock_clocksource_read+0xeb/0x190 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? mutex_lock_io_nested+0xc20/0xc20 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1e6/0x2e0 mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 ? btrfs_setattr+0x840/0x840 ? do_raw_spin_unlock+0xa8/0x140 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 ? invalidate_inodes+0x310/0x310 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 ? do_shrink_slab+0x530/0x530 ? do_shrink_slab+0x530/0x530 ? __kasan_check_read+0x11/0x20 ? mem_cgroup_protected+0x13d/0x260 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 ? mem_cgroup_shrink_node+0x490/0x490 ? _raw_spin_unlock_irq+0x27/0x40 ? finish_task_switch+0xce/0x390 ? rcu_read_lock_bh_held+0xb0/0xb0 kswapd+0x35a/0x800 ? _raw_spin_unlock_irqrestore+0x4c/0x60 ? balance_pgdat+0x8a0/0x8a0 ? finish_wait+0x110/0x110 ? __kasan_check_read+0x11/0x20 ? __kthread_parkme+0xc6/0xe0 ? balance_pgdat+0x8a0/0x8a0 kthread+0x1e9/0x210 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 This is because we hold that delayed node's mutex while doing tree operations. Fix this by just wrapping the searches in nofs. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-25 16:26:00 +01:00
Bernd Edlinger	76518d3798	proc: io_accounting: Use new infrastructure to fix deadlocks in execve This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex. This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance. This should be safe, as the credentials are only used for reading. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:04:01 -05:00
Bernd Edlinger	2db9dbf71b	proc: Use new infrastructure to fix deadlocks in execve This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex. This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance. This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:04:01 -05:00
Eric W. Biederman	eea9673250	exec: Add exec_update_mutex to replace cred_guard_mutex The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:03:36 -05:00
Eric W. Biederman	ccf0fa6be0	exec: Move exec_mmap right after de_thread in flush_old_exec I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe. This rearrangement of code has two significant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal. Further this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list. This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:03:21 -05:00
Eric W. Biederman	153ffb6ba4	exec: Move cleanup of posix timers on exec out of de_thread These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:00:38 -05:00
Eric W. Biederman	0216915592	exec: Factor unshare_sighand out of de_thread and call it separately This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:00:21 -05:00
Eric W. Biederman	2ca7be7d55	exec: Only compute current once in flush_old_exec Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-03-25 10:00:07 -05:00
Petr Vorel	aa3367c91d	NFS: Don't specify NFS version in "UDP not supported" error UDP was originally disabled in `6da1a03436` for NFSv4. Later in `b24ee6c64c` UDP is by default disabled by NFS_DISABLE_UDP_SUPPORT=y for all NFS versions. Therefore remove v4 from error message. Fixes: `b24ee6c64c` ("NFS: allow deprecation of NFS UDP protocol") Signed-off-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-25 08:46:34 -04:00
Liwei Song	89c8023fd4	nfsroot: set tcp as the default transport protocol UDP is disabled by default in commit `b24ee6c64c` ("NFS: allow deprecation of NFS UDP protocol"), but the default mount options is still udp, change it to tcp to avoid the "Unsupported transport protocol udp" error if no protocol is specified when mount nfs. Fixes: `b24ee6c64c` ("NFS: allow deprecation of NFS UDP protocol") Signed-off-by: Liwei Song <liwei.song@windriver.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-25 08:45:47 -04:00
Masahiro Yamada	d198b34f38	.gitignore: add SPDX License Identifier Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-25 11:50:48 +01:00
Jan Kara	01affd5471	fanotify: Drop fanotify_event_has_fid() When some events have directory id and some object id, fanotify_event_has_fid() becomes mostly useless and confusing because we usually need to know which type of file handle the event has. So just drop the function and use fanotify_event_object_fh() instead. Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:27:16 +01:00
Amir Goldstein	d766b55361	fanotify: prepare to report both parent and child fid's For some events, we are going to report both child and parent fid's, so pass fsid and file handle as arguments to copy_fid_to_user(), which is going to be called with parent and child file handles. Link: https://lore.kernel.org/r/20200319151022.31456-13-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:27:16 +01:00
Amir Goldstein	9e2ba2c34f	fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name Dirent events are going to be supported in two flavors: 1. Directory fid info + mask that includes the specific event types (e.g. FAN_CREATE) and an optional FAN_ONDIR flag. 2. Directory fid info + name + mask that includes only FAN_DIR_MODIFY. To request the second event flavor, user needs to set the event type FAN_DIR_MODIFY in the mark mask. The first flavor is supported since kernel v5.1 for groups initialized with flag FAN_REPORT_FID. It is intended to be used for watching directories in "batch mode" - the watcher is notified when directory is changed and re-scans the directory content in response. This event flavor is stored more compactly in the event queue, so it is optimal for workloads with frequent directory changes. The second event flavor is intended to be used for watching large directories, where the cost of re-scan of the directory on every change is considered too high. The watcher getting the event with the directory fid and entry name is expected to call fstatat(2) to query the content of the entry after the change. Legacy inotify events are reported with name and event mask (e.g. "foo", FAN_CREATE \| FAN_ONDIR). That can lead users to the conclusion that there is currently an entry "foo" that is a sub-directory, when in fact "foo" may be negative or non-dir by the time user gets the event. To make it clear that the current state of the named entry is unknown, when reporting an event with name info, fanotify obfuscates the specific event types (e.g. create,delete,rename) and uses a common event type - FAN_DIR_MODIFY to describe the change. This should make it harder for users to make wrong assumptions and write buggy filesystem monitors. At this point, name info reporting is not yet implemented, so trying to set FAN_DIR_MODIFY in mark mask will return -EINVAL. Link: https://lore.kernel.org/r/20200319151022.31456-12-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:27:16 +01:00
Jan Kara	7088f35720	fanotify: divorce fanotify_path_event and fanotify_fid_event Breakup the union and make them both inherit from abstract fanotify_event. fanotify_path_event, fanotify_fid_event and fanotify_perm_event inherit from fanotify_event. type field in abstract fanotify_event determines the concrete event type. fanotify_path_event, fanotify_fid_event and fanotify_perm_event are allocated from separate memcache pools. Rename fanotify_perm_event casting macro to FANOTIFY_PERM(), so that FANOTIFY_PE() and FANOTIFY_FE() can be used as casting macros to fanotify_path_event and fanotify_fid_event. [JK: Cleanup FANOTIFY_PE() and FANOTIFY_FE() to be proper inline functions and remove requirement that fanotify_event is the first in event structures] Link: https://lore.kernel.org/r/20200319151022.31456-11-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:27:16 +01:00
Jan Kara	afc894c784	fanotify: Store fanotify handles differently Currently, struct fanotify_fid groups fsid and file handle and is unioned together with struct path to save space. Also there is fh_type and fh_len directly in struct fanotify_event to avoid padding overhead. In the follwing patches, we will be adding more event types and this packing makes code difficult to follow. So unpack everything and create struct fanotify_fh which groups members logically related to file handle to make code easier to follow. In the following patch we will pack things again differently to make events smaller. Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:27:16 +01:00
Jan Kara	a741c2febe	fanotify: Simplify create_fd() create_fd() is never used with invalid path. Also the only thing it needs to know from fanotify_event is the path. Simplify the function to take path directly and assume it is correct. Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-25 10:22:54 +01:00
Chucheng Luo	bff6035d0c	io_uring: fix missing 'return' in comment The missing 'return' work may make it hard for other developers to understand it. Signed-off-by: Chucheng Luo <luochucheng@vivo.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 21:46:36 -06:00
Damien Le Moal	ccf4ad7da0	zonfs: Fix handling of read-only zones The write pointer of zones in the read-only consition is defined as invalid by the SCSI ZBC and ATA ZAC specifications. It is thus not possible to determine the correct size of a read-only zone file on mount. Fix this by handling read-only zones in the same manner as offline zones by disabling all accesses to the zone (read and write) and initializing the inode size of the read-only zone to 0). For zones found to be in the read-only condition at runtime, only disable write access to the zone and keep the size of the zone file to its last updated value to allow the user to recover previously written data. Also fix zonefs documentation file to reflect this change. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>	2020-03-25 11:28:26 +09:00
Christoph Hellwig	ea3edd4dc2	block: remove __bdevname There is no good reason for __bdevname to exist. Just open code printing the string in the callers. For three of them the format string can be trivially merged into existing printk statements, and in init/do_mounts.c we can at least do the scnprintf once at the start of the function, and unconditional of CONFIG_BLOCK to make the output for tiny configfs a little more helpful. Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-24 07:57:07 -06:00
Eric Biggers	a65cab7d7f	libfs: fix infoleak in simple_attr_read() Reading from a debugfs file at a nonzero position, without first reading at position 0, leaks uninitialized memory to userspace. It's a bit tricky to do this, since lseek() and pread() aren't allowed on these files, and write() doesn't update the position on them. But writing to them with splice() does update the position: #define _GNU_SOURCE 1 #include <fcntl.h> #include <stdio.h> #include <unistd.h> int main() { int pipes[2], fd, n, i; char buf[32]; pipe(pipes); write(pipes[1], "0", 1); fd = open("/sys/kernel/debug/fault_around_bytes", O_RDWR); splice(pipes[0], NULL, fd, NULL, 1, 0); n = read(fd, buf, sizeof(buf)); for (i = 0; i < n; i++) printf("%02x", buf[i]); printf("\n"); } Output: 5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a30 Fix the infoleak by making simple_attr_read() always fill simple_attr::get_buf if it hasn't been filled yet. Reported-by: syzbot+fcab69d1ada3e8d6f06b@syzkaller.appspotmail.com Reported-by: Alexander Potapenko <glider@google.com> Fixes: `acaefc25d2` ("[PATCH] libfs: add simple attribute files") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200308023849.988264-1-ebiggers@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-24 13:27:16 +01:00
Amir Goldstein	55bf882c7f	fanotify: fix merging marks masks with FAN_ONDIR Change the logic of FAN_ONDIR in two ways that are similar to the logic of FAN_EVENT_ON_CHILD, that was fixed in commit `54a307ba8d` ("fanotify: fix logic of events on child"): 1. The flag is meaningless in ignore mask 2. The flag refers only to events in the mask of the mark where it is set This is what the fanotify_mark.2 man page says about FAN_ONDIR: "Without this flag, only events for files are created." It doesn't say anything about setting this flag in ignore mask to stop getting events on directories nor can I think of any setup where this capability would be useful. Currently, when marks masks are merged, the FAN_ONDIR flag set in one mark affects the events that are set in another mark's mask and this behavior causes unexpected results. For example, a user adds a mark on a directory with mask FAN_ATTRIB \| FAN_ONDIR and a mount mark with mask FAN_OPEN (without FAN_ONDIR). An opendir() of that directory (which is inside that mount) generates a FAN_OPEN event even though neither of the marks requested to get open events on directories. Link: https://lore.kernel.org/r/20200319151022.31456-10-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-24 12:06:32 +01:00
Amir Goldstein	f367a62a7c	fanotify: merge duplicate events on parent and child With inotify, when a watch is set on a directory and on its child, an event on the child is reported twice, once with wd of the parent watch and once with wd of the child watch without the filename. With fanotify, when a watch is set on a directory and on its child, an event on the child is reported twice, but it has the exact same information - either an open file descriptor of the child or an encoded fid of the child. The reason that the two identical events are not merged is because the object id used for merging events in the queue is the child inode in one event and parent inode in the other. For events with path or dentry data, use the victim inode instead of the watched inode as the object id for event merging, so that the event reported on parent will be merged with the event reported on the child. Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-24 11:29:12 +01:00
Amir Goldstein	dfc2d2594e	fsnotify: replace inode pointer with an object id The event inode field is used only for comparison in queue merges and cannot be dereferenced after handle_event(), because it does not hold a refcount on the inode. Replace it with an abstract id to do the same thing. Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-24 11:28:00 +01:00
Ingo Molnar	baf5fe7618	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Make kfree_rcu() use kfree_bulk() for added performance - RCU updates - Callback-overload handling updates - Tasks-RCU KCSAN and sparse updates - Locking torture test and RCU torture test updates - Documentation updates - Miscellaneous fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-03-24 10:10:09 +01:00
Pavel Begunkov	86f3cd1b58	io-wq: handle hashed writes in chains We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized. Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-23 14:58:07 -06:00
Amir Goldstein	017de65fe5	fsnotify: simplify arguments passing to fsnotify_parent() Instead of passing both dentry and path and having to figure out which one to use, pass data/data_type to simplify the code. Link: https://lore.kernel.org/r/20200319151022.31456-6-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-23 18:22:48 +01:00
Amir Goldstein	aa93bdc550	fsnotify: use helpers to access data by data_type Create helpers to access path and inode from different data types. Link: https://lore.kernel.org/r/20200319151022.31456-5-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-03-23 18:19:06 +01:00
Takashi Iwai	abdd9feb45	btrfs: sysfs: Use scnprintf() instead of snprintf() snprintf() is a hard-to-use function, and it's especially difficult to use it properly for concatenating substrings in a buffer with a limited size. Since snprintf() returns the would-be-output size, not the actual size, the subsequent use of snprintf() may point to the incorrect position easily. Also, returning the value from snprintf() directly to sysfs show function would pass a bogus value that is higher than the actually truncated string. That said, although the current code doesn't actually overflow the buffer with PAGE_SIZE, it's a usage that shouldn't be done. Or it's worse; this gives a wrong confidence as if it were doing safe operations. This patch replaces such snprintf() calls with a safer version, scnprintf(). It returns the actual output size, hence it's more intuitive and the code does what's expected. Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 18:14:47 +01:00
Josef Bacik	39dba8739c	btrfs: do not resolve backrefs for roots that are being deleted Zygo reported a deadlock where a task was stuck in the inode logical resolve code. The deadlock looks like this Task 1 btrfs_ioctl_logical_to_ino ->iterate_inodes_from_logical ->iterate_extent_inodes ->path->search_commit_root isn't set, so a transaction is started ->resolve_indirect_ref for a root that's being deleted ->search for our key, attempt to lock a node, DEADLOCK Task 2 btrfs_drop_snapshot ->walk down to a leaf, lock it, walk up, lock node ->end transaction ->start transaction -> wait_cur_trans Task 3 btrfs_commit_transaction ->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK We are holding a transaction open in btrfs_ioctl_logical_to_ino while we try to resolve our references. btrfs_drop_snapshot() holds onto its locks while it stops and starts transaction handles, because it assumes nobody is going to touch the root now. Commit just does what commit does, waiting for the writers to finish, blocking any new trans handles from starting. Fix this by making the backref code not try to resolve backrefs of roots that are currently being deleted. This will keep us from walking into a snapshot that's currently being deleted. This problem was harder to hit before because we rarely broke out of the snapshot delete halfway through, but with my delayed ref throttling code it happened much more often. However we've always been able to do this, so it's not a new problem. Fixes: `8da6d5815c` ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	ea287ab157	btrfs: track reloc roots based on their commit root bytenr We always search the commit root of the extent tree for looking up back references, however we track the reloc roots based on their current bytenr. This is wrong, if we commit the transaction between relocating tree blocks we could end up in this code in build_backref_tree if (key.objectid == key.offset) { /* * Only root blocks of reloc trees use backref * pointing to itself. */ root = find_reloc_root(rc, cur->bytenr); ASSERT(root); cur->root = root; break; } find_reloc_root() is looking based on the bytenr we had in the commit root, but if we've COWed this reloc root we will not find that bytenr, and we will trip over the ASSERT(root). Fix this by using the commit_root->start bytenr for indexing the commit root. Then we change the __update_reloc_root() caller to be used when we switch the commit root for the reloc root during commit. This fixes the panic I was seeing when we started throttling relocation for delayed refs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	50dbbb71c7	btrfs: restart relocate_tree_blocks properly There are two bugs here, but fixing them independently would just result in pain if you happened to bisect between the two patches. First is how we handle the -EAGAIN from relocate_tree_block(). We don't set error, unless we happen to be the first node, which makes no sense, I have no idea what the code was trying to accomplish here. We in fact _do_ want err set here so that we know we need to restart in relocate_block_group(). Also we need finish_pending_nodes() to not actually call link_to_upper(), because we didn't actually relocate the block. And then if we do get -EAGAIN we do not want to set our backref cache last_trans to the one before ours. This would force us to update our backref cache if we didn't cross transaction ids, which would mean we'd have some nodes updated to their new_bytenr, but still able to find their old bytenr because we're searching the same commit root as the last time we went through relocate_tree_blocks. Fixing these two things keeps us from panicing when we start breaking out of relocate_tree_blocks() either for delayed ref flushing or enospc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:51 +01:00
Josef Bacik	5f6b2e5cd6	btrfs: reloc: reorder reservation before root selection Since we're not only checking for metadata reservations but also if we need to throttle our delayed ref generation, reorder reserve_metadata_space() above the select_one_root() call in relocate_tree_block(). The reason we want this is because select_reloc_root() will mess with the backref cache, and if we're going to bail we want to be able to cleanly remove this node from the backref cache and come back along to regenerate it. Move it up so this is the first thing we do to make restarting cleaner. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Josef Bacik	d7ff00f608	btrfs: do not readahead in build_backref_tree Here we are just searching down to the bytenr we're building the backref tree for, and all of it's paths to the roots. These bytenrs are not guaranteed to be anywhere near each other, so readahead just generates extra latency. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Josef Bacik	cd22a51c66	btrfs: do not use readahead for running delayed refs Readahead will generate a lot of extra reads for adjacent nodes, but when running delayed refs we have no idea if the next ref is going to be adjacent or not, so this potentially just generates a lot of extra IO. To make matters worse each ref is truly just looking for one item, it doesn't generally search forward, so we simply don't need it here. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:03:50 +01:00
Nikolay Borisov	9babda9f33	btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to pass the async_transid parameter so remove it and any code using it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Nikolay Borisov	5d54c67ecc	btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid btrfs_ioctl_snap_create_transid no longer takes a transid argument, so remove it and rename the function to __btrfs_ioctl_snap_create to reflect it's an internal, worker function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Nikolay Borisov	9c1036fdb1	btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support This functionality was deprecated in kernel 5.4. Since no one has complained of the impending removal it's time we did so. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Josef Bacik	c75e839414	btrfs: kill the subvol_srcu Now that we have proper root ref counting everywhere we can kill the subvol_srcu. * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes * the refcount_t used for the references checks for accidental 0->1 in cases where the root lifetime would not be properly protected * there's a leak detector for roots to catch unfreed roots at umount time * SRCU served us well over the years but is was not a proper synchronization mechanism for some cases Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:02:00 +01:00
Josef Bacik	efc3453494	btrfs: make btrfs_cleanup_fs_roots use the radix tree lock The radix root is primarily protected by the fs_roots_radix_lock, so use that to lookup and get a ref on all of our fs roots in btrfs_cleanup_fs_roots. The tree reference is taken in the protected section as before. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	4785e24fa5	btrfs: don't take an extra root ref at allocation time Now that all the users of roots take references for them we can drop the extra root ref we've been taking. Before we had roots at 2 refs for the life of the file system, one for the radix tree, and one simply for existing. Now that we have proper ref accounting in all places that use roots we can drop this extra ref simply for existing as we no longer need it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	dc9492c14c	btrfs: hold a ref on the root on the dead roots list At the point we add a root to the dead roots list we have no open inodes for that root, so we need to hold a ref on that root to keep it from disappearing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	5c8fd99fec	btrfs: make inodes hold a ref on their roots If we make sure all the inodes have refs on their root we don't have to worry about the root disappearing while we have open inodes. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	8c38938c7b	btrfs: move the root freeing stuff into btrfs_put_root There are a few different ways to free roots, either you allocated them yourself and you just do free_extent_buffer(root->node); free_extent_buffer(root->commit_node); btrfs_put_root(root); Which is the pattern for log roots. Or for snapshots/subvolumes that are being dropped you simply call btrfs_free_fs_root() which does all the cleanup for you. Unify this all into btrfs_put_root(), so that we don't free up things associated with the root until the last reference is dropped. This makes the root freeing code much more significant. The only caveat is at close_ctree() time we have to free the extent buffers for all of our main roots (extent_root, chunk_root, etc) because we have to drop the btree_inode and we'll run into issues if we hold onto those nodes until ->kill_sb() time. This will be addressed in the future when we kill the btree_inode. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:59 +01:00
Josef Bacik	0e996e7fcf	btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root We are going to make root life be controlled soley by refcounting, and inodes will be one of the things that hold a ref on the root. This means we need to handle dropping the ino_cache_inode outside of the root freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is cleaned up properly on unmount. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	3fd6372758	btrfs: make the extent buffer leak check per fs info I'm going to make the entire destruction of btrfs_root's controlled by their refcount, so it will be helpful to notice if we're leaking their eb's on umount. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	7b7b74315b	btrfs: remove a BUG_ON() from merge_reloc_roots() This was pretty subtle, we default to reloc roots having 0 root refs, so if we crash in the middle of the relocation they can just be deleted. If we successfully complete the relocation operations we'll set our root refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots(). At prepare_to_merge() time if any of the reloc roots have a 0 reference still, we will remove that reloc root from our reloc root rb tree, and then clean it up later. However this only happens if we successfully start a transaction. If we've aborted previously we will skip this step completely, and only have reloc roots with a reference count of 0, but were never properly removed from the reloc control's rb tree. This isn't a problem per-se, our references are held by the list the reloc roots are on, and by the original root the reloc root belongs to. If we end up in this situation all the reloc roots will be added to the dirty_reloc_list, and then properly dropped at that point. The reloc control will be free'd and the rb tree is no longer used. There were two options when fixing this, one was to remove the BUG_ON(), the other was to make prepare_to_merge() handle the case where we couldn't start a trans handle. IMO this is the cleaner solution. I started with handling the error in prepare_to_merge(), but it turned out super ugly. And in the end this BUG_ON() simply doesn't matter, the cleanup was happening properly, we were just panicing because this BUG_ON() only matters in the success case. So I've opted to just remove it and add a comment where it was. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	f44deb7442	btrfs: hold a ref on the root->reloc_root We previously were relying on root->reloc_root to be cleaned up by the drop snapshot, or the error handling. However if btrfs_drop_snapshot() failed it wouldn't drop the ref for the root. Also we sort of depend on the right thing to happen with moving reloc roots between lists and the fs root they belong to, which makes it hard to figure out who owns the reference. Fix this by explicitly holding a reference on the reloc root for roo->reloc_root. This means that we hold two references on reloc roots, one for whichever reloc_roots list it's attached to, and the root->reloc_root we're on. This makes it easier to reason out who owns a reference on the root, and when it needs to be dropped. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	f28de8d8fd	btrfs: clear DEAD_RELOC_TREE before dropping the reloc root The DEAD_RELOC_TREE flag is in place in order to avoid a use after free in init_reloc_root, tracking the presence of reloc_root. However adding the explicit tree references in previous patches makes the use after free impossible because at this point we no longer have a reloc_control set on the fs_info and thus cannot enter the function. So move this to be coupled with clearing the root->reloc_root so we're consistent with all other operations of the reloc root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:58 +01:00
Josef Bacik	1a0afa0ecf	btrfs: free the reloc_control in a consistent way If we have an error while processing the reloc roots we could leak roots that were added to rc->reloc_roots before we hit the error. We could have also not removed the reloc tree mapping from our rb_tree, so clean up any remaining nodes in the reloc root rb_tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use rbtree_postorder_for_each_entry_safe ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	2abc726ab4	btrfs: do not init a reloc root if we aren't relocating We previously were checking if the root had a dead root before accessing root->reloc_root in order to avoid a use-after-free type bug. However this scenario happens after we've unset the reloc control, so we would have been saved if we'd simply checked for fs_info->reloc_control. At this point during relocation we no longer need to be creating new reloc roots, so simply move this check above the reloc_root checks to avoid any future races and confusion. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	6217b0fadd	btrfs: reloc: clean dirty subvols if we fail to start a transaction If we do merge_reloc_roots() we could insert a few roots onto the dirty subvol roots list, where we hold a ref on them. If we fail to start the transaction we need to run clean_dirty_subvols() in order to cleanup the refs. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	fb2d83eefe	btrfs: unset reloc control if we fail to recover If we fail to load an fs root, or fail to start a transaction we can bail without unsetting the reloc control, which leads to problems later when we free the reloc control but still have it attached to the file system. In the normal path we'll end up calling unset_reloc_control() twice, but all it does is set fs_info->reloc_control = NULL, and we can only have one balance at a time so it's not racey. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	8e19c9732a	btrfs: drop block from cache on error in relocation If we have an error while building the backref tree in relocation we'll process all the pending edges and then free the node. However if we integrated some edges into the cache we'll lose our link to those edges by simply freeing this node, which means we'll leak memory and references to any roots that we've found. Instead we need to use remove_backref_node(), which walks through all of the edges that are still linked to this node and free's them up and drops any root references we may be holding. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Qu Wenruo	19b546d7a1	btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves In relocation, we need to locate all parent tree leaves referring to one data extent, thus we have a complex mechanism to iterate throught extent tree and subvolume trees to locate the related leaves. However this is already done in backref.c, we have btrfs_find_all_leafs(), which can return a ulist containing all leaves referring to that data extent. Use btrfs_find_all_leafs() to replace find_data_references(). There is a special handling for v1 space cache data extents, where we need to delete the v1 space cache data extents, to avoid those data extents to hang the data relocation. In this patch, the special handling is done by re-iterating the root tree leaf. Although it's a little less efficient than the old handling, considering we can reuse a lot of code, it should be acceptable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:57 +01:00
Josef Bacik	b39c8f5a39	btrfs: fix ref-verify to catch operations on 0 ref extents While debugging I noticed I wasn't getting ref verify errors before everything blew up. Turns out it's because we don't warn when we try to add a normal ref via btrfs_inc_ref() if the block entry exists but has 0 references. This is incorrect, we should never be doing anything other than adding a new extent once a block entry drops to 0 references. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	0a8068a3dd	btrfs: make ranged full fsyncs more efficient Commit `0c713cbab6` ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") fixed a bug where we could end up with file extent items in a log tree that represent file ranges that overlap due to a race between the hole detection of a ranged full fsync and writeback for a different file range. The problem was solved by forcing any ranged full fsync to become a non-ranged full fsync - setting the range start to 0 and the end offset to LLONG_MAX. This was a simple solution because the code that detected and marked holes was very complex, it used to be done at copy_items() and implied several searches on the fs/subvolume tree. The drawback of that solution was that we started to flush delalloc for the entire file and wait for all the ordered extents to complete for ranged full fsyncs (including ordered extents covering ranges completely outside the given range). Fortunatelly ranged full fsyncs are not the most common case (hopefully for most workloads). However a later fix for detecting and marking holes was made by commit `0e56315ca1` ("Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES") and it simplified a lot the detection of holes, and now copy_items() no longer does it and we do it in a much more simple way at btrfs_log_holes(). This makes it now possible to simply make the code that detects holes to operate only on the initial range and no longer need to operate on the whole file, while also avoiding the need to flush delalloc for the entire file and wait for ordered extents that cover ranges that don't overlap the given range. Another special care is that we must skip file extent items that fall entirely outside the fsync range when copying inode items from the fs/subvolume tree into the log tree - this is to avoid races with ordered extent completion for extents falling outside the fsync range, which could cause us to end up with file extent items in the log tree that have overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when we copy inode items we could copy an extent item for the range [0, 512K], then release the search path and before moving to the next leaf, an ordered extent for a range of [256Kb, 512Kb] completes - this would cause us to copy the new extent item for range [256Kb, 512Kb] into the log tree after we have copied one for the range [0, 512Kb] - the extents overlap, resulting in a corruption. So this change just does these steps: 1) When the NO_HOLES feature is enabled it leaves the initial range intact - no longer sets it to [0, LLONG_MAX] when the full sync bit is set in the inode. If NO_HOLES is not enabled, always set the range to a full, just like before this change, to avoid missing file extent items representing holes after replaying the log (for both full and fast fsyncs); 2) Make the hole detection code to operate only on the fsync range; 3) Make the code that copies items from the fs/subvolume tree to skip copying file extent items that cover a range completely outside the range of the fsync. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	da447009a2	btrfs: factor out inode items copy loop from btrfs_log_inode() The function btrfs_log_inode() is quite large and so is its loop which iterates the inode items from the fs/subvolume tree and copies them into a log tree. Because this is a large loop inside a very large function and because an upcoming patch in this series needs to add some more logic inside that loop, move the loop into a helper function to make it a bit more manageable. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	a5eeb3d17b	btrfs: add helper to get the end offset of a file extent item Getting the end offset for a file extent item requires a bit of code since the extent can be either inline or regular/prealloc. There are some places all over the code base that open code this logic and in another patch later in this series it will be needed again. Therefore encapsulate this logic in a helper function and use it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Filipe Manana	95418ed1d1	btrfs: fix missing file extent item for hole after ranged fsync When doing a fast fsync for a range that starts at an offset greater than zero, we can end up with a log that when replayed causes the respective inode miss a file extent item representing a hole if we are not using the NO_HOLES feature. This is because for fast fsyncs we don't log any extents that cover a range different from the one requested in the fsync. Example scenario to trigger it: $ mkfs.btrfs -O ^no-holes -f /dev/sdd $ mount /dev/sdd /mnt # Create a file with a single 256K and fsync it to clear to full sync # bit in the inode - we want the msync below to trigger a fast fsync. $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo # Force a transaction commit and wipe out the log tree. $ sync # Dirty 768K of data, increasing the file size to 1Mb, and flush only # the range from 256K to 512K without updating the log tree # (sync_file_range() does not trigger fsync, it only starts writeback # and waits for it to finish). $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo # Now dirty the range from 768K to 1M again and sync that range. $ xfs_io -c "mmap -w 768K 256K" \ -c "mwrite -S 0xef 768K 256K" \ -c "msync -s 768K 256K" \ -c "munmap" \ /mnt/foo <power fail> # Mount to replay the log. $ mount /dev/sdd /mnt $ umount /mnt $ btrfs check /dev/sdd Opening filesystem to check... Checking filesystem on /dev/sdd UUID: 482fb574-b288-478e-a190-a9c44a78fca6 [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 262144, len: 524288 ERROR: errors found in fs roots found 720896 bytes used, error(s) found total csum bytes: 512 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123514 file data blocks allocated: 589824 referenced 589824 Fix this issue by setting the range to full (0 to LLONG_MAX) when the NO_HOLES feature is not enabled. This results in extra work being done but it gives the guarantee we don't end up with missing holes after replaying the log. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:56 +01:00
Nikolay Borisov	db161806dc	btrfs: account ticket size at add/delete time Instead of iterating all pending tickets on the normal/priority list to sum their total size the cost can be amortized across ticket addition/ removal. This turns O(n) + O(m) (where n is the size of the normal list and m of the priority list) into O(1). This will mostly have effect in workloads that experience heavy flushing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Roman Gushchin	f8e6608180	btrfs: implement migratepage callback for data pages Currently btrfs doesn't provide a migratepage callback for data pages. It means that fallback_migrate_page() is used to migrate btrfs pages. fallback_migrate_page() cannot move dirty pages, instead it tries to flush them (in sync mode) or just fails (in async mode). In the sync mode pages which are scheduled to be processed by btrfs_writepage_fixup_worker() can't be effectively flushed by the migration code, because there is no established way to wait for the completion of the delayed work. It all leads to page migration failures. To fix it the patch implements a btrs-specific migratepage callback, which is similar to iomap_migrate_page() used by some other fs, except it does take care of the PagePrivate2 flag which is used for data ordering purposes. Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Nikolay Borisov	0078a9f941	btrfs: Remove block_rsv parameter from btrfs_drop_snapshot It's no longer used following `30d40577e3` ("btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Nikolay Borisov	63f018be57	btrfs: Remove __ prefix from btrfs_block_rsv_release Currently the non-prefixed version is a simple wrapper used to hide the 4th argument of the prefixed version. This doesn't bring much value in practice and only makes the code harder to follow by adding another level of indirection. Rectify this by removing the __ prefix and have only one public function to release bytes from a block reservation. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Qu Wenruo	f31ea0888c	btrfs: relocation: Check cancel request after each extent found When relocating data block groups with tons of small extents, or large metadata block groups, there can be over 200,000 extents. We will iterate all extents of such block group in relocate_block_group(), where iteration itself can be kinda time-consuming. So when user want to cancel the balance, the extent iteration loop can be another target. This patch will add the cancelling check in the extent iteration loop of relocate_block_group() to make balance cancelling faster. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:55 +01:00
Qu Wenruo	7f913c7cfe	btrfs: relocation: Check cancel request after each data page read When relocating a data extents with large large data extents, we spend most of our time in relocate_file_extent_cluster() at stage "moving data extents": 1) \| btrfs_relocate_block_group [btrfs]() { 1) \| relocate_file_extent_cluster [btrfs]() { 1) $ 6586769 us \| } 1) + 18.260 us \| relocate_file_extent_cluster [btrfs](); 1) + 15.770 us \| relocate_file_extent_cluster [btrfs](); 1) $ 8916340 us \| } 1) \| btrfs_relocate_block_group [btrfs]() { 1) \| relocate_file_extent_cluster [btrfs]() { 1) $ 11611586 us \| } 1) + 16.930 us \| relocate_file_extent_cluster [btrfs](); 1) + 15.870 us \| relocate_file_extent_cluster [btrfs](); 1) $ 14986130 us \| } To make data relocation cancelling quicker, add extra balance cancelling check after each page read in relocate_file_extent_cluster(). Cleanup and error handling uses the same mechanism as if the whole process finished Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Qu Wenruo	726a342120	btrfs: relocation: add error injection points for cancelling balance Introduce a new error injection point, should_cancel_balance(). It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but allows us to override the return value. Currently there are only one locations using this function: - btrfs_balance() It checks cancel before each block group. There are other locations checking fs_info->balance_cancel_req, but they are not used as an indicator to exit, so there is no need to use the wrapper. But there will be more locations coming, and some locations can cause kernel panic if not handled properly. So introduce this error injection to provide better test interface. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	05a5a7621c	Btrfs: implement full reflink support for inline extents There are a few cases where we don't allow cloning an inline extent into the destination inode, returning -EOPNOTSUPP to user space. This was done to prevent several types of file corruption and because it's not very straightforward to deal with these cases, as they can't rely on simply copying the inline extent between leaves. Such cases require copying the inline extent's data into the respective page of the destination inode. Not supporting these cases makes it harder and more cumbersome to write applications/libraries that work on any filesystem with reflink support, since all these cases for which btrfs fails with -EOPNOTSUPP work just fine on xfs for example. These unsupported cases are also not documented anywhere and explaining which exact cases fail require a bit of too technical understanding of btrfs's internal (inline extents and when and where can they exist in a file), so it's not really user friendly. Also some test cases from fstests that use fsx, such as generic/522 for example, can sporadically fail because they trigger one of these cases, and fsx expects all operations to succeed. This change adds supports for cloning all these cases by copying the inline extent's data into the respective page of the destination inode. With this change test case btrfs/112 from fstests fails because it expects some clone operations to fail, so it will be updated. Also a new test case that exercises all these previously unsupported cases will be added to fstests. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	a61e1e0df9	Btrfs: simplify inline extent handling when doing reflinks We can not reflink parts of an inline extent, we must always reflink the whole inline extent. We know that inline extents always start at file offset 0 and that can never represent an amount of data larger then the filesystem's sector size (both compressed and uncompressed). We also have had the constraints that reflink operations must have a start offset that is aligned to the sector size and an end offset that is also aligned or it ends the inode's i_size, so there's no way for user space to be able to do a reflink operation that will refer to only a part of an inline extent. Initially there was a bug in the inlining code that could allow compressed inline extents that encoded more than 1 page, but that was fixed in 2008 by commit `70b99e6959` ("Btrfs: Compression corner fixes") since that was problematic. So remove all the extent cloning code that deals with the possibility of cloning only partial inline extents. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Filipe Manana	6a17738100	Btrfs: move all reflink implementation code into its own file The reflink code is quite large and has been living in ioctl.c since ever. It has grown over the years after many bug fixes and improvements, and since I'm planning on making some further improvements on it, it's time to get it better organized by moving into its own file, reflink.c (similar to what xfs does for example). This change only moves the code out of ioctl.c into the new file, it doesn't do any other change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Gustavo A. R. Silva	a8753ee3a8	btrfs: scrub: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:54 +01:00
Gustavo A. R. Silva	7593f4c53c	btrfs: rcu-string: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Gustavo A. R. Silva	17b238acf7	btrfs: delayed-inode: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." [1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Madhuparna Bhowmik	29566c9c77	btrfs: add RCU locks around block group initialization The space_info list is normally RCU protected and should be traversed with rcu_read_lock held. There's a warning [29.104756] WARNING: suspicious RCU usage [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted [29.105231] ----------------------------- [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!! pointing out that the locking is missing in btrfs_read_block_groups. However this is not necessary as the list traversal happens at mount time when there's no other thread potentially accessing the list. To fix the warning and for consistency let's add the RCU lock/unlock, the code won't be affected much as it's doing some lightweight operations. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Nikolay Borisov	65cd6d9e30	btrfs: Open code insert_extent_backref No need to add a level of indirection for hiding a simple 'if'. Open code insert_extent_backref in its sole caller. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
Nikolay Borisov	c6600d9ac6	btrfs: Remove impossible BUG_ON in get_tree_block_key relocate_tree_blocks calls get_tree_block_key for a block iff that block has its ->key_ready equal false. Thus the BUG_ON in the latter function cannot ever be triggered so remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:53 +01:00
David Sterba	5ba366c399	btrfs: balance: factor out convert profile validation The validation follows the same steps for all three block group types, the existing helper validate_convert_profile can be enhanced and do more of the common things. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	c67b38925b	btrfs: return void from csum_tree_block Now that csum_tree_block is not returning any errors, we can make csum_tree_block return void and simplify callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	e9be5a303d	btrfs: simplify tree block checksumming loop Thw whole point of csum_tree_block is to iterate over all extent buffer pages and pass it to checksumming functions. The bytes where checksum is stored must be skipped, thus map_private_extent_buffer. This complicates further offset calculations. As the first page will be always present, checksum the relevant bytes unconditionally and then do a simple iteration over the remaining pages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
David Sterba	59a0fcdb48	btrfs: inline checksum name and driver definitions There's an unnecessary indirection in the checksum definition table, pointer and the string itself. The strings are short and the overall size of one entry is now 24 bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Nikolay Borisov	11c67b1a40	btrfs: Rename __btrfs_alloc_chunk to btrfs_alloc_chunk Having btrfs_alloc_chunk doesn't bring any value since it encapsulates a lockdep assert and a call to find_next_chunk. Simply rename the internal __btrfs_alloc_chunk function to the public one and remove it's 2nd parameter as all callers always pass the return value of find_next_chunk. Finally, migrate the call to lockdep_assert_held so as to not lose the check. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Josef Bacik	fa121a26b2	btrfs: fix btrfs_calc_reclaim_metadata_size calculation I noticed while running my snapshot torture test that we were getting a lot of metadata chunks allocated with very little actually used. Digging into this we would commit the transaction, still not have enough space, and then force a chunk allocation. I noticed that we were barely flushing any delalloc at all, despite the fact that we had around 13gib of outstanding delalloc reservations. It turns out this is because of our btrfs_calc_reclaim_metadata_size() calculation. It _only_ takes into account the outstanding ticket sizes, which isn't the whole story. In this particular workload we're slowly filling up the disk, which means our overcommit space will suddenly become a lot less, and our outstanding reservations will be well more than what we can handle. However we are only flushing based on our ticket size, which is much less than we need to actually reclaim. So fix btrfs_calc_reclaim_metadata_size() to take into account the overage in the case that we've gotten less available space suddenly. This makes it so we attempt to reclaim a lot more delalloc space, which allows us to make our reservations and we no longer are allocating a bunch of needless metadata chunks. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:52 +01:00
Filipe Manana	f0cc2cd701	Btrfs: fix crash during unmount due to race with delayed inode workers During unmount we can have a job from the delayed inode items work queue still running, that can lead to at least two bad things: 1) A crash, because the worker can try to create a transaction just after the fs roots were freed; 2) A transaction leak, because the worker can create a transaction before the fs roots are freed and just after we committed the last transaction and after we stopped the transaction kthread. A stack trace example of the crash: [79011.691214] kernel BUG at lib/radix-tree.c:982! [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2 (...) [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs] [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170 (...) [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293 [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0 [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0 [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000 [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001 [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000 [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0 [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [79011.715448] Call Trace: [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs] [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs] [79011.717925] start_transaction+0xdd/0x5c0 [btrfs] [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs] [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs] [79011.720773] process_one_work+0x26d/0x6a0 [79011.721497] worker_thread+0x4f/0x3e0 [79011.722153] ? process_one_work+0x6a0/0x6a0 [79011.722901] kthread+0x103/0x140 [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70 [79011.724379] ret_from_fork+0x3a/0x50 (...) The following diagram shows a sequence of steps that lead to the crash during ummount of the filesystem: CPU 1 CPU 2 CPU 3 btrfs_punch_hole() btrfs_btree_balance_dirty() btrfs_balance_delayed_items() --> sees fs_info->delayed_root->items with value 200, which is greater than BTRFS_DELAYED_BACKGROUND (128) and smaller than BTRFS_DELAYED_WRITEBACK (512) btrfs_wq_run_delayed_node() --> queues a job for fs_info->delayed_workers to run btrfs_async_run_delayed_root() btrfs_async_run_delayed_root() --> job queued by CPU 1 --> starts picking and running delayed nodes from the prepare_list list close_ctree() btrfs_delete_unused_bgs() btrfs_commit_super() btrfs_join_transaction() --> gets transaction N btrfs_commit_transaction(N) --> set transaction state to TRANTS_STATE_COMMIT_START btrfs_first_prepared_delayed_node() --> picks delayed node X through the prepared_list list btrfs_run_delayed_items() btrfs_first_delayed_node() --> also picks delayed node X but through the node_list list __btrfs_commit_inode_delayed_items() --> runs all delayed items from this node and drops the node's item count to 0 through call to btrfs_release_delayed_inode() --> finishes running any remaining delayed nodes --> finishes transaction commit --> stops cleaner and transaction threads btrfs_free_fs_roots() --> frees all roots and removes them from the radix tree fs_info->fs_roots_radix btrfs_join_transaction() start_transaction() btrfs_record_root_in_trans() record_root_in_trans() radix_tree_tag_set() --> crashes because the root is not in the radix tree anymore If the worker is able to call btrfs_join_transaction() before the unmount task frees the fs roots, we end up leaking a transaction and all its resources, since after the call to btrfs_commit_super() and stopping the transaction kthread, we don't expect to have any transaction open anymore. When this situation happens the worker has a delayed node that has no more items to run, since the task calling btrfs_run_delayed_items(), which is doing a transaction commit, picks the same node and runs all its items first. We can not wait for the worker to complete when running delayed items through btrfs_run_delayed_items(), because we call that function in several phases of a transaction commit, and that could cause a deadlock because the worker calls btrfs_join_transaction() and the task doing the transaction commit may have already set the transaction state to TRANS_STATE_COMMIT_DOING. Also it's not possible to get into a situation where only some of the items of a delayed node are added to the fs/subvolume tree in the current transaction and the remaining ones in the next transaction, because when running the items of a delayed inode we lock its mutex, effectively waiting for the worker if the worker is running the items of the delayed node already. Since this can only cause issues when unmounting a filesystem, fix it in a simple way by waiting for any jobs on the delayed workers queue before calling btrfs_commit_supper() at close_ctree(). This works because at this point no one can call btrfs_btree_balance_dirty() or btrfs_balance_delayed_items(), and if we end up waiting for any worker to complete, btrfs_commit_super() will commit the transaction created by the worker. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	7e89540942	btrfs: factor out prepare_allocation() for extent allocation This function finally factor out prepare_allocation() form find_free_extent(). This function is called before the allocation loop and a specific allocator function like prepare_allocation_clustered() should initialize their private information and can set proper hint_byte to indicate where to start the allocation with. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	45d8e033b2	btrfs: skip LOOP_NO_EMPTY_SIZE if not clustered allocation LOOP_NO_EMPTY_SIZE is solely dedicated for clustered allocation. So, we can skip this stage and give up the allocation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	c70e2139dc	btrfs: factor out chunk_allocation_failed() for extent allocation Factor out chunk_allocation_failed() from find_free_extent_update_loop(). This function is called when it failed to allocate a chunk. The function can modify "ffe_ctl->loop" and return 0 to continue with the next stage. Or, it can return -ENOSPC to give up here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	15b7ee6584	btrfs: drop unnecessary arguments from find_free_extent_update_loop() Now that, we don't use last_ptr and use_cluster in the function. Drop these arguments from it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:51 +01:00
Naohiro Aota	0ab9724bf5	btrfs: factor out found_extent() for extent allocation Factor out found_extent() from find_free_extent_update_loop(). This function is called when a proper extent is found and before returning from find_free_extent(). Hook functions like found_extent_clustered() should save information for a next allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	baba50624f	btrfs: factor out release_block_group() Factor out release_block_group() from find_free_extent(). This function is called when it gives up an allocation from a block group. Each allocation policy should reset its information for an allocation in the next block group. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	897cae7948	btrfs: drop unnecessary arguments from clustered allocation functions Now that, find_free_extent_clustered() and find_free_extent_unclustered() can access "last_ptr" from the "clustered" variable, we can drop it from the arguments. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c668690dc0	btrfs: factor out do_allocation() for extent allocation Factor out do_allocation() from find_free_extent(). This function do an actual extent allocation in a given block group. The ffe_ctl->policy is used to determine the actual allocator function to use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	c10859be9b	btrfs: move variables for clustered allocation into find_free_extent_ctl Move "last_ptr" and "use_cluster" into struct find_free_extent_ctl, so that hook functions for clustered allocator can use these variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:50 +01:00
Naohiro Aota	ea544149a4	btrfs: move hint_byte into find_free_extent_ctl This commit moves hint_byte into find_free_extent_ctl, so that we can modify the hint_byte in the other functions. This will help us split find_free_extent further. This commit also renames the function argument "hint_byte" to "hint_byte_orig" to avoid misuse. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	cb2f96f8ab	btrfs: introduce extent allocation policy This commit introduces extent allocation policy for btrfs. This policy controls how btrfs allocate an extents from block groups. There is no functional change introduced with this commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	6aafb30384	btrfs: parameterize dev_extent_min for chunk allocation Currently, we ignore a device whose available space is less than "BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current allocation policy (to maximize the number of stripes). This commit parameterizes dev_extent_min, so that other policies can set their own lower limitat to ignore a device with insufficient space. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	dce580ca40	btrfs: factor out create_chunk() Factor out create_chunk() from __btrfs_alloc_chunk(). This function finally creates a chunk. There is no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	5badf512ec	btrfs: factor out decide_stripe_size() Factor out decide_stripe_size() from __btrfs_alloc_chunk(). This function calculates the actual stripe size to allocate. decide_stripe_size() handles the common case to round down the 'ndevs' to 'devs_increment' and check the upper and lower limitation of 'ndevs'. decide_stripe_size_regular() decides the size of a stripe and the size of a chunk. The policy is to maximize the number of stripes. This commit has no functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	560156cb25	btrfs: factor out gather_device_info() Factor out gather_device_info() from __btrfs_alloc_chunk(). This function iterates over devices list and gather information about devices. This commit also introduces "max_avail" and "dev_extent_min" to fold the same calculation to one variable. This commit has no functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:49 +01:00
Naohiro Aota	27c314d5ca	btrfs: factor out init_alloc_chunk_ctl Factor out init_alloc_chunk_ctl() from __btrfs_alloc_chunk(). This function initialises parameters of "struct alloc_chunk_ctl" for allocation. init_alloc_chunk_ctl() handles a common part of the initialisation to load the RAID parameters from btrfs_raid_array. init_alloc_chunk_ctl_policy_regular() decides some parameters for its allocation. The last "else" case in the original code is moved to __btrfs_alloc_chunk() to handle the error case in the common code. Replace the BUG_ON with ASSERT() and error return at the same time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	4f2bafe8a4	btrfs: introduce alloc_chunk_ctl Introduce "struct alloc_chunk_ctl" to wrap needed parameters for the chunk allocation. This will be used to split __btrfs_alloc_chunk() into smaller functions. This commit folds a number of local variables in __btrfs_alloc_chunk() into one "struct alloc_chunk_ctl ctl". There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	3b4ffa4088	btrfs: refactor find_free_dev_extent_start() Factor out two functions from find_free_dev_extent_start(). dev_extent_search_start() decides the starting position of the search. dev_extent_hole_check() checks if a hole found is suitable for device extent allocation. These functions also have the switch-cases to change the allocation behavior depending on the policy. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	c4a816c67c	btrfs: introduce chunk allocation policy Introduce chunk allocation policy for btrfs. This policy controls how chunks and device extents are allocated from devices. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	b25c19f49e	btrfs: handle invalid profile in chunk allocation Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk(). Instead return -EINVAL with ASSERT() to catch a bug in the development stage. Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:48 +01:00
Naohiro Aota	52d40aba68	btrfs: change full_search to bool in find_free_extent_update_loop While the "full_search" variable defined in find_free_extent() is bool, but the full_search argument of find_free_extent_update_loop() is defined as int. Let's trivially fix the argument type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Qu Wenruo	daf475c915	btrfs: qgroup: Remove the unnecesaary spin lock for qgroup_rescan_running After the previous patch, qgroup_rescan_running is protected by btrfs_fs_info::qgroup_rescan_lock, thus no need for the extra spinlock. Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Qu Wenruo	d61acbbf54	btrfs: qgroup: ensure qgroup_rescan_running is only set when the worker is at least queued [BUG] There are some reports about btrfs wait forever to unmount itself, with the following call trace: INFO: task umount:4631 blocked for more than 491 seconds. Tainted: G X 5.3.8-2-default #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0 4631 3337 0x00000000 Call Trace: ([<00000000174adf7a>] __schedule+0x342/0x748) [<00000000174ae3ca>] schedule+0x4a/0xd8 [<00000000174b1f08>] schedule_timeout+0x218/0x420 [<00000000174af10c>] wait_for_common+0x104/0x1d8 [<000003ff804d6994>] btrfs_qgroup_wait_for_completion+0x84/0xb0 [btrfs] [<000003ff8044a616>] close_ctree+0x4e/0x380 [btrfs] [<0000000016fa3136>] generic_shutdown_super+0x8e/0x158 [<0000000016fa34d6>] kill_anon_super+0x26/0x40 [<000003ff8041ba88>] btrfs_kill_super+0x28/0xc8 [btrfs] [<0000000016fa39f8>] deactivate_locked_super+0x68/0x98 [<0000000016fcb198>] cleanup_mnt+0xc0/0x140 [<0000000016d6a846>] task_work_run+0xc6/0x110 [<0000000016d04f76>] do_notify_resume+0xae/0xb8 [<00000000174b30ae>] system_call+0xe2/0x2c8 [CAUSE] The problem happens when we have called qgroup_rescan_init(), but not queued the worker. It can be caused mostly by error handling. Qgroup ioctl thread \| Unmount thread ----------------------------------------+----------------------------------- \| btrfs_qgroup_rescan() \| \|- qgroup_rescan_init() \| \| \|- qgroup_rescan_running = true; \| \| \| \|- trans = btrfs_join_transaction() \| \| Some error happened \| \| \| \|- btrfs_qgroup_rescan() returns error \| But qgroup_rescan_running == true; \| \| close_ctree() \| \|- btrfs_qgroup_wait_for_completion() \| \|- running == true; \| \|- wait_for_completion(); btrfs_qgroup_rescan_worker is never queued, thus no one is going to wake up close_ctree() and we get a deadlock. All involved qgroup_rescan_init() callers are: - btrfs_qgroup_rescan() The example above. It's possible to trigger the deadlock when error happened. - btrfs_quota_enable() Not possible. Just after qgroup_rescan_init() we queue the work. - btrfs_read_qgroup_config() It's possible to trigger the deadlock. It only init the work, the work queueing happens in btrfs_qgroup_rescan_resume(). Thus if error happened in between, deadlock is possible. We shouldn't set fs_info->qgroup_rescan_running just in qgroup_rescan_init(), as at that stage we haven't yet queued qgroup rescan worker to run. [FIX] Set qgroup_rescan_running before queueing the work, so that we ensure the rescan work is queued when we wait for it. Fixes: `8d9eddad19` ("Btrfs: fix qgroup rescan worker initialization") Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ Change subject and cause analyse, use a smaller fix ] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00
Andy Shevchenko	807fc790aa	btrfs: switch to use new generic UUID API There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:47 +01:00

... 3 4 5 6 7 ...

63909 commits