Commit graph

70664 commits

Author SHA1 Message Date
Colin Ian King
f4bf74d829 fs/proc/generic.c: fix incorrect pde_is_permanent check
Currently the pde_is_permanent() check is being run on root multiple times
rather than on the next proc directory entry.  This looks like a
copy-paste error.  Fix this by replacing root with next.

Addresses-Coverity: ("Copy-paste error")
Link: https://lkml.kernel.org/r/20210318122633.14222-1-colin.king@canonical.com
Fixes: d919b33daf ("proc: faster open/read/close with "permanent" files")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-06 19:24:11 -07:00
Linus Torvalds
7ac86b3dca Notable items here are a series to take advantage of David Howells'
netfs helper library from Jeff, three new filesystem client metrics
 from Xiubo, ceph.dir.rsnaps vxattr from Yanhu and two auth-related
 fixes from myself, marked for stable.  Interspersed is a smattering
 of assorted fixes and cleanups across the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmCT8IITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizgqCACYbyY4Yr/2C8fZsn+P9rd97zRTbcC6
 eufTZwnlECLnc89BxJQRk9a2UpDJfC8RMM3/9tmiulc8G4M+ggVbdFQTCzsZox3c
 vLAunGeVyfKIY+16Bv2RNuoO3KeeZm5aB3jXJ5QcUPcXmd4XnHKI1FU2ebC56UJb
 pxxfHpE6fb59r6Ek1e5uUFyta4KDMrvwXozghuAPEgT1GpKeA9zMIGI0CkQbBHlW
 PWHpcahTiT6GWa/d9ud0CnfssiBxVydWyKTz9xppYC6LNdsZUf9tBmYYGRklcjoA
 yAwPSuqxNmg+7uWubEawc0+a/3fXORgp2SF7Rbp1XYE+HpfnMF1J+nIn
 =IO5c
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Notable items here are

   - a series to take advantage of David Howells' netfs helper library
     from Jeff

   - three new filesystem client metrics from Xiubo

   - ceph.dir.rsnaps vxattr from Yanhu

   - two auth-related fixes from myself, marked for stable.

  Interspersed is a smattering of assorted fixes and cleanups across the
  filesystem"

* tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits)
  libceph: allow addrvecs with a single NONE/blank address
  libceph: don't set global_id until we get an auth ticket
  libceph: bump CephXAuthenticate encoding version
  ceph: don't allow access to MDS-private inodes
  ceph: fix up some bare fetches of i_size
  ceph: convert some PAGE_SIZE invocations to thp_size()
  ceph: support getting ceph.dir.rsnaps vxattr
  ceph: drop pinned_page parameter from ceph_get_caps
  ceph: fix inode leak on getattr error in __fh_to_dentry
  ceph: only check pool permissions for regular files
  ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
  ceph: avoid counting the same request twice or more
  ceph: rename the metric helpers
  ceph: fix kerneldoc copypasta over ceph_start_io_direct
  ceph: use attach/detach_page_private for tracking snap context
  ceph: don't use d_add in ceph_handle_snapdir
  ceph: don't clobber i_snap_caps on non-I_NEW inode
  ceph: fix fall-through warnings for Clang
  ceph: convert ceph_readpages to ceph_readahead
  ceph: convert ceph_write_begin to netfs_write_begin
  ...
2021-05-06 10:27:02 -07:00
Linus Torvalds
682a8e2b41 Code cleanups and a bug fix
- W=1 compiler warning cleanups
 - Mutex initialization simplification
 - Protect against NULL pointer exception during mount
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEKvuQkp28KvPJn/fVfL7QslYSS/0FAmCTTQoRHGNvZGVAdHlo
 aWNrcy5jb20ACgkQfL7QslYSS/3M4RAAlpWKtZlUd83gr2wS7jKX+WR6Bmc9yu2Q
 hH9eJRca1O3LQHzRfsh6Y4PQmQD10kPOZUD4Qfy5FgYebNu/yapjx+2O7VpXeiiu
 b+Ien7l5f2qyBUjjIypW7m0T52YHDV5vc2nWkGoxYl3g6ecla3THdCmzpMD7iAUa
 7tNdDd+zF+I2Lm0nVW2r/wLSUHoWs1d5u5GKOWkWgyEa80A4F6nHR6g+1GuUYhpE
 8jfo5NnGGYjM3M+uH2MUBGan7K99YH+bTTsCSeEOSx7MK1X7zeH9pKaHIpPIYI6n
 m7iymS0qAfkdlK8GEMLVTgjVPY3Jh6f7iLug2+js4y+j2sXsBZvJzQHRS6uKu4YP
 dW/Da/0qntnnIyE0gD/SnwRGjTfv3LBbnw2vJSv2d9z9N50gIHJB43ZFP4XYfK0n
 WWHce7W7O+lo+4ZPTDd4G6Nz96Fi9Bl6bMA71Pw6P/J9KWnl6LOyp95YHifDlZ+f
 asYkxOvHDJagA6mKos0lnc/GT8IPAe65p6Jsq0IE32hplS4Cq9ajrUZ+3+VHsADr
 ZvG84IWe2vK2fWgvcaSpxLx3BhJTh/qkN/ISOykzVMWbet114XeXCC4hOQOVqU/A
 fsJgq6TeMMlpBeUnHobRN0AihGpatgiaMq7GtlZyEmk4Rkotf65vw/gmnR301pDq
 tTVVPw1XTdQ=
 =aHZx
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-5.13-rc1-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull ecryptfs updates from Tyler Hicks:
 "Code cleanups and a bug fix

   - W=1 compiler warning cleanups

   - Mutex initialization simplification

   - Protect against NULL pointer exception during mount"

* tag 'ecryptfs-5.13-rc1-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: fix kernel panic with null dev_name
  ecryptfs: remove unused helpers
  ecryptfs: Fix typo in message
  eCryptfs: Use DEFINE_MUTEX() for mutex lock
  ecryptfs: keystore: Fix some kernel-doc issues and demote non-conformant headers
  ecryptfs: inode: Help out nearly-there header and demote non-conformant ones
  ecryptfs: mmap: Help out one function header and demote other abuses
  ecryptfs: crypto: Supply some missing param descriptions and demote abuses
  ecryptfs: miscdev: File headers are not good kernel-doc candidates
  ecryptfs: main: Demote a bunch of non-conformant kernel-doc headers
  ecryptfs: messaging: Add missing param descriptions and demote abuses
  ecryptfs: super: Fix formatting, naming and kernel-doc abuses
  ecryptfs: file: Demote kernel-doc abuses
  ecryptfs: kthread: Demote file header and provide description for 'cred'
  ecryptfs: dentry: File headers are not good candidates for kernel-doc
  ecryptfs: debug: Demote a couple of kernel-doc abuses
  ecryptfs: read_write: File headers do not make good candidates for kernel-doc
  ecryptfs: use DEFINE_MUTEX() for mutex lock
  eCryptfs: add a semicolon
2021-05-06 10:06:39 -07:00
yangerkun
cf7b39a0cb block: reexpand iov_iter after read/write
We get a bug:

BUG: KASAN: slab-out-of-bounds in iov_iter_revert+0x11c/0x404
lib/iov_iter.c:1139
Read of size 8 at addr ffff0000d3fb11f8 by task

CPU: 0 PID: 12582 Comm: syz-executor.2 Not tainted
5.10.0-00843-g352c8610ccd2 #2
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
 show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x110/0x164 lib/dump_stack.c:118
 print_address_description+0x78/0x5c8 mm/kasan/report.c:385
 __kasan_report mm/kasan/report.c:545 [inline]
 kasan_report+0x148/0x1e4 mm/kasan/report.c:562
 check_memory_region_inline mm/kasan/generic.c:183 [inline]
 __asan_load8+0xb4/0xbc mm/kasan/generic.c:252
 iov_iter_revert+0x11c/0x404 lib/iov_iter.c:1139
 io_read fs/io_uring.c:3421 [inline]
 io_issue_sqe+0x2344/0x2d64 fs/io_uring.c:5943
 __io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
 io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
 io_submit_sqe fs/io_uring.c:6395 [inline]
 io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
 __do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
 __se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
 __arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
 el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
 do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

Allocated by task 12570:
 stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
 kasan_save_stack mm/kasan/common.c:48 [inline]
 kasan_set_track mm/kasan/common.c:56 [inline]
 __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
 kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
 __kmalloc+0x23c/0x334 mm/slub.c:3970
 kmalloc include/linux/slab.h:557 [inline]
 __io_alloc_async_data+0x68/0x9c fs/io_uring.c:3210
 io_setup_async_rw fs/io_uring.c:3229 [inline]
 io_read fs/io_uring.c:3436 [inline]
 io_issue_sqe+0x2954/0x2d64 fs/io_uring.c:5943
 __io_queue_sqe+0x19c/0x520 fs/io_uring.c:6260
 io_queue_sqe+0x2a4/0x590 fs/io_uring.c:6326
 io_submit_sqe fs/io_uring.c:6395 [inline]
 io_submit_sqes+0x4c0/0xa04 fs/io_uring.c:6624
 __do_sys_io_uring_enter fs/io_uring.c:9013 [inline]
 __se_sys_io_uring_enter fs/io_uring.c:8960 [inline]
 __arm64_sys_io_uring_enter+0x190/0x708 fs/io_uring.c:8960
 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
 el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
 do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:227
 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

Freed by task 12570:
 stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
 kasan_save_stack mm/kasan/common.c:48 [inline]
 kasan_set_track+0x38/0x6c mm/kasan/common.c:56
 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
 __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
 kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
 slab_free_hook mm/slub.c:1544 [inline]
 slab_free_freelist_hook mm/slub.c:1577 [inline]
 slab_free mm/slub.c:3142 [inline]
 kfree+0x104/0x38c mm/slub.c:4124
 io_dismantle_req fs/io_uring.c:1855 [inline]
 __io_free_req+0x70/0x254 fs/io_uring.c:1867
 io_put_req_find_next fs/io_uring.c:2173 [inline]
 __io_queue_sqe+0x1fc/0x520 fs/io_uring.c:6279
 __io_req_task_submit+0x154/0x21c fs/io_uring.c:2051
 io_req_task_submit+0x2c/0x44 fs/io_uring.c:2063
 task_work_run+0xdc/0x128 kernel/task_work.c:151
 get_signal+0x6f8/0x980 kernel/signal.c:2562
 do_signal+0x108/0x3a4 arch/arm64/kernel/signal.c:658
 do_notify_resume+0xbc/0x25c arch/arm64/kernel/signal.c:722
 work_pending+0xc/0x180

blkdev_read_iter can truncate iov_iter's count since the count + pos may
exceed the size of the blkdev. This will confuse io_read that we have
consume the iovec. And once we do the iov_iter_revert in io_read, we
will trigger the slab-out-of-bounds. Fix it by reexpand the count with
size has been truncated.

blkdev_write_iter can trigger the problem too.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Acked-by: Pavel Begunkov <asml.silencec@gmail.com>
Link: https://lore.kernel.org/r/20210401071807.3328235-1-yangerkun@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-06 09:24:03 -06:00
Thadeu Lima de Souza Cascardo
d1f8280887 io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers
Read and write operations are capped to MAX_RW_COUNT. Some read ops rely on
that limit, and that is not guaranteed by the IORING_OP_PROVIDE_BUFFERS.

Truncate those lengths when doing io_add_buffers, so buffer addresses still
use the uncapped length.

Also, take the chance and change struct io_buffer len member to __u32, so
it matches struct io_provide_buffer len member.

This fixes CVE-2021-3491, also reported as ZDI-CAN-13546.

Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Reported-by: Billy Jheng Bing-Jhong (@st424204)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-05 15:17:35 -06:00
Linus Torvalds
8404c9fbc8 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "The remainder of the main mm/ queue.

  143 patches.

  Subsystems affected by this patch series (all mm): pagecache, hugetlb,
  userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
  kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
  kfence"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits)
  kfence: use power-efficient work queue to run delayed work
  kfence: maximize allocation wait timeout duration
  kfence: await for allocation using wait_event
  kfence: zero guard page after out-of-bounds access
  mm/process_vm_access.c: remove duplicate include
  mm/mempool: minor coding style tweaks
  mm/highmem.c: fix coding style issue
  btrfs: use memzero_page() instead of open coded kmap pattern
  iov_iter: lift memzero_page() to highmem.h
  mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
  mm/zswap.c: switch from strlcpy to strscpy
  arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
  acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
  mm,memory_hotplug: allocate memmap from the added memory range
  mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
  mm,memory_hotplug: relax fully spanned sections check
  drivers/base/memory: introduce memory_block_{online,offline}
  mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
  ...
2021-05-05 13:50:15 -07:00
Linus Torvalds
a79cdfba68 Additional fixes and clean-ups for NFSD since tags/nfsd-5.13,
including a fix to grant read delegations for files open for
 writing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmCJz0UACgkQM2qzM29m
 f5einQ//ZqErt5sYcvQw5Onkt+lDHp13XgjIVGo1DrAegrdoTMT+jpUfYSbDLEuC
 B+G2+rUGHpNZ017mzoAmzoeA+pKsdRX+YAy/i8K+7r/cr6T9v78yoX9rx1rbEQEq
 QFJm0fGrFLydzaxRpVq5by7yCKD2DaCQL6DefcXQitfKlfRJ8i/D/vXVBb4FJcmg
 4qRJ7RCcck5gqfInFJ+ZKRjC/9Oj9bNUJz2Ph9mWH1qDDKachgnfWYqrnFQdjYTr
 /Tb+6gyqnRplHU7LmPYSREZqrS3CuvPX0MSXKcFhITj0teaF3b7MArIsSrpw/GGi
 kKrc/K+46COA/Ej0stdGev+Fe3GRlPKUk7UgdD3uWvQrDZ5WdcvN1N7xyCHk90qO
 pOmU3iQuFIBJLaHfwzDaPUJZKMsEO+hsd+liwJjBg6WD4DDLYSQT7jglwYwCxeV4
 ywJi9C3DKaM8kpSBbnMUreHdIIz1d8hNifM4PKgtKGpaXaVlO+rxbkQfZjVAF7Sk
 uRXIegRi+YSJY7RJIhT+NcmmJbyQOEXu9UyUJmqpIzbzmiLF/K2qUk5jPxFLgBpq
 CHmdEIfcoGhA1UqAlynplk5+I5QvhzjxENZJ2Bz8Xwn/uDebKlNhrQeXQP1mQ8dK
 3kJ3RUN/yQxgYCXIQWg/ug51hSZ5Y6c7RzaJeW359V5DbPKBQOU=
 =HB+N
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull more nfsd updates from Chuck Lever:
 "Additional fixes and clean-ups for NFSD since tags/nfsd-5.13,
  including a fix to grant read delegations for files open for writing"

* tag 'nfsd-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix null pointer dereference in svc_rqst_free()
  SUNRPC: fix ternary sign expansion bug in tracing
  nfsd: Fix fall-through warnings for Clang
  nfsd: grant read delegations to clients holding writes
  nfsd: reshuffle some code
  nfsd: track filehandle aliasing in nfs4_files
  nfsd: hash nfs4_files by inode number
  nfsd: ensure new clients break delegations
  nfsd: removed unused argument in nfsd_startup_generic()
  nfsd: remove unused function
  svcrdma: Pass a useful error code to the send_err tracepoint
  svcrdma: Rename goto labels in svc_rdma_sendto()
  svcrdma: Don't leak send_ctxt on Send errors
2021-05-05 13:44:19 -07:00
Linus Torvalds
7c9e41e0ef 10 CIFS/SMB3 changesets including some important multichannel fixes, as well as support for handle leases (deferred close) and shutdown support
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCSKIIACgkQiiy9cAdy
 T1F+Hgv+L3NkOFwMvBgGjHP9b+Lkv/YWGKeJkLwQW1xqoHIUHn0/+C5+9ScJGBZc
 WVuzp4pqEIgv4my4UQiyqVwzcmz4BqY2KTDJzBYtqANt6pVp1w6YtC2GplgJE3J2
 qoQh1RwZaqSXfjcoPSRnv5EiSF6DbHlBUhPMd53qOE9pwaf/38i9/M3d9G7EIB8h
 rRNmpGtFuzBHdtGQ2b+4+8ftCIpEBDu/OXcA6QXMUMcvKGaruU39NOxBuW6a/VO5
 9P47Nsof3dlN758uesoQT2VMEc0pcpwAs9BwOkinfXWGUyNqJmbPNvddIOlaP/dv
 vG58n/+JqvWUKgEnrNk5h+wD7wmXpgxpQ523sD5k6bID1hc+vh4lXf+O+iltbYtc
 1ce9ITglSVxA7z4qwFWhtawBy1j1YyvltTAGvhnzdtKZLRk6e5AYIFOUn9O+AMJw
 Eofk4lD0kNTdXyMkveGluRMBXrOzKMdmfw5FW/9hObYgebEGpTQkGyIMpaStraZM
 8hDNAGTk
 =BFOj
 -----END PGP SIGNATURE-----

Merge tag '5.13-rc-smb3-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
 "Ten CIFS/SMB3 changes - including two marked for stable - including
  some important multichannel fixes, as well as support for handle
  leases (deferred close) and shutdown support:

   - some important multichannel fixes

   - support for handle leases (deferred close)

   - shutdown support (which is also helpful since it enables multiple
     xfstests)

   - enable negotiating stronger encryption by default (GCM256)

   - improve wireshark debugging by allowing more options for root to
     dump decryption keys

  SambaXP and the SMB3 Plugfest test event are going on now so I am
  expecting more patches over the next few days due to extra testing
  (including more multichannel fixes)"

* tag '5.13-rc-smb3-part2' of git://git.samba.org/sfrench/cifs-2.6:
  fs/cifs: Fix resource leak
  Cifs: Fix kernel oops caused by deferred close for files.
  cifs: fix regression when mounting shares with prefix paths
  cifs: use echo_interval even when connection not ready.
  cifs: detect dead connections only when echoes are enabled.
  smb3.1.1: allow dumping keys for multiuser mounts
  smb3.1.1: allow dumping GCM256 keys to improve debugging of encrypted shares
  cifs: add shutdown support
  cifs: Deferred close for files
  smb3.1.1: enable negotiating stronger encryption by default
2021-05-05 13:37:07 -07:00
Ira Weiny
d048b9c2a7 btrfs: use memzero_page() instead of open coded kmap pattern
There are many places where kmap/memset/kunmap patterns occur.

Use the newly lifted memzero_page() to eliminate direct uses of kmap and
leverage the new core functions use of kmap_local_page().

The development of this patch was aided by the following coccinelle
script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memset/kunmap pattern and replace with memset*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate.  Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// Then the memset pattern
//
@ memset_rule1 @
expression page, V, L, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memset(ptr, 0, L);
+memzero_page(page, 0, L);
|
-memset(ptr + Off, 0, L);
+memzero_page(page, Off, L);
|
-memset(ptr, V, L);
+memset_page(page, V, 0, L);
|
-memset(ptr + Off, V, L);
+memset_page(page, V, Off, L);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule1
@
identifier memset_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

//
// Catch all
//
@ memset_rule2 @
expression page;
identifier ptr;
expression GenTo, GenSize, GenValue;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memset/memcpy
// The follow are catch alls which need to be evaluated by hand.
//
-memset(GenTo, 0, GenSize);
+memzero_pageExtra(page, GenTo, GenSize);
|
-memset(GenTo, GenValue, GenSize);
+memset_pageExtra(page, GenValue, GenTo, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memset_rule2
@
identifier memset_rule2.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

// </smpl>

Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Anshuman Khandual
855f9a8e87 mm: generalize SYS_SUPPORTS_HUGETLBFS (rename as ARCH_SUPPORTS_HUGETLBFS)
SYS_SUPPORTS_HUGETLBFS config has duplicate definitions on platforms
that subscribe it.  Instead, just make it a generic option which can be
selected on applicable platforms.

Also rename it as ARCH_SUPPORTS_HUGETLBFS instead.  This reduces code
duplication and makes it cleaner.

Link: https://lkml.kernel.org/r/1617259448-22529-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	[riscv]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:25 -07:00
Minchan Kim
8cc621d2f4 mm: fs: invalidate BH LRU during page migration
Pages containing buffer_heads that are in one of the per-CPU buffer_head
LRU caches will be pinned and thus cannot be migrated.  This can prevent
CMA allocations from succeeding, which are often used on platforms with
co-processors (such as a DSP) that can only use physically contiguous
memory.  It can also prevent memory hot-unplugging from succeeding,
which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory,
which ranges from 8 MiB to 1 GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration starts
and stop any buffer_head from being cached in the LRU caches, until
migration has finished.

Link: https://lkml.kernel.org/r/20210319175127.886124-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Reported-by: Laura Abbott <labbott@kernel.org>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Axel Rasmussen
f619147104 userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Axel Rasmussen
7677f7fd8b userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Miaohe Lin
15b8365363 mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages()
The local variable pseudo_vma is not used anymore.

Link: https://lkml.kernel.org/r/20210410072348.20437-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:21 -07:00
Miaohe Lin
d4241a049a mm/hugetlb: avoid calculating fault_mutex_hash in truncate_op case
The fault_mutex hashing overhead can be avoided in truncate_op case
because page faults can not race with truncation in this routine.

So calculate hash for fault_mutex only in !truncate_op case to save some
cpu cycles.

Link: https://lkml.kernel.org/r/20210308112809.26107-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Miaohe Lin
04adbc3f7b mm/hugetlb: use some helper functions to cleanup code
Patch series "Some cleanups for hugetlb".

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE, use
helper function and so on.  I also collect some previous patches into this
series in case they are forgotten.

This patch (of 5):

We could use pages_per_huge_page to get the number of pages per hugepage,
use get_hstate_idx to calculate hstate index, and use hstate_is_gigantic
to check if a hstate is gigantic to make code more succinct.

Link: https://lkml.kernel.org/r/20210308112809.26107-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210308112809.26107-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Peter Xu
6dfeaff93b hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp
Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
userfaultfd-wp is always based on pgtable entries, so they cannot be
shared.

Walk the hugetlb range and unshare all such mappings if there is, right
before UFFDIO_REGISTER will succeed and return to userspace.

This will pair with want_pmd_share() in hugetlb code so that huge pmd
sharing is completely disabled for userfaultfd-wp registered range.

Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Hugh Dickins
786b31121a mm: remove nrexceptional from inode: remove BUG_ON
clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)) is unsafe: we
know of two ways in which nodes can and do (on rare occasions) get left
behind.  Until those are fixed, do not BUG_ON() nor even WARN_ON().

Yes, this will then leak those nodes (or the next user of the struct
inode may use them); but this has been happening for years, and the new
BUG_ON(!mapping_empty) was only guilty of revealing that.  A proper fix
will follow, but no hurry.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104292229380.16080@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Matthew Wilcox (Oracle)
8bc3c481b3 mm: remove nrexceptional from inode
We no longer track anything in nrexceptional, so remove it, saving 8 bytes
per inode.

Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:20 -07:00
Matthew Wilcox (Oracle)
7f0e07fb02 dax: account DAX entries as nrpages
Simplify mapping_needs_writeback() by accounting DAX entries as pages
instead of exceptional entries.

Link: https://lkml.kernel.org/r/20201026151849.24232-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:19 -07:00
Matthew Wilcox (Oracle)
7716506ada mm: introduce and use mapping_empty()
Patch series "Remove nrexceptional tracking", v2.

We actually use nrexceptional for very little these days.  It's a minor
pain to keep in sync with nrpages, but the pain becomes much bigger with
the THP patches because we don't know how many indices a shadow entry
occupies.  It's easier to just remove it than keep it accurate.

Also, we save 8 bytes per inode which is nothing to sneeze at; on my
laptop, it would improve shmem_inode_cache from 22 to 23 objects per
16kB, and inode_cache from 26 to 27 objects.  Combined, that saves
a megabyte of memory from a combined usage of 25MB for both caches.
Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
any memory for ext4.

This patch (of 4):

Instead of checking the two counters (nrpages and nrexceptional), we can
just check whether i_pages is empty.

Link: https://lkml.kernel.org/r/20201026151849.24232-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20201026151849.24232-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:19 -07:00
Linus Torvalds
51f629446c This pull request contains changes for JFFS2, UBI and UBIFS
JFFS2:
 - Use splice_write()
 - Fix for a slab-out-of-bounds bug
 
 UBI:
 - Fix for clang related warnings
 - Code cleanup
 
 UBIFS:
 - Fix for inode rebirth at replay
 - Set s_uuid
 - Use zstd for default filesystem
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmCRqNcWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wcGbEACLv9Qm7N5z3CAG0aGC2LagEx4c
 u6fiL3EQx+wI82NKMG/KRdGeR3J18JA4xoM9FJ1ZLil9aUSvUjtLBq+wBRiKHnDn
 Oj4rgRVgwAMc1UdE00UAttCnorKXlsVWxg2+iD1CEZkkKbDN3dk7BkgtwJCjUmCy
 iD2oM02S0csLu0Uk400BaNFwOHG3AFVOU0xXeqyO2MgXcClEwfXNH8qmDQpZPmwy
 b42jd0FXxfWBnzPmrURZ163Yt/iXlNLFaHkMdJgJ23A+4eNWk5+BCjxMzj08u+mn
 XLIBGEegC/rHPYnBGqb/AhsQwoGxTqRTA0gmhkMX4j0goWpj7bO4+fwBu7DsOv/S
 kC09sobx+E8xIgoMPXwcr4twWBcbTctclZgoBRNa0BUILTVa6ifwdRxt0LlNmLLr
 wwZpRuekSBnKTEyD2c2rKnuCMZlg7f7dz6OXa+/HINy0YFvr2yFs0uF4RXpwaARz
 yR0iC80hrhGDrUA1+A2uIUZdNOI4ymR//9Of/9Qj+3o41Kl/0FpLPg8pH+CCLVTG
 iw2D3qJeX2N3VLdhb1ynQ5mLkzc3NWqEjf9bR1Aif3ti6FA5duueOctHhZTsC6Uq
 uzGQPXCpRqLbF1LoPtcmajUrgvl8tWIu9lVce5ncMA5D0q4VVc8t2yPFNdsBRyXl
 t5siLiHUdWIDvWQFmQ==
 =IwH7
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
 "JFFS2:
   - Use splice_write()
   - Fix for a slab-out-of-bounds bug

  UBI:
   - Fix for clang related warnings
   - Code cleanup

  UBIFS:
   - Fix for inode rebirth at replay
   - Set s_uuid
   - Use zstd for default filesystem"

* tag 'for-linus-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubi: Remove unnecessary struct declaration
  jffs2: Hook up splice_write callback
  jffs2: avoid Wempty-body warnings
  jffs2: Fix kasan slab-out-of-bounds problem
  ubi: Fix fall-through warnings for Clang
  ubifs: Report max LEB count at mount time
  ubifs: Set s_uuid in super block to support ima/evm uuid options
  ubifs: Default to zstd compression
  ubifs: Only check replay with inode type to judge if inode linked
2021-05-04 18:08:40 -07:00
Linus Torvalds
d0195c7d7a f2fs-for-5.13-rc1
In this round, we added a new mount option, "checkpoint_merge", which introduces
 a kernel thread dealing with the f2fs checkpoints. Once we start to manage the
 IO priority along with blk-cgroup, the checkpoint operation can be processed in
 a lower priority under the process context. Since the checkpoint holds all the
 filesystem operations, we give a higher priority to the checkpoint thread all
 the time.
 
 Enhancement:
 - introduce gc_merge mount option to introduce a checkpoint thread
 - improve to run discard thread efficiently
 - allow modular compression algorithms
 - expose # of overprivision segments to sysfs
 - expose runtime compression stat to sysfs
 
 Bug fix:
 - fix OOB memory access by the node id lookup
 - avoid touching checkpointed data in the checkpoint-disabled mode
 - fix the resizing flow to avoid kernel panic and race conditions
 - fix block allocation issues on pinned files
 - address some swapfile issues
 - fix hugtask problem and kernel panic during atomic write operations
 - don't start checkpoint thread in RO
 
 And, we've cleaned up some kernel coding style and build warnings. In addition,
 we fixed some minor race conditions and error handling routines.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCQhVIACgkQQBSofoJI
 UNIggA/8DZINzFLMCj6+6P5wNAWj3nYtx/FnwZ7C31f8qkiZjgA4LfONUnDvV7sU
 GS8MuLQz4eTYfqU2rVgGiSm+aCkEOovnk7C7Huo7pezgqYb+5J6ACXsqdU3dcD5M
 kShJMqLKcTKqtOMbnJrdGvmw1/ysuAi7UhSSgVV+9NQxlhxADnagOGbQ7lXNSV3R
 spGMWazGY2uA5DFCCa4lMX79lyFATCzEKB3SKAW5r+8QSmxJY8ViK2Er7AnvwRJz
 XJ/QJ8ALNb/GGyHzBWFv3P6Yxo/G3FkUvTIc5Rhi9P2lUgjI2NALokj7AOnfNh4a
 uSXHVlNrrfH+gpx9xr5z8MUmroCYCCOJ6EhnVweqViRmekY8jSb2HxmUtDTIf19U
 LWl3gtD2GDQx6CY0a0K58Oa2Lp0Bp9MWUdPA/4P21EymZwXum7aCkhV+DnigcoCj
 yCmKlI8nIpCS97dIO/7MsnG6Tu/7c+Prytd2ezUo+6hlkXPZk8shs+elnjlWu/6V
 3ZpSWKzQsPJL8U7eB9H04AxEokrrXm3fRhR86C7JdkEe0gyGFf3dB/G+jqzcNi5m
 ZpxiCeZ8RbNmPpnH8NWeYHk9uDKMOXuUPFYDoaOwImNWfqj0jfhiTxHo4MyBLAuk
 MT632ICcuJLvwgnSbMAI3U7v6+dZXKH4y6U7IHFjxhI7beMzxmI=
 =P5uk
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we added a new mount option, "checkpoint_merge", which
  introduces a kernel thread dealing with the f2fs checkpoints. Once we
  start to manage the IO priority along with blk-cgroup, the checkpoint
  operation can be processed in a lower priority under the process
  context. Since the checkpoint holds all the filesystem operations, we
  give a higher priority to the checkpoint thread all the time.

  Enhancements:
   - introduce gc_merge mount option to introduce a checkpoint thread
   - improve to run discard thread efficiently
   - allow modular compression algorithms
   - expose # of overprivision segments to sysfs
   - expose runtime compression stat to sysfs

  Bug fixes:
   - fix OOB memory access by the node id lookup
   - avoid touching checkpointed data in the checkpoint-disabled mode
   - fix the resizing flow to avoid kernel panic and race conditions
   - fix block allocation issues on pinned files
   - address some swapfile issues
   - fix hugtask problem and kernel panic during atomic write operations
   - don't start checkpoint thread in RO

  And, we've cleaned up some kernel coding style and build warnings. In
  addition, we fixed some minor race conditions and error handling
  routines"

* tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (48 commits)
  f2fs: drop inplace IO if fs status is abnormal
  f2fs: compress: remove unneed check condition
  f2fs: clean up left deprecated IO trace codes
  f2fs: avoid using native allocate_segment_by_default()
  f2fs: remove unnecessary struct declaration
  f2fs: fix to avoid NULL pointer dereference
  f2fs: avoid duplicated codes for cleanup
  f2fs: document: add description about compressed space handling
  f2fs: clean up build warnings
  f2fs: fix the periodic wakeups of discard thread
  f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
  f2fs: fix to avoid GC/mmap race with f2fs_truncate()
  f2fs: set checkpoint_merge by default
  f2fs: Fix a hungtask problem in atomic write
  f2fs: fix to restrict mount condition on readonly block device
  f2fs: introduce gc_merge mount option
  f2fs: fix to cover __allocate_new_section() with curseg_lock
  f2fs: fix wrong alloc_type in f2fs_do_replace_block
  f2fs: delete empty compress.h
  f2fs: fix a typo in inode.c
  ...
2021-05-04 18:03:38 -07:00
Linus Torvalds
51e6f07cb1 M68knommu fixes include:
. fix interrupt range check for ColdFire SIMR interrupt controller
 . add support for gapless sections flat format binary (needed by RISC-V)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEmsfM6tQwfNjBOxr3TiQVqaG9L4AFAmCQjRUACgkQTiQVqaG9
 L4ADMg//d2y2BjROaMQdV4ykd06/GrulZNRWvVSdDDmR8O5NU3z5zhmyeVuB0mHA
 OYn/wbWzuqmD7JVpJijytTUFsvtpsMkqRdL56xApFYfJ5RK9VEs34aonBko9C0Fp
 d7e7Wc++HdNEhUh2klFn7w4cMsOGAj9WHyC8h3bGOqXroYobFv+Zd8l+xlNl152o
 Hiqh5VL8+cBe1Bo+daVarISofc1O4DkXlQrQVQAwB+H2fM0n1F7YXPW2/kMOc76W
 cAd9w5otz+ACWhX5CnScmEFHUE5QM4yutllb8w6sVpFZly1p/9pBZna+BT5t22JJ
 Uv65nGIqLyVP9nnCSyeH3BPr+KUe50rX4Squb9qJpikowjLesghX+voivG1EkgkG
 A/3WUX/kXG/zBo96taedSPo6ZLmuW1wcyY2NrIPll78Q80W4VBI1a2pkA2vm0/1Z
 0VoxAYCYO+tGhzS4xDnvNl1+VQb7NEwrzySdh+TFh0hvhv3CGOnPsiT/kZ9CSgjI
 IkuvTXO/XTf4TRjKtwOGkZKrIrPqpVCt565h6i8s1h7qZf9vK3tIQPKpDQ04Dth5
 iE9ncQd7KYTCORsYWE0XrD5BztbwpbMqtD1usX2NXZWPqY4ciLucCPnR1i21qQZ5
 V1ld8KKXdu5gvQX50dGsoQXlggeYgDjvsXW1dJNMRzprHiNq7Bc=
 =ZbEq
 -----END PGP SIGNATURE-----

Merge tag 'm68knommu-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu

Pull m68knommu updates from Greg Ungerer:

 - a fix for interrupt number range checking for the ColdFire SIMR
   interrupt controller.

 - changes for the binfmt_flat binary loader to allow RISC-V nommu
   support it needs to be able to accept flat binaries that have no gap
   between the text and data sections.

* tag 'm68knommu-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68k: coldfire: fix irq ranges
  riscv: Disable data start offset in flat binaries
  binfmt_flat: allow not offsetting data start
2021-05-04 10:48:05 -07:00
Khaled ROMDHANI
bae4c0c1c2 fs/cifs: Fix resource leak
The -EIO error return path is leaking memory allocated
to page. Fix this by moving the allocation block after
the check of cifs_forced_shutdown.

Addresses-Coverity: ("Resource leak")
Fixes: 087f757b01 ("cifs: add shutdown support")
Signed-off-by: Khaled ROMDHANI <khaledromdhani216@gmail.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-04 11:53:15 -05:00
Rohith Surabattula
78c09634f7 Cifs: Fix kernel oops caused by deferred close for files.
Fix regression issue caused by deferred close for files.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-04 11:53:15 -05:00
Paulo Alcantara
5c1acf3fe0 cifs: fix regression when mounting shares with prefix paths
The commit 315db9a05b ("cifs: fix leak in cifs_smb3_do_mount() ctx")
revealed an existing bug when mounting shares that contain a prefix
path or DFS links.

cifs_setup_volume_info() requires the @devname to contain the full
path (UNC + prefix) to update the fs context with the new UNC and
prepath values, however we were passing only the UNC
path (old_ctx->UNC) in @device thus discarding any prefix paths.

Instead of concatenating both old_ctx->{UNC,prepath} and pass it in
@devname, just keep the dup'ed values of UNC and prepath in
cifs_sb->ctx after calling smb3_fs_context_dup(), and fix
smb3_parse_devname() to correctly parse and not leak the new UNC and
prefix paths.

Cc: <stable@vger.kernel.org> # v5.11+
Fixes: 315db9a05b ("cifs: fix leak in cifs_smb3_do_mount() ctx")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-04 11:52:56 -05:00
Tom Rix
77364faf21 btrfs: initialize return variable in cleanup_free_space_cache_v1
Static analysis reports this problem

  free-space-cache.c:3965:2: warning: Undefined or garbage value returned
    return ret;
    ^~~~~~~~~~

ret is set in the node handling loop.  Treat doing nothing as a success
and initialize ret to 0, although it's unlikely the loop would be
skipped. We always have block groups, but as it could lead to
transaction abort in the caller it's better to be safe.

CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-04 18:05:15 +02:00
Brian Foster
6e552494fb iomap: remove unused private field from ioend
The only remaining user of ->io_private is the generic ioend merging
infrastructure. The only user of that is XFS, which no longer sets
->io_private or passes an associated merge callback. Remove the
unused parameter and the ->io_private field.

CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-04 08:54:29 -07:00
Darrick J. Wong
8e9800f9f2 xfs: don't allow log writes if the data device is readonly
While running generic/050 with an external log, I observed this warning
in dmesg:

Trying to write to read-only block-device sda4 (partno 4)
WARNING: CPU: 2 PID: 215677 at block/blk-core.c:704 submit_bio_checks+0x256/0x510
Call Trace:
 submit_bio_noacct+0x2c/0x430
 _xfs_buf_ioapply+0x283/0x3c0 [xfs]
 __xfs_buf_submit+0x6a/0x210 [xfs]
 xfs_buf_delwri_submit_buffers+0xf8/0x270 [xfs]
 xfsaild+0x2db/0xc50 [xfs]
 kthread+0x14b/0x170

I think this happened because we tried to cover the log after a readonly
mount, and the AIL tried to write the primary superblock to the data
device.  The test marks the data device readonly, but it doesn't do the
same to the external log device.  Therefore, XFS thinks that the log is
writable, even though AIL writes whine to dmesg because the data device
is read only.

Fix this by amending xfs_log_writable to prevent writes when the AIL
can't possible write anything into the filesystem.

Note: As for the external log or the rt devices being readonly--
xfs_blkdev_get will complain about that if we aren't doing a norecovery
mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-04 08:43:27 -07:00
Naohiro Aota
784daf2b96 btrfs: zoned: sanity check zone type
The fstests test case generic/475 creates a dm-linear device that gets
changed to a dm-error device. This leads to errors in loading the block
group's zone information when running on a zoned file system, ultimately
resulting in a list corruption. When running on a kernel with list
debugging enabled this leads to the following crash.

 BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure
 kernel BUG at lib/list_debug.c:54!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 1 PID: 2433 Comm: umount Tainted: G        W         5.12.0+ #1018
 RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
 RSP: 0018:ffffc90001473df0 EFLAGS: 00010296
 RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90
 RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003
 RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001
 R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000
 R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100
 FS:  00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_free_block_groups+0xc9/0x310 [btrfs]
  close_ctree+0x2ee/0x31a [btrfs]
  ? call_rcu+0x8f/0x270
  ? mutex_lock+0x1c/0x40
  generic_shutdown_super+0x67/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x90
  cleanup_mnt+0x13e/0x1b0
  task_work_run+0x63/0xb0
  exit_to_user_mode_loop+0xd9/0xe0
  exit_to_user_mode_prepare+0x3e/0x60
  syscall_exit_to_user_mode+0x1d/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xae

As dm-error has no support for zones, btrfs will run it's zone emulation
mode on this device. The zone emulation mode emulates conventional zones,
so bail out if the zone bitmap that gets populated on mount sees the zone
as sequential while we're thinking it's a conventional zone when creating
a block group.

Note: this scenario is unlikely in a real wold application and can only
happen by this (ab)use of device-mapper targets.

CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-04 14:41:43 +02:00
Anand Jain
5e753a817b btrfs: fix unmountable seed device after fstrim
The following test case reproduces an issue of wrongly freeing in-use
blocks on the readonly seed device when fstrim is called on the rw sprout
device. As shown below.

Create a seed device and add a sprout device to it:

  $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
  $ btrfstune -S 1 /dev/loop0
  $ mount /dev/loop0 /btrfs
  $ btrfs dev add -f /dev/loop1 /btrfs
  BTRFS info (device loop0): relocating block group 290455552 flags system
  BTRFS info (device loop0): relocating block group 1048576 flags system
  BTRFS info (device loop0): disk added /dev/loop1
  $ umount /btrfs

Mount the sprout device and run fstrim:

  $ mount /dev/loop1 /btrfs
  $ fstrim /btrfs
  $ umount /btrfs

Now try to mount the seed device, and it fails:

  $ mount /dev/loop0 /btrfs
  mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.

Block 5292032 is missing on the readonly seed device:

 $ dmesg -kt | tail
 <snip>
 BTRFS error (device loop0): bad tree block start, want 5292032 have 0
 BTRFS warning (device loop0): couldn't read-tree root
 BTRFS error (device loop0): open_ctree failed

From the dump-tree of the seed device (taken before the fstrim). Block
5292032 belonged to the block group starting at 5242880:

  $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
  <snip>
  item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
  	block group used 114688 chunk_objectid 256 flags METADATA
  <snip>

From the dump-tree of the sprout device (taken before the fstrim).
fstrim used block-group 5242880 to find the related free space to free:

  $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
  <snip>
  item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
  	block group used 32768 chunk_objectid 256 flags METADATA
  <snip>

BPF kernel tracing the fstrim command finds the missing block 5292032
within the range of the discarded blocks as below:

  kprobe:btrfs_discard_extent {
  	printf("freeing start %llu end %llu num_bytes %llu:\n",
  		arg1, arg1+arg2, arg2);
  }

  freeing start 5259264 end 5406720 num_bytes 147456
  <snip>

Fix this by avoiding the discard command to the readonly seed device.

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-04 14:41:37 +02:00
Linus Torvalds
9b1f61d5d7 tracing updates for 5.13
New feature:
 
  The "func-no-repeats" option in tracefs/options directory. When set
  the function tracer will detect if the current function being traced
  is the same as the previous one, and instead of recording it, it will
  keep track of the number of times that the function is repeated in a row.
  And when another function is recorded, it will write a new event that
  shows the function that repeated, the number of times it repeated and
  the time stamp of when the last repeated function occurred.
 
 Enhancements:
 
  In order to implement the above "func-no-repeats" option, the ring
  buffer timestamp can now give the accurate timestamp of the event
  as it is being recorded, instead of having to record an absolute
  timestamp for all events. This helps the histogram code which no longer
  needs to waste ring buffer space.
 
  New validation logic to make sure all trace events that access
  dereferenced pointers do so in a safe way, and will warn otherwise.
 
 Fixes:
 
  No longer limit the PIDs of tasks that are recorded for "saved_cmdlines"
  to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger
  range. This caused the mapping of PIDs to the task names to be dropped
  for all tasks with a PID greater than 32768.
 
  Change trace_clock_global() to never block. This caused a deadlock.
 
 Clean ups:
 
  Typos, prototype fixes, and removing of duplicate or unused code.
 
  Better management of ftrace_page allocations.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYI/1vBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiL0AP9EemIC5TDh2oihqLRNeUjdTu0ryEoM
 HRFqxozSF985twD/bfkt86KQC8rLHwxTbxQZ863bmdaC6cMGFhWiF+H/MAs=
 =psYt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New feature:

   - A new "func-no-repeats" option in tracefs/options directory.

     When set the function tracer will detect if the current function
     being traced is the same as the previous one, and instead of
     recording it, it will keep track of the number of times that the
     function is repeated in a row. And when another function is
     recorded, it will write a new event that shows the function that
     repeated, the number of times it repeated and the time stamp of
     when the last repeated function occurred.

  Enhancements:

   - In order to implement the above "func-no-repeats" option, the ring
     buffer timestamp can now give the accurate timestamp of the event
     as it is being recorded, instead of having to record an absolute
     timestamp for all events. This helps the histogram code which no
     longer needs to waste ring buffer space.

   - New validation logic to make sure all trace events that access
     dereferenced pointers do so in a safe way, and will warn otherwise.

  Fixes:

   - No longer limit the PIDs of tasks that are recorded for
     "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
     for a much larger range. This caused the mapping of PIDs to the
     task names to be dropped for all tasks with a PID greater than
     32768.

   - Change trace_clock_global() to never block. This caused a deadlock.

  Clean ups:

   - Typos, prototype fixes, and removing of duplicate or unused code.

   - Better management of ftrace_page allocations"

* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
  tracing: Restructure trace_clock_global() to never block
  tracing: Map all PIDs to command lines
  ftrace: Reuse the output of the function tracer for func_repeats
  tracing: Add "func_no_repeats" option for function tracing
  tracing: Unify the logic for function tracing options
  tracing: Add method for recording "func_repeats" events
  tracing: Add "last_func_repeats" to struct trace_array
  tracing: Define new ftrace event "func_repeats"
  tracing: Define static void trace_print_time()
  ftrace: Simplify the calculation of page number for ftrace_page->records some more
  ftrace: Store the order of pages allocated in ftrace_page
  tracing: Remove unused argument from "ring_buffer_time_stamp()
  tracing: Remove duplicate struct declaration in trace_events.h
  tracing: Update create_system_filter() kernel-doc comment
  tracing: A minor cleanup for create_system_filter()
  kernel: trace: Mundane typo fixes in the file trace_events_filter.c
  tracing: Fix various typos in comments
  scripts/recordmcount.pl: Make vim and emacs indent the same
  scripts/recordmcount.pl: Make indent spacing consistent
  tracing: Add a verifier to check string pointers for trace events
  ...
2021-05-03 11:19:54 -07:00
Linus Torvalds
23806a3e96 Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull receive_fd update from Al Viro:
 "Cleanup of receive_fd mess"

* 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: split receive_fd_replace from __receive_fd
2021-05-03 11:05:28 -07:00
Shyam Prasad N
5b2abdafbe cifs: use echo_interval even when connection not ready.
When the tcp connection is not ready to send requests,
we keep retrying echo with an interval of zero.

This seems unnecessary, and this fix changes the interval
between echoes to what is specified as echo_interval.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-03 11:54:29 -05:00
Shyam Prasad N
f4916649f9 cifs: detect dead connections only when echoes are enabled.
We can detect server unresponsiveness only if echoes are enabled.
Echoes can be disabled under two scenarios:
1. The connection is low on credits, so we've disabled echoes/oplocks.
2. The connection has not seen any request till now (other than
negotiate/sess-setup), which is when we enable these two, based on
the credits available.

So this fix will check for dead connection, only when echo is enabled.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
CC: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-03 11:49:13 -05:00
Steve French
7ba3d1cdb7 smb3.1.1: allow dumping keys for multiuser mounts
When mounted multiuser it is hard to dump keys for the other sessions
which makes it hard to debug using network traces (e.g. using wireshark).

Suggested-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-03 11:45:36 -05:00
Steve French
aa22ebc382 smb3.1.1: allow dumping GCM256 keys to improve debugging of encrypted shares
Previously we were only able to dump CCM or GCM-128 keys (see "smbinfo keys" e.g.)
to allow network debugging (e.g. wireshark) of mounts to SMB3.1.1 encrypted
shares.  But with the addition of GCM-256 support, we have to be able to dump
32 byte instead of 16 byte keys which requires adding an additional ioctl
for that.

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-03 11:43:37 -05:00
Steve French
087f757b01 cifs: add shutdown support
Various filesystem support the shutdown ioctl which is used by various
xfstests. The shutdown ioctl sets a flag on the superblock which
prevents open, unlink, symlink, hardlink, rmdir, create etc.
on the file system until unmount and remounted. The two flags supported
in this patch are:

  FSOP_GOING_FLAGS_LOGFLUSH and FSOP_GOING_FLAGS_NOLOGFLUSH

which require very little other than blocking new operations (since
we do not cache writes to metadata on the client with cifs.ko).
FSOP_GOING_FLAGS_DEFAULT is not supported yet, but could be added in
the future but would need to call syncfs or equivalent to write out
pending data on the mount.

With this patch various xfstests now work including tests 043 through
046 for example.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2021-05-03 11:21:22 -05:00
Rohith Surabattula
c3f207ab29 cifs: Deferred close for files
When file is closed, SMB2 close request is not sent to server
immediately and is deferred for acregmax defined interval. When file is
reopened by same process for read or write, the file handle
is reused if an oplock is held.

When client receives a oplock/lease break, file is closed immediately
if reference count is zero, else oplock is downgraded.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-03 11:20:35 -05:00
Linus Torvalds
9ccce092fc orangefs: implement orangefs_readahead
mm/readahead.c/read_pages was quite a bit different back
 when I put my open-coded readahead logic into orangefs_readpage.
 It seemed to work as designed then, it is a trainwreck now.
 
 This patch implements orangefs_readahead using new xarray
 and readahead_expand features that have just been pulled and
 removes all my open-coded readahead logic.
 
 This patch results in an extreme read performance improvement,
 these sample numbers are from my test VM:
 
 Here's an example of what's upstream in
 5.11.8-200.fc33.x86_64:
 
 30+0 records in
 30+0 records out
 125829120 bytes (126 MB, 120 MiB) copied, 5.77943 s, 21.8 MB/s
 
 And here's this version of orangefs_readahead on top of
 5.12.0-rc4:
 
 30+0 records in
 30+0 records out
 125829120 bytes (126 MB, 120 MiB) copied, 0.325919 s, 386 MB/s
 
 There are four xfstest regressions with this patch. David Howells
 and Matthew Wilcox have been helping me work with this code. One
 of the regressions has gone away with the most recent version of
 their code that I'm using. I hope this patch can be
 pulled even though there are still a few regressions, and that
 we can try to get them resolved during the RC period.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAmCPCUsACgkQz0QOqevO
 Db77DQ/7B8V7RPlQ8C6HJlSuCED67W9isCG5CdzGobVafBrirbUusanQJRhjrIZO
 Voy0NYsR/rsM3K1tNk9AE7rlbT4UQibeUXwFVcVjBvtyXBiTgjbROc2AP4pjxAWu
 erH2McMEbrYjgrevwR/PKxyD8wS6vTX2InnI4yvlkbfEz04u/KkTSu0oN4UCU/8u
 8/drWDTIgZz6wffb1RpMFsCP77tfVWIWlRlH39u9OTe4fhPMug8jN+uOBrfyYxdp
 snJWznyeSYCQ4q/KkPkjfSUTDmx3+E1WeSHMNviHfwENdbcUAojk2O9wepBwJhQn
 r0DFU2yM+132oRkWO1DF7If1FRfvcmHjE4bmlLBSg+xgKOKpdMCs7Nf+s1Sji+w/
 8xTAPWzdqBeW6z4nIncvZPtjtes3979mJ/Jm/f4GLonAQB6yPJcIzA8gl5EEgXI3
 20pAt2JNCgCHVhHQso5fkLINlpND/cwlbOEOjyrNXIoJJngGDRo9FQ/osGBaLv5i
 n3XWC41lYnX9nqJ2FuVLBuZ+Jv1k5XSQualpyGGVTFaYp/jZVbjUOgJk7QPNsWl7
 9cUZAMVdDW6y7z1aZ2bu5y7VFIkPe4nfZNqrgXX+YySq0uOTrQBegkQRp1pu3t8m
 P3P9lVqcrn/kw+FASZborq921Njw+YDHvZuYfrnbF7J0sUL0fu4=
 =09Vm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.13-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "orangefs: implement orangefs_readahead

  mm/readahead.c/read_pages was quite a bit different back when I put my
  open-coded readahead logic into orangefs_readpage. That logic seemed
  to work as designed back then, it is a trainwreck now.

  This implements orangefs_readahead using the new xarray and
  readahead_expand features and removes all my open-coded readahead
  logic.

  This results in an extreme read performance improvement, these sample
  numbers are from my test VM:

  Here's an example of what's upstream in
  5.11.8-200.fc33.x86_64:

     30+0 records in
     30+0 records out
     125829120 bytes (126 MB, 120 MiB) copied, 5.77943 s, 21.8 MB/s

  And here's this version of orangefs_readahead on top of 5.12.0-rc4:

     30+0 records in
     30+0 records out
     125829120 bytes (126 MB, 120 MiB) copied, 0.325919 s, 386 MB/s

  There are four xfstest regressions with this patch. David Howells and
  Matthew Wilcox have been helping me work with this code"

* tag 'for-linus-5.13-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: leave files in the page cache for a few micro seconds at least
  Orangef: implement orangefs_readahead.
2021-05-02 14:13:46 -07:00
Linus Torvalds
27787ba3fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  useful constants: struct qstr for ".."
  hostfs_open(): don't open-code file_dentry()
  whack-a-mole: kill strlen_user() (again)
  autofs: should_expire() argument is guaranteed to be positive
  apparmor:match_mn() - constify devpath argument
  buffer: a small optimization in grow_buffers
  get rid of autofs_getpath()
  constify dentry argument of dentry_path()/dentry_path_raw()
2021-05-02 09:14:01 -07:00
Linus Torvalds
b28866f4bb Merge branch 'work.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull exryptfs updates from Al Viro:
 "The interesting part here is (ecryptfs) lock_parent() fixes - its
  treatment of ->d_parent had been very wrong.

  The rest is trivial cleanups"

* 'work.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs: ecryptfs_dentry_info->crypt_stat is never used
  ecryptfs: get rid of unused accessors
  ecryptfs: saner API for lock_parent()
  ecryptfs: get rid of pointless dget/dput in ->symlink() and ->link()
2021-05-02 09:05:54 -07:00
Linus Torvalds
17ae69aba8 Add Landlock, a new LSM from Mickaël Salaün <mic@linux.microsoft.com>
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEgycj0O+d1G2aycA8rZhLv9lQBTwFAmCInP4ACgkQrZhLv9lQ
 BTza0g//dTeb9woC9H7qlEhK4l9yk62lTss60Q8X7m7ZSNfdL4tiEbi64SgK+iOW
 OOegbrOEb8Kzh4KJJYmVlVZ5YUWyH4szgmee1wnylBdsWiWaPLPF3Cflz77apy6T
 TiiBsJd7rRE29FKheaMt34B41BMh8QHESN+DzjzJWsFoi/uNxjgSs2W16XuSupKu
 bpRmB1pYNXMlrkzz7taL05jndZYE5arVriqlxgAsuLOFOp/ER7zecrjImdCM/4kL
 W6ej0R1fz2Geh6CsLBJVE+bKWSQ82q5a4xZEkSYuQHXgZV5eywE5UKu8ssQcRgQA
 VmGUY5k73rfY9Ofupf2gCaf/JSJNXKO/8Xjg0zAdklKtmgFjtna5Tyg9I90j7zn+
 5swSpKuRpilN8MQH+6GWAnfqQlNoviTOpFeq3LwBtNVVOh08cOg6lko/bmebBC+R
 TeQPACKS0Q0gCDPm9RYoU1pMUuYgfOwVfVRZK1prgi2Co7ZBUMOvYbNoKYoPIydr
 ENBYljlU1OYwbzgR2nE+24fvhU8xdNOVG1xXYPAEHShu+p7dLIWRLhl8UCtRQpSR
 1ofeVaJjgjrp29O+1OIQjB2kwCaRdfv/Gq1mztE/VlMU/r++E62OEzcH0aS+mnrg
 yzfyUdI8IFv1q6FGT9yNSifWUWxQPmOKuC8kXsKYfqfJsFwKmHM=
 =uCN4
 -----END PGP SIGNATURE-----

Merge tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull Landlock LSM from James Morris:
 "Add Landlock, a new LSM from Mickaël Salaün.

  Briefly, Landlock provides for unprivileged application sandboxing.

  From Mickaël's cover letter:
    "The goal of Landlock is to enable to restrict ambient rights (e.g.
     global filesystem access) for a set of processes. Because Landlock
     is a stackable LSM [1], it makes possible to create safe security
     sandboxes as new security layers in addition to the existing
     system-wide access-controls. This kind of sandbox is expected to
     help mitigate the security impact of bugs or unexpected/malicious
     behaviors in user-space applications. Landlock empowers any
     process, including unprivileged ones, to securely restrict
     themselves.

     Landlock is inspired by seccomp-bpf but instead of filtering
     syscalls and their raw arguments, a Landlock rule can restrict the
     use of kernel objects like file hierarchies, according to the
     kernel semantic. Landlock also takes inspiration from other OS
     sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD
     Pledge/Unveil.

     In this current form, Landlock misses some access-control features.
     This enables to minimize this patch series and ease review. This
     series still addresses multiple use cases, especially with the
     combined use of seccomp-bpf: applications with built-in sandboxing,
     init systems, security sandbox tools and security-oriented APIs [2]"

  The cover letter and v34 posting is here:

      https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/

  See also:

      https://landlock.io/

  This code has had extensive design discussion and review over several
  years"

Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1]
Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2]

* tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  landlock: Enable user space to infer supported features
  landlock: Add user and kernel documentation
  samples/landlock: Add a sandbox manager example
  selftests/landlock: Add user space tests
  landlock: Add syscall implementations
  arch: Wire up Landlock syscalls
  fs,security: Add sb_delete hook
  landlock: Support filesystem access-control
  LSM: Infrastructure management of the superblock
  landlock: Add ptrace restrictions
  landlock: Set up the security framework and manage credentials
  landlock: Add ruleset and domain management
  landlock: Add object management
2021-05-01 18:50:44 -07:00
David Howells
22650f1481 afs: Fix speculative status fetches
The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:

        kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)

This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).

What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in.  This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).

On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).

On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.

This means that:

 - a status fetch can access the target vnode either side of the exclusive
   section of the modification

 - the status fetch could start before the modification, yet finish after,
   and vice-versa.

 - the status fetch and the modification RPCs can complete in either order.

 - the status fetch can return either the before or the after DV from the
   modification.

 - the status fetch might regress the locally cached DV.

Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.

Fix this by the following means:

 (1) Keep track of when we're performing a modification operation on a
     vnode.  This is done by marking vnode parameters with a 'modification'
     note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
     for the duration.

 (2) Alter the speculation race detection to ignore speculative status
     fetches if either the vnode is marked as being modified or the data
     version number is not what we expected.

Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.

Fixes: a9e5c87ca7 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-01 11:55:36 -07:00
Masahiro Yamada
9009b45581 .gitignore: prefix local generated files with a slash
The pattern prefixed with '/' matches files in the same directory,
but not ones in sub-directories.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Andra Paraschiv <andraprs@amazon.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-05-02 00:43:35 +09:00
Linus Torvalds
9f67672a81 New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory
 blocks by zeroing directory entries when they are unlinked or moved as
 part of a hash tree node split.  We also improve the block allocator's
 performance on a freshly mounted file system by prefetching block
 bitmaps.
 
 There are also the usual cleanups and bug fixes, including fixing a
 page cache invalidation race when there is mixed buffered and direct
 I/O and the block size is less than page size, and allow the dax flag
 to be set and cleared on inline directories.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmCLei4ACgkQ8vlZVpUN
 gaPZkgf/VH08xjMf3VthC+BpvVmChQXfV4yjigHbO2pmPyYWZhyJzkEGCQD8u2eB
 b7ShW+B1NCifcTU34xAkKHwEtakzzEv3WIMrT1oZNWrpfo8tt850EkwQggaGGDpd
 /HnP1/wLtziJ5hE6DwutmX7qB4VFghVj898MjDrEPSOBqItOjWps9mn/JWL7SHyI
 Dqzhf5XZTYPaXWuJmSmKw3q8O70JDHnZe/rRWlfX1jLI5KDtqp71Nw1B+gszUB66
 IUdncyZKvInsyjYhkbCQ8U6WFih82MrbKeuGYDp/RFvg5eMELEYkwT9j0ofuDHq8
 zn62sAlbOXv1DiqkPDHKVm9GkHx8/g==
 =UpnH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "New features for ext4 this cycle include support for encrypted
  casefold, ensure that deleted file names are cleared in directory
  blocks by zeroing directory entries when they are unlinked or moved as
  part of a hash tree node split. We also improve the block allocator's
  performance on a freshly mounted file system by prefetching block
  bitmaps.

  There are also the usual cleanups and bug fixes, including fixing a
  page cache invalidation race when there is mixed buffered and direct
  I/O and the block size is less than page size, and allow the dax flag
  to be set and cleared on inline directories"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
  ext4: wipe ext4_dir_entry2 upon file deletion
  ext4: Fix occasional generic/418 failure
  fs: fix reporting supported extra file attributes for statx()
  ext4: allow the dax flag to be set and cleared on inline directories
  ext4: fix debug format string warning
  ext4: fix trailing whitespace
  ext4: fix various seppling typos
  ext4: fix error return code in ext4_fc_perform_commit()
  ext4: annotate data race in jbd2_journal_dirty_metadata()
  ext4: annotate data race in start_this_handle()
  ext4: fix ext4_error_err save negative errno into superblock
  ext4: fix error code in ext4_commit_super
  ext4: always panic when errors=panic is specified
  ext4: delete redundant uptodate check for buffer
  ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
  ext4: make prefetch_block_bitmaps default
  ext4: add proc files to monitor new structures
  ext4: improve cr 0 / cr 1 group scanning
  ext4: add MB_NUM_ORDERS macro
  ext4: add mballoc stats proc file
  ...
2021-04-30 15:35:30 -07:00
Linus Torvalds
6bab076a3d dlm for 5.13
This set includes more dlm networking cleanups and improvements for
 making dlm shutdowns more robust.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJgjC8hAAoJEDgbc8f8gGmqf2kP/AgQEu1T1OKoJ8K0hEi4Axsn
 CIn09+Sb+mVnEa5LHGESw7WbsoodalMnn7D1GExpTXRy015d/SQjBdG9oWuWYnWO
 ouo/ElHAAtU4XCLcGIFYwv7vgUY+ag6Nc2yAvkIpETBKiCl6IHkoIwz1+B+Qgboo
 DOiu3iCYaPB3w13guMETDFZTrzM29k0VTnEgnGVSCKznm1gHUo7WuB17xyKOVFiY
 Y3KkzjqD+NOqtbXlTIEIBQ2PRD2pX1hroto5Qi85/HM/W/yieNYjIM51+kjX5SVy
 VJB/9JpVP/819TDuHPz3xHSxgXgXePWstuAH1WruUfoB4iLx0Fw2WI3snld2VFBn
 5fFZ1+F/YnBQZ4ElQ6E1YciQLQM8SEA7vCgBpNYkWe+3DFvoKKeYNZEO2lTGtvnr
 EYcie8epQKj6eddLbW/hYvO44vhlcp8lV0AwT1mSCHHKn/8GbNB6kk1gW4PIxOqm
 1PQ4Z5VBrxnxxoPD2wIZbmqIgPXKZIxg93SrR0RYyJbjoSgGz45n6I3dLwR5KUNE
 9PQfx/UhmyVqE7UELdarb0hDaPXXzhh+T0LH6YHtihmAhfCA/dhiOYUhkWNtXCkE
 sQwWQq17xnC5juH/vp2ispBnSjUJq1kdVnN4l1RGkMNaSeVZeORH0A8jMwYE3vsK
 A6KMaBkfBpvuiY7UHCcK
 =0iOq
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This includes more dlm networking cleanups and improvements for making
  dlm shutdowns more robust"

* tag 'dlm-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: fix missing unlock on error in accept_from_sock()
  fs: dlm: add shutdown hook
  fs: dlm: flush swork on shutdown
  fs: dlm: remove unaligned memory access handling
  fs: dlm: check on minimum msglen size
  fs: dlm: simplify writequeue handling
  fs: dlm: use GFP_ZERO for page buffer
  fs: dlm: change allocation limits
  fs: dlm: add check if dlm is currently running
  fs: dlm: add errno handling to check callback
  fs: dlm: set subclass for othercon sock_mutex
  fs: dlm: set connected bit after accept
  fs: dlm: fix mark setting deadlock
  fs: dlm: fix debugfs dump
2021-04-30 15:28:25 -07:00
Linus Torvalds
9ec1efbf9d fuse update for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYIwY/wAKCRDh3BK/laaZ
 PNSmAPwLFCBGegvwxUSguiPmIXpDrrlG+USwTzGlxhVOg2ETGgEA6D+Lsz2uCBI3
 xLkPAXD6uTbWLp13YtUSMXK+LR8V5wc=
 =Fl+Q
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix a page locking bug in write (introduced in 2.6.26)

 - Allow sgid bit to be killed in setacl()

 - Miscellaneous fixes and cleanups

* tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  cuse: simplify refcount
  cuse: prevent clone
  virtiofs: fix userns
  virtiofs: remove useless function
  virtiofs: split requests that exceed virtqueue size
  virtiofs: fix memory leak in virtio_fs_probe()
  fuse: invalidate attrs when page writeback completes
  fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID
  fuse: extend FUSE_SETXATTR request
  fuse: fix matching of FUSE_DEV_IOC_CLONE command
  fuse: fix a typo
  fuse: don't zero pages twice
  fuse: fix typo for fuse_conn.max_pages comment
  fuse: fix write deadlock
2021-04-30 15:23:16 -07:00
Linus Torvalds
d652502ef4 overlayfs update for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYIwTsgAKCRDh3BK/laaZ
 PDktAP41eScbCiFzXDRjXw9S7Wfd8HEct0y1p+9BUh8m3VdHfwEA0pDlJWNaJdYW
 nFixPJ5GsAfxo+1ags0vn06CUS/K4gA=
 =QlbJ
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Fix a regression introduced in 5.2 that resulted in valid overlayfs
   mounts being rejected with ELOOP (Too many levels of symbolic links)

 - Fix bugs found by various tools

 - Miscellaneous improvements and cleanups

* tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: add debug print to ovl_do_getxattr()
  ovl: invalidate readdir cache on changes to dir with origin
  ovl: allow upperdir inside lowerdir
  ovl: show "userxattr" in the mount data
  ovl: trivial typo fixes in the file inode.c
  ovl: fix misspellings using codespell tool
  ovl: do not copy attr several times
  ovl: remove ovl_map_dev_ino() return value
  ovl: fix error for ovl_fill_super()
  ovl: fix missing revert_creds() on error path
  ovl: fix leaked dentry
  ovl: restrict lower null uuid for "xino=auto"
  ovl: check that upperdir path is not on a read-only mount
  ovl: plumb through flush method
2021-04-30 15:17:08 -07:00
Brian Geffon
14d071134c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called.  This meant that
vm_ops->mremap() would never be called on such mappings.  Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL.

1. https://lkml.org/lkml/2020/12/28/2340

Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:39 -07:00
Jens Axboe
985b71db17 iomap: use filemap_range_needs_writeback() for O_DIRECT reads
For reads, use the better variant of checking for the need to call
filemap_write_and_wait_range() when doing O_DIRECT.  This avoids falling
back to the slow path for IOCB_NOWAIT, if there are no pages to wait for
(or write out).

Link: https://lkml.kernel.org/r/20210224164455.1096727-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:36 -07:00
Randy Dunlap
21ae3ad163 vfs: fs_parser: clean up kernel-doc warnings
Fix kernel-doc notation function arguments to eliminate two kernel-doc
warnings:

  fs_parser.c:322: warning: Excess function parameter 'name' description in 'validate_constant_table'
  fs_parser.c:367: warning: Function parameter or member 'name' not described in 'fs_validate_description'

Link: https://lkml.kernel.org/r/20210407033743.9701-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Jiapeng Chong
ccf33ec4a7 ocfs2/dlm: remove unused function
Fix the following clang warning:

  fs/ocfs2/dlm/dlmrecovery.c:129:20: warning: unused function 'dlm_reset_recovery' [-Wunused-function].

Link: https://lkml.kernel.org/r/1618382761-5784-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Bhaskar Chowdhury
f13604a2b9 ocfs2: fix a typo
s/cluter/cluster/

Link: https://lkml.kernel.org/r/20210324072931.5056-1-unixbhaskar@gmail.com
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Joseph Qi
f9630ec9d9 ocfs2: map flags directly in flags_to_o2dlm()
Use macro map_flag() is tricky and coccicheck outputs the following
warning:

  fs/ocfs2/stack_o2cb.c:69:5-16: Unneeded variable: "o2dlm_flags"

So map flags directly in flags_to_o2dlm() to make coccicheck happy.
And remove BUG_ON() here as well to simplify code since it runs well
a long time.

Link: https://lkml.kernel.org/r/1616138664-35935-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Yang Li
1634852df7 ocfs2: replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE
Fix the following coccicheck warning:

  fs/ocfs2/blockcheck.c:232:0-23: WARNING: blockcheck_fops should be defined with DEFINE_DEBUGFS_ATTRIBUTE

Link: https://lkml.kernel.org/r/1614155230-57292-1-git-send-email-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 11:20:35 -07:00
Zqiang
bb6659cc0a io_uring: Fix memory leak in io_sqe_buffers_register()
unreferenced object 0xffff8881123bf0a0 (size 32):
comm "syz-executor557", pid 8384, jiffies 4294946143 (age 12.360s)
backtrace:
[<ffffffff81469b71>] kmalloc_node include/linux/slab.h:579 [inline]
[<ffffffff81469b71>] kvmalloc_node+0x61/0xf0 mm/util.c:587
[<ffffffff815f0b3f>] kvmalloc include/linux/mm.h:795 [inline]
[<ffffffff815f0b3f>] kvmalloc_array include/linux/mm.h:813 [inline]
[<ffffffff815f0b3f>] kvcalloc include/linux/mm.h:818 [inline]
[<ffffffff815f0b3f>] io_rsrc_data_alloc+0x4f/0xc0 fs/io_uring.c:7164
[<ffffffff815f26d8>] io_sqe_buffers_register+0x98/0x3d0 fs/io_uring.c:8383
[<ffffffff815f84a7>] __io_uring_register+0xf67/0x18c0 fs/io_uring.c:9986
[<ffffffff81609222>] __do_sys_io_uring_register fs/io_uring.c:10091 [inline]
[<ffffffff81609222>] __se_sys_io_uring_register fs/io_uring.c:10071 [inline]
[<ffffffff81609222>] __x64_sys_io_uring_register+0x112/0x230 fs/io_uring.c:10071
[<ffffffff842f616a>] do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47
[<ffffffff84400068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix data->tags memory leak, through io_rsrc_data_free() to release
data memory space.

Reported-by: syzbot+0f32d05d8b6cd8d7ea3e@syzkaller.appspotmail.com
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210430082515.13886-1-qiang.zhang@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-30 06:44:22 -06:00
Linus Torvalds
b0030af53a Kbuild updates for v5.13
- Evaluate $(call cc-option,...) etc. only for build targets
 
  - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux
 
  - Remove unnecessary --gcc-toolchains Clang flag because the --prefix
    flag finds the toolchains
 
  - Do not pass Clang's --prefix flag when using the integrated as
 
  - Check the assembler version in Kconfig time
 
  - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up
    some dependencies in Kconfig
 
  - Fix invalid Module.symvers creation when building only modules without
    vmlinux
 
  - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is
    set, but there is no module to build
 
  - Refactor module installation Makefile
 
  - Support zstd for module compression
 
  - Convert alpha and ia64 to use generic shell scripts to generate the
    syscall headers
 
  - Add a new elfnote to indicate if the kernel was built with LTO, which
    will be used by pahole
 
  - Flatten the directory structure under include/config/ so CONFIG options
    and filenames match
 
  - Change the deb source package name from linux-$(KERNELRELEASE) to
    linux-upstream
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmCKOLUVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGdq8P/2z+saxIWGXVWt0ggavR0vimcY4e
 NQIKGu9uZpo/lfoC78UG8HO+XvzvPUrcRuOX+WIVr2GfScgVnweDukexUAY0/2oi
 4UvqhndJ0sjEwRj8mXXJ0O+PED+OtgrqrbhkLq9wHQd/jpSD4XEWXwn1g1XVrTZu
 WbwP6b1G/Rnjp2lz3HKC017rPkmfsCFQB7r+hbJGKhT0rCaceheUuBvGa/XqLknr
 IOyaUAY76u3Gtj6fVY1rk70kQgDMF8+LJPgdSSZ/XPCvbNJQAeop36EeRNfmxGIh
 vQhFJRJeqy+K5MhCpdGtTGYDawlmQVn/f/99SkDw9F04S4ZL2Xnaaqw4L1QDhjTh
 xBlckbPvmq36F4xSqWd5kYF3iwS+LsEJROwZKFLEVDb3zMsRQPEGQM/556QmrBi2
 5KXzwOYEJKuobWr1hQ3PwLumJKTPGLvGEFB3Bq2eG8LrgpOAHPI4ejC2EBu0vCez
 QbskP2lPlMj3MbL5iZg+6ZRlOChZ7RUrSDj6+iTeOcinmXHqQONCL6qy+um4Rfcb
 zUkfwTlqM9d88u6AbO2VvQMOobMjvp4bvmqi/Xv8IiTukLHco4tc8zTuySmZwSyI
 rd3RKYn367qWztX5YyaoGRPVmlMG7ssbRc4fkXiV13vfeZebNfVwlX/CHv9+IWwN
 RVnMhYBhUZR68h6z
 =ti9L
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Evaluate $(call cc-option,...) etc. only for build targets

 - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux

 - Remove unnecessary --gcc-toolchains Clang flag because the --prefix
   flag finds the toolchains

 - Do not pass Clang's --prefix flag when using the integrated as

 - Check the assembler version in Kconfig time

 - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up
   some dependencies in Kconfig

 - Fix invalid Module.symvers creation when building only modules
   without vmlinux

 - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is
   set, but there is no module to build

 - Refactor module installation Makefile

 - Support zstd for module compression

 - Convert alpha and ia64 to use generic shell scripts to generate the
   syscall headers

 - Add a new elfnote to indicate if the kernel was built with LTO, which
   will be used by pahole

 - Flatten the directory structure under include/config/ so CONFIG
   options and filenames match

 - Change the deb source package name from linux-$(KERNELRELEASE) to
   linux-upstream

* tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (42 commits)
  kbuild: Add $(KBUILD_HOSTLDFLAGS) to 'has_libelf' test
  kbuild: deb-pkg: change the source package name to linux-upstream
  tools: do not include scripts/Kbuild.include
  kbuild: redo fake deps at include/config/*.h
  kbuild: remove TMPO from try-run
  MAINTAINERS: add pattern for dummy-tools
  kbuild: add an elfnote for whether vmlinux is built with lto
  ia64: syscalls: switch to generic syscallhdr.sh
  ia64: syscalls: switch to generic syscalltbl.sh
  alpha: syscalls: switch to generic syscallhdr.sh
  alpha: syscalls: switch to generic syscalltbl.sh
  sysctl: use min() helper for namecmp()
  kbuild: add support for zstd compressed modules
  kbuild: remove CONFIG_MODULE_COMPRESS
  kbuild: merge scripts/Makefile.modsign to scripts/Makefile.modinst
  kbuild: move module strip/compression code into scripts/Makefile.modinst
  kbuild: refactor scripts/Makefile.modinst
  kbuild: rename extmod-prefix to extmod_prefix
  kbuild: check module name conflict for external modules as well
  kbuild: show the target directory for depmod log
  ...
2021-04-29 14:24:39 -07:00
Colin Ian King
cf3770e784 io_uring: Fix premature return from loop and memory leak
Currently the -EINVAL error return path is leaking memory allocated
to data. Fix this by not returning immediately but instead setting
the error return variable to -EINVAL and breaking out of the loop.

Kudos to Pavel Begunkov for suggesting a correct fix.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210429104602.62676-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:19 -06:00
Pavel Begunkov
47b228ce6f io_uring: fix unchecked error in switch_start()
io_rsrc_node_switch_start() can fail, don't forget to check returned
error code.

Reported-by: syzbot+a4715dd4b7c866136f79@syzkaller.appspotmail.com
Fixes: eae071c9b4 ("io_uring: prepare fixed rw for dynanic buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c4c06e2f3f0c8e43bd8d0a266c79055bcc6b6e60.1619693112.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:19 -06:00
Pavel Begunkov
6224843d56 io_uring: allow empty slots for reg buffers
Allow empty reg buffer slots any request using which should fail. This
allows users to not register all buffers in advance, but do it lazily
and/or on demand via updates. That is achieved by setting iov_base and
iov_len to zero for registration and/or buffer updates. Empty buffer
can't have a non-zero tag.

Implementation details: to not add extra overhead to io_import_fixed(),
create a dummy buffer crafted to fail any request using it, and set it
to all empty buffer slots.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e95e4d700082baaf010c648c72ac764c9cc8826.1619611868.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:19 -06:00
Pavel Begunkov
b0d658ec88 io_uring: add more build check for uapi
Add a couple of BUILD_BUG_ON() checking some rsrc uapi structs and SQE
flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ff960df4d5026b9fb5bfd80994b9d3667d3926da.1619536280.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:18 -06:00
Pavel Begunkov
dddca22636 io_uring: dont overlap internal and user req flags
CQE flags take one byte that we store in req->flags together with other
REQ_F_* internal flags. CQE flags are copied directly into req and then
verified that requires some handling on failures, e.g. to make sure that
that copy doesn't set some of the internal flags.

Move all internal flags to take bits after the first byte, so we don't
need extra handling and make it safer overall.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8b5b02d1ab9d786fcc7db4a3fe86db6b70b8987.1619536280.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:18 -06:00
Pavel Begunkov
2840f710f2 io_uring: fix drain with rsrc CQEs
Resource emitted CQEs are not bound to requests, so fix up counters used
for DRAIN/defer logic.

Fixes: b60c8dce33 ("io_uring: preparation for rsrc tagging")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b32f5f0a40d5928c3466d028f936e167f0654be.1619536280.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29 13:26:18 -06:00
Linus Torvalds
9d31d23389 Networking changes for 5.13.
Core:
 
  - bpf:
 	- allow bpf programs calling kernel functions (initially to
 	  reuse TCP congestion control implementations)
 	- enable task local storage for tracing programs - remove the
 	  need to store per-task state in hash maps, and allow tracing
 	  programs access to task local storage previously added for
 	  BPF_LSM
 	- add bpf_for_each_map_elem() helper, allowing programs to
 	  walk all map elements in a more robust and easier to verify
 	  fashion
 	- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
 	  redirection
 	- lpm: add support for batched ops in LPM trie
 	- add BTF_KIND_FLOAT support - mostly to allow use of BTF
 	  on s390 which has floats in its headers files
 	- improve BPF syscall documentation and extend the use of kdoc
 	  parsing scripts we already employ for bpf-helpers
 	- libbpf, bpftool: support static linking of BPF ELF files
 	- improve support for encapsulation of L2 packets
 
  - xdp: restructure redirect actions to avoid a runtime lookup,
 	improving performance by 4-8% in microbenchmarks
 
  - xsk: build skb by page (aka generic zerocopy xmit) - improve
 	performance of software AF_XDP path by 33% for devices
 	which don't need headers in the linear skb part (e.g. virtio)
 
  - nexthop: resilient next-hop groups - improve path stability
 	on next-hops group changes (incl. offload for mlxsw)
 
  - ipv6: segment routing: add support for IPv4 decapsulation
 
  - icmp: add support for RFC 8335 extended PROBE messages
 
  - inet: use bigger hash table for IP ID generation
 
  - tcp: deal better with delayed TX completions - make sure we don't
 	give up on fast TCP retransmissions only because driver is
 	slow in reporting that it completed transmitting the original
 
  - tcp: reorder tcp_congestion_ops for better cache locality
 
  - mptcp:
 	- add sockopt support for common TCP options
 	- add support for common TCP msg flags
 	- include multiple address ids in RM_ADDR
 	- add reset option support for resetting one subflow
 
  - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
 	co-existence with UDP tunnel GRO, allowing the first to take
 	place correctly	even for encapsulated UDP traffic
 
  - micro-optimize dev_gro_receive() and flow dissection, avoid
 	retpoline overhead on VLAN and TEB GRO
 
  - use less memory for sysctls, add a new sysctl type, to allow using
 	u8 instead of "int" and "long" and shrink networking sysctls
 
  - veth: allow GRO without XDP - this allows aggregating UDP
 	packets before handing them off to routing, bridge, OvS, etc.
 
  - allow specifing ifindex when device is moved to another namespace
 
  - netfilter:
 	- nft_socket: add support for cgroupsv2
 	- nftables: add catch-all set element - special element used
 	  to define a default action in case normal lookup missed
 	- use net_generic infra in many modules to avoid allocating
 	  per-ns memory unnecessarily
 
  - xps: improve the xps handling to avoid potential out-of-bound
 	accesses and use-after-free when XPS change race with other
 	re-configuration under traffic
 
  - add a config knob to turn off per-cpu netdev refcnt to catch
 	underflows in testing
 
 Device APIs:
 
  - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    -independent APIs
 
  - ethtool:
 	- add interface for reading IEEE MIB stats (incl. mlx5 and
 	  bnxt support)
 	- allow network drivers to dump arbitrary SFP EEPROM data,
 	  current offset+length API was a poor fit for modern SFP
 	  which define EEPROM in terms of pages (incl. mlx5 support)
 
  - act_police, flow_offload: add support for packet-per-second
 	policing (incl. offload for nfp)
 
  - psample: add additional metadata attributes like transit delay
 	for packets sampled from switch HW (and corresponding egress
 	and policy-based sampling in the mlxsw driver)
 
  - dsa: improve support for sandwiched LAGs with bridge and DSA
 
  - netfilter:
 	- flowtable: use direct xmit in topologies with IP
 	  forwarding, bridging, vlans etc.
 	- nftables: counter hardware offload support
 
  - Bluetooth:
 	- improvements for firmware download w/ Intel devices
 	- add support for reading AOSP vendor capabilities
 	- add support for virtio transport driver
 
  - mac80211:
 	- allow concurrent monitor iface and ethernet rx decap
 	- set priority and queue mapping for injected frames
 
  - phy: add support for Clause-45 PHY Loopback
 
  - pci/iov: add sysfs MSI-X vector assignment interface
 	to distribute MSI-X resources to VFs (incl. mlx5 support)
 
 New hardware/drivers:
 
  - dsa: mv88e6xxx: add support for Marvell mv88e6393x -
 	11-port Ethernet switch with 8x 1-Gigabit Ethernet
 	and 3x 10-Gigabit interfaces.
 
  - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
 	and BCM63xx switches
 
  - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
 
  - ath11k: support for QCN9074 a 802.11ax device
 
  - Bluetooth: Broadcom BCM4330 and BMC4334
 
  - phy: Marvell 88X2222 transceiver support
 
  - mdio: add BCM6368 MDIO mux bus controller
 
  - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
 
  - mana: driver for Microsoft Azure Network Adapter (MANA)
 
  - Actions Semi Owl Ethernet MAC
 
  - can: driver for ETAS ES58X CAN/USB interfaces
 
 Pure driver changes:
 
  - add XDP support to: enetc, igc, stmmac
  - add AF_XDP support to: stmmac
 
  - virtio:
 	- page_to_skb() use build_skb when there's sufficient tailroom
 	  (21% improvement for 1000B UDP frames)
 	- support XDP even without dedicated Tx queues - share the Tx
 	  queues with the stack when necessary
 
  - mlx5:
 	- flow rules: add support for mirroring with conntrack,
 	  matching on ICMP, GTP, flex filters and more
 	- support packet sampling with flow offloads
 	- persist uplink representor netdev across eswitch mode
 	  changes
 	- allow coexistence of CQE compression and HW time-stamping
 	- add ethtool extended link error state reporting
 
  - ice, iavf: support flow filters, UDP Segmentation Offload
 
  - dpaa2-switch:
 	- move the driver out of staging
 	- add spanning tree (STP) support
 	- add rx copybreak support
 	- add tc flower hardware offload on ingress traffic
 
  - ionic:
 	- implement Rx page reuse
 	- support HW PTP time-stamping
 
  - octeon: support TC hardware offloads - flower matching on ingress
 	and egress ratelimitting.
 
  - stmmac:
 	- add RX frame steering based on VLAN priority in tc flower
 	- support frame preemption (FPE)
 	- intel: add cross time-stamping freq difference adjustment
 
  - ocelot:
 	- support forwarding of MRP frames in HW
 	- support multiple bridges
 	- support PTP Sync one-step timestamping
 
  - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
 	learning, flooding etc.
 
  - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
 	SC7280 SoCs)
 
  - mt7601u: enable TDLS support
 
  - mt76:
 	- add support for 802.3 rx frames (mt7915/mt7615)
 	- mt7915 flash pre-calibration support
 	- mt7921/mt7663 runtime power management fixes
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
 Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
 5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
 Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
 q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
 I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
 wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
 7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
 zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
 rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
 pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
 B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
 =vcbA
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - bpf:
        - allow bpf programs calling kernel functions (initially to
          reuse TCP congestion control implementations)
        - enable task local storage for tracing programs - remove the
          need to store per-task state in hash maps, and allow tracing
          programs access to task local storage previously added for
          BPF_LSM
        - add bpf_for_each_map_elem() helper, allowing programs to walk
          all map elements in a more robust and easier to verify fashion
        - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
          redirection
        - lpm: add support for batched ops in LPM trie
        - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
          s390 which has floats in its headers files
        - improve BPF syscall documentation and extend the use of kdoc
          parsing scripts we already employ for bpf-helpers
        - libbpf, bpftool: support static linking of BPF ELF files
        - improve support for encapsulation of L2 packets

   - xdp: restructure redirect actions to avoid a runtime lookup,
     improving performance by 4-8% in microbenchmarks

   - xsk: build skb by page (aka generic zerocopy xmit) - improve
     performance of software AF_XDP path by 33% for devices which don't
     need headers in the linear skb part (e.g. virtio)

   - nexthop: resilient next-hop groups - improve path stability on
     next-hops group changes (incl. offload for mlxsw)

   - ipv6: segment routing: add support for IPv4 decapsulation

   - icmp: add support for RFC 8335 extended PROBE messages

   - inet: use bigger hash table for IP ID generation

   - tcp: deal better with delayed TX completions - make sure we don't
     give up on fast TCP retransmissions only because driver is slow in
     reporting that it completed transmitting the original

   - tcp: reorder tcp_congestion_ops for better cache locality

   - mptcp:
        - add sockopt support for common TCP options
        - add support for common TCP msg flags
        - include multiple address ids in RM_ADDR
        - add reset option support for resetting one subflow

   - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
     co-existence with UDP tunnel GRO, allowing the first to take place
     correctly even for encapsulated UDP traffic

   - micro-optimize dev_gro_receive() and flow dissection, avoid
     retpoline overhead on VLAN and TEB GRO

   - use less memory for sysctls, add a new sysctl type, to allow using
     u8 instead of "int" and "long" and shrink networking sysctls

   - veth: allow GRO without XDP - this allows aggregating UDP packets
     before handing them off to routing, bridge, OvS, etc.

   - allow specifing ifindex when device is moved to another namespace

   - netfilter:
        - nft_socket: add support for cgroupsv2
        - nftables: add catch-all set element - special element used to
          define a default action in case normal lookup missed
        - use net_generic infra in many modules to avoid allocating
          per-ns memory unnecessarily

   - xps: improve the xps handling to avoid potential out-of-bound
     accesses and use-after-free when XPS change race with other
     re-configuration under traffic

   - add a config knob to turn off per-cpu netdev refcnt to catch
     underflows in testing

  Device APIs:

   - add WWAN subsystem to organize the WWAN interfaces better and
     hopefully start driving towards more unified and vendor-
     independent APIs

   - ethtool:
        - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
          support)
        - allow network drivers to dump arbitrary SFP EEPROM data,
          current offset+length API was a poor fit for modern SFP which
          define EEPROM in terms of pages (incl. mlx5 support)

   - act_police, flow_offload: add support for packet-per-second
     policing (incl. offload for nfp)

   - psample: add additional metadata attributes like transit delay for
     packets sampled from switch HW (and corresponding egress and
     policy-based sampling in the mlxsw driver)

   - dsa: improve support for sandwiched LAGs with bridge and DSA

   - netfilter:
        - flowtable: use direct xmit in topologies with IP forwarding,
          bridging, vlans etc.
        - nftables: counter hardware offload support

   - Bluetooth:
        - improvements for firmware download w/ Intel devices
        - add support for reading AOSP vendor capabilities
        - add support for virtio transport driver

   - mac80211:
        - allow concurrent monitor iface and ethernet rx decap
        - set priority and queue mapping for injected frames

   - phy: add support for Clause-45 PHY Loopback

   - pci/iov: add sysfs MSI-X vector assignment interface to distribute
     MSI-X resources to VFs (incl. mlx5 support)

  New hardware/drivers:

   - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
     Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
     interfaces.

   - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
     BCM63xx switches

   - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

   - ath11k: support for QCN9074 a 802.11ax device

   - Bluetooth: Broadcom BCM4330 and BMC4334

   - phy: Marvell 88X2222 transceiver support

   - mdio: add BCM6368 MDIO mux bus controller

   - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

   - mana: driver for Microsoft Azure Network Adapter (MANA)

   - Actions Semi Owl Ethernet MAC

   - can: driver for ETAS ES58X CAN/USB interfaces

  Pure driver changes:

   - add XDP support to: enetc, igc, stmmac

   - add AF_XDP support to: stmmac

   - virtio:
        - page_to_skb() use build_skb when there's sufficient tailroom
          (21% improvement for 1000B UDP frames)
        - support XDP even without dedicated Tx queues - share the Tx
          queues with the stack when necessary

   - mlx5:
        - flow rules: add support for mirroring with conntrack, matching
          on ICMP, GTP, flex filters and more
        - support packet sampling with flow offloads
        - persist uplink representor netdev across eswitch mode changes
        - allow coexistence of CQE compression and HW time-stamping
        - add ethtool extended link error state reporting

   - ice, iavf: support flow filters, UDP Segmentation Offload

   - dpaa2-switch:
        - move the driver out of staging
        - add spanning tree (STP) support
        - add rx copybreak support
        - add tc flower hardware offload on ingress traffic

   - ionic:
        - implement Rx page reuse
        - support HW PTP time-stamping

   - octeon: support TC hardware offloads - flower matching on ingress
     and egress ratelimitting.

   - stmmac:
        - add RX frame steering based on VLAN priority in tc flower
        - support frame preemption (FPE)
        - intel: add cross time-stamping freq difference adjustment

   - ocelot:
        - support forwarding of MRP frames in HW
        - support multiple bridges
        - support PTP Sync one-step timestamping

   - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
     learning, flooding etc.

   - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
     SC7280 SoCs)

   - mt7601u: enable TDLS support

   - mt76:
        - add support for 802.3 rx frames (mt7915/mt7615)
        - mt7915 flash pre-calibration support
        - mt7921/mt7663 runtime power management fixes"

* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
  net: selftest: fix build issue if INET is disabled
  net: netrom: nr_in: Remove redundant assignment to ns
  net: tun: Remove redundant assignment to ret
  net: phy: marvell: add downshift support for M88E1240
  net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
  net/sched: act_ct: Remove redundant ct get and check
  icmp: standardize naming of RFC 8335 PROBE constants
  bpf, selftests: Update array map tests for per-cpu batched ops
  bpf: Add batched ops support for percpu array
  bpf: Implement formatted output helpers with bstr_printf
  seq_file: Add a seq_bprintf function
  sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
  net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
  net: fix a concurrency bug in l2tp_tunnel_register()
  net/smc: Remove redundant assignment to rc
  mpls: Remove redundant assignment to err
  llc2: Remove redundant assignment to rc
  net/tls: Remove redundant initialization of record
  rds: Remove redundant assignment to nr_sig
  dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
  ...
2021-04-29 11:57:23 -07:00
Linus Torvalds
3644286f6c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJUfIACgkQnJ2qBz9k
 QNkStAf8CA7beya7LZ/GGN7HzXhv2cs+IpUFhRkynLklEM0lxKsOEagLFSZxkoMD
 IBSRSo4odkkderqI9W/yp+9OYhOd9+BQCq4isg1Gh9Tf5xANJEpLvBAPnWVhooJs
 9CrYZQY9Bdf+fF/8GHbKlrMAYm56vBCmWqyWTEtWUyPBOA12in2ZHQJmCa+5+nge
 zTT/B5cvuhN5K7uYhGM4YfeCU5DBmmvD4sV6YBTkQOgCU0bEF0f9R3JjHDo34a1s
 yqna3ypqKNRhsJVs8F+aOGRieUYxFoRqtYNHZK3qI9i07v7ndoTm5jzGN6OFlKs3
 U3rF9/+cBgeESahWG6IjHIqhXGXNhg==
 =KjNm
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - support for limited fanotify functionality for unpriviledged users

 - faster merging of fanotify events

 - a few smaller fsnotify improvements

* tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  shmem: allow reporting fanotify events with file handles on tmpfs
  fs: introduce a wrapper uuid_to_fsid()
  fanotify_user: use upper_32_bits() to verify mask
  fanotify: support limited functionality for unprivileged users
  fanotify: configurable limits via sysfs
  fanotify: limit number of event merge attempts
  fsnotify: use hash table for faster events merge
  fanotify: mix event info and pid into merge key hash
  fanotify: reduce event objectid to 29-bit hash
  fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-29 11:06:13 -07:00
Linus Torvalds
767fcbc80f \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJU1UACgkQnJ2qBz9k
 QNk62AgAgp05OIXU/AgObb7DvSyI3ycwCV8PeWBpwD8yoDAh5x0tmT7vnJu974p6
 yHdnF7rr69ZzvbNCHLJ5kRykRlUao9W7cO5fdOW1uTpL7Ic60QuJMks/NfgVTHp1
 2zIQmBDerfn1/LTK8r2pPGcvtcjRcr7Ep4beN0Duw57lfVMJhjsNRPnBbXGBcp0r
 QzKk4/8V3DCZvOw+XNC3nto7avjvf+nU9sJmuh83546eqh0atjWivvO5aAlDOe6W
 rhBiLlmP0in5u2n1fYqzI1OQvtgtleyEZT2G0CrbAZn0xjmV/if9wl+3K6TOwDvR
 778xDEX7sZCaO/xkB+WK3hrd15ftKg==
 =0kYE
 -----END PGP SIGNATURE-----

Merge tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota, ext2, reiserfs updates from Jan Kara:

 - support for path (instead of device) based quotactl syscall
   (quotactl_path(2))

 - ext2 conversion to kmap_local()

 - other minor cleanups & fixes

* tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fs/reiserfs/journal.c: delete useless variables
  fs/ext2: Replace kmap() with kmap_local_page()
  ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry()
  fs/ext2/: fix misspellings using codespell tool
  quota: report warning limits for realtime space quotas
  quota: wire up quotactl_path
  quota: Add mountpath based quota support
2021-04-29 10:51:29 -07:00
Linus Torvalds
d2b6f8a179 New code for 5.13:
- Various minor fixes in online scrub.
 - Prevent metadata files from being automatically inactivated.
 - Validate btree heights by the computed per-btree limits.
 - Don't warn about remounting with deprecated mount options.
 - Initialize attr forks at create time if we suspect we're going to need
   to store them.
 - Reduce memory reallocation workouts in the logging code.
 - Fix some theoretical math calculation errors in logged buffers that
   span multiple discontig memory ranges but contiguous ondisk regions.
 - Speedups in dirty buffer bitmap handling.
 - Make type verifier functions more inline-happy to reduce overhead.
 - Reduce debug overhead in directory checking code.
 - Many many typo fixes.
 - Begin to handle the permanent loss of the very end of a filesystem.
 - Fold struct xfs_icdinode into xfs_inode.
 - Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx ioctl.
 - Remove a broken directory block format check from online scrub.
 - Fix a bug where we could produce an unnecessarily tall data fork btree
   when creating an attr fork.
 - Fix scrub and readonly remounts racing.
 - Fix a writeback ioend log deadlock problem by dropping the behavior
   where we could preallocate a setfilesize transaction.
 - Fix some bugs in the new extent count checking code.
 - Fix some bugs in the attr fork preallocation code.
 - Refactor if_flags out of the incore inode fork data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmB6MFUACgkQ+H93GTRK
 tOvigBAAlpzBUXnZVo+U18u0tSHnq5c1zbXYcf5GPhQv9w3n3TlPi3YhK2vgEXlI
 TULwsdU+an30oqWkQiVrwQjKPVaTWeWE3K0sA2MlYX9L2CwPPde4x5hwhyppfQxq
 mQyu0suWp480ao7vToXAgZ751OdZRtGu8sRQ7rVQ/FVf9K4R8EqpZMEynNry25f+
 hpK235hpf4IUC9E1A4pE2hNBSr/LGPIyu1t5sZsfazcNmtpKcauy5R5b8Pdnzo2/
 WFa6PoeE8SRIp4OxZY/c/4QUI5cRubJGyoB+kbl0hg69uYIJO+pc+R69yrQPD9Z+
 JDW/FktH+Zz4pstFsC+qnSvhRaF2DvXpvXrIldonQ2Z2ByVqbs3r6HzKySlWQ+QE
 jU717HApWl/ADI/kVD2IuQnrbU+Q8Ue8thzgQeEpTRWsea2HzPMofNi5FImU2ulw
 g4V7PleQWJ6AsLhcpfA46Y+CUAtjTD1Tvj67JpXuWJ+MFTB4hRm3U7zgCtV/0c3T
 wBBUybQjDoVA6DDr6CP/9ki1k0BO3wKJGlZMR0bkEsuxXdFNTvHEz5lmueYT/Wxc
 D91+oRbna9NpEeIVFGo6lhMIu2t0iYssFdgQKyn1jXrpGXKvOklP8zDjRdPnnQmz
 plT2ajlXPIjc6KjOTP2mbVqKs059LuJoYV7gIWwM7CgtFsMIrd8=
 =oRKe
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "The notable user-visible addition this cycle is ability to remove
  space from the last AG in a filesystem. This is the first of many
  changes needed for full-fledged support for shrinking a filesystem.
  Still needed are (a) the ability to reorganize files and metadata away
  from the end of the fs; (b) the ability to remove entire allocation
  groups; (c) shrink support for realtime volumes; and (d) thorough
  testing of (a-c).

  There are a number of performance improvements in this code drop: Dave
  streamlined various parts of the buffer logging code and reduced the
  cost of various debugging checks, and added the ability to pre-create
  the xattr structures while creating files. Brian eliminated
  transaction reservations that were being held across writeback (thus
  reducing livelock potential.

  Other random pieces: Pavel fixed the repetitve warnings about
  deprecated mount options, I fixed online fsck to behave itself when a
  readonly remount comes in during scrub, and refactored various other
  parts of that code, Christoph contributed a lot of refactoring this
  cycle. The xfs_icdinode structure has been absorbed into the (incore)
  xfs_inode structure, and the format and flags handling around
  xfs_inode_fork structures has been simplified. Chandan provided a
  number of fixes for extent count overflow related problems that have
  been shaken out by debugging knobs added during 5.12.

  Summary:

   - Various minor fixes in online scrub.

   - Prevent metadata files from being automatically inactivated.

   - Validate btree heights by the computed per-btree limits.

   - Don't warn about remounting with deprecated mount options.

   - Initialize attr forks at create time if we suspect we're going to
     need to store them.

   - Reduce memory reallocation workouts in the logging code.

   - Fix some theoretical math calculation errors in logged buffers that
     span multiple discontig memory ranges but contiguous ondisk
     regions.

   - Speedups in dirty buffer bitmap handling.

   - Make type verifier functions more inline-happy to reduce overhead.

   - Reduce debug overhead in directory checking code.

   - Many many typo fixes.

   - Begin to handle the permanent loss of the very end of a filesystem.

   - Fold struct xfs_icdinode into xfs_inode.

   - Deprecate the long defunct BMV_IF_NO_DMAPI_READ from the bmapx
     ioctl.

   - Remove a broken directory block format check from online scrub.

   - Fix a bug where we could produce an unnecessarily tall data fork
     btree when creating an attr fork.

   - Fix scrub and readonly remounts racing.

   - Fix a writeback ioend log deadlock problem by dropping the behavior
     where we could preallocate a setfilesize transaction.

   - Fix some bugs in the new extent count checking code.

   - Fix some bugs in the attr fork preallocation code.

   - Refactor if_flags out of the incore inode fork data structure"

* tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (77 commits)
  xfs: remove xfs_quiesce_attr declaration
  xfs: remove XFS_IFEXTENTS
  xfs: remove XFS_IFINLINE
  xfs: remove XFS_IFBROOT
  xfs: only look at the fork format in xfs_idestroy_fork
  xfs: simplify xfs_attr_remove_args
  xfs: rename and simplify xfs_bmap_one_block
  xfs: move the XFS_IFEXTENTS check into xfs_iread_extents
  xfs: drop unnecessary setfilesize helper
  xfs: drop unused ioend private merge and setfilesize code
  xfs: open code ioend needs workqueue helper
  xfs: drop submit side trans alloc for append ioends
  xfs: fix return of uninitialized value in variable error
  xfs: get rid of the ip parameter to xchk_setup_*
  xfs: fix scrub and remount-ro protection when running scrub
  xfs: move the check for post-EOF mappings into xfs_can_free_eofblocks
  xfs: move the xfs_can_free_eofblocks call under the IOLOCK
  xfs: precalculate default inode attribute offset
  xfs: default attr fork size does not handle device inodes
  xfs: inode fork allocation depends on XFS_IFEXTENT flag
  ...
2021-04-29 10:43:51 -07:00
Linus Torvalds
f2c80837e2 Changes in gfs2:
- Fix some compiler and kernel-doc warnings.
 
 - Various minor cleanups and optimizations.
 
 - Add a new sysfs gfs2 status file with some filesystem wide
   information.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmCKiWIUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqXOA//cgEMi+WZ0pQ1m4Z7Yk58ArAGXOW4
 L+efdMjk2zoqgixF502tQzaa2ctz6XpukF4oZbM+Jc+yZxrbZ5CLjUIOWc9RH+Id
 WQwj5+5GLbMAPnn5ksHUCTK9V+1oAlpgoY4fMtLdKq234Y6xqWj5qBjvtGUTLFAl
 ACvy8FUZplFOkaOSBqgh221LT4Oh0Wthe/Elq5qvqwBfdAiE/p1sHSi2FWxktIlU
 wV3PKL96rFsnWN8E6jqyJR1RNJ5d5MYA+PDkTHKcoqcXZrzw4mfu2tCh88Bh9wFb
 MEyjtLxE09G1+3Li/T/Tb7qbRKWvxEmkLZXaFAjRUp7zYPvM6twKSg8nihcBDtLi
 UgvTrc208CYvYj7QpRQ1dU9lEg47A46rB8dgLz+ymlpNNk/G0gqgvWLevMKnBfaX
 AkZviI6qm1iNCBd6wWWPUKqR0qrCWqoe9N8F7cWyZBki7dKkoj29Gt1X1SeIQMjd
 n8Mkv6Btd39kBt3DydXlCEaREMQYeDrxBJHxur234hEfFLFraFj5tYFjoeSODZdg
 Uxsn5X5dgLy/hHjps8YcuBoEgRMR/aKovK5G1FXDcQR5O6UByqtJuJTJBT8jxAld
 vLYHqO6vxdgGATaYAkuGLSLrJjwES+7tEXjtrdarswGo55dPitwtOB1NRWuOor/Z
 uTnzJbykMdIzEHs=
 =VblJ
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Fix some compiler and kernel-doc warnings

 - Various minor cleanups and optimizations

 - Add a new sysfs gfs2 status file with some filesystem wide
   information

* tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix fall-through warnings for Clang
  gfs2: Fix a number of kernel-doc warnings
  gfs2: Make gfs2_setattr_simple static
  gfs2: Add new sysfs file for gfs2 status
  gfs2: Silence possible null pointer dereference warning
  gfs2: Turn gfs2_meta_indirect_buffer into gfs2_meta_buffer
  gfs2: Replace gfs2_lblk_to_dblk with gfs2_get_extent
  gfs2: Turn gfs2_extent_map into gfs2_{get,alloc}_extent
  gfs2: Add new gfs2_iomap_get helper
  gfs2: Remove unused variable sb_format
  gfs2: Fix dir.c function parameter descriptions
  gfs2: Eliminate gh parameter from go_xmote_bh func
  gfs2: don't create empty buffers for NO_CREATE
2021-04-29 10:33:35 -07:00
Linus Torvalds
8ae8932c6a Description for this pull request:
- Improve write performance with dirsync mount option.
  - Improve lookup performance.
  - Add support for FITRIM ioctl.
  - Fix a bug with discard option.
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmCIADwYHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIalwP/jepL2Y7B8FOhMH1R4whbY6x
 5fFlG608tR7jxUEMpAfMTcFC2+5nKpHN1keoHV8SXdJz2esd1ZUknaok5CIVQFYX
 I7TTOWn/genJ6nVmHeTLZLAxAWznSlIqRdeuSnDMLMKSLDn7noW5l+vzgYy5dnaF
 SzICtNXJqA7D7PcMIwTtfszcSEwmt1aLiRZZabZ+y0xSF2ha0mRB6B3hCx9sDXh5
 iVVbSn10lk6ULENYcjaUZhF1Dt9Lv4pOZ9drr8bVfRmWBeWspB/X4/TgO+Mf/xP8
 5el7OL965ofUaaCmhyfDMrWCBffeFbf0K8sUM/psiAUuKTugKvlZniZ0RB8zLEvZ
 Bn2yvr+walpdkVNBbv3qi32/4xAx5ng90rSdqX3bvY3Zq1UDW49MIIZOh6uQCNnD
 9ravgGoU2ujVcUQUKEU01dj1ora83IKQ7WuxcGXQOt/xNFMRG1FmiKgS7aj2HgMM
 ax9pjZZY4H8ZUZOC2hpph+x6Kc+ZQDqzR6+0qbWM2/hlLgIURDXTqVe3m1DCuvDI
 c2rvtpVtpYGz+HTZW+krUzc4kOAfUsS98D+7bLS5X/f4x4e5FxQDGyXaCNUI0OCr
 ed8kHmHQsKvbIui4RHgZvZylj2xO34arSuMURL41ifuT9jFoQVxPcGluXNvbdcu8
 yOyqeaqwgh5AJzXPVsRv
 =igtQ
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - Improve write performance with dirsync mount option

 - Improve lookup performance

 - Add support for FITRIM ioctl

 - Fix a bug with discard option

* tag 'exfat-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: speed up iterate/lookup by fixing start point of traversing cluster chain
  exfat: improve write performance when dirsync enabled
  exfat: add support ioctl and FITRIM function
  exfat: introduce bitmap_lock for cluster bitmap access
  exfat: fix erroneous discard when clear cluster bit
2021-04-29 10:32:18 -07:00
Darrick J. Wong
d4f74e162d xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_range
The final parameter of filemap_write_and_wait_range is the end of the
range to flush, not the length of the range to flush.

Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29 07:45:44 -07:00
Brian Foster
fd43cf600c xfs: set aside allocation btree blocks from block reservation
The blocks used for allocation btrees (bnobt and countbt) are
technically considered free space. This is because as free space is
used, allocbt blocks are removed and naturally become available for
traditional allocation. However, this means that a significant
portion of free space may consist of in-use btree blocks if free
space is severely fragmented.

On large filesystems with large perag reservations, this can lead to
a rare but nasty condition where a significant amount of physical
free space is available, but the majority of actual usable blocks
consist of in-use allocbt blocks. We have a record of a (~12TB, 32
AG) filesystem with multiple AGs in a state with ~2.5GB or so free
blocks tracked across ~300 total allocbt blocks, but effectively at
100% full because the the free space is entirely consumed by
refcountbt perag reservation.

Such a large perag reservation is by design on large filesystems.
The problem is that because the free space is so fragmented, this AG
contributes the 300 or so allocbt blocks to the global counters as
free space. If this pattern repeats across enough AGs, the
filesystem lands in a state where global block reservation can
outrun physical block availability. For example, a streaming
buffered write on the affected filesystem continues to allow delayed
allocation beyond the point where writeback starts to fail due to
physical block allocation failures. The expected behavior is for the
delalloc block reservation to fail gracefully with -ENOSPC before
physical block allocation failure is a possibility.

To address this problem, set aside in-use allocbt blocks at
reservation time and thus ensure they cannot be reserved until truly
available for physical allocation. This allows alloc btree metadata
to continue to reside in free space, but dynamically adjusts
reservation availability based on internal state. Note that the
logic requires that the allocbt counter is fully populated at
reservation time before it is fully effective. We currently rely on
the mount time AGF scan in the perag reservation initialization code
for this dependency on filesystems where it's most important (i.e.
with active perag reservations).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29 07:45:44 -07:00
Brian Foster
16eaab839a xfs: introduce in-core global counter of allocbt blocks
Introduce an in-core counter to track the sum of all allocbt blocks
used by the filesystem. This value is currently tracked per-ag via
the ->agf_btreeblks field in the AGF, which also happens to include
rmapbt blocks. A global, in-core count of allocbt blocks is required
to identify the subset of global ->m_fdblocks that consists of
unavailable blocks currently used for allocation btrees. To support
this calculation at block reservation time, construct a similar
global counter for allocbt blocks, populate it on first read of each
AGF and update it as allocbt blocks are used and released.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29 07:45:44 -07:00
Brian Foster
2675ad3890 xfs: unconditionally read all AGFs on mounts with perag reservation
perag reservation is enabled at mount time on a per AG basis. The
upcoming change to set aside allocbt blocks from block reservation
requires a populated allocbt counter as soon as possible after mount
to be fully effective against large perag reservations. Therefore as
a preparation step, initialize the pagf on all mounts where at least
one reservation is active. Note that this already occurs to some
degree on most default format filesystems as reservation requirement
calculations already depend on the AGF or AGI, depending on the
reservation type.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29 07:45:44 -07:00
Darrick J. Wong
e147a756ab xfs: count free space btree blocks when scrubbing pre-lazysbcount fses
Since agf_btreeblks didn't exist before the lazysbcount feature, the fs
summary count scrubber needs to walk the free space btrees to determine
the amount of space being used by those btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-29 07:44:19 -07:00
Dave Chinner
6543990a16 xfs: update superblock counters correctly for !lazysbcount
Keep the mount superblock counters up to date for !lazysbcount
filesystems so that when we log the superblock they do not need
updating in any way because they are already correct.

It's found by what Zorro reported:
1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev
2. mount $dev $mnt
3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load)
4. umount $mnt
5. xfs_repair -n $dev
and I've seen no problem with this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29 07:44:18 -07:00
Darrick J. Wong
e6c01077ec xfs: don't check agf_btreeblks on pre-lazysbcount filesystems
The AGF free space btree block counter wasn't added until the
lazysbcount feature was added to XFS midway through the life of the V4
format, so ignore the field when checking.  Online AGF repair requires
rmapbt, so it doesn't need the feature check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-04-29 07:44:18 -07:00
Darrick J. Wong
1aec7c3d05 xfs: remove obsolete AGF counter debugging
In commit f8f2835a9c we changed the behavior of XFS to use EFIs to
remove blocks from an overfilled AGFL because there were complaints
about transaction overruns that stemmed from trying to free multiple
blocks in a single transaction.

Unfortunately, that commit missed a subtlety in the debug-mode
transaction accounting when a realtime volume is attached.  If a
realtime file undergoes a data fork mapping change such that realtime
extents are allocated (or freed) in the same transaction that a data
device block is also allocated (or freed), we can trip a debugging
assertion.  This can happen (for example) if a realtime extent is
allocated and it is necessary to reshape the bmbt to hold the new
mapping.

When we go to allocate a bmbt block from an AG, the first thing the data
device block allocator does is ensure that the freelist is the proper
length.  If the freelist is too long, it will trim the freelist to the
proper length.

In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to
record the decrement in the AG free list count.  Prior to f8f28 we would
put the free block back in the free space btrees in the same
transaction, which calls xfs_trans_agblocks_delta() to record the
increment in the AG free block count.  Since AGFL blocks are included in
the global free block count (fdblocks), there is no corresponding
fdblocks update, so the AGFL free satisfies the following condition in
xfs_trans_apply_sb_deltas:

	/*
	 * Check that superblock mods match the mods made to AGF counters.
	 */
	ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) ==
	       (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta +
		tp->t_ag_btree_delta));

The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is
the number blocks that were allocated.

After commit f8f28 we defer the block freeing to the next chained
transaction, which means that the calls to xfs_trans_agflist_delta and
xfs_trans_agblocks_delta occur in separate transactions.  The (first)
transaction that shortens the free list trips on the comparison, which
has now become:

(X + 0) == ((X) + -1 + 0)

because we haven't freed the AGFL block yet; we've only logged an
intention to free it.  When the second transaction (the deferred free)
commits, it will evaluate the expression as:

(0 + 0) == (1 + 0 + 0)

and trip over that in turn.

At this point, the astute reader may note that the two commits tagged by
this patch have been in the kernel for a long time but haven't generated
any bug reports.  How is it that the author became aware of this bug?

This originally surfaced as an intermittent failure when I was testing
realtime rmap, but a different bug report by Zorro Lang reveals the same
assertion occuring on !lazysbcount filesystems.

The common factor to both reports (and why this problem wasn't
previously reported) becomes apparent if we consider when
xfs_trans_apply_sb_deltas is called by __xfs_trans_commit():

	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
		xfs_trans_apply_sb_deltas(tp);

With a modern lazysbcount filesystem, transactions update only the
percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence
xfs_trans_apply_sb_deltas is rarely called.

However, updates to the count of free realtime extents are not part of
lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or
removing data fork mappings to realtime files; similarly,
XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems.

Dave mentioned in response to an earlier version of this patch:

"IIUC, what you are saying is that this debug code is simply not
exercised in normal testing and hasn't been for the past decade?  And it
still won't be exercised on anything other than realtime device testing?

"...it was debugging code from 1994 that was largely turned into dead
code when lazysbcounters were introduced in 2007. Hence I'm not sure it
holds any value anymore."

This debugging code isn't especially helpful - you can modify the
flcount on one AG and the freeblks of another AG, and it won't trigger.
Add the fact that nobody noticed for a decade, and let's just get rid of
it (and start testing realtime :P).

This bug was found by running generic/051 on either a V4 filesystem
lacking lazysbcount; or a V5 filesystem with a realtime volume.

Cc: bfoster@redhat.com, zlang@redhat.com
Fixes: f8f2835a9c ("xfs: defer agfl block frees when dfops is available")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29 07:44:18 -07:00
Mike Marshall
211f9f2e05 orangefs: leave files in the page cache for a few micro seconds at least
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2021-04-29 08:06:05 -04:00
Linus Torvalds
625434dafd for-5.13/io_uring-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb
 WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0
 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t
 wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC
 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW
 OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS
 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1
 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ
 Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck
 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu
 jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi
 +CB1NokPng==
 =ZJ8L
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Support for multi-shot mode for POLL requests

 - More efficient reference counting. This is shamelessly stolen from
   the mm side. Even though referencing is mostly single/dual user, the
   128 count was retained to keep the code the same. Maybe this
   should/could be made generic at some point.

 - Removal of the need to have a manager thread for each ring. The
   manager threads only job was checking and creating new io-threads as
   needed, instead we handle this from the queue path.

 - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this
   thread is "just" a regular application thread, so no need to restrict
   use of it anymore.

 - Cleanup of how internal async poll data lifetime is managed.

 - Fix for syzbot reported crash on SQPOLL cancelation.

 - Make buffer registration more like file registrations, which includes
   flexibility in avoiding full set unregistration and re-registration.

 - Fix for io-wq affinity setting.

 - Be a bit more defensive in task->pf_io_worker setup.

 - Various SQPOLL fixes.

 - Cleanup of SQPOLL creds handling.

 - Improvements to in-flight request tracking.

 - File registration cleanups.

 - Tons of cleanups and little fixes

* tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits)
  io_uring: maintain drain logic for multishot poll requests
  io_uring: Check current->io_uring in io_uring_cancel_sqpoll
  io_uring: fix NULL reg-buffer
  io_uring: simplify SQPOLL cancellations
  io_uring: fix work_exit sqpoll cancellations
  io_uring: Fix uninitialized variable up.resv
  io_uring: fix invalid error check after malloc
  io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker
  kernel: always initialize task->pf_io_worker to NULL
  io_uring: update sq_thread_idle after ctx deleted
  io_uring: add full-fledged dynamic buffers support
  io_uring: implement fixed buffers registration similar to fixed files
  io_uring: prepare fixed rw for dynanic buffers
  io_uring: keep table of pointers to ubufs
  io_uring: add generic rsrc update with tags
  io_uring: add IORING_REGISTER_RSRC
  io_uring: enumerate dynamic resources
  io_uring: add generic path for rsrc update
  io_uring: preparation for rsrc tagging
  io_uring: decouple CQE filling from requests
  ...
2021-04-28 14:56:09 -07:00
Linus Torvalds
fc05860628 for-5.13/drivers-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U
 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i
 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k
 h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE
 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz
 HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d
 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk
 H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N
 ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb
 MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ
 veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1
 mO9R3/ccaw==
 =c0x4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - MD changes via Song:
        - raid5 POWER fix
        - raid1 failure fix
        - UAF fix for md cluster
        - mddev_find_or_alloc() clean up
        - Fix NULL pointer deref with external bitmap
        - Performance improvement for raid10 discard requests
        - Fix missing information of /proc/mdstat

 - rsxx const qualifier removal (Arnd)

 - Expose allocated brd pages (Calvin)

 - rnbd via Gioh Kim:
        - Change maintainer
        - Change domain address of maintainers' email
        - Add polling IO mode and document update
        - Fix memory leak and some bug detected by static code analysis
          tools
        - Code refactoring

 - Series of floppy cleanups/fixes (Denis)

 - s390 dasd fixes (Julian)

 - kerneldoc fixes (Lee)

 - null_blk double free (Lv)

 - null_blk virtual boundary addition (Max)

 - Remove xsysace driver (Michal)

 - umem driver removal (Davidlohr)

 - ataflop fixes (Dan)

 - Revalidate disk removal (Christoph)

 - Bounce buffer cleanups (Christoph)

 - Mark lightnvm as deprecated (Christoph)

 - mtip32xx init cleanups (Shixin)

 - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang)

* tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits)
  async_xor: increase src_offs when dropping destination page
  drivers/block/null_blk/main: Fix a double free in null_init.
  md/raid1: properly indicate failure when ending a failed write request
  md-cluster: fix use-after-free issue when removing rdev
  nvme: introduce generic per-namespace chardev
  nvme: cleanup nvme_configure_apst
  nvme: do not try to reconfigure APST when the controller is not live
  nvme: add 'kato' sysfs attribute
  nvme: sanitize KATO setting
  nvmet: avoid queuing keep-alive timer if it is disabled
  brd: expose number of allocated pages in debugfs
  ataflop: fix off by one in ataflop_probe()
  ataflop: potential out of bounds in do_format()
  drbd: Fix fall-through warnings for Clang
  block/rnbd: Use strscpy instead of strlcpy
  block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name
  block/rnbd-clt: Remove max_segment_size
  block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes
  block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev
  Documentation/ABI/rnbd-clt: Add description for nr_poll_queues
  ...
2021-04-28 14:39:37 -07:00
Linus Torvalds
6c00292113 for-5.13/block-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJW0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr8sD/4qP+MsFTB1IFUu8fW7BjBPdduoK8Vq9o3S
 HB8iF/yhJZ73nLecMMdn/jTO8SCW0Iw+okywW3BugGnNPbwXo0UQ4jLhzbTts76P
 JvZaguZFhBsF3ceFOt3CRCQDOeoDfMp3sitLUVivkN+2vwMs9vJpVNaEeUjcCC1Z
 8QjlpqYSMuakTwEn7QhlnKxVWn1V2B6PDjZMcf48ONRZGsCkoOXH1SE4Ge8nxjqa
 KHKO5bvwgRzGhKpvdHEIl8dmFL9WEWElBVoY3vE2EHL0SPE32zHlxtYLS0NAhY2M
 aprkJ0QP0Rgl8HpYiCstwAnJGKDg4a0ArWhf/CJTuLAWmTNFR7v5n7vw2SilJHTG
 0FtiFiOnpvvBmUC0B1PUEQX8AiFcdXueLb6xboExcp2WtxIAe8wPoGFl6T1tobBY
 qsfWggGs/vD1RVrJISPC+20cJemcRyeakMV48w+n3Lt/ES3IEv/LXx6PO/PbXvOo
 B7HJXTofkoaX52A/1+NxraGapwzhYouhi6Sb6Fc++X59/a/oBuOUGuur0eZ+/oWA
 9787mUUDmW/sahfZUgZh5AxqKo2jJULjeggANCICW9/RN6duV8TBQVOLW1/0Wddp
 9lndiA9ZMveWF+J19+sjBoiYMYawLmURaOlDK77ctTCcR/ji3l4GZ+2KvBEMeIT8
 O1OYEnwaIQ==
 =oza6
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Pretty quiet round this time, which is nice. In detail:

   - Series revamping bounce buffer support (Christoph)

   - Dead code removal (Christoph, Bart)

   - Partition iteration revamp, now using xarray (Christoph)

   - Passthrough request scheduler improvements (Lin)

   - Series of BFQ improvements (Paolo)

   - Fix ioprio task iteration (Peter)

   - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max,
     Nikolay)"

* tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits)
  blk-iocost: don't ignore vrate_min on QD contention
  blk-mq: Fix spurious debugfs directory creation during initialization
  bfq/mq-deadline: remove redundant check for passthrough request
  blk-mq: bypass IO scheduler's limit_depth for passthrough request
  block: Remove an obsolete comment from sg_io()
  block: move bio_list_copy_data to pktcdvd
  block: remove zero_fill_bio_iter
  block: add queue_to_disk() to get gendisk from request_queue
  block: remove an incorrect check from blk_rq_append_bio
  block: initialize ret in bdev_disk_changed
  block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration
  block: remove disk_part_iter
  block: simplify diskstats_show
  block: simplify show_partition
  block: simplify printk_all_partitions
  block: simplify partition_overlaps
  block: simplify partition removal
  block: take bd_mutex around delete_partitions in del_gendisk
  block: refactor blk_drop_partitions
  block: move more syncing and invalidation to delete_partition
  ...
2021-04-28 14:27:12 -07:00
Linus Torvalds
16b3d0cf5b Scheduler updates for this cycle are:
- Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces
    to a unified debugfs interface.
 
  - Signals: Allow caching one sigqueue object per task, to improve performance & latencies.
 
  - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups.
 
  - Improve energy-aware scheduling
 
  - Improve the PELT metrics for certain workloads
 
  - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous
    regressions
 
  - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This
    is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature,
    or the use of the resched_latency_warn_ms=xx boot parameter.
 
  - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining
    balance_push() vs. hotplug holes/races
 
  - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well
 
  - Fix/improve various load-balancing corner cases vs. capacity margins
 
  - Fix sched topology on systems with NUMA diameter of 3 or above
 
  - Fix PF_KTHREAD vs to_kthread() race
 
  - Minor rseq optimizations
 
  - Misc cleanups, optimizations, fixes and smaller updates
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJInsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i5XxAArh0b+fwXlkVGzTUly7HQjhU7lFbChnmF
 h6ToyNLi6pXoZ14VC/WoRIME+RzK3gmw9cEFaSLVPxbkbekTcyWS78kqmcg1/j2v
 kO/20QhXobiIxVskYfoMmqSavZ5mKhMWBqtFXkCuYfxwGylas0VVdh3AZLJ7N21G
 WEoFh99pVULwWnPHxM2ZQ87Ex9BkGKbsBTswxWpprCfXLqD0N2hHlABpwJP78zRf
 VniWFOcC7lslILCFawb7CqGgAwbgV85nDRS4QCuCKisrkFywvjJrEeu/W+h1NfhF
 d6ves/osNdEAM1DSALoxwEA42An8l8xh8NyJnl8JZV00LW0DM108O5/7pf5Zcryc
 RHV3RxA7skgezBh5uThvo60QzNK+kVMatI4qpQEHxLE52CaDl/fBu1Cgb/VUxnIl
 AEBfyiFbk+skHpuMFKtl30Tx3M+yJKMTzFPd4kYjHYGEDwtAcXcB3dJQW48A79i3
 H3IWcDcXpk5Rjo2UZmaXdt/qlj7mP6U0xdOUq8ZK6JOC4uY9skszVGsfuNN9QQ5u
 2E2YKKVrGFoQydl4C8R6A7axL2VzIJszHFZNipd8E3YOyW7PWRAkr02tOOkBTj8N
 dLMcNM7aPJWqEYiEIjEzGQN20pweJ1dRA29LDuOswKh+7W2bWTQFh6F2Q8Haansc
 RVg5PDzl+Mc=
 =E7mz
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
   debugfs interfaces to a unified debugfs interface.

 - Signals: Allow caching one sigqueue object per task, to improve
   performance & latencies.

 - Improve newidle_balance() irq-off latencies on systems with a large
   number of CPU cgroups.

 - Improve energy-aware scheduling

 - Improve the PELT metrics for certain workloads

 - Reintroduce select_idle_smt() to improve load-balancing locality -
   but without the previous regressions

 - Add 'scheduler latency debugging': warn after long periods of pending
   need_resched. This is an opt-in feature that requires the enabling of
   the LATENCY_WARN scheduler feature, or the use of the
   resched_latency_warn_ms=xx boot parameter.

 - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
   remaining balance_push() vs. hotplug holes/races

 - PSI fixes, plus allow /proc/pressure/ files to be written by
   CAP_SYS_RESOURCE tasks as well

 - Fix/improve various load-balancing corner cases vs. capacity margins

 - Fix sched topology on systems with NUMA diameter of 3 or above

 - Fix PF_KTHREAD vs to_kthread() race

 - Minor rseq optimizations

 - Misc cleanups, optimizations, fixes and smaller updates

* tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  cpumask/hotplug: Fix cpu_dying() state tracking
  kthread: Fix PF_KTHREAD vs to_kthread() race
  sched/debug: Fix cgroup_path[] serialization
  sched,psi: Handle potential task count underflow bugs more gracefully
  sched: Warn on long periods of pending need_resched
  sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
  sched/debug: Rename the sched_debug parameter to sched_verbose
  sched,fair: Alternative sched_slice()
  sched: Move /proc/sched_debug to debugfs
  sched,debug: Convert sysctl sched_domains to debugfs
  debugfs: Implement debugfs_create_str()
  sched,preempt: Move preempt_dynamic to debug.c
  sched: Move SCHED_DEBUG sysctl to debugfs
  sched: Don't make LATENCYTOP select SCHED_DEBUG
  sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
  sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
  sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
  cpumask: Introduce DYING mask
  cpumask: Make cpu_{online,possible,present,active}() inline
  rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
  ...
2021-04-28 13:33:57 -07:00
Linus Torvalds
42dec9a936 Perf events changes in this cycle were:
- Improve Intel uncore PMU support:
 
      - Parse uncore 'discovery tables' - a new hardware capability enumeration method
        introduced on the latest Intel platforms. This table is in a well-defined PCI
        namespace location and is read via MMIO. It is organized in an rbtree.
 
        These uncore tables will allow the discovery of standard counter blocks, but
        fancier counters still need to be enumerated explicitly.
 
      - Add Alder Lake support
 
      - Improve IIO stacks to PMON mapping support on Skylake servers
 
  - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
    and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
    cores.
 
    The CPU-side feature set is entirely symmetrical - but on the PMU side there's
    core type dependent PMU functionality.
 
  - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
    fixing the AUX allocation watermark logic.
 
  - Improve ring buffer allocation on NUMA systems
 
  - Put 'struct perf_event' into their separate kmem_cache pool
 
  - Add support for synchronous signals for select perf events. The immediate motivation
    is to support low-overhead sampling-based race detection for user-space code. The
    feature consists of the following main changes:
 
     - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
       inheritance of events to CLONE_THREAD.
 
     - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
 
     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
       ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.
 
    The siginfo support is adequate for breakpoints right now - but the new field can be used
    to introduce support for other types of metadata passed over siginfo as well.
 
  - Misc fixes, cleanups and smaller updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
 QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
 mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
 NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
 5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
 UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
 USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
 I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
 BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
 7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
 nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
 goUMOL/D0lg=
 =NhQ7
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - Improve Intel uncore PMU support:

     - Parse uncore 'discovery tables' - a new hardware capability
       enumeration method introduced on the latest Intel platforms. This
       table is in a well-defined PCI namespace location and is read via
       MMIO. It is organized in an rbtree.

       These uncore tables will allow the discovery of standard counter
       blocks, but fancier counters still need to be enumerated
       explicitly.

     - Add Alder Lake support

     - Improve IIO stacks to PMON mapping support on Skylake servers

 - Add Intel Alder Lake PMU support - which requires the introduction of
   'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
   and Gracemont ('small' - Atom derived) cores.

   The CPU-side feature set is entirely symmetrical - but on the PMU
   side there's core type dependent PMU functionality.

 - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
   profiling, by fixing the AUX allocation watermark logic.

 - Improve ring buffer allocation on NUMA systems

 - Put 'struct perf_event' into their separate kmem_cache pool

 - Add support for synchronous signals for select perf events. The
   immediate motivation is to support low-overhead sampling-based race
   detection for user-space code. The feature consists of the following
   main changes:

     - Add thread-only event inheritance via
       perf_event_attr::inherit_thread, which limits inheritance of
       events to CLONE_THREAD.

     - Add the ability for events to not leak through exec(), via
       perf_event_attr::remove_on_exec.

     - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
       extend siginfo with an u64 ::si_perf, and add the breakpoint
       information to ::si_addr and ::si_perf if the event is
       PERF_TYPE_BREAKPOINT.

   The siginfo support is adequate for breakpoints right now - but the
   new field can be used to introduce support for other types of
   metadata passed over siginfo as well.

 - Misc fixes, cleanups and smaller updates.

* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  signal, perf: Add missing TRAP_PERF case in siginfo_layout()
  signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
  perf/x86: Allow for 8<num_fixed_counters<16
  perf/x86/rapl: Add support for Intel Alder Lake
  perf/x86/cstate: Add Alder Lake CPU support
  perf/x86/msr: Add Alder Lake CPU support
  perf/x86/intel/uncore: Add Alder Lake support
  perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
  perf/x86/intel: Add Alder Lake Hybrid support
  perf/x86: Support filter_match callback
  perf/x86/intel: Add attr_update for Hybrid PMUs
  perf/x86: Add structures for the attributes of Hybrid PMUs
  perf/x86: Register hybrid PMUs
  perf/x86: Factor out x86_pmu_show_pmu_cap
  perf/x86: Remove temporary pmu assignment in event_init
  perf/x86/intel: Factor out intel_pmu_check_extra_regs
  perf/x86/intel: Factor out intel_pmu_check_event_constraints
  perf/x86/intel: Factor out intel_pmu_check_num_counters
  perf/x86: Hybrid PMU support for extra_regs
  perf/x86: Hybrid PMU support for event constraints
  ...
2021-04-28 13:03:44 -07:00
Filipe Manana
f9baa501b4 btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.

When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").

However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.

When this issue happens, stack traces like the following are reported:

  [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
  [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
  [72747.556271] Call Trace:
  [72747.556273]  __schedule+0x296/0x760
  [72747.556277]  schedule+0x3c/0xa0
  [72747.556279]  io_schedule+0x12/0x40
  [72747.556284]  __lock_page+0x13c/0x280
  [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
  [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
  [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
  [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
  [72747.556362]  ? update_group_capacity+0x25/0x210
  [72747.556366]  ? cpumask_next_and+0x1a/0x20
  [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
  [72747.556394]  do_writepages+0x41/0xd0
  [72747.556398]  __writeback_single_inode+0x39/0x2a0
  [72747.556403]  writeback_sb_inodes+0x1ea/0x440
  [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
  [72747.556410]  wb_writeback+0x235/0x2b0
  [72747.556414]  ? get_nr_inodes+0x35/0x50
  [72747.556417]  wb_workfn+0x354/0x490
  [72747.556420]  ? newidle_balance+0x2c5/0x3e0
  [72747.556424]  process_one_work+0x1aa/0x340
  [72747.556426]  worker_thread+0x30/0x390
  [72747.556429]  ? create_worker+0x1a0/0x1a0
  [72747.556432]  kthread+0x116/0x130
  [72747.556435]  ? kthread_park+0x80/0x80
  [72747.556438]  ret_from_fork+0x1f/0x30

  [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
  [72747.566961] Call Trace:
  [72747.566964]  __schedule+0x296/0x760
  [72747.566968]  ? finish_wait+0x80/0x80
  [72747.566970]  schedule+0x3c/0xa0
  [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
  [72747.566999]  ? finish_wait+0x80/0x80
  [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
  [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
  [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
  [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
  [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
  [72747.567106]  ? update_load_avg+0x7e/0x5f0
  [72747.567109]  ? enqueue_entity+0xf4/0x6f0
  [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
  [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
  [72747.567140]  do_writepages+0x41/0xd0
  [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
  [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
  [72747.567200]  process_one_work+0x1aa/0x340
  [72747.567202]  worker_thread+0x30/0x390
  [72747.567205]  ? create_worker+0x1a0/0x1a0
  [72747.567208]  kthread+0x116/0x130
  [72747.567211]  ? kthread_park+0x80/0x80
  [72747.567214]  ret_from_fork+0x1f/0x30

  [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
  [72747.569689] Call Trace:
  [72747.569691]  __schedule+0x296/0x760
  [72747.569694]  schedule+0x3c/0xa0
  [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
  [72747.569725]  ? finish_wait+0x80/0x80
  [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
  [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
  [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
  [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
  [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
  [72747.569836]  ? __kmalloc+0x16a/0x2c0
  [72747.569839]  do_iter_readv_writev+0x160/0x1c0
  [72747.569843]  do_iter_write+0x80/0x1b0
  [72747.569847]  vfs_writev+0x84/0x140
  [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
  [72747.569873]  do_writev+0x65/0x100
  [72747.569876]  do_syscall_64+0x33/0x40
  [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
  [72747.569903] Call Trace:
  [72747.569906]  __schedule+0x296/0x760
  [72747.569909]  schedule+0x3c/0xa0
  [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
  [72747.569940]  ? finish_wait+0x80/0x80
  [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
  [72747.569989]  start_transaction+0x279/0x580 [btrfs]
  [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
  [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
  [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
  [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
  [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
  [72747.570126]  do_clone_file_range+0xed/0x200
  [72747.570131]  vfs_clone_file_range+0x37/0x110
  [72747.570134]  ioctl_file_clone+0x7d/0xb0
  [72747.570137]  do_vfs_ioctl+0x138/0x630
  [72747.570140]  __x64_sys_ioctl+0x62/0xc0
  [72747.570143]  do_syscall_64+0x33/0x40
  [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.

The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e965360 ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e965360 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28 20:09:47 +02:00
Filipe Manana
626e9f41f7 btrfs: fix race leading to unpersisted data and metadata on fsync
When doing a fast fsync on a file, there is a race which can result in the
fsync returning success to user space without logging the inode and without
durably persisting new data.

The following example shows one possible scenario for this:

   $ mkfs.btrfs -f /dev/sdc
   $ mount /dev/sdc /mnt

   $ touch /mnt/bar
   $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz

   # Now we have:
   # file bar == inode 257
   # file baz == inode 258

   $ mv /mnt/baz /mnt/foo

   # Now we have:
   # file bar == inode 257
   # file foo == inode 258

   $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo

   # fsync bar before foo, it is important to trigger the race.
   $ xfs_io -c "fsync" /mnt/bar
   $ xfs_io -c "fsync" /mnt/foo

   # After this:
   # inode 257, file bar, is empty
   # inode 258, file foo, has 1M filled with 0xcd

   <power failure>

   # Replay the log:
   $ mount /dev/sdc /mnt

   # After this point file foo should have 1M filled with 0xcd and not 0xab

The following steps explain how the race happens:

1) Before the first fsync of inode 258, when it has the "baz" name, its
   ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
   The inode also has the full sync flag set;

2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
   the generation of the current transaction, and set ->last_log_commit
   to 0, which is the current value of ->last_sub_trans (done at
   btrfs_log_inode()).

   The full sync flag is cleared from the inode during the fsync.

   The log sub transaction that was committed had an ID of 0 and when we
   synced the log, at btrfs_sync_log(), we incremented root->log_transid
   from 0 to 1;

3) During the rename:

   We update inode 258, through btrfs_update_inode(), and that causes its
   ->last_sub_trans to be set to 1 (the current log transaction ID), and
   ->last_log_commit remains with a value of 0.

   After updating inode 258, because we have previously logged the inode
   in the previous fsync, we log again the inode through the call to
   btrfs_log_new_name(). This results in updating the inode's
   ->last_log_commit from 0 to 1 (the current value of its
   ->last_sub_trans).

   The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
   the next log transaction;

4) Then a buffered write against inode 258 is made. This leaves the value
   of ->last_sub_trans as 1 (the ID of the current log transaction, stored
   at root->log_transid);

5) Then an fsync against inode 257 (or any other inode other than 258),
   happens. This results in committing the log transaction with ID 1,
   which results in updating root->last_log_commit to 1 and bumping
   root->log_transid from 1 to 2;

6) Then an fsync against inode 258 starts. We flush delalloc and wait only
   for writeback to complete, since the full sync flag is not set in the
   inode's runtime flags - we do not wait for ordered extents to complete.

   Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
   ordered extent completes. The call returns true:

     static inline bool btrfs_inode_in_log(...)
     {
         bool ret = false;

         spin_lock(&inode->lock);
         if (inode->logged_trans == generation &&
             inode->last_sub_trans <= inode->last_log_commit &&
             inode->last_sub_trans <= inode->root->last_log_commit)
                 ret = true;
         spin_unlock(&inode->lock);
         return ret;
     }

   generation has a value of 6 (fs_info->generation), ->logged_trans also
   has a value of 6 (set when we logged the inode during the first fsync
   and when logging it during the rename), ->last_sub_trans has a value
   of 1, set during the rename (step 3), ->last_log_commit also has a
   value of 1 (set in step 3) and root->last_log_commit has a value of 1,
   which was set in step 5 when fsyncing inode 257.

   As a consequence we don't log the inode, any new extents and do not
   sync the log, resulting in a data loss if a power failure happens
   after the fsync and before the current transaction commits.
   Also, because we do not log the inode, after a power failure the mtime
   and ctime of the inode do not match those we had before.

   When the ordered extent completes before we call btrfs_inode_in_log(),
   then the call returns false and we log the inode and sync the log,
   since at the end of ordered extent completion we update the inode and
   set ->last_sub_trans to 2 (the value of root->log_transid) and
   ->last_log_commit to 1.

This problem is found after removing the check for the emptiness of the
inode's list of modified extents in the recent commit 209ecbb858
("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
added in the 5.13 merge window. However checking the emptiness of the
list is not really the way to solve this problem, and was never intended
to, because while that solves the problem for COW writes, the problem
persists for NOCOW writes because in that case the list is always empty.

In the case of NOCOW writes, even though we wait for the writeback to
complete before returning from btrfs_sync_file(), we end up not logging
the inode, which has a new mtime/ctime, and because we don't sync the log,
we never issue disk barriers (send REQ_PREFLUSH to the device) since that
only happens when we sync the log (when we write super blocks at
btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
btrfs_sync_file() to user space, we are not guaranteeing that the data is
durably persisted on disk.

Also, while the example above uses a rename exchange to show how the
problem happens, it is not the only way to trigger it. An alternative
could be adding a new hard link to inode 258, since that also results
in calling btrfs_log_new_name() and updating the inode in the log.
An example reproducer using the addition of a hard link instead of a
rename operation:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ touch /mnt/bar
  $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo

  $ ln /mnt/foo /mnt/foo_link
  $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo

  $ xfs_io -c "fsync" /mnt/bar
  $ xfs_io -c "fsync" /mnt/foo

  <power failure>

  # Replay the log:
  $ mount /dev/sdc /mnt

  # After this point file foo often has 1M filled with 0xab and not 0xcd

The reasons leading to the final fsync of file foo, inode 258, not
persisting the new data are the same as for the previous example with
a rename operation.

So fix by never skipping logging and log syncing when there are still any
ordered extents in flight. To avoid making the conditional if statement
that checks if logging an inode is needed harder to read, place all the
logic into an helper function with separate if statements to make it more
manageable and easier to read.

A test case for fstests will follow soon.

For NOCOW writes, the problem existed before commit b5e6c3e170
("btrfs: always wait on ordered extents at fsync time"), introduced in
kernel 4.19, then it went away with that commit since we started to always
wait for ordered extent completion before logging.

The problem came back again once the fast fsync path was changed again to
avoid waiting for ordered extent completion, in commit 487781796d
("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.

However, for COW writes, the race only happens after the recent
commit 209ecbb858 ("btrfs: remove stale comment and logic from
btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
writes, the bug existed before that commit. So tag 5.10+ as the release
for stable backports.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28 20:09:45 +02:00
Filipe Manana
ffb7c2e923 btrfs: do not consider send context as valid when trying to flush qgroups
At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info
is either NULL or has the value BTRFS_SEND_TRANS_STUB.

However allowing for BTRFS_SEND_TRANS_STUB makes no sense because:

1) It is misleading, because send operations are read-only and do not
   ever need to reserve qgroup space;

2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB
   at transaction.c:start_transaction();

3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in
   a crash if try_flush_qgroup() is ever called in a send context, because
   at transaction.c:start_transaction we cast current->journal_info into
   a struct btrfs_trans_handle pointer and then dereference it.

So just do allow a send context at try_flush_qgroup() and update the
comment about it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28 20:09:42 +02:00
Filipe Manana
adbd914dcd btrfs: zoned: fix silent data loss after failure splitting ordered extent
On a zoned filesystem, sometimes we need to split an ordered extent into 3
different ordered extents. The original ordered extent is shortened, at
the front and at the rear, and we create two other new ordered extents to
represent the trimmed parts of the original ordered extent.

After adjusting the original ordered extent, we create an ordered extent
to represent the pre-range, and that may fail with ENOMEM for example.
After that we always try to create the ordered extent for the post-range,
and if that happens to succeed we end up returning success to the caller
as we overwrite the 'ret' variable which contained the previous error.

This means we end up with a file range for which there is no ordered
extent, which results in the range never getting a new file extent item
pointing to the new data location. And since the split operation did
not return an error, writeback does not fail and the inode's mapping is
not flagged with an error, resulting in a subsequent fsync not reporting
an error either.

It's possibly very unlikely to have the creation of the post-range ordered
extent succeed after the creation of the pre-range ordered extent failed,
but it's not impossible.

So fix this by making sure we only create the post-range ordered extent
if there was no error creating the ordered extent for the pre-range.

Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28 20:09:38 +02:00
Mike Marshall
0c4b7cadd1 Orangef: implement orangefs_readahead.
Also remove open-coded readahead logic from orangefs_readpage.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2021-04-28 07:33:04 -04:00
Steve French
fee742b502 smb3.1.1: enable negotiating stronger encryption by default
Now that stronger encryption (gcm256) has been more broadly
tested, and confirmed to work with multiple servers (Windows
and Azure for example), enable it by default.  Although gcm256 is
the second choice we offer (after gcm128 which should be faster),
this change allows mounts to server which are configured to
require the strongest encryption to work (without changing a module
load parameter).

Signed-off-by: Steve French <stfrench@microsoft.com>
Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2021-04-27 23:18:40 -05:00
Linus Torvalds
7f3d08b255 printk changes for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmCIBMIACgkQUqAMR0iA
 lPIt9w//bbHUN/JsNtLCs/849oExdUn/thVajrD5yELttYZXhdzbXncNdkGX9tlU
 4JmExmUoqKYdN6JhSnrcYvckHj7XXZM7pVh9IdzqRh10MEXIQ+7IUHjQc8034Zs/
 W4/oZmfMtBjszap+cJ9hvdp9qaJkPz/fRLGlrbjc1K4hhxDa1gGmeD35SKswGltm
 q6RzX3uRl5JbBrYsLoqb28MGYRHhjf2+Pvndoj+5Nn9FtwPSot6jAkyqY5Y6iJlS
 W2EsFqOt+Kv7/I93FyQlnXC6Nx7vntmow7knmmGPXDf2BqLb0J8Bxl3fwuzpQoao
 nZzL/p9GQ4ZXF6y8gRV8+RzPIcftBdayOswEDGH0LzlTkbAe/9Sq9Lo7a4Z8jxHW
 ro0P+PSRK5Ksm7jvpVmSTg+Nt+XqDA5zA1lAorX1UjsyeDDNF9ndQ4C+ZNhCKo54
 y+RDgtAArJMIvsHLQ53ReoOct5NnGVNb8G/r3bIAu+Dn6K3nesr6fP1XG8iduseL
 yFlLB7w214BQMr2B/C+8lQvj54wWE4lea2+LNvObxC5b8puYj0fEniUxTYP6bcB5
 QT+LfTToufYz4US7ggJy6hoEfohifGWVvDHbn9tXmyXotSTHH7pHdYypqY+UO+kl
 7BkwzNFCm4qCIKsg8nyJxT2hDOlpcCrQx1dBIjveMqJ0c5+ahXU=
 =ovSn
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Stop synchronizing kernel log buffer readers by logbuf_lock. As a
   result, the access to the buffer is fully lockless now.

   Note that printk() itself still uses locks because it tries to flush
   the messages to the console immediately. Also the per-CPU temporary
   buffers are still there because they prevent infinite recursion and
   serialize backtraces from NMI. All this is going to change in the
   future.

 - kmsg_dump API rework and cleanup as a side effect of the logbuf_lock
   removal.

 - Make bstr_printf() aware that %pf and %pF formats could deference the
   given pointer.

 - Show also page flags by %pGp format.

 - Clarify the documentation for plain pointer printing.

 - Do not show no_hash_pointers warning multiple times.

 - Update Senozhatsky email address.

 - Some clean up.

* tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (24 commits)
  lib/vsprintf.c: remove leftover 'f' and 'F' cases from bstr_printf()
  printk: clarify the documentation for plain pointer printing
  kernel/printk.c: Fixed mundane typos
  printk: rename vprintk_func to vprintk
  vsprintf: dump full information of page flags in pGp
  mm, slub: don't combine pr_err with INFO
  mm, slub: use pGp to print page flags
  MAINTAINERS: update Senozhatsky email address
  lib/vsprintf: do not show no_hash_pointers message multiple times
  printk: console: remove unnecessary safe buffer usage
  printk: kmsg_dump: remove _nolock() variants
  printk: remove logbuf_lock
  printk: introduce a kmsg_dump iterator
  printk: kmsg_dumper: remove @active field
  printk: add syslog_lock
  printk: use atomic64_t for devkmsg_user.seq
  printk: use seqcount_latch for clear_seq
  printk: introduce CONSOLE_LOG_MAX
  printk: consolidate kmsg_dump_get_buffer/syslog_print_all code
  printk: refactor kmsg_dump_get_buffer()
  ...
2021-04-27 18:09:44 -07:00
Florent Revest
76d6a13383 seq_file: Add a seq_bprintf function
Similarly to seq_buf_bprintf in lib/seq_buf.c, this function writes a
printf formatted string with arguments provided in a "binary
representation" built by functions such as vbin_printf.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210427174313.860948-2-revest@chromium.org
2021-04-27 15:50:15 -07:00
Jeff Layton
d4f6b31d72 ceph: don't allow access to MDS-private inodes
The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.

Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.

URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
2d6795fbb8 ceph: fix up some bare fetches of i_size
We need to use i_size_read(), which properly handles the torn read
case on 32-bit arches.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
8ff2d290c8 ceph: convert some PAGE_SIZE invocations to thp_size()
Start preparing to allow the use of THPs in the pagecache with ceph by
making it use thp_size() in lieu of PAGE_SIZE in the appropriate places.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Yanhu Cao
e7f7295250 ceph: support getting ceph.dir.rsnaps vxattr
Add support for grabbing the rsnaps value out of the inode info in
traces, and exposing that via ceph.dir.rsnaps xattr.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
e72968e15b ceph: drop pinned_page parameter from ceph_get_caps
All of the existing callers that don't set this to NULL just drop the
page reference at some arbitrary point later in processing. There's no
point in keeping a page reference that we don't use, so just drop the
reference immediately after checking the Uptodate flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
1775c7ddac ceph: fix inode leak on getattr error in __fh_to_dentry
Fixes: 878dabb641 ("ceph: don't return -ESTALE if there's still an open file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
e9b2250156 ceph: only check pool permissions for regular files
There is no need to do a ceph_pool_perm_check() on anything that isn't a
regular file, as the MDS is what handles talking to the OSD in those
cases. Just return 0 if it's not a regular file.

Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00