Commit Graph

196 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle) 908a1ad894 iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
copy_page_from_iter_atomic() already handles !highmem compound
pages correctly, but if we are passed a highmem compound page,
each base page needs to be mapped & unmapped individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-07-24 18:03:43 -04:00
Matthew Wilcox (Oracle) f7f9a0c873 iov_iter: Map the page later in copy_page_from_iter_atomic()
Remove a couple of calls to kunmap_atomic() in the rare error cases.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-07-24 18:03:43 -04:00
Peter Zijlstra 719a937b70 iov_iter: Mark copy_iovec_from_user() noclone
Extend commit 50f9a76ef1 ("iov_iter: Mark
copy_compat_iovec_from_user() noinline") to also cover
copy_iovec_from_user(). Different compiler versions cause the same
problem on different functions.

lib/iov_iter.o: warning: objtool: .altinstr_replacement+0x1f: redundant UACCESS disable
lib/iov_iter.o: warning: objtool: iovec_from_user+0x84: call to copy_iovec_from_user.part.0() with UACCESS enabled
lib/iov_iter.o: warning: objtool: __import_iovec+0x143: call to copy_iovec_from_user.part.0() with UACCESS enabled

Fixes: 50f9a76ef1 ("iov_iter: Mark copy_compat_iovec_from_user() noinline")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20230616124354.GD4253@hirez.programming.kicks-ass.net
2023-07-10 09:52:28 +02:00
Christoph Hellwig 84bd06c632 iov_iter: remove iov_iter_get_pages and iov_iter_get_pages_alloc
Now that the direct I/O helpers have switched to use
iov_iter_extract_pages, these helpers are unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:09 -06:00
David Howells 3fc40265ae iov_iter: Kill ITER_PIPE
The ITER_PIPE-type iterator was only used by generic_file_splice_read() and
that has been replaced and removed.  This leaves ITER_PIPE unused - so
remove it too.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-31-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:17 -06:00
Kefeng Wang 245f092268 mm: hwpoison: coredump: support recovery from dump_user_range()
dump_user_range() is used to copy the user page to a coredump file, but if
a hardware memory error occurred during copy, which called from
__kernel_write_iter() in dump_user_range(), it crashes,

  CPU: 112 PID: 7014 Comm: mca-recover Not tainted 6.3.0-rc2 #425

  pc : __memcpy+0x110/0x260
  lr : _copy_from_iter+0x3bc/0x4c8
  ...
  Call trace:
   __memcpy+0x110/0x260
   copy_page_from_iter+0xcc/0x130
   pipe_write+0x164/0x6d8
   __kernel_write_iter+0x9c/0x210
   dump_user_range+0xc8/0x1d8
   elf_core_dump+0x308/0x368
   do_coredump+0x2e8/0xa40
   get_signal+0x59c/0x788
   do_signal+0x118/0x1f8
   do_notify_resume+0xf0/0x280
   el0_da+0x130/0x138
   el0t_64_sync_handler+0x68/0xc0
   el0t_64_sync+0x188/0x190

Generally, the '->write_iter' of file ops will use copy_page_from_iter()
and copy_page_from_iter_atomic(), change memcpy() to copy_mc_to_kernel()
in both of them to handle #MC during source read, which stop coredump
processing and kill the task instead of kernel panic, but the source
address may not always a user address, so introduce a new copy_mc flag in
struct iov_iter{} to indicate that the iter could do a safe memory copy,
also introduce the helpers to set/cleck the flag, for now, it's only used
in coredump's dump_user_range(), but it could expand to any other
scenarios to fix the similar issue.

Link: https://lkml.kernel.org/r/20230417045323.11054-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 17:21:50 -07:00
Linus Torvalds 7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds 487c20b016 iov: improve copy_iovec_from_user() code generation
Use the same pattern as the compat version of this code does: instead of
copying the whole array to a kernel buffer and then having a separate
phase of verifying it, just do it one entry at a time, verifying as you
go.

On Jens' /dev/zero readv() test this improves performance by ~6%.

[ This was obviously triggered by Jens' ITER_UBUF updates series ]

Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/de35d11d-bce7-e976-7372-1f2caf417103@kernel.dk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-24 10:37:17 -07:00
Josh Poimboeuf 50f9a76ef1 iov_iter: Mark copy_compat_iovec_from_user() noinline
After commit 6376ce56feb6 ("iov_iter: import single vector iovecs as
ITER_UBUF"), GCC does an inter-procedural compiler optimization which
moves the user_access_begin() out of copy_compat_iovec_from_user() and
into its callers:

  lib/iov_iter.o: warning: objtool: .altinstr_replacement+0x0: redundant UACCESS disable
  lib/iov_iter.o: warning: objtool: iovec_from_user.part.0+0xc7: call to copy_compat_iovec_from_user.part.0() with UACCESS enabled
  lib/iov_iter.o: warning: objtool: __import_iovec+0x21d: call to copy_compat_iovec_from_user.part.0() with UACCESS enabled

Enforce the "no UACCESS enable across function boundaries" rule by
disabling cloning for copy_compat_iovec_from_user().

Fixes: 6376ce56feb6 ("iov_iter: import single vector iovecs as ITER_UBUF")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
https://lkml.kernel.org/lkml/20230327120017.6bb826d7@canb.auug.org.au
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-12 10:46:48 -06:00
Lorenzo Stoakes 4f80818b4a iov_iter: add copy_page_to_iter_nofault()
Provide a means to copy a page to user space from an iterator, aborting if
a page fault would occur.  This supports compound pages, but may be passed
a tail page with an offset extending further into the compound page, so we
cannot pass a folio.

This allows for this function to be called from atomic context and _try_
to user pages if they are faulted in, aborting if not.

The function does not use _copy_to_iter() in order to not specify
might_fault(), this is similar to copy_page_from_iter_atomic().

This is being added in order that an iteratable form of vread() can be
implemented while holding spinlocks.

Link: https://lkml.kernel.org/r/19734729defb0f498a76bdec1bef3ac48a3af3e8.1679511146.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:57 -07:00
Jens Axboe 3b2deb0e46 iov_iter: import single vector iovecs as ITER_UBUF
Add a special case to __import_iovec(), which imports a single segment
iovec as an ITER_UBUF rather than an ITER_IOVEC. ITER_UBUF is cheaper
to iterate than ITER_IOVEC, and for a single segment iovec, there's no
point in using a segmented iterator.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Jens Axboe e03ad4ee27 iov_iter: convert import_single_range() to ITER_UBUF
Since we're just importing a single vector, we don't have to turn it
into an ITER_IOVEC. Instead turn it into an ITER_UBUF, which is cheaper
to iterate.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Jens Axboe de4f5fed3f iov_iter: add iter_iovec() helper
This returns a pointer to the current iovec entry in the iterator. Only
useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF
and ITER_IOVEC identically for the first segment.

Rename struct iov_iter->iov to iov_iter->__iov to find any potentially
troublesome spots, and also to prevent anyone from adding new code that
accesses iter->iov directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Linus Torvalds 307e14c039 46 fs/cifs (smb3 client) changesets, 37 in fs/cifs and 9 for related helper functions and cleanup outside from Dave Howells and Willy
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmP2kaAACgkQiiy9cAdy
 T1Eergv9FHVs7hS0anJF0xgRghR4+g0m5UUo08iJazgJdDgcS5JY+ZasIpYpEsG3
 QmsIT33XVYZypXoOzjMSsPlwo6esTCJQScVLz85e4ebedCbCBDks+wVQcbfTzD5/
 KrwmUoTBLU0L/ppFhqRk9k53nrSf1SXCWPthjdfWa3mTHdIVM4kQJruTWwUDiJXp
 mdYwTx6FnTNer3QWetNzYOwdUgLu3rk0zLcBwQNCo6g5LOpA44iFfEAO4zeiOuZT
 LMDPbDj0nWQyWPLLdcbtsn2laYyEBDBLZevLirSaqPQ/KCtGcw0mBt6dCAzg8/CM
 ONqHHxdEpvPON8Sxujcn4CxpXhl0nCLwwtKtWU4rt7IevI9U+PynNl57TtJJ16/s
 b3XD2QVbFjlcdAMTmArvqnogdzoC3mZu1R1IRs+jukhLAOqZiLN6o/E2HAllt47i
 krzXeXIzQr10w9fnJ7LtIc/7IUFgtUfrOkg4TKyNcnRVHQaSSxv+JLRgqMPOr/M0
 I7zt0G0j
 =4hIT
 -----END PGP SIGNATURE-----

Merge tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client updates from Steve French:
 "The largest subset of this is from David Howells et al: making the
  cifs/smb3 driver pass iov_iters down to the lowest layers, directly to
  the network transport rather than passing lists of pages around,
  helping multiple areas:

   - Pin user pages, thereby fixing the race between concurrent DIO read
     and fork, where the pages containing the DIO read buffer may end up
     belonging to the child process and not the parent - with the result
     that the parent might not see the retrieved data.

   - cifs shouldn't take refs on pages extracted from non-user-backed
     iterators (eg. KVEC). With these changes, cifs will apply the
     appropriate cleanup.

   - Making it easier to transition to using folios in cifs rather than
     pages by dealing with them through BVEC and XARRAY iterators.

   - Allowing cifs to use the new splice function

  The remainder are:

   - fixes for stable, including various fixes for uninitialized memory,
     wrong length field causing mount issue to very old servers,
     important directory lease fixes and reconnect fixes

   - cleanups (unused code removal, change one element array usage, and
     a change form strtobool to kstrtobool, and Kconfig cleanups)

   - SMBDIRECT (RDMA) fixes including iov_iter integration and UAF fixes

   - reconnect fixes

   - multichannel fixes, including improving channel allocation (to
     least used channel)

   - remove the last use of lock_page_killable by moving to
     folio_lock_killable"

* tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits)
  update internal module version number for cifs.ko
  cifs: update ip_addr for ses only for primary chan setup
  cifs: use tcon allocation functions even for dummy tcon
  cifs: use the least loaded channel for sending requests
  cifs: DIO to/from KVEC-type iterators should now work
  cifs: Remove unused code
  cifs: Build the RDMA SGE list directly from an iterator
  cifs: Change the I/O paths to use an iterator rather than a page list
  cifs: Add a function to read into an iter from a socket
  cifs: Add some helper functions
  cifs: Add a function to Hash the contents of an iterator
  cifs: Add a function to build an RDMA SGE list from an iterator
  netfs: Add a function to extract an iterator into a scatterlist
  netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
  cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
  splice: Export filemap/direct_splice_read()
  iov_iter: Add a function to extract a page list from an iterator
  iov_iter: Define flags to qualify page extraction.
  splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
  splice: Add a func to do a splice from a buffered file without ITER_PIPE
  ...
2023-02-22 17:12:44 -08:00
David Howells 7d58fe7310 iov_iter: Add a function to extract a page list from an iterator
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

 (1) ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins (FOLL_PIN) obtained on them so that a
     concurrent fork() will forcibly copy the page so that DMA is done
     to/from the parent's buffer and is unavailable to/unaffected by the
     child process.

     iov_iter_extract_will_pin() will return true for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_will_pin() will return false.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells f62e52d127 iov_iter: Define flags to qualify page extraction.
Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is supported.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells 07073eb01c splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
Keith Busch 4397a17c1d iov_iter: move iter_ubuf check inside restore WARN
io_uring is using iter_ubuf types for single vector requests. We expect
state restore may happen for this type now, and it is already handled
correctly, so suppress the warning.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-01-08 20:59:17 -07:00
Jens Axboe 2ad9bd8332 iov: add import_ubuf()
Like import_single_range(), but for ITER_UBUF.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-01-08 20:59:17 -07:00
Linus Torvalds ce8a79d560 for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
 WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
 YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
 aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
 LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
 Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
 XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
 D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
 wxzFGfvHfw==
 =k/vv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
        Joshi)
      - Refactor PCIe probing and reset (Christoph Hellwig)
      - Various fabrics authentication fixes and improvements (Sagi
        Grimberg)
      - Avoid fallback to sequential scan due to transient issues (Uday
        Shankar)
      - Implement support for the DEAC bit in Write Zeroes (Christoph
        Hellwig)
      - Allow overriding the IEEE OUI and firmware revision in configfs
        for nvmet (Aleksandr Miloserdov)
      - Force reconnect when number of queue changes in nvmet (Daniel
        Wagner)
      - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
        Grimberg, Christoph Hellwig, Christophe JAILLET)
      - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
      - Use the common tagset helpers in nvme-pci driver (Christoph
        Hellwig)
      - Cleanup the nvme-pci removal path (Christoph Hellwig)
      - Use kstrtobool() instead of strtobool (Christophe JAILLET)
      - Allow unprivileged passthrough of Identify Controller (Joel
        Granados)
      - Support io stats on the mpath device (Sagi Grimberg)
      - Minor nvmet cleanup (Sagi Grimberg)

 - MD pull requests via Song:
      - Code cleanups (Christoph)
      - Various fixes

 - Floppy pull request from Denis:
      - Fix a memory leak in the init error path (Yuan)

 - Series fixing some batch wakeup issues with sbitmap (Gabriel)

 - Removal of the pktcdvd driver that was deprecated more than 5 years
   ago, and subsequent removal of the devnode callback in struct
   block_device_operations as no users are now left (Greg)

 - Fix for partition read on an exclusively opened bdev (Jan)

 - Series of elevator API cleanups (Jinlong, Christoph)

 - Series of fixes and cleanups for blk-iocost (Kemeng)

 - Series of fixes and cleanups for blk-throttle (Kemeng)

 - Series adding concurrent support for sync queues in BFQ (Yu)

 - Series bringing drbd a bit closer to the out-of-tree maintained
   version (Christian, Joel, Lars, Philipp)

 - Misc drbd fixes (Wang)

 - blk-wbt fixes and tweaks for enable/disable (Yu)

 - Fixes for mq-deadline for zoned devices (Damien)

 - Add support for read-only and offline zones for null_blk
   (Shin'ichiro)

 - Series fixing the delayed holder tracking, as used by DM (Yu,
   Christoph)

 - Series enabling bio alloc caching for IRQ based IO (Pavel)

 - Series enabling userspace peer-to-peer DMA (Logan)

 - BFQ waker fixes (Khazhismel)

 - Series fixing elevator refcount issues (Christoph, Jinlong)

 - Series cleaning up references around queue destruction (Christoph)

 - Series doing quiesce by tagset, enabling cleanups in drivers
   (Christoph, Chao)

 - Series untangling the queue kobject and queue references (Christoph)

 - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
   Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)

* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
  blktrace: Fix output non-blktrace event when blk_classic option enabled
  block: sed-opal: Don't include <linux/kernel.h>
  sed-opal: allow using IOC_OPAL_SAVE for locking too
  blk-cgroup: Fix typo in comment
  block: remove bio_set_op_attrs
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  block: bio_copy_data_iter
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  ...
2022-12-13 10:43:59 -08:00
Al Viro a41dad905e iov_iter: saner checks for attempt to copy to/from iterator
instead of "don't do it to ITER_PIPE" check for ->data_source being
false on copying from iterator.  Check for !->data_source for
copying to iterator, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:21 -05:00
Al Viro c67f1fd2b2 csum_and_copy_to_iter(): handle ITER_DISCARD
Not hard to implement - we are not copying anything here, so
csum_and_memcpy() is not usable, but calculating a checksum
of source directly is trivial...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:20 -05:00
Al Viro 40a86061a5 get rid of unlikely() on page_copy_sane() calls
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:20 -05:00
Logan Gunthorpe d82076403c iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221021174116.7200-4-logang@deltatee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-09 11:29:20 -07:00
Alexander Potapenko 33b75c1d88 instrumented.h: allow instrumenting both sides of copy_from_user()
Introduce instrument_copy_from_user_before() and
instrument_copy_from_user_after() hooks to be invoked before and after the
call to copy_from_user().

KASAN and KCSAN will be only using instrument_copy_from_user_before(), but
for KMSAN we'll need to insert code after copy_from_user().

Link: https://lkml.kernel.org/r/20220915150417.722975-4-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:18 -07:00
Al Viro c03f05f183 fix copy_page_from_iter() for compound destinations
had been broken for ITER_BVEC et.al. since ever (OK, v3.17 when
ITER_BVEC had first appeared)...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:26 -04:00
Al Viro f0f6b614f8 copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
... just shove it into one pipe_buffer.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:25 -04:00
Al Viro 310d9d5a50 expand those iov_iter_advance()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:25 -04:00
Al Viro 746de1f86f pipe_get_pages(): switch to append_pipe()
now that we are advancing the iterator, there's no need to
treat the first page separately - just call append_pipe()
in a loop.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:25 -04:00
Al Viro eba2d3d798 get rid of non-advancing variants
mechanical change; will be further massaged in subsequent commits

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:24 -04:00
Al Viro 3cf42da327 iov_iter: saner helper for page array allocation
All call sites of get_pages_array() are essenitally identical now.
Replace with common helper...

Returns number of slots available in resulting array or 0 on OOM;
it's up to the caller to make sure it doesn't ask to zero-entry
array (i.e. neither maxpages nor size are allowed to be zero).

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:22 -04:00
Al Viro 8520008417 fold __pipe_get_pages() into pipe_get_pages()
... and don't mangle maxsize there - turn the loop into counting
one instead.  Easier to see that we won't run out of array that
way.  Note that special treatment of the partial buffer in that
thing is an artifact of the non-advancing semantics of
iov_iter_get_pages() - if not for that, it would be append_pipe(),
same as the body of the loop that follows it.  IOW, once we make
iov_iter_get_pages() advancing, the whole thing will turn into
	calculate how many pages do we want
	allocate an array (if needed)
	call append_pipe() that many times.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:21 -04:00
Al Viro 0aa4fc32f5 ITER_XARRAY: don't open-code DIV_ROUND_UP()
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:21 -04:00
Al Viro 451c0ba947 unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
same as for pipes and xarrays; after that iov_iter_get_pages() becomes
a wrapper for __iov_iter_get_pages_alloc().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:21 -04:00
Al Viro 68fe506f37 unify xarray_get_pages() and xarray_get_pages_alloc()
same as for pipes

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:20 -04:00
Al Viro acbdeb8320 unify pipe_get_pages() and pipe_get_pages_alloc()
The differences between those two are
* pipe_get_pages() gets a non-NULL struct page ** value pointing to
preallocated array + array size.
* pipe_get_pages_alloc() gets an address of struct page ** variable that
contains NULL, allocates the array and (on success) stores its address in
that variable.

	Not hard to combine - always pass struct page ***, have
the previous pipe_get_pages_alloc() caller pass ~0U as cap for
array size.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:20 -04:00
Al Viro c81ce28df5 iov_iter_get_pages(): sanity-check arguments
zero maxpages is bogus, but best treated as "just return 0";
NULL pages, OTOH, should be treated as a hard bug.

get rid of now completely useless checks in xarray_get_pages{,_alloc}().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:20 -04:00
Al Viro 91329559eb iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
Incidentally, ITER_XARRAY did *not* free the sucker in case when
iter_xarray_populate_pages() returned 0...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:19 -04:00
Al Viro 12d426ab64 ITER_PIPE: fold data_start() and pipe_space_for_user() together
All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.

Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:19 -04:00
Al Viro 10f525a8cd ITER_PIPE: cache the type of last buffer
We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
	check if ->iov_offset is non-zero (i.e. that pipe is not empty)
	if so, get the corresponding pipe_buffer and check its ->ops
	if it's &default_pipe_buf_ops, we have an anon buffer.

Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
	empty, no buffers occupied:		0
	anon, with bytes up to N-1 filled:	N
	zero-copy, with bytes up to N-1 filled:	-N

That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.

	Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.

	Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator.  About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe.  There are only two cases where we leave the
sane state:
	1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
	2) iov_iter copied, then something is put into the copy.  Since
they share the underlying pipe, the original gets behind.  When we
decide that we are done with the copy (original is not usable until then)
we advance the original.  direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data.  At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:18 -04:00
Al Viro 92acdc4f37 ITER_PIPE: clean iov_iter_revert()
Fold pipe_truncate() into it, clean up.  We can release buffers
in the same loop where we walk backwards to the iterator beginning
looking for the place where the new position will be.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:18 -04:00
Al Viro 2c855de933 ITER_PIPE: clean pipe_advance() up
instead of setting ->iov_offset for new position and calling
pipe_truncate() to adjust ->len of the last buffer and discard
everything after it, adjust ->len at the same time we set ->iov_offset
and use pipe_discard_from() to deal with buffers past that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:18 -04:00
Al Viro ca59196754 ITER_PIPE: lose iter_head argument of __pipe_get_pages()
it's only used to get to the partial buffer we can add to,
and that's always the last one, i.e. pipe->head - 1.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:17 -04:00
Al Viro e3b42964f8 ITER_PIPE: fold push_pipe() into __pipe_get_pages()
Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.

Note that the only reason it's not a loop doing append_pipe() is
that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:17 -04:00
Al Viro 8fad7767ed ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
New helper: append_pipe().  Extends the last buffer if possible,
allocates a new one otherwise.  Returns page and offset in it
on success, NULL on failure.  iov_iter is advanced past the
data we've got.

Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way.  Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.

[Fix for braino caught by Liu Xinpeng <liuxp11@chinatelecom.cn> folded in]
[another braino fix, this time in copy_pipe_to_iter() and pipe_zero();
caught by testcase from Hugh Dickins]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:17 -04:00
Al Viro 47b7fcae41 ITER_PIPE: helpers for adding pipe buffers
There are only two kinds of pipe_buffer in the area used by ITER_PIPE.

1) anonymous - copy_to_iter() et.al. end up creating those and copying
data there.  They have zero ->offset, and their ->ops points to
default_pipe_page_ops.

2) zero-copy ones - those come from copy_page_to_iter(), and page
comes from caller.  ->offset is also caller-supplied - it might be
non-zero.  ->ops points to page_cache_pipe_buf_ops.

Move creation and insertion of those into helpers - push_anon(pipe, size)
and push_page(pipe, page, offset, size) resp., separating them from
the "could we avoid creating a new buffer by merging with the current
head?" logics.

Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:16 -04:00
Al Viro 2dcedb2a54 ITER_PIPE: helper for getting pipe buffer by index
pipe_buffer instances of a pipe are organized as a ring buffer,
with power-of-2 size.  Indices are kept *not* reduced modulo ring
size, so the buffer refered to by index N is
	pipe->bufs[N & (pipe->ring_size - 1)].

Ring size can change over the lifetime of a pipe, but not while
the pipe is locked.  So for any iov_iter primitives it's a constant.
Original conversion of pipes to this layout went overboard trying
to microoptimize that - calculating pipe->ring_size - 1, storing
it in a local variable and using through the function.  In some
cases it might be warranted, but most of the times it only
obfuscates what's going on in there.

Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
that and use it in the obvious cases.  More will follow...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:16 -04:00
Al Viro fcb14cb1bd new iov_iter flavour - ITER_UBUF
Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:15 -04:00
Linus Torvalds d9b58ab789 backportable fix for copy_to_iter_mc() - the second part of
iov_iter work will pretty much overwrite that one, but it's
 much harder to backport.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurHhgAKCRBZ7Krx/gZQ
 6+cgAP9XLVFQn0UE1e1JTOitdC/pqjBDMUDQ7SDazrmiVWoMcAEAm+PJSl7GVvMi
 fo3fIlaullRT3DjCidAmbbaOOZqkmgA=
 =z+N2
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull copy_to_iter_mc fix from Al Viro:
 "Backportable fix for copy_to_iter_mc() - the second part of iov_iter
  work will pretty much overwrite this, but would be much harder to
  backport"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix short copy handling in copy_mc_pipe_to_iter()
2022-08-03 13:59:15 -07:00
Linus Torvalds 5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00