Commit Graph

64 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle) 2a0774c288 XArray: set the marks correctly when splitting an entry
If we created a new node to replace an entry which had search marks set,
we were setting the search mark on every entry in that node.  That works
fine when we're splitting to order 0, but when splitting to a larger
order, we must not set the search marks on the sibling entries.

Link: https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
Fixes: c010d47f10 ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/ZjFGCOYk3FK_zVy3@bombadil.infradead.org
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05 17:28:08 -07:00
Philipp Stanner e7716c74e3 xarray: Document necessary flag in alloc functions
Adds a new line to the docstrings of functions wrapping __xa_alloc() and
__xa_alloc_cyclic(), informing about the necessity of flag XA_FLAGS_ALLOC
being set previously.

The documentation so far says that functions wrapping __xa_alloc() and
__xa_alloc_cyclic() are supposed to return either -ENOMEM or -EBUSY in
case of an error. If the xarray has been initialized without the flag
XA_FLAGS_ALLOC, however, they fail with a different, undocumented error
code.

As hinted at in Documentation/core-api/xarray.rst, wrappers around these
functions should only be invoked when the flag has been set. The
functions' documentation should reflect that as well.

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2023-09-05 19:01:38 -04:00
Matthew Wilcox (Oracle) cbc0285433 XArray: Do not return sibling entries from xa_load()
It is possible for xa_load() to observe a sibling entry pointing to
another sibling entry.  An example:

Thread A:		Thread B:
			xa_store_range(xa, entry, 188, 191, gfp);
xa_load(xa, 191);
entry = xa_entry(xa, node, 63);
[entry is a sibling of 188]
			xa_store_range(xa, entry, 184, 191, gfp);
if (xa_is_sibling(entry))
offset = xa_to_sibling(entry);
entry = xa_entry(xas->xa, node, offset);
[entry is now a sibling of 184]

It is sufficient to go around this loop until we hit a non-sibling entry.
Sibling entries always point earlier in the node, so we are guaranteed
to terminate this search.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fixes: 6b24ca4a1a ("mm: Use multi-index entries in the page cache")
Cc: stable@vger.kernel.org
2023-07-28 15:37:45 -04:00
Arnd Bergmann bde1597d0f radix-tree: move declarations to header
The xarray.c file contains the only call to radix_tree_node_rcu_free(),
and it comes with its own extern declaration for it.  This means the
function definition causes a missing-prototype warning:

lib/radix-tree.c:288:6: error: no previous prototype for 'radix_tree_node_rcu_free' [-Werror=missing-prototypes]

Instead, move the declaration for this function to a new header that can
be included by both, and do the same for the radix_tree_node_cachep
variable that has the same underlying problem but does not cause a warning
with gcc.

[zhangpeng.00@bytedance.com: fix building radix tree test suite]
  Link: https://lkml.kernel.org/r/20230521095450.21332-1-zhangpeng.00@bytedance.com
Link: https://lkml.kernel.org/r/20230516194212.548910-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Matthew Wilcox (Oracle) 69a37a8ba1 mm/huge_memory: Fix xarray node memory leak
If xas_split_alloc() fails to allocate the necessary nodes to complete the
xarray entry split, it sets the xa_state to -ENOMEM, which xas_nomem()
then interprets as "Please allocate more memory", not as "Please free
any unnecessary memory" (which was the intended outcome).  It's confusing
to use xas_nomem() to free memory in this context, so call xas_destroy()
instead.

Reported-by: syzbot+9e27a75a8c24f3fe75c1@syzkaller.appspotmail.com
Fixes: 6b24ca4a1a ("mm: Use multi-index entries in the page cache")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-09 16:24:25 -04:00
Matthew Wilcox (Oracle) 63b1898fff XArray: Disallow sibling entries of nodes
There is a race between xas_split() and xas_load() which can result in
the wrong page being returned, and thus data corruption.  Fortunately,
it's hard to hit (syzbot took three months to find it) and often guarded
with VM_BUG_ON().

The anatomy of this race is:

thread A			thread B
order-9 page is stored at index 0x200
				lookup of page at index 0x274
page split starts
				load of sibling entry at offset 9
stores nodes at offsets 8-15
				load of entry at offset 8

The entry at offset 8 turns out to be a node, and so we descend into it,
and load the page at index 0x234 instead of 0x274.  This is hard to fix
on the split side; we could replace the entire node that contains the
order-9 page instead of replacing the eight entries.  Fixing it on
the lookup side is easier; just disallow sibling entries that point
to nodes.  This cannot ever be a useful thing as the descent would not
know the correct offset to use within the new node.

The test suite continues to pass, but I have not added a new test for
this bug.

Reported-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com
Tested-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com
Fixes: 6b24ca4a1a ("mm: Use multi-index entries in the page cache")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-04-22 15:35:40 -04:00
Linus Torvalds 5a3fe95d76 XArray update for 5.18:
- Documentation update
  - Fix test-suite build after move of bitmap.h
  - Fix xas_create_range() when a large entry is already present
  - Fix xas_split() of a shadow entry
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmJHBfoACgkQDpNsjXcp
 gj4eGggAlBsHZCBDT1wY45hQjaZA+GlI1Q7M8/x+MkaK3CN6O3FMdNcbUx/KVkMJ
 YItwoh9X5VywsMD4ASxPqT/3t2lJFV7ldNvwQpLr1eVSP34XsVxprYDgT09a/CXS
 JEwLoyy18FMCZJTWPdszGvazrtAaQmvEMwcz3Y9km93qVx5o+dvninGsKWfOuu+O
 b/+VIv0wHG0RfsXVrC10BfzMlqe50YMrLOWVrb66+XDdjtITeZ2M7PXRtsa5iOtG
 TDFzngSrOl59gqqhvDrhZOHY2S+wJnuCaXiG6w6rBLDRucZ5p2x4WWYeqtZGQlDk
 nLi6wMAp3fTt6+JlbXPtT01RHWZEyw==
 =xrXd
 -----END PGP SIGNATURE-----

Merge tag 'xarray-5.18' of git://git.infradead.org/users/willy/xarray

Pull XArray updates from Matthew Wilcox:

 - Documentation update

 - Fix test-suite build after move of bitmap.h

 - Fix xas_create_range() when a large entry is already present

 - Fix xas_split() of a shadow entry

* tag 'xarray-5.18' of git://git.infradead.org/users/willy/xarray:
  XArray: Update the LRU list in xas_split()
  XArray: Fix xas_create_range() when multi-order entry present
  XArray: Include bitmap.h from xarray.h
  XArray: Document the locking requirement for the xa_state
2022-04-01 13:40:44 -07:00
Matthew Wilcox (Oracle) 3ed4bb7715 XArray: Update the LRU list in xas_split()
When splitting a value entry, we may need to add the new nodes to the LRU
list and remove the parent node from the LRU list.  The WARN_ON checks
in shadow_lru_isolate() catch this oversight.  This bug was latent
until we stopped splitting folios in shrink_page_list() with commit
820c4e2e6f ("mm/vmscan: Free non-shmem folios without splitting them").
That allows the creation of large shadow entries, and subsequently when
trying to page in a small page, we will split the large shadow entry
in __filemap_add_folio().

Fixes: 8fc75643c5 ("XArray: add xas_split")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-31 08:52:57 -04:00
Matthew Wilcox (Oracle) 3e3c658055 XArray: Fix xas_create_range() when multi-order entry present
If there is already an entry present that is of order >= XA_CHUNK_SHIFT
when we call xas_create_range(), xas_create_range() will misinterpret
that entry as a node and dereference xa_node->parent, generally leading
to a crash that looks something like this:

general protection fault, probably for non-canonical address 0xdffffc0000000001:
0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 32 Comm: khugepaged Not tainted 5.17.0-rc8-syzkaller-00003-g56e337f2cf13 #0
RIP: 0010:xa_parent_locked include/linux/xarray.h:1207 [inline]
RIP: 0010:xas_create_range+0x2d9/0x6e0 lib/xarray.c:725

It's deterministically reproducable once you know what the problem is,
but producing it in a live kernel requires khugepaged to hit a race.
While the problem has been present since xas_create_range() was
introduced, I'm not aware of a way to hit it before the page cache was
converted to use multi-index entries.

Fixes: 6b24ca4a1a ("mm: Use multi-index entries in the page cache")
Reported-by: syzbot+0d2b0bf32ca5cfd09f2e@syzkaller.appspotmail.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-28 19:25:11 -04:00
Muchun Song 9bbdc0f324 xarray: use kmem_cache_alloc_lru to allocate xa_node
The workingset will add the xa_node to the shadow_nodes list.  So the
allocation of xa_node should be done by kmem_cache_alloc_lru().  Using
xas_set_lru() to pass the list_lru which we want to insert xa_node into to
set up the xa_node reclaim context correctly.

Link: https://lkml.kernel.org/r/20220228122126.37293-9-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
Matthew Wilcox (Oracle) 25a8de7f8d XArray: Add xas_advance()
Add a new helper function to help iterate over multi-index entries.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08 00:28:41 -05:00
Matthew Wilcox (Oracle) 3012110d71 XArray: Fix splitting to non-zero orders
Splitting an order-4 entry into order-2 entries would leave the array
containing pointers to 000040008000c000 instead of 000044448888cccc.
This is a one-character fix, but enhance the test suite to check this
case.

Reported-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2021-03-30 13:42:33 -04:00
Matthew Wilcox (Oracle) 12efebab09 XArray: Fix split documentation
I wrote the documentation backwards; the new order of the entry is stored
in the xas and the caller passes the old entry.

Reported-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2021-03-30 13:42:33 -04:00
Linus Torvalds c4d6fe7311 XArray updates for 5.9
- Fix the test suite after introduction of the local_lock
  - Fix a bug in the IDA spotted by Coverity
  - Change the API that allows the workingset code to delete a node
  - Fix xas_reload() when dealing with entries that occupy multiple indices
  - Add a few more tests to the test suite
  - Fix an unsigned int being shifted into an unsigned long
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAl+OzzAACgkQDpNsjXcp
 gj5YFgf/cV99dyPaal7AfMwhVwFcuVjIRH4S/VeOHkjS2QT1lpu3ffqfKALVR8vU
 3IObM3oDCmLk0mYz9O+V/udVJoBYWiduI0LZhR6+V5ZrDjbw/d4VdCbwOplpeF5x
 rntyI9r8f5d4LxBJ/moLjsosc1KfCzyVnV389eZRvZ8Muxuyc73WdAwZZZfD79nY
 66gScEXQokU99zqJJ1nWfh05XTcTsKF25fVBGMLZTUBAytoFyPuC/kO2z8Uq9lEi
 Ug6gDClskSB7A2W5gvprMcoUAVYcHfTb0wqJD5/MhkHyoTdcWdW8Re0kssXvD86V
 KwlBdYQ/JuskgY/hbynZ/FP3p8+t1Q==
 =12E/
 -----END PGP SIGNATURE-----

Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray

Pull XArray updates from Matthew Wilcox:

 - Fix the test suite after introduction of the local_lock

 - Fix a bug in the IDA spotted by Coverity

 - Change the API that allows the workingset code to delete a node

 - Fix xas_reload() when dealing with entries that occupy multiple
   indices

 - Add a few more tests to the test suite

 - Fix an unsigned int being shifted into an unsigned long

* tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
  XArray: Fix xas_create_range for ranges above 4 billion
  radix-tree: fix the comment of radix_tree_next_slot()
  XArray: Fix xas_reload for multi-index entries
  XArray: Add private interface for workingset node deletion
  XArray: Fix xas_for_each_conflict documentation
  XArray: Test marked multiorder iterations
  XArray: Test two more things about xa_cmpxchg
  ida: Free allocated bitmap in error path
  radix tree test suite: Fix compilation
2020-10-20 14:39:37 -07:00
Matthew Wilcox (Oracle) 8fc75643c5 XArray: add xas_split
In order to use multi-index entries for huge pages in the page cache, we
need to be able to split a multi-index entry (eg if a file is truncated in
the middle of a huge page entry).  This version does not support splitting
more than one level of the tree at a time.  This is an acceptable
limitation for the page cache as we do not expect to support order-12
pages in the near future.

[akpm@linux-foundation.org: export xas_split_alloc() to modules]
[willy@infradead.org: fix xarray split]
  Link: https://lkml.kernel.org/r/20200910175450.GV6583@casper.infradead.org
[willy@infradead.org: fix xarray]
  Link: https://lkml.kernel.org/r/20201001233943.GW20115@casper.infradead.org

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Qian Cai <cai@lca.pw>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lkml.kernel.org/r/20200903183029.14930-3-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00
Matthew Wilcox (Oracle) 57417cebc9 XArray: add xa_get_order
Patch series "Fix read-only THP for non-tmpfs filesystems".

As described more verbosely in the [3/3] changelog, we can inadvertently
put an order-0 page in the page cache which occupies 512 consecutive
entries.  Users are running into this if they enable the
READ_ONLY_THP_FOR_FS config option; see
https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also
reported it here:
https://lore.kernel.org/lkml/20200616013309.GB815@lca.pw/

This is a rather intrusive way of fixing the problem, but has the
advantage that I've actually been testing it with the THP patches, which
means that it sees far more use than it does upstream -- indeed, Song has
been entirely unable to reproduce it.  It also has the advantage that it
removes a few patches from my gargantuan backlog of THP patches.

This patch (of 3):

This function returns the order of the entry at the index.  We need this
because there isn't space in the shadow entry to encode its order.

[akpm@linux-foundation.org: export xa_get_order to modules]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Qian Cai <cai@lca.pw>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lkml.kernel.org/r/20200903183029.14930-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200903183029.14930-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00
Matthew Wilcox (Oracle) 84c34df158 XArray: Fix xas_create_range for ranges above 4 billion
The 'sibs' variable would be shifted as a 32-bit integer, so if 'shift'
is more than 32, this is undefined behaviour.  In practice, this doesn't
happen because the page cache is the only user and nobody uses 16TB pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-13 08:53:29 -04:00
Matthew Wilcox (Oracle) f82cd2f0b5 XArray: Add private interface for workingset node deletion
Move the tricky bits of dealing with the XArray from the workingset
code to the XArray.  Make it clear in the documentation that this is a
private interface, and only export it for the benefit of the test suite.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-13 08:41:26 -04:00
Matthew Wilcox (Oracle) 7e934cf5ac xarray: Fix early termination of xas_for_each_marked
xas_for_each_marked() is using entry == NULL as a termination condition
of the iteration. When xas_for_each_marked() is used protected only by
RCU, this can however race with xas_store(xas, NULL) in the following
way:

TASK1                                   TASK2
page_cache_delete()         	        find_get_pages_range_tag()
                                          xas_for_each_marked()
                                            xas_find_marked()
                                              off = xas_find_chunk()

  xas_store(&xas, NULL)
    xas_init_marks(&xas);
    ...
    rcu_assign_pointer(*slot, NULL);
                                              entry = xa_entry(off);

And thus xas_for_each_marked() terminates prematurely possibly leading
to missed entries in the iteration (translating to missing writeback of
some pages or a similar problem).

If we find a NULL entry that has been marked, skip it (unless we're trying
to allocate an entry).

Reported-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org
Fixes: ef8e5717db ("page cache: Convert delete_batch to XArray")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-03-12 17:42:08 -04:00
Matthew Wilcox (Oracle) d8e93e3f22 XArray: Optimise xas_sibling() if !CONFIG_XARRAY_MULTI
If CONFIG_XARRAY_MULTI is disabled, then xas_sibling() must be false.

Reported-by: JaeJoon Jung <rgbi3307@gmail.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-02-27 07:37:40 -05:00
Matthew Wilcox (Oracle) c36d451ad3 XArray: Fix xas_pause for large multi-index entries
Inspired by the recent Coverity report, I looked for other places where
the offset wasn't being converted to an unsigned long before being
shifted, and I found one in xas_pause() when the entry being paused is
of order >32.

Fixes: b803b42823 ("xarray: Add XArray iterators")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
2020-01-31 15:09:49 -05:00
Matthew Wilcox (Oracle) bd40b17ca4 XArray: Fix xa_find_next for large multi-index entries
Coverity pointed out that xas_sibling() was shifting xa_offset without
promoting it to an unsigned long first, so the shift could cause an
overflow and we'd get the wrong answer.  The fix is obvious, and the
new test-case provokes UBSAN to report an error:
runtime error: shift exponent 60 is too large for 32-bit type 'int'

Fixes: 19c30f4dd0 ("XArray: Fix xa_find_after with multi-index entries")
Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
2020-01-31 15:09:36 -05:00
Matthew Wilcox (Oracle) c44aa5e8ab XArray: Fix xas_find returning too many entries
If you call xas_find() with the initial index > max, it should have
returned NULL but was returning the entry at index.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
2020-01-17 22:33:33 -05:00
Matthew Wilcox (Oracle) 19c30f4dd0 XArray: Fix xa_find_after with multi-index entries
If the entry is of an order which is a multiple of XA_CHUNK_SIZE,
the current detection of sibling entries does not work.  Factor out
an xas_sibling() function to make xa_find_after() a little more
understandable, and write a new implementation that doesn't suffer from
the same bug.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
2020-01-17 22:33:27 -05:00
Matthew Wilcox (Oracle) 430f24f94c XArray: Fix infinite loop with entry at ULONG_MAX
If there is an entry at ULONG_MAX, xa_for_each() will overflow the
'index + 1' in xa_find_after() and wrap around to 0.  Catch this case
and terminate the loop by returning NULL.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
2020-01-17 22:32:24 -05:00
Matthew Wilcox (Oracle) 82a22311b7 XArray: Fix xas_pause at ULONG_MAX
If we were unlucky enough to call xas_pause() when the index was at
ULONG_MAX (or a multi-slot entry which ends at ULONG_MAX), we would
wrap the index back around to 0 and restart the iteration from the
beginning.  Use the XAS_BOUNDS state to indicate that we should just
stop the iteration.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-11-08 23:48:40 -05:00
Matthew Wilcox (Oracle) 91abab8383 XArray: Fix xas_next() with a single entry at 0
If there is only a single entry at 0, the first time we call xas_next(),
we return the entry.  Unfortunately, all subsequent times we call
xas_next(), we also return the entry at 0 instead of noticing that the
xa_index is now greater than zero.  This broke find_get_pages_contig().

Fixes: 64d3e9a9e0 ("xarray: Step through an XArray")
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-07-01 17:11:16 -04:00
Johannes Weiner 7b785645e8 mm: fix page cache convergence regression
Since a283348629 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-05-31 13:52:41 -04:00
Matthew Wilcox 4a5c8d8989 XArray: Fix xa_reserve for 2-byte aligned entries
If we reserve index 0, the next entry to be stored there might be 2-byte
aligned.  That means we have to create the root xa_node at the time of
reserving the initial entry.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-21 17:54:44 -05:00
Matthew Wilcox 2fbe967b3e XArray: Fix xa_erase of 2-byte aligned entries
xas_store() was interpreting the entry it found in the array as a node
entry if the bottom two bits had value 2.  That's only true if either
the entry is in the root node or in a non-leaf node.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-21 17:36:45 -05:00
Matthew Wilcox 962033d55d XArray: Use xa_cmpxchg to implement xa_reserve
Jason feels this is clearer, and it saves a function and an exported
symbol.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-20 17:08:54 -05:00
Matthew Wilcox b38f6c5027 XArray: Fix xa_release in allocating arrays
xa_cmpxchg() was a little too magic in turning ZERO entries into NULL,
and would leave the entry set to the ZERO entry instead of releasing
it for future use.  After careful review of existing users of
xa_cmpxchg(), change the semantics so that it does not translate either
incoming argument from NULL into ZERO entries.

Add several tests to the test-suite to make sure this problem doesn't
come back.

Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-20 17:08:54 -05:00
Matthew Wilcox 2fa044e51a XArray: Add cyclic allocation
This differs slightly from the IDR equivalent in five ways.

1. It can allocate up to UINT_MAX instead of being limited to INT_MAX,
   like xa_alloc().  Also like xa_alloc(), it will write to the 'id'
   pointer before placing the entry in the XArray.
2. The 'next' cursor is allocated separately from the XArray instead
   of being part of the IDR.  This saves memory for all the users which
   do not use the cyclic allocation API and suits some users better.
3. It returns -EBUSY instead of -ENOSPC.
4. It will attempt to wrap back to the minimum value on memory allocation
   failure as well as on an -EBUSY error, assuming that a user would
   rather allocate a small ID than suffer an ID allocation failure.
5. It reports whether it has wrapped, which is important to some users.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-06 13:32:25 -05:00
Matthew Wilcox a3e4d3f97e XArray: Redesign xa_alloc API
It was too easy to forget to initialise the start index.  Add an
xa_limit data structure which can be used to pass min & max, and
define a couple of special values for common cases.  Also add some
more tests cribbed from the IDR test suite.  Change the return value
from -ENOSPC to -EBUSY to match xa_insert().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-06 13:32:23 -05:00
Matthew Wilcox 3ccaf57a6a XArray: Add support for 1s-based allocation
A lot of places want to allocate IDs starting at 1 instead of 0.
While the xa_alloc() API supports this, it's not very efficient if lots
of IDs are allocated, due to having to walk down to the bottom of the
tree to see if ID 1 is available, then all the way over to the next
non-allocated ID.  This method marks ID 0 as being occupied which wastes
one slot in the XArray, but preserves xa_empty() as working.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-06 13:13:24 -05:00
Matthew Wilcox fd9dc93e36 XArray: Change xa_insert to return -EBUSY
Userspace translates EEXIST to "File exists" which isn't a very good
error message for the problem.  "Device or resource busy" is a better
indication of what went wrong.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-06 13:12:15 -05:00
Matthew Wilcox 809ab9371c XArray: Update xa_erase family descriptions
xa_erase does not allocate memory and doesn't have a gfp parameter.
Update the descriptions of all four variants to be more useful.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-04 23:16:58 -05:00
Matthew Wilcox b0606fed6e XArray: Honour reserved entries in xa_insert
xa_insert() should treat reserved entries as occupied, not as available.
Also, it should treat requests to insert a NULL pointer as a request
to reserve the slot.  Add xa_insert_bh() and xa_insert_irq() for
completeness.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-01-06 22:12:58 -05:00
Matthew Wilcox 76b4e52995 XArray: Permit storing 2-byte-aligned pointers
On m68k, statically allocated pointers may only be two-byte aligned.
This clashes with the XArray's method for tagging internal pointers.
Permit storing these pointers in single slots (ie not in multislots).

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-01-06 22:12:57 -05:00
Matthew Wilcox 02669b17a4 XArray: Turn xa_init_flags into a static inline
A regular xa_init_flags() put all dynamically-initialised XArrays into
the same locking class.  That leads to lockdep believing that taking
one XArray lock while holding another is a deadlock.  It's possible to
work around some of these situations with separate locking classes for
irq/bh/regular XArrays, and SINGLE_DEPTH_NESTING, but that's ugly, and
it doesn't work for all situations (where we have completely unrelated
XArrays).

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-01-06 21:24:43 -05:00
Matthew Wilcox 48483614de XArray: Fix xa_alloc when id exceeds max
Specifying a starting ID greater than the maximum ID isn't something
attempted very often, but it should fail.  It was succeeding due to
xas_find_marked() returning the wrong error state, so add tests for
both xa_alloc() and xas_find_marked().

Fixes: b803b42823 ("xarray: Add XArray iterators")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-12-13 14:07:33 -05:00
Matthew Wilcox 44a4a66b61 XArray: Correct xa_store_range
The explicit '64' should have been BITS_PER_LONG, but while looking at
this code I realised I meant to use __ffs(), not ilog2().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-16 16:27:28 -05:00
Matthew Wilcox 804dfaf01b XArray: Fix Documentation
Minor fixes.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:10 -05:00
Matthew Wilcox d9c480435a XArray: Handle NULL pointers differently for allocation
For allocating XArrays, it makes sense to distinguish beteen erasing an
entry and storing NULL.  Storing NULL keeps the index allocated with a
NULL pointer associated with it while xa_erase() frees the index.  Some
existing IDR users rely on this ability.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:09 -05:00
Matthew Wilcox 611f318637 XArray: Unify xa_store and __xa_store
Saves around 115 bytes on a tinyconfig build and reduces the amount
of code duplication in the XArray implementation.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:09 -05:00
Matthew Wilcox 9c16bb8890 XArray: Turn xa_erase into an exported function
Make xa_erase() take the spinlock and then call __xa_erase(), but make
it out of line since it's such a common function.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:09 -05:00
Matthew Wilcox c5beb07e7a XArray: Unify xa_cmpxchg and __xa_cmpxchg
xa_cmpxchg() was one of the largest functions in the xarray
implementation.  By turning it into a wrapper and having the callers
take the lock (like several other functions), we save 160 bytes on a
tinyconfig build and reduce the duplication in xarray.c.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:08 -05:00
Matthew Wilcox 4c0608f4a0 XArray: Regularise xa_reserve
The xa_reserve() function was a little unusual in that it attempted to
be callable for all kinds of locking scenarios.  Make it look like the
other APIs with __xa_reserve, xa_reserve_bh and xa_reserve_irq variants.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 16:38:08 -05:00
Matthew Wilcox 9ee5a3b7ee XArray: Export __xa_foo to non-GPL modules
Without this, it's not possible to use static inlines like xa_store_bh()
and xa_erase_irq().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 14:56:58 -05:00
Matthew Wilcox 8229706e03 XArray: Fix xa_for_each with a single element at 0
The following sequence of calls would result in an infinite loop in
xa_find_after():

	xa_store(xa, 0, x, GFP_KERNEL);
	index = 0;
	xa_for_each(xa, entry, index, ULONG_MAX, XA_PRESENT) { }

xa_find_after() was confusing the situation where we found no entry in
the tree with finding a multiorder entry, so it would look for the
successor entry forever.  Just check for this case explicitly.  Includes
a few new checks in the test suite to be sure this doesn't reappear.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 14:56:46 -05:00