Commit Graph

73950 Commits

Author SHA1 Message Date
Shyam Prasad N 73f9bfbe3d cifs: maintain a state machine for tcp/smb/tcon sessions
If functions like cifs_negotiate_protocol, cifs_setup_session,
cifs_tree_connect are called in parallel on different channels,
each of these will be execute the requests. This maybe unnecessary
in some cases, and only the first caller may need to do the work.

This is achieved by having more states for the tcp/smb/tcon session
status fields. And tracking the state of reconnection based on the
state machine.

For example:
for tcp connections:
CifsNew/CifsNeedReconnect ->
  CifsNeedNegotiate ->
    CifsInNegotiate ->
      CifsNeedSessSetup ->
        CifsInSessSetup ->
          CifsGood

for smb sessions:
CifsNew/CifsNeedReconnect ->
  CifsGood

for tcon:
CifsNew/CifsNeedReconnect ->
  CifsInFilesInvalidate ->
    CifsNeedTcon ->
      CifsInTcon ->
        CifsGood

If any channel reconnect sees that it's in the middle of
transition to CifsGood, then they can skip the function.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-07 20:09:22 -06:00
Enzo Matsumiya 1913e1116a cifs: fix hang on cifs_get_next_mid()
Mount will hang if using SMB1 and DFS.

This is because every call to get_next_mid() will, unconditionally,
mark tcpStatus to CifsNeedReconnect before even establishing the
initial connect, because "reconnect" variable was not initialized.

Initializing "reconnect" to false fix this issue.

Fixes: 220c5bc25d87 ("cifs: take cifs_tcp_ses_lock for status checks")
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-07 20:07:11 -06:00
Shyam Prasad N 080dc5e565 cifs: take cifs_tcp_ses_lock for status checks
While checking/updating status for tcp ses, smb ses or tcon,
we take GlobalMid_Lock. This doesn't make any sense.
Replaced it with cifs_tcp_ses_lock.

Ideally, we should take a spin lock per struct.
But since tcp ses, smb ses and tcon objects won't add up to a lot,
I think there should not be too much contention.

Also, in few other places, these are checked without locking.
Added locking for these.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-07 20:07:07 -06:00
Shyam Prasad N 183eea2ee5 cifs: reconnect only the connection and not smb session where possible
With the new per-channel bitmask for reconnect, we have an option to
reconnect the tcp session associated with the channel without reconnecting
the smb session. i.e. if there are still channels to operate on, we can
continue to use the smb session and tcon.

However, there are cases where it makes sense to reconnect the smb session
even when there are active channels underneath. For example for
SMB session expiry.

With this patch, we'll have an option to do either, and use the correct
option for specific cases.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:46 -06:00
Shyam Prasad N 2e0fa298d1 cifs: add WARN_ON for when chan_count goes below minimum
chan_count keeps track of the total number of channels.
Since at least the primary channel will always be connected,
this value can never go below 1. Warn if that happens.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:46 -06:00
Shyam Prasad N 66eb0c6e66 cifs: adjust DebugData to use chans_need_reconnect for conn status
Use ses->chans_need_reconnect bitmask to print the connection
status of each channel under an SMB session.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:46 -06:00
Shyam Prasad N f486ef8e20 cifs: use the chans_need_reconnect bitmap for reconnect status
We use the concept of "binding" when one of the secondary channel
is in the process of connecting/reconnecting to the server. Till this
binding process completes, and the channel is bound to an existing session,
we redirect traffic from other established channels on the binding channel,
effectively blocking all traffic till individual channels get reconnected.

With my last set of commits, we can get rid of this binding serialization.
We now have a bitmap of connection states for each channel. We will use
this bitmap instead for tracking channel status.

Having a bitmap also now enables us to keep the session alive, as long
as even a single channel underneath is alive.

Unfortunately, this also meant that we need to supply the tcp connection
info for the channel during all negotiate and session setup functions.
These changes have resulted in a slightly bigger code churn.
However, I expect perf and robustness improvements in the mchan scenario
after this change.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:46 -06:00
Shyam Prasad N d1a931ce2e cifs: track individual channel status using chans_need_reconnect
We needed a way to identify the channels under the smb session
which are in reconnect, so that the traffic to other channels
can continue. So I replaced the bool need_reconnect with
a bitmask identifying all the channels that need reconnection
(named chans_need_reconnect). When a channel needs reconnection,
the bit corresponding to the index of the server in ses->chans
is used to set this bitmask. Checking if no channels or all
the channels need reconnect then becomes very easy.

Also wrote some helper macros for checking and setting the bits.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:46 -06:00
Colin Ian King 0b66fa776c cifs: remove redundant assignment to pointer p
The pointer p is being assigned with a value that is never read. The
pointer is being re-assigned a different value inside the do-while
loop. The assignment is redundant and can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-02 20:38:45 -06:00
Christian Brauner 012e332286 fs/mount_setattr: always cleanup mount_kattr
Make sure that finish_mount_kattr() is called after mount_kattr was
succesfully built in both the success and failure case to prevent
leaking any references we took when we built it.  We returned early if
path lookup failed thereby risking to leak an additional reference we
took when building mount_kattr when an idmapped mount was requested.

Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-30 15:12:13 -08:00
Linus Torvalds 7a29b11da9 three ksmbd fixes, all for stable as well. Two fix potential unitialized memory and one fixes a security problem where encryption is unitentionally disabled from some clients
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmG+uVcACgkQiiy9cAdy
 T1EbsQv+NRGXy7WoI4V0QQSY6Q3uWVZwKXwe8MO4+wHex7t5ZDaGDTUboExZ9gnK
 W63t/pq+hsb4Z8JI//vWdcVdVgnUexrcbHt3ANKGPPwZz1/uttIismyqLGkw+Ntc
 c99NKve+Q6LNuU1TEF+RRDd3wgrT3ZD3odlRW+xN0w6ZQ7BPMz38OlemSmUaYqUc
 fh2S3W0Vp3/gxLNfjp1GMrOqjQ8qtTZIzHtLLXHzJKGUYxtq3d77+jP36BAA2Wvn
 ufBNFO/Y94gRekKyw+KiW8+E5ceDHNits4SCI/C97txyBnBCbo7gu4TKi+ChDcek
 auDqiLrFp/GuuU2D5TuOw8GeFPNQo08O6IAkcR3ftnxcTM8rvqYPdvRNnEkgkHKH
 Fu/FWk8wXJLGlymW7bMjeA4bDs0e8z6HjqFWd8d0xeYOismcdO7OZuY4iaC73lCy
 Nv/2shlzocO68nNd0kxU7gkSPkv1LynVE5IV55DQaJpryUWe71gClldiNS/HTm4c
 OTnxbAfN
 =tOhf
 -----END PGP SIGNATURE-----

Merge tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull ksmbd fixes from Steve French:
 "Three ksmbd fixes, all for stable as well.

  Two fix potential unitialized memory and one fixes a security problem
  where encryption is unitentionally disabled from some clients"

* tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
  ksmbd: fix uninitialized symbol 'pntsd_size'
  ksmbd: fix error code in ndr_read_int32()
2021-12-23 17:15:23 -08:00
Linus Torvalds a026fa5404 io_uring-5.16-2021-12-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHEmDcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpu1EEACnW4afkADzsu5L0DOYV0k2171cBQFaIeNw
 FJb3ebp6iHbWS0BBnseKKqdpFOQoGF6q8lcDdyx/cmUH7217cZTRDt1jxxXPwmGP
 i3Zy2HMh6Dq/RX2t0Be7oMEFZeRyB+FIQySvsR4JSfhOeCF+x0kg1Rvwljd4u/t/
 sEM4BEPl1ssm1YO5ZXvhnxCoX7eEhS3pbnJj5ZV/y46YTLr9crJQBYfeOHOflEvp
 iyTfK6NGy/CDR8a9DTXrYaGRozd7J8ArO+5K4qdLtNP7Tc6+bMrV/CE5l7gwWdlU
 06eeRxGt/4NQwfWD2aphRaRM5Qa/VcssMy5ikKf98UKvalzwbaAl7Hk5ywbxdQPV
 cgiHTQa1gwpY7FAw+KQwE2m0iAqaLs4eXfFNFt68yJ2NKc3tVWsMXtFskuBdz/E7
 TeoXQTEb5PwDsjHmmFbFlW2OuJajEeIJfRG0pXellOgqm8VDruvjemNDFlg80QmC
 TpF5y/dbhIht9BHTt+K5mniRZXlVfjePok04XWQgYaIvkYATbGzC4vXMAQdwfW6y
 kO6Ez7zaDjQJShDKZvni5VlXhWwWQEjm01f1tLmljPiLmZH+bwGQVin1jHgYpskW
 4mUZD2UpwNY5N5QbEfKeQ/xCxl7EQAN8WPY4HoCssK139WFHczoGTECl6/DFC4sB
 TrTMi9rRaA==
 =CQfA
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Single fix for not clearing kiocb->ki_pos back to 0 for a stream,
  destined for stable as well"

* tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block:
  io_uring: zero iocb->ki_pos for stream file types
2021-12-23 15:32:07 -08:00
Jens Axboe 7b9762a5e8 io_uring: zero iocb->ki_pos for stream file types
io_uring supports using offset == -1 for using the current file position,
and we read that in as part of read/write command setup. For the non-iter
read/write types we pass in NULL for the position pointer, but for the
iter types we should not be passing any anything but 0 for the position
for a stream.

Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we
do, then the request will error with -ESPIPE.

Fixes: ba04291eb6 ("io_uring: allow use of offset == -1 to mean file position")
Link: https://github.com/axboe/liburing/discussions/501
Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-22 20:34:32 -07:00
Linus Torvalds 5dbdc4c565 Fixes:
- Address a buffer overrun reported by Anatoly Trosinenko
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmHArD0ACgkQM2qzM29m
 f5ddQRAAnPQnaLHGzvXkLKP76EMkFWlCPZk4pkNy1LuHRyU3F+D2Pj1qZBJpqFCa
 gzRh2N1U+zPguwsoKlPuPjHUqqU4D2Wf0DHo9xsU0gvY5B86m4bebJgpK3zXpD3m
 37gyM/UMe744D5A2Kh/yyKCEWR+8STh5/a956dCi22Z7qyEhPQDkrbk9yLBsDo9t
 NIb2rV/tvdvQvjzBd5om4Sm8QAXrdqoChK619b/T3v46shj/OX2Rqneid1deJi0U
 czrz19g/hzzvaTlrdXmFT0w9qZQ3Md/T2wDtiKtc/XoTF7ZsBHZFOwLhcDRuTXJO
 UAfXzXkf1WmhrQZOqJqHCuWFf01vdc3++8L01PXyxVupwGYS/uvdy0VNdO/u5hDr
 ibYjkrMpTKT14Q7iyPU7CCPolWipVpKt3pMwXuusCz8ky8oMQQSmpjdkofXutVnP
 NNuzx8iqW8N4Vo86Itoau7qEFM0FcxWR0Ut2F0EKiOjiS63Ccg9wxNbu3rJe/Wpb
 gpb3ICpPgyOaSUI1D0NEEibbRyiOAE+ldiDdpGHyOGgcK672jaD/5NAvD3s18FRp
 2V7Xf/vY3Qiggakopk+hEKYNh2aFXkhZmUf+rNNbhEHpgJGQ6HU1rlSw4+SfrdPK
 vKx/uDuUr5aahnjrZDaLv1sGp4Nq+m3KEjhlQGKSVi25n4xXgtc=
 =FIJ5
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:
 "Address a buffer overrun reported by Anatoly Trosinenko"

* tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix READDIR buffer overflow
2021-12-21 12:02:36 -08:00
Linus Torvalds 9273d6cb99 two cifs/smb3 fixes, one fscache related, and one mount parsing related for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmG9V/0ACgkQiiy9cAdy
 T1GbQAv+N/2p7mMzhxcB91xlU6DSo2kZhC1FcmiXld3BEiMzkef6Rtb/lzW1AtBP
 kNgqjVU7MMmdFQNKLhrHUV2BNCtLWKOC2I0xTZqr+nBlGMJkiNUBzUWW8DcnLd1e
 zGky7Wck8HHu7LdnZA3TVlPhNV9KjdlHMEYBBJFlOc779a/Q0KyHLdwhP9mpSa2d
 QrVRHkBMylNUqF4WiLKQ0I8eJFhKtd7pJNAW1qGeKe2YDV6kZQiXwHOLIZb7SHET
 9Qtj83IwsQ13cLTmb7tGibj0Fn+GEEqgBv40PjSMuTCoiCpfhcLrx0lWT5hxlxao
 9w3Yw1pPPaJXbxCmL9b3Bqm3vnNqMaN7HqxsuGIFwnulGFB4GJY2l+ZQPcTHzWG0
 oCHlimlixH7RKVEfjIDhidC1C0ReZI8/1XuHPGE17Ysxt6L7dH/a7J2y7SlIdYTZ
 ZVvVT0Kib74I7ZUJH1ymc4MdAVnBIkiTGOrQk/FU+cHOyzMUpNcwXiCqkvF77QUa
 hDW4hkzE
 =MW5I
 -----END PGP SIGNATURE-----

Merge tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two cifs/smb3 fixes, one fscache related, and one mount parsing
  related for stable"

* tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: sanitize multiple delimiters in prepath
  cifs: ignore resource_id while getting fscache super cookie
2021-12-19 11:23:02 -08:00
Chuck Lever 53b1119a6e NFSD: Fix READDIR buffer overflow
If a client sends a READDIR count argument that is too small (say,
zero), then the buffer size calculation in the new init_dirlist
helper functions results in an underflow, allowing the XDR stream
functions to write beyond the actual buffer.

This calculation has always been suspect. NFSD has never sanity-
checked the READDIR count argument, but the old entry encoders
managed the problem correctly.

With the commits below, entry encoding changed, exposing the
underflow to the pointer arithmetic in xdr_reserve_space().

Modern NFS clients attempt to retrieve as much data as possible
for each READDIR request. Also, we have no unit tests that
exercise the behavior of READDIR at the lower bound of @count
values. Thus this case was missed during testing.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: f5dcccd647 ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream")
Fixes: 7f87fc2d34 ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-18 17:11:06 -05:00
Linus Torvalds 1887bf5cc4 zonefs fixes for 5.16-rc6
One fix and one trivial update for rc6:
 * Add MODULE_ALIAS_FS to get automatic module loading on mount (from
   Naohiro)
 * Update Damien's email address in the MAINTAINERS file (from me).
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYb0hkgAKCRDdoc3SxdoY
 dlciAP4lGpsiFcO7TdLY2W64EHgIkcstGx1UitsqBTR5iZnp6QD/bAfaHOaTNQDG
 Nr8GznsB3di4WRbFoV7BhBsKFcC4cgA=
 =GWY0
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fixes from Damien Le Moal:
 "One fix and one trivial update for rc6:

   - Add MODULE_ALIAS_FS to get automatic module loading on mount
     (Naohiro)

   - Update Damien's email address in the MAINTAINERS file (me)"

* tag 'zonefs-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  MAITAINERS: Change zonefs maintainer email address
  zonefs: add MODULE_ALIAS_FS
2021-12-17 17:19:51 -08:00
Marcos Del Sol Vives 83912d6d55 ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
According to the official Microsoft MS-SMB2 document section 3.3.5.4, this
flag should be used only for 3.0 and 3.0.2 dialects. Setting it for 3.1.1
is a violation of the specification.

This causes my Windows 10 client to detect an anomaly in the negotiation,
and disable encryption entirely despite being explicitly enabled in ksmbd,
causing all data transfers to go in plain text.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-17 19:19:45 -06:00
Thiago Rafael Becker a31080899d cifs: sanitize multiple delimiters in prepath
mount.cifs can pass a device with multiple delimiters in it. This will
cause rename(2) to fail with ENOENT.

V2:
  - Make sanitize_path more readable.
  - Fix multiple delimiters between UNC and prepath.
  - Avoid a memory leak if a bad user starts putting a lot of delimiters
    in the path on purpose.

BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2031200
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11+
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-17 19:16:49 -06:00
Shyam Prasad N b774302e88 cifs: ignore resource_id while getting fscache super cookie
We have a cyclic dependency between fscache super cookie
and root inode cookie. The super cookie relies on
tcon->resource_id, which gets populated from the root inode
number. However, fetching the root inode initializes inode
cookie as a child of super cookie, which is yet to be populated.

resource_id is only used as auxdata to check the validity of
super cookie. We can completely avoid setting resource_id to
remove the circular dependency. Since vol creation time and
vol serial numbers are used for auxdata, we should be fine.
Additionally, there will be auxiliary data check for each
inode cookie as well.

Fixes: 5bf91ef03d ("cifs: wait for tcon resource_id before getting fscache super")
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-17 19:09:06 -06:00
Linus Torvalds 9609134186 for-5.16-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmG8+tEACgkQxWXV+ddt
 WDuuGA/9E75ZMqsMLW5az7z8Rt5voBjPeweyRHmGCLZKpgfaj0QjrJRvu0CTKU/W
 zCSQf+ShTTY2D3cmh1eEwKyX/waKQ71qBrMX/SgIeA0OjmlhK1UGB18MF5sAVGCB
 mymVYJh7IntYJE7S7OiMUL/yILmIWZYrYT+iaPZlIc9M6h0b1gjMIsE0VEmxJMCN
 X8RAQ4CfL9bpTTKItNehSyXx+J7TB5yamh5AspaiB/ivyN1DcUcsFf3AoaWXeh2D
 YIBzq4WbGnDMfUdWXKE2rqDfQgaTXtN9ffGUvphJnegg8Tqfp29LyLZ1GU0qGSXc
 /K8g5QNmM3nhubXq2MG5zfbHPJ1H2CgnvkDqiCcyeop+09yj/ugxTt+ULaIbJL76
 pKSpcuIFXTmoW2Z7ZwijIEX4H5Dgk2l2DbE8SkJT4LjJybgpHfBT1KDQrj5iQdx+
 XgmG/CbRELuGGltJNuldp0SqIyMNRgDuv6Rheg9N9H73m9epwfH5oiM0Fj/FYyQ6
 lfgle6DQCP4xaDmk1zA9zrJHTUqi8Caeyg+tQYT6AbkoeCoXnvEAPgv9OOGe1M+C
 Ks7zeAseWs3A/j/+wCdiCKombOfR+AY3RGkPzlodUJj4YYOTyXrigtb5yhTz6Zdv
 ozVBZ71LUMMOf0NV45mGtqsiLqyfe3cnlqj1XtHQKaajyjgHvW8=
 =G7CE
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes, almost all error handling one-liners and for stable.

   - regression fix in directory logging items

   - regression fix of extent buffer status bits handling after an error

   - fix memory leak in error handling path in tree-log

   - fix freeing invalid anon device number when handling errors during
     subvolume creation

   - fix warning when freeing leaf after subvolume creation failure

   - fix missing blkdev put in device scan error handling

   - fix invalid delayed ref after subvolume creation failure"

* tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix missing blkdev_put() call in btrfs_scan_one_device()
  btrfs: fix warning when freeing leaf after subvolume creation failure
  btrfs: fix invalid delayed ref after subvolume creation failure
  btrfs: check WRITE_ERR when trying to read an extent buffer
  btrfs: fix missing last dir item offset update when logging directory
  btrfs: fix double free of anon_dev after failure to create subvolume
  btrfs: fix memory leak in __add_inode_ref()
2021-12-17 13:50:58 -08:00
Linus Torvalds cb29eee3b2 io_uring-5.16-2021-12-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG8wLIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpncuD/wI9UnDapmyWodpXCw+f/GN0ZfuGzKAblkR
 yYJcA3spTIYHHZiR4KCKke1vM+j8PoPBTGwCuDX9EhHFq9EIEAAAFqQwbta85Rcu
 c3iKhb7qWGN2VQwqdmQDZtegdblM5usNsY3fNUA3iaGeQy+7qvYEyk0u5CDSIKCQ
 oR0NQpE/cHYlyoqUbITlCvOwD32QvcuHK2oCBnGW02FYiKq3gddJ3EPxnBkLcTUI
 zeHRCtslZTFrZexapQ/w4rAJ0O6bMR/08TYjtRDxLB4bTpYR34Hx23tvEsZk4jqE
 olpUlJDwV2B+vZ1aAKqCrabbkNQpiws7Y1EMSwM1pqqZYaPhpVeEmcqkpDNsSgY2
 s764iXPKjJnKB6XWUuuShobOjiwwk0rAyNdWKVdvzNVkEnhpNnFVmfllChOiphqM
 jHua88LHpRJpmHSDdfJMO9qT1uffixv1AIfhEoc0hBCddwp9RwqhijelbJDYIjou
 KKE7aip0rzwK90tOTFGP9lD7cUl3aFig3yQNLaJpXybm2Qa+xE//hdUk4lm1i57c
 ciEYtz8S827lCx/CzhpWoTszG11cpT249IMXyxtspoZZh44VTXYC83EeYI2cyG24
 q8zfmgMLNMYn6dc4dP5oAA/d0r7s9AY9GvyblhwmLGpjO5mANqJPZAtl2PpSYWg6
 jVh1NDEyKw==
 =IY1e
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Just a single fix, fixing an issue with the worker creation change
  that was merged last week"

* tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-block:
  io-wq: drop wqe lock before creating new worker
2021-12-17 11:31:46 -08:00
Naohiro Aota 8ffea2599f zonefs: add MODULE_ALIAS_FS
Add MODULE_ALIAS_FS() to load the module automatically when you do "mount
-t zonefs".

Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: stable <stable@vger.kernel.org> # 5.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2021-12-17 16:56:35 +09:00
Namjae Jeon f2e78affc4 ksmbd: fix uninitialized symbol 'pntsd_size'
No check for if "rc" is an error code for build_sec_desc().
This can cause problems with using uninitialized pntsd_size.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org # v5.15
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-16 12:36:49 -06:00
Dan Carpenter ef399469d9 ksmbd: fix error code in ndr_read_int32()
This is a failure path and it should return -EINVAL instead of success.
Otherwise it could result in the caller using uninitialized memory.

Fixes: 303fff2b8c ("ksmbd: add validation for ndr read/write functions")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-16 12:36:49 -06:00
David Howells 1744a22ae9 afs: Fix mmap
Fix afs_add_open_map() to check that the vnode isn't already on the list
when it adds it.  It's possible that afs_drop_open_mmap() decremented
the cb_nr_mmap counter, but hadn't yet got into the locked section to
remove it.

Also vnode->cb_mmap_link should be initialised, so fix that too.

Fixes: 6e0e99d58a ("afs: Fix mmap coherency vs 3rd-party changes")
Reported-by: kafs-testing+fedora34_64checkkafs-build-300@auristor.com
Suggested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora34_64checkkafs-build-300@auristor.com
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/686465.1639435380@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-16 09:10:13 -08:00
Linus Torvalds 2b14864acb An SGID directory handling fix (marked for stable), a metrics
accounting fix and two fixups to appease static checkers.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmG54aMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzixPBB/4rf0d+oJbBNQQIjTVSO09G9A5pxnU1
 CY++W5xppv2hQQX8Czh8iLLYSP3tRh+1xpAlKT+DFsRI3GTh5tXiPf7Mpu76noo3
 UAKxiz5NmtqiwrHKZtpsQEJlmwtAFa+qFVJIOiqq1l+gFeiYMueoFEGDgdSCLJrE
 P6TL9CZx06Yf+fef7hJnXKR3YM0qras1jV6Adpvu+2g/ciaThzJF3YdyAFYFp7nl
 sGnPTXlJ1DDIsiOAgj0oGeQICpjQJOSGNZtxVjhJx5dJjqgKjim0V4MIShqsW/tg
 n8LzjrdGh/lR2QfCgOfYmvFrVTeb9QQyF57sT034rGrP6VXlzCkALPi3
 =FQag
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "An SGID directory handling fix (marked for stable), a metrics
  accounting fix and two fixups to appease static checkers"

* tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-client:
  ceph: fix up non-directory creation in SGID directories
  ceph: initialize pathlen variable in reconnect_caps_cb
  ceph: initialize i_size variable in ceph_sync_read
  ceph: fix duplicate increment of opened_inodes metric
2021-12-15 11:06:41 -08:00
Shin'ichiro Kawasaki 4989d4a0ae btrfs: fix missing blkdev_put() call in btrfs_scan_one_device()
The function btrfs_scan_one_device() calls blkdev_get_by_path() and
blkdev_put() to get and release its target block device. However, when
btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the
block device is left without clean up. This triggered failure of fstests
generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to
call blkdev_put().

Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15 17:07:34 +01:00
Filipe Manana 212a58fda9 btrfs: fix warning when freeing leaf after subvolume creation failure
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
insert the root item for the new subvolume into the root tree, we can
trigger the following warning:

[78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs]
[78961.743344] Modules linked in:
[78961.749440]  dm_snapshot dm_thin_pool (...)
[78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1
[78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs]
[78961.778398] Code: 17 00 48 85 (...)
[78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202
[78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000
[78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff
[78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000
[78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8
[78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000
[78961.786620] FS:  00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000
[78961.787717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0
[78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[78961.791684] Call Trace:
[78961.792082]  <TASK>
[78961.792359]  create_subvol+0x5d1/0x9a0 [btrfs]
[78961.793054]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
[78961.794009]  ? preempt_count_add+0x49/0xa0
[78961.794705]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
[78961.795712]  ? _copy_from_user+0x66/0xa0
[78961.796382]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[78961.797392]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
[78961.798172]  ? __slab_free+0x10a/0x360
[78961.798820]  ? rcu_read_lock_sched_held+0x12/0x60
[78961.799664]  ? lock_release+0x223/0x4a0
[78961.800321]  ? lock_acquired+0x19f/0x420
[78961.800992]  ? rcu_read_lock_sched_held+0x12/0x60
[78961.801796]  ? trace_hardirqs_on+0x1b/0xe0
[78961.802495]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
[78961.803358]  ? kmem_cache_free+0x321/0x3c0
[78961.804071]  ? __x64_sys_ioctl+0x83/0xb0
[78961.804711]  __x64_sys_ioctl+0x83/0xb0
[78961.805348]  do_syscall_64+0x3b/0xc0
[78961.805969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[78961.806830] RIP: 0033:0x7f31284bc957
[78961.807517] Code: 3c 1c 48 f7 d8 (...)

This is because we are calling btrfs_free_tree_block() on an extent
buffer that is dirty. Fix that by cleaning the extent buffer, with
btrfs_clean_tree_block(), before freeing it.

This was triggered by test case generic/475 from fstests.

Fixes: 67addf2900 ("btrfs: fix metadata extent leak after failure to create subvolume")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15 17:07:33 +01:00
Filipe Manana 7a1636089a btrfs: fix invalid delayed ref after subvolume creation failure
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
insert the new root's root item into the root tree, we are freeing the
metadata extent we reserved for the new root to prevent a metadata
extent leak, as we don't abort the transaction at that point (since
there is nothing at that point that is irreversible).

However we allocated the metadata extent for the new root which we are
creating for the new subvolume, so its delayed reference refers to the
ID of this new root. But when we free the metadata extent we pass the
root of the subvolume where the new subvolume is located to
btrfs_free_tree_block() - this is incorrect because this will generate
a delayed reference that refers to the ID of the parent subvolume's root,
and not to ID of the new root.

This results in a failure when running delayed references that leads to
a transaction abort and a trace like the following:

[3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
[3868.739857] Code: 68 0f 85 e6 fb ff (...)
[3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
[3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
[3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
[3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
[3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
[3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
[3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
[3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
[3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3868.757803] Call Trace:
[3868.758281]  <TASK>
[3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
[3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
[3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
[3868.762069]  ? lock_acquired+0x19f/0x420
[3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
[3868.763860]  ? _raw_spin_unlock+0x29/0x40
[3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
[3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
[3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
[3868.767669]  ? preempt_count_add+0x49/0xa0
[3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
[3868.769639]  ? _copy_from_user+0x66/0xa0
[3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
[3868.772364]  ? __slab_free+0x10a/0x360
[3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
[3868.774121]  ? lock_release+0x223/0x4a0
[3868.774863]  ? lock_acquired+0x19f/0x420
[3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
[3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
[3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
[3868.778280]  ? kmem_cache_free+0x321/0x3c0
[3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
[3868.779718]  __x64_sys_ioctl+0x83/0xb0
[3868.780387]  do_syscall_64+0x3b/0xc0
[3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[3868.781953] RIP: 0033:0x7f281c59e957
[3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
[3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
[3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
[3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
[3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
[3868.793765]  </TASK>
[3868.794037] irq event stamp: 0
[3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
[3868.799284] ---[ end trace be24c7002fe27747 ]---
[3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
[3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
[3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
[3868.802863]          extent refs 1 gen 1265 flags 2
[3868.803447]          ref#0: tree block backref root 1610
(...)
[3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
[3869.065421]          extent refs 1 gen 1268 flags 2
[3869.066115]          ref#0: tree block backref root 1689
(...)
[3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
[3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
[3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry

Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
This requires changing the root argument of btrfs_free_tree_block() from
struct btrfs_root * to a u64, since at this point during the subvolume
creation we have not yet created the struct btrfs_root for the new
subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
else from a struct btrfs_root.

This was triggered by test case generic/475 from fstests.

Fixes: 67addf2900 ("btrfs: fix metadata extent leak after failure to create subvolume")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15 17:07:33 +01:00
Josef Bacik 651740a502 btrfs: check WRITE_ERR when trying to read an extent buffer
Filipe reported a hang when we have errors on btrfs.  This turned out to
be a side-effect of my fix c2e3930529 ("btrfs: clear extent buffer
uptodate when we fail to write it") which made it so we clear
EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.

Below is a paste of Filipe's analysis he got from using drgn to debug
the hang

"""
btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
a value while writeback of all pages has not yet completed:
   --> writeback for the first 3 pages finishes, we clear
       EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
       error.
   --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
       first 3 pages
   --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
       it continues, it's able to lock the pages since we obviously don't
       hold the pages locked during writeback
   --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
       eb->io_pages to 3, since only the first page does not have Uptodate
       bit set at this point
   --> writeback for the remaining page completes, we ended decrementing
       eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
       never calling end_extent_buffer_writeback(), so
       EXTENT_BUFFER_WRITEBACK remains in the eb's flags
   --> of course, when the read bio completes, it doesn't and shouldn't
       call end_extent_buffer_writeback()
   --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
       the eb finished writeback?  or maybe make the read pages code
       wait for writeback of all pages of the eb to complete before
       checking which pages need to be read, touch ->io_pages, submit
       read bio, etc

writeback bit never cleared means we can hang when aborting a
transaction, at:

    btrfs_cleanup_one_transaction()
       btrfs_destroy_marked_extents()
         wait_on_extent_buffer_writeback()
"""

This is a problem because our writes are not synchronized with reads in
any way.  We clear the UPTODATE flag and then we can easily come in and
try to read the EB while we're still waiting on other bio's to
complete.

We have two options here, we could lock all the pages, and then check to
see if eb->io_pages != 0 to know if we've already got an outstanding
write on the eb.

Or we can simply check to see if we have WRITE_ERR set on this extent
buffer.  We set this bit _before_ we clear UPTODATE, so if the read gets
triggered because we aren't UPTODATE because of a write error we're
guaranteed to have WRITE_ERR set, and in this case we can simply return
-EIO.  This will fix the reported hang.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: c2e3930529 ("btrfs: clear extent buffer uptodate when we fail to write it")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-15 17:07:31 +01:00
Filipe Manana 1b2e5e5c7f btrfs: fix missing last dir item offset update when logging directory
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.

However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:

Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel:         item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel:                 inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel:         item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel:         item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel:         item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel:         item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel:         item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel:                 dir oid 704793 type 1
Dec 13 21:44:50 kernel:         item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel:                 dir oid 707849 type 2
Dec 13 21:44:50 kernel:         item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel:                 dir oid 707901 type 2
Dec 13 21:44:50 kernel:         item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel:                 dir oid 704694 type 2
Dec 13 21:44:50 kernel:         item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel:                 dir oid 707841 type 1
Dec 13 21:44:50 kernel:         item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel:                 dir oid 2167736 type 1
Dec 13 21:44:50 kernel:         item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
Dec 13 21:44:50 kernel:         item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel:                 dir oid 704696 type 1
Dec 13 21:44:50 kernel:         item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel:                 dir oid 704704 type 2
Dec 13 21:44:50 kernel:         item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel:                 dir oid 704738 type 1
Dec 13 21:44:50 kernel:         item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel:                 dir oid 2167729 type 1
Dec 13 21:44:50 kernel:         item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel:                 dir oid 704745 type 2
Dec 13 21:44:50 kernel:         item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel:                 dir oid 2167734 type 1
Dec 13 21:44:50 kernel:         item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel:                 dir oid 2167730 type 1
Dec 13 21:44:50 kernel:         item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel:                 dir oid 704688 type 1
Dec 13 21:44:50 kernel:         item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel:                 dir oid 707892 type 1
Dec 13 21:44:50 kernel:         item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
Dec 13 21:44:50 kernel:         item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel:         item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel:         item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel:         item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel:         item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel:         item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel:         item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel:         item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel:         item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel:         item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel:         item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel:         item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel:         item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel:         item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel:         item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel:         item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel:         item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel:                 inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected

So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.

Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes: dc2872247e ("btrfs: keep track of the last logged keys when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-14 15:52:10 +01:00
Filipe Manana 33fab97249 btrfs: fix double free of anon_dev after failure to create subvolume
When creating a subvolume, at create_subvol(), we allocate an anonymous
device and later call btrfs_get_new_fs_root(), which in turn just calls
btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
the anonymous device to the root, but if after that call there's an error,
when we jump to 'fail' label, we call btrfs_put_root(), which frees the
anonymous device and then returns an error that is propagated back to
create_subvol(). Than create_subvol() frees the anonymous device again.

When this happens, if the anonymous device was not reallocated after
the first time it was freed with btrfs_put_root(), we get a kernel
message like the following:

  (...)
  [13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
  [13950.283027] ida_free called for id=65 which is not allocated.
  [13950.285974] BTRFS info (device dm-0): forced readonly
  (...)

If the anonymous device gets reallocated by another btrfs filesystem
or any other kernel subsystem, then bad things can happen.

So fix this by setting the root's anonymous device to 0 at
btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
happened.

Fixes: 2dfb1e43f5 ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-14 15:52:04 +01:00
Jianglei Nie f35838a693 btrfs: fix memory leak in __add_inode_ref()
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but  when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).

We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).

1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 				  struct btrfs_root *root,
1059 				  struct btrfs_path *path,
1060 				  struct btrfs_root *log_root,
1061 				  struct btrfs_inode *dir,
1062 				  struct btrfs_inode *inode,
1063 				  u64 inode_objectid, u64 parent_objectid,
1064 				  u64 ref_index, char *name, int namelen,
1065 				  int *search_done)
1066 {

1104 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
	// #1: kmalloc (victim_name-1)
1105 	if (!victim_name)
1106 		return -ENOMEM;

1112	ret = backref_in_log(log_root, &search_key,
1113			parent_objectid, victim_name,
1114			victim_name_len);
1115	if (ret < 0) {
1116		kfree(victim_name); // #2: kfree (victim_name-1)
1117		return ret;
1118	} else if (!ret) {

1169 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
	// #3: kmalloc (victim_name-2)
1170 	if (!victim_name)
1171 		return -ENOMEM;

1180 	ret = backref_in_log(log_root, &search_key,
1181 			parent_objectid, victim_name,
1182 			victim_name_len);
1183 	if (ret < 0) {
1184 		return ret; // #4: missing kfree (victim_name-2)
1185 	} else if (!ret) {

1241 	return 0;
1242 }

Fixes: d3316c8233 ("btrfs: Properly handle backref_in_log retval")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-14 15:52:00 +01:00
Linus Torvalds e386dfc56f fget: clarify and improve __fget_files() implementation
Commit 054aa8d439 ("fget: check that the fd still exists after getting
a ref to it") fixed a race with getting a reference to a file just as it
was being closed.  It was a fairly minimal patch, and I didn't think
re-checking the file pointer lookup would be a measurable overhead,
since it was all right there and cached.

But I was wrong, as pointed out by the kernel test robot.

The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
quite noticeably.  Admittedly it seems to be a very artificial test:
doing "poll()" system calls on regular files in a very tight loop in
multiple threads.

That means that basically all the time is spent just looking up file
descriptors without ever doing anything useful with them (not that doing
'poll()' on a regular file is useful to begin with).  And as a result it
shows the extra "re-check fd" cost as a sore thumb.

Happily, the regression is fixable by just writing the code to loook up
the fd to be better and clearer.  There's still a cost to verify the
file pointer, but now it's basically in the noise even for that
benchmark that does nothing else - and the code is more understandable
and has better comments too.

[ Side note: this patch is also a classic case of one that looks very
  messy with the default greedy Myers diff - it's much more legible with
  either the patience of histogram diff algorithm ]

Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Carel Si <beibei.si@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-13 10:55:30 -08:00
Jens Axboe d800c65c2d io-wq: drop wqe lock before creating new worker
We have two io-wq creation paths:

- On queue enqueue
- When a worker goes to sleep

The latter invokes worker creation with the wqe->lock held, but that can
run into problems if we end up exiting and need to cancel the queued work.
syzbot caught this:

============================================
WARNING: possible recursive locking detected
5.16.0-rc4-syzkaller #0 Not tainted
--------------------------------------------
iou-wrk-6468/6471 is trying to acquire lock:
ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187

but task is already holding lock:
ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&wqe->lock);
  lock(&wqe->lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by iou-wrk-6468/6471:
 #0: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700

stack backtrace:
CPU: 1 PID: 6471 Comm: iou-wrk-6468 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1dc/0x2d8 lib/dump_stack.c:106
 print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
 check_deadlock kernel/locking/lockdep.c:2999 [inline]
 validate_chain+0x5984/0x8240 kernel/locking/lockdep.c:3788
 __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027
 lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5637
 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154
 io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187
 io_wq_cancel_tw_create fs/io-wq.c:1220 [inline]
 io_queue_worker_create+0x3cf/0x4c0 fs/io-wq.c:372
 io_wq_worker_sleeping+0xbe/0x140 fs/io-wq.c:701
 sched_submit_work kernel/sched/core.c:6295 [inline]
 schedule+0x67/0x1f0 kernel/sched/core.c:6323
 schedule_timeout+0xac/0x300 kernel/time/timer.c:1857
 wait_woken+0xca/0x1b0 kernel/sched/wait.c:460
 unix_msg_wait_data net/unix/unix_bpf.c:32 [inline]
 unix_bpf_recvmsg+0x7f9/0xe20 net/unix/unix_bpf.c:77
 unix_stream_recvmsg+0x214/0x2c0 net/unix/af_unix.c:2832
 sock_recvmsg_nosec net/socket.c:944 [inline]
 sock_recvmsg net/socket.c:962 [inline]
 sock_read_iter+0x3a7/0x4d0 net/socket.c:1035
 call_read_iter include/linux/fs.h:2156 [inline]
 io_iter_do_read fs/io_uring.c:3501 [inline]
 io_read fs/io_uring.c:3558 [inline]
 io_issue_sqe+0x144c/0x9590 fs/io_uring.c:6671
 io_wq_submit_work+0x2d8/0x790 fs/io_uring.c:6836
 io_worker_handle_work+0x808/0xdd0 fs/io-wq.c:574
 io_wqe_worker+0x395/0x870 fs/io-wq.c:630
 ret_from_fork+0x1f/0x30

We can safely drop the lock before doing work creation, making the two
contexts the same in that regard.

Reported-by: syzbot+b18b8be69df33a3918e9@syzkaller.appspotmail.com
Fixes: 71a8538754 ("io-wq: check for wq exit after adding new worker task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-13 09:04:01 -07:00
Linus Torvalds e034d9cbf9 Fixes for 5.16-rc4:
- Fix a data corruption vector that can result from the ro remount
    process failing to clear all speculative preallocations from files
    and the rw remount process not noticing the incomplete cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGyQ6UACgkQ+H93GTRK
 tOskJA/+Jg+fn+Vs7dHwBN+SaHCBH81RT+YD9gbI7Lq5xcK16K7BWz3Hf360J9uJ
 BpVn0fv6/+3bNg0/QMWUE67CotB2S43CzWnltUN0ymiGgAEO1eMKL1JHRtZ/89d4
 pG2Lr9W1QsxMqP9L/TgxqHb5WC2xtromDLMWa7qYL8aH4qp4Dx6hv++u3qhbdTis
 sfvTOMsDKTwK5RC+7yB2qxr/Txn7viIlBLjTWR3UvvO0K1LxjuBTXpBsiVeQE92T
 ZqLgAhWDBu/wZFBibqtMsnLF0Wmn5oIZJgBoxp/+Ze8saUsuYdAI1X65ESdVmw7E
 ZJtSyZObDHOlRhvDn7Vlb8BNF7QY4l4n2eeGFB4zoPdy0jjz/7QKCqMSm7YCMXqc
 WycjeEIo1H+ZEk0AWu7KrTUAaO3zzqHB/DJt4zQA/w4QdSD14bAsyp8ZWyUEMnWy
 7M0Pc0q2ptzOOp1rjd2J+0s9YSc24/dceW1lCDpIDSkOuxO82HdCxlWQ/YPEt2Lx
 gGv2hll8Cg0j18VThB++XZ3ozImFu+sZhkQPPCM/kKY0tWZvd4Yby1bWLNzS+ppE
 RIKh+hyhFTXmr0C2iJ8fGNFu/KA3V0efztvCIATHDmGcC6f4WePUQyLzZGFgS0Yw
 uBPmvqWeUWySgfauIvqW7Oosf0pbKvYVEd+1fcLVdh0sgr2MUPE=
 =N9Wf
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "This fixes a race between a readonly remount process and other
  processes that hold a file IOLOCK on files that previously experienced
  copy on write, that could result in severe filesystem corruption if
  the filesystem is then remounted rw.

  I think this is fairly rare (since the only reliable reproducer I have
  that fits the second criteria is the experimental xfs_scrub program),
  but the race is clear, so we still need to fix this.

  Summary:

   - Fix a data corruption vector that can result from the ro remount
     process failing to clear all speculative preallocations from files
     and the rw remount process not noticing the incomplete cleanup"

* tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove all COW fork extents when remounting readonly
2021-12-11 16:21:06 -08:00
Linus Torvalds f152165ada io_uring-5.16-2021-12-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG0PwYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphbiEACv95PrGVuOzgXwC1Ydk4Km5Yyt8vANKcxA
 LFzeScrRhGG28vHCLZuPSxtFn9nGjBJPDKscDWQE0tpq8NwB6BAjr3Ps2AR4tI3M
 2bn1ssHIHADdkaV1ixwTtQynHgCvOtKSTmLJ+TTDIYdgwAyRVlRBtpke/nXn5VCR
 kUdQ/3hHSPbEY+WvrDz7yAvDIMxTrBHmt8TbDiI1FXzNf8AOd32ekpBNzwe2hTQi
 OLhpaQ60S74UG78c4OGg6SV+twW0nkp1DyTf8MhC3WoBBnTwQdc/BpLNJbnUwH57
 lJVJTz0sknbnQCtifJ8/Wjzn6r95cOwbYvaGgXyi7KtTGn0QXLcK+FohNVizFR0x
 qTMWlFJ5aPmzqed6oMZ2m8iaGxR9813I0GP54Hsqw4PmoPb1GDN5FvZIN+MKdaJd
 PSp9YH7RDHsusYZjjaUtHIcxhxztblf5wNVAkmkfrFZzNArnwsw0k9gX7DI0f+GX
 +TazZzX8gLRonAskBHgoSDT+8WOzDOHYYIVVe0g/8TR+jW8oGDd9D0sJbH74dV1Y
 3C4WLbJCquBzatotmcmgkt9CJzZz9y373cVVNu4HZgH49qjEZCJuFgBaTpanvwHk
 Qk5KTl9yIfjs+H2FDWa2Pd6LnsVvuIpwOd4R37RDLGM2ZiX100eX1VbD6RxtIehQ
 PPKP13kYAQ==
 =BUVd
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few fixes that are all bound for stable:

   - Two syzbot reports for io-wq that turned out to be separate fixes,
     but ultimately very closely related

   - io_uring task_work running on cancelations"

* tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block:
  io-wq: check for wq exit after adding new worker task_work
  io_uring: ensure task_work gets run as part of cancelations
  io-wq: remove spurious bit clear on task_work addition
2021-12-11 09:19:44 -08:00
Linus Torvalds 6f51352929 for-5.16-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmGwxl8ACgkQxWXV+ddt
 WDvSeQ//T+mv4o7ucldQZCVdN0TTVCkQUhia+ZdMwBcPty2/ZEdap+KEIVmfCV/v
 OLRmSNkIPDhHcIc/O3zJ1/AY0DFbb9brYGkMD/qidgPbRArhDZSrDIr+xnrKJ0iq
 HFxM01B54l8hJe6GWIGFuuOz+nXUP1o9SfiDOwMDTkqgzz1JSvPec70RKMxG8pTC
 4plVrGaUXkKTC8WyBXnSvkP2gvjfJxqnEKv2Ru1eP1t7Bq65aed0+Z32Mogzl2ip
 ZSHVlRergeB03xF9YErWSfgofVWLEIXgLCh/Wfq73sMnHUHUGM/dpEKxP911MI00
 A5TuTl3I25cVnDv3qMayBPECMCCGZc2vUyXTLCWUNYLb/vEUTI36QBu/KglRURmO
 Zx20B+3/7Yu9pFeZ23S+nlNdGDADmjkOfvZIZDoYDzqBAKWM6kZ/9oWiib21Uwi6
 ql5oZNl7G6UXLPvgJoq8dqZIj/HYeLEHeqwf/tepSVQLXYzQyWYDMp686z958XI1
 K/A/TKaKk19nn1Dhsz4KJeb3xhMFlFN30K7BNjdH3XRH1RrwJP6pLF/WUduJq2qn
 rAvJoODLHkby1D9+HqSKW/RToeR5NbBYYdfB0Q8zpy9zF/gZX7wgFc21/4B7qtib
 hkKfNSfAVjedOJJWTFCUDnZwwrqtcrKxYzUpdcls/O1AjifK7+U=
 =Mh19
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more regression fixes and stable patches, mostly one-liners.

  Regression fixes:

   - fix pointer/ERR_PTR mismatch returned from memdup_user

   - reset dedicated zoned mode relocation block group to avoid using it
     and filling it without any recourse

  Fixes:

   - handle a case to FITRIM range (also to make fstests/generic/260
     work)

   - fix warning when extent buffer state and pages get out of sync
     after an IO error

   - fix transaction abort when syncing due to missing mapping error set
     on metadata inode after inlining a compressed file

   - fix transaction abort due to tree-log and zoned mode interacting in
     an unexpected way

   - fix memory leak of additional extent data when qgroup reservation
     fails

   - do proper handling of slot search call when deleting root refs"

* tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling
  btrfs: zoned: clear data relocation bg on zone finish
  btrfs: free exchange changeset on failures
  btrfs: fix re-dirty process of tree-log nodes
  btrfs: call mapping_set_error() on btree inode with a write error
  btrfs: clear extent buffer uptodate when we fail to write it
  btrfs: fail if fstrim_range->start == U64_MAX
  btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
2021-12-10 17:28:02 -08:00
Linus Torvalds e1b96811e2 two cifs/smb3 fixes, one for stable, the other fixes a recently reported NTLMSSP auth problem
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGzuJoACgkQiiy9cAdy
 T1H9MAv/cOAk4iUhgDUsa+HYsBNiAQ+IOOu/WOZI56jEV/qVP0I3cpkiIJenG5gu
 ZurivI4smsgNokHOFoT3vjtLXTfTl0OdHHY/mftd5IGIPG+KnXcg3+ZaE3T+fUV3
 uHX0cH3a3Azo3RGf2rRiTfW6u9FXJnb9aAUTif7UDVwsU37wUAQrKEmcazaUXdaT
 +j3KpqwGhCgbkneKuAd/FDTAg4wciJgRg3aE/4W2s2ovFiF6vsUcrmhQD1zP1EZh
 sPWdx4/U+WpeV02RfKBLQlXi6ofqRF5qRT4HkD07G5Zhz8OOZcm06wYFNl/+aYhF
 lTOTa9KAoTPfV2tFlTGZVEy07ggEFd3wDxeAPulDrqY8etwWSwRAHHRi5HstMlIX
 iYz06DLS+LSPyriy6GV6CIL01OPg8vkeLPbrLFbs8oLSVQgmx74UDS339oxJMdFe
 kkhLgtfRr5DfxS3uD0u17aBDL4ullH84dWdlsUKiUfGvVsBcoVSCPIQeWqkB8xjA
 OTKih3J1
 =+1L9
 -----END PGP SIGNATURE-----

Merge tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two cifs/smb3 fixes - one for stable, the other fixes a recently
  reported NTLMSSP auth problem"

* tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix ntlmssp auth when there is no key exchange
  cifs: Fix crash on unload of cifs_arc4.ko
2021-12-10 17:24:57 -08:00
Linus Torvalds e80bdc5ed0 Fix a race on startup and another in the delegation code. The latter
has been around for years, but I suspect recent changes may have
 widened the race window a little, so I'd like to go ahead and get it in.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGzuB8VHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rYQP/1N61E3rf2MOfF4sluubaGGi0fVR
 cx3w8pTDSALfxAiz0ZTA6hJbZN0SR7MZuXMhWDHT/RiG7jFfuMJACU81BWn0cCPJ
 zuJziE3Y6fElGi5Ifp71yMLABqTlFPOUHBZkpzaADqXGwmZk4fecXmqKbtqa9kCU
 1mCQJlZDGU1ljGVYuXU4SaBOzsrCpPfbxby2ZWFDs8bu5cGKDTq9OSVF1LaGzjMN
 2+v0y2Qni+dRfTBCGAmxF8AukRa2r5i2rfHRatgKQO8PdROv/pguy1hYjjeeofhX
 /Jlq2RIVl1/9/Fwar4Cew2+fpQhpaY62wUuo4KG4PK5WZ92LGDjSpYqvu/I4lI6z
 BXdge0HzDPRnpZtD+tu4IJ7nQwG+JnO2Ws91rPxwT6DDX4zX9gNZc+7EGevVeXJI
 XTO+61Tyz+7/6YLR8UvnvyER9KvKjQiUWZkO81LZpuAAwn1PE3rahD48jxkmwwYE
 m5EvfuA6YR8oyk3Ml+nZNuKwgrNqPcvohyJz4QHpNpqmvOlVrb9I1k90JzIm8D9Q
 09M8An1Z3AGw76+RMnvBnWewn1Nx1bKIS32SJk9x3zBoeyXLS17G6FmpEhW81xvN
 LoToRqD/GNx+4AtfvkHqwv2/Qd1rvONCsjbZv+2P0DAj8ZEOWn0YHk18WlvAIZUg
 4ipyZbwMUbXeytKP
 =GgUu
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Fix a race on startup and another in the delegation code.

  The latter has been around for years, but I suspect recent changes may
  have widened the race window a little, so I'd like to go ahead and get
  it in"

* tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix use-after-free due to delegation race
  nfsd: Fix nsfd startup race (again)
2021-12-10 17:17:53 -08:00
Linus Torvalds 257dcf2923 tracing, ftrace and tracefs fixes:
- Have tracefs honor the gid mount option
 
  - Have new files in tracefs inherit the parent ownership
 
  - Have direct_ops unregister when it has no more functions
 
  - Properly clean up the ops when unregistering multi direct ops
 
  - Add a sample module to test the multiple direct ops
 
  - Fix memory leak in error path of __create_synth_event()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYbOgPBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgOtAP0YD+cRLxnRKA376oQVB8zmuZ3mZ/4x
 6M1hqruSDlno3AEA19PyHpxl7flFwnBb6Gnfo9VeefcMS5ENDH9p1aHX4wU=
 =Tr6t
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Tracing, ftrace and tracefs fixes:

   - Have tracefs honor the gid mount option

   - Have new files in tracefs inherit the parent ownership

   - Have direct_ops unregister when it has no more functions

   - Properly clean up the ops when unregistering multi direct ops

   - Add a sample module to test the multiple direct ops

   - Fix memory leak in error path of __create_synth_event()"

* tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix possible memory leak in __create_synth_event() error path
  ftrace/samples: Add module to test multi direct modify interface
  ftrace: Add cleanup to unregister_ftrace_direct_multi
  ftrace: Use direct_ops hash in unregister_ftrace_direct
  tracefs: Set all files to the same group ownership as the mount option
  tracefs: Have new files inherit the ownership of their parent
2021-12-10 14:24:05 -08:00
Linus Torvalds 0d21e66847 aio poll fixes for 5.16-rc5
Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
 
   - aio poll didn't handle POLLFREE, causing a use-after-free.
   - aio poll could block while the file is ready.
   - aio poll called eventfd_signal() when it isn't allowed.
   - POLLFREE didn't handle multiple exclusive waiters correctly.
 
 This has been tested with the libaio test suite, as well as with test
 programs I wrote that reproduce the first two bugs.  I am sending this
 pull request myself as no one seems to be maintaining this code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYbObthQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+3mAQC9W8ApzBleEPI6FXzIIo5AiQT/2jGl
 7FbO1MtkdUBU4QEAzf+VWl4Z4BJTgxl44avRdVDpXGAMnbWkd7heY+e3HwA=
 =mp+r
 -----END PGP SIGNATURE-----

Merge tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull aio poll fixes from Eric Biggers:
 "Fix three bugs in aio poll, and one issue with POLLFREE more broadly:

   - aio poll didn't handle POLLFREE, causing a use-after-free.

   - aio poll could block while the file is ready.

   - aio poll called eventfd_signal() when it isn't allowed.

   - POLLFREE didn't handle multiple exclusive waiters correctly.

  This has been tested with the libaio test suite, as well as with test
  programs I wrote that reproduce the first two bugs. I am sending this
  pull request myself as no one seems to be maintaining this code"

* tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  aio: Fix incorrect usage of eventfd_signal_allowed()
  aio: fix use-after-free due to missing POLLFREE handling
  aio: keep poll requests on waitqueue until completed
  signalfd: use wake_up_pollfree()
  binder: use wake_up_pollfree()
  wait: add wake_up_pollfree()
2021-12-10 14:15:39 -08:00
Jens Axboe 71a8538754 io-wq: check for wq exit after adding new worker task_work
We check IO_WQ_BIT_EXIT before attempting to create a new worker, and
wq exit cancels pending work if we have any. But it's possible to have
a race between the two, where creation checks exit finding it not set,
but we're in the process of exiting. The exit side will cancel pending
creation task_work, but there's a gap where we add task_work after we've
canceled existing creations at exit time.

Fix this by checking the EXIT bit post adding the creation task_work.
If it's set, run the same cancelation that exit does.

Reported-and-tested-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-10 13:56:28 -07:00
Jens Axboe 78a7806020 io_uring: ensure task_work gets run as part of cancelations
If we successfully cancel a work item but that work item needs to be
processed through task_work, then we can be sleeping uninterruptibly
in io_uring_cancel_generic() and never process it. Hence we don't
make forward progress and we end up with an uninterruptible sleep
warning.

While in there, correct a comment that should be IFF, not IIF.

Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-10 13:56:28 -07:00
J. Bruce Fields 548ec0805c nfsd: fix use-after-free due to delegation race
A delegation break could arrive as soon as we've called vfs_setlease.  A
delegation break runs a callback which immediately (in
nfsd4_cb_recall_prepare) adds the delegation to del_recall_lru.  If we
then exit nfs4_set_delegation without hashing the delegation, it will be
freed as soon as the callback is done with it, without ever being
removed from del_recall_lru.

Symptoms show up later as use-after-free or list corruption warnings,
usually in the laundromat thread.

I suspect aba2072f45 "nfsd: grant read delegations to clients holding
writes" made this bug easier to hit, but I looked as far back as v3.0
and it looks to me it already had the same problem.  So I'm not sure
where the bug was introduced; it may have been there from the beginning.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-12-10 11:55:15 -05:00
Alexander Sverdlin b10252c7ae nfsd: Fix nsfd startup race (again)
Commit bd5ae9288d ("nfsd: register pernet ops last, unregister first")
has re-opened rpc_pipefs_event() race against nfsd_net_id registration
(register_pernet_subsys()) which has been fixed by commit bb7ffbf29e
("nfsd: fix nsfd startup race triggering BUG_ON").

Restore the order of register_pernet_subsys() vs register_cld_notifier().
Add WARN_ON() to prevent a future regression.

Crash info:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000012
CPU: 8 PID: 345 Comm: mount Not tainted 5.4.144-... #1
pc : rpc_pipefs_event+0x54/0x120 [nfsd]
lr : rpc_pipefs_event+0x48/0x120 [nfsd]
Call trace:
 rpc_pipefs_event+0x54/0x120 [nfsd]
 blocking_notifier_call_chain
 rpc_fill_super
 get_tree_keyed
 rpc_fs_get_tree
 vfs_get_tree
 do_mount
 ksys_mount
 __arm64_sys_mount
 el0_svc_handler
 el0_svc

Fixes: bd5ae9288d ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-12-10 11:54:59 -05:00
Xie Yongji 4b37498653 aio: Fix incorrect usage of eventfd_signal_allowed()
We should defer eventfd_signal() to the workqueue when
eventfd_signal_allowed() return false rather than return
true.

Fixes: b542e383d8 ("eventfd: Make signal recursion protection a task bit")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09 10:52:55 -08:00
Eric Biggers 50252e4b5e aio: fix use-after-free due to missing POLLFREE handling
signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, by sending a POLLFREE notification to all waiters.

Unfortunately, only eventpoll handles POLLFREE.  A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE.  This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.

Fix this by making aio poll handle POLLFREE.

A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen.  However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order.  Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.

The second problem was solved by my previous patch.  This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way.  It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.

Fixes: 2c14fa838c ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09 10:49:56 -08:00
Eric Biggers 363bee27e2 aio: keep poll requests on waitqueue until completed
Currently, aio_poll_wake() will always remove the poll request from the
waitqueue.  Then, if aio_poll_complete_work() sees that none of the
polled events are ready and the request isn't cancelled, it re-adds the
request to the waitqueue.  (This can easily happen when polling a file
that doesn't pass an event mask when waking up its waitqueue.)

This is fundamentally broken for two reasons:

  1. If a wakeup occurs between vfs_poll() and the request being
     re-added to the waitqueue, it will be missed because the request
     wasn't on the waitqueue at the time.  Therefore, IOCB_CMD_POLL
     might never complete even if the polled file is ready.

  2. When the request isn't on the waitqueue, there is no way to be
     notified that the waitqueue is being freed (which happens when its
     lifetime is shorter than the struct file's).  This is supposed to
     happen via the waitqueue entries being woken up with POLLFREE.

Therefore, leave the requests on the waitqueue until they are actually
completed (or cancelled).  To keep track of when aio_poll_complete_work
needs to be scheduled, use new fields in struct poll_iocb.  Remove the
'done' field which is now redundant.

Note that this is consistent with how sys_poll() and eventpoll work;
their wakeup functions do *not* remove the waitqueue entries.

Fixes: 2c14fa838c ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09 10:49:56 -08:00