linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-06 00:39:48 +00:00

Author	SHA1	Message	Date
Harald Freudenberger	5cf1a563a3	s390/ap: fix vanishing crypto cards in SE environment A secure execution (SE, also known as confidential computing) guest may see asynchronous errors on a crypto firmware queue. The current implementation to gather information about cards and queues in ap_queue_info() simple returns if an asynchronous error is hanging on the firmware queue. If such a situation happened and it was the only queue visible for a crypto card within an SE guest, then the card vanished from sysfs as the AP bus scan function refuses to hold a card without any type information. As lszcrypt evaluates the sysfs such a card vanished from the lszcrypt card listing and the user is baffled and has no way to reset and thus clear the pending asynchronous error. This patch improves the named function to also evaluate GR2 of the TAPQ in case of asynchronous error pending. If there is a not-null value stored in, the info is processed now. In the end, a queue with pending asynchronous error does not lead to a vanishing card any more. Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Ingo Franzki	cfaef6516e	s390/zcrypt: don't report online if card or queue is in check-stop state If a card or a queue is in check-stop state, it can't be used by applications to perform crypto operations. Applications check the 'online' sysfs attribute to check if a card or queue is usable. Report a card or queue as offline in case it is in check-stop state. Furthermore, don't allow to set a card or queue online, if it is in check-stop state. This is similar to when the card or the queue is in deconfigured state, there it is also reported as being offline, and it can't be set online. Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Heiko Carstens	aa44433ac4	s390: add USER_STACKTRACE support Use the perf_callchain_user() code as blueprint to also add support for USER_STACKTRACE. To describe how to use this cite the commit message of the LoongArch implementation which came with commit `4d7bf939df` ("LoongArch: Add USER_STACKTRACE support"), but replace -fno-omit-frame-pointer option with the s390 specific -mbackchain option: ====================================================================== To get the best stacktrace output, you can compile your userspace programs with frame pointers (at least glibc + the app you are tracing). 1, export "CC = gcc -mbackchain"; 2, compile your programs with "CC"; 3, use uprobe to get stacktrace output. ... echo 'p:malloc /usr/lib64/libc.so.6:0x0a4704 size=%r2:u64' > uprobe_events echo 'p:free /usr/lib64/libc.so.6:0x0a4d50 ptr=%r2:u64' >> uprobe_events echo 'comm == "demo"' > ./events/uprobes/malloc/filter echo 'comm == "demo"' > ./events/uprobes/free/filter echo 1 > ./options/userstacktrace echo 1 > ./options/sym-userobj ... ====================================================================== Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Heiko Carstens	504b73d00a	s390/perf: implement perf_callchain_user() Daan De Meyer and Neal Gompa reported that s390 does not support perf user stack unwinding. This was never implemented since this requires user space to be compiled with the -mbackchain compile option, which until now no distribution did. However this is going to change with Fedora. Therefore provide a perf_callchain_user() implementation. Note that due to the way s390 sets up stack frames the provided call chains can contain invalid values. This is especially true for the first stack frame, where it is not possible to tell if the return address has been written to the stack already or not. Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com> Reported-by: Neal Gompa <ngompa@fedoraproject.org> Closes: https://lore.kernel.org/all/CAO8sHcn3+_qrnvp0580aK7jN0Wion5F7KYeBAa4MnCY4mqABPA@mail.gmail.com/ Link: https://lore.kernel.org/all/20231030123558.10816-A-hca@linux.ibm.com Reviewed-by: Neal Gompa <ngompa@fedoraproject.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Harald Freudenberger	e14aec2302	s390/ap: fix AP bus crash on early config change callback invocation Fix kernel crash in AP bus code caused by very early invocation of the config change callback function via SCLP. After a fresh IML of the machine the crypto cards are still offline and will get switched online only with activation of any LPAR which has the card in it's configuration. A crypto card coming online is reported to the LPAR via SCLP and the AP bus offers a callback function to get this kind of information. However, it may happen that the callback is invoked before the AP bus init function is complete. As the callback triggers a synchronous AP bus scan, the scan may already run but some internal states are not initialized by the AP bus init function resulting in a crash like this: [ 11.635859] Unable to handle kernel pointer dereference in virtual kernel address space [ 11.635861] Failing address: 0000000000000000 TEID: 0000000000000887 [ 11.635862] Fault in home space mode while using kernel ASCE. [ 11.635864] AS:00000000894c4007 R3:00000001fece8007 S:00000001fece7800 P:000000000000013d [ 11.635879] Oops: 0004 ilc:1 [#1] SMP [ 11.635882] Modules linked in: [ 11.635884] CPU: 5 PID: 42 Comm: kworker/5:0 Not tainted 6.6.0-rc3-00003-g4dbf7cdc6b42 #12 [ 11.635886] Hardware name: IBM 3931 A01 751 (LPAR) [ 11.635887] Workqueue: events_long ap_scan_bus [ 11.635891] Krnl PSW : 0704c00180000000 0000000000000000 (0x0) [ 11.635895] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 [ 11.635897] Krnl GPRS: 0000000001000a00 0000000000000000 0000000000000006 0000000089591940 [ 11.635899] 0000000080000000 0000000000000a00 0000000000000000 0000000000000000 [ 11.635901] 0000000081870c00 0000000089591000 000000008834e4e2 0000000002625a00 [ 11.635903] 0000000081734200 0000038000913c18 000000008834c6d6 0000038000913ac8 [ 11.635906] Krnl Code:>0000000000000000: 0000 illegal [ 11.635906] 0000000000000002: 0000 illegal [ 11.635906] 0000000000000004: 0000 illegal [ 11.635906] 0000000000000006: 0000 illegal [ 11.635906] 0000000000000008: 0000 illegal [ 11.635906] 000000000000000a: 0000 illegal [ 11.635906] 000000000000000c: 0000 illegal [ 11.635906] 000000000000000e: 0000 illegal [ 11.635915] Call Trace: [ 11.635916] [<0000000000000000>] 0x0 [ 11.635918] [<000000008834e4e2>] ap_queue_init_state+0x82/0xb8 [ 11.635921] [<000000008834ba1c>] ap_scan_domains+0x6fc/0x740 [ 11.635923] [<000000008834c092>] ap_scan_adapter+0x632/0x8b0 [ 11.635925] [<000000008834c3e4>] ap_scan_bus+0xd4/0x288 [ 11.635927] [<00000000879a33ba>] process_one_work+0x19a/0x410 [ 11.635930] Discipline DIAG cannot be used without z/VM [ 11.635930] [<00000000879a3a2c>] worker_thread+0x3fc/0x560 [ 11.635933] [<00000000879aea60>] kthread+0x120/0x128 [ 11.635936] [<000000008792afa4>] __ret_from_fork+0x3c/0x58 [ 11.635938] [<00000000885ebe62>] ret_from_fork+0xa/0x30 [ 11.635942] Last Breaking-Event-Address: [ 11.635942] [<000000008834c6d4>] ap_wait+0xcc/0x148 This patch improves the ap_bus_force_rescan() function which is invoked by the config change callback by checking if a first initial AP bus scan has been done. If not, the force rescan request is simple ignored. Anyhow it does not make sense to trigger AP bus re-scans even before the very first bus scan is complete. Cc: stable@vger.kernel.org Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Harald Freudenberger	c40284b364	s390/ap: re-enable interrupt for AP queues This patch introduces some code lines which check for interrupt support enabled on an AP queue after a reply has been received. This invocation has been chosen as there is a good chance to have the queue empty at that time. As the enablement of the irq imples a state machine change the queue should not have any pending requests or unreceived replies. Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Harald Freudenberger	01c89ab7f8	s390/ap: rework to use irq info from ap queue status This patch reworks the irq handling and reporting code for the AP queue interrupt handling to always use the irq info from the queue status. Until now the interrupt status of an AP queue was stored into a bool variable within the ap_queue struct. This variable was set on a successful interrupt enablement and cleared with kicking a reset. However, it may be that the interrupt state is manipulated outband for example by a hypervisor. This patch removes this variable and instead the irq bit from the AP queue status which is always reflecting the current irq state is used. Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Alexander Gordeev	0547e0bd9b	s390/mm: add missing conversion to use ptdescs Commit `6326c26c15` ("s390: convert various pgalloc functions to use ptdescs") missed to convert tlb_remove_table() into tlb_remove_ptdesc() in few locations. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>	2023-11-05 22:34:57 +01:00
Linus Torvalds	77fa2fbe87	vhost,virtio,vdpa: features, fixes, cleanups vdpa/mlx5: VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK new maintainer vdpa: support for vq descriptor mappings decouple reset of iotlb mapping from device reset fixes, cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVCUMYPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRp4L0H/RKcnNXPRqzhhBI1XVQ11Z8CO8WjovcmJalu ADHNEGmvuWnY79fp9eLiZ4iVaTx1qbzqIB5Q500DJ65jh71W7UQ8ww6CGjNUoRGs Zoe4G09WoOf4bvDZZzVV7ml/AzMdsHWSZK8pxY3QI9CsC9Zfp9hg20QYxPylCqYx SIJx7w2MkoojfmtOHRx1WUxaQz99yfU4Z0C5PxtRE1HGN6/a1aY0P0CAl5jq8uCK U5sCRsfCmP7VKlspeEddMiPA35ADbCiysSobCbwGVQEs5cHpMUX7KWa+oV0tF/PY 9uyJb2rJy6zG3tXmL4XNib665ZR86HX6qiWRfm2nBQQStuHaJyg= =mXgo -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: "vhost,virtio,vdpa: features, fixes, cleanups. vdpa/mlx5: - VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK - new maintainer vdpa: - support for vq descriptor mappings - decouple reset of iotlb mapping from device reset and fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) vdpa_sim: implement .reset_map support vdpa/mlx5: implement .reset_map driver op vhost-vdpa: clean iotlb map during reset for older userspace vdpa: introduce .compat_reset operation callback vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vhost-vdpa: reset vendor specific mapping to initial state in .release vdpa: introduce .reset_map operation callback virtio_pci: add check for common cfg size virtio-blk: fix implicit overflow on virtio_max_dma_size virtio_pci: add build offset check for the new common cfg items virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit vduse: make vduse_class constant vhost-scsi: Spelling s/preceeding/preceding/g virtio: kdoc for struct virtio_pci_modern_device vdpa: Update sysfs ABI documentation MAINTAINERS: Add myself as mlx5_vdpa driver virtio-balloon: correct the comment of virtballoon_migratepage() mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK vdpa/mlx5: Update cvq iotlb mapping on ASID change vdpa/mlx5: Make iotlb helper functions more generic ...	2023-11-05 09:02:32 -10:00
Linus Torvalds	1cfb751165	firewire updates for 6.7 The pull request includes a slight change for flexible length of array in core function. Kees Cook provides a patch to annotate the array embedded in fw_node structure referring to structure member for the length of array. The annotation would be defined by future extension of C compilers, and used for access bound-check at run-time enabled by UBSAN and FORTIFY_SOURCE. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQE66IEYNDXNBPeGKSsLtaWM8LwEwUCZUebWgAKCRCsLtaWM8Lw E9ONAQDJgJrJ8NkA2L/RMq+og3hdz+PFSNgCjLMudpJOzqBr8gD+Mds+Kjkt3PyL gNeixdoc2H2xaGw+Lij/czASbWiywQ4= =+/gn -----END PGP SIGNATURE----- Merge tag 'firewire-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire update from Takashi Sakamoto: "A slight change for flexible length of array in core function. Kees Cook provides a patch to annotate the array embedded in fw_node structure referring to structure member for the length of array. The annotation would be defined by future extension of C compilers, and used for access bound-check at run-time enabled by UBSAN and FORTIFY_SOURCE" * tag 'firewire-updates-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: Annotate struct fw_node with __counted_by	2023-11-05 08:50:41 -10:00
Linus Torvalds	3d05e493e4	I2C has largely driver updates for 6.7., i.e. feature additions (like adding transfers while in atomic mode), using new helpers (like devm_clk_get_enabled), new IDs, documentation fixes and additions... you name it. The core got a memleak fix and better support for nested muxes. -----BEGIN PGP SIGNATURE----- iQJCBAABCgAtFiEEOZGx6rniZ1Gk92RdFA3kzBSgKbYFAmVHnGkPHHdzYUBrZXJu ZWwub3JnAAoJEBQN5MwUoCm2ZosP9jLusVXmRi3BSwnCqNVBD6z6fDcYpIi71zTF P9GALIdxf2lPwUs+GrS/GVsdNzIhA6Fl5847uGaSd9ijaVjaBgjIqt2DRy7zBdwD bvE7mbSRBYLmUF3SgxrmS559MiYTTOJvImR1BTUTOgHXSfnymufCv+s2MGKiI9aa 6jK9SuFXRwgMubzQkl4VT6By/TsnuCF7+9pr2fU9lR5b+mN5JT0EYBF5CPfpdfca UcREhZXkHd758icJ/T0CSqZ6y7/3RmAC8L+vQBcluoXucFfTcIqdZ44+SaiV9pAh kicvsm5Vyme/hCe+mnBRVWDBSJUpvgj48084qZj+QCfrsaqBadA8KCMZFB1ZLi3e eMtEFlt4OtXwL2PpbGjDhG59XWQosPc5ii7Fwu1EA8flW2p+H5NoRPCrFUEBKvDa BkHduxRykJVm86L6vFh6XN93JnbWzBqE6IKnJWEnPNcA2+2RbcenYWghQDE8vS6E bJ2SdHC2S1VOY6KKN6HsGJT4tfpKPP/o+8eabkhrLups/tilNhbct5xjOCtRaV74 mSkwn+I5yJJ43ZdDrTycbe2qd5yrE0hXRfBDXiDpsVlKAJOXtEzhfNJc1j8oNdKX BSz6Bci3GSKcycrSjPJxu0ynW6GgMzvo18N+JENhSuRA5WNVJqrVq76fTDTRSf5T C/ouFqg= =nfX/ -----END PGP SIGNATURE----- Merge tag 'i2c-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "I2C has largely driver updates for 6.7, i.e. feature additions (like adding transfers while in atomic mode), using new helpers (like devm_clk_get_enabled), new IDs, documentation fixes and additions... you name it. The core got a memleak fix and better support for nested muxes" * tag 'i2c-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (53 commits) i2c: s3c2410: make i2c_s3c_irq_nextbyte() void i2c: qcom-geni: add ACPI device id for sc8180x Documentation: i2c: add fault code for not supporting 10 bit addresses i2c: sun6i-p2wi: Prevent potential division by zero i2c: mux: demux-pinctrl: Convert to use sysfs_emit_at() API i2c: i801: Use new helper acpi_use_parent_companion ACPI: Add helper acpi_use_parent_companion MAINTAINERS: add YAML file for i2c-demux-pinctrl i2c: core: fix lockdep warning for sparsely nested adapter chain i2c: axxia: eliminate kernel-doc warnings dt-bindings: i2c: i2c-demux-pinctrl: Convert to json-schema i2c: stm32f7: Use devm_clk_get_enabled() i2c: stm32f4: Use devm_clk_get_enabled() i2c: stm32f7: add description of atomic in struct stm32f7_i2c_dev i2c: fix memleak in i2c_new_client_device() i2c: exynos5: Calculate t_scl_l, t_scl_h according to i2c spec i2c: i801: Simplify class-based client device instantiation i2c: exynos5: add support for atomic transfers i2c: at91-core: Use devm_clk_get_enabled() eeprom: at24: add ST M24C64-D Additional Write lockable page support ...	2023-11-05 08:41:14 -10:00
Linus Torvalds	2153fc3d68	This pull request contains updates for UBI and UBIFS - UBI Fastmap improvements - Minor issues found by static analysis bots in both UBI and UBIFS - Fix for wrong dentry length UBIFS in fscrypt mode -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmVHbHYWHHJpY2hhcmRA c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wSdwEACWV8rZGKamOgnidf5WR8ycQUFd mPnTwPaRAZNjCq5LgeoMPcV1osucWb+vN+dO7o3KQSk2N/UF1iC6Lhs6giMj8MuU ZJ4v6bfPB0m0d5cQy+X8D1NH/yTQ7vzJrpuu3xDJeywLZFt7P6bSJRlX36Zr6WWC BVmVKeiqaB0qnlkpTf4BkIbgXR8Pcv/pXkYjsRcHnSJzpLQD6ByBErQwL2LKGUkE 3ULajJW2oRs8FA6YWfzTcARS3BOpIPJtUyA8RFi8uWKWks16R1Vg4FFfjMVoTzf9 Fe6f2KB+SQxX4dzhGMuVQTGgJuRVacktiL6fLHzvpE2mcKY7E5/b0F39P3BnFhID 2QtojZMXtgFpqz/Vg2W1vNrvHrN++e0sVgqIoxrN8PaE4ciP0yhkhoTEQwx1HWu5 qIxNH8j6USMnfybLgyphVDIoLXzOzhQqY5RWzIYN454jsFg9PeNPkqB3SH9h63j+ nZZ8v+PHeq2wxyYTg4vV+GPKcAlQ/XdKzM9w2Hh+Gp+XIgcO9i4Ag5Qku2lnjPY3 HvKq9t19YSDZkwMnmqeUY/hPB5OK9c0ukirsipQ6ACcQYj0h5hpVHlBGhX8r2cc2 EhdPNP11FhnxR3EcjBDBtSYAKmBs/LDT7gYrjo0013ciMZpVHFREx8G8Icp97366 spp8j/xwAAOyvw6/xA== =fJhk -----END PGP SIGNATURE----- Merge tag 'ubifs-for-linus-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI and UBIFS updates from Richard Weinberger: - UBI Fastmap improvements - Minor issues found by static analysis bots in both UBI and UBIFS - Fix for wrong dentry length UBIFS in fscrypt mode * tag 'ubifs-for-linus-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: ubifs_link: Fix wrong name len calculating when UBIFS is encrypted ubi: block: Fix use-after-free in ubiblock_cleanup ubifs: fix possible dereference after free ubi: fastmap: Add control in 'UBI_IOCATT' ioctl to reserve PEBs for filling pools ubi: fastmap: Add module parameter to control reserving filling pool PEBs ubi: fastmap: Fix lapsed wear leveling for first 64 PEBs ubi: fastmap: Get wl PEB even ec beyonds the 'max' if free PEBs are run out ubi: fastmap: may_reserve_for_fm: Don't reserve PEB if fm_anchor exists ubi: fastmap: Remove unneeded break condition while filling pools ubi: fastmap: Wait until there are enough free PEBs before filling pools ubi: fastmap: Use free pebs reserved for bad block handling ubi: Replace erase_block() with sync_erase() ubi: fastmap: Allocate memory with GFP_NOFS in ubi_update_fastmap ubi: fastmap: erase_block: Get erase counter from wl_entry rather than flash ubi: fastmap: Fix missed ec updating after erasing old fastmap data block ubifs: Fix missing error code err ubifs: Fix memory leak of bud->log_hash ubifs: Fix some kernel-doc comments	2023-11-05 08:28:32 -10:00
Kent Overstreet	c7046ed0cf	bcachefs: Improve stripe checksum error message We now include the name of the device in the error message - and also increment the number of checksum errors on that device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:14:22 -05:00
Kent Overstreet	853960d00b	bcachefs: Simplify, fix bch2_backpointer_get_key() - backpointer_not_found() checks backpointers_no_use_write_buffer, no need to do it inbackpointer_get_key(). - always use backpointer_get_node() for pointers to nodes: backpointer_get_key() was sometimes returning the key from the root node unlocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:14:22 -05:00
Kent Overstreet	daba90f2da	bcachefs: kill thing_it_points_to arg to backpointer_not_found() This can be calculated locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:14:22 -05:00
Kent Overstreet	aa98266588	bcachefs: bch2_ec_read_extent() now takes btree_trans We're not supposed to have more than one btree_trans at a time in a given thread - that causes recursive locking deadlocks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:13:57 -05:00
Kent Overstreet	da4aa3b001	bcachefs: bch2_stripe_to_text() now prints ptr gens Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:13:57 -05:00
Kent Overstreet	769b360049	bcachefs: Don't iterate over journal entries just for btree roots Small performance optimization, and a bit of a code cleanup too. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:18 -05:00
Kent Overstreet	80396a4749	bcachefs: Break up bch2_journal_write() Split up bch2_journal_write() to simplify locking: - bch2_journal_write_pick_flush(), which needs j->lock - bch2_journal_write_prep, which operates on the journal buffer to be written and will need the upcoming buf_lock for synchronization with the btree write buffer flush path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:18 -05:00
Kent Overstreet	a973de85e3	bcachefs: Replace ERANGE with private error codes We avoid using standard error codes: private, per-callsite error codes make debugging easier. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:18 -05:00
Kent Overstreet	a8958a1a95	bcachefs: bkey_copy() is no longer a macro Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:18 -05:00
Kent Overstreet	103ffe9aaf	bcachefs: x-macro-ify inode flags enum This lets us use bch2_prt_bitflags to print them out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:18 -05:00
Kent Overstreet	d4c8bb69d0	bcachefs: Convert bch2_fs_open() to darray Open coded dynamic arrays are deprecated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:17 -05:00
Kent Overstreet	0f0fc31238	bcachefs: Move __bch2_members_v2_get_mut to sb-members.h Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:17 -05:00
Kent Overstreet	59154f2c66	bcachefs: bch2_prt_datetime() Improved, better named version of pr_time(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-05 13:12:08 -05:00
Palmer Dabbelt	3ce99bd635	Merge patch series "Improve PTDUMP and introduce new fields" Yu Chien Peter Lin <peterlin@andestech.com> says: This patchset enhances PTDUMP by providing additional information from pagetable entries. The first patch fixes the RSW field, while the second and third patches introduce the PBMT and NAPOT fields, respectively, for RV64 systems. * b4-shazam-merge: riscv: Introduce NAPOT field to PTDUMP riscv: Introduce PBMT field to PTDUMP riscv: Improve PTDUMP to show RSW with non-zero value Link: https://lore.kernel.org/r/20230921025022.3989723-1-peterlin@andestech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:41:57 -08:00
Yu Chien Peter Lin	015c3c3704	riscv: Introduce NAPOT field to PTDUMP This patch introduces the NAPOT field to PTDUMP, allowing it to display the letter "N" for pages that have the 63rd bit set. Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230921025022.3989723-4-peterlin@andestech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:41:55 -08:00
Yu Chien Peter Lin	0713ff3371	riscv: Introduce PBMT field to PTDUMP This patch introduces the PBMT field to the PTDUMP, so it can display the memory attributes for NC or IO. Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230921025022.3989723-3-peterlin@andestech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:41:54 -08:00
Yu Chien Peter Lin	d5d2c264d3	riscv: Improve PTDUMP to show RSW with non-zero value RSW field can be used to encode 2 bits of software defined information. Currently, PTDUMP only prints "RSW" when its value is 1 or 3. To fix this issue and improve the debugging experience with PTDUMP, we redefine _PAGE_SPECIAL to its original value and use _PAGE_SOFT as the RSW mask, allow it to print the RSW with any non-zero value. This patch also removes the val from the struct prot_bits as it is no longer needed. Signed-off-by: Yu Chien Peter Lin <peterlin@andestech.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230921025022.3989723-2-peterlin@andestech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:41:53 -08:00
Conor Dooley	d3eabf2f2c	RISC-V: capitalise CMO op macros The CMO op macros initially used lower case, as the original iteration of the ALT_CMO_OP alternative stringified the first parameter to finalise the assembly for the standard variant. As a knock-on, the T-Head versions of these CMOs had to use mixed case defines. Commit `dd23e95358` ("RISC-V: replace cbom instructions with an insn-def") removed the asm construction with stringify, replacing it an insn-def macro, rending the lower-case surplus to requirements. As far as I can tell from a brief check, CBO_zero does not see similar use and didn't require the mixed case define in the first place. Replace the lower case characters now for consistency with other insn-def macros in the standard and T-Head forms, and adjust the callsites. Suggested-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230915-aloe-dollar-994937477776@spud Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:11:23 -08:00
Jisheng Zhang	c20d36cc2a	riscv: don't probe unaligned access speed if already done If misaligned_access_speed percpu var isn't so called "HWPROBE MISALIGNED UNKNOWN", it means the probe has happened(this is possible for example, hotplug off then hotplug on one cpu), and the percpu var has been set, don't probe again in this case. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Fixes: `584ea6564b` ("RISC-V: Probe for unaligned access speed") Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230912154040.3306-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:05:51 -08:00
Jinyu Tang	07863871df	riscv: defconfig : add CONFIG_MMC_DW for starfive If these config not set, mmc can't run for jh7110, rootfs can't be found when using SD card. So set CONFIG_MMC_DW=y like arm64 defconfig, and set CONFIG_MMC_DW_STARFIVE=y for starfive. Then starfive vf2 board can start SD card rootfs with mainline defconfig and dtb. Signed-off-by: Jinyu Tang <tangjinyu@tinylab.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230912133128.5247-1-tangjinyu@tinylab.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 09:04:54 -08:00
Haorong Lu	ce4f78f1b5	riscv: signal: handle syscall restart before get_signal In the current riscv implementation, blocking syscalls like read() may not correctly restart after being interrupted by ptrace. This problem arises when the syscall restart process in arch_do_signal_or_restart() is bypassed due to changes to the regs->cause register, such as an ebreak instruction. Steps to reproduce: 1. Interrupt the tracee process with PTRACE_SEIZE & PTRACE_INTERRUPT. 2. Backup original registers and instruction at new_pc. 3. Change pc to new_pc, and inject an instruction (like ebreak) to this address. 4. Resume with PTRACE_CONT and wait for the process to stop again after executing ebreak. 5. Restore original registers and instructions, and detach from the tracee process. 6. Now the read() syscall in tracee will return -1 with errno set to ERESTARTSYS. Specifically, during an interrupt, the regs->cause changes from EXC_SYSCALL to EXC_BREAKPOINT due to the injected ebreak, which is inaccessible via ptrace so we cannot restore it. This alteration breaks the syscall restart condition and ends the read() syscall with an ERESTARTSYS error. According to include/linux/errno.h, it should never be seen by user programs. X86 can avoid this issue as it checks the syscall condition using a register (orig_ax) exposed to user space. Arm64 handles syscall restart before calling get_signal, where it could be paused and inspected by ptrace/debugger. This patch adjusts the riscv implementation to arm64 style, which also checks syscall using a kernel register (syscallno). It ensures the syscall restart process is not bypassed when changes to the cause register occur, providing more consistent behavior across various architectures. For a simplified reproduction program, feel free to visit: https://github.com/ancientmodern/riscv-ptrace-bug-demo. Signed-off-by: Haorong Lu <ancientmodern4@gmail.com> Link: https://lore.kernel.org/r/20230803224458.4156006-1-ancientmodern4@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 06:43:01 -08:00
Palmer Dabbelt	0619ff9f02	Merge patch series "Add support to handle misaligned accesses in S-mode" Clément Léger <cleger@rivosinc.com> says: Since commit 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture.") in the RISC-V ISA manual, it is stated that misaligned load/store might not be supported. However, the RISC-V kernel uABI describes that misaligned accesses are supported. In order to support that, this series adds support for S-mode handling of misaligned accesses as well support for prctl(PR_UNALIGN). Handling misaligned access in kernel allows for a finer grain control of the misaligned accesses behavior, and thanks to the prctl() call, can allow disabling misaligned access emulation to generate SIGBUS. User space can then optimize its software by removing such access based on SIGBUS generation. This series is useful when using a SBI implementation that does not handle misaligned traps as well as detecting misaligned accesses generated by userspace application using the prctrl(PR_SET_UNALIGN) feature. This series can be tested using the spike simulator[1] and a modified openSBI version[2] which allows to always delegate misaligned load/store to S-mode. A test[3] that exercise various instructions/registers can be executed to verify the unaligned access support. [1] https://github.com/riscv-software-src/riscv-isa-sim [2] https://github.com/rivosinc/opensbi/tree/dev/cleger/no_misaligned [3] https://github.com/clementleger/unaligned_test * b4-shazam-merge: riscv: add support for PR_SET_UNALIGN and PR_GET_UNALIGN riscv: report misaligned accesses emulation to hwprobe riscv: annotate check_unaligned_access_boot_cpu() with __init riscv: add support for sysctl unaligned_enabled control riscv: add floating point insn support to misaligned access emulation riscv: report perf event for misaligned fault riscv: add support for misaligned trap handling in S-mode riscv: remove unused functions in traps_misaligned.c Link: https://lore.kernel.org/r/20231004151405.521596-1-cleger@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2023-11-05 06:42:51 -08:00
Kees Cook	c12d7aa7ff	firewire: Annotate struct fw_node with __counted_by Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct fw_node. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Takashi Sakamoto <o-takashi@sakamocchi.jp> Cc: linux1394-devel@lists.sourceforge.net Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230922175334.work.335-kees@kernel.org Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>	2023-11-05 21:15:17 +09:00
Linus Torvalds	1c41041124	I3C for 6.7 Core: - handle IBI in the proper order Drivers: - cdns: fix status register access - mipi-i3c-hci: many fixes now that the driver has been actually tested - svc: many IBI fixes, correct compatible string, fix hot join corner cases -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmVGzKMACgkQY6TcMGxw OjKcIxAAiPLfjxU6c9tbGfOOvV/rM9gDKZvGmFeawJMIP6LAEGkdp6AvU2bhoy7j zKcEYCYPVvdY7Rh9HiQrcoSXjgN5XPlc5hoEyW4E+OoWyXC2FA+8DItmSpRaF+Sn KqsPlLjbNKhUTd1Y+tzVGqHOCLuD+U4EQOZNqaAFNiU6wSXaVRo7Idvwznv9B6u9 RhJIcjPfh8M7Rr5DtKFr+zw/8TQB5maj4cWMZC0iphxSXCs6GlkVlxwI2S3PuPUI AbZJg/aGmczkce164dIzo5cBYL4Rq9JIqOUUpEhVBlKblSzKqIu0WM28KF7eASg5 Y/4YOh8xKO0NZtveBwxTkaO2LaJ1phqEVSfkxtXLH+fnlqQTyHWyyOFuYKm2UTKR vdASQeHo/rhYK/7qeGPjoYi8QRXNwI94oYgCuTNFr0l3SFocIHYJA81dpZtZs83l hIpUyfC1v9edXaCe9gdZb/pUw9akOoqYi6wm33YvRWKYC2W1qeLoH0hB1XSF39mP eXwNoF8Odp+e1BCgoVGWAVDDEUAT1b270ZmBLM+2qjP1UoIgARevDFEFnelvcyCQ dte0GRSOHtfbgcjL75+P/RcyGguY2TgpvFgTSv7tcc5lh5lYsEk775/pONahESmV 3Jjf+8q0R31gWPQNQEDxsp1tpPXvM5YaQzy9u45/CWYoEfH/Z4s= =rFcf -----END PGP SIGNATURE----- Merge tag 'i3c/for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux Pull i3c updates from Alexandre Belloni: "There are now more fixes because as stated in my previous pull request, people now have access to actual hardware. Core: - handle IBI in the proper order Drivers: - cdns: fix status register access - mipi-i3c-hci: many fixes now that the driver has been actually tested - svc: many IBI fixes, correct compatible string, fix hot join corner cases" * tag 'i3c/for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux: (29 commits) i3c: master: handle IBIs in order they came i3c: master: mipi-i3c-hci: Fix a kernel panic for accessing DAT_data. i3c: master: svc: fix compatibility string mismatch with binding doc i3c: master: svc: fix random hot join failure since timeout error i3c: master: svc: fix SDA keep low when polling IBIWON timeout happen i3c: master: svc: fix check wrong status register in irq handler i3c: master: svc: fix ibi may not return mandatory data byte i3c: master: svc: fix wrong data return when IBI happen during start frame i3c: master: svc: fix race condition in ibi work thread i3c: Fix typo "Provisional ID" to "Provisioned ID" i3c: Fix potential refcount leak in i3c_master_register_new_i3c_devs i3c: mipi-i3c-hci: Resume controller after aborted transfer i3c: mipi-i3c-hci: Resume controller explicitly i3c: mipi-i3c-hci: Fix missing xfer->completion in hci_cmd_v1_daa() i3c: mipi-i3c-hci: Do not unmap region not mapped for transfer i3c: mipi-i3c-hci: Set number of SW enabled Ring Bundles earlier i3c: mipi-i3c-hci: Fix race between bus cleanup and interrupt i3c: mipi-i3c-hci: Set ring start request together with enable i3c: mipi-i3c-hci: Remove BUG() when Ring Abort request times out i3c: mipi-i3c-hci: Fix out of bounds access in hci_dma_irq_handler ...	2023-11-04 16:25:36 -10:00
Linus Torvalds	b8cc56d041	cxl for v6.7 - Add support for RCH (Restricted CXL Host) Error recovery - Fix several region assembly bugs - Fix mem-device lifetime issues relative to the sanitize command and RCH topology. - Refactor ACPI table parsing for CDAT parsing re-use in preparation for CXL QOS support. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZUaowQAKCRDfioYZHlFs Z75rAP44azzLPwJtva7Ur60KpNsGuoZKhvWWdeI1/zo9k4pHbwEA/Vaf/GGo0U5k bMkoTmwPTd7YY79B5HNUQSZsqF9wlAc= =TEQ0 -----END PGP SIGNATURE----- Merge tag 'cxl-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL (Compute Express Link) updates from Dan Williams: "The main new functionality this time is work to allow Linux to natively handle CXL link protocol errors signalled via PCIe AER for current generation CXL platforms. This required some enlightenment of the PCIe AER core to workaround the fact that current generation RCH (Restricted CXL Host) platforms physically hide topology details and registers via a mechanism called RCRB (Root Complex Register Block). The next major highlight is reworks to address bugs in parsing region configurations for next generation VH (Virtual Host) topologies. The old broken algorithm is replaced with a simpler one that significantly increases the number of region configurations supported by Linux. This is again relevant for error handling so that forward and reverse address translation of memory errors can be carried out by Linux for memory regions instantiated by platform firmware. As for other cross-tree work, the ACPI table parsing code has been refactored for reuse parsing the "CDAT" structure which is an ACPI-like data structure that is reported by CXL devices. That work is in preparation for v6.8 support for CXL QoS. Think of this as dynamic generation of NUMA node topology information generated by Linux rather than platform firmware. Lastly, a number of internal object lifetime issues have been resolved along with misc. fixes and feature updates (decoders_committed sysfs ABI). Summary: - Add support for RCH (Restricted CXL Host) Error recovery - Fix several region assembly bugs - Fix mem-device lifetime issues relative to the sanitize command and RCH topology. - Refactor ACPI table parsing for CDAT parsing re-use in preparation for CXL QOS support" * tag 'cxl-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (50 commits) lib/fw_table: Remove acpi_parse_entries_array() export cxl/pci: Change CXL AER support check to use native AER cxl/hdm: Remove broken error path cxl/hdm: Fix && vs \|\| bug acpi: Move common tables helper functions to common lib cxl: Add support for reading CXL switch CDAT table cxl: Add checksum verification to CDAT from CXL cxl: Export QTG ids from CFMWS to sysfs as qos_class attribute cxl: Add decoders_committed sysfs attribute to cxl_port cxl: Add cxl_decoders_committed() helper cxl/core/regs: Rework cxl_map_pmu_regs() to use map->dev for devm cxl/core/regs: Rename phys_addr in cxl_map_component_regs() PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler cxl/pci: Disable root port interrupts in RCH mode cxl/pci: Add RCH downstream port error logging cxl/pci: Map RCH downstream AER registers for logging protocol errors cxl/pci: Update CXL error logging to use RAS register address PCI/AER: Refactor cper_print_aer() for use by CXL driver module cxl/pci: Add RCH downstream port AER register discovery ...	2023-11-04 16:20:36 -10:00
Kent Overstreet	bf61dcdfc1	bcachefs: CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y BCACHEFS_DEBUG_TRANSACTIONS is useful, but it's too expensive to have on by default - and it hasn't been coming up in bug reports. Turn it off by default until we figure out a way to make it cheaper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	9fcdd23b6e	bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usage BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization related to inode updates and fsync - document it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	d3c7727bb9	bcachefs: rebalance_work btree is not a snapshots btree rebalance_work entries may refer to entries in the extents btree, which is a snapshots btree, or they may also refer to entries in the reflink btree, which is not. Hence rebalance_work keys may use the snapshot field but it's not required to be nonzero - add a new btree flag to reflect this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	01ccee225a	bcachefs: Add missing printk newlines This was causing error messages in -tools to not get printed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	5a53f851e6	bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entry When we didn't find anything in the journal that we'd like to use, and we're forced to use whatever we can find - that entry will have been a JSET_NO_FLUSH entry with a garbage last_seq value, since it's not normally used. Initialize it to something sane, for bch2_fs_journal_start(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	ce3e9a8a10	bcachefs: .get_parent() should return an error pointer Delete the useless check for inum == 0; we'll return -ENOENT without it, which is what we want. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Kent Overstreet	4bd156c4b4	bcachefs: Fix bch2_delete_dead_inodes() - the fsck_err() check for the filesystem being clean was incorrect, causing us to always fail to delete unlinked inodes - if a snapshot had been taken, the unlinked inode needs to be propagated to snapshot leaves so the unlink can happen there - fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	7cb2a7895d	bcachefs: use swab40 for bch_backpointer.bucket_offset bitfield The bucket_offset field of bch_backpointer is a 40-bit bitfield, but the bch2_backpointer_swab() helper uses swab32. This leads to inconsistency when an on-disk fs is accessed from an opposite endian machine. As it turns out, we already have an internal swab40() helper that is used from the bch_alloc_v4 swab callback. Lift it into the backpointers header file and use it consistently in both places. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	0996c72a0f	bcachefs: byte order swap bch_alloc_v4.fragmentation_lru field A simple test to populate a filesystem on one CPU architecture and fsck on an arch of the opposite byte order produces errors related to the fragmentation LRU. This occurs because the 64-bit fragmentation_lru field is not byte-order swapped when reads detect that the on-disk/bset key values were written in opposite byte-order of the current CPU. Update the bch2_alloc_v4 swab callback to handle fragmentation_lru as is done for other multi-byte fields. This doesn't affect existing filesystems when accessed by CPUs of the same endianness because the ->swab() callback is only called when the bset flags indicate an endianness mismatch between the CPU and on-disk data. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	2a4e749760	bcachefs: allow writeback to fill bio completely The bcachefs folio writeback code includes a bio full check as well as a fixed size check to determine when to split off and submit writeback I/O. The inclusive check of the latter against the limit means that writeback can submit slightly prematurely. This is not a functional problem, but results in unnecessarily split I/Os and extent merging. This can be observed with a buffered write sized exactly to the current maximum value (1MB) and with key_merging_disabled=1. The latter prevents the merge from the second write such that a subsequent check of the extent list shows a 1020k extent followed by a contiguous 4k extent. The purpose for the fixed size check is also undocumented and somewhat obscure. Lift this check into a new helper that wraps the bio check, fix the comparison logic, and add a comment to document the purpose and how we might improve on this in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	0e91d3a6d5	bcachefs: fix odebug warn and lockdep splat due to on-stack rhashtable Guenter Roeck reports a lockdep splat and DEBUG_OBJECTS_WORK related warning when bch2_copygc_thread() initializes its rhashtable. The lockdep splat relates to a warning print caused by the fact that the rhashtable exists on the stack but is not annotated as so. This is something that could be addressed by INIT_WORK_ONSTACK(), but rhashtable doesn't expose that control and probably isnt worth the churn for just one user. Instead, dynamically allocate the buckets_in_flight structure and avoid the splat that way. Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	e0fb0dccfd	bcachefs: update alloc cursor in early bucket allocator A recent bug report uncovered a scenario where a filesystem never runs with freespace_initialized, and therefore the user observes significantly degraded write performance by virtue of running the early bucket allocator. The associated bug aside, the primary cause of the performance drop in this particular instance is that the early bucket allocator does not update the allocation cursor. This means that every allocation walks the alloc btree from the first bucket of the associated device looking for a bucket marked as free space. Update the early allocator code to set the alloc cursor to the last processed position in the tree, similar to how the freelist allocator behaves. With the alloc_cursor being updated, the retry logic also needs to be updated to restart from the beginning of the device when a free bucket is not available between the cursor and the end of the device. Track the restart position in a first_bucket variable to make the code a bit more easily readable and consistent with the freelist allocator. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00
Brian Foster	385a82f62a	bcachefs: serialize on cached key in early bucket allocator bcachefs had a transient bug where freespace_initialized was not properly being set, which lead to unexpected use of the early bucket allocator at runtime. This issue has been fixed, but the existence of it uncovered a coherency issue in the early bucket allocation code that is somewhat related to how uncached iterators deal with the key cache. The problem itself manifests as occasional failure of generic/113 due to corruption, often seen as a duplicate backpointer or multiple data types per-bucket error. The immediate cause of the error is a racing bucket allocation along the lines of the following sequence: - Task 1 selects key A in bch2_bucket_alloc_early() and schedules. - Task 2 selects the same key A, but proceeds to complete the allocation and associated I/O, after which it releases the open_bucket. - Task 1 resumes with key A, but does not recognize the bucket is now allocated because the open_bucket has been removed from the hash when it was released in the previous step. This generally shouldn't happen because the allocating task updates the alloc btree key before releasing the bucket. This is not sufficient in this particular instance, however, because an uncached iterator for a cached btree doesn't actually lock the key cache slot when no key exists for a given slot in the cache. Thus the fact that the allocation side updates the cached key means that multiple uncached iters can stumble across the same alloc key and duplicate the bucket allocation as described above. This is something that probably needs a longer term fix in the iterator code. As a short term fix, close the race through explicit use of a cached iterator for likely allocation candidates. We don't want to scan the btree with a cached iterator because that would unnecessarily pollute the cache. This mitigates cache pollution by primarily scanning the tree with an uncached iterator, but closes the race by creating a key cache entry for any prospective slot prior to the bucket allocation attempt (also similar to how _alloc_freelist() works via try_alloc_bucket()). This survives many iterations of generic/113 on a kernel hacked to always use the early bucket allocator. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-11-04 22:19:13 -04:00

... 7 8 9 10 11 ...

1233466 commits