linux-stable

History

Barry Song 43b3dfdd04 arm64: support batched/deferred tlb shootdown during page reclamation/migration On x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char p = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { / swap in / for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } / swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ........ ....... ................. ...... # 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out [kernel.kallsyms] [k] memset64 1.63% a.out [kernel.kallsyms] [k] clear_page 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out [kernel.kallsyms] [k] xas_load 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also tested with benchmark in the commit on Kunpeng920 arm64 server and observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 100 times for 1G memory, this patch decrease the time used around 20% (prev 18.338318910 sec after 13.981866350 sec) and saved the time used by ptep_clear_flush(). Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Barry Song <baohua@kernel.org> Cc: Darren Hart <darren@os.amperecomputing.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: lipeifeng <lipeifeng@oppo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Miao <realmz6@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-08-18 10:12:37 -07:00
..
ABI	HWPOISON: offline support: fix spelling in Documentation/ABI/	2023-08-18 10:12:18 -07:00
PCI	…
RCU	…
accel	…
accounting	…
admin-guide	mm/memory_hotplug: document the signal_pending() check in offline_pages()	2023-08-18 10:12:19 -07:00
arch	- Work around an erratum on GIC700, where a race between a CPU	2023-07-30 10:59:19 -07:00
block	…
bpf	…
cdrom	…
core-api	…
cpu-freq	…
crypto	…
dev-tools	…
devicetree	Devicetree fixes for v6.5:	2023-07-22 10:28:22 -07:00
doc-guide	…
driver-api	Fixes for pci_clean_master, error handling in driver inits, and various	2023-07-09 09:35:51 -07:00
fault-injection	…
fb	…
features	arm64: support batched/deferred tlb shootdown during page reclamation/migration	2023-08-18 10:12:37 -07:00
filesystems	tmpfs: fix Documentation of noswap and huge mount options	2023-07-27 13:07:03 -07:00
firmware-guide	…
firmware_class	…
fpga	…
gpu	…
hid	…
hwmon	…
i2c	…
iio	…
images	…
infiniband	…
input	…
isdn	…
kbuild	…
kernel-hacking	…
leds	…
litmus-tests	…
livepatch	…
locking	…
loongarch	…
maintainer	…
mhi	…
mips	…
misc-devices	…
mm	…
netlabel	…
netlink	…
networking	docs: net: clarify the NAPI rules around XDP Tx	2023-07-21 18:51:37 -07:00
nvdimm	…
nvme	…
pcmcia	…
peci	…
power	…
powerpc	…
process	Documentation: embargoed-hardware-issues.rst: add AMD to the list	2023-07-26 09:39:34 +02:00
riscv	Documentation: RISC-V: hwprobe: Fix a formatting error	2023-07-11 10:43:51 -07:00
rust	…
s390	…
scheduler	…
scsi	…
security	…
sound	…
sphinx	…
sphinx-static	…
spi	…
staging	…
target	…
timers	…
tools	…
trace	…
translations	A half-dozen late arriving docs patches. They are mostly fixes, but we	2023-07-06 22:15:38 -07:00
usb	…
userspace-api	media updates for v6.5-rc1	2023-07-05 10:42:32 -07:00
virt	A half-dozen late arriving docs patches. They are mostly fixes, but we	2023-07-06 22:15:38 -07:00
w1	…
watchdog	…
wmi	platform/x86: dell-ddv: Fix mangled list in documentation	2023-07-11 12:15:30 +02:00
.gitignore	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	…
SubmittingPatches	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	…
docutils.conf	…
dontdiff	…
index.rst	…
memory-barriers.txt	…
subsystem-apis.rst	…