linux-stable/fs
Daniel Colascione 493b0e9d94 mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.

Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes.  This sampling process involves
opening /proc/pid/smaps and summing certain fields.  For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.

smaps_rollup improves the situation.  It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process.  In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.

By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings.  For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.

One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status.  We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.

The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA.  In this way, summation happens automatically.  We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.

Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows.  Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more.  IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.

Tim said:

: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
:    application via the settings app.  This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications.  We have found
:    an enormous number of bugs (in the Android platform, in Google's own
:    apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that.  The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend).  PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it.  We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent.  Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process.  The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release.  After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process.  Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward.  I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).

Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
..
9p fscache: remove unused ->now_uncached callback 2017-09-06 17:27:26 -07:00
adfs
affs
afs fscache: remove unused ->now_uncached callback 2017-09-06 17:27:26 -07:00
autofs4
befs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
bfs
btrfs Btrfs: fix blk_status_t/errno confusion 2017-08-24 17:19:02 +02:00
cachefiles
ceph fscache: remove unused ->now_uncached callback 2017-09-06 17:27:26 -07:00
cifs fscache: remove unused ->now_uncached callback 2017-09-06 17:27:26 -07:00
coda
configfs
cramfs
crypto
debugfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
devpts pty: Repair TIOCGPTPEER 2017-08-24 13:23:03 -07:00
dlm
ecryptfs
efivarfs
efs
exofs
exportfs
ext2 dax: use common 4k zero page for dax mmap reads 2017-09-06 17:27:24 -07:00
ext4 mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
f2fs f2fs: avoid cpu lockup 2017-07-17 19:23:18 -07:00
fat
freevxfs
fscache mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2017-08-11 11:20:48 -07:00
gfs2 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
hfs
hfsplus hfsplus: Don't clear SGID when inheriting ACLs 2017-07-18 18:23:39 +02:00
hostfs
hpfs
hugetlbfs mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
isofs isofs: Fix off-by-one in 'session' mount option parsing 2017-07-18 12:33:16 +02:00
jbd2
jffs2
jfs jfs should use MAX_LFS_FILESIZE when calculating s_maxbytes 2017-08-31 17:02:21 -07:00
kernfs kernfs: Clarify lockdep name for kn->count 2017-08-28 16:50:15 +02:00
lockd
minix
ncpfs
nfs fscache: remove unused ->now_uncached callback 2017-09-06 17:27:26 -07:00
nfs_common
nfsd annotate RWF_... flags 2017-08-31 17:32:38 -04:00
nilfs2 mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
nls
notify
ntfs
ocfs2 ocfs2: clean up some dead code 2017-09-06 17:27:24 -07:00
omfs
openpromfs
orangefs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
overlayfs overlayfs, locking: Remove smp_mb__before_spinlock() usage 2017-08-10 12:29:02 +02:00
proc mm: add /proc/pid/smaps_rollup 2017-09-06 17:27:30 -07:00
pstore Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
qnx4
qnx6
quota quota: correct space limit check 2017-08-07 16:51:28 +02:00
ramfs mm: make pagevec_lookup() update index 2017-09-06 17:27:26 -07:00
reiserfs reiserfs: preserve i_mode if __reiserfs_set_acl() fails 2017-07-18 11:24:08 +02:00
romfs
squashfs
sysfs
sysv
tracefs
ubifs ubifs: Set double hash cookie also for RENAME_EXCHANGE 2017-07-14 22:50:57 +02:00
udf
ufs
xfs dax: use common 4k zero page for dax mmap reads 2017-09-06 17:27:24 -07:00
aio.c
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf.c x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks 2017-08-16 20:32:02 +02:00
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c binfmt_flat: Use %u to format u32 2017-07-16 09:24:05 -07:00
binfmt_misc.c
binfmt_script.c
block_dev.c
buffer.c mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
char_dev.c char_dev: order /proc/devices by major number 2017-07-17 15:28:50 +02:00
compat.c
compat_binfmt_elf.c
compat_ioctl.c
coredump.c
dax.c dax: initialize variable pfn before using it 2017-09-06 17:27:24 -07:00
dcache.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
dcookies.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() 2017-09-01 13:07:35 -07:00
exec.c
fcntl.c
fhandle.c
file.c
file_table.c
filesystems.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
fs-writeback.c
fs_pin.c
fs_struct.c
inode.c
internal.h
ioctl.c
iomap.c iomap: fix integer truncation issues in the zeroing and dirtying helpers 2017-08-11 16:56:33 -07:00
Kconfig
Kconfig.binfmt
libfs.c
locks.c
Makefile
mbcache.c
mount.h Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
mpage.c
namei.c Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
namespace.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
no-block.c
nsfs.c
open.c
pipe.c
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c annotate RWF_... flags 2017-08-31 17:32:38 -04:00
readdir.c
select.c fs/select: Fix memory corruption in compat_get_fd_set() 2017-08-28 16:09:19 -07:00
seq_file.c
signalfd.c
splice.c
stack.c
stat.c
statfs.c
super.c
sync.c fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range 2017-09-06 17:27:28 -07:00
timerfd.c
userfaultfd.c userfaultfd: provide pid in userfault msg - add feat union 2017-09-06 17:27:29 -07:00
utimes.c
xattr.c