linux-stable/fs
Mike Kravetz 527cabfffb hugetlbfs: fix races and page leaks during migration
commit cb6acd01e2 upstream.

hugetlb pages should only be migrated if they are 'active'.  The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.

When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked.  Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code.  This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.

To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.

Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
then migrates the pages associated with the file from one node to
another.  When the program exits, huge page counts are as follows:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  0       free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

That is as expected.  2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem.  If the file is then removed, the counts become:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  1024    free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use.  The only way to 'fix' the filesystem
accounting is to unmount the filesystem

If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field.  At migration
time, this information is not preserved.  To fix, simply transfer
page_private from old to new page at migration time if necessary.

There is a related race with removing a huge page from a file and
migration.  When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page().  A page could be migrated
while in this state.  However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.

To fix that, check for this condition before migrating a huge page.  If
the condition is detected, return EBUSY for the page.

Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
  Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
  Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-05 17:58:53 +01:00
..
9p v9fs_dir_readdir: fix double-free on p9stat_read error 2018-12-01 09:37:27 +01:00
adfs adfs: use timespec64 for time conversion 2018-08-22 10:52:51 -07:00
affs
afs afs: Fix key refcounting in file locking code 2019-02-27 10:08:56 +01:00
autofs Merge branch 'akpm' (patches from Andrew) 2018-08-22 12:34:08 -07:00
befs
bfs bfs: add sanity check at bfs_fill_super() 2018-12-01 09:37:27 +01:00
btrfs btrfs: use tagged writepage to mitigate livelock of snapshot 2019-02-12 19:47:11 +01:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-12-17 09:24:40 +01:00
ceph ceph: avoid repeatedly adding inode to mdsc->snap_flush_list 2019-02-27 10:08:50 +01:00
cifs CIFS: Do not assume one credit for async responses 2019-02-20 10:25:44 +01:00
coda
configfs
cramfs Cramfs: fix abad comparison when wrap-arounds occur 2018-11-13 11:08:55 -08:00
crypto crypto: speck - remove Speck 2018-11-13 11:08:46 -08:00
debugfs debugfs: fix debugfs_rename parameter checking 2019-02-15 08:10:11 +01:00
devpts devpts: Convert to new IDA API 2018-08-21 23:54:17 -04:00
dlm dlm: Don't swamp the CPU with callbacks queued during recovery 2019-02-12 19:46:58 +01:00
ecryptfs
efivarfs efivars: Call guid_parse() against guid_t type of variable 2018-07-22 14:13:44 +02:00
efs
exofs fs/exofs: fix potential memory leak in mount option parsing 2018-11-27 16:13:00 +01:00
exportfs exportfs: do not read dentry after free 2018-12-17 09:24:35 +01:00
ext2 ext2: fix potential use after free 2018-12-05 19:32:11 +01:00
ext4 Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal" 2019-02-15 08:10:13 +01:00
f2fs f2fs: fix sbi->extent_list corruption issue 2019-02-12 19:47:17 +01:00
fat fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters() 2018-10-13 09:31:03 +02:00
freevxfs
fscache fscache: fix race between enablement and dropping of object 2018-12-17 09:24:40 +01:00
fuse fuse: handle zero sized retrieve correctly 2019-02-12 19:47:24 +01:00
gfs2 gfs2: Revert "Fix loop in gfs2_rbm_find" 2019-02-06 17:30:13 +01:00
hfs hfs: do not free node before using 2018-12-17 09:24:41 +01:00
hfsplus hfsplus: do not free node before using 2018-12-17 09:24:41 +01:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: remove unnecessary checks on the value of r when assigning error code 2018-08-25 12:42:33 -07:00
hugetlbfs hugetlbfs: fix races and page leaks during migration 2019-03-05 17:58:53 +01:00
isofs isofs: reject hardware sector size > 2048 bytes 2018-08-21 11:37:41 +02:00
jbd2 jbd2: fix use after free in jbd2_log_do_checkpoint() 2018-11-13 11:08:43 -08:00
jffs2 jffs2: Fix use of uninitialized delayed_work, lockdep breakage 2019-01-26 09:32:37 +01:00
jfs Just one jfs patch for 4.19 2018-08-15 22:47:23 -07:00
kernfs Driver core patches for 4.19-rc1 2018-08-18 11:44:53 -07:00
lockd lockd: Show pid of lockd for remote locks 2019-01-13 09:51:08 +01:00
minix
nfs NFS: nfs_compare_mount_options always compare auth flavors. 2019-02-12 19:47:16 +01:00
nfs_common
nfsd Revert "nfsd4: return default lease period" 2019-02-20 10:25:47 +01:00
nilfs2 nilfs2: convert to SPDX license tags 2018-09-04 16:45:02 -07:00
nls
notify inotify: Fix fd refcount leak in inotify_add_watch(). 2019-01-31 08:14:34 +01:00
ntfs ntfs: mft: remove VLA usage 2018-08-17 16:20:27 -07:00
ocfs2 ocfs2: improve ocfs2 Makefile 2019-02-12 19:47:18 +01:00
omfs
openpromfs
orangefs orangefs: remove redundant pointer orangefs_inode 2018-08-14 12:07:14 -04:00
overlayfs ovl: fix missing override creds in link of a metacopy upper 2018-12-19 19:19:51 +01:00
proc proc, oom: do not report alien mms when setting oom_score_adj 2019-02-27 10:08:50 +01:00
pstore pstore/ram: Do not treat empty buffers as valid 2019-01-26 09:32:37 +01:00
qnx4
qnx6
quota quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. 2019-01-26 09:32:42 +01:00
ramfs
reiserfs reiserfs: propagate errors from fill_with_dentries() properly 2018-11-27 16:12:59 +01:00
romfs
squashfs Squashfs: Compute expected length from inode size rather than block length 2018-08-02 09:34:02 -07:00
sysfs Driver core patches for 4.19-rc1 2018-08-18 11:44:53 -07:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-12-17 09:24:30 +01:00
tracefs tracefs: Annotate tracefs_ops with __ro_after_init 2018-07-31 11:32:44 -04:00
ubifs ubifs: Handle re-linking of inodes correctly while recovery 2018-12-29 13:37:55 +01:00
udf udf: Fix BUG on corrupted inode 2019-02-12 19:47:09 +01:00
ufs fs/ufs: use ktime_get_real_seconds for sb and cg timestamps 2018-08-17 16:20:27 -07:00
xfs xfs: eof trim writeback mapping as soon as it is cached 2019-02-12 19:47:23 +01:00
aio.c aio: fix spectre gadget in lookup_ioctx 2018-12-19 19:19:50 +01:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf.c Here are the main MIPS changes for 4.19. 2018-08-13 19:24:32 -07:00
binfmt_elf_fdpic.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c Revert "exec: load_script: don't blindly truncate shebang string" 2019-02-15 09:09:54 +01:00
block_dev.c blockdev: Fix livelocks on loop device 2019-01-22 21:40:36 +01:00
buffer.c notifier: Remove notifier header file wherever not used 2018-08-30 12:56:40 +02:00
char_dev.c
compat.c
compat_binfmt_elf.c
compat_ioctl.c media: dvb/audio.h: get rid of unused APIs 2018-07-30 16:21:49 -04:00
coredump.c
d_path.c
dax.c dax: Use non-exclusive wait in wait_entry_unlocked() 2019-01-09 17:38:46 +01:00
dcache.c fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() 2019-02-06 17:30:11 +01:00
dcookies.c
direct-io.c direct-io: allow direct writes to empty inodes 2019-03-05 17:58:50 +01:00
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: drop ovflist branch prediction 2019-02-12 19:47:19 +01:00
exec.c Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-08-21 13:47:29 -07:00
fcntl.c signal: Don't send signals to tasks that don't exist 2018-08-15 23:03:20 -05:00
fhandle.c
file.c
file_table.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
filesystems.c
fs-writeback.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-03-05 17:58:50 +01:00
fs_pin.c
fs_struct.c
inode.c Revert "mm: don't reclaim inodes with many attached pages" 2019-02-20 10:25:47 +01:00
internal.h overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
ioctl.c vfs: fix FIGETBSZ ioctl on an overlayfs file 2018-11-21 09:19:14 +01:00
iomap.c iomap: don't search past page end in iomap_is_partially_uptodate 2019-01-26 09:32:43 +01:00
Kconfig
Kconfig.binfmt kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
libfs.c
locks.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
Makefile
mbcache.c
mount.h
mpage.c mpage: mpage_readpages() should submit IO as read-ahead 2018-08-17 16:20:29 -07:00
namei.c Revert "vfs: Allow userns root to call mknod on owned filesystems." 2018-12-29 13:37:54 +01:00
namespace.c mnt: fix __detach_mounts infinite loop 2018-11-21 09:19:22 +01:00
no-block.c
nsfs.c
open.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
pipe.c Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 19:58:36 -07:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c vfs: swap names of {do,vfs}_clone_file_range() 2018-09-24 10:54:01 +02:00
readdir.c
select.c
seq_file.c fs/seq_file.c: simplify seq_file iteration code and interface 2018-08-17 16:20:28 -07:00
signalfd.c
splice.c
stack.c
stat.c
statfs.c
super.c Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax 2018-08-26 11:48:42 -07:00
sync.c
timerfd.c Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 20:56:23 -07:00
userfaultfd.c userfaultfd: clear flag if remap event not enabled 2019-01-26 09:32:43 +01:00
utimes.c
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00