linux-stable/fs/btrfs
Filipe Manana de554a4fa6 Btrfs: fix scrub race leading to use-after-free
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       atomic_dec(&sctx->bios_in_flight)
                                                                   wait sctx->bios_in_flight == 0
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
       wake_up(&sctx->list_wait)
          __wake_up()
              spin_lock_irqsave(&sctx->list_wait->lock, flags)

Another variation of this scenario that results in the same use-after-free
issue is:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up(&sctx->list_wait)
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:50 -08:00
..
tests btrfs: switch extent_state state to unsigned 2015-01-21 18:02:04 -08:00
acl.c btrfs: remove useless ACL check 2014-06-09 17:20:42 -07:00
async-thread.c btrfs: remove unlikely from NULL checks 2014-10-02 16:06:19 +02:00
async-thread.h Btrfs: implement repair function when direct read fails 2014-09-17 13:39:01 -07:00
backref.c btrfs: cleanup, remove inode_ref_info helper 2015-01-14 19:23:47 +01:00
backref.h btrfs: cleanup, remove inode_item_info helper 2015-01-14 19:23:47 +01:00
btrfs_inode.h Btrfs: Add code to support file creation time 2015-02-02 18:39:16 -08:00
check-integrity.c Btrfs: include vmalloc.h in check-integrity.c 2014-11-25 06:01:11 -08:00
check-integrity.h block: submit_bio_wait() conversions 2013-11-24 16:33:41 -07:00
compression.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-12-12 11:15:23 -08:00
compression.h btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00
ctree.c Btrfs: insert_new_root: Fix lock type of the extent buffer. 2015-01-22 05:42:23 -08:00
ctree.h Btrfs: fix race between transaction commit and empty block group removal 2015-02-02 19:24:48 -08:00
delayed-inode.c Btrfs: Add code to support file creation time 2015-02-02 18:39:16 -08:00
delayed-inode.h Btrfs: introduce the delayed inode ref deletion for the single link inode 2014-01-28 13:20:09 -08:00
delayed-ref.c Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
delayed-ref.h Btrfs: rework qgroup accounting 2014-06-09 17:20:48 -07:00
dev-replace.c Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event() 2015-01-21 18:06:48 -08:00
dev-replace.h
dir-item.c Btrfs: make xattr replace operations atomic 2014-11-20 17:20:07 -08:00
disk-io.c Btrfs: fix race between transaction commit and empty block group removal 2015-02-02 19:24:48 -08:00
disk-io.h btrfs: sink blocksize parameter to btrfs_find_create_tree_block 2014-12-12 18:07:21 +01:00
export.c btrfs: kill the key type accessor helpers 2014-09-17 13:37:12 -07:00
export.h
extent-tree.c Btrfs: fix race between transaction commit and empty block group removal 2015-02-02 19:24:48 -08:00
extent_io.c Btrfs: add ref_count and free function for btrfs_bio 2015-01-21 18:06:48 -08:00
extent_io.h btrfs: switch extent_state state to unsigned 2015-01-21 18:02:04 -08:00
extent_map.c Btrfs: do not move em to modified list when unpinning 2014-11-21 11:59:54 -08:00
extent_map.h Btrfs: fix NULL pointer crash when running balance and scrub concurrently 2014-06-19 14:20:55 -07:00
file-item.c Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup 2014-11-04 06:59:04 -08:00
file.c Btrfs: fix snapshot inconsistency after a file write followed by truncate 2014-11-25 07:41:23 -08:00
free-space-cache.c Btrfs: track dirty block groups on their own list 2015-01-21 17:36:52 -08:00
free-space-cache.h Btrfs: fix race between writing free space cache and trimming 2014-12-02 18:35:09 -08:00
hash.c btrfs: LLVMLinux: Remove VLAIS 2014-10-14 10:51:22 +02:00
hash.h Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
inode-item.c Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs 2015-01-21 18:02:05 -08:00
inode-map.c Btrfs: fix race between writing free space cache and trimming 2014-12-02 18:35:09 -08:00
inode-map.h
inode.c Btrfs: Add code to support file creation time 2015-02-02 18:39:16 -08:00
ioctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2014-12-12 11:15:23 -08:00
Kconfig Btrfs: fix btrfs boot when compiled as built-in 2014-01-28 13:20:31 -08:00
locking.c btrfs: fix lockups from btrfs_clear_path_blocking 2014-11-19 10:34:35 -08:00
locking.h btrfs: fix lockups from btrfs_clear_path_blocking 2014-11-19 10:34:35 -08:00
lzo.c btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00
Makefile Btrfs: add sanity tests for new qgroup accounting code 2014-06-09 17:20:49 -07:00
math.h
ordered-data.c Btrfs: collect only the necessary ordered extents on ranged fsync 2014-11-21 11:59:56 -08:00
ordered-data.h Btrfs: collect only the necessary ordered extents on ranged fsync 2014-11-21 11:59:56 -08:00
orphan.c btrfs: kill the key type accessor helpers 2014-09-17 13:37:12 -07:00
print-tree.c btrfs: remove parameter blocksize from read_tree_block 2014-10-02 17:14:50 +02:00
print-tree.h
props.c Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
props.h Btrfs: add support for inode properties 2014-01-28 13:20:24 -08:00
qgroup.c btrfs: qgroup: move WARN_ON() to the correct location. 2015-01-21 18:22:37 -08:00
qgroup.h btrfs: qgroup: account shared subtrees during snapshot delete 2014-08-15 07:43:14 -07:00
raid56.c Btrfs: Include map_type in raid_bio 2015-01-21 18:06:49 -08:00
raid56.h Btrfs: Make raid_map array be inlined in btrfs_bio structure 2015-01-21 18:06:47 -08:00
rcu-string.h
reada.c Btrfs: add ref_count and free function for btrfs_bio 2015-01-21 18:06:48 -08:00
relocation.c btrfs: sink blocksize parameter to tree_block_processed 2014-12-12 18:07:22 +01:00
root-tree.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
scrub.c Btrfs: fix scrub race leading to use-after-free 2015-02-02 19:24:50 -08:00
send.c btrfs: kill btrfs_inode_*time helpers 2015-02-02 18:39:07 -08:00
send.h
struct-funcs.c
super.c btrfs: remove a no-op unfreeze superbock callback 2015-01-21 18:02:04 -08:00
sysfs.c Btrfs: add missing cleanup on sysfs init failure 2015-02-02 19:24:49 -08:00
sysfs.h btrfs: code optimize: BTRFS_ATTR_RW could set the mode 2014-09-17 13:37:59 -07:00
transaction.c Btrfs: track dirty block groups on their own list 2015-01-21 17:36:52 -08:00
transaction.h Btrfs: track dirty block groups on their own list 2015-01-21 17:36:52 -08:00
tree-defrag.c Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root 2014-06-09 17:20:40 -07:00
tree-log.c btrfs: kill btrfs_inode_*time helpers 2015-02-02 18:39:07 -08:00
tree-log.h Btrfs: fix data corruption after fast fsync and writeback error 2014-09-19 06:57:51 -07:00
ulist.c Btrfs: do not export ulist functions 2014-01-29 07:06:27 -08:00
ulist.h Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch 2014-08-15 07:43:19 -07:00
uuid-tree.c Btrfs: make btrfs_search_forward return with nodes unlocked 2014-09-17 13:38:02 -07:00
volumes.c btrfs: add more checks to btrfs_read_sys_array 2015-02-02 19:24:39 -08:00
volumes.h Btrfs: Include map_type in raid_bio 2015-01-21 18:06:49 -08:00
xattr.c Btrfs: make xattr replace operations atomic 2014-11-20 17:20:07 -08:00
xattr.h btrfs: use generic posix ACL infrastructure 2014-01-25 23:58:18 -05:00
zlib.c btrfs: zero out left over bytes after processing compression streams 2014-11-30 09:33:51 -08:00