linux-stable/net/smc
Wenjia Zhang b615238e5b net/smc: fix deadlock triggered by cancel_delayed_work_syn()
[ Upstream commit 13085e1b5c ]

The following LOCKDEP was detected:
		Workqueue: events smc_lgr_free_work [smc]
		WARNING: possible circular locking dependency detected
		6.1.0-20221027.rc2.git8.56bc5b569087.300.fc36.s390x+debug #1 Not tainted
		------------------------------------------------------
		kworker/3:0/176251 is trying to acquire lock:
		00000000f1467148 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0},
			at: __flush_workqueue+0x7a/0x4f0
		but task is already holding lock:
		0000037fffe97dc8 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0},
			at: process_one_work+0x232/0x730
		which lock already depends on the new lock.
		the existing dependency chain (in reverse order) is:
		-> #4 ((work_completion)(&(&lgr->free_work)->work)){+.+.}-{0:0}:
		       __lock_acquire+0x58e/0xbd8
		       lock_acquire.part.0+0xe2/0x248
		       lock_acquire+0xac/0x1c8
		       __flush_work+0x76/0xf0
		       __cancel_work_timer+0x170/0x220
		       __smc_lgr_terminate.part.0+0x34/0x1c0 [smc]
		       smc_connect_rdma+0x15e/0x418 [smc]
		       __smc_connect+0x234/0x480 [smc]
		       smc_connect+0x1d6/0x230 [smc]
		       __sys_connect+0x90/0xc0
		       __do_sys_socketcall+0x186/0x370
		       __do_syscall+0x1da/0x208
		       system_call+0x82/0xb0
		-> #3 (smc_client_lgr_pending){+.+.}-{3:3}:
		       __lock_acquire+0x58e/0xbd8
		       lock_acquire.part.0+0xe2/0x248
		       lock_acquire+0xac/0x1c8
		       __mutex_lock+0x96/0x8e8
		       mutex_lock_nested+0x32/0x40
		       smc_connect_rdma+0xa4/0x418 [smc]
		       __smc_connect+0x234/0x480 [smc]
		       smc_connect+0x1d6/0x230 [smc]
		       __sys_connect+0x90/0xc0
		       __do_sys_socketcall+0x186/0x370
		       __do_syscall+0x1da/0x208
		       system_call+0x82/0xb0
		-> #2 (sk_lock-AF_SMC){+.+.}-{0:0}:
		       __lock_acquire+0x58e/0xbd8
		       lock_acquire.part.0+0xe2/0x248
		       lock_acquire+0xac/0x1c8
		       lock_sock_nested+0x46/0xa8
		       smc_tx_work+0x34/0x50 [smc]
		       process_one_work+0x30c/0x730
		       worker_thread+0x62/0x420
		       kthread+0x138/0x150
		       __ret_from_fork+0x3c/0x58
		       ret_from_fork+0xa/0x40
		-> #1 ((work_completion)(&(&smc->conn.tx_work)->work)){+.+.}-{0:0}:
		       __lock_acquire+0x58e/0xbd8
		       lock_acquire.part.0+0xe2/0x248
		       lock_acquire+0xac/0x1c8
		       process_one_work+0x2bc/0x730
		       worker_thread+0x62/0x420
		       kthread+0x138/0x150
		       __ret_from_fork+0x3c/0x58
		       ret_from_fork+0xa/0x40
		-> #0 ((wq_completion)smc_tx_wq-00000000#2){+.+.}-{0:0}:
		       check_prev_add+0xd8/0xe88
		       validate_chain+0x70c/0xb20
		       __lock_acquire+0x58e/0xbd8
		       lock_acquire.part.0+0xe2/0x248
		       lock_acquire+0xac/0x1c8
		       __flush_workqueue+0xaa/0x4f0
		       drain_workqueue+0xaa/0x158
		       destroy_workqueue+0x44/0x2d8
		       smc_lgr_free+0x9e/0xf8 [smc]
		       process_one_work+0x30c/0x730
		       worker_thread+0x62/0x420
		       kthread+0x138/0x150
		       __ret_from_fork+0x3c/0x58
		       ret_from_fork+0xa/0x40
		other info that might help us debug this:
		Chain exists of:
		  (wq_completion)smc_tx_wq-00000000#2
	  	  --> smc_client_lgr_pending
		  --> (work_completion)(&(&lgr->free_work)->work)
		 Possible unsafe locking scenario:
		       CPU0                    CPU1
		       ----                    ----
		  lock((work_completion)(&(&lgr->free_work)->work));
		                   lock(smc_client_lgr_pending);
		                   lock((work_completion)
					(&(&lgr->free_work)->work));
		  lock((wq_completion)smc_tx_wq-00000000#2);
		 *** DEADLOCK ***
		2 locks held by kworker/3:0/176251:
		 #0: 0000000080183548
			((wq_completion)events){+.+.}-{0:0},
				at: process_one_work+0x232/0x730
		 #1: 0000037fffe97dc8
			((work_completion)
			 (&(&lgr->free_work)->work)){+.+.}-{0:0},
				at: process_one_work+0x232/0x730
		stack backtrace:
		CPU: 3 PID: 176251 Comm: kworker/3:0 Not tainted
		Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
		Call Trace:
		 [<000000002983c3e4>] dump_stack_lvl+0xac/0x100
		 [<0000000028b477ae>] check_noncircular+0x13e/0x160
		 [<0000000028b48808>] check_prev_add+0xd8/0xe88
		 [<0000000028b49cc4>] validate_chain+0x70c/0xb20
		 [<0000000028b4bd26>] __lock_acquire+0x58e/0xbd8
		 [<0000000028b4cf6a>] lock_acquire.part.0+0xe2/0x248
		 [<0000000028b4d17c>] lock_acquire+0xac/0x1c8
		 [<0000000028addaaa>] __flush_workqueue+0xaa/0x4f0
		 [<0000000028addf9a>] drain_workqueue+0xaa/0x158
		 [<0000000028ae303c>] destroy_workqueue+0x44/0x2d8
		 [<000003ff8029af26>] smc_lgr_free+0x9e/0xf8 [smc]
		 [<0000000028adf3d4>] process_one_work+0x30c/0x730
		 [<0000000028adf85a>] worker_thread+0x62/0x420
		 [<0000000028aeac50>] kthread+0x138/0x150
		 [<0000000028a63914>] __ret_from_fork+0x3c/0x58
		 [<00000000298503da>] ret_from_fork+0xa/0x40
		INFO: lockdep is turned off.
===================================================================

This deadlock occurs because cancel_delayed_work_sync() waits for
the work(&lgr->free_work) to finish, while the &lgr->free_work
waits for the work(lgr->tx_wq), which needs the sk_lock-AF_SMC, that
is already used under the mutex_lock.

The solution is to use cancel_delayed_work() instead, which kills
off a pending work.

Fixes: a52bcc919b ("net/smc: improve termination processing")
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-22 13:31:26 +01:00
..
af_smc.c net/smc: fix fallback failed while sendmsg with fastopen 2023-03-17 08:48:56 +01:00
Kconfig
Makefile net/smc: Add SMC statistics support 2021-06-16 12:54:02 -07:00
smc.h net/smc: Forward wakeup to smc socket waitqueue after fallback 2022-02-08 18:34:09 +01:00
smc_cdc.c net/smc: fix NULL sndbuf_desc in smc_cdc_tx_handler() 2023-03-22 13:31:25 +01:00
smc_cdc.h net/smc: fix kernel panic caused by race of smc_sock 2022-01-05 12:42:36 +01:00
smc_clc.c net/smc: add missing error check in smc_clc_prfx_set() 2021-09-21 10:54:16 +01:00
smc_clc.h net/smc: Add support for obtaining system information 2020-12-01 17:56:13 -08:00
smc_close.c net/smc: Keep smc_close_final rc during active close 2021-12-08 09:04:50 +01:00
smc_close.h
smc_core.c net/smc: fix deadlock triggered by cancel_delayed_work_syn() 2023-03-22 13:31:26 +01:00
smc_core.h net/smc: Reset conn->lgr when link group registration fails 2022-01-27 11:03:53 +01:00
smc_diag.c net/smc: Introduce SMCR get link command 2020-12-01 17:56:13 -08:00
smc_ib.c net/smc: fix kernel panic caused by race of smc_sock 2022-01-05 12:42:36 +01:00
smc_ib.h net/smc: fix kernel panic caused by race of smc_sock 2022-01-05 12:42:36 +01:00
smc_ism.c net/smc: no need to flush smcd_dev's event_wq before destroying it 2021-06-03 13:54:49 -07:00
smc_ism.h net/smc: Add support for obtaining SMCD device list 2020-12-01 17:56:13 -08:00
smc_llc.c tcp: Fix data-races around keepalive sysctl knobs. 2022-07-29 17:25:17 +02:00
smc_llc.h net/smc: move add link processing for new device into llc layer 2020-07-19 15:30:22 -07:00
smc_netlink.c net/smc: Add netlink support for SMC fallback statistics 2021-06-16 12:54:02 -07:00
smc_netlink.h net/smc: Add netlink support for SMC fallback statistics 2021-06-16 12:54:02 -07:00
smc_netns.h net/smc: introduce list of pnetids for Ethernet devices 2020-09-28 15:19:03 -07:00
smc_pnet.c net/smc: Fix NULL pointer dereference in smc_pnet_find_ib() 2022-04-20 09:34:11 +02:00
smc_pnet.h net/smc: Use a mutex for locking "struct smc_pnettable" 2022-03-02 11:47:59 +01:00
smc_rx.c net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending 2022-05-18 10:26:51 +02:00
smc_rx.h
smc_stats.c net/smc: Fix ENODATA tests in smc_nl_get_fback_stats() 2021-06-21 12:16:58 -07:00
smc_stats.h net/smc: Make SMC statistics network namespace aware 2021-06-16 12:54:02 -07:00
smc_tx.c net/smc: Send directly when TCP_CORK is cleared 2022-04-13 20:59:02 +02:00
smc_tx.h net/smc: Send directly when TCP_CORK is cleared 2022-04-13 20:59:02 +02:00
smc_wr.c net/smc: fix kernel panic caused by race of smc_sock 2022-01-05 12:42:36 +01:00
smc_wr.h net/smc: fix kernel panic caused by race of smc_sock 2022-01-05 12:42:36 +01:00