Commit graph

20 commits

Author SHA1 Message Date
Alex Deucher
5f152b5e69 drm/amdgpu: rename amdgpu_gpu_recover
add device to the name for consistency.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-18 10:59:58 -05:00
Andrey Grodzovsky
8854695add drm/amdgpu: Simplify amdgpu_lockup_timeout usage.
With introduction of amdgpu_gpu_recovery we don't need any more
to rely on amdgpu_lockup_timeout == 0 for disabling GPU reset.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-15 17:15:00 -05:00
Andrey Grodzovsky
dcebf026e6 drm/amdgpu: Add gpu_recovery parameter
Add new parameter to control GPU recovery procedure.

v2:
Add auto logic where reset is disabled for bare metal and enabled
for SR-IOV.
Allow forced reset from debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-15 17:14:50 -05:00
Shaoyun Liu
4fd09a19a6 drm/admgpu: Reduce the usage of soc15ip.h
Remove the header where it's not used.

Acked-by: Christian Konig <christian.koenig@amd.com>
Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-08 11:35:19 -05:00
Feifei Xu
fb960bd283 drm/amd/include:cleanup vega10 header files.
Remove asic_reg/vega10 folder.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:48:22 -05:00
Feifei Xu
f0a58aa3f2 drm/amd/include:cleanup vega10 nbio header files.
Cleanup asic_reg/vega10/NBIO folder.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:48:21 -05:00
Feifei Xu
cde5c34f63 drm/amd/include:cleanup vega10 gc header files.
Cleanup asic_reg/vega10/GC folder.

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>

Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06 12:48:20 -05:00
Monk Liu
34a4d2bf06 drm/amdgpu:fix random missing of FLR NOTIFY
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:32 -05:00
Monk Liu
5740682e66 drm/amdgpu:implement new GPU recover(v3)
1,new imple names amdgpu_gpu_recover which gives more hint
on what it does compared with gpu_reset

2,gpu_recover unify bare-metal and SR-IOV, only the asic reset
part is implemented differently

3,gpu_recover will increase hang job karma and mark its entity/context
as guilty if exceeds limit

V2:

4,in scheduler main routine the job from guilty context  will be immedialy
fake signaled after it poped from queue and its fence be set with
"-ECANCELED" error

5,in scheduler recovery routine all jobs from the guilty entity would be
dropped

6,in run_job() routine the real IB submission would be skipped if @skip parameter
equales true or there was VRAM lost occured.

V3:

7,replace deprecated gpu reset, use new gpu recover

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:41:30 -05:00
pding
b59142384e drm/amdgpu/virt: implement wait_reset callbacks for vi/ai
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: pding <Pixel.Ding@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-04 16:33:14 -05:00
Horace Chen
2dc8f81e4f drm/amdgpu: SR-IOV data exchange between PF&VF
SR-IOV need to exchange some data between PF&VF through shared VRAM

PF will copy some necessary firmware and information to the shared
VRAM. It also requires some information from VF. PF will send a
key through mailbox2 to help guest calculate checksum so that it can
verify whether the data is correct.

So check the data on the specified offset of the shared VRAM, if the
checksum is right, read values from it and write some VF information
next to the data from PF.

Signed-off-by: Horace Chen <horace.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-10-19 15:26:59 -04:00
Gavin Wan
890419409a drm/amdgpu: Support passing amdgpu critical error to host via GPU Mailbox.
This feature works for SRIOV enviroment. For non-SRIOV enviroment, the
trans_error function does nothing.

The error information includes error_code (16bit), error_flags(16bit)
and error_data(64bit). Since there are not many errors, we keep the
errors in an array and transfer all errors to Host before amdgpu
initialization function (amdgpu_device_init) exit.

Signed-off-by: Gavin Wan <Gavin.Wan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-07-14 11:05:52 -04:00
Monk Liu
0c63e11340 drm/amdgpu:only call flr_work under infinite timeout
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-24 17:40:39 -04:00
Monk Liu
7225f8736c drm/amdgpu:use job* to replace voluntary
that way we can know which job cause hang and
can do per sched reset/recovery instead of all
sched.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-24 17:40:38 -04:00
Xiangliang Yu
034b6867a4 drm/amdgpu/virt: change AI ack-irq message to debug level
Change message to debug level as VI does.

Signed-off-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Reviewed-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-24 17:40:22 -04:00
Monk Liu
17b2e332a2 drm/amdgpu:need som change on vega10 mailbox
if sriov gpu reset is invoked by job timeout, it is run
in a global work-queue which is very slow and better not call
msleep ortherwise it takes long time to get back CPU.

so make below changes:

1: Change msleep 1 to mdelay 5
2: Ignore the ack fail from pf after time out,
   because VF FLR will clear ack, sometime VF FLR is done
   prior to the beginning of poll_ack so we can ignore this ack

TODO:
Put job_timedout (and the following gpu reset) in a driver thread,
instead of the global work_struct.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-24 17:40:18 -04:00
Monk Liu
3af906f0cf drm/amdgpu:fix cannot receive rcv/ack irq bug
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-05-24 17:40:18 -04:00
Monk Liu
f98b617ed5 drm/amdgpu:implement the reset MB func for vega10
they are lack in the bringup stage, we need them for GPU reset
feature.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-04-06 13:28:05 -04:00
Monk Liu
94b4fd725b drm/amdgpu:fix typo for mxgpu_ai
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-04-06 13:28:05 -04:00
Xiangliang Yu
c9c9de93a3 drm/amdgpu/virt: impl mailbox for ai
Implement mailbox protocol for AI so that guest vf can communicate
with GPU hypervisor.

Signed-off-by: Xiangliang Yu <Xiangliang.Yu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-03-29 23:55:05 -04:00