Commit Graph

1003 Commits

Author SHA1 Message Date
Nirmoy Das 575e55ee4f drm/amdgpu: recover gart table at resume
Get rid off pin/unpin of gart BO at resume/suspend and
instead pin only once and try to recover gart content
at resume time. This is much more stable in case there
is OOM situation at 2nd call to amdgpu_device_evict_resources()
while evicting GART table.

v3: remove gart recovery from other places
v2: pin gart at amdgpu_gart_table_vram_alloc()

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Nirmoy Das 1dd8b1b987 drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr
Do not allow exported amdgpu_gtt_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call gtt_mgr functions.

v4: remove unused adev.
v3: upcast mgr from ttm resopurce manager instead of
getting it from adev.
v2: pass adev's gtt_mgr instead of adev.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Leslie Shi 62d5f9f711 drm/amdgpu: Unmap MMIO mappings when device is not unplugged
Patch: 3efb17ae7e92 ("drm/amdgpu: Call amdgpu_device_unmap_mmio() if device
is unplugged to prevent crash in GPU initialization failure") makes call to
amdgpu_device_unmap_mmio() conditioned on device unplugged. This patch unmaps
MMIO mappings even when device is not unplugged.

v2: Add condition of drm_dev_enter() to deleted unmaps in patch
"drm/amdgpu: Unmap all MMIO mappings"

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Mario Limonciello e53d9665ab drm/amdgpu: explicitly check for s0ix when evicting resources
This codepath should be running in both s0ix and s3, but only does
currently because s3 and s0ix are both set in the s0ix case.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-07 17:20:38 -05:00
Dave Airlie cb6846fbb8 Merge tag 'amd-drm-next-5.17-2021-12-30' of ssh://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.17-2021-12-30:

amdgpu:
- Suspend/resume fixes
- Fence fix
- Misc code cleanups
- IP discovery fixes
- SRIOV fixes
- RAS fixes
- GMC 8 VRAM detection fix
- FRU fixes for Aldebaran
- Display fixes

amdkfd:
- SVM fixes
- IP discovery fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211230141032.613596-1-alexander.deucher@amd.com
2021-12-31 10:59:17 +10:00
Alex Deucher 0637d41786 drm/amdgpu: no DC support for headless chips
Chips with no display hardware should return false for
DC support.

v2: drop Arcturus and Aldebaran

Fixes: f7f12b2582 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reported-by: Tareque Md.Hanif <tarequemd.hanif@yahoo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30 08:54:45 -05:00
Surbhi Kakarya b6fd6e0f5e drm/amdgpu: Check the memory can be accesssed by ttm_device_clear_dma_mappings.
If the event guard is enabled and VF doesn't receive an ack from PF for full access,
the guest driver load crashes.
This is caused due to the call to ttm_device_clear_dma_mappings with non-initialized
mman during driver tear down.

This patch adds the necessary condition to check if the mman initialization passed or not
and takes the path based on the condition output.

Signed-off-by: Surbhi Kakarya <Surbhi.Kakarya@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30 08:54:43 -05:00
Leslie Shi 87172e89dc drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure
[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30 08:54:24 -05:00
sashank saye 4da8b63944 drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling
For Aldebaran chip passthrough case we need to intimate SMU
about special handling for SBR.On older chips we send
LightSBR to SMU, enabling the same for Aldebaran. Slight
difference, compared to previous chips, is on Aldebaran, SMU
would do a heavy reset on SBR. Hence, the word Heavy
instead of Light SBR is used for SMU to differentiate.

Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: sashank saye <sashank.saye@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:03:19 -05:00
Victor Skvortsov 4a0165f060 drm/amdgpu: get xgmi info before ip_init
Driver needs to call get_xgmi_info() before ip_init
to determine whether it needs to handle a pending hive reset.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Reviewed by: shaoyun.liu <Shaoyun.lui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:17 -05:00
Dave Airlie b06103b532 Merge tag 'amd-drm-next-5.17-2021-12-16' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amdgpu:
- Add some display debugfs entries
- RAS fixes
- SR-IOV fixes
- W=1 fixes
- Documentation fixes
- IH timestamp fix
- Misc power fixes
- IP discovery fixes
- Large driver documentation updates
- Multi-GPU memory use reductions
- Misc display fixes and cleanups
- Add new SMU debug option

amdkfd:
- SVM fixes

radeon:
- Fix typo in comment

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211216202731.5900-1-alexander.deucher@amd.com
2021-12-23 11:55:28 +10:00
Victor Skvortsov 892deb4826 drm/amdgpu: Separate vf2pf work item init from virt data exchange
We want to be able to call virt data exchange conditionally
after gmc sw init to reserve bad pages as early as possible.
Since this is a conditional call, we will need
to call it again unconditionally later in the init sequence.

Refactor the data exchange function so it can be
called multiple times without re-initializing the work item.

v2: Cleaned up the code. Kept the original call to init_exchange_data()
inside early init to initialize the work item, afterwards call
exchange_data() when needed.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed By: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-16 14:08:20 -05:00
Huang Rui 5c1e6fa49e drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence
The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-16 13:42:36 -05:00
Yann Dirson 03f2abb07e amdgpu: fix some kernel-doc markup
Those are not today pulled by the sphinx doc, but better be ready.

Signed-off-by: Yann Dirson <ydirson@free.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14 16:10:53 -05:00
Guchun Chen e0f943b4f9 drm/amdgpu: use adev_to_drm to get drm_device pointer
Updated for consistency when accessing drm_device from amdgpu driver.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14 16:09:24 -05:00
Daniel Vetter 99b03ca651 Linux 5.16-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmG2fU0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC7EH/3R7Rt+OD8Wn8Ss3
 w8V+dBxVwa2u2oMTyUHPxaeOXZ7bi38XlUdLFPOK/76bGwO0a5TmYZqsWdRbGyT0
 HfcYjHsQ0lbJXk/nh2oM47oJxJXVpThIHXJEk0FZ0Y5t+DYjIYlNHzqZymUyhLem
 St74zgWcyT+MXuqY34vB827FJDUnOxhhhi85tObeunaSPAomy9aiYidSC1ARREnz
 iz2VUntP/QnRnKVvL2nUZNzcz1xL5vfCRSKsRGRSv3qW1Y/1M71ylt6JVmSftWq+
 VmMdFxFhdrb1OK/1ct/930Un/UP2NG9EJsWxote2XYlnVSZHzDqH7lUhbqgdCcLz
 1m2tVNY=
 =7wRd
 -----END PGP SIGNATURE-----

Merge v5.16-rc5 into drm-next

Thomas Zimmermann requested a fixes backmerge, specifically also for
96c5f82ef0 ("drm/vc4: fix error code in vc4_create_object()")

Just a bunch of adjacent changes conflicts, even the big pile of them
in vc4.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2021-12-14 10:24:28 +01:00
Philip Yang 4a74c38cd6 drm/amdgpu: Detect if amdgpu in IOMMU direct map mode
If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode,
set adev->ram_is_direct_mapped flag which will be used to optimize
memory usage for multi GPU mappings.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:33:16 -05:00
Lang Yu 34f3a4a98b drm/amdgpu: introduce a kind of halt state for amdgpu device
It is useful to maintain error context when debugging
SW/FW issues. Introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

v2:
 - Set adev->no_hw_access earlier to avoid potential crashes.(Christian)

Suggested-by: Christian Koenig <christian.koenig@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Christian Koenig <christian.koenig@amd.co>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:33:16 -05:00
Lang Yu 613aa3ea74 drm/amdgpu: only hw fini SMU fisrt for ASICs need that
We found some headaches on ASICs don't need that,
so remove that for them.

Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Kevin Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:35 -05:00
Isabella Basso bbe04dec5c drm/amd: fix improper docstring syntax
This fixes various warnings relating to erroneous docstring syntax, of
which some are listed below:

 warning: Function parameter or member 'adev' not described in
 'amdgpu_atomfirmware_ras_rom_addr'
 ...
 warning: expecting prototype for amdgpu_atpx_validate_functions().
 Prototype was for amdgpu_atpx_validate() instead
 ...
 warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new'
 ...
 warning: Cannot understand  * @kfd_get_cu_occupancy - Collect number of
 waves in-flight on this device
 ...
 warning: This comment starts with '/**', but isn't a kernel-doc
 comment. Refer Documentation/doc-guide/kernel-doc.rst

Signed-off-by: Isabella Basso <isabbasso@riseup.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:34 -05:00
Zhigang Luo a5f67c939e drm/amdgpu: recover XGMI topology for SRIOV VF after reset
For SRIOV VF, the XGMI topology was not recovered after reset. This
change added code to SRIOV VF reset function to update XGMI topology
for SRIOV VF after reset.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
Reviewed-by: Shaoyun Liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:34 -05:00
Zhigang Luo 175ac6ec6b drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
On SRIOV, host driver can support FLR(function level reset) on individual VF
within the hive which might bring the individual device back to normal without
the necessary to execute the hive reset. If the FLR failed , host driver will
trigger the hive reset, each guest VF will get reset notification before the
real hive reset been executed. The VF device can handle the reset request
individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in
the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
Reviewed-by: Shaoyun Liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:34 -05:00
shaoyunl 428890a3fe drm/amdgpu: adjust the kfd reset sequence in reset sriov function
This change revert previous commits:
9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
271fd38ce5 ("drm/amdgpu: move kfd post_reset out of reset_sriov function")

This change moves the amdgpu_amdkfd_pre_reset to an earlier place
in amdgpu_device_reset_sriov, presumably to address the sequence issue
that the first patch was originally meant to fix.

Some register access(GRBM_GFX_CNTL) only be allowed on full access
mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov
function.

Fixes: 9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
Fixes: 271fd38ce5 ("drm/amdgpu: move kfd post_reset out of reset_sriov function")
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 17:09:30 -05:00
Flora Cui 1053b9c948 drm/amdgpu: check atomic flag to differeniate with legacy path
since vkms support atomic KMS interface

Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Alex Deucher <aleander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:59:38 -05:00
shaoyunl 992110d747 drm/amdgpu: adjust the kfd reset sequence in reset sriov function
This change revert previous commits:
9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
271fd38ce5 ("drm/amdgpu: move kfd post_reset out of reset_sriov function")

This change moves the amdgpu_amdkfd_pre_reset to an earlier place
in amdgpu_device_reset_sriov, presumably to address the sequence issue
that the first patch was originally meant to fix.

Some register access(GRBM_GFX_CNTL) only be allowed on full access
mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov
function.

Fixes: 9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
Fixes: 271fd38ce5 ("drm/amdgpu: move kfd post_reset out of reset_sriov function")
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:07:33 -05:00
Stanley.Yang 232d1d43b5 drm/amdgpu: fix disable ras feature failed when unload drvier v2
v2:
    still need call ras_disable_all_featrures to handle
    ras initilization failure case.

Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:03:22 -05:00
Flora Cui 700de2c8aa drm/amdgpu: check atomic flag to differeniate with legacy path
since vkms support atomic KMS interface

Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Alex Deucher <aleander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:03:03 -05:00
shaoyunl 271fd38ce5 drm/amdgpu: move kfd post_reset out of reset_sriov function
Fixes: 9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")

For sriov XGMI  configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 15:08:58 -05:00
shaoyunl 4f30d920d1 drm/amdgpu: move kfd post_reset out of reset_sriov function
Fixes: 9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")

For sriov XGMI  configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24 14:06:52 -05:00
Evan Quan 6c08e0ef87 drm/amd/pm: avoid duplicate powergate/ungate setting
Just bail out if the target IP block is already in the desired
powergate/ungate state. This can avoid some duplicate settings
which sometimes may cause unexpected issues.

Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789
Signed-off-by: Evan Quan <evan.quan@amd.com>
Tested-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:45:02 -05:00
Evan Quan 6ee27ee27b drm/amd/pm: avoid duplicate powergate/ungate setting
Just bail out if the target IP block is already in the desired
powergate/ungate state. This can avoid some duplicate settings
which sometimes may cause unexpected issues.

Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789
Fixes: bf756fb833 ("drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend")
Signed-off-by: Evan Quan <evan.quan@amd.com>
Tested-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-11-17 17:41:20 -05:00
Evan Quan 087451f372 drm/amdgpu: use generic fb helpers instead of setting up AMD own's.
With the shadow buffer support from generic framebuffer emulation, it's
possible now to have runpm kicked when no update for console.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:03 -05:00
shaoyunl 9f4f2c1a35 drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
The KFD pre_reset should be called before reset been executed, it will
hold the lock to prevent other rocm process to sent the packlage to hiq
during host execute the real reset on the HW

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-09 17:08:00 -05:00
Alex Deucher 2d32ffd6e9 drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support()
Properly handle SI DC support when CONFIG_DRM_AMD_DC_SI is not
set.

Fixes: f7f12b2582 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05 14:12:53 -04:00
James Zhu 93cec18478 drm/amdgpu: remove duplicated kfd_resume_iommu
Remove duplicated kfd_resume_iommu which already runs
in mdgpu_amdkfd_device_init.

Tested-By: Ken Moffat <zarniwhoop@ntlworld.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:08 -04:00
Jingwen Chen 38d4e4638e drm/amd/amdgpu: fix bad job hw_fence use after free in advance tdr
[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03 12:22:07 -04:00
Lang Yu a5c5d8d50e drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()
amdgpu_fence_driver_sw_fini() should be executed before
amdgpu_device_ip_fini(), otherwise fence driver resource
won't be properly freed as adev->rings have been tore down.

Fixes: 72c8c97b15 ("drm/amdgpu: Split amdgpu_device_fini into early and late")

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28 14:26:12 -04:00
Dave Airlie 970eae1560 Linux 5.15-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmF298ceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGIJYH/1rsEFQQ6caeQdy1
 z9eFIe48DNM4l7bFk+qEj2UAbzPdahVJ299Mg5fW0n2CDemOc9/n0b9TxQ37YObi
 mOzu0xwJVupIxkyFMPQSSc2q8aLm67NSpJy08DsmaNses5hSvu8x15RPHLQTybjt
 SwtKns+jpCq79P1GWbrB5e5UkLb0VNoxNp4L1U4pMrYGcEkJUXbaxNY2V/JcXdM7
 Vtn+qN0T/J6V6QVftv0t8Ecj3bjEnmL3kZHaTaNg3dGeKRpCGyHc5lcBQ0cNFG6t
 vjZ9VbuhBzGI3TN2tHH5hpA1UXo7HPBBCwQqxF1jeGLGHULikYwZ3TAPWqL3QZqC
 9cxr9SY=
 =p75d
 -----END PGP SIGNATURE-----

BackMerge tag 'v5.15-rc7' into drm-next

The msm next tree is based on rc3, so let's just backmerge rc7 before pulling it in.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2021-10-28 14:59:38 +10:00
YuBiao Wang a3848df60b drm/amd/amdgpu: Do irq_fini_hw after ip_fini_early
[Why]
drm_irq_uninstall is called in irq_fini_hw so that irq is disabled in sw
stage. SMU (and maybe other IP blocks) fini_hw will call irq_put for
cleanup and the whole cleanup process will be skipped because of
drm->irq_enable = false.

[How]
Move ip_fini_early before irq_fini_hw.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-19 17:16:50 -04:00
Yifan Zhang afd18180c0 drm/amdkfd: fix boot failure when iommu is disabled in Picasso.
When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2
init will fail. But this failure should not block amdgpu driver init.

Reported-by: youling <youling257@gmail.com>
Tested-by: youling <youling257@gmail.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:15:52 -04:00
Guchun Chen c58a863b1c drm/amdgpu: use adev_to_drm for consistency when accessing drm_device
adev_to_drm is used everywhere, so improve recent changes
when accessing drm_device pointer from amdgpu_device.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-08 13:22:13 -04:00
Nirmoy Das 58144d2837 drm/amdgpu: unify BO evicting method in amdgpu_ttm
Unify BO evicting functionality for possible memory
types in amdgpu_ttm.c.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-07 11:55:46 -04:00
Guchun Chen 248b061689 drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume
In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f45 ("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 13:02:31 -04:00
Yifan Zhang 714d9e4574 drm/amdgpu: init iommu after amdkfd device init
This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
  Platform Profile: FULL_PROFILE
  Platform Version: OpenCL 2.2 AMD-APP (3364.0)
  Platform Name: AMD Accelerated Parallel Processing
  Platform Vendor: Advanced Micro Devices, Inc.
  Platform Extensions: cl_khr_icd cl_amd_event_callback

  Platform Name: AMD Accelerated Parallel Processing Number of devices: 0

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 13:02:20 -04:00
Guchun Chen e17e27f9bd drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume
In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f45 ("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 12:23:56 -04:00
Christian König 127aedf979 drm/amdgpu: print warning and taint kernel if lockup timeout is disabled
Make sure that we notice this in error reports.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 12:23:42 -04:00
Christian König c8365dbda0 drm/amdgpu: revert "Add autodump debugfs node for gpu reset v8"
This reverts commit 728e7e0cd6.

Further discussion reveals that this feature is severely broken
and needs to be reverted ASAP.

GPU reset can never be delayed by userspace even for debugging or
otherwise we can run into in kernel deadlocks.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 12:22:36 -04:00
Yifan Zhang 286826d7d9 drm/amdgpu: init iommu after amdkfd device init
This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
  Platform Profile: FULL_PROFILE
  Platform Version: OpenCL 2.2 AMD-APP (3364.0)
  Platform Name: AMD Accelerated Parallel Processing
  Platform Vendor: Advanced Micro Devices, Inc.
  Platform Extensions: cl_khr_icd cl_amd_event_callback

  Platform Name: AMD Accelerated Parallel Processing Number of devices: 0

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-05 12:22:29 -04:00
Alex Deucher 3ae695d691 drm/amdgpu: add new asic_type for IP discovery
Add a new asic type for asics where we don't have an
explicit entry in the PCI ID list.  We don't need
an asic type for these asics, other than something higher
than the existing ones, so just use this for all new
asics.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:23:01 -04:00
Alex Deucher f7f12b2582 drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support
We are not going to support any new chips with the old
non-DC code so make it the default.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:23:00 -04:00