Documentation: add a page on amdgpu debugging

Covers GPU page fault debugging and adds a reference
to umr.

v2: update client ids to include SQC/G
v3: Remove duplicate text
v4: add umr documentation link, fix typo

Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
Alex Deucher 2024-03-13 13:19:54 -04:00
parent e58acb7613
commit 23808afc61
2 changed files with 81 additions and 0 deletions

View File

@ -0,0 +1,80 @@
===============
GPU Debugging
===============
GPUVM Debugging
===============
To aid in debugging GPU virtual memory related problems, the driver supports a
number of options module parameters:
`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
the GPU.
Decoding a GPUVM Page Fault
===========================
If you see a GPU page fault in the kernel log, you can decode it to figure
out what is going wrong in your application. A page fault in your kernel
log may look something like this:
::
[gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
VM_L2_PROTECTION_FAULT_STATUS:0x00301030
Faulty UTCL2 client ID: TCP (0x8)
MORE_FAULTS: 0x0
WALKER_ERROR: 0x0
PERMISSION_FAULTS: 0x3
MAPPING_ERROR: 0x0
RW: 0x0
First you have the memory hub, gfxhub and mmhub. gfxhub is the memory
hub used for graphics, compute, and sdma on some chips. mmhub is the
memory hub used for multi-media and sdma on some chips.
Next you have the vmid and pasid. If the vmid is 0, this fault was likely
caused by the kernel driver or firmware. If the vmid is non-0, it is generally
a fault in a user application. The pasid is used to link a vmid to a system
process id. If the process is active when the fault happens, the process
information will be printed.
The GPU virtual address that caused the fault comes next.
The client ID indicates the GPU block that caused the fault.
Some common client IDs:
- CB/DB: The color/depth backend of the graphics pipe
- CPF: Command Processor Frontend
- CPC: Command Processor Compute
- CPG: Command Processor Graphics
- TCP/SQC/SQG: Shaders
- SDMA: SDMA engines
- VCN: Video encode/decode engines
- JPEG: JPEG engines
PERMISSION_FAULTS describe what faults were encountered:
- bit 0: the PTE was not valid
- bit 1: the PTE read bit was not set
- bit 2: the PTE write bit was not set
- bit 3: the PTE execute bit was not set
Finally, RW, indicates whether the access was a read (0) or a write (1).
In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
0x0000800102800000. The user can then inspect their shader code and resource
descriptor state to determine what caused the GPU page fault.
UMR
===
`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
GPU debugging and diagnostics tool. Please see the umr
`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
about its capabilities.

View File

@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
ras
thermal
driver-misc
debugging
amdgpu-glossary