Commit Graph

1089 Commits

Author SHA1 Message Date
Oded Gabbay 259cee1c24 habanalabs: eliminate aggregate use warning
When doing sizeof() and giving as argument a dereference of
a pointer-to-a-pointer object, clang will issue a warning.

Eliminate the warning by passing struct <name>*

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20 15:52:27 +03:00
Tomer Tayar e403856468 habanalabs/gaudi: use 8KB aligned address for TPC kernels
I$ prefetch is enabled when sending a TPC kernel to initialize the TPC
memory, and it has a restriction that the base address will be aligned
to 8KB.
Currently the base address is 128 bytes from the start address of the
device SRAM, so prefetching will start 128 bytes before the actual
kernel memory.
Modify the kernel address to be 8KB aligned.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20 15:51:54 +03:00
farah kassabri 6b9b9e244f habanalabs: remove some f/w descriptor validations
To be forward-backward compatible with the firmware in the initial
communication during preboot, we need to remove the validation of the
header size. This will allow us to add more fields to the
lkd_fw_comms_desc structure.

Instead of the validation of the header size, we just print warning
when some mismatch in descriptor has been revealed, and we calculate
the CRC base on descriptor size reported by the firmware instead of
calculating it ourselves.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20 15:46:45 +03:00
Ohad Sharabi 8412bb69ed habanalabs: build ASICs from new to old
Newer ASICs code changes more often, has more chance to fail
compilation. So, let's compile them first so errors in those files
will fail compilation sooner.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20 15:42:08 +03:00
Ofir Bitton bb677d527e habanalabs/gaudi2: allow user to flush PCIE by read
In order for the user to flush PCIE he needs to read some register
from PCIE block. The chosen register is SPECIAL_GLBL_SPARE_0 and
hence needs to be unsecured.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:10:01 +03:00
Oded Gabbay 4f3ce5e0d0 habanalabs: failure to open device due to reset is debug level
If the user wants to open the device, and the device is currently in
reset, the user will get an error from the open().

We don't need to display an error in the dmesg for that as it is
not a real error and we can spam the kernel log with this message.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:09:04 +03:00
Li zeming 006fd8cb65 habanalabs/gaudi2: Remove unnecessary (void*) conversions
The void pointer object can be directly assigned to different structure
objects, it does not need to be cast.

Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:40 +03:00
Dani Liberman 0c88760f8f habanalabs/gaudi2: add secured attestation info uapi
User will provide a nonce via the ioctl, and will retrieve
secured attestation data of the boot, generated using given
nonce.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:40 +03:00
Dani Liberman 43657dadfe habanalabs/gaudi2: add handling to pmmu events in eqe handler
In order to get the error cause and the captured address in case of
page fault, added pmmu events to eqe handler.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Tal Cohen ff13b900b0 habanalabs/gaudi: change TPC Assert to use TPC DEC instead of QMAN err
This change is done while there is a problem to use QMAN error for
TPC assert async. The problem involves security limitation that exists
to generate the assert via QMAN error.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Dani Liberman 97a78e3d8e habanalabs: rename error info structure
As a preparation for adding more errors to it,
change to more suitable name.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
farah kassabri 04d53cd2a6 habanalabs/gaudi2: get f/w reset status register dynamically
Get the firmware reset status address from the dynamic registers
we read from the firmware instead of using a define.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Tomer Tayar f0b6d3cc29 habanalabs/gaudi2: increase hard-reset sleep time to 2 sec
The access to the device registers is blocked during hard reset, until
preboot runs and allows the access to specific registers, including the
PSOC BTM_FSM register which is used to know when the reset is done.
Between the reset request and until this register is polled there is a
small delay of 500 msec which is not enough for F/W to process the reset
and for preboot to run, so the register might be accessed while it is
blocked.
To avoid it, increase the delay to 2 sec.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Tomer Tayar cecde184ca habanalabs/gaudi2: print RAZWI info upon PCIe access error
Add the dump of the RAZWI information when a PCIe access is blocked by
RR.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Oded Gabbay 82736b063f habanalabs: MMU invalidation h/w is per device
The code used the mmu mutex to protect access to the context's page
tables and invalidation of the MMU cache. Because pgt are per
context, the mmu mutex was a member of the context object.

The problem is that the device has a single MMU invalidation h/w
(per MMU). Therefore, the mmu mutex should not be a property of the
context but a property of the device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:39 +03:00
Tal Cohen 6f0818c9fc habanalabs: new notifier events for device state
Add new notifier events that inform several device states.
General H/W error raised on device general H/W error occurs.
User engine error is raised when a device engine informs of an error.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Oded Gabbay c833ac1a5f habanalabs/gaudi2: free event irq if init fails
In case initialization fails after event irq was requested, we need to
release that irq.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Ohad Sharabi 76925f55c9 habanalabs: fix resetting the DRAM BAR
Current code does not takes into account the new DRAM region base
and so calculated address is wrong and can lead to crush.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Ofir Bitton 0626fa1a4d habanalabs: add support for new cpucp return codes
Firmware now responds with a more detailed cpucp return codes.
Driver can now distinguish between error and debug return codes.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Tomer Tayar a0fc8688c0 habanalabs/gaudi2: read F/W security indication after hard reset
F/W security status might change after every reset.

Add the reading of the preboot status to the hard reset sequence, which
among others reads this security indication.

As this preboot status reading includes the waiting for the preboot to
be ready, it can be removed from the CPU init which is done in a later
stage.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Ofir Bitton aee3fd74fe habanalabs/gaudi: rename mme cfg error response print
Current description is misleading hence we rename it to a more
suitable error description.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
farah kassabri 62adba0a55 habanalabs: fix possible hole in device va
cb_map_mem() uses gen_pool_alloc() to get virtual address for
mapping a CB.
The mapping is done in chunks of page size, so if the CB size is
larger, it is possible that the allocated virtual addresses won't
be consecutive.
User retrieves this device VA which returns the virtual address
in the first va_block. If there is a "hole" in the virtual addresses,
user can configure a HW block with a bad device VA.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:38 +03:00
Ofir Bitton f5ec364c9e habanalabs: send device activity in a proper context
'Device activity open packet' should be sent outside of mutex as
there is no real necessity for a lock.
In addition 'device activity close packet' should be sent upon an
actual release of the device.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:37 +03:00
farah kassabri 4745b2f0d0 habanalabs: send device active message to f/w
As part of the RAS that is done by the f/w, we should send a message
to the f/w when a user either acquires or releases the device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:37 +03:00
Ofir Bitton 0855bf8b17 habanalabs/gaudi2: dump detailed information upon RAZWI
In order to improve debuggability, we add all available information
when a RAZWI event occur.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19 15:08:37 +03:00
farah kassabri 988262ef2f habanalabs/gaudi2: log critical events with no rate limit
When we have a storm of errors of HBM ECC SERR we can reach a situation
where driver start hard reset flow without logging the error cause
that caused the hard reset due to logs rate limiting.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Ofir Bitton d155df4f62 habanalabs: ignore EEPROM errors during boot
EEPROM errors reported by firmware are basically warnings and
should not fail the boot process.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Ofir Bitton c38f72370b habanalabs: perform context switch flow only if needed
Except Goya, none of our ASICs require context switch flow, hence we
enable this flow only where it is needed.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Dafna Hirschfeld 262042af13 habanalabs: set command buffer host VA dynamically
Set the addresses for userspace command buffer dynamically
instead of hard-coded. There is no reason for it to
be hard-coded.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Ohad Sharabi 0263256791 habanalabs: trace DMA allocations
This patch add tracepoints in the code for DMA allocation.
The main purpose is to be able to cross data with the map operations and
determine whether memory violation occurred, for example free DMA
allocation before unmapping it from device memory.

To achieve this the DMA alloc/free code flows were refactored so that a
single DMA tracepoint will catch many flows.

To get better understanding of what happened in the DMA allocations
the real allocating function is added to the trace as well.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Ohad Sharabi 4eb87df3d0 habanalabs: trace MMU map/unmap page
This patch utilize the defined tracepoint to trace the MMU's pages
map/unmap operations.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Ohad Sharabi 191a4443c3 habanalabs: define trace events
This patch adds trace events for habanalabs driver to gain all the
benefits such an infrastructure can supply.

The following events were added:
- MMU map/unmap: to be able to track driver's memory allocations
- DMA alloc/free: to track our DMA allocation

the above trace points in conjunction will help us map the device memory
usage as well as to be able to track memory violations.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:53 +03:00
Rajarama Manjukody Bhat 7b5d13c9ca habanalabs/gaudi2: assigning PQFs for ARC f/w in PDMA
Assigning 3 PQFs in PDMA1 and 2 PQFs in PDMA0 for ARC firmware usage.

Signed-off-by: Rajarama Manjukody Bhat <rmbhat@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Tomer Tayar fb855768d3 habanalabs: fix calculation of DRAM base address in PCIe BAR
The calculation of the device DRAM base address before setting the
relevant PCIe BAR to point at it, has an assumption that this BAR is
used to access only the DRAM, and thus the covered DRAM size is a power
of 2.
In future ASICs it is not necessarily true, so need to update the
calculation to support also a non-power-of-2 size.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Dafna Hirschfeld 46e49f434f habanalabs: if map page fails don't try to unmap it
The original code tried to unmap a page that was not mapped as part of
the map page error path.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Oded Gabbay 6173572f29 habanalabs: select FW_LOADER in Kconfig
The driver is loading firmware to the device and we use the firmware
loading functions from the FW_LOADER module.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Omer Shpigelman 273190d420 habanalabs: add cdev index data member
Instead of recalculating the cdev index, store it in a dedicated data
member. This data member is intended to be passed to other drivers using
the auxiliary bus infra and hence this new data member is necessary in
case that the calculation is changed in the future.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Dafna Hirschfeld 75bc3986fc habanalabs: fix bug when setting va block size
the size of a block is always 'block->end - block->start + 1'

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Ofir Bitton 38a4358009 habanalabs: expose device security status using info ioctl
In order for the user to know if he is running on a secured device
or not, we add it also to the hw_ip info ioctl.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Ofir Bitton 6457271f64 habanalabs: expose device security status through sysfs
In order for the user to know if he is running on a secured device
or not, a sysfs node is added.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Ofir Bitton 107a5bcc0b habanalabs: remove secured PCI IDs
Secured PCI ID will not be supported in new asics because the
security status can always be read from the f/w.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:52 +03:00
Tomer Tayar 65d3c63513 habanalabs: fix H/W block handling for partial unmappings
Several munmap() calls can be done or a mapped H/W block that has a
larger size than a page size.
Releasing the object should be done only when all mapped range is
unmapped.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Dani Liberman 07ecaa0d85 habanalabs: unify hwmon resources clean up
Since hwmon fini code is common for all asics, unified it to common
function.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Tal Cohen 194e515c79 habanalabs/gaudi2: new API to control engine cores running mode
The current flow of halting the engine cores is implemented by command
buffers built by the user space and sent towards the Driver.

This current flow is broken since the user space does not know when
the cores actually halt as sending a workload is async op.

Therefore the application can not free the memory that is mapped
to the engine cores.

This new API allows the user space to control the running mode. The
API call is sync (returns after the cores are set to the
requested mode).

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Oded Gabbay 07056f58e4 habanalabs: remove left-over code from bring-up
There is some left-over code from the gaudi2 bring-up that wasn't
removed so far.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
farah kassabri 6419b5232e habanalabs/gaudi2: change device f/w security check
On Gaudi2 the f/w always configures the PCIe iATU and allows access to
scratchpad registers. Therefore, we can know if the f/w is secured
by reading a status bit from the f/w registers.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Oded Gabbay ab6c08f0d5 habanalabs: move common function out of debugfs.c
A common function that is called from multiple places can't be
located in degugfs.c because that file is only compiled if
debugfs is enabled in the kernel config file.

This can lead to undefined symbol compilation error.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Tomer Tayar f0d4944c20 habanalabs: add a missing lock for in_reset indication
Add a missing lock in hl_device_resume() when it assigns a value to the
'in_reset' indication.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Tomer Tayar 5f46217221 habanalabs: fix vma fields assignments order in hl_hw_block_mmap()
In hl_hw_block_mmap(), the vma's 'vm_private_data' and 'vm_ops' fields
are assigned before filling the content of the private data.
In between there is a call to the ASIC hw_block_mmap() function, and if
it fails, the vma close function will be called with a bad private data
value.
Fix the order of assignments to avoid this issue.

In hl_hw_block_mmap() the vma's 'vm_private_data and vm_ops are assigned
before setting the

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:51 +03:00
Tomer Tayar 7fa6c0fe8b habanalabs: avoid returning a valid handle if map_block() fails
map_block() sets the block id handle even if get_hw_block_id() fails,
and in this case it uses block id 0 which might be a valid id.
Modify it to set the handle only if get_hw_block_id() succeeds.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00