Commit graph

1035 commits

Author SHA1 Message Date
Dani Liberman
f018c54e3d habanalabs: add uapi to retrieve engines status
Currently, to get engines status, user needed to read debugfs file
with root permissions.

This new uapi allows user apace apps retrieve status, so for example,
in case of failure, status can be retrieved immediately by the
application itself which runs without root permissions.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Oded Gabbay
5f92c1e296 habanalabs: remove all kdma locks
We don't use KDMA concurrently in the driver. The only use is through
debugfs and we don't protect concurrent access through it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Ohad Sharabi
0c819c9a04 habanalabs: wrap macro arg with parentheses
The macro argument <val> is cast-ed to u32 in some of the places.
Because this arg can be some arithmetic computation (e.g. address +
offset) the cast should be on the whole expression.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Bharat Jauhari
f25a72b8b9 habanalabs: fix spelling mistakes
Cosmetic commit, no logical changes. It just fixes the spelling
mistakes.

Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Ofir Bitton
ae937492ec habanalabs/gaudi2: remove old interrupt mappings
Interrupt enumration has changed some time ago but the old mapping
was accidentally left in the driver.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:50 +03:00
Oded Gabbay
cd6b0cea89 habanalabs/gaudi: increase default cs timeout to 10 minutes
In order to improve scalability and reduce host overhead, it is better
to increase the default TDR timeout of Gaudi1 from 30 seconds to
10 minutes.

This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove
the host sync they are using now and improve overall performance on
scaleout training.

Note that one can always set the timeout to a custom value via
a kernel module parameter given during driver load.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Ohad Sharabi
913bd4179b habanalabs: add return code field to module iterator
Up until now the module iterator called void callback functions
and so caller activating callback that may fail suffered from 2 issues:
1. The need to "plant" return called in the private data. This is a
   drawback since the iterator itself should not be aware of the private
   data of the caller.
2. Due to 1 even in a failure the iterator would keep iterating instead
   of break upon error.

To overcome this an optional rc field added to the iterator context.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Tomer Tayar
28742772a0 habanalabs/gaudi2: enable all MMU SPI/SEI interrupts
Currently only part of the MMU SPI/SEI interrupts are enabled, although
there is no real reason to not enable all.
The only exception is "burst_fifo_full" which is expected for PMMU
because it has a 2 entries FIFO, and thus is it not enabled for it.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Ofir Bitton
bc9b271e6c habanalabs: rename non_hard_reset to compute_reset
In order to be more explicit we should use the term compute_reset
for describing the reset in which only the compute engines gets
reset.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Colin Ian King
e4507995da habanalabs: Fix spelling mistake "Scrubing" -> "Scrubbing"
There is a spelling mistake in a dev_dbg message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Yang Li
2d4c09e3f9 habanalabs: Simplify bool conversion
Fix the following coccicheck warning:
./drivers/misc/habanalabs/gaudi2/gaudi2.c:9727:48-53: WARNING:
conversion to bool not needed here

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Dani Liberman
71386e11f2 habanalabs: removed seq_file parameter from is_idle asic functions
Change is_idle functions so it would be more usable outside debugfs.

Do this by replacing seq_file parameter with regular string.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18 13:29:49 +03:00
Oded Gabbay
a919b823ab habanalabs: move h/w dirty message to debug
H/W being dirty during initialization is completely expected in case
f/w tools are used before loading the driver. As it is not an error,
and as it doesn't give any meaningful information to the user,
no point of printing it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Oded Gabbay
0b0ae02440 habanalabs: rename soft reset to compute reset
Doing compute reset can be the traditional inference soft reset
that is supported only in Goya.

Or it can be the new reset upon device release, which is supported
in Gaudi2 and above.

Therefore, wherever suitable, use the terminology of compute reset
instead of soft reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Oded Gabbay
e3b20f3ee4 habanalabs: add status of reset after device release
The user might want to know the device is in reset after device
release, which is not an erroneous event as a regular reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Oded Gabbay
bd4a338886 habanalabs: fix update of is_in_soft_reset
reset_info.is_in_soft_reset should be updated both before in_reset
and inside the spin lock of the reset info structure.

The reasons are:

- When we are inside soft reset, it implies we are in reset. Therefore,
  if someone checks if we are in soft reset, he can deduce we are
  in reset, while the opposite is not correct and might be misleading.

- Both these flags are changed together so they must be changed
  inside the reset info spinlock.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Ofir Bitton
08f0aa9548 habanalabs: expose only valid debugfs nodes
In case security is enabled on the device, some debugfs nodes will
fail. Hence, we do not expose them.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Tomer Tayar
168fc71857 habanalabs/gaudi2: map virtual MSI-X doorbell memory for user
Upon the initialization of a user context, map the host memory page of
the virtual MSI-X doorbell in the device MMU.
A reserved VA is used for this purpose, so user can use it directly
without any allocation/map operation.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Tomer Tayar
3f043b3192 habanalabs/gaudi2: modify decoder to use virtual MSI-X doorbell
Modify the decoder wrapper blocks to generate interrupts using the
virtual MSI-X doorbell.

As a decoder wrapper block cannot write directly to HBW upon completion,
it writes instead to SOB which is monitored by a master monitor.
When resolved, this monitor will be the one to actually write to the
virtual MSI-X doorbell.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:31 +03:00
Tomer Tayar
bfbf5a0a71 habanalabs/gaudi2: modify CS completion CQ to use virtual MSI-X doorbell
Modify the CQ which is used for CS completion, to use the virtual MSI-X
doorbell.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Tomer Tayar
25ad863839 habanalabs/gaudi2: replace defines for reserved sob/mob with enums
Following patches are going to add more reserved sync objects and
monitors.
To make the counting of these reserved resources simpler, replace the
existing RESERVED_* defines with enumerations.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Tomer Tayar
1cf596c6b9 habanalabs/gaudi2: configure virtual MSI-X doorbell interface
Due to a watchdog timer in the LBW path, writes to the MSI-X doorbell
can return sporadic error responses.
To work-around this issue, a virtual MSI-X doorbell on the HBW path is
configured, using the MSI-X AXI slave interface in the PCIe controller.
Upon an access to a configured HBW host address, the controller will
generate MSI-X interrupt instead of treating the access as regular host
memory access.

This patch allocates the dedicate host memory page, and communicate the
address to F/W, so it will configure the relevant address match
registers in the controller, and will use this address to generate MSI-X
interrupts for F/W events.

Following patches will handle other initiators in the device, to move
them to use the virtual MSI-X doorbell.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Tomer Tayar
af2e650b36 habanalabs: add a value field to hl_fw_send_pci_access_msg()
For gaudi2 we need to send a value to F/W as part of the
PCI_ACCESS packet.
As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value'
field.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Ohad Sharabi
20cd88a775 habanalabs: fixes to the poll-timeout macros
- use conventional internal macro variables (double underscore prefix)
- adjust address casting
- on register poll using ELBI use ELBI read rather than BAR read on
  error condition
- remove unused macro

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Ohad Sharabi
3fc252670b habanalabs/gaudi2: use DIV_ROUND_UP_SECTOR_T instead of roundup
roundup will create an error in 32-bit architectures as we use
64-bit variables.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Oded Gabbay
b596ad6f11 habanalabs: initialize variable explicitly
Fix warning of
"warning: ‘old_base’ may be used uninitialized in this function"

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Christophe JAILLET
6d24b4d17d habanalabs: Use the bitmap API to allocate bitmaps
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.

It is less verbose and it improves the semantic.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Oded Gabbay
ead36b1981 habanalabs/gaudi2: remove unused defines
There were some defines that are unused in the current upstreamed
code.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:30 +03:00
Oded Gabbay
cf008f5acb habanalabs: make sure variable is set before used
timestamp could be unset in both _hl_interrupt_wait_ioctl() and
_hl_interrupt_wait_ioctl_user_addr() so it is better to explicitly
initialize it to 0 when declaring it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Oded Gabbay
f2d9ec872c habanalabs: don't declare tmp twice in same function
tmp is declared in the scope of the function cs_do_release() and
inside a block inside that function.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Ofir Bitton
cc81c0f3b0 habanalabs: do not set max power on a secured device
Max power API is not supported in secured devices. Hence, we should
skip setting it during boot.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Oded Gabbay
e475acabb9 habanalabs/gaudi2: SM mask can only be 8-bit
Otherwise, due to how we calculate it, we might fail in FIELD_PREP
checks.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Oded Gabbay
c979cb5d8b habanalabs/gaudi2: remove unused variable
glbl_sts_clr_val was set but never used

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Oded Gabbay
18913d6870 habanalabs: allow detection of unsupported f/w packets
If we send a packet to the f/w, and that packet is unsupported, we
want to be able to identify this situation and possibly ignore this.

Therefore, if the f/w returned an error, we need to propagate it
to the callers in the result value, if those callers were interested
in it.

In addition, no point of printing the error code here because each
caller prints its own error with a specific message.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Sagiv Ozeri
ea9770e653 habanalabs: save f/w preboot minor version
We need this property for backward compatibility against the f/w.

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Ofir Bitton
d6a66d5960 habanalabs: add support for common decoder interrupts
User application should be able to get notification for any decoder
completion. Hence, we introduce a new interface in which a user
can wait for all current decoder pending interrupts.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:29 +03:00
Ofir Bitton
1a6609cdd4 habanalabs: naming refactor of user interrupt flow
Current naming convention can be misleading. Hence renaming some
variables and defines in order to be more explicit.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Ohad Sharabi
2b9e583d0a habanalabs: wait for preboot ready after hard reset
Currently we are not waiting for preboot ready after hard reset.
This leads to a race in which COMMs protocol begins but will get no
response from the f/w.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Ofir Bitton
a85e389a84 habanalabs/gaudi2: reset device upon critical ECC event
Correctable ECC events are not fatal, but as they accumulate, the f/w
can decide that a hard-rest is required. This indication is
propagated to the host using the existing ECC event interface.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Oded Gabbay
6b4e8a12b2 habanalabs: enable gaudi2 code in driver
Enable the Gaudi2 ASIC code in the pci probe callback of the driver so
the driver will handle Gaudi2 ASICs.

Add the PCI ID to the PCI table and add the ASIC enum value to all
relevant places.

Fixup the device parameters initialization for Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Moti Haimovski
8aa1e1e605 habanalabs: add gaudi2 MMU support
Gaudi2 has new MMU units. A PMMU for device->host accesses, and HMMU
for HBM accesses.

The page tables of both MMUs are located in the host's memory (referred
to in the code as host-resident pgt).

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Oded Gabbay
f73c637645 habanalabs: add gaudi2 wait-for-CS support
In Gaudi2 we moved to a different wait for command submission
completion model. Instead of receiving interrupt only on external
queues, we use the device's sync manager to notify us when the
entire command submission finishes.

This enables us to remove the categorization of queues to external
and internal, and treat each queue equally, without the need to parse
and patch any command buffer.

This change also requires refactoring to the IRQ handling of
CS completions.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Benjamin Dotan
de88aa67af habanalabs/gaudi2: add gaudi2 profiler module
Add the Gaudi2 code to initialize the ASIC's profiler. The profile
receives its initialization values from the user, same as in Gaudi2,
but the code to initialize is in the driver because the configuration
space of the device is not directly exposed to the user.

Signed-off-by: Benjamin Dotan <bdotan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Ofir Bitton
4567214686 habanalabs/gaudi2: add gaudi2 security module
Use the generic security module to block all registers in the ASIC and
then open only those that are needed to be accessed by the user.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:28 +03:00
Ofir Bitton
e392d1bd04 habanalabs: add generic security module
As the ASICs become more complex and have many more registers, we need
a better way to configure the security properties.

As a reminder, we have two dedicated mechanisms for security:
Range Registers and Protection bits. Those mechanisms protect sensitive
memory and configuration areas inside the device.

The generic module handles the low-level part of the configuration,
because the configuration mechanism is identical in all ASICs. The
difference is the address ranges and register names.

Any ASIC that use this block should first block all the register
blocks in the ASIC. Then, it should open only the registers that
need to be accessed by the user (This is opposed to Goya and Gaudi,
where we blocked only what should not be accesses by the user).

The module contains several functions, to unblock single register,
multiple registers, entire blocks, ranges, ranges with mask.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
c47082c22d habanalabs: remove obsolete device variables used for testing
There are a couple of device variables that are used for testing
purposes and they are set to fixed values.

Remove the variables that are not relevant anymore and document the
remaining variables.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
be7813eaa6 habanalabs: initialize new asic properties
New asic properties were added for Gaudi2. We want to initialize
and use them, when relevant, also for Goya and Gaudi.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
9e17258c78 habanalabs: add unsupported functions
There are a number of new ASIC-specific functions that were added
for Gaudi2. To make the common code work, we need to define empty
implementations of those functions for Goya and Gaudi.

Some functions will return error if called with Goya/Gaudi.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
d7bb1ac89b habanalabs: add gaudi2 asic-specific code
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the
function callbacks that the driver's common code need to initialize,
finalize and submit workloads to the Gaudi2 ASIC.

It also contains the code to initialize the F/W of the Gaudi2 ASIC
and to receive events from the F/W.

It contains new debugfs entry to dump razwi events. razwi is a case
where the device's engines create a transaction that reaches an
invalid destination.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00
Oded Gabbay
01d9ccf865 habanalabs/gaudi2: add asic registers header files
Add the relevant GAUDI2 ASIC registers header files. These files are
generated automatically from a tool maintained by the VLSI engineers.

There are more files which are not upstreamed because only very few
defines from those files are used in the driver. For those files, I
copied the relevant defines into gaudi2_regs.h and gaudi2_masks.h, to
reduce the size of this patch.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12 09:09:27 +03:00