As hard-reset can be requested during soft-reset, driver must allow
it or else critical events received during soft-reset will be
ignored.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Atomic operations during reset are replaced by a spinlock in order
to have the ability to protect more than a single variable.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This patch fixes issue in which we have timeout for multi-CS although
the CS in the list actually completed.
Example scenario (the two threads marked as WAIT for the thread that
handles the wait_for_multi_cs and CMPL as the thread that signal
completion for both CS and multi-CS):
1. Submit CS with sequence X
2. [WAIT]: call wait_for_multi_cs with single CS X
3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS
(multi_cs_completion_done still false)
4. [WAIT]: enter poll_fences, reinit the completion and find the CS
as completed when asking on the fence but multi_cs_done is
still false it returns that no CS actually completed
5. [CMPL]: set multi_cs_handling_done as true
6. [WAIT]: wait for completion but no CS to awake the wait context
and hence wait till timeout
Solution: if CS detected as completed in poll_fences but multi_cs_done
is still false invoke complete_all to the multi-CS completion
and so it will not go to sleep in wait_for_completion but
rather will have a "second chance" to wait for
multi_cs_completion_done.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In some cases the driver cannot configure ASID of some engines due to
the security level of the relevant registers.
For this a new CPU-CP packet is introduced, which will allow the driver
to ask the F/W to do this configuration instead.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
-ENOTTY is returned in case of error in the ioctl arguments themselves,
such as function that doesn't exists.
In all other cases, where the error is in the arguments of the custom
data structures that we define that are passed in the various ioctls,
we need to return -EINVAL.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Current sysfs implementation does not take endianness into
consideration when dumping the cpld version.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently the cq counters are allocated in userspace memory,
and mapped by the driver to the device address space.
A new requirement that is part of new future API related to this one,
requires that cq counters will be allocated in kernel memory.
We leverage the existing cb_create API with KERNEL_MAPPED flag set to
allocate this memory.
That way we gain two things:
1. The memory cannot be freed while in use since it's protected
by refcount in driver.
2. No need to wake up the user thread upon each interrupt from CQ,
because the kernel has direct access to the counter. Therefore,
it can make comparison with the target value in the interrupt
handler and wake up the user thread only if the counter reaches the
target value. This is instead of waking the thread up to copy counter
value from user then go sleep again if target value wasn't reached.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
By the original design we assumed that if we "miss" multi CS completion
it is of no severe consequence as we'll just call wait_for_multi_cs
again.
Sequence of events for such scenario:
1. user submit CS with sequence N
2. user calls wait for multi-CS with only CS #N in the list
3. the multi CS call starts with poll of the CSs but find that none
completed (while CS #N did not completed yet)
4. now, multi CS #N complete but multi CS CTX was not yet created for
the above multi-CS. so, attempt to complete multi-CS fails (as no
multi CS CTX exist)
5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and
for this create the multi-CS CTX)
6. wait_for_multi_cs wits on completion but will not get one as CS #N
already completed
To fix the issue we initialize the multi-CS CTX prior polling the
fences.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As BTL can be replaced by ROM we should modify relevant error print.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
During the MMU development the MMU header files were left with unclean
definitions:
- MMU "version specific" definitions that were left in the mmu_general
file
- unused definitions
This patch attempts, where possible, to keep definitions that can serve
multiple MMU versions (but that are not tightly bound with specific MMU
arch) in the mmu_general header file (e.g. different definitions for
number of HOPs).
Otherwise, move MMU version specific definitions (e.g. HOPs masks and
shifts) to the specific MMU version file.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As we allow soft-reset to be performed only on inference devices,
having the sysfs nodes may cause a confusion. Hence, we remove those
nodes on training ASICs.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently sysfs support dumping a single infineon version, in
future asics we will have two infineon versions.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Need to allow user retrieve data during reset and afterwards without
the need to reopen the device.
Did it by seperating the user peocesses list into two lists:
1. fpriv_list which contains list of user processes that opened
the device (currently only one).
2. fpriv_ctrl_list which contains list of user processes that opened
the control device. This processes in this list shall not be
killed during reset, only when the device is suddenly removed from
PCI chain.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In legacy f/w that use old hwmon.h file, the values of the hwmon
enums are different than the values that are in newer kernels (5.6
and above).
Therefore, to support working with those f/w, we need to do some
fixup before registering with the hwmon subsystem and also when
calling the functions that communicate with the f/w to retrieve
sensors information.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In order to increase cpucp messaging reliability we will add
the current PI value to the descriptor sent to F/W.
F/W will wait for the PI value as an indication of a valid packet.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The driver supports only a single user anyway, so there is no point
in checking whether we are in_debug state when a user tries to open
the device, because if we are in_debug, it means a user is already
using the device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Current clock throttling period returned from driver was wrong due
to wrong time comparison.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The original multi-CS design assumption that stream masters are used
exclusively (i.e. multi-CS with set of stream master QIDs will not get
completed by CS not from the multi-CS set) is inaccurate.
Thus multi-CS behavior is now modified not to treat such case as an
error.
Instead, if we have multi-CS completion but we detect that no CS from
the list is actually completed we will do another multi-CS wait (with
modified timeout).
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
It was an error to save the compute context's pointer in the device
structure, as it allowed its use without proper ref-cnt.
Change the variable to a flag that only indicates whether there is
an active compute context. Code that needs the pointer will now
be forced to use proper internal APIs to get the pointer.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
There are multiple places where the code needs to get the context's
pointer and increment its ref cnt. This is the proper way instead
of using the compute context pointer in the device structure.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Pass the user's context pointer into the etr configuration function
to extract its ASID.
Using the compute_ctx pointer is an error as it is just an indication
of whether a user has opened the compute device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Compute context pointer in hdev shouldn't be used for fetching the
context's pointer.
If an object needs the context's pointer, it should get it while
incrementing its kref, and when the object is released, put it.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The driver supports only a single context. Therefore, no need to check
if the user context that is closed is the compute context. The user
context, if exists, is always the compute context.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Fix a bug where in case of failure to allocate idr, the handle's
memory wasn't freed as part of the error handling code.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The reset flags used by the reset thread are currently a mix of
hard-coded values and a specific flag which is passed from the context
that initiates the reset.
To make it easier to pass more flags in future from this context to the
reset thread, modify it to pass all the original reset flags to the
thread.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Because info ioctl is used to retrieve data, some of its opcodes may be
used during hard reset.
Other ioctls should be blocked while device is not operational.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
For debug purpose, add SOB address and SOB initial counter value
before current submission to uAPI output.
Using SOB address and initial counter, user can calculate how much of
the submmision has been completed.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reporting FW errors involves reading of the error registers.
In case we have a corrupted FW descriptor we cannot do that since the
dynamic scratchpad is potentially corrupted as well and may cause kernel
crush when attempting access to a corrupted register offset.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Driver should handle events during soft-reset as F/W is not
going through reset and it keeps sending events towards host.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently we dump the physical IRQ line index in host if an event
is received during reset. This ID is confusing as it means nothing
to the user.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
In new f/w versions, it is required to explicitly indicate the power
information type when querying the F/W for power info.
When getting the current power level it should be set to power_input.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Some info ioctls can be served even if the device is disabled or
in reset. Hence, we enable more info ioctls during reset, as these
ioctls do not require any H/W nor F/W communication.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Race example scenario:
1. User have 2 threads that waits on multi CS:
- thread_0 waits on QID 0 and uses multi CS context 0.
- thread_1 waits on QID 1 and uses multi CS context 1.
2. thread_1 got completion and release multi CS context 1.
3. CS related to multi CS of thread_0 starts executing
complete_multi_cs function, the first iteration of the loop
completes the multi CS of thread_0, hence multi CS context 0
is released.
4. thread_1 waits on QID 1 and uses multi CS context 0.
5. thread_0 waits on QID 0 and uses multi CS context 1.
6. The second iterattion of the loop (from step 3) starts, which
means, start checking multi CS context 1:
- multi CS contetxt is being used by thread_0 waiting on QID 0.
- The fence of the CS (still CS from step 3) has QID map the same
as the multi CS context 1.
- multi CS context 1 (thread_0) gets completion on CS that triggered
already thread_0 (with multi CS context 0) and is no longer
being waited on.
Fixed by exiting the loop in complete_multi_cs after getting completion
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
As device boot warnings clears the indication from the error mask,
they must be located together before the unknown error validation.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
GAUDI supports only hard-reset. Therefore, this function should
return an error of operation not permitted.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
The ASIC-specific soft_reset_late_init() is now called after either
soft-reset or reset-upon-device-release. Therefore, it needs a more
appropriate name.
No need to split it to two functions, as an ASIC either supports
soft-reset or reset-upon-device-release.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reset upon device release is not a soft-reset from user/system point
of view. As such, we shouldn't count that reset in the statistics we
gather and expose to the monitoring applications.
We also shouldn't print soft-reset when doing the reset upon device
release.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Changing the frequency automatically is only done in Goya. In future
ASICs this is done inside the firmware. Therefore, move the common code
into the Goya specific files.
Main changes as part of the commit are:
1. The thread for setting frequency is moved from device_late_init
to goya_late_init
2. hl_device_set_frequency is removed from hl_device_open as it is
not relevant for other ASICs and for Goya it is taken care by
the thread
3. hl_device_set_frequency is renamed as goya_set_frequency
Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Hard-reset is mutually exclusive with reset-on-device-release.
Therefore, if such a request arrives to the reset function, abort
the reset and return an error to the callee.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Currently there is a deadlock in driver in scenarios where MMU
cache invalidation fails. The issue is basically device reset
being performed without releasing the MMU mutex.
The solution is to skip device reset as it is not necessary.
In addition we introduce a slight code refactor that prints the
invalidation error from a single location.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Getting the used PLL index with which to send the CPUPU packet relies on
the CPUCP info packet.
In case CPU queues are not enabled getting the PLL index will issue an
error and in some ASICs will also fail the driver load.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
If a device reset has started, there is a chance that the heartbeat
function will fail because the device is disabled at the beginning
of the reset function.
In that case, we don't want the error message to appear in the log.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
A new uAPI is added for debug purposes of the user-space to retrieve
errors related data from previous session (before device reset was
performed).
Inforamtion is filled when a razwi or CS timeout happens and can
contain one of the following:
1. Retrieve timestamp of last time the device was opened and razwi or
CS timeout happened.
2. Retrieve information about last CS timeout.
3. Retrieve information about last razwi error.
This information doesn't contain user data, so no danger of data
leakage between users.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
AS TPM error indication is not fatal, driver should dump a warning
and continue booting.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
I2C debugfs support is limited to 1 byte. We extend functionality
to more than 1 byte by using one of the pad fields as a length.
No backward compatibility issues as new F/W versions will treat 0
length as a 1 byte length transaction.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Divide the code into 3 different parts:
- Copy kernel parameters
- Setting device behaivor per asic
- Fixup of various device parameters according to the device behaivor.
In addition, remove non-relevant code for upstream (simulator support).
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add implementation for new opcodes in the INFO IOCTL:
1. Retrieve the replaced DRAM rows from f/w.
2. Retrieve the pending DRAM rows from f/w.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>