1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.
Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)
Fix issue where `vdot` binary wasn't removed when running `make clean`.
Fix compile warnings in `server` example.
Add `.hpp` files to trigger workflow (the server example has one).
Improvements to loading the session with `--prompt-cache` in the `main` example.
1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
* xor hack
* block y dim
* loop unrolling
* Fixed cmake LLAMA_CUDA_BY option
* Removed hipblas compatibility code
* Define GGML_CUDA_DMMV_BLOCK_Y if not defined
* Fewer iters, more ops per iter
* Renamed DMMV X/Y compilation options