- added functionality to find the smallest fitting buffer instead of the first found buffer that >= than requested
-- this prevents that two buffer allocations in sequence can take a huge buffer for a small tensor and then require a new buffer for the 2nd tensor
-- in my test it saved 1GB VRAM that are now free for more offloading
cuda free buffers:
- added a helper function that frees all unused buffers from a device to prevent huge F32 buffers from cuBLAS occupying VRAM needlessly after token ingestion
libfalcon:
- corrected vram_overhead calculation to account for the actual non-weight buffers needed during inference
- added vram_overhead for n_batch > 1 as this switches the ingestion into a 32 bit dequantization mode for cu_blas which needs almost 2 GB VRAM buffers
- corrected the automated layer distribution to fill VRAM as much as possible with layers
From here on it's recommended to use --ngl 100 and -b 1 for CUDA processing.
In addition -t is recommended using 1 or 1 less threads than CPU cores (depends on CPU, GPU used)
- disabled offload of non layer tensors for now (not working yet)
- corrected tensor size calculation for vram
- added some more detailed vram reporting
- added a automated skip of tensors that would not fit in vram (significant slowdown if --ngl is too high, probably from temporary cuda buffer copies)
- added vram_overhead and vram_reserved - those are not pretty but currently needed to get the vram usage right
- moved vram scratch buffer allocation a bit up so the usage is available for the skip
Added falcon main and library based on llama.cpp
CPU inference works (getting ~260ms/token on 7B 16 bit falcon)
Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly)
TODO/WIP:
1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary
- even quantization from 16 -> 16 also fails, something is wrong in the tensors produced
2) mmap should work with quantized binaries once 1) is solved
3) CUDA support is mostly there, it's currently disabled (all CPU backend)
4) memory/context caluculations are off, GPU memory calculations are wrong either
5) the python conversion script is pre GGML 1 version (tokens without scores)
6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both
7) the GGML produced by the current python uses an old ftype method
Makfiles:
cmake on windows with build tools works
the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something
Changes to the codebase:
* repeat2 has been added to ggml (jploski - https://github.com/ggerganov/ggml/pull/231) including the backward variant (untested, probably fails)
* minor changes to work with falcon (name length)
* libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp