Commit graph

88 commits

Author SHA1 Message Date
comex
f9ef010e6b [WIP, broken] Importer for GPTQ quantized LLaMA models
Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Current status: Something is busted.  The output starts out decent, but
quickly degrades into gibberish.  This doesn't happen with either the
original GPTQ-for-LLaMa using the same weights, or llama.cpp when using
weights quantized by its own quantizer.  Is there a bug in the
conversion script that somehow only comes into play with a large context
size?

I did notice one potential issue.  It's clearly not the main cause of
the gibberish, since it doesn't happen when using q4_1 weights quantized
by llama.cpp itself, but it seems concerning.  When doing a matrix
multiplication of f16 * f32 => f32 or q4_1 * f32 => f32, at least when
the multiplication is not done with BLAS, the intermediate results are
stored in the smaller format rather than f32.  This seems like an
unnecessary waste of precision, especially in the q4_1 case.

I was originally hoping to validate the results by matching the Python
implementation's output exactly, but precision and non-associativity
issues make this very difficult, including when performing matrix
multiplications and, especially, computing norms.

Anyway, design details:

The models being imported store per-layer weights in essentially q4_1
format, although the addend and scale are shared across an entire row
rather than every group of 32 weights.  This script duplicates the
addend and scale to match ggml's expectations, at the cost of wasting
some memory.

However, there are two differences which I accommodated changing the
output format (and adding corresponding support to main.cpp) rather than
having the script match the existing one:

- The tok_embeddings and output weights (i.e. the weights that aren't
  per-layer) are f16 instead of q4_1.  They could be converted to q4_1,
  and the impact of the loss of precision would probably be low, but
  this would rule out exactly matching the Python implementation's
  output for validation.

- There is no sharding, since the input doesn't have it, and for a
  CPU-only implementation it seems more useful to avoid having to deal
  with multiple files.

The new format is differentiated from existing q4_1 format by changing
the 'f16' header flag to a new value, 4.  That said, I think a cleaner
approach would be to change main.cpp to support loading each tensor with
an arbitrary sharding configuration and type rather than hardcoding
specific combinations of types.  So far I've wasted too much time
debugging to try implementing this...
2023-03-19 11:44:36 -07:00
Alex Nguyen
d3f202d57b
Remove unused code since n_vocab is model.hparams.n_vocab (#262) 2023-03-18 13:51:49 +00:00
Justin Suess
e03e359730
fixed warning with std::ignore about unused function result (#151)
fixed warning with std::ignore about unused function result
2023-03-18 11:44:09 +00:00
Gary Linscott
a81d0c2a17
Fix n^2 loop in tokenization (#254)
This causes long prompts to parse very slowly.
2023-03-18 11:17:19 +00:00
anzz1
b2de7f18df
CI Improvements (#230)
* CI Improvements

Manual build feature, autoreleases for Windows

* better CI naming convention

use branch name in releases and tags
2023-03-18 09:27:12 +02:00
Niklas Korz
a292747893
Nix flake (#40)
* Nix flake

* Nix: only add Accelerate framework on macOS

* Nix: development shel, direnv and compatibility

* Nix: use python packages supplied by withPackages

* Nix: remove channel compatibility

* Nix: fix ARM neon dotproduct on macOS

---------

Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-03-17 23:03:48 +01:00
thement
c9f670a177
Implement non-greedy tokenizer that tries to maximize token lengths (#242)
* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <jakub.horak@ibawizard.net>
2023-03-17 21:05:58 +01:00
Georgi Gerganov
4f54609110
Default to 4 threads (#243) 2023-03-17 21:46:46 +02:00
Georgi Gerganov
e81b9c81c1
Update Contributing section 2023-03-17 20:30:04 +02:00
Stephan Walter
367946c668
Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
2023-03-17 19:47:35 +02:00
mmyjona
6b0df5ccf3
add ptread link to fix cmake build under linux (#114)
* add ptread link to fix cmake build under linux

* add cmake to linux and macos platform

* separate make and cmake workflow

---------

Co-authored-by: Sebastián A <sebastian.aedo29@gmail.com>
2023-03-17 13:38:24 -03:00
Bernat Vadell
2af23d3043
🚀 Dockerize llamacpp (#132)
* feat: dockerize llamacpp

* feat: split build & runtime stages

* split dockerfile into main & tools

* add quantize into tool docker image

* Update .devops/tools.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add docker action pipeline

* change CI to publish at github docker registry

* fix name runs-on macOS-latest is macos-latest (lowercase)

* include docker versioned images

* fix github action docker

* fix docker.yml

* feat: include all-in-one command tool & update readme.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-17 10:47:06 +01:00
Matvey Soloviev
904d2a8d6a
Q4_1 quantization (#193)
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix #152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
2023-03-17 06:48:39 +02:00
Georgi Gerganov
721311070e
Update README.md 2023-03-16 15:00:09 +02:00
Georgi Gerganov
ac15de7895
Expand "Contributing" section 2023-03-16 08:55:13 +02:00
Georgi Gerganov
273abc47ff
Update hot topics - RMSnorm 2023-03-16 07:12:12 +02:00
Nebula
9b4a15b17d
Fix RMS norm in GGML (#191) 2023-03-15 19:29:25 -04:00
hoangmit
6eac39ba95
Add RMS norm and use it (#187)
* add ggml_rms_norm

* update op num
2023-03-16 00:41:38 +02:00
moritzbrantner
27944c4206
fixed typo (#178) 2023-03-15 22:35:25 +02:00
Rickey Bowers Jr
2d15d6c9a9
add SIGINT support for _WIN32 environments (#120)
* add SIGINT support for _WIN32 environments

* perhaps more consistent
2023-03-15 21:56:24 +02:00
Justin Suess
2d64715ad4
added ctx_size parameter (#148)
* added ctx_size parameter

* added it in more places

* Apply suggestions from code review

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:42:40 +02:00
Justin Suess
16b2c61a22
fixed color reset on exit (#149)
* fixed color reset on exit

* added sigint handler for ansi_color_reset

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:39:38 +02:00
Musab Gultekin
977295c700
Fix potential licensing issue (#126)
* Update README.md

* Update README.md

remove facebook
2023-03-15 21:39:06 +02:00
Ronsor
956dfda8ad
Use tokenizer.vocab_size() instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)
There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
2023-03-15 21:37:50 +02:00
hoangmit
113e685d18
inline -> static inline for "bytesFromNibbles" (#161)
Without "static" prefix, it fails to compile in clang
2023-03-15 21:05:14 +02:00
Ronsor
47857e564c
Don't use vdotq_s32 if it's not available (#139)
* Don't use vdotq_s32 if it's not available

`dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available.

Reintroduces the code removed in 84d9015 if `__ARM_FEATURE_DOTPROD` isn't defined.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-14 21:34:37 +02:00
Radoslav Gerganov
60f819a2b1
Add section to README on how to run the project on Android (#130) 2023-03-14 15:30:08 +02:00
Georgi Gerganov
97ab2b2578
Add Misc section + update hot topics + minor fixes 2023-03-14 09:43:52 +02:00
Sebastián A
2f700a2738
Add windows to the CI (#98) 2023-03-13 22:29:10 +02:00
Georgi Gerganov
c09a9cfb06
CMake build in Release by default (#75) 2023-03-13 21:22:15 +02:00
Georgi Gerganov
7ec903d3c1
Update contribution section, hot topics, limitations, etc. 2023-03-13 19:21:51 +02:00
Georgi Gerganov
4497ad819c
Print system information 2023-03-13 19:15:08 +02:00
Sebastián A
ed6849cc07
Initial support for CMake (#75) 2023-03-13 19:12:33 +02:00
Thomas Klausner
41be0a3b3d
Add NetBSD support. (#90) 2023-03-13 18:40:54 +02:00
Pavol Rusnak
671d5cac15
Use fprintf for diagnostic output (#48)
keep printf only for printing model output

one can now use ./main ... 2>dev/null to suppress any diagnostic output
2023-03-13 18:39:56 +02:00
Georgi Gerganov
84d9015c4a
Use vdotq_s32 to improve performance (#67)
* 10% performance boost on ARM

* Back to original change
2023-03-13 18:36:44 +02:00
uint256_t
63fd76fbb0
Reduce model loading time (#43)
* Use buffering

* Use vector

* Minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-13 18:33:43 +02:00
Val Kharitonov
2a20f48efa
Fix UTF-8 handling (including colors) (#79) 2023-03-13 18:24:18 +02:00
Pavol Rusnak
d1f224712d
Add quantize script for batch quantization (#92)
* Add quantize script for batch quantization

* Indentation

* README for new quantize.sh

* Fix script name

* Fix file list on Mac OS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-13 18:15:20 +02:00
Georgi Gerganov
1808ee0500
Add initial contribution guidelines 2023-03-13 09:42:26 +02:00
Matvey Soloviev
a169bb889c Gate signal support on being on a unixoid system. (#74) 2023-03-13 04:08:01 +01:00
Matvey Soloviev
460c482540 Fix token count accounting 2023-03-13 01:04:41 +01:00
Georgi Gerganov
c80e2a8f2a
Revert "10% performance boost on ARM"
This reverts commit 113a9e83eb.

There are some reports for illegal instruction.
Moved this stuff to vdotq_s32 branch until resolve
2023-03-13 01:28:08 +02:00
Georgi Gerganov
54a0e66ea0
Check for vdotq_s32 availability 2023-03-13 01:21:03 +02:00
Georgi Gerganov
543c57e991
Ammend to previous commit - forgot to update non-QRDMX branch 2023-03-13 01:05:24 +02:00
Georgi Gerganov
113a9e83eb
10% performance boost on ARM 2023-03-13 00:56:10 +02:00
Matvey Soloviev
404fac0d62
Fix color getting reset before prompt output done (#65)
(cherry picked from commit 7eb2987619feee04c40eff69b604017d09919cb6)
2023-03-13 00:07:34 +02:00
Georgi Gerganov
1a0a74300f
Update README.md 2023-03-12 23:39:01 +02:00
Matvey Soloviev
96ea727f47
Add interactive mode (#61)
* Initial work on interactive mode.

* Improve interactive mode. Make rev. prompt optional.

* Update README to explain interactive mode.

* Fix OS X build
2023-03-12 23:13:28 +02:00
Marc Köhlbrugge
9661954835
Fix typo in README (#45) 2023-03-12 22:30:08 +02:00