- Work towards improving non-optimized build support
- Introduce MODE=zero which is -O0 without ASAN/UBSAN
- Use system GCC when ~/.cosmo.mk has USE_SYSTEM_TOOLCHAIN=1
- Have package.com check .privileged code doesn't call non-privileged
llama.com can now load weights that use the new file format which was
introduced a few weeks ago. Note that, unlike llama.cpp, we will keep
support for old file formats in our tool so you don't need to convert
your weights when the upstream project makes breaking changes. Please
note that using ggjt v3 does make avx2 inference go 5% faster for me.
- Fix UX issues with llama.com
- Do housekeeping on libm code
- Add more vectorization to GGML
- Get GGJT quantizer programs working well
- Have the quantizer keep the output layer as f16c
- Prefetching improves performance 15% if you use fewer threads
make -j8 o//third_party/radpajama/radpajama.com
make -j8 o//third_party/radpajama/radpajama-chat.com
This change gets the radpajama.mk config working. This package depends
on THIRD_PARTY_GGML but it's configured to call ggjt_v1(), so that the
library will provide the old quantizers. The ggml_quantize_chunk() API
will now dispatch to older quantizers based on the configured version.
Example use case for JSON completion:
$ m=opt
$ make -j16 m=$m o/$m/third_party/ggml/llama.com
$ o/$m/third_party/ggml/llama.com -m llama.bin -p '{"key": "life", "val": ' -r '}'
42}
This provides better control. More sophisticated facilities for
controlling text generation will be provided soon enough.
- Introduce -v and --verbose flags
- Don't print stats / diagnostics unless -v is passed
- Reduce --top_p default from 0.95 to 0.70
- Change --reverse-prompt to no longer imply --interactive
- Permit --reverse-prompt specifying custom EOS if non-interactive