- Fix UX issues with llama.com
- Do housekeeping on libm code
- Add more vectorization to GGML
- Get GGJT quantizer programs working well
- Have the quantizer keep the output layer as f16c
- Prefetching improves performance 15% if you use fewer threads
The libm code from musl wasn't being used since most of these functions
are implemented using x87 which goes faster than a library intended for
risc machines.